Numerous applications involve massive, high-dimensional datasets. For example, the search industry routinely deals with billions of web pages, where each page is often represented as a binary vector in 2^64 dimensions. In computer vision, images are often represented as non-binary vectors in millions of dimensions. Algorithms which are capable of efficiently compressing, retrieving, and mining these datasets are of high practical importance. Mathematically rigorous and computationally efficient hashing methods will be developed to dramatically reduce ultra-high-dimensional datasets. These algorithms will be integrated with a variety of learning techniques including classification, clustering, near-neighbor search, matrix factorizations, etc. The project builds on and extends minwise hashing, and b-bit minwise hashing which are standard hashing techniques in search applications. The project aims to (i) rigorously analyze b-bit minwise hashing and develop, analyze, and apply significantly more efficient (and more accurate) to problems in search and learning; (ii) develop a unified framework of probabilistic hashing which essentially consists of one permutation followed by (at most) one random projection; (iii) develop a unified theory of summary statistics under a variety of engineering constraints (storage space, computational speed, indexing capability, adaptation to streaming, etc.). Hashing algorithms developed under this framework are expected to be substantially much more efficient and more accurate than existing popular algorithms such as random projections and minwise hashing. This general framework allows the design algorithms to accommodate many different data types (sparse or dense data, binary or real-valued data, static or streaming data), many different engineering needs (computing inner products or lp distances, kernel learning or linear learning), and different storage requirements. Anticipated results of the proposed research include rigorous and computationally efficient hashing algorithms for dealing with ultra-high-dimensional datasets, the integration of the resulting hashing algorithms into with a variety of learning techniques for classification, clustering, near-neighbor search, singular value decompositions, matrix factorization, etc; and rigorous experimental evaluation of the resulting methods on big (e.g., TeraByte or potentially PetaByte) data of the order of up to 2^64 dimensions. Broader Impacts: Effective approaches to building predictive models from extremely high dimensional data can impact many areas of science that rely on machine learning as the primary methodology for knowledge acquisition from data. The PI''s education and outreach efforts aim to broaden the participation of women and underrepresented groups. The publications, software, and datasets resulting from the project will be freely disseminated to the larger scientific community.
|Effective start/end date||8/28/13 → 8/31/16|
- National Science Foundation (National Science Foundation (NSF))