The ubiquitous phenomenon of massive data (including data streams) imposes considerable challenges in data visualization and exploratory data analysis. About 15 years ago, terabyte datasets were still considered `ridiculous.' However, modern datasets managed by Stanford Linear Acceleration Center (SLAC), NASA, NSA, etc. have reached the perabyte scale or larger. Corporations such as Amazon, Wal-Mart, Ebay, and search engine firms are also major generators and users of massive data. The general theme of data reduction and summarization has become an active and highly inter-disciplinary area of research. This project proposes to develop various approximation techniques, which generate a 'fingerprint' or 'sketch' of the massive data by transforming the original data. These `sketches' are reasonably small (hence easy to store) and can provide approximate answers which are usually good enough for practical purposes. This proposal concerns the fundamental problems of processing/transforming massive (possibly dynamic) data. In particular, it focuses on (A) developing systematic fundamental tools for effective data reduction and efficient data summarization; (B) applying these tools to improve numerical analysis, visualization, and exploratory data analysis. Two lines of theoretically sound techniques for data reduction and summarization will be developed and further improved: (1) the method of stable random projections (SRP), effective in heavy-tailed data; (2) the method of Conditional Random Sampling (CRS), mainly for sparse data. Concrete applications of SRP and CRS will be investigated. Widely-used basic numerical algorithms can be rewritten by taking advantage of SRP or CRS. Popular methods/tools for exploratory data analysis will also benefit considerably from the development of data reduction techniques.
|Effective start/end date||4/16/14 → 2/28/15|
- National Science Foundation (National Science Foundation (NSF))