The need for modern streaming systems to collect and analyze human activities is enormous, in businesses, government, academia, and society. Businesses can improve their operations and production; governments can improve the participation and satisfaction of citizens; society at large can be more sustainable or safe; etc. However, systems that collect and analyze such streams have enormous challenges of scale. At the highest level, this proposal combines a research, education and outreach plan to address these challenges. This research focusses on developing new algorithmic methods and theoretical understanding of modern streaming problems. There is an extensive theory of streaming algorithms for single streams, or to a limited extent to distributed streams, for one (or few) high cardinality dimensional data and simple frequency based analyses. But there are large gaps in creating streaming algorithms for 'extreme' needs in modern data streaming applications, where the dimension of data that is collected is large, multiple streams of collected in distributed vantage points, there is a need to find anomalies in high dimensional spaces, and analyses that are needed are sophisticated including machine learning and other real time decision making tasks. This research develops the algorithmic theory of these extreme streaming problems. In particular, the project develops the algorithmic foundations for using large scale distributed streaming systems and tradeoff quality, certainty, CPU, memory and communication needed to do extreme, streaming, sophisticated analyses. Since modern streaming systems power businesses and deal with behavioral data on users, this work has broad societal impact. The project significantly improves the state of the art in algorithms for modern streaming systems. By providing new, rich algorithmic approaches, the project inspires practitioners in academia and industry to conceive more impactful applications, which are infeasible given the current algorithmic tools.The research program both enables and benefits from an education and outreach program that enhances curriculum, fosters training women and underrepresented minorities. To enable technology transfer the project involves practitioners in streaming systems, for field-testing the methods whenever possible. All publishable results are disseminated in respected academic journals, conferences, and workshops. All code and data sets developed in this work are made openly available to the community via the MassDAL site that already has code that is used by the community for classical streaming problems.Classical streaming algorithms use space sublinear, typically polylogarithmic in the input parameters. When extended to distributed systems, often the focus is on sublinear communication. The research program, here, builds on this algorithmic theory of past 15 years and addresses the modern, extreme needs of streaming applications. The project extends the theory to: many dimensions with large attributes, using far fewer resources in memory, computing and communication; emerging, pipeline models of streaming; more sophisticated analyses from local privacy to deep learning type vector embeddings; etc. The research program addresses fundamental problems. In more detail:(a) Extant streaming algorithms work for one or few dimensions of data of high cardinality. Modern streaming systems collect logs of human activity and have 100s of dimensions, 10s or more of them have very high cardinality. Can one identify the key problems for these domains and develop an algorithmic theory? The PI has identified an effective approach based on graphical modeling of the relationship between the dimensions, and believes this nugget can yield an effective theory.(b) Extant streaming algorithms use polylogarithmic words of memory per analysis when they are considered successful (and lower bounds point to problems for which sublinear space is not sufficient). Modern streaming systems run several orders of magnitude of such analyses, for example, one analysis for each of the millions of users. The project has identified an approach of 'frugal' streaming, where algorithms use a small constant number of words, and develops a theory of frugal streaming algorithms, and their limitations.(c) Modern data stream systems allow pipelining, with the stream (modifiable or not) passing through stages, either at individual sites, or across the sites. The project abstracts and develops algorithmic theory of streaming problems with pipelined streaming systems. Preliminary results indicate that this allows algorithms that use time sublinear in the sublinear space used by the solutions, and there is a rich class of path problems that can be solved in these models which are impossible in classical streaming.(d) Modern systems need sophisticated streaming analyses. For example, streaming systems that collect usage data from users need private methods, and combining local differential privacy with streaming methods is an exciting direction; as another example, systems that analyze usage data might rely on embedding data into vectors with semantics, like word2vec and related deep learning methods. These methods need to work with polylogarithmic space for streaming. As another example, rich classes of graph problems are solvable in property testing framework with sublinear samples, can such classes be solved in streaming models too? The project highlights specific research challenges involved in developing streaming algorithms, and develops algorithms with provable performance guarantees on the tradeoff of resources used, approximation ratio, and probability of success. The project empirically evaluates them when possible.One cannot take known statements of problems and hope to solve them using streaming algorithms. One needs to modify the problems a bit to be amenable to streaming. In classical streaming, 'heavy hitters' and 'few terms' properties helped achieve that. In similar vein, the project identifies certain natural phenomena which helps formulate versions of problems for which extreme streaming solutions can be developed. Contours of this are already seen in using graphical models that limit interactions between dimensions to circumvent high dimensional high cardinality cases, or reusing counters in frugal streaming or sampling the sketch data structure in privacy problems and pipelined streaming, or using the randomness in stream order. This may eventually lead to algorithmic and empirical insights. Overall vision of the project is to provide a principled perspective for design and analysis of streaming algorithms with extreme needs.
|Effective start/end date||7/15/17 → 6/30/20|
- National Science Foundation (National Science Foundation (NSF))