Beyond With-Replacement Sampling For Large-Scale Data Analysis And Optimization

Description

Advances in sensing and processing technologies, communication capabilities and smart devices have enabled deployment of systems where a massive amount of data is collected to make decisions. Many key problems of interest for analyzing and processing big data result in large-scale optimization problems. For a core, very widely used optimization method, which is efficient for such problems where the data points are sampled and processed in a sequential manner, there is a large gap between the theory and practice of this method. This project is about filling this gap by providing novel performance guarantees relevant to practical problems as well as developing novel and faster variants of the optimization method. The methods and techniques developed under the scope of this project will contribute to the efficiency and mathematical foundations of optimization algorithms targeted for big data challenges, contributing to more efficient decision making for a wide variety of large-scale data analysis problems. Incremental gradient (IG) is the core, very widely used optimization method mentioned above and subsumes popular optimization methods in data analysis and machine learning practice such as stochastic gradient descent, randomized coordinate descent and Kaczmarz methods. Various performance guarantees for IG are available if data points are sampled with replacement in an independent identically distributed (i.i.d.) manner. However, these are not helpful in practical scenarios: In practice, data is often sampled in a non-i.i.d fashion without-replacement instead, as the resulting convergence is typically much faster. A first goal in this project is to study and quantify this discrepancy over an interesting class of regression problems, which has been a key open problem. Several techniques and methods are proposed for obtaining asymptotic and non-asymptotic theoretical guarantees for without-replacement sampling schemes. A second goal is to develop fast algorithms with convergence guarantees that go beyond the limitations of the i.i.d. sampling. For this purpose, a new framework for studying several alternative sampling schemes and their performance is developed. Using this framework, novel sampling schemes based on weighted without-replacement sampling and cyclic sampling that can adapt to the dataset and improve upon the performance of the traditional i.i.d. sampling in terms of limiting accuracy are developed.
StatusActive
Effective start/end date7/15/176/30/20

Funding

  • National Science Foundation (NSF)

Fingerprint

Sampling
Processing
Learning systems
Decision making
Communication