Large volumes of data, which are being collected for the purpose of knowledge extraction, have to be reliably, efficiently, and securely stored. Retrieval of large data files from storage has to be fast. Large-scale cloud data storage and distributed file systems have become the backbone of many applications such as web searching, e-commerce, and cluster computing. Cloud services are implemented on top of a distributed storage layer that acts as a middleware to the applications, and also provides the desired content to the users, whose interests range from performing data analytics to watching movies. Users of cloud systems demand that their content and services be readily available and their data be reliably stored. Although there are apparent connections and trade-offs between these two objectives, so far they have been addressed mostly separately. The proposed research will characterize the interplay between reliable data storage and fast data access, and develop methods to jointly optimize these two main objectives of cloud storage. The project team will develop the methodology for design, analysis, and performance evaluation of a broad range of techniques for distributed storage that enable high reliability, robustness, and fast data retrieval. The research team will also develop schemes for storage and distributed computing which will optimize both reliability and the latency of data access. The methodology developed in the course of this project will allow the system operators to make their content and services readily available to users and support delay-sensitive applications ranging from individual video streaming to using online collaborative tools. This research will minimize the energy requirements of data centers, which have increased massively in recent years. The project will contribute to the broader areas of coding theory, information theory, and queueing theory, and open new ways of cross-pollination.This project focuses on efficient data access in distributed file systems that employ codes for reliable and efficient storage. Users of cloud systems demand that their content and services be readily available and their data be reliably stored. Although there are apparent connections and trade-offs between these two objectives, thus far they have been addressed mostly separately. This proposal follows findings that analyze how some of today's solutions for reliable data storage affect the speed of data download under certain access models. This preliminary research has shown that, in some scenarios, the coding schemes used for increasing storage reliability can be further exploited for fast data access, while in others, the coding schemes that seemingly increase data availability actually fail to provide efficient access to popular content (so-called hot data). The proposed research aims to characterize the interplay between reliable data storage and fast data access, and develop methods to jointly optimize these two main objectives of cloud storage. The proposed research will first identify and design schemes for coded-data access and derive (bounds on and estimates of) the expected download time for these schemes. Regardless of which data access scheme is used, the expected download time will depend on realistic service models as well as the distributed system service capacity provisioning and allocation schemes, which are then addressed. The work will also focus on the connections between the proposed research and the areas of efficient distributed computing and reduction in data center energy consumption. Preliminary results indicate that these areas are closely connected and the techniques developed for one area can benefit other areas.
|Effective start/end date||9/1/17 → 8/31/20|
- National Science Foundation (NSF)
Data storage equipment
Distributed computer systems