The goal of this project is to design algorithms and statistical tools to build complex probabilistic models from massive quantities of data in a computationally efficient manner. This work is motivated by an important current problem in genomics, namely comparative epigenetics. While every cell in an organism has the same DNA sequence, epigenetic marks on the genome are known to be highly correlated with variation between cells. A pressing question in biology is to compare the epigenetic marks across different cell types to understand these differences. While massive amounts of data has been generated for this purpose, there is a great need for computational tools that can operate on this data and provide biologically meaningful solutions. This work will thus advance the state-of-the-art in the analysis of large complex data sets and advance the field of epigenomics. The broader impact of the work includes organizing workshops and tutorials at machine learning and bioinformatics venues, involving undergraduate students in research, and releasing open source software for the community.Specifically, this project will focus on spectral learning, which has recently provided principled and computationally efficient methods for learning parameters of probabilistic graphical models. While spectral learning methods are known for some simple latent variable models, a major barrier to realizing the potential of spectral learning in real-world applications is the lack of associated statistical tools such as regularization and hypothesis testing that connect these methods in a principled manner to end-to-end application frameworks. This project proposes to develop such statistical tools by integrating modern spectral learning with the classical statistical literature in econometrics on Generalized Method of Moments. The project proposes to formulate the statistical generalized method of moment procedures for complex graphical models in the context of spectral learning as constrained optimization problems and proposes ways of solving these problems. Finally, the novel algorithms developed will be directly applied to model epigenomics data sets from the ENCODE and Roadmap Epigenomics Projects to yield methods that can operate on the massive quantities of data and provide biologically meaningful solutions. These algorithms and software have the potential to have a widespread impact on the understanding of complex human diseases such as cancer and mental disorders. This will provide a basis for designing therapeutics for these diseases and advance society towards a future of Personalized Medicine.
|Effective start/end date||7/1/16 → 6/30/19|
- National Science Foundation (National Science Foundation (NSF))