Predictive Modeling with High-Dimensional lncomplete Data

Project Details


Predictive modeling is the cornerstone of individualized health care. The outcome of interest is most frequently the presence or absence of a health condition, and a large number of predictors are commonly available for model building. Both the high dimensional data and the missing data have posed great challenges in statistical inference related to predictive modeling. The overarching goal of this proposal is to address methodological challenges of predicting binary outcomes with high-dimensional incomplete data. Specifically, the PIs proposed to address the methodological challenges from the following two perspectives: (1) Quantify the uncertainty for the risk prediction based on the high-dimensional logistic model; (2) Accommodate two study designs where missingness happens in a structured way, including the “Positive-only” study design and the two-phase design. Recent years have seen great breakthroughs in statistical inference methods for analyzing high-dimensional data arising from a wide spectrum of scientific fields, with a focus primarily on a single regression coefficient in the generalized linear models. Inferential methods for confidence interval construction and hypothesis testing for the predicted probability, which is a function of all regression coefficients, are largely lacking. We develop innovative statistical methods in this proposal towards filling this methodological gap in high dimensional data analysis. Our proposed method is innovative also because they accommodate the structured incomplete data which arises from important sampling designs. To our best knowledge, to date, statistical inference methods for high dimensional data analysis have exclusively focused on data arising from complete data arising from cross-sectional study designs. We additionally consider two important study designs with incomplete data, one is termed as the “positive-only” study design that arises in EHR phenotyping, and the other is the two-phase design, an important cost-effective sampling design that aims to reduce cost for measuring expensive predictors. We elucidate methodological challenges of accommodating the missing data issues in downstream analysis and provide corresponding solutions.
Effective start/end date9/1/208/31/23


  • National Institute of General Medical Sciences: $183,360.00
  • National Institute of General Medical Sciences: $183,360.00
  • National Institute of General Medical Sciences: $193,407.00


Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.