Many statistical analyses utilize a model selection procedure. Perhaps the most common model selection problem is that of variable selection in linear regression. Principled motivations for selection include the desire for interpretability, prevention of over-fitting, and concerns about statistical power. A practical motivation arises when the data are high-dimensional, with more explanatory variables than observations. Relevant applications span the entire domain of scientific inquiry, from neuroscience, medicine and physics, to economics, sociology, and psychology. A large catalogue of variable selection procedures is now available, and the statistics community has turned its focus to the question of inference after selection. Standard methods of statistical inference are no longer valid when the same data are used to both select a model and make inferences about that model. It is fundamentally important to have accurate post-selection inference procedures that are also powerful enough to detect observed departures from scientific hypotheses and that avoid the strong distributional assumptions needed for exact inference in finite samples. This research aims to develop post-selection inference methodology that is both accurate and powerful, with particular emphasis on reducing statistical errors that depend on the sample size. The goal of this project is to further understanding of the asymptotic theory of post-selection inference, particularly selective inference based on the CovTest and truncated Gaussian (TG) statistic, as well as simultaneous inference using the post-selection intervals (PoSI) procedure. The first two procedures can be motivated by selective error control, i.e., error control for the selected model parameters. The PoSI method seeks to control family-wise error rates for all possible sub-model parameters. While these procedures yield valid post-selection inference, without strong assumptions they are particularly vulnerable to the effects of violation of key assumptions in the realistic setting of small to moderate sample sizes, such as overly-conservative or inaccurate confidence intervals, and low power. In this project, asymptotic expansions, saddle-point approximations, the bootstrap, and related techniques from higher-order asymptotics will be employed to improve accuracy and power for these post-selection inference procedures.
|Effective start/end date||8/1/17 → 7/31/20|
- National Science Foundation (NSF)
Familywise Error Rate
Selection of Variables