Advanced colorectal neoplasia risk stratification by penalized logistic regression

Yunzhi Lin, Menggang Yu, Sijian Wang, Richard Chappell, Thomas F. Imperiale, Andrew B. Lawson, Duncan Lee, Ying MacNab

Research output: Contribution to journalArticlepeer-review

1 Scopus citations


Colorectal cancer is the second leading cause of death from cancer in the United States. To facilitate the efficiency of colorectal cancer screening, there is a need to stratify risk for colorectal cancer among the 90% of US residents who are considered "average risk." In this article, we investigate such risk stratification rules for advanced colorectal neoplasia (colorectal cancer and advanced, precancerous polyps). We use a recently completed large cohort study of subjects who underwent a first screening colonoscopy. Logistic regression models have been used in the literature to estimate the risk of advanced colorectal neoplasia based on quantifiable risk factors. However, logistic regression may be prone to overfitting and instability in variable selection. Since most of the risk factors in our study have several categories, it was tempting to collapse these categories into fewer risk groups. We propose a penalized logistic regression method that automatically and simultaneously selects variables, groups categories, and estimates their coefficients by penalizing the L 1 -norm of both the coefficients and their differences. Hence, it encourages sparsity in the categories, i.e. grouping of the categories, and sparsity in the variables, i.e. variable selection. We apply the penalized logistic regression method to our data. The important variables are selected, with close categories simultaneously grouped, by penalized regression models with and without the interactions terms. The models are validated with 10-fold cross-validation. The receiver operating characteristic curves of the penalized regression models dominate the receiver operating characteristic curve of naive logistic regressions, indicating a superior discriminative performance.

Original languageEnglish (US)
Pages (from-to)1677-1691
Number of pages15
JournalStatistical Methods in Medical Research
Issue number4
StatePublished - Aug 1 2016
Externally publishedYes

All Science Journal Classification (ASJC) codes

  • Epidemiology
  • Statistics and Probability
  • Health Information Management


  • colorectal cancer
  • interaction
  • lasso
  • penalized logistic regression
  • risk stratification


Dive into the research topics of 'Advanced colorectal neoplasia risk stratification by penalized logistic regression'. Together they form a unique fingerprint.

Cite this