### Abstract

Statistical learning theory (SLT) is the statistical formulation of machine learning theory, a body of analytic methods common in "big data" problems. Regression-based SLT algorithms seek to maximize predictive accuracy for some outcome, given a large pool of potential predictors, without overfitting the sample. Research goals in psychology may sometimes call for high dimensional regression. One example is criterion-keyed scale construction, where a scale with maximal predictive validity must be built from a large item pool. Using this as a working example, we first introduce a core principle of SLT methods: minimization of expected prediction error (EPE). Minimizing EPE is fundamentally different than maximizing the within-sample likelihood, and hinges on building a predictive model of sufficient complexity to predict the outcome well, without undue complexity leading to overfitting. We describe how such models are built and refined via cross-validation. We then illustrate how 3 common SLT algorithms-supervised principal components, regularization, and boosting-can be used to construct a criterion-keyed scale predicting all-cause mortality, using a large personality item pool within a population cohort. Each algorithm illustrates a different approach to minimizing EPE. Finally, we consider broader applications of SLT predictive algorithms, both as supportive analytic tools for conventional methods, and as primary analytic tools in discovery phase research. We conclude that despite their differences from the classic null-hypothesis testing approach-or perhaps because of them-SLT methods may hold value as a statistically rigorous approach to exploratory regression.

Original language | English (US) |
---|---|

Pages (from-to) | 603-620 |

Number of pages | 18 |

Journal | Psychological Methods |

Volume | 21 |

Issue number | 4 |

DOIs | |

State | Published - Dec 1 2016 |

### Fingerprint

### All Science Journal Classification (ASJC) codes

- Psychology (miscellaneous)

### Keywords

- Machine learning theory
- Mortality
- Personality
- Psychometrics
- Statistical learning theory

### Cite this

*Psychological Methods*,

*21*(4), 603-620. https://doi.org/10.1037/met0000088

}

*Psychological Methods*, vol. 21, no. 4, pp. 603-620. https://doi.org/10.1037/met0000088

**Statistical learning theory for high dimensional prediction : Application to criterion-keyed scale development.** / Chapman, Benjamin P.; Weiss, Alexander; Duberstein, Paul R.

Research output: Contribution to journal › Article

TY - JOUR

T1 - Statistical learning theory for high dimensional prediction

T2 - Application to criterion-keyed scale development

AU - Chapman, Benjamin P.

AU - Weiss, Alexander

AU - Duberstein, Paul R.

PY - 2016/12/1

Y1 - 2016/12/1

N2 - Statistical learning theory (SLT) is the statistical formulation of machine learning theory, a body of analytic methods common in "big data" problems. Regression-based SLT algorithms seek to maximize predictive accuracy for some outcome, given a large pool of potential predictors, without overfitting the sample. Research goals in psychology may sometimes call for high dimensional regression. One example is criterion-keyed scale construction, where a scale with maximal predictive validity must be built from a large item pool. Using this as a working example, we first introduce a core principle of SLT methods: minimization of expected prediction error (EPE). Minimizing EPE is fundamentally different than maximizing the within-sample likelihood, and hinges on building a predictive model of sufficient complexity to predict the outcome well, without undue complexity leading to overfitting. We describe how such models are built and refined via cross-validation. We then illustrate how 3 common SLT algorithms-supervised principal components, regularization, and boosting-can be used to construct a criterion-keyed scale predicting all-cause mortality, using a large personality item pool within a population cohort. Each algorithm illustrates a different approach to minimizing EPE. Finally, we consider broader applications of SLT predictive algorithms, both as supportive analytic tools for conventional methods, and as primary analytic tools in discovery phase research. We conclude that despite their differences from the classic null-hypothesis testing approach-or perhaps because of them-SLT methods may hold value as a statistically rigorous approach to exploratory regression.

AB - Statistical learning theory (SLT) is the statistical formulation of machine learning theory, a body of analytic methods common in "big data" problems. Regression-based SLT algorithms seek to maximize predictive accuracy for some outcome, given a large pool of potential predictors, without overfitting the sample. Research goals in psychology may sometimes call for high dimensional regression. One example is criterion-keyed scale construction, where a scale with maximal predictive validity must be built from a large item pool. Using this as a working example, we first introduce a core principle of SLT methods: minimization of expected prediction error (EPE). Minimizing EPE is fundamentally different than maximizing the within-sample likelihood, and hinges on building a predictive model of sufficient complexity to predict the outcome well, without undue complexity leading to overfitting. We describe how such models are built and refined via cross-validation. We then illustrate how 3 common SLT algorithms-supervised principal components, regularization, and boosting-can be used to construct a criterion-keyed scale predicting all-cause mortality, using a large personality item pool within a population cohort. Each algorithm illustrates a different approach to minimizing EPE. Finally, we consider broader applications of SLT predictive algorithms, both as supportive analytic tools for conventional methods, and as primary analytic tools in discovery phase research. We conclude that despite their differences from the classic null-hypothesis testing approach-or perhaps because of them-SLT methods may hold value as a statistically rigorous approach to exploratory regression.

KW - Machine learning theory

KW - Mortality

KW - Personality

KW - Psychometrics

KW - Statistical learning theory

UR - http://www.scopus.com/inward/record.url?scp=84988692252&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84988692252&partnerID=8YFLogxK

U2 - 10.1037/met0000088

DO - 10.1037/met0000088

M3 - Article

C2 - 27454257

AN - SCOPUS:84988692252

VL - 21

SP - 603

EP - 620

JO - Psychological Methods

JF - Psychological Methods

SN - 1082-989X

IS - 4

ER -