Cross-validation for the selection of statistical models

Cross-validation for the selection of statistical models
Simon J. Mason Michael K. Tippett IRI

The Model Selection Problem
Given: Family of models Ma and observations. Question: Which model to use? Goals: Maximize predictive ability given limited observations. Accurately estimate predictive ability. Example: Linear regression: Observations (n=50); Possible predictors sorted by correlation; M1 uses first predictor, M2 uses first two predictors, etc. With limited observations it may be difficult to accurately calibrate the parameters of the correct model, reducing the predictive ability of the correct model. A simpler model make have greater predictive skill.

Estimating predictive ability
Wrong way: Calibrate model with all data. Choose the model that best fits the data.

In-sample skill estimates
Akaike information criterion (AIC). AIC = -2 log (L) + 2p asymptotic estimate of expected out of sample error. Maximizing Mallows’ Cp = minimizing AIC Bayesian information criterion (BIC) BIC = -2 log(L)+p log(n) Difference approximates Bayes factor. L=likelihood, p=# parameters,n=# samples. Maximize fit, penalize complexity. In-sample methods rather than maximizing the likelihood of the model given the data. Well-known in Earth science literature. Maximizing AIC is equivalent to minimizing Mallows Cp for normal multiple linear regression models. The difference between the BIC for two models approximates the Bayes factor which measures the relative likelihood of one model to another given the data and equal prior probabilities for the models.

AIC and BIC AIC = -2 log (L) + 2p BIC = -2 log(L)+p log(n)
BIC tends to select simpler models. AIC is asymptotically (many obs.) inconsistent. BIC consistent. For constant model size, pick best fit. Large pool of predictors leads to over-fitting. Relevant case is when the models have different dimensions.

Out-of-sample skill estimates
Calibrate and validate models using independent data sets. Split data into calibration and validation data sets. Repeatedly divide data. Leave-1-out cross-validation; Leave-k-out cross-validation. Properties? Split the data when there are many observations. What are the properties of cross-validation? Model selection by cross-validation.

Leave-k-out CV is biased
Single predictor and predictand. Underestimates correlation. Increasing k reduces (increases) the bias for low (high) correlations. (Barnston & van den Dool 1993). Multivariate linear regression. Overestimates RMS error with a bias ~ k/[n(n-k)] (Burman 1989). For a given model with significant skill, large k underestimates skill. Important to remember that the CV skill is a random variable, a function of the noise realization. Results are for expected values.

On the other hand … Selection bias
“If one begins with a very large collection of rival models, then we can be fairly sure that the winning model will have an accidentally high maximum likelihood term.” (Forster). True predictive skill likely to be overestimated. Impacts goals of optimal model choice accurate skill estimate. Ideally use an independent data set to estimate skill. Subtle point. Given a model, CV is likely to underestimate skill. But if that model was chosen from a large pool of model, likely to overestimate. Can the bias of cross-validation and the selection bias off-set each other?

In-sample and CV estimate
Leave-1-out cross-validation asymptotically equivalent to AIC (and Mallows’ Cp; Stone 1979). Leave-k-out cross-validation asymptotically equivalent to BIC for well chosen k. Increasing k tends to simpler models CV with large k complex models by require them to estimate many parameters with little data. The asymptotic limit is again when the number of observations is large. No useful distinction when the dimension of the models is the same.

Leave-k-out cross validation
Leaving more out tends to select simpler models. Choice of metric matters. Correlation and rms error not simply related. RMS error selects simpler models in numerical experiments.

Impact on skill estimates
Leaving more out reduces skill estimate biases in numerical experiments.

Better model selected? If the “true” model is simple, leaving out more selects a better model. If the model is not simple, then leaving more out has modest impact on the model skill. Complex model parameters may be hard to estimate accurately. Parameters of the simpler model may be easier to estimate accurately. Leaving more out gives more accurate estimates of skill.

Conclusions Increasing pool of predictors, increases chance of over-fitting and over-estimating skill. AIC and BIC balance data-fit and model complexity. BIC chooses simpler models. Leave-k-out cross-validation also penalizes model complexity. (Leave-1-out asymptotic to AIC). Leaving more out selects simpler models reduces skill estimate bias.

Cross-validation for the selection of statistical models

Similar presentations

Presentation on theme: "Cross-validation for the selection of statistical models"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Cross-validation for the selection of statistical models

Similar presentations

Presentation on theme: "Cross-validation for the selection of statistical models"— Presentation transcript:

Similar presentations

About project

Feedback