Variable selection and model building Part II. Statement of situation A common situation is that there is a large set of candidate predictor variables.

Variable selection and model building Part II

Statement of situation A common situation is that there is a large set of candidate predictor variables. (Note: The examples herein are not really that large.) Goal is to choose a small subset from the larger set so that the resulting regression model is simple and useful: –provides a good summary of the trend in the response –and/or provides good predictions of response –and/or provides good estimates of slope coefficients

Two basic methods of selecting predictors Stepwise regression: Enter and remove predictors, in a stepwise manner, until no justifiable reason to enter or remove more. Best subsets regression: Select the subset of predictors that do the best at meeting some well-defined objective criterion.

Two cautions! The list of candidate predictor variables must include all the variables that actually predict the response. There is no single criterion that will always be the best measure of the “best” regression equation.

Best subsets regression …. or all possible subsets regression

Best subsets regression Consider all of the possible regression models from all of the possible combinations of the candidate predictors. Identify, for further evaluation, models with a subset of predictors that do the “best” at meeting some well-defined criteria. Further evaluate the models identified in the last step. Fine-tune the final model.

Example: Cement data Response y: heat evolved in calories during hardening of cement on a per gram basis Predictor x 1 : % of tricalcium aluminate Predictor x 2 : % of tricalcium silicate Predictor x 3 : % of tetracalcium alumino ferrite Predictor x 4 : % of dicalcium silicate

Example: Cement data

Why best subsets regression? # of predictors (p-1) # of regression models 12 : ( ) (x 1 ) 24 : ( ) (x 1 ) (x 2 ) (x 1, x 2 ) 38: ( ) (x 1 ) (x 2 ) (x 3 ) (x 1, x 2 ) (x 1, x 3 ) (x 2, x 3 ) (x 1, x 2, x 3 ) 416: 1 none, 4 one, 6 two, 4 three, 1 four

Why best subsets regression? If there are p-1 possible predictors, then there are 2 p-1 possible regression models containing the predictors. For example, 10 predictors yields 2 10 = 1024 possible regression models. A best subsets algorithm determines best subsets of each size, so that candidates for a final model can be identified by researcher.

Common ways of judging “best” Different criteria quantify different aspects of the regression model, so can lead to different choices for best set of predictors: –R-squared –Adjusted R-squared –MSE (or S = square root of MSE) –Mallow’s C p –(PRESS statistic)

Increase in R-squared R 2 can only increase as more variables are added. Use R-squared values to find the point where adding more predictors is not worthwhile, because it yields a very small increase in R-squared. Most often, used in combination with other criteria.

Best Subsets Regression: y versus x1, x2, x3, x4 Response is y x x x x Vars R-Sq R-Sq(adj) C-p S 1 2 3 4 1 67.5 64.5 138.7 8.9639 X 1 66.6 63.6 142.5 9.0771 X 2 97.9 97.4 2.7 2.4063 X X 2 97.2 96.7 5.5 2.7343 X X 3 98.2 97.6 3.0 2.3087 X X X 3 98.2 97.6 3.0 2.3121 X X X 4 98.2 97.4 5.0 2.4460 X X X X Cement example

Largest adjusted R-squared Makes you pay a penalty for adding more predictors. According to this criterion, the best regression model is the one with the largest adjusted R-squared.

Smallest MSE According to this criterion, the best regression model is the one with the smallest MSE. Adjusted R-squared increases only if MSE decreases, so the adjusted R-squared and MSE criteria yield the same models.

Mallow’s C p statistic

C p estimates the size of the bias introduced in the estimates of the responses by having an underspecified model (a model with important predictors missing).

Biased prediction If there is no bias, the expected value of the observed responses and the expected value of the predicted responses both equal μ Y|x. Fitting the data with an underspecified model, introduces bias,, into predicted response at the i th data point.

Biased prediction no bias bias

Bias from an underspecified model Weight = -1.22 + 0.283 Height + 0.111 Water, MSE = 0.017 Weight = -4.14 + 0.389 Height, MSE = 0.653

Variation in predicted responses Because of bias, variance in the predicted responses for data point i is due to two things: –random sampling variation –variance associated with the bias

Total variation in predicted responses If there is no bias, Γ p achieves its smallest value, p: Sum the two variance components over all n data points to obtain a measure of the total variation in the predicted responses:

A good measure of an underspecified model So, Γ p seems to be a good measure of an underspecified model: The best model is simply the one with the smallest value of Γ p. We even know that the theoretical minimum of Γ p is p.

C p as an estimate of Γ p If we know the population variance σ 2, we can estimate Γ p : where MSE p is the mean squared error from fitting the model containing the subset of p-1 predictors (p parameters).

Mallow’s C p statistic But we don’t know σ 2. So, estimate it using MSE all, the mean squared error obtained from fitting the model containing all of the predictors. Estimating σ 2 using MSE all : assumes that there are no biases in the full model with all of the predictors, an assumption that may or may not be valid, but can’t be tested without additional information. guarantees that C p = p for the full model.

Summary facts about Mallow’s C p Subset models with small C p values have a small total (standardized) variance of prediction. When the C p value is … –near p, the bias is small (next to none), –much greater than p, the bias is substantial, –below p, it is due to sampling error; interpret as no bias. For the largest model with all possible predictors, C p = p (always).

Using the C p criterion Identify subsets of predictors for which the C p value is near p (if possible). –The full model always yields C p = p, so don’t select the full model based on C p. –If all models, except the full model, yield a large C p not near p, it suggests some important predictor(s) are missing from the analysis. –When more than one model has a C p value near p, in general, choose the simpler model or the model that meets your research needs.

The regression equation is y = 62.4 + 1.55 x1 + 0.510 x2 + 0.102 x3 - 0.144 x4 Source DF SS MS F P Regression 4 2667.90 666.97 111.48 0.000 Residual Error 8 47.86 5.98 Total 12 2715.76 The regression equation is y = 52.6 + 1.47 x1 + 0.662 x2 Source DF SS MS F P Regression 2 2657.9 1328.9 229.50 0.000 Residual Error 10 57.9 5.8 Total 12 2715.8

The regression equation is y = 62.4 + 1.55 x1 + 0.510 x2 + 0.102 x3 - 0.144 x4 Source DF SS MS F P Regression 4 2667.90 666.97 111.48 0.000 Residual Error 8 47.86 5.98 Total 12 2715.76 The regression equation is y = 103 + 1.44 x1 - 0.614 x4 Source DF SS MS F P Regression 2 2641.0 1320.5 176.63 0.000 Residual Error 10 74.8 7.5 Total 12 2715.8

The regression equation is y = 71.6 + 1.45 x1 + 0.416 x2 - 0.237 x4 Predictor Coef SE Coef T P VIF Constant 71.65 14.14 5.07 0.001 x1 1.4519 0.1170 12.41 0.000 1.1 x2 0.4161 0.1856 2.24 0.052 18.8 x4 -0.2365 0.1733 -1.37 0.205 18.9 S = 2.309 R-Sq = 98.2% R-Sq(adj) = 97.6% Analysis of Variance Source DF SS MS F P Regression 3 2667.79 889.26 166.83 0.000 Residual Error 9 47.97 5.33 Total 12 2715.76

The regression equation is y = 48.2 + 1.70 x1 + 0.657 x2 + 0.250 x3 Predictor Coef SE Coef T P VIF Constant 48.194 3.913 12.32 0.000 x1 1.6959 0.2046 8.29 0.000 3.3 x2 0.65691 0.04423 14.85 0.000 1.1 x3 0.2500 0.1847 1.35 0.209 3.1 S = 2.312 R-Sq = 98.2% R-Sq(adj) = 97.6% Analysis of Variance Source DF SS MS F P Regression 3 2667.65 889.22 166.34 0.000 Residual Error 9 48.11 5.35 Total 12 2715.76

The regression equation is y = 52.6 + 1.47 x1 + 0.662 x2 Predictor Coef SE Coef T P VIF Constant 52.577 2.286 23.00 0.000 x1 1.4683 0.1213 12.10 0.000 1.1 x2 0.66225 0.04585 14.44 0.000 1.1 S = 2.406 R-Sq = 97.9% R-Sq(adj) = 97.4% Analysis of Variance Source DF SS MS F P Regression 2 2657.9 1328.9 229.50 0.000 Residual Error 10 57.9 5.8 Total 12 2715.8

Stepwise Regression: y versus x1, x2, x3, x4 Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15 Response is y on 4 predictors, with N = 13 Step 1 2 3 4 Constant 117.57 103.10 71.65 52.58 x4 -0.738 -0.614 -0.237 T-Value -4.77 -12.62 -1.37 P-Value 0.001 0.000 0.205 x1 1.44 1.45 1.47 T-Value 10.40 12.41 12.10 P-Value 0.000 0.000 0.000 x2 0.416 0.662 T-Value 2.24 14.44 P-Value 0.052 0.000 S 8.96 2.73 2.31 2.41 R-Sq 67.45 97.25 98.23 97.87 R-Sq(adj) 64.50 96.70 97.64 97.44 C-p 138.7 5.5 3.0 2.7

Residual analysis

Example: Modeling PIQ

Best Subsets Regression: PIQ versus MRI, Height, Weight Response is PIQ H W e e i i M g g R h h Vars R-Sq R-Sq(adj) C-p S I t t 1 14.3 11.9 7.3 21.212 X 1 0.9 0.0 13.8 22.810 X 2 29.5 25.5 2.0 19.510 X X 2 19.3 14.6 6.9 20.878 X X 3 29.5 23.3 4.0 19.794 X X X

Stepwise Regression: PIQ versus MRI, Height, Weight Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15 Response is PIQ on 3 predictors, with N = 38 Step 1 2 Constant 4.652 111.276 MRI 1.18 2.06 T-Value 2.45 3.77 P-Value 0.019 0.001 Height -2.73 T-Value -2.75 P-Value 0.009 S 21.2 19.5 R-Sq 14.27 29.49 R-Sq(adj) 11.89 25.46 C-p 7.3 2.0

Example: Modeling BP

Best Subsets Regression: BP versus Age, Weight,... Response is BP D u W r S e a P t i t u r A g B i l e g h S o s s Vars R-Sq R-Sq(adj) C-p S e t A n e s 1 90.3 89.7 312.8 1.7405 X 1 75.0 73.6 829.1 2.7903 X 2 99.1 99.0 15.1 0.53269 X X 2 92.0 91.0 256.6 1.6246 X X 3 99.5 99.4 6.4 0.43705 X X X 3 99.2 99.1 14.1 0.52012 X X X 4 99.5 99.4 6.4 0.42591 X X X X 4 99.5 99.4 7.1 0.43500 X X X X 5 99.6 99.4 7.0 0.42142 X X X X X 5 99.5 99.4 7.7 0.43078 X X X X X 6 99.6 99.4 7.0 0.40723 X X X X X X

Stepwise Regression: BP versus Age, Weight, BSA, Duration, Pulse, Stress Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15 Response is BP on 6 predictors, with N = 20 Step 1 2 3 Constant 2.205 -16.579 -13.667 Weight 1.201 1.033 0.906 T-Value 12.92 33.15 18.49 P-Value 0.000 0.000 0.000 Age 0.708 0.702 T-Value 13.23 15.96 P-Value 0.000 0.000 BSA 4.6 T-Value 3.04 P-Value 0.008 S 1.74 0.533 0.437 R-Sq 90.26 99.14 99.45 R-Sq(adj) 89.72 99.04 99.35 C-p 312.8 15.1 6.4

The regression equation is BP = - 12.9 + 0.683 Age + 0.897 Weight + 4.86 BSA + 0.0665 Dur Predictor Coef SE Coef T P VIF Constant -12.852 2.648 -4.85 0.000 Age 0.68335 0.04490 15.22 0.000 1.3 Weight 0.89701 0.04818 18.62 0.000 4.5 BSA 4.860 1.492 3.26 0.005 4.3 Dur 0.06653 0.04895 1.36 0.194 1.2 S = 0.4259 R-Sq = 99.5% R-Sq(adj) = 99.4% Analysis of Variance Source DF SS MS F P Regression 4 557.28 139.32 768.01 0.000 Residual Error 15 2.72 0.18 Total 19 560.00

The regression equation is BP = - 13.7 + 0.702 Age + 0.906 Weight + 4.63 BSA Predictor Coef SE Coef T P VIF Constant -13.667 2.647 -5.16 0.000 Age 0.70162 0.04396 15.96 0.000 1.2 Weight 0.90582 0.04899 18.49 0.000 4.4 BSA 4.627 1.521 3.04 0.008 4.3 S = 0.4370 R-Sq = 99.5% R-Sq(adj) = 99.4% Analysis of Variance Source DF SS MS F P Regression 3 556.94 185.65 971.93 0.000 Residual Error 16 3.06 0.19 Total 19 560.00

The regression equation is BP = - 16.6 + 0.708 Age + 1.03 Weight Predictor Coef SE Coef T P VIF Constant -16.579 3.007 -5.51 0.000 Age 0.70825 0.05351 13.23 0.000 1.2 Weight 1.03296 0.03116 33.15 0.000 1.2 S = 0.5327 R-Sq = 99.1% R-Sq(adj) = 99.0% Analysis of Variance Source DF SS MS F P Regression 2 555.18 277.59 978.25 0.000 Residual Error 17 4.82 0.28 Total 19 560.00

Best subsets regression Stat >> Regression >> Best subsets … Specify response and all possible predictors. If desired, specify predictors that must be included in every model. –(Researcher’s knowledge!) Select OK. Results appear in session window.

Model building strategy

The first step Decide on the type of model needed –Predictive: model used to predict the response variable from a chosen set of predictors. –Theoretical: model based on theoretical relationship between response and predictors. –Control: model used to control a response variable by manipulating predictor variables.

The first step (cont’d) Decide on the type of model needed –Inferential: model used to explore strength of relationships between response and predictors. –Data summary: model used merely as a way to summarize a large set of data by a single equation.

The second step Decide which predictor variables and response variable on which to collect the data. Collect the data.

The third step Explore the data –Check for outliers, gross data errors, missing values on a univariate basis. –Study bivariate relationships to reveal other outliers, to suggest possible transformations, to identify possible multicollinearities.

The fourth step Randomly divide the data into a training set and a test set: –The training set, with at least 15-20 error d.f., is used to fit the model. –The test set is used for cross-validation of the fitted model.

The fifth step Using the training set, fit several candidate models: –Use best subsets regression. –Use stepwise regression (only gives one model unless specifies different alpha-to-remove and alpha-to-enter values).

The sixth step Select and evaluate a few “good” models: –Select based on adjusted R 2, Mallow’s C p, number and nature of predictors. –Evaluate selected models for violation of model assumptions. –If none of the models provide a satisfactory fit, try something else, such as more data, different predictors, a different class of model …

The final step Select the final model: –Compare competing models by cross-validating them against the test data. –The model with a larger cross-validation R 2 is a better predictive model. –Consider residual plots, outliers, parsimony, relevance, and ease of measurement of predictors.

Variable selection and model building Part II. Statement of situation A common situation is that there is a large set of candidate predictor variables.

Similar presentations

Presentation on theme: "Variable selection and model building Part II. Statement of situation A common situation is that there is a large set of candidate predictor variables."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Variable selection and model building Part II. Statement of situation A common situation is that there is a large set of candidate predictor variables.

Similar presentations

Presentation on theme: "Variable selection and model building Part II. Statement of situation A common situation is that there is a large set of candidate predictor variables."— Presentation transcript:

Similar presentations

About project

Feedback