Variable selection and model building Part I. Statement of situation A common situation is that there is a large set of candidate predictor variables.

Variable selection and model building Part I

Statement of situation A common situation is that there is a large set of candidate predictor variables. (Note: The examples herein are not really that large.) Goal is to choose a small subset from the larger set so that the resulting regression model is simple and useful: –provides a good summary of the trend in the response –and/or provides good predictions of response –and/or provides good estimates of slope coefficients

What if the regression equation contains “wrong” variables?

When is an estimate unbiased? An estimate is unbiased if the average of the values of the statistics determined from all possible random samples equals the parameter you’re trying to estimate. –An estimated regression coefficient b i is unbiased if the mean of all possible b i equals β i. –The predicted response is unbiased if the mean of all possible equals.

One of four possible situations The model is correctly specified: –The regression equation contains all relevant predictors, including necessary interaction terms and transformations. No redundant or extraneous predictors. –Leads to unbiased regression coefficients and unbiased predictions of the response. –MSE is an unbiased estimate of σ 2.

One of four possible situations The model is underspecified: –The regression equation is missing one or more important predictor variables. –Leads to biased regression coefficients and biased predictions of the response. –MSE is a biased (upward) estimate of σ 2.

A (likely) underspecified model Weight = -1.22 + 0.283 Height + 0.111 Water, MSE = 0.017 Weight = -4.14 + 0.389 Height, MSE = 0.653

One of four possible situations The model contains two or more extraneous variables: –The regression equation contains extraneous variables that are not related to the response or to any of the other predictors. –Leads to unbiased regression coefficients and unbiased predictions of the response. –MSE is an unbiased estimate of σ 2, but has fewer degrees of freedom associated with it.

One of four possible situations The model is overspecified: –The regression equation contains one or more redundant predictor variables. –Leads to unbiased regression coefficients and unbiased predictions of the response. –MSE is an unbiased estimate of σ 2. –Because of multicollinearity, the standard errors of the regression coefficients are inflated. –Model can be used, with caution, for prediction.

A goal, a strategy Know your research question. –Are there a few particular predictors of interest? –Most interested in summary description, in prediction or in effects of predictors? Identify all possible candidate predictors. –Don’t worry about functional form, such as x 2, log x, and interactions, yet.

A goal, a strategy (cont’d) Use variable selection procedures to find the middle ground between underspecified model and model with extraneous variables. Fine-tune the model to get a correctly specified model. –If necessary, change functional form of predictors and add interactions. –Check behavior of residuals.

Two basic methods of selecting predictors Stepwise regression: Enter and remove predictors, in a stepwise manner, until no justifiable reason to enter or remove more. Best subsets regression: Select the subset of predictors that do the best at meeting some well-defined objective criterion.

Two cautions! The list of candidate predictor variables must include all the variables that actually predict the response. There is no single criterion that will always be the best measure of the “best” regression equation.

Stepwise regression

Enter and remove predictors, in a stepwise manner, until there is no justifiable reason to enter or remove any more.

Example: Cement data Response y: heat evolved in calories during hardening of cement on a per gram basis Predictor x 1 : % of tricalcium aluminate Predictor x 2 : % of tricalcium silicate Predictor x 3 : % of tetracalcium alumino ferrite Predictor x 4 : % of dicalcium silicate

Example: Cement data

Stepwise regression: the idea Start with no predictors in the “stepwise model.” At each step, enter or remove a predictor based on partial F-tests (that is, the t-tests). Stop when no more predictors can be justifiably entered or removed from the stepwise model.

Stepwise regression: Preliminary steps 1.Specify an Alpha-to-Enter (α E = 0.15) significance level. 2.Specify an Alpha-to-Remove (α R = 0.15) significance level.

Stepwise regression: Step #1 1.Fit each of the one-predictor models, that is, regress y on x 1, regress y on x 2, …, regress y on x p-1. 2.The first predictor put in the stepwise model is the predictor that has the smallest t-test P-value (below α E = 0.15). 3.If no P-value < 0.15, stop.

Stepwise regression: Step #2 1.Suppose x 1 was the “best” one predictor. 2.Fit each of the two-predictor models with x 1 in the model, that is, regress y on (x 1, x 2 ), regress y on (x 1, x 3 ), …, and y on (x 1, x p-1 ). 3.The second predictor put in stepwise model is the predictor that has the smallest t-test P-value (below α E = 0.15). 4.If no P-value < 0.15, stop.

Stepwise regression: Step #2 (continued) 1.Suppose x 2 was the “best” second predictor. 2.Step back and check the P-value for β 1 = 0. If the P-value for β 1 = 0 has become not significant (above α R = 0.15), remove x 1 from the stepwise model.

Stepwise regression: Step #3 1.Suppose both x 1 and x 2 made it into the two-predictor stepwise model. 2.Fit each of the three-predictor models with x 1 and x 2 in the model, that is, regress y on (x 1, x 2, x 3 ), regress y on (x 1, x 2, x 4 ), …, and regress y on (x 1, x 2, x p-1 ).

Stepwise regression: Step #3 (continued) 1.The third predictor put in stepwise model is the predictor that has the smallest t-test P-value (below α E = 0.15). 2.If no P-value < 0.15, stop. 3.Step back and check P-values for β 1 = 0 and β 2 = 0. If either P-value has become not significant (above α R = 0.15), remove the predictor from the stepwise model.

Stepwise regression: Stopping the procedure The procedure is stopped when adding an additional predictor does not yield a t-test P-value below α E = 0.15.

Example: Cement data

Predictor Coef SE Coef T P Constant 81.479 4.927 16.54 0.000 x1 1.8687 0.5264 3.55 0.005 Predictor Coef SE Coef T P Constant 57.424 8.491 6.76 0.000 x2 0.7891 0.1684 4.69 0.001 Predictor Coef SE Coef T P Constant 110.203 7.948 13.87 0.000 x3 -1.2558 0.5984 -2.10 0.060 Predictor Coef SE Coef T P Constant 117.568 5.262 22.34 0.000 x4 -0.7382 0.1546 -4.77 0.001

Predictor Coef SE Coef T P Constant 103.097 2.124 48.54 0.000 x4 -0.61395 0.04864 -12.62 0.000 x1 1.4400 0.1384 10.40 0.000 Predictor Coef SE Coef T P Constant 94.16 56.63 1.66 0.127 x4 -0.4569 0.6960 -0.66 0.526 x2 0.3109 0.7486 0.42 0.687 Predictor Coef SE Coef T P Constant 131.282 3.275 40.09 0.000 x4 -0.72460 0.07233 -10.02 0.000 x3 -1.1999 0.1890 -6.35 0.000

Predictor Coef SE Coef T P Constant 71.65 14.14 5.07 0.001 x4 -0.2365 0.1733 -1.37 0.205 x1 1.4519 0.1170 12.41 0.000 x2 0.4161 0.1856 2.24 0.052 Predictor Coef SE Coef T P Constant 111.684 4.562 24.48 0.000 x4 -0.64280 0.04454 -14.43 0.000 x1 1.0519 0.2237 4.70 0.001 x3 -0.4100 0.1992 -2.06 0.070

Predictor Coef SE Coef T P Constant 52.577 2.286 23.00 0.000 x1 1.4683 0.1213 12.10 0.000 x2 0.66225 0.04585 14.44 0.000

Predictor Coef SE Coef T P Constant 71.65 14.14 5.07 0.001 x1 1.4519 0.1170 12.41 0.000 x2 0.4161 0.1856 2.24 0.052 x4 -0.2365 0.1733 -1.37 0.205 Predictor Coef SE Coef T P Constant 48.194 3.913 12.32 0.000 x1 1.6959 0.2046 8.29 0.000 x2 0.65691 0.04423 14.85 0.000 x3 0.2500 0.1847 1.35 0.209

Predictor Coef SE Coef T P Constant 52.577 2.286 23.00 0.000 x1 1.4683 0.1213 12.10 0.000 x2 0.66225 0.04585 14.44 0.000

Stepwise Regression: y versus x1, x2, x3, x4 Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15 Response is y on 4 predictors, with N = 13 Step 1 2 3 4 Constant 117.57 103.10 71.65 52.58 x4 -0.738 -0.614 -0.237 T-Value -4.77 -12.62 -1.37 P-Value 0.001 0.000 0.205 x1 1.44 1.45 1.47 T-Value 10.40 12.41 12.10 P-Value 0.000 0.000 0.000 x2 0.416 0.662 T-Value 2.24 14.44 P-Value 0.052 0.000 S 8.96 2.73 2.31 2.41 R-Sq 67.45 97.25 98.23 97.87 R-Sq(adj) 64.50 96.70 97.64 97.44 C-p 138.7 5.5 3.0 2.7

Caution about stepwise regression! Do not over-interpret the order in which predictors are entered into the model. Do not jump to the conclusion … –that all the important predictor variables for predicting y have been identified, or –that all the unimportant predictor variables have been eliminated.

Caution about stepwise regression! (cont’d) Many t-tests for β k = 0 are conducted in a stepwise regression procedure. The probability is high … –that we included some unimportant predictors –that we excluded some important predictors

Drawbacks of stepwise regression The final model is not guaranteed to be optimal in any specified sense. The procedure yields a single final model, although often several equally good models. It doesn’t take into account a researcher’s knowledge about the predictors. –If necessary, force the procedure to include important predictors.

Example: Modeling PIQ

Stepwise Regression: PIQ versus MRI, Height, Weight Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15 Response is PIQ on 3 predictors, with N = 38 Step 1 2 Constant 4.652 111.276 MRI 1.18 2.06 T-Value 2.45 3.77 P-Value 0.019 0.001 Height -2.73 T-Value -2.75 P-Value 0.009 S 21.2 19.5 R-Sq 14.27 29.49 R-Sq(adj) 11.89 25.46 C-p 7.3 2.0

The regression equation is PIQ = 111 + 2.06 MRI - 2.73 Height Predictor Coef SE Coef T P Constant 111.28 55.87 1.99 0.054 MRI 2.0606 0.5466 3.77 0.001 Height -2.7299 0.9932 -2.75 0.009 S = 19.51 R-Sq = 29.5% R-Sq(adj) = 25.5% Analysis of Variance Source DF SS MS F P Regression 2 5572.7 2786.4 7.32 0.002 Error 35 13321.8 380.6 Total 37 18894.6 Source DF Seq SS MRI 1 2697.1 Height 1 2875.6

Example: Modeling BP

Stepwise Regression: BP versus Age, Weight, BSA, Duration, Pulse, Stress Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15 Response is BP on 6 predictors, with N = 20 Step 1 2 3 Constant 2.205 -16.579 -13.667 Weight 1.201 1.033 0.906 T-Value 12.92 33.15 18.49 P-Value 0.000 0.000 0.000 Age 0.708 0.702 T-Value 13.23 15.96 P-Value 0.000 0.000 BSA 4.6 T-Value 3.04 P-Value 0.008 S 1.74 0.533 0.437 R-Sq 90.26 99.14 99.45 R-Sq(adj) 89.72 99.04 99.35 C-p 312.8 15.1 6.4

The regression equation is BP = - 13.7 + 0.702 Age + 0.906 Weight + 4.63 BSA Predictor Coef SE Coef T P Constant -13.667 2.647 -5.16 0.000 Age 0.70162 0.04396 15.96 0.000 Weight 0.90582 0.04899 18.49 0.000 BSA 4.627 1.521 3.04 0.008 S = 0.4370 R-Sq = 99.5% R-Sq(adj) = 99.4% Analysis of Variance Source DF SS MS F P Regression 3 556.94 185.65 971.93 0.000 Error 16 3.06 0.19 Total 19 560.00 Source DF Seq SS Age 1 243.27 Weight 1 311.91 BSA 1 1.77

Stepwise regression in Minitab Stat >> Regression >> Stepwise … Specify response and all possible predictors. If desired, specify predictors that must be included in every model. –(This is where researcher’s knowledge helps!!) Select OK. Results appear in session window.

Variable selection and model building Part I. Statement of situation A common situation is that there is a large set of candidate predictor variables.

Similar presentations

Presentation on theme: "Variable selection and model building Part I. Statement of situation A common situation is that there is a large set of candidate predictor variables."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Variable selection and model building Part I. Statement of situation A common situation is that there is a large set of candidate predictor variables.

Similar presentations

Presentation on theme: "Variable selection and model building Part I. Statement of situation A common situation is that there is a large set of candidate predictor variables."— Presentation transcript:

Similar presentations

About project

Feedback