Topic 18: Model Selection and Diagnostics

Slides:



Advertisements
Similar presentations
Topic 12: Multiple Linear Regression
Advertisements

Topic 9: Remedies.
12 Multiple Linear Regression CHAPTER OUTLINE
Topic 15: General Linear Tests and Extra Sum of Squares.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 13 Nonlinear and Multiple Regression.
Topic 3: Simple Linear Regression. Outline Simple linear regression model –Model parameters –Distribution of error terms Estimation of regression parameters.
Partial Regression Plots. Life Insurance Example: (nknw364.sas) Y = the amount of life insurance for the 18 managers (in $1000) X 1 = average annual income.
Multiple regression analysis
Statistics for Managers Using Microsoft® Excel 5th Edition
1 Chapter 3 Multiple Linear Regression Ray-Bing Chen Institute of Statistics National University of Kaohsiung.
Lecture 6: Multiple Regression
1 Chapter 9 Variable Selection and Model building Ray-Bing Chen Institute of Statistics National University of Kaohsiung.
Lecture 11 Multivariate Regression A Case Study. Other topics: Multicollinearity  Assuming that all the regression assumptions hold how good are our.
Multiple Linear Regression
EPI809/Spring Testing Individual Coefficients.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.
Chapter 15: Model Building
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Topic 16: Multicollinearity and Polynomial Regression.
Topic 28: Unequal Replication in Two-Way ANOVA. Outline Two-way ANOVA with unequal numbers of observations in the cells –Data and model –Regression approach.
© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.
Variable selection and model building Part II. Statement of situation A common situation is that there is a large set of candidate predictor variables.
Topic 2: An Example. Leaning Tower of Pisa Construction began in 1173 and by 1178 (2 nd floor), it began to sink Construction resumed in To compensate.
1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation.
Topic 7: Analysis of Variance. Outline Partitioning sums of squares Breakdown degrees of freedom Expected mean squares (EMS) F test ANOVA table General.
Copyright © 2010 Pearson Education, Inc Chapter Seventeen Correlation and Regression.
Topic 14: Inference in Multiple Regression. Outline Review multiple linear regression Inference of regression coefficients –Application to book example.
1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.
6-3 Multiple Regression Estimation of Parameters in Multiple Regression.
Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.
Topic 17: Interaction Models. Interaction Models With several explanatory variables, we need to consider the possibility that the effect of one variable.
6-1 Introduction To Empirical Models Based on the scatter diagram, it is probably reasonable to assume that the mean of the random variable Y is.
Anaregweek11 Regression diagnostics. Regression Diagnostics Partial regression plots Studentized deleted residuals Hat matrix diagonals Dffits, Cook’s.
Review of Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
Multiple Regression Petter Mostad Review: Simple linear regression We define a model where are independent (normally distributed) with equal.
Topic 6: Estimation and Prediction of Y h. Outline Estimation and inference of E(Y h ) Prediction of a new observation Construction of a confidence band.
Dr. C. Ertuna1 Issues Regarding Regression Models (Lesson - 06/C)
6-3 Multiple Regression Estimation of Parameters in Multiple Regression.
Topic 13: Multiple Linear Regression Example. Outline Description of example Descriptive summaries Investigation of various models Conclusions.
Topic 23: Diagnostics and Remedies. Outline Diagnostics –residual checks ANOVA remedial measures.
1 Experimental Statistics - week 14 Multiple Regression – miscellaneous topics.
Topic 25: Inference for Two-Way ANOVA. Outline Two-way ANOVA –Data, models, parameter estimates ANOVA table, EMS Analytical strategies Regression approach.
Topic 26: Analysis of Covariance. Outline One-way analysis of covariance –Data –Model –Inference –Diagnostics and rememdies Multifactor analysis of covariance.
Slide 1 DSCI 5340: Predictive Modeling and Business Forecasting Spring 2013 – Dr. Nick Evangelopoulos Lecture 2: Review of Multiple Regression (Ch. 4-5)
Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory.
Simple Linear Regression. Data available : (X,Y) Goal : To predict the response Y. (i.e. to obtain the fitted response function f(X)) Least Squares Fitting.
 Relationship between education level, income, and length of time out of school  Our new regression equation: is the predicted value of the dependent.
BUILDING THE REGRESSION MODEL Data preparation Variable reduction Model Selection Model validation Procedures for variable reduction 1 Building the Regression.
1 Experimental Statistics - week 12 Chapter 12: Multiple Regression Chapter 13: Variable Selection Model Checking.
Multiple Regression I 1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 4 Multiple Regression Analysis (Part 1) Terry Dielman.
12/17/ lecture 111 STATS 330: Lecture /17/ lecture 112 Outliers and high-leverage points  An outlier is a point that has a larger.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection 1 Chapter 8 Variable Selection Terry Dielman Applied Regression Analysis:
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Topic 24: Two-Way ANOVA. Outline Two-way ANOVA –Data –Cell means model –Parameter estimates –Factor effects model.
Topic 20: Single Factor Analysis of Variance. Outline Analysis of Variance –One set of treatments (i.e., single factor) Cell means model Factor effects.
Copyright © 2010 Pearson Education, Inc Chapter Seventeen Correlation and Regression.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
1 Building the Regression Model –I Selection and Validation KNN Ch. 9 (pp )
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
1 Statistics 262: Intermediate Biostatistics Regression Models for longitudinal data: Mixed Models.
1 Experimental Statistics - week 13 Multiple Regression Miscellaneous Topics.
Experimental Statistics - week 9
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Yandell – Econ 216 Chap 15-1 Chapter 15 Multiple Regression Model Building.
Chapter 9 Multiple Linear Regression
Multiple Linear Regression
Statistics in MSmcDESPOT
Multiple Linear Regression
Presentation transcript:

Topic 18: Model Selection and Diagnostics

Variable Selection We want to choose a “best” model that is a subset of the available explanatory variables Two separate problems How many explanatory variables should we use (i.e., subset size) Given the subset size, which variables should we choose

KNNL Example Page 350, Section 9.2 Y : survival time of patient (liver op) X’s (explanatory variables) are Blood clotting score Prognostic index Enzyme function test Liver function test

KNNL Example cont. n = 54 patients Start with the usual plots and descriptive statistics Time-to-event data is often heavily skewed and typically transformed with a log

Dummy variables for alcohol use Data Tab delimited Data a1; infile 'U:\.www\datasets512\CH09TA01.txt‘ delimiter='09'x; input blood prog enz liver age gender alcmod alcheavy surv lsurv; run; Ln(surv) Dummy variables for alcohol use

Data Obs blood prog enz liver age gender alcmod alcheavy surv lsurv 1 6.7 62 81 2.59 50 695 6.544 2 5.1 59 66 1.70 39 403 5.999 3 7.4 57 83 2.16 55 710 6.565 4 6.5 73 41 2.01 48 349 5.854 5 7.8 65 115 4.30 45 2343 7.759 6 5.8 38 72 1.42 348 5.852

Log Transform of Y Recall that regression model does not require Y to be Normally distributed In this case, transform reduces influence of long right tail and often stabilizes the variance of the residuals

Scatterplots proc corr plot=matrix; var blood prog enz liver; run; proc corr plot=scatter; with lsurv;

Pearson Correlation Coefficients, N = 54 Prob > |r| under H0: Rho=0 Correlation Summary Pearson Correlation Coefficients, N = 54 Prob > |r| under H0: Rho=0   blood prog enz liver lsurv 0.24619 0.0727 0.46994 0.0003 0.65389 <.0001 0.64926 <.0001

The Two Problems in Variable Selection To determine an appropriate subset size Might use adjusted R2, Cp, MSE, PRESS, AIC, SBC (BIC) To determine best model of this fixed size Might use R2

Adjusted R2 R2 by its construction is guaranteed to increase with p SSE cannot decrease with additional X and SSTO constant Adjusted R2 uses df to account for p

Adjusted R2 Want to find model that maximizes Since MSTO will remain constant for a given data set Depends only on Y Equivalent information to MSE Thus could also find choice of model that minimizes MSE Details on pages 354-356

Cp Criterion The basic idea is to compare subset models with the full model A subset model is good if there is not substantial “bias” in the predicted values (relative to the full model) Looks at the ratio of total mean squared error and the true error variance See page 357-359 for details

Cp Criterion SSE based on a specific choice of p-1 variables MSE based on the full set of variables Select the full set and Cp=(n-p)-(n-2p)=p

Use of Cp p is the number of regression coefficients including the intercept A model is good according to this criterion if Cp ≤ p Rule: Pick the smallest model for which Cp is smaller than p or pick the model that minimizes Cp, provided the minimum Cp is smaller than p

SBC (BIC) and AIC Criterion based on log(likelihood) plus a penalty for more complexity AIC – minimize SBC – minimize

Other approaches PRESS (prediction SS) For each case i Delete the case and predict Y using a model based on the other n-1 cases Look at the SS for observed minus predicted Want to minimize the PRESS

Variable Selection Additional proc reg model statement options useful in variable selection INCLUDE=n forces the first n explanatory variables into all models BEST=n limits the output to the best n models of each subset size or total START=n limits output to models that include at least n explanatory variables

Variable Selection Step type procedures Forward selection (Step up) Backward elimination (Step down) Stepwise (forward selection with a backward glance) Very popular but now have much better search techniques like BEST

Ordering models of the same subset size Use R2 or SSE This approach can lead us to consider several models that give us approximately the same predicted values May need to apply knowledge of the subject matter to make a final selection Not that important if prediction is the key goal

Proc Reg proc reg data=a1; model lsurv= blood prog enz liver/ selection=rsquare cp aic sbc b best=3; run;

Selection Results Number in Model R-Square C(p) AIC SBC 1 0.4276 66.4889 -103.8269 -99.84889 0.4215 67.7148 -103.2615 -99.28357 0.2208 108.5558 -87.1781 -83.20011 2 0.6633 20.5197 -130.4833 -124.51634 0.5995 33.5041 -121.1126 -115.14561 0.5486 43.8517 -114.6583 -108.69138 3 0.7573 3.3905 -146.1609 -138.20494 0.7178 11.4237 -138.0232 -130.06723 0.6121 32.9320 -120.8442 -112.88823 4 0.7592 5.0000 -144.5895 -134.64461

Selection Results Number in Model Parameter Estimates Intercept blood prog enz liver 1 5.26426 . 0.01512 5.61218 0.29819 5.56613 0.01367 2 4.35058 0.01412 0.01539 5.02818 0.01073 0.20945 4.54623 0.10792 0.01634 3 3.76618 0.09546 0.01334 0.01645 4.40582 0.01101 0.01261 0.12977 4.78168 0.04482 0.01220 0.16360 4 3.85195 0.08368 0.01266 0.01563 0.03216

Proc Reg proc reg data=a1; model lsurv= blood prog enz liver/ selection=cp aic sbc b best=3; run;

Selection Results Number in Model C(p) R-Square AIC SBC 3 3.3905 0.7573 -146.1609 -138.20494 4 5.0000 0.7592 -144.5895 -134.64461 11.4237 0.7178 -138.0232 -130.06723 WARNING: “selection=cp” just lists the models in order based on lowest C(p), regardless of whether it is good or not

How to Choose with C(p) Want small C(p) Want C(p) near p In original paper, it was suggested to plot C(p) versus p and consider the smallest model that satisfies these criteria Can be somewhat subjective when determining “near”

Proc Reg proc reg data=a1 outest=b1; model lsurv=blood prog enz liver/ Creates data set with estimates & criteria Proc Reg proc reg data=a1 outest=b1; model lsurv=blood prog enz liver/ selection=rsquare cp aic sbc b; run;quit; symbol1 v=circle i=none; symbol2 v=none i=join; proc gplot data=b1; plot _Cp_*_P_ _P_*_P_ / overlay; run;

Start to approach C(p)=p line here

Model Validation Since data used to generate parameter estimates, you’d expect model to predict fitted Y’s well Want to check model predictive ability for a separate data set Various techniques of cross validation (data split, leave one out) are possible

Regression Diagnostics Partial regression plots Studentized deleted residuals Hat matrix diagonals Dffits, Cook’s D, DFBETAS Variance inflation factor Tolerance

KNNL Example Page 386, Section 10.1 Y is amount of life insurance X1 is average annual income X2 is a risk aversion score n = 18 managers

Read in the data set data a1; infile ‘../data/ch10ta01.txt'; input income risk insur;

Partial regression plots Also called added variable plots or adjusted variable plots One plot for each Xi

Partial regression plots These plots show the strength of the marginal relationship between Y and Xi in the full model . They can also detect Nonlinear relationships Heterogeneous variances Outliers

Partial regression plots Consider plot for X1 Use the other X’s to predict Y Use the other X’s to predict X1 Plot the residuals from the first regression vs the residuals from the second regression

The partial option with proc reg and plots= proc reg data=a1 plots=partialplot; model insur=income risk /partial; run;

Output Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 2 173919 86960 542.33 <.0001 Error 15 2405.1476 160.3431   Corrected Total 17 176324 Root MSE 12.66267 R-Square 0.9864 Dependent Mean 134.44444 Adj R-Sq 0.9845 Coeff Var 9.41851  

Output Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > |t| Tolerance Intercept 1 -205.71866 11.39268 -18.06 <.0001 . income 6.28803 0.20415 30.80 0.93524 risk 4.73760 1.37808 3.44 0.0037

Output The partial option on the model statement in proc reg generates graphs in the output window These are ok for some purposes but we prefer better looking plots To generate these plots we follow the regression steps outlined earlier and use gplot or plots=partialplot

Partial regression plots *partial regression plot for risk; proc reg data=a1; model insur risk = income; output out=a2 r=resins resris; symbol1 v=circle i=sm70; proc gplot data=a2; plot resins*resris; run;

The plot for risk

Partial plot for income code not shown

Residual plot (vs risk) proc reg data=a1; model insur= risk income; output out=a2 r=resins; symbol1 v=circle i=sm70; proc sort data=a2; by risk; proc gplot data=a2; plot resins*risk; run;

Residuals vs Risk

Residual plot (vs income) proc sort data=a2; by income; proc gplot data=a2; plot resins*income; run;

Residuals vs Income

Other “Residuals” There are several versions of residuals Our usual residuals ei= Yi – Studentized residuals Studentized means dividing by its standard error Are distributed t(n-p) ( ≈ Normal)

Other “Residuals” Studentized deleted residuals Delete case i and refit the model Compute the predicted value for case i using this refitted model Compute the “studentized residual” Don’t do this literally but this is the concept

Studentized Deleted Residuals We use the notation (i) to indicate that case i has been deleted from the model fit computations di = Yi - is the deleted residual Turns out di = ei/(1-hii) Also Var di=(Var ei)/(1-hii)2=MSE(i)/(1- hii)

Residuals When we examine the residuals, regardless of version, we are looking for Outliers Non-normal error distributions Influential observations

The r option and studentized residuals proc reg data=a1; model insur=income risk/r; run;

Output Obs Residual 1 -1.206 2 -0.910 3 2.121 4 -0.363 5 -0.210 Student Obs Residual 1 -1.206 2 -0.910 3 2.121 4 -0.363 5 -0.210

The influence option and studentized deleted residuals proc reg data=a1; model insur=income risk /influence; run;

Output Obs Residual RStudent 1 -14.7311 -1.2259 2 -10.9321 -0.9048 1 -14.7311 -1.2259 2 -10.9321 -0.9048 3 24.1845 2.4487 4 -4.2780 -0.3518 5 -2.5522 -0.2028 6 10.3417 1.0138

Hat matrix diagonals hii is a measure of how much Yi is contributing to the prediction of = hi1Y1 + hi2 Y2 + hi3Y3 + … hii is sometimes called the leverage of the ith observation It is a measure of the distance between the X values for the ith case and the means of the X values

Hat matrix diagonals 0 ≤ hii ≤ 1 Σ(hii) = p Large value of hii suggess that ith case is distant from the center of all X’s The average value is p/n Values far from this average point to cases that should be examined carefully

Influence option gives hat diagonals Obs H 1 0.0693 2 0.1006 3 0.1890 4 0.1316 5 0.0756

DFFITS A measure of the influence of case i on (a single case) Thus, it is closely related to hii It is a standardized version of the difference between computed with and without case i Concern if greater than 1 for small data sets or greater than for large data sets

Cook’s Distance A measure of the influence of case i on all of the ’s (all the cases) It is a standardized version of the sum of squares of the differences between the predicted values computed with and without case I Compare with F(p,n-p) Concern if distance above 50%-tile

DFBETAS A measure of the influence of case i on each of the regression coefficients It is a standardized version of the difference between the regression coefficient computed with and without case i Concern if DFBETA greater than 1 in small data sets or greater than for large data sets

Variance Inflation Factor The VIF is related to the variance of the estimated regression coefficients We calculate it for each explanatory variable One suggested rule is that a value of 10 or more for VIF indicates excessive multicollinearity

Tolerance TOL = (1-R2k) where R2k is the squared multiple correlation obtained in a regression where all other explanatory variables are used to predict Xk TOL = 1/VIF Described in comment on p 410

Output Variable Tolerance Intercept . income 0.93524 risk 0.93524

Full diagnostics proc reg data=a1; model insur=income risk /r partial influence tol; id income risk; plot r.*(income risk); run;

Plot statement inside Reg Can generate several plots within Proc Reg Need to know symbol names Available in Table 1 once you click on plot command inside REG syntax r. represents usual residuals rstudent. represents deleted resids p. represents predicted values

Last slide We went over KNNL Chapters 9 and 10 We used program topic18.sas to generate the output