# Review of Univariate Linear Regression BMTRY 726 3/4/14.

## Presentation on theme: "Review of Univariate Linear Regression BMTRY 726 3/4/14."— Presentation transcript:

Review of Univariate Linear Regression BMTRY 726 3/4/14

Regression Analysis We are interested in predicting values of one or more responses from a set of predictors Regression analysis is an extension of what we discussed with ANOVA and MANOVA We now allow for inclusion of continuous predictors in place of (or in addition to) treatment indicators in MANOVA

Univariate Regression Analysis A univariate regression model states that response Y is composed of a mean dependent on a set of independent predictors z i and random error  i Model assumptions:

Least Squares Estimation We use the method of least square to estimate or model parameters

Least Squares Estimation We estimate the variance using the residuals

Least Squares Estimation

Geometry of Least Squares

LRT for individual  i ’s First we may test if any predictors effect the response: The LRT is based on the difference in sums of square between the full and null models…

LRT for individual  i ’s Difference in SS between the full and null models…

Model Building If we have a large number of predictors, we want to identify the “best” subset There are many methods of selecting the “best” -Examine all possible subsets of predictors -Forward stepwise selection -Backwards stepwise selection

Model Building Though we can consider predictors that are significant, this may not yield the “best” subset (some models may yield similar results) The “best” choice is made by examining some criterion -R 2 -adjusted R 2 -Mallow’s C p -AIC Since R 2 increases as predictors are added, Mallow’s C p and AIC are better choices for selecting the “best” predictor subset

Model Building Mallow’s C p : -Plot the pairs ( p, C p ). -Good models have coordinates near the 45 o line Akaike’s Information Criterion -Smaller is better

Model Checking Always good to check if the model is “correct” before using it to make decisions… Information about fit is contained in the residuals If the model fits well the estimated error terms should mimic N(0,   2 ). So how can we check?

Model Checking 1.Studentized residuals plot 2.Plot residuals versus predicted values -Ideally points should be scattered (i.e. no pattern) -If a pattern exists, can show something about the problem

Model Checking 3.Plot residuals versus predictors 4.QQ plot of studentized residuals plot

Model Checking While residuals analysis is useful, it may miss outliers- i.e. observations that are very influential on predictions Leverage: -how far is the j th observation from the others? -How much pull does j exert on the fit Observations that affect inferences are influential

Collinearity If Z is not full rank, a linear combination aZ of columns in Z =0 In such a case the columns are colinear in which case the inverse of Z’Z doesn’t exist It is rare that aZ == 0, but if a combination exists that is nearly zero, (Z’Z) -1 is numerically unstable Results in very large estimated variance of the model parameters making it difficult to identify significant regression coefficients

Collinearity We can check for severity of multicollinearity using the variance inflation factor ( VIF )

Misspecified Model If important predictors are omitted, the vector of regression coefficients my be biased Biased unless columns of Z 1 and Z 2 are independent.

Example Develop a model to predict percent body fat (PBF) using: -Age, Weight, Height, Chest, Abdomen, Hip, Arm, Wrist Our full model is lm(formula = PBF ~ Age + Wt + Ht + Chest + Abd + Hip + Arm + Wrist, data = SSbodyfat) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -16.67443 15.64951 -1.065 0.287711 Age 0.03962 0.02983 1.328 0.185322 Wt -0.07582 0.04704 -1.612 0.108274 Ht -0.10940 0.09398 -1.164 0.245502 Chest -0.06304 0.09758 -0.646 0.518838 Abd 0.94626 0.08556 11.059 < 2e-16 *** Hip -0.07802 0.13584 -0.574 0.566263 Arm 0.50335 0.18791 2.679 0.007896 ** Wrist -1.78671 0.50204 -3.559 0.000448 *** Residual standard error: 4.347 on 243 degrees of freedom Multiple R-squared: 0.7388, Adjusted R-squared: 0.7303 F-statistic: 85.94 on 8 and 243 DF, p-value: < 2.2e-16

LRT What if we want to test whether or not the 4 most non- significant predictors in the model can be removed

Model Subset Selection First consider the plot of SS res for all possible subsets of the eight predictors

Model Subset Selection What about Mallow’s C p and AIC ?

Model Say we choosing the model with the 4 parameters. > summary(mod4) Call: lm(formula = PBF ~ Wt + Abd + Arm + Wrist, data = SSbodyfat, x = T) coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -34.85407 7.24500 -4.811 2.62e-06 *** Wt -0.13563 0.02475 -5.480 1.05e-07 *** Abd 0.99575 0.05607 17.760 < 2e-16 *** Arm 0.47293 0.18166 2.603 0.009790 ** Wrist -1.50556 0.44267 -3.401 0.000783 *** Residual standard error: 4.343 on 247 degrees of freedom Multiple R-squared: 0.735, Adjusted R-squared: 0.7307 F-statistic: 171.3 on 4 and 247 DF, p-value: < 2.2e-16

Model Check Say we choosing the model with 4 parameters. We need to check our regression diagnostics

Model Check What about our parameters versus the residuals?

Model Check What about influential points? > SSbodyfat[c(39,175,159,206,36),] Obs PBF Wt Abd Arm Wrist 39 35.2 363.15 148.1 29.0 21.4 175 25.3 226.75 108.8 21.0 20.1 159 12.5 136.50 76.6 34.9 16.9 206 16.6 208.75 96.3 23.1 19.4 36 40.1 191.75 113.1 29.8 17.0

Model Check What about collinearity? > round(cor(SSbodyfat[,c(2,3,6,8,9)]), digits=3) Wt Abd Arm Wrist Wt 1.000 0.888 0.630 0.730 Abd 0.888 1.000 0.503 0.620 Arm 0.630 0.503 1.000 0.586 Wrist 0.730 0.620 0.586 1.000 >library(HH) >vif(mod4) Wt Abd Arm Wrist 7.040774 4.864380 1.793374 2.273047

What do all our model checks tell us about the validity of out model?

What if our investigator really felt all 13 predictors really would give the best model? > summary(mod13) Call: lm(formula = PBF ~ Age + Wt + Ht + Neck + Chest + Abd + Hip + Thigh + Knee + Ankle + Bicep + Arm + Wrist, data = bodyfat, x = T) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -18.18849 17.34857 -1.048 0.29551 Age 0.06208 0.03235 1.919 0.05618. Wt -0.08844 0.05353 -1.652 0.09978. Ht -0.06959 0.09601 -0.725 0.46925 Neck -0.47060 0.23247 -2.024 0.04405 * Chest -0.02386 0.09915 -0.241 0.81000 Abd 0.95477 0.08645 11.044 < 2e-16 *** Hip -0.20754 0.14591 -1.422 0.15622 Thigh 0.23610 0.14436 1.636 0.10326 Knee 0.01528 0.24198 0.063 0.94970 Ankle 0.17400 0.22147 0.786 0.43285 Bicep 0.18160 0.17113 1.061 0.28966 Arm 0.45202 0.19913 2.270 0.02410 * Wrist -1.62064 0.53495 -3.030 0.00272 ** Residual standard error: 4.305 on 238 degrees of freedom. Multiple R-squared: 0.749, Adjusted R-squared: 0.7353. F-statistic: 54.65 on 13 and 238 DF, p-value: < 2.2e-16

Is collinrearity problematic? > vif(mod13) Age Wt Ht Neck Chest Abd Hip 2.250450 33.509320 1.674591 4.324463 9.460877 11.767073 14.796520 Thigh Knee Ankle Bicep Arm Wrist 7.777865 4.612147 1.907961 3.619744 2.192492 3.377515