Presentation on theme: "Review of Univariate Linear Regression BMTRY 726 3/4/14."— Presentation transcript:
Review of Univariate Linear Regression BMTRY 726 3/4/14
Regression Analysis We are interested in predicting values of one or more responses from a set of predictors Regression analysis is an extension of what we discussed with ANOVA and MANOVA We now allow for inclusion of continuous predictors in place of (or in addition to) treatment indicators in MANOVA
Univariate Regression Analysis A univariate regression model states that response Y is composed of a mean dependent on a set of independent predictors z i and random error i Model assumptions:
Least Squares Estimation We use the method of least square to estimate or model parameters
Least Squares Estimation We estimate the variance using the residuals
Least Squares Estimation
Geometry of Least Squares
LRT for individual i ’s First we may test if any predictors effect the response: The LRT is based on the difference in sums of square between the full and null models…
LRT for individual i ’s Difference in SS between the full and null models…
Model Building If we have a large number of predictors, we want to identify the “best” subset There are many methods of selecting the “best” -Examine all possible subsets of predictors -Forward stepwise selection -Backwards stepwise selection
Model Building Though we can consider predictors that are significant, this may not yield the “best” subset (some models may yield similar results) The “best” choice is made by examining some criterion -R 2 -adjusted R 2 -Mallow’s C p -AIC Since R 2 increases as predictors are added, Mallow’s C p and AIC are better choices for selecting the “best” predictor subset
Model Building Mallow’s C p : -Plot the pairs ( p, C p ). -Good models have coordinates near the 45 o line Akaike’s Information Criterion -Smaller is better
Model Checking Always good to check if the model is “correct” before using it to make decisions… Information about fit is contained in the residuals If the model fits well the estimated error terms should mimic N(0, 2 ). So how can we check?
Model Checking 1.Studentized residuals plot 2.Plot residuals versus predicted values -Ideally points should be scattered (i.e. no pattern) -If a pattern exists, can show something about the problem
Model Checking 3.Plot residuals versus predictors 4.QQ plot of studentized residuals plot
Model Checking While residuals analysis is useful, it may miss outliers- i.e. observations that are very influential on predictions Leverage: -how far is the j th observation from the others? -How much pull does j exert on the fit Observations that affect inferences are influential
Collinearity If Z is not full rank, a linear combination aZ of columns in Z =0 In such a case the columns are colinear in which case the inverse of Z’Z doesn’t exist It is rare that aZ == 0, but if a combination exists that is nearly zero, (Z’Z) -1 is numerically unstable Results in very large estimated variance of the model parameters making it difficult to identify significant regression coefficients
Collinearity We can check for severity of multicollinearity using the variance inflation factor ( VIF )
Misspecified Model If important predictors are omitted, the vector of regression coefficients my be biased Biased unless columns of Z 1 and Z 2 are independent.
Example Develop a model to predict percent body fat (PBF) using: -Age, Weight, Height, Chest, Abdomen, Hip, Arm, Wrist Our full model is lm(formula = PBF ~ Age + Wt + Ht + Chest + Abd + Hip + Arm + Wrist, data = SSbodyfat) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) Age Wt Ht Chest Abd < 2e-16 *** Hip Arm ** Wrist *** Residual standard error: on 243 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 8 and 243 DF, p-value: < 2.2e-16
LRT What if we want to test whether or not the 4 most non- significant predictors in the model can be removed
Model Subset Selection First consider the plot of SS res for all possible subsets of the eight predictors
Model Subset Selection What about Mallow’s C p and AIC ?
Model Say we choosing the model with the 4 parameters. > summary(mod4) Call: lm(formula = PBF ~ Wt + Abd + Arm + Wrist, data = SSbodyfat, x = T) coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-06 *** Wt e-07 *** Abd < 2e-16 *** Arm ** Wrist *** Residual standard error: on 247 degrees of freedom Multiple R-squared: 0.735, Adjusted R-squared: F-statistic: on 4 and 247 DF, p-value: < 2.2e-16
Model Check Say we choosing the model with 4 parameters. We need to check our regression diagnostics
Model Check What about our parameters versus the residuals?
Model Check What about influential points? > SSbodyfat[c(39,175,159,206,36),] Obs PBF Wt Abd Arm Wrist
Model Check What about collinearity? > round(cor(SSbodyfat[,c(2,3,6,8,9)]), digits=3) Wt Abd Arm Wrist Wt Abd Arm Wrist >library(HH) >vif(mod4) Wt Abd Arm Wrist
What do all our model checks tell us about the validity of out model?
What if our investigator really felt all 13 predictors really would give the best model? > summary(mod13) Call: lm(formula = PBF ~ Age + Wt + Ht + Neck + Chest + Abd + Hip + Thigh + Knee + Ankle + Bicep + Arm + Wrist, data = bodyfat, x = T) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) Age Wt Ht Neck * Chest Abd < 2e-16 *** Hip Thigh Knee Ankle Bicep Arm * Wrist ** Residual standard error: on 238 degrees of freedom. Multiple R-squared: 0.749, Adjusted R-squared: F-statistic: on 13 and 238 DF, p-value: < 2.2e-16
Is collinrearity problematic? > vif(mod13) Age Wt Ht Neck Chest Abd Hip Thigh Knee Ankle Bicep Arm Wrist