summary(lm(hipcenter ~ Age)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-09 *** Age Multiple R-squared: , Adjusted R-squared:">

Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 4 Linear Models III Olivier MISSA, Advanced Research Skills.

Similar presentations


Presentation on theme: "Lecture 4 Linear Models III Olivier MISSA, Advanced Research Skills."— Presentation transcript:

1 Lecture 4 Linear Models III Olivier MISSA, Advanced Research Skills

2 2 Outline "Refresher" on different types of model: Multiple regression Polynomial regression Model building Finding the "best" model.

3 3 When more than one continuous or discrete variables are used to predict a response variable. Example: The 38 car drivers dataset Multiple Regression > data (seatpos) ## from faraway package > attach(seatpos) > names(seatpos) [1] "Age" "Weight" "HtShoes" "Ht" "Seated" [6] "Arm" "Thigh" "Leg" "hipcenter" > summary(lm(hipcenter ~ Age)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-09 *** Age Multiple R-squared: , Adjusted R-squared:

4 4 Multiple Regression > summary(lm(hipcenter ~ Age)) Estimate Std. Error t value Pr(>|t|) Age Multiple R-squared: > summary(lm(hipcenter ~ Weight)) Estimate Std. Error t value Pr(>|t|) Weight e-05 *** Multiple R-squared: 0.41 > summary(lm(hipcenter ~ HtShoes)) Estimate Std. Error t value Pr(>|t|) HtShoes e-09 *** Multiple R-squared: > summary(lm(hipcenter ~ Ht)) Estimate Std. Error t value Pr(>|t|) Ht e-09 *** Multiple R-squared: > summary(lm(hipcenter ~ Seated)) Estimate Std. Error t value Pr(>|t|) Seated e-07 *** Multiple R-squared:

5 5 Multiple Regression > summary(lm(hipcenter ~ Arm)) Estimate Std. Error t value Pr(>|t|) Arm *** Multiple R-squared: > summary(lm(hipcenter ~ Thigh)) Estimate Std. Error t value Pr(>|t|) Thigh e-05 *** Multiple R-squared: > summary(lm(hipcenter ~ Leg)) Estimate Std. Error t value Pr(>|t|) Leg e-09 *** Multiple R-squared: Age Ht Leg

6 6 Multiple Regression > summary(mod <- lm(hipcenter ~ Ht + Leg)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-05 *** Ht Leg Residual standard error: on 35 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 2 and 35 DF, p-value: 6.517e-09 > summary(lm(hipcenter ~ Ht)) Estimate Std. Error t value Pr(>|t|) Ht e-09 *** Multiple R-squared: > summary(lm(hipcenter ~ Leg)) Estimate Std. Error t value Pr(>|t|) Leg e-09 *** Multiple R-squared: slope parameters change Still Significant No Longer Significant

7 7 > drop1(mod, test="F") Single term deletions Model: hipcenter ~ Ht + Leg Df Sum of Sq RSS AIC F value Pr(F) Ht Leg > cor(Ht,Leg) [1] > plot(Ht ~ Leg) Multiple Regression Beware of strong collinearity

8 8 > add1( lm(hipcenter~Ht), ~. +Age +Weight +HtShoes +Seated +Arm +Thigh +Leg, test="F") Single term additions Model: hipcenter ~ Ht Df Sum of Sq RSS AIC F value Pr(F) Age Weight HtShoes Seated Arm Thigh Leg Multiple Regression Beware of strong collinearity No Added Variable Significantly Improves the Model

9 9 Comparing models How can we compare models on an equal footing ? (regardless of the number of parameters). The multiple-R 2 can only increase as more variables enter a model (because the RSS can only decrease). The adjusted R 2 corrects for the different number of parameters to some extent. no good to compare models

10 10 Akaike Information Criterion Invented by Hirotugu Akaike in 1971 after a few weeks of sleepless nights stressing over a conference presentation. The AIC originally called 'An Information Criterion' penalizes the likelihood of a model according to the number of parameters being estimated. Maximized value of the Likelihood function Number of Parameters The lower the AIC value the better the model is

11 11 Akaike Information Criterion Invented by Hirotugu Akaike in 1971 after a few weeks of sleepless nights stressing over a conference presentation. When residuals are normally & independently distributed: Exact expression Simplified expression (not equal)

12 12 Akaike Information Criterion May need to be corrected for small sample sizes A variant, the BIC, 'Bayesian Information Criterion' (Schwartz Criterion) penalizes free parameters more strongly than AIC When residuals are normally & independently distributed: Exact expression Simplified expression (not equal)

13 13 > mod <- lm(hipcenter ~., data=seatpos) ## starts with everything > s.mod <- step(mod) ## by default prunes variables out ('backward') Start: AIC= hipcenter ~ Age +Weight +HtShoes +Ht +Seated +Arm +Thigh +Leg Df Sum of Sq RSS AIC - Ht Weight Seated HtShoes Arm Thigh Age Leg Multiple Regression Searching for the "best solution" Deletions ranked in increasing order of AIC

14 14 Step: AIC= ## after removing Ht hipcenter ~ Age +Weight +HtShoes +Seated +Arm +Thigh +Leg Df Sum of Sq RSS AIC - Weight Seated Arm Thigh HtShoes Leg Age Multiple Regression Searching for the "best solution"

15 15 Step: AIC= ## after removing Ht & Weight hipcenter ~ Age +HtShoes +Seated +Arm +Thigh +Leg Df Sum of Sq RSS AIC - Seated Arm Thigh HtShoes Leg Age Multiple Regression Searching for the "best solution"

16 16 Step: AIC= ## after removing Ht, Weight & Seated hipcenter ~ Age +HtShoes +Arm +Thigh +Leg Df Sum of Sq RSS AIC - Arm Thigh HtShoes Leg Age Step: AIC= ## after removing Arm as well hipcenter ~ Age + HtShoes + Thigh + Leg Df Sum of Sq RSS AIC - Thigh HtShoes Age Leg Multiple Regression Searching for the "best solution"

17 17 Step: AIC= ## after removing Thigh too hipcenter ~ Age + HtShoes + Leg Df Sum of Sq RSS AIC Age Leg HtShoes > summary(s.mod) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-05 *** Age HtShoes Leg Residual standard error: on 34 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 3 and 34 DF, p-value: 1.437e-08 Multiple Regression Searching for the "best solution"

18 18 > anova(s.mod) Response: hipcenter Df Sum Sq Mean Sq F value Pr(>F) Age * HtShoes e-09 *** Leg Residuals > drop1(s.mod, test="F") hipcenter ~ Age + HtShoes + Leg Df Sum of Sq RSS AIC F value Pr(F) Age HtShoes Leg Multiple Regression Searching for the "best solution"

19 19 > plot(hipcenter ~ s.mod$fit) > abline(0,1, col="red") > null <- lm(hipcenter ~1, data=seatpos) ## starting from a null model > s.mod <- step(null, ~. +Age+Weight+Ht+HtShoes+Seated+Arm+Thigh+Leg, direction="forward") ## intermediate steps removed Step: AIC= hipcenter ~ Ht + Leg + Age Df Sum of Sq RSS AIC Thigh Arm Seated Weight HtShoes Multiple Regression No guarantee that the forward & backward searches will find the same solution

20 20 > ozone.pollution <- read.table("ozone.data.txt", header=T) > dim(ozone.pollution) [1] > names(ozone.pollution) [1] "rad" "temp" "wind" "ozone" > attach(ozone.pollution) > pairs(ozone.pollution, panel=panel.smooth, pch=16, lwd=2) > model <- lm(ozone ~., data=ozone.pollution) Multiple Regression Another example: How is ozone concentration in the atmosphere related to solar radiation, ambient temperature & wind speed ?

21 21 > summary(model) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) ** rad * temp e-09 *** wind e-06 *** --- Residual standard error: on 107 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 3 and 107 DF, p-value: < 2.2e-16 > drop1(model, test="F") ozone ~ rad + temp + wind Df Sum of Sq RSS AIC F value Pr(F) rad * temp e-09 *** wind e-06 *** Multiple Regression

22 22 > plot(ozone ~ model$fitted, pch=16, xlab="Model Predictions", ylab="Ozone Concentration") > abline(0,1, col="red", lwd=2) > shapiro.test(model$res) Shapiro-Wilk normality test data: model$res W = , p-value = 3.704e-06 > plot(model, which=1) Multiple Regression

23 23 > library(car) > cr.plot(model, rad, pch=16, main="") > cr.plot(model, temp, pch=16, main="") > cr.plot(model, wind, pch=16, main="") Multiple Regression Which predictor variable is non-linearly related to Ozone ? rad temp wind

24 24 > model2 <- lm(ozone ~ poly(rad,2)+poly(temp,2)+poly(wind,2)) > summary(model2) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) < 2e-16 *** poly(rad, 2) ** poly(rad, 2) poly(temp, 2) e-08 *** poly(temp, 2) ** poly(wind, 2) e-08 *** poly(wind, 2) e-05 *** --- Residual standard error: on 104 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 6 and 104 DF, p-value: < 2.2e-16 Polynomial Regression When the trend between a predictor variable and the response variable is not linear, the curvature can be "captured" using polynomials of various degrees.

25 25 > extractAIC(model2) ## Simplified Version [1] > model3 <- lm(ozone ~ rad +poly(temp,2) +poly(wind,2)) > extractAIC(model3) [1] > model4 <- lm(ozone ~ rad+poly(temp,3)+poly(wind,2) ) > extractAIC(model4) [1] > model5 <- lm(ozone ~ rad+poly(temp,2)+poly(wind,3) ) > extractAIC(model5) [1] > model6 <- lm(ozone ~ rad+poly(temp,3)+poly(wind,3) ) > extractAIC(model6) [1] > extractAIC(model) ## original, strictly linear model [1] Polynomial Regression But how many degrees should we choose ? Best Model

26 26 > plot(model4, which=1) > shapiro.test(model4$res) Shapiro-Wilk normality test data: model4$res W = , p-value = 2.267e-05 > plot(model4, which=2) Polynomial Regression

27 27 > library(MASS) > boxcox(model4, plotit=T) > boxcox(model4, plotit=T, lambda=seq(0,1,by=.1) ) > new.ozone <- ozone^(1/3) > mod4 <- lm(new.ozone ~ rad +poly(temp,3) +poly(wind,2) ) > extractAIC(mod4) [1] > shapiro.test(mod4$res) Shapiro-Wilk normality test data: mod4$res W = , p-value = Polynomial Regression Finding the best transformation of our response variable

28 28 > par(mfrow=c(2,2)) > plot(mod4) > anova(mod4) Analysis of Variance Table Response: new.ozone Df Sum Sq Mean Sq F value Pr(>F) rad e-13 *** poly(temp, 3) < 2.2e-16 *** poly(wind, 2) e-07 *** Residuals > summary(mod4) Residual standard error: on 104 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 6 and 104 DF, p-value: < 2.2e-16 Polynomial Regression

29 29 > par(mfrow=c(1,1)) > plot(new.ozone ~ mod4$fitted) > abline(0,1, col="red", lwd=2) > cr.plot(mod4, rad, pch=16, main="") > cr.plot(mod4, poly(temp,3), pch=16, main="") > cr.plot(mod4, poly(wind,2), pch=16, main="") Polynomial Regression rad temp wind


Download ppt "Lecture 4 Linear Models III Olivier MISSA, Advanced Research Skills."

Similar presentations


Ads by Google