summary(lm(hipcenter ~ Age)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-09 *** Age Multiple R-squared: , Adjusted R-squared:">

# Lecture 4 Linear Models III Olivier MISSA, Advanced Research Skills.

## Presentation on theme: "Lecture 4 Linear Models III Olivier MISSA, Advanced Research Skills."— Presentation transcript:

Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills

2 Outline "Refresher" on different types of model: Multiple regression Polynomial regression Model building Finding the "best" model.

3 When more than one continuous or discrete variables are used to predict a response variable. Example: The 38 car drivers dataset Multiple Regression > data (seatpos) ## from faraway package > attach(seatpos) > names(seatpos) [1] "Age" "Weight" "HtShoes" "Ht" "Seated" [6] "Arm" "Thigh" "Leg" "hipcenter" > summary(lm(hipcenter ~ Age)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -192.9645 24.3015 -7.940 2.00e-09 *** Age 0.7963 0.6331 1.258 0.217... Multiple R-squared: 0.0421, Adjusted R-squared: 0.01549...

4 Multiple Regression > summary(lm(hipcenter ~ Age)) Estimate Std. Error t value Pr(>|t|) Age 0.7963 0.6331 1.258 0.217 Multiple R-squared: 0.0421 > summary(lm(hipcenter ~ Weight)) Estimate Std. Error t value Pr(>|t|) Weight -1.0674 0.2134 -5.002 1.49e-05 *** Multiple R-squared: 0.41 > summary(lm(hipcenter ~ HtShoes)) Estimate Std. Error t value Pr(>|t|) HtShoes -4.2621 0.5391 -7.907 2.21e-09 *** Multiple R-squared: 0.6346 > summary(lm(hipcenter ~ Ht)) Estimate Std. Error t value Pr(>|t|) Ht -4.2650 0.5351 -7.970 1.83e-09 *** Multiple R-squared: 0.6383 > summary(lm(hipcenter ~ Seated)) Estimate Std. Error t value Pr(>|t|) Seated -8.844 1.375 -6.432 1.84e-07 *** Multiple R-squared: 0.5347

5 Multiple Regression > summary(lm(hipcenter ~ Arm)) Estimate Std. Error t value Pr(>|t|) Arm -10.351 2.391 -4.329 0.000114 *** Multiple R-squared: 0.3423 > summary(lm(hipcenter ~ Thigh)) Estimate Std. Error t value Pr(>|t|) Thigh -9.100 2.069 -4.398 9.29e-05 *** Multiple R-squared: 0.3495 > summary(lm(hipcenter ~ Leg)) Estimate Std. Error t value Pr(>|t|) Leg -13.795 1.801 -7.658 4.59e-09 *** Multiple R-squared: 0.6196 Age Ht Leg

6 Multiple Regression > summary(mod <- lm(hipcenter ~ Ht + Leg)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 491.244 99.543 4.935 1.95e-05 *** Ht -2.565 1.268 -2.022 0.0509 Leg -6.136 4.164 -1.473 0.1496 --- Residual standard error: 35.79 on 35 degrees of freedom Multiple R-squared: 0.6594, Adjusted R-squared: 0.6399 F-statistic: 33.88 on 2 and 35 DF, p-value: 6.517e-09 > summary(lm(hipcenter ~ Ht)) Estimate Std. Error t value Pr(>|t|) Ht -4.2650 0.5351 -7.970 1.83e-09 *** Multiple R-squared: 0.6383 > summary(lm(hipcenter ~ Leg)) Estimate Std. Error t value Pr(>|t|) Leg -13.795 1.801 -7.658 4.59e-09 *** Multiple R-squared: 0.6196 slope parameters change Still Significant No Longer Significant

7 > drop1(mod, test="F") Single term deletions Model: hipcenter ~ Ht + Leg Df Sum of Sq RSS AIC F value Pr(F) 44835 275 Ht 1 5236 50071 277 4.0877 0.05089. Leg 1 2781 47616 275 2.1711 0.14957 > cor(Ht,Leg) [1] 0.9097524 > plot(Ht ~ Leg) Multiple Regression Beware of strong collinearity

8 > add1( lm(hipcenter~Ht), ~. +Age +Weight +HtShoes +Seated +Arm +Thigh +Leg, test="F") Single term additions Model: hipcenter ~ Ht Df Sum of Sq RSS AIC F value Pr(F) 47616 275 Age 1 2354 45262 275 1.8199 0.1860 Weight 1 196 47420 277 0.1446 0.7061 HtShoes 1 26 47590 277 0.0189 0.8913 Seated 1 102 47514 277 0.0748 0.7861 Arm 1 76 47540 277 0.0558 0.8147 Thigh 1 5 47611 277 0.0034 0.9538 Leg 1 2781 44835 275 2.1711 0.1496 Multiple Regression Beware of strong collinearity No Added Variable Significantly Improves the Model

9 Comparing models How can we compare models on an equal footing ? (regardless of the number of parameters). The multiple-R 2 can only increase as more variables enter a model (because the RSS can only decrease). The adjusted R 2 corrects for the different number of parameters to some extent. no good to compare models

10 Akaike Information Criterion Invented by Hirotugu Akaike in 1971 after a few weeks of sleepless nights stressing over a conference presentation. The AIC originally called 'An Information Criterion' penalizes the likelihood of a model according to the number of parameters being estimated. Maximized value of the Likelihood function Number of Parameters The lower the AIC value the better the model is

11 Akaike Information Criterion Invented by Hirotugu Akaike in 1971 after a few weeks of sleepless nights stressing over a conference presentation. When residuals are normally & independently distributed: Exact expression Simplified expression (not equal)

12 Akaike Information Criterion May need to be corrected for small sample sizes A variant, the BIC, 'Bayesian Information Criterion' (Schwartz Criterion) penalizes free parameters more strongly than AIC When residuals are normally & independently distributed: Exact expression Simplified expression (not equal)

13 > mod <- lm(hipcenter ~., data=seatpos) ## starts with everything > s.mod <- step(mod) ## by default prunes variables out ('backward') Start: AIC=283.62 hipcenter ~ Age +Weight +HtShoes +Ht +Seated +Arm +Thigh +Leg Df Sum of Sq RSS AIC - Ht 1 5 41267 282 - Weight 1 9 41271 282 - Seated 1 29 41290 282 - HtShoes 1 108 41370 282 - Arm 1 165 41427 282 - Thigh 1 263 41525 282 41262 284 - Age 1 2632 43894 284 - Leg 1 2655 43917 284 Multiple Regression Searching for the "best solution" Deletions ranked in increasing order of AIC

14 Step: AIC=281.63 ## after removing Ht hipcenter ~ Age +Weight +HtShoes +Seated +Arm +Thigh +Leg Df Sum of Sq RSS AIC - Weight 1 11 41278 280 - Seated 1 31 41297 280 - Arm 1 161 41427 280 - Thigh 1 269 41536 280 - HtShoes 1 972 42239 281 41267 282 - Leg 1 2665 43931 282 - Age 1 2809 44075 282 Multiple Regression Searching for the "best solution"

15 Step: AIC=279.64 ## after removing Ht & Weight hipcenter ~ Age +HtShoes +Seated +Arm +Thigh +Leg Df Sum of Sq RSS AIC - Seated 1 35 41313 278 - Arm 1 156 41434 278 - Thigh 1 285 41563 278 - HtShoes 1 975 42253 279 41278 280 - Leg 1 2661 43939 280 - Age 1 3012 44290 280 Multiple Regression Searching for the "best solution"

16 Step: AIC=277.67 ## after removing Ht, Weight & Seated hipcenter ~ Age +HtShoes +Arm +Thigh +Leg Df Sum of Sq RSS AIC - Arm 1 172 41485 276 - Thigh 1 345 41658 276 - HtShoes 1 1853 43166 277 41313 278 - Leg 1 2871 44184 278 - Age 1 2977 44290 278 Step: AIC=275.83 ## after removing Arm as well hipcenter ~ Age + HtShoes + Thigh + Leg Df Sum of Sq RSS AIC - Thigh 1 473 41958 274 41485 276 - HtShoes 1 2341 43826 276 - Age 1 3501 44986 277 - Leg 1 3592 45077 277 Multiple Regression Searching for the "best solution"

17 Step: AIC=274.26 ## after removing Thigh too hipcenter ~ Age + HtShoes + Leg Df Sum of Sq RSS AIC 41958 274 - Age 1 3109 45067 275 - Leg 1 3476 45434 275 - HtShoes 1 4219 46176 276 > summary(s.mod) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 456.2137 102.8078 4.438 9.1e-05 *** Age 0.5998 0.3779 1.587 0.1217 HtShoes -2.3023 1.2452 -1.849 0.0732. Leg -6.8297 4.0693 -1.678 0.1024 --- Residual standard error: 35.13 on 34 degrees of freedom Multiple R-squared: 0.6813, Adjusted R-squared: 0.6531 F-statistic: 24.22 on 3 and 34 DF, p-value: 1.437e-08 Multiple Regression Searching for the "best solution"

18 > anova(s.mod) Response: hipcenter Df Sum Sq Mean Sq F value Pr(>F) Age 1 5541 5541 4.4904 0.04147 * HtShoes 1 80664 80664 65.3647 1.996e-09 *** Leg 1 3476 3476 2.8169 0.10244 Residuals 34 41958 1234 > drop1(s.mod, test="F") hipcenter ~ Age + HtShoes + Leg Df Sum of Sq RSS AIC F value Pr(F) 41958 274 Age 1 3109 45067 275 2.5192 0.12173 HtShoes 1 4219 46176 276 3.4185 0.07318. Leg 1 3476 45434 275 2.8169 0.10244 Multiple Regression Searching for the "best solution"

19 > plot(hipcenter ~ s.mod\$fit) > abline(0,1, col="red") > null <- lm(hipcenter ~1, data=seatpos) ## starting from a null model > s.mod <- step(null, ~. +Age+Weight+Ht+HtShoes+Seated+Arm+Thigh+Leg, direction="forward") ## intermediate steps removed Step: AIC=274.24 hipcenter ~ Ht + Leg + Age Df Sum of Sq RSS AIC 41938 274 + Thigh 1 373 41565 276 + Arm 1 257 41681 276 + Seated 1 121 41817 276 + Weight 1 47 41891 276 + HtShoes 1 13 41925 276 Multiple Regression No guarantee that the forward & backward searches will find the same solution

20 > ozone.pollution <- read.table("ozone.data.txt", header=T) > dim(ozone.pollution) [1] 111 4 > names(ozone.pollution) [1] "rad" "temp" "wind" "ozone" > attach(ozone.pollution) > pairs(ozone.pollution, panel=panel.smooth, pch=16, lwd=2) > model <- lm(ozone ~., data=ozone.pollution) Multiple Regression Another example: How is ozone concentration in the atmosphere related to solar radiation, ambient temperature & wind speed ?

21 > summary(model) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -64.23208 23.04204 -2.788 0.00628 ** rad 0.05980 0.02318 2.580 0.01124 * temp 1.65121 0.25341 6.516 2.43e-09 *** wind -3.33760 0.65384 -5.105 1.45e-06 *** --- Residual standard error: 21.17 on 107 degrees of freedom Multiple R-squared: 0.6062, Adjusted R-squared: 0.5952 F-statistic: 54.91 on 3 and 107 DF, p-value: < 2.2e-16 > drop1(model, test="F") ozone ~ rad + temp + wind Df Sum of Sq RSS AIC F value Pr(F) 47964 682 rad 1 2984 50948 686 6.6565 0.01124 * temp 1 19032 66996 717 42.4567 2.429e-09 *** wind 1 11680 59644 704 26.0567 1.450e-06 *** Multiple Regression

22 > plot(ozone ~ model\$fitted, pch=16, xlab="Model Predictions", ylab="Ozone Concentration") > abline(0,1, col="red", lwd=2) > shapiro.test(model\$res) Shapiro-Wilk normality test data: model\$res W = 0.9173, p-value = 3.704e-06 > plot(model, which=1) Multiple Regression

23 > library(car) > cr.plot(model, rad, pch=16, main="") > cr.plot(model, temp, pch=16, main="") > cr.plot(model, wind, pch=16, main="") Multiple Regression Which predictor variable is non-linearly related to Ozone ? rad temp wind

24 > model2 <- lm(ozone ~ poly(rad,2)+poly(temp,2)+poly(wind,2)) > summary(model2) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 42.099 1.735 24.269 < 2e-16 *** poly(rad, 2)1 65.085 19.379 3.359 0.00110 ** poly(rad, 2)2 -16.259 20.174 -0.806 0.42213 poly(temp, 2)1 142.708 23.502 6.072 2.09e-08 *** poly(temp, 2)2 56.043 19.138 2.928 0.00419 ** poly(wind, 2)1 -125.800 21.391 -5.881 4.99e-08 *** poly(wind, 2)2 88.636 19.199 4.617 1.12e-05 *** --- Residual standard error: 18.28 on 104 degrees of freedom Multiple R-squared: 0.7148, Adjusted R-squared: 0.6984 F-statistic: 43.44 on 6 and 104 DF, p-value: < 2.2e-16 Polynomial Regression When the trend between a predictor variable and the response variable is not linear, the curvature can be "captured" using polynomials of various degrees.

25 > extractAIC(model2) ## Simplified Version [1] 7.0000 651.8105 > model3 <- lm(ozone ~ rad +poly(temp,2) +poly(wind,2)) > extractAIC(model3) [1] 6.0000 650.5016 > model4 <- lm(ozone ~ rad+poly(temp,3)+poly(wind,2) ) > extractAIC(model4) [1] 7.0000 648.0394 > model5 <- lm(ozone ~ rad+poly(temp,2)+poly(wind,3) ) > extractAIC(model5) [1] 7.0000 651.6489 > model6 <- lm(ozone ~ rad+poly(temp,3)+poly(wind,3) ) > extractAIC(model6) [1] 8.0000 649.0149 > extractAIC(model) ## original, strictly linear model [1] 4.0000 681.6233 Polynomial Regression But how many degrees should we choose ? Best Model

26 > plot(model4, which=1) > shapiro.test(model4\$res) Shapiro-Wilk normality test data: model4\$res W = 0.9309, p-value = 2.267e-05 > plot(model4, which=2) Polynomial Regression

27 > library(MASS) > boxcox(model4, plotit=T) > boxcox(model4, plotit=T, lambda=seq(0,1,by=.1) ) > new.ozone <- ozone^(1/3) > mod4 <- lm(new.ozone ~ rad +poly(temp,3) +poly(wind,2) ) > extractAIC(mod4) [1] 7.0000 -162.6569 > shapiro.test(mod4\$res) Shapiro-Wilk normality test data: mod4\$res W = 0.9899, p-value = 0.5855 Polynomial Regression Finding the best transformation of our response variable

28 > par(mfrow=c(2,2)) > plot(mod4) > anova(mod4) Analysis of Variance Table Response: new.ozone Df Sum Sq Mean Sq F value Pr(>F) rad 1 15.531 15.531 71.467 1.859e-13 *** poly(temp, 3) 3 40.947 13.649 62.804 < 2.2e-16 *** poly(wind, 2) 2 8.129 4.065 18.703 1.152e-07 *** Residuals 104 22.602 0.217 > summary(mod4) Residual standard error: 0.4662 on 104 degrees of freedom Multiple R-squared: 0.7408, Adjusted R-squared: 0.7259 F-statistic: 49.55 on 6 and 104 DF, p-value: < 2.2e-16 Polynomial Regression

29 > par(mfrow=c(1,1)) > plot(new.ozone ~ mod4\$fitted) > abline(0,1, col="red", lwd=2) > cr.plot(mod4, rad, pch=16, main="") > cr.plot(mod4, poly(temp,3), pch=16, main="") > cr.plot(mod4, poly(wind,2), pch=16, main="") Polynomial Regression rad temp wind