# Lecture 2 Linear Models I Olivier MISSA, Advanced Research Skills.

## Presentation on theme: "Lecture 2 Linear Models I Olivier MISSA, Advanced Research Skills."— Presentation transcript:

Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills

2 Outline What are Linear models Basic assumptions "Refresher" on different types of model: Single Linear Regression 2 sample t-test Anova (one-way & two-way) Ancova Multiple Linear Regression

3 What are linear models ? Put simply they are Models attempting to "explain" one response variable by combining linearly several predictor variables. In theory, the response variable and the predictor variables can either be continuous, discrete or categorical. This makes them particularly versatile. Indeed, many well known statistical procedures are linear models. e.g. Linear regression, Two-sample t-test, One-way & Two-way ANOVA, ANCOVA,...

4 What are linear models ? For the time being, we are going to assume that (1) the response variable is continuous. (2) the residuals ( ε ) are normally distributed and... (3)... independently and identically distributed. These "three" assumptions define classical linear models. We will see in later lectures ways to bypass these assumptions to be able to cope with an even wider range of situations (generalized linear models, mixed-effects models). But let's take things one step at a time.

5 Starting with a simple example Single Linear regression 1 response variable (continuous) vs. 1 explanatory variable either continuous or discrete (but ordinal !). Attempts to fit a straight line, y = a + b.x > library(faraway) > data(seatpos) > attach(seatpos) > model <- lm(Weight ~ Ht) > model Coefficients: (Intercept) Ht -292.992 2.653

6 But how "significant" is this trend ? Can be approached in a number of ways 1) R 2 the proportion of variance explained. Residual Sum of Squares Total Sum of Squares fitted values average value R 2 = 0.686 Deviance RSS = 14853 TotalSS = 47371

7 1) R 2 the proportion of variance explained. Sum of Squares due to regression SS reg = 32518 RSS = 14853 TotalSS = 47371

8 1) R 2 the proportion of variance explained. > summary(model) Call: lm(formula = Weight ~ Ht) Residuals: Min 1Q Median 3Q Max -29.388 -12.213 -4.876 4.945 59.586 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -292.9915 50.6402 -5.786 1.34e-06 *** Ht 2.6533 0.2989 8.878 1.35e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 20.31 on 36 degrees of freedom Multiple R-squared: 0.6865, Adjusted R-squared: 0.6777 F-statistic: 78.82 on 1 and 36 DF, p-value: 1.351e-10

9 2) model parameters and their standard errors > summary(model)... some outputs left out Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -292.9915 50.6402 -5.786 1.34e-06 *** Ht 2.6533 0.2989 8.878 1.35e-10 ***... How many standard errors away from 0 are the estimates

10 > anova(model) Analysis of Variance Table Response: Weight Df Sum Sq Mean Sq F value Pr(>F) Ht 1 32518 32518 78.816 1.351e-10 *** Residuals 36 14853 413... 3) F-test SS reg RSS MS reg = SS reg / df reg F = MS reg / MSE The F-value so obtained must be compared to the theoretical F probabilities with 1 (numerator) and 36 (denominator) degrees of freedom MSE = RSS / rdf

11 How strong is this trend ? 95% Confidence Interval around Slope 95% future observations band 22 31 13

12 Are the assumptions met ? 1: Is the response variable continuous ? YES ! 2: Are the residuals normally distributed ? > library(nortest) ## Package of Normality tests > ad.test(model\$residuals) ## Anderson-Darling Anderson-Darling normality test data: mod\$residuals A = 1.9511, p-value = 4.502e-05 > plot(model, which=2) ## qqplot of std.residuals Answer: NO ! due to a few outliers ?

13 3a : Are the residuals independent ? > plot(model, which=1) ## residuals vs fitted values. A bad example : non-linear trend Answer: looks OK !

14 3b : Are the residuals identically distributed ? > plot(model, which=3) ## sqrt(abs(standardized(residuals))) vs fitted values. Continuing our bad example Answer: Not perfect, but OK ! Bad OK

16 Is our model OK ? despite having outliers Possible approaches 1) Repeat the analysis, removing the outliers, and check the model parameters. > summary(model) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -292.9915 50.6402 -5.786 1.34e-06 *** Ht 2.6533 0.2989 8.878 1.35e-10 *** > model2 <- lm(Weight ~ Ht, subset=seq(38)[-22]) > summary(model2) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -213.8000 47.4169 -4.509 7.00e-05 *** Ht 2.1731 0.2813 7.727 4.52e-09 *** R 2 = 0.686R 2 = 0.630 minus obs. #22

17 Possible approaches 2) Examine directly the impact each observation has on the model output. Depends on (a) leverage ( h ii ) "How extreme each x-value is compared to the others" Has the potential to influence the fit strongly. When there is only one predictor variable: Σ h ii = 2 In general (for later): Σ h ii = p (no. parameters. in the model)

18 However: the leverage is only part of the story The direct impact of an observation depends also on (b) its residual A point with strong leverage but small residual is not a problem: the model adequately accounts for it. A point with weak leverage but large residual is not a problem either: the model is weakly affected by it. A point with strong leverage and large residual, however, strongly influences the model. Removing the latter point will usually modify the model output to some extent.

19 Influential Observations > plot(model, which=5) ## standardized residuals vs Leverage. 22 31 13

20 A direct measure of influence combining both Leverages and Residuals Cook's distance fitted values when point i is omitted original fitted values Mean Square Error No. of Parameters = RSS / (n-p)

21 Combining both Leverages and Residuals Cook's distance Raw residuals Leverage values Any Di value above 0.5 deserves a closer look and Any Di value above 1 merits special attention. Standardised residuals

22 Combining both Leverages and Residuals Cook's distance Any Di value above 0.5 deserves a closer look and Any Di value above 1 merits special attention.

23 Remove point #22 ? Although point #22 is influential, it does not invalidate the model. We should probably keep it, but note that the regression slope may be shallower than the model suggests. A "reason" for the presence of these outliers can be suggested: some obese people were included in the survey. Depending on the purpose of the model, you may want to keep or remove these outliers. 22 31 13

24 Next : " 2-sample t-test " > data(energy) ## in ISwR package > attach(energy) > plot(expend ~ stature) > stripchart(expend ~ stature, method="jitter") first, let's have a look at the classical t-test t-value Standard Error of the Difference of Means Standard Error of the Mean

25 Classical 2-sample t-test > t.test(expend ~ stature) Welch Two Sample t-test data: expend by stature t = -3.8555, df = 15.919, p-val = 0.001411 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -3.459167 -1.004081 sample estimates: mean in group lean mean in group obese 8.066154 10.297778 unequal variance for the difference Assuming Unequal Variance

26 Classical 2-sample t-test > t.test(expend~stature, var.equal=T) Two Sample t-test data: expend by stature t = -3.9456, df = 20, p-value = 0.000799 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -3.411451 -1.051796 sample estimates: mean in group lean mean in group obese 8.066154 10.297778 Equal Variance slightly more significant Assuming Equal Variance

27 " 2-sample t-test " as a linear model The t-test could "logically" translate into : However: One of these β parameters is superfluous, and actually makes the model fitting through matrix algebra impossible. So, instead it is usually translated into: when X = "lean" when X = "obese" β0β0 for "all" X when X = "obese" β Δlean β Δobese β default β Δobese

28 " 2-sample t-test " as a linear model > mod <- lm(expend~stature) > summary(mod)... Residuals: Min 1Q Median 3Q Max -1.9362 -0.6153 -0.4070 0.2613 2.8138 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.0662 0.3618 22.297 1.34e-15 *** statureobese 2.2316 0.5656 3.946 0.000799 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Remember : linear models assume Equal Variance doesn't look symmetric difference between lean and obese averages Average for lean category factor w/ only two levels: lean & obese

29 > mod <- lm(expend~stature) > summary(mod)... Residuals: Min 1Q Median 3Q Max -1.9362 -0.6153 -0.4070 0.2613 2.8138 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.0662 0.3618 22.297 1.34e-15 *** statureobese 2.2316 0.5656 3.946 0.000799 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 " 2-sample t-test " as a linear model Remember : linear models assume Equal Variance Standard Error of the Mean (SEM) for lean Standard Error of the Difference of the Means (SEDM)

30 > mod <- lm(expend~stature) > summary(mod)... Residuals: Min 1Q Median 3Q Max -1.9362 -0.6153 -0.4070 0.2613 2.8138 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.0662 0.3618 22.297 1.34e-15 *** statureobese 2.2316 0.5656 3.946 0.000799 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.304 on 20 degrees of freedom Multiple R-squared: 0.4377, Adjusted R-squared: 0.4096 F-statistic: 15.57 on 1 and 20 DF, p-value: 0.000799 " 2-sample t-test " as a linear model Remember : linear models assume Equal Variance Same values as classical t-test (equal variance)

31 > anova(mod) Analysis of Variance Table Response: expend Df Sum Sq Mean Sq F value Pr(>F) stature 1 26.485 26.485 15.568 0.000799 *** Residuals 20 34.026 1.701 " 2-sample t-test " as a linear model SS model F = MS model / MSE RSS MS model = SS model / df model MSE = RSS / rdf RSS = 34.026 TSS = 60.511 SS model = 26.485

32 Are the assumptions met ? 1 : Is the response variable continuous ? YES ! 2 : Are the residuals normally distributed ? > library(nortest) ## Package of Normality tests > ad.test(mod\$residuals) ## Anderson-Darling Anderson-Darling normality test data: mod\$residuals A = 0.9638, p-value = 0.01224 > plot(mod, which=2) ## qqplot of std.residuals Answer: NO !

33 3 : Are the residuals independent and identically distributed ? > plot(mod, which=1) ## residuals vs fitted values. Answer: perhaps not identically distributed > plot(mod, which=3) ## sqrt(abs(standardized(residuals))) vs fitted values. leanobese

34 Transforming the response variable to produce normal residuals Can be optimised using the Box-Cox method. > library(MASS) > boxcox(mod, plotit=T) The method transforms the response y into g λ (y) where: optimal solution λ ≈ - 1/2

35 Is this transformation good enough ? Transforming the response variable to produce normal residuals > new.y <- 2 – 2/sqrt(expend) > mod2 <- lm(new.y ~ stature) > ad.test(residuals(mod2)) Anderson-Darling normality test data: residuals(mod2) A = 0.5629, p-value = 0.1280 > plot(mod2, which=2) > plot(mod, which=2) ## qqplot of std.residuals. Perhaps the assumption of equal variance is not warranted ??

36 Is this transformation good enough ? Transforming the response variable to produce normal residuals > var.test(new.y ~ stature) ## only works if stature is ## a factor with only 2 levels F test to compare two variances data: new.y by stature F = 1.5897, num df = 12, denom df = 8, p-value = 0.5201 alternative hypothesis: true ratio of variances is not equal to 1 95 percent confidence interval: 0.3785301 5.5826741 sample estimates: ratio of variances 1.589701

37 Is this transformation good enough ? Transforming the response variable to produce normal residuals > summary(mod) ## untransformed y Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.0662 0.3618 22.297 1.34e-15 *** statureobese 2.2316 0.5656 3.946 0.000799 *** > summary(mod2) ## transformed y. Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.29049 0.01301 99.176 < 2e-16 *** statureobese 0.08264 0.02034 4.062 0.000608 ***

38 Parametric extension of the t-test to more than 2 groups whose sample sizes can be unequal Next : One-Way Anova > data(coagulation) ## in faraway package ## blood coagulation times among ## 24 animals fed one of 4 diets > attach(coagulation) > plot(coag ~ diet) > stripchart(coag ~ diet, method="jitter")

39 One-Way Anova as an F-test > res.aov <- aov(coag ~ diet) ## classical ANOVA > summary(res.aov) Df Sum Sq Mean Sq F value Pr(>F) diet 3 228.0 76.0 13.571 4.658e-05 *** Residuals 20 112.0 5.6 --- RSS = 112 TSS = 340 SS model = 228

40 One-Way Anova as a linear model > mod <- lm(coag ~ diet) ## as a linear model > summary(mod) ## some output left out Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.100e+01 1.183e+00 51.554 < 2e-16 *** dietB 5.000e+00 1.528e+00 3.273 0.003803 ** dietC 7.000e+00 1.528e+00 4.583 0.000181 *** dietD -1.071e-14 1.449e+00 -7.39e-15 1.000000 Residual standard error: 2.366 on 20 degrees of freedom Multiple R-squared: 0.6706, Adjusted R-squared: 0.6212 F-statistic: 13.57 on 3 and 20 DF, p-value: 4.658e-05 > anova(mod) ## or summary(aov(mod)) Analysis of Variance Table Response: coag Df Sum Sq Mean Sq F value Pr(>F) diet 3 228.0 76.0 13.571 4.658e-05 *** Residuals 20 112.0 5.6 not very useful

41 Are the assumptions met ? 1 : Is the response variable continuous ? No, discrete ! 2 : Are the residuals normally distributed ? > library(nortest) ## Package of Normality tests > ad.test(mod\$residuals) ## Anderson-Darling Anderson-Darling normality test data: mod\$residuals A = 0.301, p-value = 0.5517 > plot(mod, which=2) ## qqplot of std.residuals Answer: Yes !

42 3 : Are the residuals independent and identically distributed ? > plot(mod, which=1) ## residuals vs fitted values. Answer: OK > plot(mod, which=3) ## sqrt(abs(standardized(residuals))) vs fitted values. A DB C 3 obs. > library(car) > levene.test(mod) Levene's Test for Homogeneity of Variance Df F value Pr(>F) group 3 0.6492 0.5926 20

43

44 More Classical Graphs Histogram + Theoretical curve Boxplot Stripchart Barplot Pie chart 3D models