Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 2 Linear Models I Olivier MISSA, Advanced Research Skills.

Similar presentations


Presentation on theme: "Lecture 2 Linear Models I Olivier MISSA, Advanced Research Skills."— Presentation transcript:

1 Lecture 2 Linear Models I Olivier MISSA, Advanced Research Skills

2 2 Outline What are Linear models Basic assumptions "Refresher" on different types of model: Single Linear Regression 2 sample t-test Anova (one-way & two-way) Ancova Multiple Linear Regression

3 3 What are linear models ? Put simply they are Models attempting to "explain" one response variable by combining linearly several predictor variables. In theory, the response variable and the predictor variables can either be continuous, discrete or categorical. This makes them particularly versatile. Indeed, many well known statistical procedures are linear models. e.g. Linear regression, Two-sample t-test, One-way & Two-way ANOVA, ANCOVA,...

4 4 What are linear models ? For the time being, we are going to assume that (1) the response variable is continuous. (2) the residuals ( ε ) are normally distributed and... (3)... independently and identically distributed. These "three" assumptions define classical linear models. We will see in later lectures ways to bypass these assumptions to be able to cope with an even wider range of situations (generalized linear models, mixed-effects models). But let's take things one step at a time.

5 5 Starting with a simple example Single Linear regression 1 response variable (continuous) vs. 1 explanatory variable either continuous or discrete (but ordinal !). Attempts to fit a straight line, y = a + b.x > library(faraway) > data(seatpos) > attach(seatpos) > model <- lm(Weight ~ Ht) > model Coefficients: (Intercept) Ht

6 6 But how "significant" is this trend ? Can be approached in a number of ways 1) R 2 the proportion of variance explained. Residual Sum of Squares Total Sum of Squares fitted values average value R 2 = Deviance RSS = TotalSS = 47371

7 7 1) R 2 the proportion of variance explained. Sum of Squares due to regression SS reg = RSS = TotalSS = 47371

8 8 1) R 2 the proportion of variance explained. > summary(model) Call: lm(formula = Weight ~ Ht) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-06 *** Ht e-10 *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 36 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 36 DF, p-value: 1.351e-10

9 9 2) model parameters and their standard errors > summary(model)... some outputs left out Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-06 *** Ht e-10 ***... How many standard errors away from 0 are the estimates

10 10 > anova(model) Analysis of Variance Table Response: Weight Df Sum Sq Mean Sq F value Pr(>F) Ht e-10 *** Residuals ) F-test SS reg RSS MS reg = SS reg / df reg F = MS reg / MSE The F-value so obtained must be compared to the theoretical F probabilities with 1 (numerator) and 36 (denominator) degrees of freedom MSE = RSS / rdf

11 11 How strong is this trend ? 95% Confidence Interval around Slope 95% future observations band

12 12 Are the assumptions met ? 1: Is the response variable continuous ? YES ! 2: Are the residuals normally distributed ? > library(nortest) ## Package of Normality tests > ad.test(model$residuals) ## Anderson-Darling Anderson-Darling normality test data: mod$residuals A = , p-value = 4.502e-05 > plot(model, which=2) ## qqplot of std.residuals Answer: NO ! due to a few outliers ?

13 13 3a : Are the residuals independent ? > plot(model, which=1) ## residuals vs fitted values. A bad example : non-linear trend Answer: looks OK !

14 14 3b : Are the residuals identically distributed ? > plot(model, which=3) ## sqrt(abs(standardized(residuals))) vs fitted values. Continuing our bad example Answer: Not perfect, but OK ! Bad OK

15 15 Bad Another bad example of residuals Heteroscedasticity Bad

16 16 Is our model OK ? despite having outliers Possible approaches 1) Repeat the analysis, removing the outliers, and check the model parameters. > summary(model) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-06 *** Ht e-10 *** > model2 <- lm(Weight ~ Ht, subset=seq(38)[-22]) > summary(model2) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-05 *** Ht e-09 *** R 2 = 0.686R 2 = minus obs. #22

17 17 Possible approaches 2) Examine directly the impact each observation has on the model output. Depends on (a) leverage ( h ii ) "How extreme each x-value is compared to the others" Has the potential to influence the fit strongly. When there is only one predictor variable: Σ h ii = 2 In general (for later): Σ h ii = p (no. parameters. in the model)

18 18 However: the leverage is only part of the story The direct impact of an observation depends also on (b) its residual A point with strong leverage but small residual is not a problem: the model adequately accounts for it. A point with weak leverage but large residual is not a problem either: the model is weakly affected by it. A point with strong leverage and large residual, however, strongly influences the model. Removing the latter point will usually modify the model output to some extent.

19 19 Influential Observations > plot(model, which=5) ## standardized residuals vs Leverage

20 20 A direct measure of influence combining both Leverages and Residuals Cook's distance fitted values when point i is omitted original fitted values Mean Square Error No. of Parameters = RSS / (n-p)

21 21 Combining both Leverages and Residuals Cook's distance Raw residuals Leverage values Any Di value above 0.5 deserves a closer look and Any Di value above 1 merits special attention. Standardised residuals

22 22 Combining both Leverages and Residuals Cook's distance Any Di value above 0.5 deserves a closer look and Any Di value above 1 merits special attention.

23 23 Remove point #22 ? Although point #22 is influential, it does not invalidate the model. We should probably keep it, but note that the regression slope may be shallower than the model suggests. A "reason" for the presence of these outliers can be suggested: some obese people were included in the survey. Depending on the purpose of the model, you may want to keep or remove these outliers

24 24 Next : " 2-sample t-test " > data(energy) ## in ISwR package > attach(energy) > plot(expend ~ stature) > stripchart(expend ~ stature, method="jitter") first, let's have a look at the classical t-test t-value Standard Error of the Difference of Means Standard Error of the Mean

25 25 Classical 2-sample t-test > t.test(expend ~ stature) Welch Two Sample t-test data: expend by stature t = , df = , p-val = alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: sample estimates: mean in group lean mean in group obese unequal variance for the difference Assuming Unequal Variance

26 26 Classical 2-sample t-test > t.test(expend~stature, var.equal=T) Two Sample t-test data: expend by stature t = , df = 20, p-value = alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: sample estimates: mean in group lean mean in group obese Equal Variance slightly more significant Assuming Equal Variance

27 27 " 2-sample t-test " as a linear model The t-test could "logically" translate into : However: One of these β parameters is superfluous, and actually makes the model fitting through matrix algebra impossible. So, instead it is usually translated into: when X = "lean" when X = "obese" β0β0 for "all" X when X = "obese" β Δlean β Δobese β default β Δobese

28 28 " 2-sample t-test " as a linear model > mod <- lm(expend~stature) > summary(mod)... Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-15 *** statureobese *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Remember : linear models assume Equal Variance doesn't look symmetric difference between lean and obese averages Average for lean category factor w/ only two levels: lean & obese

29 29 > mod <- lm(expend~stature) > summary(mod)... Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-15 *** statureobese *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 " 2-sample t-test " as a linear model Remember : linear models assume Equal Variance Standard Error of the Mean (SEM) for lean Standard Error of the Difference of the Means (SEDM)

30 30 > mod <- lm(expend~stature) > summary(mod)... Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-15 *** statureobese *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 20 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 20 DF, p-value: " 2-sample t-test " as a linear model Remember : linear models assume Equal Variance Same values as classical t-test (equal variance)

31 31 > anova(mod) Analysis of Variance Table Response: expend Df Sum Sq Mean Sq F value Pr(>F) stature *** Residuals " 2-sample t-test " as a linear model SS model F = MS model / MSE RSS MS model = SS model / df model MSE = RSS / rdf RSS = TSS = SS model =

32 32 Are the assumptions met ? 1 : Is the response variable continuous ? YES ! 2 : Are the residuals normally distributed ? > library(nortest) ## Package of Normality tests > ad.test(mod$residuals) ## Anderson-Darling Anderson-Darling normality test data: mod$residuals A = , p-value = > plot(mod, which=2) ## qqplot of std.residuals Answer: NO !

33 33 3 : Are the residuals independent and identically distributed ? > plot(mod, which=1) ## residuals vs fitted values. Answer: perhaps not identically distributed > plot(mod, which=3) ## sqrt(abs(standardized(residuals))) vs fitted values. leanobese

34 34 Transforming the response variable to produce normal residuals Can be optimised using the Box-Cox method. > library(MASS) > boxcox(mod, plotit=T) The method transforms the response y into g λ (y) where: optimal solution λ ≈ - 1/2

35 35 Is this transformation good enough ? Transforming the response variable to produce normal residuals > new.y <- 2 – 2/sqrt(expend) > mod2 <- lm(new.y ~ stature) > ad.test(residuals(mod2)) Anderson-Darling normality test data: residuals(mod2) A = , p-value = > plot(mod2, which=2) > plot(mod, which=2) ## qqplot of std.residuals. Perhaps the assumption of equal variance is not warranted ??

36 36 Is this transformation good enough ? Transforming the response variable to produce normal residuals > var.test(new.y ~ stature) ## only works if stature is ## a factor with only 2 levels F test to compare two variances data: new.y by stature F = , num df = 12, denom df = 8, p-value = alternative hypothesis: true ratio of variances is not equal to 1 95 percent confidence interval: sample estimates: ratio of variances

37 37 Is this transformation good enough ? Transforming the response variable to produce normal residuals > summary(mod) ## untransformed y Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-15 *** statureobese *** > summary(mod2) ## transformed y. Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) < 2e-16 *** statureobese ***

38 38 Parametric extension of the t-test to more than 2 groups whose sample sizes can be unequal Next : One-Way Anova > data(coagulation) ## in faraway package ## blood coagulation times among ## 24 animals fed one of 4 diets > attach(coagulation) > plot(coag ~ diet) > stripchart(coag ~ diet, method="jitter")

39 39 One-Way Anova as an F-test > res.aov <- aov(coag ~ diet) ## classical ANOVA > summary(res.aov) Df Sum Sq Mean Sq F value Pr(>F) diet e-05 *** Residuals RSS = 112 TSS = 340 SS model = 228

40 40 One-Way Anova as a linear model > mod <- lm(coag ~ diet) ## as a linear model > summary(mod) ## some output left out Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.100e e < 2e-16 *** dietB 5.000e e ** dietC 7.000e e *** dietD e e e Residual standard error: on 20 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 3 and 20 DF, p-value: 4.658e-05 > anova(mod) ## or summary(aov(mod)) Analysis of Variance Table Response: coag Df Sum Sq Mean Sq F value Pr(>F) diet e-05 *** Residuals not very useful

41 41 Are the assumptions met ? 1 : Is the response variable continuous ? No, discrete ! 2 : Are the residuals normally distributed ? > library(nortest) ## Package of Normality tests > ad.test(mod$residuals) ## Anderson-Darling Anderson-Darling normality test data: mod$residuals A = 0.301, p-value = > plot(mod, which=2) ## qqplot of std.residuals Answer: Yes !

42 42 3 : Are the residuals independent and identically distributed ? > plot(mod, which=1) ## residuals vs fitted values. Answer: OK > plot(mod, which=3) ## sqrt(abs(standardized(residuals))) vs fitted values. A DB C 3 obs. > library(car) > levene.test(mod) Levene's Test for Homogeneity of Variance Df F value Pr(>F) group

43 43

44 44 More Classical Graphs Histogram + Theoretical curve Boxplot Stripchart Barplot Pie chart 3D models


Download ppt "Lecture 2 Linear Models I Olivier MISSA, Advanced Research Skills."

Similar presentations


Ads by Google