Download presentation

Presentation is loading. Please wait.

Published byZaria Elvin Modified about 1 year ago

1
Lecture 10 F-tests in MLR (continued) Coefficients of Determination BMTRY 701 Biostatistical Methods II

2
F-tests continued Two kinds of F-tests Overall F-test (or Global F-test) tests whether or not there is a regression relation between Y and the set of covariates For a regression with p covariates, the overall F-test compares F* = MSR/MSE ~ F(p, n-p-1)

3
Recall earlier example “Full” model The overall F-test tests if there is some association

4
> reg1 <- lm(LOS ~ INFRISK + ms + NURSE + nurse2, data=data) > anova(reg1) Analysis of Variance Table Response: LOS Df Sum Sq Mean Sq F value Pr(>F) INFRISK e-10 *** ms * NURSE nurse Residuals SSR < SSE < MSR <- SSR/4 MSE <- SSE/108 Fstar <- MSR/MSE Fstar 1 - pf(Fstar, 4, 108)

5
But, Global F is part of the “summary” output so no need for the additional calculations > summary(reg1) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.355e e < 2e-16 *** INFRISK 6.289e e e-06 *** ms 7.829e e NURSE 4.136e e nurse e e Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 108 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 4 and 108 DF, p-value: 1.298e-08

6
Partial F test partial because it tests “part” of the model. tests one or more covariates simultaneously Can be done using the ANOVA table, if covariates are entered in the ‘correct’ order Or, by comparing results from regression tables Examples:

7
ANOVA tables with 3 covariates SSdfMS X1 SS(X1)1SS(X1)/1 X2|X1 SS(X2|X1)1SS(X2|X1)/1 X3|X2,X1 SS(X3|X2,X1)1SS(X3|X2,X1)/1 Error SSEn – 4SSE/(n-4) Total SSTn - 1

8
ANOVA tables with 3 covariates SSdfMS Regression SS(X1,X2,X3)3SSR/3 X1 SS(X1)1SS(X1)/1 X2|X1 SS(X2|X1)1SS(X2|X1)/1 X3|X2,X1 SS(X3|X2,X1)1SS(X3|X2,X1)/1 Error SSEn – 4SSE/(n-4) Total SSTn - 1 where SS(X1,X2,X3) = SS(X1) + SS(X2|X1) + SS(X3|X2,X1)

9
Interpretation of ANOVA table with >1 covariate > anova(reg1) Analysis of Variance Table Response: LOS Df Sum Sq Mean Sq F value Pr(>F) INFRISK e-10 *** ms * NURSE nurse Residuals SSR(INFRISK) = SSR(ms | INFRISK) = SSR(NURSE| ms, INFRISK) = SSR(nurse2| nurse, ms, INFRISK) = What are these F-tests and pvalues testing?

10
F-tests and p-values in ANOVA table They are tests for a covariate, conditional on what is above it in the table. Example: F statistic for INFRISK tests is it adjusted for other covariates? no it tests INFRISK in the presence of no other covariates p <

11
F-tests and p-values in ANOVA table Example: F statistic for ‘ms’ tests is it adjusted for other covariates? yes it tests the significance of ms, after adjusting for INFRISK p = 0.03 Example: F-statistic for nurse2 tests significance of β4, adjusting for INFRISK, ms, NURSE. p = 0.41

12
Interpretation of ANOVA table with >1 covariate > reg1a <- lm(LOS ~ ms + NURSE + nurse2 + INFRISK, data=data) > anova(reg1a) Analysis of Variance Table Response: LOS Df Sum Sq Mean Sq F value Pr(>F) ms *** NURSE * nurse ** INFRISK e-06 *** Residuals SSR(ms) = SSR(NURSE| ms) = SSR(nurse2| ms, NURSE) = SSR(INFRISK| ms, NURSE, nurse2 ) =

13
Implications ANOVA table results depends on the order in which the covariates appear If you want to use ANOVA table to test one or more covariates, they should come at the end reg1: we can see if INFRISK is significant without any adjustments we can see if nurse2 is significant adjusting for everything else reg1a: we can see if INFRISK is significant adjusting for everything else we can see if nurse2 is significant, adjusting for NURSE and ms, but not adjusting for INFRISK

14
F-tests Global F-test Partial F-test for ONE covariate

15
F-tests (continued) Partial F-test for >1 covariate Implications: The denominator is always the MSE from the full model The numerator can always be determined by entering the covariates in the order in which you want to test them Recall: additivity of sums of squares

16
More on the partial F test Test whether an individual β k = 0 Test whether a set of β k = 0 Model 1: Model 2: Model 3:

17
Testing more than two covariates To test Model 1 vs. Model 3 we are testing that β 3 = 0 AND β 4 = 0 Ho: β 3 = β 4 = 0 vs. Ha: β 3 ≠ 0 or β 4 ≠ 0 If β 3 = β 4 = 0, then we conclude that Model 3 is superior to Model 1 That is, if we fail to reject the null hypothesis Model 1: Model 3:

18
Interpretation of ANOVA table with >1 covariate > anova(reg1) Analysis of Variance Table Response: LOS Df Sum Sq Mean Sq F value Pr(>F) INFRISK e-10 *** ms * NURSE nurse Residuals SSR(INFRISK) = SSR(ms | INFRISK) = SSR(NURSE| ms, INFRISK) = SSR(nurse2| nurse, ms, INFRISK) = 1.789

19
Using ANOVA table results SSR(NURSE, nurse2| INFRISK, ms) = SSR(NURSE| ms, INFRISK) + SSR(nurse2| nurse, ms, INFRISK) = = MSR = 2.886/2 = F* = 1.443/2.565 = ~ F(2,108) p-value = 0.57

20
R: simpler approach > anova(reg3) Analysis of Variance Table Response: LOS Df Sum Sq Mean Sq F value Pr(>F) INFRISK e-10 *** ms * Residuals > anova(reg1, reg3) Analysis of Variance Table Model 1: LOS ~ INFRISK + ms + NURSE + nurse2 Model 2: LOS ~ INFRISK + ms Res.Df RSS Df Sum of Sq F Pr(>F)

21
R > summary(reg3) Call: lm(formula = LOS ~ INFRISK + ms, data = data) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) <2e-16 *** INFRISK e-08 *** ms * --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 110 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 2 and 110 DF, p-value: 8.42e-10

22
Testing multiple coefficients simultaneously Region: it is a ‘factor’ variable with 4 categories > reg4 <- lm(LOS ~ factor(REGION), data=data) > anova(reg4) Analysis of Variance Table Response: LOS Df Sum Sq Mean Sq F value Pr(>F) factor(REGION) e-07 *** Residuals

23
Continued… > summary(reg4) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) < 2e-16 *** factor(REGION) ** factor(REGION) e-05 *** factor(REGION) e-07 *** Residual standard error: on 109 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 3 and 109 DF, p-value: 5.376e-07

24
Recall previous example Interaction between REGION and MEDSCHL

25
How to test the interaction terms? Approach 1: Fit two models model with interactions model without interactions Compare models using ‘anova’ command Approach 2: fit one model find SSR for interactions, conditional on main effects calculate F-statistic calculate p-value

26
Approach 1 > reg5 <- lm(logLOS ~ factor(REGION)*ms, data=data) > reg6 <- lm(logLOS ~ factor(REGION)+ ms, data=data) > anova (reg6, reg5) Analysis of Variance Table Model 1: logLOS ~ factor(REGION) + ms Model 2: logLOS ~ factor(REGION) * ms Res.Df RSS Df Sum of Sq F Pr(>F) >

27
Approach 2 > anova(reg5) Analysis of Variance Table Response: logLOS Df Sum Sq Mean Sq F value Pr(>F) factor(REGION) e-08 *** ms *** factor(REGION):ms Residuals What are degrees of freedom for the F-test?

28
Concluding remarks r.e. F-test Global F-test: not very common, except for very small models Partial F-test for individual covariate: not very common because it is the same as the t-test Partial F-test for set of covariates: quite common easiest to find ANOVA table for nested models can use ANOVA table from full model to determine F- statistic

29
Coefficient of Determination Also called R 2 Measures the variability in Y explained by the covariates. Two questions (and think ‘sums of squares’ in ANOVA): How do we measure the variance in Y? How do we measure the variance explained by the X’s?

30
R2R2 The coefficient of determination is defined as SST: Variance in Y SSR: Variance explained by X’s SSE:Variance left over, not explained by regression

31
Use of R 2 Similar to correlation But, not specific to just one X and Y Partitioning of explained versus unexplained For certain models, it can be used to determine if addition of a covariate helps ‘predict’

32
SENIC example > summary(reg1) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.355e e < 2e-16 *** INFRISK 6.289e e e-06 *** ms 7.829e e NURSE 4.136e e nurse e e Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 108 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 4 and 108 DF, p-value: 1.298e-08 32% of the variance in LOS is explained by the regression model

33
Misunderstandings r.e. R 2 A high R 2 indicates that a useful prediction can be made there still may be considerable uncertainty, due to small N. recall that predictions depend on how close “X” is to the mean A high R 2 indicates that the regression model is a ‘good fit’ high R 2 says nothing about adhering to model assumptions standard diagnostics should still be used, even if R 2 is high R 2 near 0 indicates X and Y are not related. you can still have strong association with a lot of unexplained variance (e.g., age and cancer) for similar reasons as above, need to look at modeling X and Y may be related, but not linearly

34
What if we remove the ‘insignificant’ X’s? > reg7 <- lm(LOS ~ INFRISK, data=data) > summary(reg7) Call: lm(formula = LOS ~ INFRISK, data = data) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) < 2e-16 *** INFRISK e-09 *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 111 degrees of freedom Multiple R-Squared: , Adjusted R-squared: F-statistic: on 1 and 111 DF, p-value: 1.177e-09

35
R 2 decreased? The addition of a covariate will ALWAYS increase the R 2 value. Why? there is always at least a little bit explained by the new X the only possible way to have no increase in R2 would be if the addition of the new covariate had estimated β = 0 It is ‘almost never’ true that the slope estimate is exactly. Extreme case: perfect linear association between two covariates (e.g., age in years and age in months)

36
“Solution” Adjusted R Accounts for the number of covariates in the model “Purists” do not like the adjusted R 2 The adjusted only increases with a new covariate if the new term “improves” the model more than expected by chance alone.

37
Coefficients of Partial Determination Measures the marginal contribution of one X variable when all others are already in the model Intuitively, how much variation in Y are we explaining, after accounting for what is already in the model? Construction in Two Covariate case:

38
Example: X1 = ms, X2 = INFRISK > anova(reg3) Analysis of Variance Table Response: LOS Df Sum Sq Mean Sq F value Pr(>F) INFRISK e-10 *** ms * Residuals > anova(reg7) Analysis of Variance Table Response: LOS Df Sum Sq Mean Sq F value Pr(>F) INFRISK e-09 *** Residuals

39
Example: X1 = ms, X2 = INFRISK SSR(X1|X2) = SSR(ms|INFRISK) = SSE(X2) = SSE(INFRISK) = R 2 (Y 1|2) =

40
General Case Examples with 3 and 4 covariates Can also be generalized for a set of covariates

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google