Presentation is loading. Please wait.

Presentation is loading. Please wait.

5/18/2015330 lecture 101 STATS 330: Lecture 10. 5/18/2015330 lecture 102 Diagnostics 2 Aim of today’s lecture  To describe some more remedies for non-planar.

Similar presentations


Presentation on theme: "5/18/2015330 lecture 101 STATS 330: Lecture 10. 5/18/2015330 lecture 102 Diagnostics 2 Aim of today’s lecture  To describe some more remedies for non-planar."— Presentation transcript:

1 5/18/2015330 lecture 101 STATS 330: Lecture 10

2 5/18/2015330 lecture 102 Diagnostics 2 Aim of today’s lecture  To describe some more remedies for non-planar data  To look at diagnostics and remedies for non-constant scatter.

3 5/18/2015330 lecture 103 Remedies for non-planar data (cont)  Last time we looked at diagnostics for non- planar data  We discussed what to do if the diagnostics indicate a problem.  The short answer was: we transform, so that the model fits the transformed data.  How to choose a transformation? Theory Ladder of powers Polynomials  We illustrate with a few examples

4 5/18/2015330 lecture 104 Example: Using theory - cherry trees A tree trunk is a bit like a cylinder Volume =   (diameter/2) 2  height Log volume = log(  4) + 2 log(diameter) + log(height) so a linear regression using the logged variables should work! In fact R 2 increases from 95% to 98%, and residual plots are better

5 5/18/2015330 lecture 105 Example: cherry trees (cont) > new.reg<- lm(log(volume)~log(diameter)+log(height), data=cherry.df) > summary(new.reg) Call: lm(formula = log(volume) ~ log(diameter) + log(height), data = cherry.df) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -6.63162 0.79979 -8.292 5.06e-09 *** log(diameter) 1.98265 0.07501 26.432 < 2e-16 *** log(height) 1.11712 0.20444 5.464 7.81e-06 *** --- Residual standard error: 0.08139 on 28 degrees of freedom Multiple R-Squared: 0.9777, Adjusted R-squared: 0.9761 F-statistic: 613.2 on 2 and 28 DF, p-value: < 2.2e-16 Previously 94.8%

6 5/18/2015330 lecture 106 Example: cherry trees (original)

7 5/18/2015330 lecture 107 Example: cherry trees (logs)

8 5/18/2015330 lecture 108 Tyre abrasion data: gam plots rubber.gam = gam(abloss~s(hardness)+s(tensile), data=rubber.df) par(mfrow=c(1,2)); plot(rubber.gam)

9 5/18/2015330 lecture 109 Tyre abrasion data: polynomial GAM curve is like a polynomial, so fit a polynomial (ie include terms tensile 2, tensile 3,…) lm(abloss~hardness+poly(tensile,4), data=rubber.df) Usually a lot of trial and error involved! We have succeeded when R 2 improves Residual plots show no pattern 4 th deg polynomial works for the rubber data: R2 increases from 84% to 94% Degree of polynomial

10 Why 4 th degree? > rubber.lm = lm(abloss~poly(tensile,5)+hardness, data=rubber.df) > summary(rubber.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 615.3617 29.8178 20.637 2.44e-16 *** poly(tensile, 5)1 -264.3933 25.0612 -10.550 2.76e-10 *** poly(tensile, 5)2 23.6148 25.3437 0.932 0.361129 poly(tensile, 5)3 119.9500 24.6356 4.869 6.46e-05 *** poly(tensile, 5)4 -91.6951 23.6920 -3.870 0.000776 *** poly(tensile, 5)5 9.3811 23.6684 0.396 0.695495 hardness -6.2608 0.4199 -14.911 2.59e-13 *** Residual standard error: 23.67 on 23 degrees of freedom Multiple R-squared: 0.9427, Adjusted R-squared: 0.9278 F-statistic: 63.11 on 6 and 23 DF, p-value: 3.931e-13 5/18/2015330 lecture 1010 Try 5 th degree Highest significant power

11 Ladder of powers  Rather than fit polynomials in some independent variables, guided by gam plots, we can transform the response using the “ladder of powers” (i.e. use y p as the response rather than y for some power p)  Choose p either by trial and error using R 2 or use a “Box-Cox plot – see later in this lecture 5/18/2015330 lecture 1011

12 5/18/2015330 lecture 1012 Checking for equal scatter  The model specifies that the scatter about the regression plane is uniform  In practice this means that the scatter doesn’t depend on the explanatory variables or the mean of the response  All tests, confidence intervals rely on this

13 5/18/2015330 lecture 1013 Scatter  Scatter is measured by the size of the residuals  A common problem is where the scatter increases as the mean response increases  This is means the big residuals happen when the fitted values are big  Recognize this by a “funnel effect” in the residuals versus fitted value plot

14 5/18/2015330 lecture 1014 Example: Education expenditure data  Data for the 50 states of the USA  Variables are Per capita expenditure on education (response), variable educ Per capita Income, variable percap Number of residents per 1000 under 18, variable under18 Number of residents per 1000 in urban areas, variable urban Fit model educ~ percap+under18+urban

15 5/18/2015330 lecture 1015 response Outlier! (response)

16 5/18/2015330 lecture 1016 Outlier, pt 50 (California)

17 Basic fit, outlier in 5/18/2015330 lecture 1017 > educ.lm = lm(educ~urban + percap + under18, data=educ.df) > summary(educ.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -555.92562 123.46634 -4.503 4.56e-05 *** urban -0.00476 0.05174 -0.092 0.927 percap 0.07236 0.01165 6.211 1.40e-07 *** under18 1.55134 0.31545 4.918 1.16e-05 *** Residual standard error: 40.53 on 46 degrees of freedom Multiple R-squared: 0.5902, Adjusted R-squared: 0.5634 F-statistic: 22.08 on 3 and 46 DF, p-value: 5.271e-09 R 2 is 59%

18 Basic fit, outlier out 5/18/2015330 lecture 1018 > educ50.lm = lm(educ~urban + percap + under18, data=educ.df, subset=-50) > summary(educ50.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -278.06430 132.61422 -2.097 0.041664 * urban 0.06624 0.04966 1.334 0.188948 percap 0.04827 0.01220 3.958 0.000266 *** under18 0.88983 0.33159 2.684 0.010157 * --- Residual standard error: 35.88 on 45 degrees of freedom Multiple R-squared: 0.4947, Adjusted R-squared: 0.461 F-statistic: 14.68 on 3 and 45 DF, p-value: 8.365e-07 R 2 is now 49% See how we exclude pt 50

19 5/18/2015330 lecture 1019 > par(mfrow=c(1,2)) > plot(educ50.lm, which = c(1,3)) Funnel effect Increasing relationship

20 5/18/2015330 lecture 1020 Remedies  Either Transform the response  Or Estimate the variances of the observations and use “weighted least squares”

21 5/18/2015330 lecture 1021 > tr.educ50.lm <- lm(I(1/educ)~urban + percap + under18,data=educ.df[-50,]) > plot(tr.educ50.lm) Transforming the response Transform to reciprocal Better!

22 What power to choose?  How did we know to use reciprocals?  Think of a more general model I(educ^p)~percap + under18 + urban where p is some power  Then estimate p from the data using a Box- Cox plot 5/18/2015330 lecture 1022

23 5/18/2015330 lecture 1023 boxcoxplot(educ~urban + percap + under18, educ.df[-50,]) Transforming the response (how?) Draws “Box-Cox plot” A “R330” function Min at about -1

24 5/18/2015330 lecture 1024 Weighted least squares  Tests are invalid if observations do not have constant variance  If the ith observation has variance v i  2, then we can get a valid test by using “weighted least squares”, minimising the sum of the weighted squared residuals  r i 2 /v i rather than the sum of squared residuals  r i 2  Need to know the variances v i

25 5/18/2015330 lecture 1025 Finding the weights  Step 1: Plot the squared residuals versus the fitted values  Step 2: Smooth the plot  Step 3: Estimate the variance of an observation by the smoothed squared residual  Step 4: weight is reciprocal of smoothed squared residual  Rationale: variance is a function of the mean  Use “R330” function funnel

26 5/18/2015330 lecture 1026 Doing it in R (1) vars = funnel(educ50.lm)# a“R330” function Slope 1.7, indicates p=1-1.7=-0.7

27 5/18/2015330 lecture 1027 > educ50.lm<-lm(educ~urban+percap+under18data=educ.df[-50,]) > summary(educ50.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -278.06430 132.61422 -2.097 0.041664 * urban 0.06624 0.04966 1.334 0.188948 percap 0.04827 0.01220 3.958 0.000266 *** under18 0.88983 0.33159 2.684 0.010157 * --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 35.88 on 45 degrees of freedom Multiple R-Squared: 0.4947, Adjusted R-squared: 0.461 F-statistic: 14.68 on 3 and 45 DF, p-value: 8.365e-07 Note p- values

28 5/18/2015330 lecture 1028 > weighted.lm<-lm(educ~urban+percap+under18, weights=1/vars, data=educ.df[-50,]) > summary(weighted.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -270.29363 102.61073 -2.634 0.0115 * urban 0.01197 0.04030 0.297 0.7677 percap 0.05850 0.01027 5.694 8.88e-07 *** under18 0.82384 0.27234 3.025 0.0041 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.019 on 45 degrees of freedom Multiple R-squared: 0.629, Adjusted R-squared: 0.6043 F-statistic: 25.43 on 3 and 45 DF, p-value: 8.944e-10 Note changes! Conclusion: unequal variances matter! Can change results! Note reciprocals!


Download ppt "5/18/2015330 lecture 101 STATS 330: Lecture 10. 5/18/2015330 lecture 102 Diagnostics 2 Aim of today’s lecture  To describe some more remedies for non-planar."

Similar presentations


Ads by Google