Presentation is loading. Please wait.

Presentation is loading. Please wait.

12/17/2015330 lecture 111 STATS 330: Lecture 11. 12/17/2015330 lecture 112 Outliers and high-leverage points  An outlier is a point that has a larger.

Similar presentations


Presentation on theme: "12/17/2015330 lecture 111 STATS 330: Lecture 11. 12/17/2015330 lecture 112 Outliers and high-leverage points  An outlier is a point that has a larger."— Presentation transcript:

1 12/17/2015330 lecture 111 STATS 330: Lecture 11

2 12/17/2015330 lecture 112 Outliers and high-leverage points  An outlier is a point that has a larger or smaller y value that the model would suggest Can be due to a genuine large error  Can be caused by typographical errors in recording the data  A high leverage point is a point with extreme values of the explanatory variables

3 12/17/2015330 lecture 113 Outliers  The effect of an outlier depends on whether it is also a high leverage point  A “high leverage” outlier Can attract the fitted plane, distorting the fit, sometimes extremely In extreme cases may not have a big residual In extreme cases can increase R 2  A “low leverage” outlier Does not distort the fit to the same extent Usually has a big residual Inflates standard errors, decreases R 2

4 12/17/2015330 lecture 114 No outliers No high- leverage points Low leverage Outlier: big residual High leverage Not an outlier High- leverage outlier

5 12/17/2015330 lecture 115 Example: the education data (ignoring urban) High leverage point

6 12/17/2015330 lecture 116 An outlier also? Residual somewhat extreme

7 12/17/2015330 lecture 117 Measuring leverage It can be shown (see eg STATS 310) that the fitted value of case i is related to the response data y 1,…,y n by the equation The h ij depend on the explanatory variables. The quantities h ii are called “hat matrix diagonals” (HMD’s) and measure the influence y i has on the ith fitted value. They can also be interpreted as the distance between the X-data for the ith case and the average x-data for all the cases. Thus, they directly measure how extreme the x-values of each point are

8 12/17/2015330 lecture 118 Interpreting the HMDs  Each HMD lies between 0 and 1  The average HMD is p/n (p=no of reg coefficients, p=k+1)  An HMD more than 3p/n is considered extreme

9 12/17/2015330 lecture 119 Example: the education data educ.lm<-lm(educ~percapita+under18, data=educ.df) hatvalues(educ.lm)[50] 50 0.3428523 > 9/50 [1] 0.18 Clearly extreme! n=50, p=3

10 12/17/2015330 lecture 1110 Studentized residuals  How can we recognize a big residual? How big is big?  The actual size depends on the units in which the y-variable is measured, so we need to standardize them.  Can divide by their standard deviations  Variance of a typical residual e is var(e) = (1-h)  2 where h is the hat matrix diagonal for the point.

11 12/17/2015330 lecture 1111 Studentized residuals (2)  “Internally studentised” (Called “standardised” in R)  “Externally studentised” (Called “studentised” in R) s 2 is Usual Estimate of  2 s 2 i is estimate of  2 after deleting the ith data point

12 12/17/2015330 lecture 1112 Studentized residuals (3)  How big is big?  Both types of studentised residual are approximately distributed as standard normals when the model is OK and there are no outliers. (in fact the externally studentised one has a t- distribution)  Thus, studentised residuals should be between -2 and 2 with approximately 95% probability.

13 12/17/2015330 lecture 1113 Studentized residuals (4) Calculating in R: library(MASS) # load the MASS library stdres(educ.lm) #internally studentised (standardised in R) studres(educ.lm) #externally studentised (studentised in R) > stdres(educ.lm)[50] 50 3.275808 > studres(educ.lm)[50] 50 3.700221

14 What does studentised mean? 12/17/2015330 lecture 1114

15 12/17/2015330 lecture 1115 Recognizing outliers  If a point is a low influence outlier, the residual will usually be large, so large residual and a low HMD indicates an outlier  If a point is a high leverage outlier, then a large error usually will cause a large residual.  However, in extreme cases, a high leverage outlier may not have a very big residual, depending on how much the point attracts the fitted plane. Thus, if a point has a large HMD, and the residual is not particularly big, we can’t always tell if a point is an outlier or not.

16 High-leverage outlier 12/17/2015330 lecture 1116 Small residual!

17 12/17/2015330 lecture 1117 Leverage-residual plot Can plot standardised residuals versus leverage (HMD’s): the leverage-residual plot (LR plot) Point 50 is high leverage, big residual, is an outlier plot(educ.lm, which=5)

18 Interpreting LR plots 12/17/2015330 lecture 1118 leverage 0 2 -2 Standardised residualStandardised residual High leverage outlier High leverage, outlier Possible high leverage outlier OK Low leverage outlier 3p/n

19 12/17/2015330 lecture 1119 Residuals and HMD’s No big studentized residuals, no big HMD’s (3p/n = 0.2 for this example)

20 12/17/2015330 lecture 1120 Residuals and HMD’s (2) One big studentized residual, no big HMD’s (3p/n = 0.2 for this example). Line moves a bit Point 24

21 12/17/2015330 lecture 1121 Residuals and HMD’s (3) No big studentized residual, one big HMD, pt 1. (3p/n = 0.2 for this example). Line hardly moves.Pt 1 is high leverage but not influential. Point 1

22 12/17/2015330 lecture 1122 Residuals and HMD’s (4) One big studentized residual, one big HMD, both pt 1. (3p/n = 0.2 for this example). Line moves but residual is large. Pt 1 is influential Point 1

23 12/17/2015330 lecture 1123 Residuals and HMD’s (5) No big studentized residuals, big HMD, 3p/n= 0.2. Point 1 is high leverage and influential Point 1

24 12/17/2015330 lecture 1124 Influential points  How can we tell if a high-leverage point/outlier is affecting the regression?  By deleting the point and refitting the regression: a large change in coefficients means the point is affecting the regression  Such points are called influential points  Don’t want analysis to be driven by one or two points

25 12/17/2015330 lecture 1125 “Leave-one out” measures  We can calculate a variety of measures by leaving out each data point in turn, and looking at the change in key regression quantities such as Coefficients Fitted values Standard errors  We discuss each in turn

26 12/17/2015330 lecture 1126 Example: education data With point 50Without point 50 Const-557-278 percap0.0710.059 under181.5550.932

27 12/17/2015330 lecture 1127 Standardized difference in coefficients: DFBETAS Formula: Problem when: Greater than 1 in absolute value This is the criterion coded into R

28 12/17/2015330 lecture 1128 Standardized difference in fitted values: DFFITS Formula: Problem when: Greater than  (p/(N-p) ) in absolute value (p=number of regression coefficients)

29 12/17/2015330 lecture 1129 COV RATIO & Cooks D Cov Ratio: Measures change in the standard errors of the estimated coefficients Problem indicated: when Cov Ratio more than 1+3p/n or less than 1-3p/n Cook’s D Measures overall change in the coefficients Problem when: More than qf(0.50, p,n-p) (lower 50% of F distribution), roughly 1 in most cases

30 12/17/2015330 lecture 1130 Calculations > influence.measures(educ.lm) Influence measures of lm(formula = educ ~ under18 + percap, data = educ.df) dfb.1. dfb.un18 dfb.prcp dffit cov.r cook.d hat inf 10 0.06381 -0.02222 -0.16792 -0.3631 0.803 4.05e-02 0.0257 * 44 0.02289 -0.02948 0.00298 -0.0340 1.283 3.94e-04 0.1690 * 50 -2.36876 2.23393 1.50181 2.4733 0.821 1.66e+00 0.3429 * > p=3, n=50, 3p/n=0.18, 3  (p/(n-p)) =0.758, qf(0.5,3,47)= 0.8002294

31 12/17/2015330 lecture 1131 Plotting influence # set up plot window with 2 x 4 array of plots par(mfrow=c(2,4)) # plot dffbetas, dffits, cov ratio, # cooks D, HMD’s influenceplots(educ.lm)

32 12/17/2015330 lecture 1132

33 12/17/2015330 lecture 1133 Remedies for outliers  Correct typographical errors in the data  Delete a small number of points and refit (don’t want fitted regression to be determined by one or two influential points)  Report existence of outliers separately: they are often of scientific interest  Don’t delete too many points (1 or 2 max)

34 12/17/2015330 lecture 1134 Summary: Doing it in R  LR plot: plot(educ.lm, which=5)  Full diagnostic display plot(educ.lm)  Influence measures: influence.measures(educ.lm)  Plots of influence measures par(mfrow=c(2,4)) influenceplots(educ.lm)

35 12/17/2015330 lecture 1135 HMD Summary  Hat matrix diagonals Measure the effect of a point on its fitted value Measure how outlying the x-values are (how “high –leverage” a point is) Are always between 0 and 1 with bigger values indicating high leverage Points with HMD’s more than 3p/n are considered “high leverage”


Download ppt "12/17/2015330 lecture 111 STATS 330: Lecture 11. 12/17/2015330 lecture 112 Outliers and high-leverage points  An outlier is a point that has a larger."

Similar presentations


Ads by Google