Presentation is loading. Please wait.

Presentation is loading. Please wait.

Unit 1c: Detecting Influential Data Points and Assessing Their Impact © Andrew Ho, Harvard Graduate School of EducationUnit 1c – Slide 1

Similar presentations


Presentation on theme: "Unit 1c: Detecting Influential Data Points and Assessing Their Impact © Andrew Ho, Harvard Graduate School of EducationUnit 1c – Slide 1"— Presentation transcript:

1 Unit 1c: Detecting Influential Data Points and Assessing Their Impact © Andrew Ho, Harvard Graduate School of EducationUnit 1c – Slide 1 http://xkcd.com/539/

2 Three measures of atypicality Leverage x Discrepancy = Influence Sensitivity Analyses © Andrew Ho, Harvard Graduate School of EducationUnit 1c– Slide 2 Multiple Regression Analysis (MRA) Multiple Regression Analysis (MRA) Do your residuals meet the required assumptions? Test for residual normality Use influence statistics to detect atypical datapoints If your residuals are not independent, replace OLS by GLS regression analysis Use Individual growth modeling Specify a Multi-level Model If time is a predictor, you need discrete- time survival analysis… If your outcome is categorical, you need to use… Binomial logistic regression analysis (dichotomous outcome) Multinomial logistic regression analysis (polytomous outcome) If you have more predictors than you can deal with, Create taxonomies of fitted models and compare them. Form composites of the indicators of any common construct. Conduct a Principal Components Analysis Use Cluster Analysis Use non-linear regression analysis. Transform the outcome or predictor If your outcome vs. predictor relationship is non-linear, Use Factor Analysis: EFA or CFA? Today’s Topic Area Course Roadmap: Unit 1c

3 © Andrew Ho, Harvard Graduate School of EducationUnit 1c – Slide 3 Anscombe’s Quartet: Four datasets with identical summary statistics Unit 1d: Next Class How might we detect and describe these “atypical” observations?

4 © Andrew Ho, Harvard Graduate School of EducationUnit 1c – Slide 4 When do we check for atypical observations? At least first… and last… We should be nervous here that there is some atypical observation that has particular influence on all of these statistics. We check these assumptions now.

5 © Andrew Ho, Harvard Graduate School of EducationUnit 1c – Slide 5 Classifying Atypical Observations Influence: A high-discrepancy, high-leverage observation will have a strong influence on estimated regression coefficients and an impact on all model fit statistics and hypothesis testing.

6 © Andrew Ho, Harvard Graduate School of EducationUnit 1c – Slide 6 Discrepancy (1): The Raw Residual *-------------------------------------------------------------------------------- * Refit final regression model, estimate & output selected influence statistics *-------------------------------------------------------------------------------- * Refit Model 6, the final regression model: eststo: regress ILLCAUSE ILL LAGE ILLxLAGE SES * Output and summarize the predicted values ("capture" suppresses an error that * would otherwise be generated if you ran this again, redefining the variable): capture predict PREDICTED, xb summarize PREDICTED * Output and summarize the residuals: * Raw residuals: capture predict RESID, residuals summarize RESID Refit Model 6 A one-predictor illustration of residuals: The mean of residuals will always be 0. Why? The standard deviation will be close to the RMSE. Why?

7 © Andrew Ho, Harvard Graduate School of EducationUnit 1c – Slide 7 Discrepancy (2, 3): The PRESS and Standardized Residual * PRESS (Predicted Residual Sum of Squares) residuals: capture predict HATSTAT, leverage generate PRESS = RESID/(1-HATSTAT) summarize PRESS * Standardized residuals: capture predict STDRESID, rstandard summarize STDRESID Funny thing about atypical observations… they mask themselves. They draw the regression line to themselves, reducing their residuals. The PRESS residual “unmasks” the atypicality of an observation by calculating a residual for a regression line that is estimated from a dataset that does not include the observation itself. The standardized residual (also, confusingly, the standardized PRESS residual, and the internally studentized residual) is the PRESS residual expressed in terms of predicted standard deviations of residuals. This arguably results in a more interpretable statistic, where a residual of 2 or 3 standard deviations starts to seem “atypical.” This does not mean that we should discard observations with standardized residuals > 3, say. If we did, and we recalculated standardized residuals, what might we find?

8 © Andrew Ho, Harvard Graduate School of Education Unit 1c / Page 8 This outlier has a large residual This outlier does not. It masks itself. Leverage: Extremity of an observation along predictor variables http://www.stat.sc.edu/~west/javahtml/Regression.html

9 © Andrew Ho, Harvard Graduate School of EducationUnit 1c – Slide 9 Cook’s Distance: Influence = Discrepancy * Leverage Influence: High Discrepancy, High Leverage Leverage (horizontal) Squared standardized residual (vertical) The lvr2plot Influence Demo:. findit regpt. regpt

10 © Andrew Ho, Harvard Graduate School of EducationUnit 1c – Slide 10 Exploratory Analysis of Discrepancy, Leverage, and Influence * Identify cases that are extreme-on-Y: * Plot standardized residuals versus ID to pick out the extreme-on-Y cases: graph twoway (scatter STDRESID ID, mlabel(ID) msize(small)),name(Unit1c_g2,replace) * Sort and list only those cases atypical on the standardized residuals. * Recall that they can be both positive and negative: sort STDRESID list ID STDRESID if STDRESID !=. in F/20, clean list ID STDRESID if STDRESID !=. in -20/L, clean * Second, identify the cases that are extreme-on-X: * Plot HATSTAT versus ID to pick out the extreme-on-X cases: graph twoway (scatter HATSTAT ID, mlabel(ID) msize(small)),name(Unit1c_g3,replace) * Sort and list only the cases atypical on HATSTAT. * Recall that the HAT statistic measures the "horizontal" distance of each * case from the "center" of the data in the predictor plane. It can only * take on positive values, and so only atypical cases with large positive * and non-missing values are in contention: sort HATSTAT list ID HATSTAT if HATSTAT !=. in -20/L, clean * Third, identify the cases that are most influential overall: * Plot COOKSDSTAT versus ID to pick out the most influential cases overall: graph twoway (scatter COOKSDSTAT ID, mlabel(ID) msize(small)),name(Unit1c_g4,replace) * Sort and list the atypical cases that are most influential overall. * Recall that Cook's D statistic measures overall impact on the generic fit. * It can only take on positive values and so it is only atypical cases with * large positive and non-missing values that are in contention: sort COOKSDSTAT list ID COOKSDSTAT if COOKSDSTAT !=. in -20/L, clean * Identify cases that are extreme-on-Y: * Plot standardized residuals versus ID to pick out the extreme-on-Y cases: graph twoway (scatter STDRESID ID, mlabel(ID) msize(small)),name(Unit1c_g2,replace) * Sort and list only those cases atypical on the standardized residuals. * Recall that they can be both positive and negative: sort STDRESID list ID STDRESID if STDRESID !=. in F/20, clean list ID STDRESID if STDRESID !=. in -20/L, clean * Second, identify the cases that are extreme-on-X: * Plot HATSTAT versus ID to pick out the extreme-on-X cases: graph twoway (scatter HATSTAT ID, mlabel(ID) msize(small)),name(Unit1c_g3,replace) * Sort and list only the cases atypical on HATSTAT. * Recall that the HAT statistic measures the "horizontal" distance of each * case from the "center" of the data in the predictor plane. It can only * take on positive values, and so only atypical cases with large positive * and non-missing values are in contention: sort HATSTAT list ID HATSTAT if HATSTAT !=. in -20/L, clean * Third, identify the cases that are most influential overall: * Plot COOKSDSTAT versus ID to pick out the most influential cases overall: graph twoway (scatter COOKSDSTAT ID, mlabel(ID) msize(small)),name(Unit1c_g4,replace) * Sort and list the atypical cases that are most influential overall. * Recall that Cook's D statistic measures overall impact on the generic fit. * It can only take on positive values and so it is only atypical cases with * large positive and non-missing values that are in contention: sort COOKSDSTAT list ID COOKSDSTAT if COOKSDSTAT !=. in -20/L, clean List the IDs of the cases with extreme values Plot The Value Of Each Influence Statistic Versus the Case ID. Use scatterplots of statistics on ID and list sorted statistics by ID.

11 © Andrew Ho, Harvard Graduate School of EducationUnit 1c – Slide 11 ID STDRESID 1. 444 -2.933572 2. 307 -2.640082 3. 441 -2.495741 4. 617 -2.462099 5. 424 -1.921746 6. 310 -1.759873 7. 602 -1.722051 ID STDRESID 188. 745 2.100213 189. 502 2.155126 190. 702 2.262468 191. 726 2.450768 192. 621 2.522492 193. 423 2.744562 194. 553 3.084371 Note that extreme values of residuals can be positive or negative

12 © Andrew Ho, Harvard Graduate School of EducationUnit 1c – Slide 12 ID HATSTAT 199. 634.0540155 200. 331.0542818 201. 573.0542818 202. 537.0563484 203. 322.0621393 204. 568.0677331 205. 700.0827226 Note that extreme values of leverage can only be positive.

13 © Andrew Ho, Harvard Graduate School of EducationUnit 1c – Slide 13 ID HATSTAT 188. 336.0387276 189. 424.0418666 190. 441.0432583 191. 745.0460144 192. 307.0557836 193. 444.0595759 194. 553.0778421 Note that extreme values of Cook’s Distance can only be positive.

14 © Andrew Ho, Harvard Graduate School of EducationUnit 1c – Slide 14 Back to the Leverage by Residual Squared Plot For our sensitivity analysis, we’ll take a look at these five points. In a real analysis, we would conduct a full substantive investigation of these children using all available qualitative and quantitative data.

15 © Andrew Ho, Harvard Graduate School of EducationUnit 1c – Slide 15 *-------------------------------------------------------------------------------- * Describe atypical observations. *-------------------------------------------------------------------------------- * Identify atypical cases by their ID #s from the previous analysis. * First, tag high-discrepancy observations: generate HIRESID=(ID==444|ID==307|ID==423|ID==553) * Second, tag high-leverage observations: generate HIHAT=(ID==700) * Third, tag cases w/ high overall influence: generate HICOOK=(ID==553) * Sort the atypical cases by their ID and selected characteristics: sort HICOOK HIRESID HIHAT ILLCAUSE ILL AGE SES ID * List cases for inspection, in a table, sorted by type of atypicality: list ID HICOOK HIRESID HIHAT ILLCAUSE PREDICTED ILL AGE SES /// if HIRESID==1 | HIHAT==1 | HICOOK==1, sepby(HICOOK HIRESID HIHAT) Exploring Atypical Observations Classify atypical observations by their basis for being flagged. Sort and list cases by selected variables. Similar observations may be chunked together for sensitivity analyses, e.g., #307 & #444

16 © Andrew Ho, Harvard Graduate School of EducationUnit 1c – Slide 16 Bivariate Visualizations of Atypical Observations

17 © Andrew Ho, Harvard Graduate School of EducationUnit 1c – Slide 17 Effects of ILL and logAGE – both as main effects and in a two-way interaction – are fairly robust to the removal of the atypical datapoints. The basic substantive story is preserved throughout. Sensitivity Analysis The main effect of SES fluctuates in magnitude and significance as atypical observations are excluded. The usefulness of socioeconomic status in the prediction of ILLCAUSE is sensitive to the inclusion of atypical observations in this model. In practice, we typically restrict investigation to high influence points, unless there is a serious question about whether the other atypical observations are actually part of the target population.


Download ppt "Unit 1c: Detecting Influential Data Points and Assessing Their Impact © Andrew Ho, Harvard Graduate School of EducationUnit 1c – Slide 1"

Similar presentations


Ads by Google