Presentation is loading. Please wait.

Presentation is loading. Please wait.

Missing Data: Analysis and Design John W. Graham The Prevention Research Center and Department of Biobehavioral Health Penn State University.

Similar presentations


Presentation on theme: "Missing Data: Analysis and Design John W. Graham The Prevention Research Center and Department of Biobehavioral Health Penn State University."— Presentation transcript:

1 Missing Data: Analysis and Design John W. Graham The Prevention Research Center and Department of Biobehavioral Health Penn State University

2 Presentation in Four Parts (1) Introduction: Missing Data Theory (2) A brief analysis demonstration Multiple Imputation with NORM and Proc MI Amos...break... (3) Attrition Issues (4) Planned missingness designs: 3-form Design

3 Recent Papers Graham, J. W., Cumsille, P. E., & Elek-Fisk, E. (2003). Methods for handling missing data. In J. A. Schinka & W. F. Velicer (Eds.). Research Methods in Psychology (pp. 87_114). Volume 2 of Handbook of Psychology (I. B. Weiner, Editor-in-Chief). New York: John Wiley & Sons. Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods, 6, 330_351. Schafer, J. L., & Graham, J. W. (2002). Missing data: our view of the state of the art. Psychological Methods, 7, 147- 177. jgraham@psu.edu

4 Part I: A Brief Introduction to Analysis with Missing Data

5 Problem with Missing Data Analysis procedures were designed for complete data...

6 Solution 1 Design new model-based procedures Missing Data + Parameter Estimation in One Step Full Information Maximum Likelihood (FIML) SEM and Other Latent Variable Programs (Amos, Mx, LISREL, Mplus, LTA)

7 Solution 2 Data based procedures e.g., Multiple Imputation (MI) Two Steps Step 1: Deal with the missing data (e.g., replace missing values with plausible values Produce a product Step 2: Analyze the product as if there were no missing data

8 FAQ Aren't you somehow helping yourself with imputation?...

9 NO. Missing data imputation... does NOT give you something for nothing DOES let you make use of all data you have...

10 FAQ Is the imputed value what the person would have given?

11 NO. When we impute a value.. We do not impute for the sake of the value itself We impute to preserve important characteristics of the whole data set...

12 We want... unbiased parameter estimation e.g., b-weights Good estimate of variability e.g., standard errors best statistical power

13 Causes of Missingness Ignorable MCAR: Missing Completely At Random MAR: Missing At Random Non-Ignorable MNAR: Missing Not At Random

14 MCAR ( Missing Completely At Random) MCAR 1: Cause of missingness completely random process (like coin flip) MCAR 2: Cause un correlated with variables of interest Example: parents move No bias if cause omitted

15 MAR (Missing At Random) Missingness may be related to measured variables But no residual relationship with unmeasured variables Example: reading speed No bias if you control for measured variables

16 MNAR (Missing Not At Random) Even after controlling for measured variables... Residual relationship with unmeasured variables Example: drug use reason for absence

17 MNAR Causes The recommended methods assume missingness is MAR But what if the cause of missingness is not MAR? Should these methods be used when MAR assumptions not met?...

18 YES! These Methods Work! Suggested methods work better than “old” methods Multiple causes of missingness Only small part of missingness may be MNAR Suggested methods usually work very well

19 Revisit Question: What if THE Cause of Missingness is MNAR? Example model of interest: X  Y X = Program (prog vs control) Y = Cigarette Smoking Z = Cause of missingness: say, Rebelliousness (or smoking itself) Factors to be considered: % Missing (e.g., % attrition) r YZ. r Z,Ymis.

20 r YZ Correlation between cause of missingness (Z) e.g., rebelliousness (or smoking itself) and the variable of interest (Y) e.g., Cigarette Smoking

21 r Z,Ymis Correlation between cause of missingness (Z) e.g., rebelliousness (or smoking itself) and missingness on variable of interest e.g., Missingness on the Smoking variable Missingness on Smoking (Y mis ) Dichotomous variable: Y mis = 1: Smoking variable not missing Y mis = 0: Smoking variable missing

22 How Could the Cause of Missingness be Purely MNAR? r Z,Y = 1.0 AND r Z,Ymis = 1.0 We can get r Z,Y = 1.0 if smoking is the cause of missingness on the smoking variable

23 How Could the Cause of Missingness be Purely MNAR? We can get r Z,Ymis = 1.0 like this: If person is a smoker, smoking variable is always missing If person is not a smoker, smoking variable is never missing But is this plausible? ever?

24 What if the cause of missingness is MNAR? Problems with this statement MAR & MNAR are widely misunderstood concepts I argue that the cause of missingness is never purely MNAR The cause of missingness is virtually never purely MAR either.

25 MAR vs MNAR: MAR and MNAR form a continuum Pure MAR and pure MNAR are just theoretical concepts Neither occurs in the real world MAR vs MNAR NOT dimension of interest

26 MAR vs MNAR: What IS the Dimension of Interest? Question of Interest: How much estimation bias? when cause of missingness cannot be included in the model

27 Bottom Line... All missing data situations are partly MAR and partly MNAR Sometimes it matters... bias affects statistical conclusions Often it does not matter bias has minimal effects on statistical conclusions (Collins, Schafer, & Kam, Psych Methods, 2001)

28 Methods: "Old" vs MAR vs MNAR MAR methods (MI and ML) are ALWAYS at least as good as, usually better than "old" methods (e.g., listwise deletion) Methods designed to handle MNAR missingness are NOT always better than MAR methods

29 References Graham, J. W., & Donaldson, S. I. (1993). Evaluating interventions with differential attrition: The importance of nonresponse mechanisms and use of followup data. Journal of Applied Psychology, 78, 119-128. Graham, J. W., Hofer, S.M., Donaldson, S.I., MacKinnon, D.P., & Schafer, J.L. (1997). Analysis with missing data in prevention research. In K. Bryant, M. Windle, & S. West (Eds.), The science of prevention: methodological advances from alcohol and substance abuse research. (pp. 325-366). Washington, D.C.: American Psychological Association. Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods, 6, 330-351.

30 Analysis: Old and New

31 Old Procedures: Analyze Complete Cases (listwise deletion) may produce bias you always lose some power (because you are throwing away data) reasonable if you lose only 5% of cases often lose substantial power

32 Analyze Complete Cases (listwise deletion) 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0 very common situation only 20% (4 of 20) data points missing but discard 80% of the cases

33 Other "Old" Procedures Pairwise deletion May be of occasional use for preliminary analyses Mean substitution Never use it Regression-based single imputation generally not recommended... except...

34 Recommended Model-Based Procedures Multiple Group SEM (Structural Equation Modeling) L atent T ransition A nalysis (Collins et al.) A latent class procedure

35 Recommended Model-Based Procedures Raw Data Maximum Likelihood SEM aka Full Information Maximum Likelihood (FIML) Amos (James Arbuckle) LISREL 8.5+ (Jöreskog & Sörbom) Mplus (Bengt Muthén) Mx (Michael Neale)

36 Amos 7, Mx, Mplus, LISREL 8.8 Structural Equation Modeling (SEM) Programs In Single Analysis... Good Estimation Reasonable standard errors Windows Graphical Interface

37 Limitation with Model-Based Procedures That particular model must be what you want

38 Recommended Data-Based Procedures EM Algorithm (ML parameter estimation) Norm-Cat-Mix, EMcov, SAS, SPSS Multiple Imputation NORM, Cat, Mix, Pan (Joe Schafer) SAS Proc MI LISREL 8.5+

39 EM Algorithm Expectation - Maximization Alternate between E-step: predict missing data M-step: estimate parameters Excellent parameter estimates But no standard errors must use bootstrap or multiple imputation

40 Multiple Imputation Problem with Single Imputation: Too Little Variability Because of Error Variance Because covariance matrix is only one estimate

41 Too Little Error Variance Imputed value lies on regression line

42 Imputed Values on Regression Line

43 Restore Error... Add random normal residual

44 Covariance Matrix (Regression Line) only One Estimate Obtain multiple plausible estimates of the covariance matrix ideally draw multiple covariance matrices from population Approximate this with Bootstrap Data Augmentation (Norm) MCMC (SAS 8.2, 9)

45 Regression Line only One Estimate

46 Data Augmentation stochastic version of EM EM E (expectation) step: predict missing data M (maximization) step: estimate parameters Data Augmentation I (imputation) step: simulate missing data P (posterior) step: simulate parameters

47 Data Augmentation Parameters from consecutive steps... too related i.e., not enough variability after 50 or 100 steps of DA... covariance matrices are like random draws from the population

48 Multiple Imputation Allows: Unbiased Estimation Good standard errors provided number of imputations is large enough too few imputations  reduced power with small effect sizes

49 From Graham, J. W., Olchowski, A. E., & Gilreath, T. D. (in press). How many imputations are really needed? Some practical clarifications of multiple imputation theory. Prevention Science.

50 Part II: Illustration of Missing Data Analysis: Multiple Imputation with NORM and Proc MI

51 Multiple Imputation: Basic Steps Impute Analyze Combine results

52 Imputation and Analysis Impute 40 datasets a missing value gets a different imputed value in each dataset Analyze each data set with USUAL procedures e.g., SAS, SPSS, LISREL, EQS, STATA Save parameter estimates and SE’s

53 Combine the Results Parameter Estimates to Report Average of estimate (b-weight) over 40 imputed datasets

54 Combine the Results Standard Errors to Report Sum of: “within imputation” variance average squared standard error usual kind of variability “between imputation” variance sample variance of parameter estimates over 40 datasets variability due to missing data

55 Materials for SPSS Regression Starting place http://methodology.psu.edu downloads missing data software Joe Schafer's Missing Data Programs John Graham's Additional NORM Utilities http://mcgee.hhdev.psu.edu/missing/index.html

56 Materials for SPSS Regression SPSS (NORMSPSS) The following six files provide a new (not necessarily better) way to use SPSS regression with NORM imputed datasets steps.pdf norm2mi.exe selectif.sps space.exe spssinf.bat minfer.exe

57 exit for sample analysis

58 Inclusive Missing Data Strategies Auxiliary Variables: What’s All the Fuss? John Graham IES Summer Research Training Institute, June 27, 2007

59 What Is an Auxiliary Variable? A variable correlated with the variables in your model  but not part of the model  not necessarily related to missingness  used to "help" with missing data estimation

60 Benefit of Auxiliary Variables Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods, 6, 330_351. Graham, J. W., & Collins, L. M. (2007). Using modern missing data methods with auxiliary variables to mitigate the effects of attrition on statistical power. Technical Report, The Methodology Center, Penn State University.

61 Model of Interest

62 Benefit of Auxiliary Variables Example from Graham & Collins (2007) X Y Z 1 1 1 500 complete cases 1 0 1500 cases missing Y X, Y variables in the model (Y sometimes missing) Z is auxiliary variable

63 Benefit of Auxiliary Variables Effective sample size (N')  Analysis involving N cases, with auxiliary variable(s)  gives statistical power equivalent to N' complete cases without auxiliary variables

64 Benefit of Auxiliary Variables It matters how highly Y and Z (the auxiliary variable) are correlated For example increase  r YZ =.40N = 500 gives power of N' = 542(8%)  r YZ =.60N = 500 gives power of N' = 608 (22%)  r YZ =.80N = 500 gives power of N' = 733(47%)  r YZ =.90N = 500 gives power of N' = 839(68%)

65

66 Empirical Illustration The Model Alcohol-related Harm Prevention (AHP) Project with College Students Intent make Vehicle Plans 1 Alcohol Use 1 Took Vehicle Risks 3 Physical Harm 5

67 How Much Data? Intent Alcohol VehRisk Harm Freq _______ ____ ____ ______ ____ 0 0 0 0 59 0 0 0 1 109 0 0 1 0 99 0 0 1 1 122 0 1 0 0 1 0 1 0 1 2 0 1 1 1 5 1 1 0 0 100 1 1 0 1 46 1 1 1 0 136 1 1 1 1 344  Complete Total 1023 1 = data 0 = missing

68 Empirical Illustration Complete Cases (N = 344) Intent make Vehicle Plans 1 Alcohol Use 1 Took Vehicle Risks 3 Physical Harm 5 ns t = 0.2 t = -6 t = 5

69 Empirical Illustration Simple MI (no Aux Vars) Intent make Vehicle Plans 1 Alcohol Use 1 Took Vehicle Risks 3 Physical Harm 5 t = 3 t = -9 t = 7 N = 1023

70 Empirical Illustration MI with Aux Vars Intent make Vehicle Plans 1 Alcohol Use 1 Took Vehicle Risks 3 Physical Harm 5 t = 6 t = -10 t = 8 N = 1023 Auxiliary Variables: Intent2, Intent3, Intent4, Intent5 Alcohol2, Alcohol3, Alcohol4, Alcohol5 Risks1, Risks3, Risks4, Risks5 Harm1, Harm2, Harm3, Harm4

71 Effect of Auxiliary Variables on Fraction of Missing Information no aux vars 16 aux vars iplnvsep  harm2nv0.71.46 alcsep  harm2nv0.64.44 female  harm2nv0.48.27 vriskfeb  harm2nv0.85.67 iplnvsep  harm2nv0.76.53 alcsep  harm2nv0.68.46 female  harm2nv0.52.27 iplnvsep  vriskfeb.58.46 alcsep  vriskfeb.56.32 female  vriskfeb.42.28

72 Methods for Adding Auxiliary Variables Multiple Imputation Amos

73 Adding Auxiliary Variables: MI Simply add Auxiliary variables to imputation model Couldn't be easier  Except...  There are limits to how many variables can be included in NORM conveniently My current thinking:  add Aux Vars judiciously

74 Empirical Illustration MI with Aux Vars Intent make Vehicle Plans 1 Alcohol Use 1 Took Vehicle Risks 3 Physical Harm 5 t = 6 t = -10 t = 8 N = 1023 Auxiliary Variables: Intent2, Intent3, Intent4, Intent5 Alcohol2, Alcohol3, Alcohol4, Alcohol5 Risks1, Risks3, Risks4, Risks5 Harm1, Harm2, Harm3, Harm4

75 Adding Auxiliary Variables: Amos (and other FIML/SEM programs) Graham, J. W. (2003). Adding missing-data relevant variables to FIML-based structural equation models. Structural Equation Modeling, 10, 80-100. Extra DV model  Good for manifest variable models Saturated Correlates ("Spider") Model  Better for latent variable models

76 Covariate Model NOT Adequate Aux Variable Changes X  Y Estimate

77 Extra DV Model Good for Manifest Variable Models Aux Variable does NOT Change X  Y Estimate

78 Spider Model (Graham, 2003) Good for Latent Variable Models Aux Variable does NOT Change X  Y Estimate Aux

79 Extra DV Model (Amos) Real world version gets a little clumsy... but Amos does provide some excellent drawing tools Large models easier in text-based SEM programs (e.g., LISREL)

80 Using Missing Data Analysis and Design to Develop Cost-Effective Measurement Strategies in Prevention Research John Graham IES Summer Research Training Institute, June 27, 2007

81 Planned Missingness Designs: The 3-Form Design

82 Planned Missingness Why would anyone want to plan to have missing data? To manage costs, data quality, and statistical power In fact, we've been doing it for decades...

83 Common Sampling Designs Random sampling of Subjects Items Goal: Collect smaller, more manageable amount of data Draw reasonable conclusions

84 Why NOT Use Planned Missingness? Past: Not convenient to do analyses Present: Many statistical solutions Now is time to consider design alternatives

85 Design Examples

86 Lighten Burden on Respondents The problem: 7th graders can answer only 100 questions We want to ask 133 questions One Solution: The 3-form design

87 Idea Grew out of Practical Need Project SMART (1982) NIDA-funded drug abuse prevention project Johnson, Flay, Hansen, Graham

88 3-Form Design Student Received Item Set? ---------------------------- X A B C Form 1yes yes yes NO Form 2yes yes NO yes Form 3yes NO yes yes

89 3-Form Design Item Sets total XABC asked 34333333= 133 total for each formXABC student 1343333 0=100 23433 033= 100 334 03333=100 Think of it as “leveraging” resources

90 3-Form Design: Item Order Form 1: XAB Form 2:XCA Form 3XBC

91 3-Form Design: Item Order Form 1: XABC Form 2:XCAB Form 3XBCA

92 3-Form Design: Item Order Form 1: XABC Form 2:XCAB Form 3XBCA Give questions as shown, measure reasons for non-completion poor reading low motivation conscientiousness "Managed" missingness

93 Other Designs in the Same Family

94 3-Form Design (Graham, Flay et al., 1984) Item Sets XABCtotal Form33333333133 __________________________________________ 1333333 0100 23333 033100 333 03333100

95 6-Form Design (e.g., King, King et al., 2002) Item Sets XABCDtotal Form3333333333167 __________________________________________ 1333333 00100 23333 0330100 33333 0 033100 433 03333 0100 533 033 033100 633 0 03333100

96 Split Questionnaire Survey Design SQSD (Raghunathan & Grizzle, 1995) Item Sets XABCDEtotal Form333333333333 200 __________________________________________ 1333333 000 100 23333 03300 100 33333 0 0330... 43333 0 0 033 533 03333 0 0 633 033 033 0 733 033 0 033 833 0 03333 0 933 0 033 033 1033 0 0 03333

97 Family of Designs 3-form Design All combinations of 3 sets taken 2 at a time SQSD (10-form design) All combinations of 5 sets taken 2 at a time 6-form design All combinations of 4 sets taken 2 at a time Complete cases (1-form design) All combinations of 2 sets taken 2 at a time

98 Evaluating Designs (Benefits and costs)

99 Number of item sets (4 vs 3) Number of items (133 vs 100) Number of (correlation) effects Sample sizes.....

100 Number of Effects Effects tested with n = N/3 (100) Effects tested with n = 2N/3 (200) Effects tested with total N (300)

101 Evaluating Designs (Benefits and costs ) Number of effects tested with good power (power ≥.80) Take multiple effect sizes into account

102 Effect Size (r) 30-40 scenario = Mild Leveraging Scenario

103 Evaluating Designs (Benefits and costs ) Number of effects tested with good power (power ≥.80) … Still Something Missing It's not how many effects But WHICH effects can be tested: Tradeoff Matrix

104

105 1.27 1.20 2.13 1.36 power ratio

106

107 3-Form Design Student Received Item Set? ---------------------------- X A B C corepeerparent other Form 1yes yes yes NO Form 2yes yes NO yes Form 3yes NO yes yes

108 3-Form Design: Implementation Strategies Core Questions in "X" set Keep related questions together in A or B or C sets Example for Collaboration (Hansen & Graham) X set (core items) A: Hansen Set B: Graham set C: Other

109 "Back Against the Wall" Concept 3-form design better received if one of these is true: You CAN ask some number of questions (e.g., 100) You WANT to ask some larger number of questions (e.g., 133) You have been asking 133 questions of respondents Data Collectors (or data gate keepers) say you MUST reduce number of questions

110 Some Future Directions Current power calculations based on zero-order correlations (beneficial) effect of auxiliary variables not taken into account Current power calculations based on level one correlation analysis loss of power will be discounted in multilevel analyses

111 Change in FMI adding 15 Aux Vars from X set PredictorsFMI changer with Aux Vars posatt.48 .30.54 freetimewithfriends.47 .34.29 fangry.49 .38.56 nparties.41 .33.36 negatt.46 .37.26 sportsimportant.47 .39.16 nclosefriends.46 .40.20 carefriends.46 .43.28 parangry.39 .38.45 easytalkfriends.43 .43.24 DV: Trouble Dataset: AAPT 7 th graders

112 the end


Download ppt "Missing Data: Analysis and Design John W. Graham The Prevention Research Center and Department of Biobehavioral Health Penn State University."

Similar presentations


Ads by Google