Presentation on theme: "Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at www.oregonstate.edu/~acock/missing."— Presentation transcript:
Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at
Alan C. Acock, Working with Missing Values 2 Why are the Values Missing: The reason instructs the solution By DesignCompletely Random –Missing Completely at Random (MCAR) –50% of items selected randomly for each interview –50% randomly selected for follow-up –Effective when there are too many items or high costs Intentionally MissingResearcher controlled –Boys not asked when first menstruation –Drop from analysis –Sometimes unintentionally imputed –Imputing doesnt necessarily hurt
Alan C. Acock, Working with Missing Values 3 Why are the Values Missing RefusalsWe may know mechanism –Adjusted for gender, race, education –May be missing at random –Otherwise, bias is likely w/o Auxiliary Variables Missing because of dont know responses –Between agree and disagree? –Can we impute a better value? –Should we?
Alan C. Acock, Working with Missing Values 4 Why are the Values Missing Missing by researcher error –May be missing completely at random –May reflect researcher bias –Perceived risk to researcher –Missing observation worse than missing value Code reason value is missing –NLSY97, uses 5 types of missing values –Treat each differently
Alan C. Acock, Working with Missing Values 5 Why are the Values Missing Understand why each value is missing Delete observations or variables where you do not intend to impute a value –Drop variable –Drop observation
Alan C. Acock, Working with Missing Values 6 Four Questions Do I want to have a value for this person? Is the value missing completely at random, or Do I have auxiliary variables that explain why it is missing, and Do I have covariates that predict the score?
Alan C. Acock, Working with Missing Values 7 Patterns of Missing Values MISSING DATA PATTERNS HLTH x x x x CHILDS x x x x x x x x x x HAP_GEN x x x x x INCOME98 x x x x x x AGE x x x x x x x x EDUC x x x x x –What is problem with HLTH? INCOME98? EDUC?
Alan C. Acock, Working with Missing Values 8 Patterns of Missing Values MISSING DATA PATTERN FREQUENCIES Pattern Freq Pattern Freq Pattern Freq Throw out 81 people in pattern 2? We have data on five of the six variables Income might not be a key predictor Why is health missing in patterns 5 to 10Was this by design?
Alan C. Acock, Working with Missing Values 9 Amount of Missing Values PROPORTION OF DATA PRESENT HLTH CHILDS HAP_GEN INC AGE EDUC HLTH.90 CHILDS HAP_GEN INCOME AGE EDUC Income low with educ, hlth, hap_gen If income is just a control variable--Find a substitute or impute Over 50% of cases for all the combinations Could be worse if you did 3-way ( hlth, income, educ )
Alan C. Acock, Working with Missing Values 10 Raw DataMissingness IDVar1Var2Var IDD1D2D
Alan C. Acock, Working with Missing Values 11 Missing Completely at Random (MCAR) The Missingness data is random. D1, D2, D3 uncorrelated with anything! Correlate (or logistic regression) variables with D1, D2, D3 Consider race, gender, age, education None of these should be correlated with D1, D2, or D3 This is not correlating variables with the raw score!
Alan C. Acock, Working with Missing Values 12 Missing at Random (MAR) The Missingness data is a random pattern after you control for –Variables in your analysis –Auxiliary variables –Probability of missingness NOT dependent on unobserved variables Correlate variables with D1, D2, D3 Consider auxiliary variables--race, gender, age, education
Alan C. Acock, Working with Missing Values 13 Missing at Random (MAR) Include auxiliary variables as mechanisms for missingness –If they are correlated significantly with the missingness, D1, D2, D3 Data is MAR after controlling auxiliary variables Auxiliary variables available in many datasets
Alan C. Acock, Working with Missing Values 14 Problem with Traditional Approaches Listwise deletionstandard default –It excludes many observations50%? –May be only missing one variable and that variable may not be important –In longitudinal program evaluations Missing those with low level of implementation –If MCAR, this reduces power, but is unbiased –W/O MCAR this is biased –Political Science Journal50% deleted
Alan C. Acock, Working with Missing Values 15 Problem with Traditional Approaches Mean Substitution –Mean often bad estimate –Attenuates variance –Reduces effectvariables w/ missing data, or –Exaggerates effects--variables with little missing data –Reduces R 2
Alan C. Acock, Working with Missing Values 16 Problem with Traditional Approaches Pairwise Deletion (rarely used) –Each correlation on different subsample –Set of correlationsno single sample –May not be able to invert matrix –What is the right sample size? –If it works, usually better than mean substitution or listwise deletion
Alan C. Acock, Working with Missing Values 17 Problem with Traditional Approaches Ordinary regression imputation –Multiple regression used to predict their score –Predicted value will have no new information if predictors are in your modelcolinearity –Does nothing about uncertainty of predictions If R 2 =.90, the predicted value is good If R 2 =.10, the predicted value has a lot of noise –Thus, predicted values are too good
Alan C. Acock, Working with Missing Values 18 Problem with Traditional Approaches Single Imputation (SPSS Module) (MAR) –American Statistician article--done incorrectly –Single imputation does not incorporate variability between multiple imputations –Reviewers for many journals not aware of limitations of single imputation so... –Easy to implement using SPSS
Alan C. Acock, Working with Missing Values 19 Modern Approaches Multiple Imputation--Assumes MAR –Imputation is done 5-20 times –Model is estimated 5-20 times –Estimates (Rs, Bs, Betas) are averaged –Standard errors--variances between solutions incorporated –Reflects uncertainty of the process –Always better than single imputation
Alan C. Acock, Working with Missing Values 20 Modern Approaches Multiple Imputation –Available with best Statistical packages Stata SAS –Available with freeware programs that work in conjunction with statistical packages Norm Amelia IVEware Mice
Alan C. Acock, Working with Missing Values 21 Modern Approaches Full Information Maximum Likelihood (FIML) –Assumes MAR –Uses all available information –Assumes patterns same if no missing –Results similar to multiple imputation –Available with SEM programs Mplus LISREL AMOS EQS
Alan C. Acock, Working with Missing Values 22 Modern Approaches Full Information Maximum Likelihood –Easy changes in SEM programs will do this –Researchers rarely include auxiliary variables –Researchers rarely include covariates unless in model –Possible to add auxiliary/predictor variables –Mplus allows for both FIML estimation and multiple imputation--nice to compare results
Alan C. Acock, Working with Missing Values 23 How Multiple Imputation Works: Non-technical Explanation All variables may have some missing values, including DV Eliminate observations will missing values on all variables –Missing wave of panel is just missing values Estimate covariance matrix (listwise) Regress x i on remaining variables
Alan C. Acock, Working with Missing Values 24 How Multiple Imputation Works Add residual based on strength of prediction –R 2 =.90add small error –R 2 =.10add big error You now have an actual or imputed value for all observations on all variables Estimate a covariance This covariance matrix should be better because it utilizes more information
Alan C. Acock, Working with Missing Values 25 How Multiple Imputation Works If covariance matrices are different –Repeat process until successive covariance matrices are virtually identical This provides first imputed dataset Repeat this process m times –Resultsm imputed datasets with no missing values
Alan C. Acock, Working with Missing Values 26 How Multiple Imputation Works Estimate your model with each of your m imputed datasets Combine the results using Rubins rules –Parameter estimatesmean of their m values –Standard errors inflate mean of standard errors based on how much solutions vary –Standard errors (hence t-tests) will be unbiased if the data is MAR
Alan C. Acock, Working with Missing Values 27 How FIML is Implemented: Mplus Title: Missing values including mechanisms Data: File is miss_systematic-999.dat ; Variables: Names are childs satfin male hap_gen ident income98 educ hlth age; Missing are all (-999) ; Usevariables are hlth childs hap_gen income98 age educ satfin male ; Analysis: Type = missing ; *without this get listwise
Alan C. Acock, Working with Missing Values 28 FIML: Mplus Example Model: hlth on childs hap_gen income98 age educ ; satfin on childs hap_gen income98 age educ ; male on childs hap_gen income98 age educ ; Output: standardized ; 1.The hlth and satfin lines are the model 2.The male line is a nonsense equation that includes any covariates or auxiliary variables
Alan C. Acock, Working with Missing Values 29 Freeware Dedicated Packages PackageSingle Imputation Multiple Imputation FIML AmeliaXX IVEwareXX NormXX MICEXX MxX
Alan C. Acock, Working with Missing Values 30 Commercial Statistical Packages PackageSingle Imputation Multiple Imputation FIML SAS (MI)X SPSS (EM)X Stata (ice)XX
Alan C. Acock, Working with Missing Values 31 Commercial FIML Packages PackageSingle Imputation Multiple Imputation FIML AMOSX EQSX HLMX LISRELX MplusXX
Alan C. Acock, Working with Missing Values 32 Web Pages for Selected Software Ameilia gking.harvard.edu/amelia/gking.harvard.edu/amelia/ Iveware Norm MX SPSS LISREL Mplus SAS Stata