Treatment of missing values

Slides:



Advertisements
Similar presentations
Handling attrition and non- response in longitudinal data Harvey Goldstein University of Bristol.
Advertisements

Non response and missing data in longitudinal surveys.
Writing up results from Structural Equation Models
1 QOL in oncology clinical trials: Now that we have the data what do we do?
Some birds, a cool cat and a wolf
CJT 765: Structural Equation Modeling Class 3: Data Screening: Fixing Distributional Problems, Missing Data, Measurement.
Adapting to missing data
How to Handle Missing Values in Multivariate Data By Jeff McNeal & Marlen Roberts 1.
Maximum likelihood (ML) and likelihood ratio (LR) test
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.

Missing Data in Randomized Control Trials
How to deal with missing data: INTRODUCTION
Modeling Achievement Trajectories When Attrition is Informative Betsy J. Feldman & Sophia Rabe- Hesketh.
Partially Missing At Random and Ignorable Inferences for Parameter Subsets with Missing Data Roderick Little Rennes
Maximum likelihood (ML)
Statistical Methods for Missing Data Roberta Harnett MAR 550 October 30, 2007.
PEAS wprkshop 2 Non-response and what to do about it Gillian Raab Professor of Applied Statistics Napier University.
A P STATISTICS LESSON 9 – 1 ( DAY 1 ) SAMPLING DISTRIBUTIONS.
Multiple imputation using ICE: A simulation study on a binary response Jochen Hardt Kai Görgen 6 th German Stata Meeting, Berlin June, 27 th 2008 Göteborg.
ANCOVA Lecture 9 Andrew Ainsworth. What is ANCOVA?
Multiple Regression The Basics. Multiple Regression (MR) Predicting one DV from a set of predictors, the DV should be interval/ratio or at least assumed.
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
G Lecture 11 G Session 12 Analyses with missing data What should be reported?  Hoyle and Panter  McDonald and Moon-Ho (2002)
Handling Attrition and Non- response in the 1970 British Cohort Study Tarek Mostafa Institute of Education – University of London.
Applied Epidemiologic Analysis - P8400 Fall 2002 Lab 10 Missing Data Henian Chen, M.D., Ph.D.
Imputation for Multi Care Data Naren Meadem. Introduction What is certain in life? –Death –Taxes What is certain in research? –Measurement error –Missing.
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
Eurostat Statistical matching when samples are drawn according to complex survey designs Training Course «Statistical Matching» Rome, 6-8 November 2013.
SW 983 Missing Data Treatment Most of the slides presented here are from the Modern Missing Data Methods, 2011, 5 day course presented by the KUCRMDA,
© John M. Abowd 2007, all rights reserved General Methods for Missing Data John M. Abowd March 2007.
1 G Lect 13W Imputation (data augmentation) of missing data Multiple imputation Examples G Multiple Regression Week 13 (Wednesday)
The Impact of Missing Data on the Detection of Nonuniform Differential Item Functioning W. Holmes Finch.
1 G Lect 13M Why might data be missing in psychological studies? Missing data patterns Overview of statistical approaches Example G Multiple.
Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.
A REVIEW By Chi-Ming Kam Surajit Ray April 23, 2001 April 23, 2001.
Simulation Study for Longitudinal Data with Nonignorable Missing Data Rong Liu, PhD Candidate Dr. Ramakrishnan, Advisor Department of Biostatistics Virginia.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 13, 2013.
General Structural Equations (LISREL)
Tutorial I: Missing Value Analysis
INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011.
Pre-Processing & Item Analysis DeShon Pre-Processing Method of Pre-processing depends on the type of measurement instrument used Method of Pre-processing.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 April 9, 2012.
A framework for multiple imputation & clustering -Mainly basic idea for imputation- Tokei Benkyokai 2013/10/28 T. Kawaguchi 1.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
DATA STRUCTURES AND LONGITUDINAL DATA ANALYSIS Nidhi Kohli, Ph.D. Quantitative Methods in Education (QME) Department of Educational Psychology 1.
Best Practices for Handling Missing Data
HANDLING MISSING DATA.
Missing data: Why you should care about it and what to do about it
Handling Attrition and Non-response in the 1970 British Cohort Study
MISSING DATA AND DROPOUT
The Centre for Longitudinal Studies Missing Data Strategy
Maximum Likelihood & Missing data
Introduction to Survey Data Analysis
Multiple Imputation Using Stata
How to handle missing data values
Presenter: Ting-Ting Chung July 11, 2017
The bane of data analysis
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
The European Statistical Training Programme (ESTP)
CH2. Cleaning and Transforming Data
Missing Data Mechanisms
Non response and missing data in longitudinal surveys
Analysis of missing responses to the sexual experience question in evaluation of an adolescent HIV risk reduction intervention Yu-li Hsieh, Barbara L.
Learning From Observed Data
Chapter 4: Missing data mechanisms
The European Statistical Training Programme (ESTP)
Chapter 13: Item nonresponse
Missing data: Is it all the same?
Presentation transcript:

Treatment of missing values

1.11) Missing values in autoregressive and cross-lagged models: diagnostics and therapy. What is a missing value? There is a unit non-response and an item-non-response. Also answers such as „Don‘t know“, „Refused“, „no opinion“ are often considered as missing. In longitudinal studies the problem of missing values is especially disturbing due to panel mortality, that is unit non response. This is also called wave non response, attrition or drop out.

Different patterns of non reponse 1) Univariate pattern: for some indicators or items we have full observations, and for some other items we have missing values- no answers. These items may be fully or partly missing. 2) A monotone pattern: may arise in longitudinal studies with attrition. If an item is missing in some wave, it continues to be missing in the next waves. 3) An arbitrary pattern: any set of variables may be missing for any unit.

1) univariate pattern 2) monotone pattern. 3)arbitrary pattern ? ? ? y1 y2 y3…yp y1 y2 y3…. yp y1 y2 y3…. yp 1 2 . N 1) univariate pattern 2) monotone pattern. 3)arbitrary pattern

The literature differentiates three kinds of missing values: 1) MCAR-missing completely at random- means that whether the data are missing is entirely unrelated statistically to the values that would have been observed. MCAR is the most restrictive assumption. MCAR can be sometimes established by randomly assigning test booklets or blocks of survey questions to different respondents. 2) MAR-missing at random- is a somewhat more relaxed condition. It means that missingness is statistically unrelated to the variable itself. However, it may be related to other variables in the data set. One way to establish MAR processes is to include completely observed variables that are highly predictive of incomplete data.

3) MNAR- missing not at random- or nonignorable missing data, where missingness conveys probablistic information about the values that would have been observed.

Example: Let us take two variables, education and income. Education has no missing values, and income has. MCAR would mean that the missing values of income are dependent neither on education nor on income. MAR would mean that the missing values of income are dependent on education. That is, education can predict the missing values in income. MNAR would mean that the missingness in income is not independent of the values of the missings, controlling for the prediction of education. That is, for example high income values are more often missing than low income values.

The diagnosis of the kind of missing is very tricky, and cannot always be established. So often researchers assume the kind of missingness in their data. Fortunately, there are solutions which are independent of the kind of missing data, even if we have MNAR.

MAR: logistic regression is a possible test for MAR MAR: logistic regression is a possible test for MAR. But if there is a significant effect of some non-missing variables of the missingness, it cannot exclude MNAR. Only experimental designs with a representative sample of the missing, which has no missing, can help design a model for the missingness in the full data set.

So far diagnostics, now therapy: Traditional methods: 1) Listwise deletion (LD)- deleting every case which has any missing value. Advantage: consistent solution. Disadvantage: not efficient, and causes often a drastic reduction in sample size, especially in studies where multiple indicators are involved, and sensitive questions such as income.

Therapy (2) 2) Pairwise deletion (PD) (also called available case (AC) analysis): calculates each correlation separately. This method excludes an observation from the calculation when it is missing a value that is needed for the computation of that particular correlation. Advantages: smaller loss of cases than in the LD. Disadvantages: not efficient, and could create problems in estimation, because the observed correlation matrix may not be positive definite. There is no defined N for the sample, since it depends on the computed pair.

Advantages of the case deletion methods: simplicity. If a missing data problem can be resolved by discarding only a small part of the sample, then the method can be quite effective. However, even in that situation, one should explore the data to make sure that the discarded cases are not influential.

Reweighting In some non-MCAR situations, it is possible to reduce biases by applying weights. After incomplete cases are removed, the remaining complete cases are reweighted so that their distribution more closely resembles that of the full sample or population with respect to auxiliary variables (Little and Rubin 1987). It requires some model for the probabilities of response, to calculate the weights. Better for the univariate and monotone missing patterns. Becomes complicated to apply if missing is in an arbitrary pattern

Older methods of multiple imputation Imputation means filling in missing values with plausible values, and continuing with the analysis. Advantages: potentially more efficient than discarding the unit. Prevention of loss of power due to decreasing sample size. Disadvantages: imputation may be difficult to implement well.

1) Imputing unconditional means: average is preserved, but distribution aspects such as variance are distorted. 2) Hot deck imputation: filling in respondents‘ data with values from actual respondents randomly. Advantage: It preserves the variable‘s distribution. Disadvantage: the method still distorts correlations and other measures of association. 3) Imputing conditional means by regression: the model is first fit for cases to which y is known. After we have a regression parameter from X to Y we use it to forecast missing values of Y by known values of X. It is almost optimal with some corrections for standard errors (Schafer&Schenker 2000). Not recommended for analyses of covariances or correlations, since it overstates the relation between Y and X.

4) Imputing from a conditional distribution 4) Imputing from a conditional distribution. Distortion of covariances can be eliminated if each missing value of Y is replaced not by a regression prediction but by a random draw from the conditional or predictive distribution of Y given X plus an error term.

. . . . . . . . . . . . . . …. . . . . . .. . . …. . . . . . . . .. .. . .. . … . . .. . . . …… . .. 1) Mean Substitution 2) Hot Deck 3) Conditional Mean 4) Predictive distribution x y

1) Mean substitution causes all imputed values to fall on a horizontal line- produces biased estimates for any type of missingness according to simulation studies. 2) Conditional mean substitution causes them to fall on a regression line-introduces bias. 3) The hot deck produces an elliptical cloud with too little correlation- produces biased estimates for any type of missingness. 4) The only method which produces a reasonable point cloud is imputation from the conditional distribution of Y on X-it is unbiased. However, in all methods coverage is very low (see Schafer and Graham 2002 for details p.161). Solution: modern methods of imputation: MI and ML

Therapy (3) Modern methods: 1) FIML-full information maximum likelihood. The FIML discrepancy function maximizes the sum of N casewise contributions to the likelihood function that measure the discrepancy between the observed data and current parameter estimates using all available data for a given case. FIML is a direct method in the sense that model parameters and standard errors are estimated directly from the available data. Missing data (MD) points are not estimated or imputed, and are essentially treated as values that were never intended to be sampled. Advantage: the algorithm uses all the available information, and the method is both consistent and efficient for MAR. Disadvantage: the method is model dependent, as it uses information only from variables in the model (different variables in the model-different results).

Assumptions of ML estimates 1) they assume that the sample is large enough and normally distributed for the ML estimates to be approximately unbiased. 2) they assume some model for the complete data and MAR. 3) However, in many realistic applications and according to simulation studies departures from the last two assumptions are not large enough to effectively invalidate the results. 4) According to simulations, non-normality is not a crucial problem. In Graham and Schafer (1999) non-normal missing data when imputed even with small samples reported excellent performance. 4) For large enough Ns (over 250), MI and ML estimates are very similar. 5) Conclusions: FIML, and Bayesian are state of the art and should be used.

However, the FIML procedure is attractive, as it is very easy to implement. FIML is available in most SEM programs (AMOS, LISREL, Mplus). Data imputation is available in LISREL, Mplus and Amos.