Missing Data. What is missing? Missing data are unavoidable, and more encompassing than the ubiquitous association of the term. What is missing? ~Cases.

Slides:



Advertisements
Similar presentations
General Linear Model With correlated error terms  =  2 V ≠  2 I.
Advertisements

Treatment of missing values
Week 11 Review: Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution.
Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
CJT 765: Structural Equation Modeling Class 3: Data Screening: Fixing Distributional Problems, Missing Data, Measurement.
Chapter 3 Producing Data 1. During most of this semester we go about statistics as if we already have data to work with. This is okay, but a little misleading.
Adapting to missing data
How to Handle Missing Values in Multivariate Data By Jeff McNeal & Marlen Roberts 1.
Chapter 3 Simple Regression. What is in this Chapter? This chapter starts with a linear regression model with one explanatory variable, and states the.
Point estimation, interval estimation
Statistical Inference Chapter 12/13. COMP 5340/6340 Statistical Inference2 Statistical Inference Given a sample of observations from a population, the.
Topic 7 Sampling And Sampling Distributions. The term Population represents everything we want to study, bearing in mind that the population is ever changing.
Chapter 4 Multiple Regression.
Missing Data in Randomized Control Trials
Topic 3: Regression.
How to deal with missing data: INTRODUCTION
Statistical Methods for Missing Data Roberta Harnett MAR 550 October 30, 2007.
1 A MONTE CARLO EXPERIMENT In the previous slideshow, we saw that the error term is responsible for the variations of b 2 around its fixed component 
Christopher Dougherty EC220 - Introduction to econometrics (chapter 3) Slideshow: prediction Original citation: Dougherty, C. (2012) EC220 - Introduction.
1 PREDICTION In the previous sequence, we saw how to predict the price of a good or asset given the composition of its characteristics. In this sequence,
Multiple imputation using ICE: A simulation study on a binary response Jochen Hardt Kai Görgen 6 th German Stata Meeting, Berlin June, 27 th 2008 Göteborg.
Hypothesis Testing in Linear Regression Analysis
1 1 Slide Statistical Inference n We have used probability to model the uncertainty observed in real life situations. n We can also the tools of probability.
Determining Sample Size
Chapter 1: Introduction to Statistics
Sampling. Concerns 1)Representativeness of the Sample: Does the sample accurately portray the population from which it is drawn 2)Time and Change: Was.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
2-1 MGMG 522 : Session #2 Learning to Use Regression Analysis & The Classical Model (Ch. 3 & 4)
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
PARAMETRIC STATISTICAL INFERENCE
Random Regressors and Moment Based Estimation Prepared by Vera Tabakova, East Carolina University.
Multiple Regression The Basics. Multiple Regression (MR) Predicting one DV from a set of predictors, the DV should be interval/ratio or at least assumed.
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
G Lecture 11 G Session 12 Analyses with missing data What should be reported?  Hoyle and Panter  McDonald and Moon-Ho (2002)
Sampling Methods and Sampling Distributions
Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.
SW 983 Missing Data Treatment Most of the slides presented here are from the Modern Missing Data Methods, 2011, 5 day course presented by the KUCRMDA,
Academic Research Academic Research Dr Kishor Bhanushali M
© John M. Abowd 2007, all rights reserved General Methods for Missing Data John M. Abowd March 2007.
Question paper 1997.
The Impact of Missing Data on the Detection of Nonuniform Differential Item Functioning W. Holmes Finch.
Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.
I271B QUANTITATIVE METHODS Regression and Diagnostics.
Sample Size Determination
Tutorial I: Missing Value Analysis
10-1 MGMG 522 : Session #10 Simultaneous Equations (Ch. 14 & the Appendix 14.6)
Topics Semester I Descriptive statistics Time series Semester II Sampling Statistical Inference: Estimation, Hypothesis testing Relationships, casual models.
Pre-Processing & Item Analysis DeShon Pre-Processing Method of Pre-processing depends on the type of measurement instrument used Method of Pre-processing.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Sampling Design and Analysis MTH 494 LECTURE-11 Ossam Chohan Assistant Professor CIIT Abbottabad.
1 Ka-fu Wong University of Hong Kong A Brief Review of Probability, Statistics, and Regression for Forecasting.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
DATA STRUCTURES AND LONGITUDINAL DATA ANALYSIS Nidhi Kohli, Ph.D. Quantitative Methods in Education (QME) Department of Educational Psychology 1.
HANDLING MISSING DATA.
Missing data: Why you should care about it and what to do about it
CJT 765: Structural Equation Modeling
Maximum Likelihood & Missing data
Multiple Imputation Using Stata
How to handle missing data values
The European Statistical Training Programme (ESTP)
EM for Inference in MV Data
EM for Inference in MV Data
Clinical prediction models
Chapter 13: Item nonresponse
Presentation transcript:

Missing Data

What is missing? Missing data are unavoidable, and more encompassing than the ubiquitous association of the term. What is missing? ~Cases ~Variables ~Values

Missing Cases

Missing cases - 1 Too few cases ~Here, missing data means not enough data due to the ‘curse of dimensionality’. ~N must increase rapidly as you add variables if you want to maintain even coverage of the space of explanatory variables: ~If 1 variable requires N=10, then... ~2 variables need N=10×10=100, ~3 variables need N=10×10×10=1000, ~D variables need N=10 D.

Missing cases - 1 Too few cases – ctd ~But remember Gelman’s Observation ~We do not have enough data as we would like for our research question. But if we had more data we would try to fit a more complicated model. And then we would not have enough data as we would like for our research question...

Missing cases - 2 Sampling and Descriptive Inference ~we are interested in some parameter (say µ) describing a population (size N) ~We only have observations of cases from a random sample (size n) ~Missing cases: (N - n) ~However, sample mean is a consistent and unbiased estimator of µ ~Cost of missing data: uncertainty the exact value of the population parameter (expressed in the confidence interval of the estimate)

Missing cases - 3 Prediction ~If we are interested in a particular element in a population, which is not (yet) observed, we have a missing cases problem that can be addressed by prediction. ~ Prediction of the value of an element is based on estimating the relevant population parameter (e.g., µ) and expressing the uncertainty in terms of the standard error of the estimate, which combines the uncertainty generated by variation in the population with uncertainty generated by estimation:

Missing cases - 4 Causal Inference ~Causal inference (about the effect of a factor X) involves the comparison of observations (where X is present) with counterfactual ‘observations’ where X is absent (see King, Keohane and Verba, 1994: 75-84) ~In this situation, half the required cases are missing, and unavoidably so because they pertain to a counterfactual

Missing cases - 4 Causal Inference ctd ~Practice: compare observed cases where X is present with other observed cases where X is absent: ~Involves assumption of unit homogeneity ~If possible: condition on relevant factors or match

Missing cases - 5 Inaccessibility for Observation Particular cases which one would like to observe turn out to be unobservable (at least with the data collection methods chosen): ~Documents are classified ~Crimes/accidents are unreported ~People cannot be interviewed (cannot be found / refuse/ other causes) ~Particularly problematic if unobservable cases differ systematically from other ones, resulting in selection bias in the observations

Missing cases - 6 Selection Bias ~Inaccessibility for observation is often related to variables of interest: ~Classified documents pertain to particularly interesting cases ~Politically uninterested and cynical people are less likely to consent in being interviewed ~Economic sanctions are only imposed where there are expected to have effect ~Particular kinds of crimes go unreported because the victims feel ashamed or embarrassed (e.g., blackmail)

Missing cases - 6 Selection Bias ctd Selection bias has pernicious consequences for ~Descriptive inference (biased estimates of frequency) ~Causal inference (biased estimates of effects, see King, Keohane and Verba 1994, xx-xx).

Missing Variables

Missing variables – 1 Latent variables Latent variables are always missing. They can often be estimated in situations of multiple-item operationalization and the use of measurement models such as factor analysis and IRT – see Measurement clinic.

Missing variables – 2 Manifest variables Missing manifest variables: ~under-coverage of elaborated concepts (yielding validity problems) ~absent additional operationalizations which would allow the estimation of latent variables (and partial tests of validity assumptions) ~use of ‘container’ measures ~absent independent and control variables required in analysis stages ~Diminish the problem by creative use of proxy variables (including instrumental variables), strategic use of secondary analyses, and possibly by strategies of (synthetic) data linking

Missing Values

Missing values This is the common association with ‘missing data’: for some of the cases observations are missing for some of the variables ~‘Swiss cheese’ analogy ~This is the situation that ‘methods for dealing with missing data’ refer to, but these methods do not deal with (completely) missing cases and (completely) missing variables.

Why worry? ~Practicalities of data-analysis: Most methods require complete data, in case of missing data software makes the data complete one way or another, you better know how, and what consequences this may have. The simplest ‘solution’ is deletion of cases with missing values. ~Quality of substantive findings: Manner of handling missing data has consequences for bias, consistency and precision of inferences. ~Cost/benefit considerations: Data required resources (money, time, effort) to be collected and constructed, there is no compelling rationale for not using them optimally (hence one should be wary of deleting available information).

Modelling and missing data ~Data-analysis and modelling is done on empirical data ~Data = f (SER, MDGP) where SER: system of empirical relationships MDGP: missing data generating process ~SER is the object of our substantive interest adequately modelling SER from data thus requires also modelling MDGP; failure to do so may lead to inferential errors about SER

Types of missing - MCAR ~MCAR (missing completely at random) the MDGP is independent from any of the observed variables and independent from the SER ~(usually) data entry errors: neither case attributes (variables X 1 to X k ), nor their (unknown) scores on the variable with missing data (Y) predict missingness ~instrument rotation: for each case determine randomly which version of an instrument to use: probability of using a particular version is a probability independent of X’s or Y.

MCAR ~If MCAR missings are deleted: ~Inferences are unbiased ~Inferences are less precise (due to smaller # of cases) ~But, MCAR is uncommon in actual practice

MCAR (grey: missing on D2 but observed on D1)

Types of missing - MAR ~I is random after conditioning on X (observed variables), which implies that I is random within groups (or sub-populations) defined on X ~Implies that missing values can be (partly) predicted from observed values on other variables (as long as there are sufficient cases which have valid scores on both X and Y)

MAR (grey: missing on D2 but observed on D1) ~A ~NB: dependency on X (D1 in the graph) will generally not be as deterministic, as depicted here

MAR ~Ignoring missing data will lead to biased estimates (the mean of the black dots underestimates the mean of all dots on D2) ~But the distribution of D1 is known, as well as the relationship between D1 and D2. From this a correct estimate of the mean of D2 can be obtained ~Using information about D1 and about the relationship between D1 and D2 makes the missing data MAR, and allows a correct estimate of the mean of D2.

Types of missing - NMAR ~NMAR: not missing at random ~The probability that a value is missing depends on the true, but unknown missing value

NMAR (grey: missing on D2 but observed on D1) ~NB: dependency on Y (D2 in the graph) will generally not be as deterministic as depicted here

NMAR ~Ignoring missing data will lead to biased estimates (the mean of the black dots underestimates the mean of all dots on D2) ~Knowledge about the distribution of D1 does not help to solve this, does not help to make missing data MAR. ~Only hope in these kind of situations is that other variables than D1 may help to make missing data into MAR.

MAR /NMAR Mixture As in selection-bias situations, the selection (on D2 or Y) generated by the missing cases results in biased estimate of the relation between D1 and D2

NMAR into MAR ~The problems generated by NMAR missing values are not an inherent characteristic of the empirical world, but of our data and our imagination ~Additional variables and sensible proxies that are systematically correlated with the variable with NMAR missings, may make those missings into MAR (if no such variables would exist, missing values would be MCAR) ~Hence the value of (simultaneously) looking at all other possible variables, rather than just a few

What to do? Data deletion strategies: - unless MCAR, will generally bias estimates - always inefficient (loss of precision/power) ~ Pairwise deletion To be discouraged. May lead to inconsistent results (e.g., not positive definite correlation/covariance matrices) ~Listwise deletion (aka Complete Case Analysis) Except in the case of very few missing values, the cumulation of deleted cases may be enormous

What to do? - 2 ‘Working around’ strategies ~Full Information Maximum Likelihood (FIML) integrates out the missing data when fitting the desired model ~Requires particular assumptions (e.g., multivariate normality) ~In a restricted form available in SPSS MVA procedure

What to do? - 3 Imputation strategies consist all of replacing missing value with an estimate of the actual value of that case ~‘hot-deck and ‘cold-deck’ ~Mean imputation ~EM procedures ~Regression mean imputation ~Multiple imputation

Imputation - 1 ~‘hot-deck’ imputation consists of replacing the missing value by the observed value from another, similar case from the same dataset for which that variable was not missing. ~Requires definition of ‘similar’ ~Reifies the observed value from the donor case, tends to inflate precision ~‘cold-deck’ uses cases from another (but similar) dataset ~Used to be popular amongst Census Bureaus

Imputation - 2 ~Mean imputation consists of replacing the missing value by the mean of the variable in question 

Imputation - 3 Mean imputation ~Is still offered as an option in many analysis procedures of statpacks (e.g., SPSS: regression, factor analysis) ~From previous slide: leads generally to severe bias ~General advice: do not do this!

Imputation - 4 ~Expectation Maximization (EM) procedures procedure for arriving at the best point estimates of the true values, given the model (which itself is estimated on the basis of the imputed missings) ~Procedure does not take account of uncertainty in the point estimates, therefore tends to inflate precision of estimates ~Procedure assumes that the model is correct ~Procedure increasingly available in statpacks, e.g., SPSS MVA

Imputation - 5 ~Regression-mean imputation Replaces the missing value by the conditional regression mean ( ŷ ): ~ Estimate of slope unbiased ~ Precision inflated 

Imputation – 6 ~Regression-simulation imputation: replaces missing value by ( ŷ )+error, where error is a random draw from the regression derived residual variance ~Estimate of slope unbiased ~Inflation of precision much less, but ~Reifies the single imputed value (and the certainty of the imputation process); solution: multiple imputation

Imputation - 7 Multiple imputation: rather than a single imputed value, multiple ones are (stochastically) derived from a prediction equation. Each is, in principle, as good as any other one, yet they are not the same. ~King et al. (2001) recommend the creation of a number of different, imputed, datasets, on each of which the same model is fitted/estimated. Subsequently the parameters of interest are combined in an appropriate fashion.

Imputation - 8 ~Software for multiple imputation: Amelia II free download from Does require that statistical software package R is installed (also free downloadable, see ~Amelia let you define missing data model: Variables to be used (at least those in subsequent analyses, but any others that are thought to be predictive) Specific data features (e.g. Time dependencies, TSCS) ~It assumes MAR (or NMAR  MAR via the specified variables) and multidimensional normality ~Software samples from the conditional distribution of the missings on the observed values of the other variables, which is equivalent to many simultaneous little regressions

Using multiple imputatations ~ each of the imputed datasets: run the same analysis ~Let Q be the outcome of interest (parameter, mean, etc), then (more details in King et al 2001, 2009):

Literature ~Allison, P. D. (2000) Multiple imputation for missing data, Sociological Methods and Research 28, pp ~Honaker, J. and King, G. (ms) What to do about missing values in time series cross-section data, Available at ~Horton, N. J. and Kleinman (2007) Much Ado About Nothing, The American Statistician 61(1), pp.79–90. ~King, G., Honaker, J., Joseph, A., and Scheve, K. (2001), Analyzing incomplete political science data, American Political Science Review 95, pp.49–69. ~Little, R. J. A. and Rubin, D. B. (2002) Statistical Analysis With Missing Data (2nd ed.) Chichester: Wiley. ~Schafer, J. L. (1997) Analysis of Incomplete Multivariate Data. London: Chapman and Hall.