Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.

Slides:



Advertisements
Similar presentations
Handling attrition and non- response in longitudinal data Harvey Goldstein University of Bristol.
Advertisements

Treatment of missing values
CountrySTAT Team-I November 2014, ECO Secretariat,Teheran.
Efficient Algorithms for Imputation of Missing SNP Genotype Data A.Mihajlović, V. Milutinović,
Uncertainty in fall time surrogate Prediction variance vs. data sensitivity – Non-uniform noise – Example Uncertainty in fall time data Bootstrapping.
 Overview  Types of Missing Data  Strategies for Handling Missing Data  Software Applications and Examples.
Evaluating Diagnostic Accuracy of Prostate Cancer Using Bayesian Analysis Part of an Undergraduate Research course Chantal D. Larose.
Improving Forecast Accuracy by Unconstraining Censored Demand Data Rick Zeni AGIFORS Reservations and Yield Management Study Group May, 2001.
CJT 765: Structural Equation Modeling Class 3: Data Screening: Fixing Distributional Problems, Missing Data, Measurement.
Adapting to missing data
How to Handle Missing Values in Multivariate Data By Jeff McNeal & Marlen Roberts 1.
Mutual Information Mathematical Biology Seminar
1 Unsupervised Learning With Non-ignorable Missing Data Machine Learning Group Talk University of Toronto Monday Oct 4, 2004 Ben Marlin Sam Roweis Rich.
Missing Data in Randomized Control Trials
How to deal with missing data: INTRODUCTION
Modeling Achievement Trajectories When Attrition is Informative Betsy J. Feldman & Sophia Rabe- Hesketh.
Psych 524 Andrew Ainsworth Data Screening 2. Transformation allows for the correction of non-normality caused by skewness, kurtosis, or other problems.
Statistical Methods for Missing Data Roberta Harnett MAR 550 October 30, 2007.
A P STATISTICS LESSON 9 – 1 ( DAY 1 ) SAMPLING DISTRIBUTIONS.
Multiple imputation using ICE: A simulation study on a binary response Jochen Hardt Kai Görgen 6 th German Stata Meeting, Berlin June, 27 th 2008 Göteborg.
Introduction to plausible values National Research Coordinators Meeting Madrid, February 2010.
STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)
Guide to Handling Missing Information Contacting researchers Algebraic recalculations, conversions and approximations Imputation method (substituting missing.
Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 19 Process of Quantitative Data Analysis and Interpretation.
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
G Lecture 11 G Session 12 Analyses with missing data What should be reported?  Hoyle and Panter  McDonald and Moon-Ho (2002)
Handling Attrition and Non- response in the 1970 British Cohort Study Tarek Mostafa Institute of Education – University of London.
Applied Epidemiologic Analysis - P8400 Fall 2002 Lab 10 Missing Data Henian Chen, M.D., Ph.D.
Managerial Economics Demand Estimation & Forecasting.
Imputation for Multi Care Data Naren Meadem. Introduction What is certain in life? –Death –Taxes What is certain in research? –Measurement error –Missing.
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
1 Chapter 7 Sampling Distributions. 2 Chapter Outline  Selecting A Sample  Point Estimation  Introduction to Sampling Distributions  Sampling Distribution.
SW 983 Missing Data Treatment Most of the slides presented here are from the Modern Missing Data Methods, 2011, 5 day course presented by the KUCRMDA,
1 G Lect 13W Imputation (data augmentation) of missing data Multiple imputation Examples G Multiple Regression Week 13 (Wednesday)
The Impact of Missing Data on the Detection of Nonuniform Differential Item Functioning W. Holmes Finch.
1 G Lect 13M Why might data be missing in psychological studies? Missing data patterns Overview of statistical approaches Example G Multiple.
A REVIEW By Chi-Ming Kam Surajit Ray April 23, 2001 April 23, 2001.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 13, 2013.
Missing Values C5.2 Data Screening. Missing Data Use the summary function to check out the missing data for your dataset. summary(notypos)
Machine Learning 5. Parametric Methods.
Tutorial I: Missing Value Analysis
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Pre-Processing & Item Analysis DeShon Pre-Processing Method of Pre-processing depends on the type of measurement instrument used Method of Pre-processing.
A framework for multiple imputation & clustering -Mainly basic idea for imputation- Tokei Benkyokai 2013/10/28 T. Kawaguchi 1.
Chapter 9 Sampling Distributions 9.1 Sampling Distributions.
Exposure Prediction and Measurement Error in Air Pollution and Health Studies Lianne Sheppard Adam A. Szpiro, Sun-Young Kim University of Washington CMAS.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Chapter 4. The Normality Assumption: CLassical Normal Linear Regression Model (CNLRM)
DATA STRUCTURES AND LONGITUDINAL DATA ANALYSIS Nidhi Kohli, Ph.D. Quantitative Methods in Education (QME) Department of Educational Psychology 1.
Copyright © 2005 by Lippincott Williams and Wilkins. PowerPoint Presentation to Accompany Statistical Methods for Health Care Research by Barbara Hazard.
Multiple Imputation in Finite Mixture Modeling Daniel Lee Presentation for MMM conference May 24, 2016 University of Connecticut 1.
Research and Evaluation Methodology Program College of Education A comparison of methods for imputation of missing covariate data prior to propensity score.
HANDLING MISSING DATA.
Missing data: Why you should care about it and what to do about it
STA 291 Spring 2010 Lecture 12 Dustin Lueker.
Multiple Imputation using SOLAS for Missing Data Analysis
Maximum Likelihood & Missing data
Introduction to Survey Data Analysis
Multiple Imputation Using Stata
How to handle missing data values
Dealing with missing data
Presenter: Ting-Ting Chung July 11, 2017
Working with missing Data
The European Statistical Training Programme (ESTP)
CH2. Cleaning and Transforming Data
Missing Data Mechanisms
Analysis of missing responses to the sexual experience question in evaluation of an adolescent HIV risk reduction intervention Yu-li Hsieh, Barbara L.
Clinical prediction models
STA 291 Summer 2008 Lecture 12 Dustin Lueker.
Chapter 13: Item nonresponse
Presentation transcript:

Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015

I.Intro Missing Values and Bias II.Simulations and Imputation III.Deletion Methodology IV.Not Missing at Random

Initial Steps Why is our data missing? What is the characteristic of our missing data? How will that affect the bias? Mean? Std?

OLS Unbiased Estimator

Initial Steps 1.Identify the reason for missing data  Marriage, graduation, death, etc. 2.Understand the distribution of missing data  Certain groups more likely to have missing values 3.Decide on the best method of analysis  Deletion methods – Listwise, pairwise deletion  Single Imputation Methods – Mean substitution, dummy variable, single regression  Model based methods – Maximum likelihood and multiple imputation 4.Power and Bias  Too many missing variables reduces power  Introduction of bias in your estimator

Missing Values and Bias Are missing values moving us away or closer to the true DGP?

Conditional Distribution MCAR (missing completely at random) Probability ( Y = Missing | X,Y) = Probability (Y=Missing) Probability that Y is missing does not depend on X or Y MAR (missing at random) Probability ( Y = Missing | X,Y) = Probability (Y=Missing | X) Probability that Y is missing depends on X but not Y NMAR (not missing at random) Probability ( Y = Missing | X,Y) = Probability (Y=Missing | X,Y) Probability that Y is missing depends on Y and possibly on X Statistical Models- A.C. Davison- Cambridge University Press

Normal Data MCAR NMAR MAR Statistical Models- A.C. Davison- Cambridge University Press

Bias Matrix – Does Bias Exist? DeletionMean Imputation None (but reduced power) None< 0 Conditional None Unconditional Yes Conditional Yes < 0 Unconditional Yes Yes Statistical Models- A.C. Davison- Cambridge University Press

Working with Missing Data Deletion Maximum Likelihood Multiple Imputation Single Imputation MCAR Maximum Likelihood Multiple Imputation Single Imputation MAR Sensitivity Analysis Pattern Mixture Models Selection Model Maximum Entropy NMAR

Listwise and Pairwise Deletion Missing values are MCAR MAR BIASED NMAR Conditonal UNBIASED MCAR MAR

Single Imputation Replace missing data with mean or mode Introduces bias in estimated variance Mean Mode Substition Create indicator (1=missing, 0=not missing) Impute missing values to a constant Dummy Variable Control Replace missing values with predicted score from a regression Overestimates model fit Conditional Mean Substitution

PRESENTATION TITLE HERE Simulations and Imputation

Imputing Values Deal with missing data by generating values for those that are missing. Use a variety of methods to impute these values varying in accuracy and complexity. We will focus on single imputation methods and a few multiple imputation methods.

Mean Imputation We can use the mean in place of the missing values This will retain the mean from the dataset This will also cause a negative bias in the variance

Regression Mean Imputation Instead of using the mean, we can use regression to give us predicted values for those missing. This may allow us to achieve better estimates

Multiple Imputations A more complex way to impute missing values. Imputes and analyzes data to replace missing values within the data set.

A Few R Methods How can we do this in R?  Amelia  mi  There are many others, and some can be used to treat specific conditions for certain data sets.

Amelia Amelia is an algorithm that bootstraps data and uses that data in a multiple imputation process.

mi “mi” imputes missing values using Bayesian regression methods, which are run a number of times and analyzed for convergence. This method is very customizable, but is also very costly

Additional Resources Additional packages that can be used in R can be found here:

Imputation Summary  In order to use imputation based methods we need to first understand the data and the reason for the “missingness” of the data.  By knowing this we can fit the method that we feel is most appropriate to our data set.  Single imputation methods can give us quick and easy answers to our missing values, but they also bias statistics like the variance.  Multiple imputation methods can handle the bias better but are complex and require more specialized R packages or software

PRESENTATION TITLE HERE Deletion Methodology

Bias 0 means no bias there is a systematic tendency for the estimate to be larger than the parameter it is estimating. there is a systematic tendency for the estimate to be smaller than the parameter it is estimating. Credit: from Dr.Westfall

Listwise Vs Pairwise Deletion What are they? They are methods that discard data. How do they work? Listwise (Complete-case analysis): Excluding all units for which the outcome or any of the inputs are missing. Pairwise (Available-case analysis): Excluding a pair which contains one ore two missing values from data set. What is the difference? Pairwise attempts to minimize the loss that occurs in listwise deletion. Credit:

Listwise Vs Pairwise Deletion (Cont’) Listwise deletion Pairwise deletion

Listwise Vs Pairwise Deletion (Cont’) Pros and Cons of Listwise and Pairwise deletions: Listwise : The sample after deletion may not be representative of the full sample. Reducing power and type II error rates increase. Tendency to get bias results. Pairwise: Preserved or increase statistical power in the analyses. The result will be the same if the data has two variables (columns) Bias (over or underestimated) Credit: Credit:

PRESENTATION TITLE HERE Not Missing at Random

Case of NMAR  Why are our values missing? High income individuals don’t report income  What is the characteristic of the missing data Missing values are NMAR

Meboot Package

Evaluation of a Fund Manager While evaluating a fund manager for investment you notice that the fund did not include 2008 returns for its equity fund You highly suspect it is NMAR – It was left out because returns were bad

Evaluation of a Fund Manager You find out that the equity fund normally held stocks representative of the entire stock market Distribution of the missing data may follow the overall US equity market

Meboot Maximum Entropy

Meboot Maximum Entropy NMAR missing values requires the most assumptions Minimizing bias for NMAR depends heavily on your model setup There is no “right” answer, we do not know the true DGP All we can do is minimize bias with well grounded assumptions

Questions? THANK YOU!