Survey Inference with Incomplete Data Trivellore Raghunathan Chair and Professor of Biostatistics, School of Public Health Research Professor, Institute.

Slides:



Advertisements
Similar presentations
Treatment of missing values
Advertisements

Research on Improvements to Current SIPP Imputation Methods ASA-SRM SIPP Working Group September 16, 2008 Martha Stinson.
9. Weighting and Weighted Standard Errors. 1 Prerequisites Recommended modules to complete before viewing this module  1. Introduction to the NLTS2 Training.
National Center for Health Statistics DCC CENTERS FOR DISEASE CONTROL AND PREVENTION Changes in Race Differentials: The Impact of the New OMB Standards.
Some birds, a cool cat and a wolf
NLSCY – Non-response. Non-response There are various reasons why there is non-response to a survey  Some related to the survey process Timing Poor frame.

Bridging the Gaps: Dealing with Major Survey Changes in Data Set Harmonization Joint Statistical Meetings Minneapolis, MN August 9, 2005 Presented by:
How to deal with missing data: INTRODUCTION
FINAL REPORT: OUTLINE & OVERVIEW OF SURVEY ERRORS
Statistical Methods for Missing Data Roberta Harnett MAR 550 October 30, 2007.
Analysis of National Health Interview Survey Data
PEAS wprkshop 2 Non-response and what to do about it Gillian Raab Professor of Applied Statistics Napier University.
Survey Experiments. Defined Uses a survey question as its measurement device Manipulates the content, order, format, or other characteristics of the survey.
Multiple imputation using ICE: A simulation study on a binary response Jochen Hardt Kai Görgen 6 th German Stata Meeting, Berlin June, 27 th 2008 Göteborg.
Categorical Data Prof. Andy Field.
Complexities of Complex Survey Design Analysis. Why worry about this? Many government studies use these designs – CDC National Health Interview Survey.
Liesl Eathington Iowa Community Indicators Program Iowa State University October 2014.
Copyright 2010, The World Bank Group. All Rights Reserved. Estimation and Weighting, Part I.
Multiple Indicator Cluster Surveys Survey Design Workshop Sampling: Overview MICS Survey Design Workshop.
A Comparison of Conventional Weighted Estimates of Vaccination Coverage with Estimates from Imputed Data Using Available Software Taylor Lewis and Meena.
The 2006 National Health Interview Survey (NHIS) Paradata File: Overview And Applications Beth L. Taylor 2008 NCHS Data User’s Conference August 13 th,
Slide 1 Estimating Performance Below the National Level Applying Simulation Methods to TIMSS Fourth Annual IES Research Conference Dan Sherman, Ph.D. American.
Evaluating a Research Report
Design Effects: What are they and how do they affect your analysis? David R. Johnson Population Research Institute & Department of Sociology The Pennsylvania.
Secondary Data Analysis Linda K. Owens, PhD Assistant Director for Sampling and Analysis Survey Research Laboratory University of Illinois.
Practical Missing Data Analysis in SPSS (v17 onwards) Peter T. Donnan Professor of Epidemiology and Biostatistics.
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
Poverty and Health Jennifer Madans, Kimberly Lochner, and Diane Makuc National Center for Health Statistics Centers for Disease Control and Prevention.
Multiple Imputation (MI) Technique Using a Sequence of Regression Models OJOC Cohort 15 Veronika N. Stiles, BSDH University of Michigan September’2012.
G Lecture 11 G Session 12 Analyses with missing data What should be reported?  Hoyle and Panter  McDonald and Moon-Ho (2002)
Current Population Survey Sponsor: Bureau of Labor Statistics Collector: Census Bureau Purpose: Monthly Data for Analysis of Labor Market Conditions –CPS.
Handling Attrition and Non- response in the 1970 British Cohort Study Tarek Mostafa Institute of Education – University of London.
Applied Epidemiologic Analysis - P8400 Fall 2002 Lab 10 Missing Data Henian Chen, M.D., Ph.D.
Defining Success Understanding Statistical Vocabulary.
Introduction to Multiple Imputation CFDR Workshop Series Spring 2008.
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
CHAPTER 12 Descriptive, Program Evaluation, and Advanced Methods.
Assessing SES differences in life expectancy: Issues in using longitudinal data Elsie Pamuk, Kim Lochner, Nat Schenker, Van Parsons, Ellen Kramarow National.
Measurement of Income in NCHS Surveys Diane M. Makuc NCHS Data Users Conference July 12, 2006 Centers for Disease Control and Prevention National Center.
1 Using National Hospital Ambulatory Medical Care Survey (NHAMCS) data for injury analysis Linda McCaig Ambulatory Care Statistics Branch Division of Health.
Building Wave Response Rates in a Longitudinal Survey: Essential for Nonsampling Error Reduction or Last In - First Out? Steven B. Cohen Fred Rohde and.
SW 983 Missing Data Treatment Most of the slides presented here are from the Modern Missing Data Methods, 2011, 5 day course presented by the KUCRMDA,
Introduction to Secondary Data Analysis Young Ik Cho, PhD Research Associate Professor Survey Research Laboratory University of Illinois at Chicago Fall,
© John M. Abowd 2007, all rights reserved General Methods for Missing Data John M. Abowd March 2007.
1 G Lect 13W Imputation (data augmentation) of missing data Multiple imputation Examples G Multiple Regression Week 13 (Wednesday)
1 G Lect 13M Why might data be missing in psychological studies? Missing data patterns Overview of statistical approaches Example G Multiple.
Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.
A REVIEW By Chi-Ming Kam Surajit Ray April 23, 2001 April 23, 2001.
Analytical Example Using NHIS Data Files John R. Pleis.
1-1 Copyright © 2014, 2011, and 2008 Pearson Education, Inc.
Tutorial I: Missing Value Analysis
CASE STUDY: NATIONAL SURVEY OF FAMILY GROWTH Karen E. Davis National Center for Health Statistics Coordinating Center for Health Information and Service.
INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011.
Using Data from the National Survey of Children with Special Health Care Needs Centers for Disease Control and Prevention National Center for Health Statistics.
Pre-Processing & Item Analysis DeShon Pre-Processing Method of Pre-processing depends on the type of measurement instrument used Method of Pre-processing.
Sample Design of the National Health Interview Survey (NHIS) Linda Tompkins Data Users Conference July 12, 2006 Centers for Disease Control and Prevention.
Multiple Imputation and Missing Race in the Pre-Invasive Cervical Cancer Study among Three States 2010 NAACCR Conference Quebec City, June 22, 2010 Bin.
1 ANALYZING DATA FROM THE NATIONAL IMMUNIZATION SURVEY __________________________________________ Michael P. Battaglia Abt Associates Inc. Meena Khare.
DATA STRUCTURES AND LONGITUDINAL DATA ANALYSIS Nidhi Kohli, Ph.D. Quantitative Methods in Education (QME) Department of Educational Psychology 1.
Handling Attrition and Non-response in the 1970 British Cohort Study
Multiple Imputation using SOLAS for Missing Data Analysis
Introduction to Survey Data Analysis
Multiple Imputation Using Stata
Dealing with missing data
The European Statistical Training Programme (ESTP)
EM for Inference in MV Data
EM for Inference in MV Data
Clinical prediction models
Chapter 13: Item nonresponse
Presentation transcript:

Survey Inference with Incomplete Data Trivellore Raghunathan Chair and Professor of Biostatistics, School of Public Health Research Professor, Institute for Social Research University of Michigan Presented at the National Conference on Health Statistics, August 16-18, 2010

Goals Importance of dealing with missing data in national surveys Weighting and Imputation as a general purpose solutions for missing data Why we need multiple imputation? Enhance the use of all available information for the creation of public-use datasets Design implications and future directions Other two talks provide several applications and software related issues

Missing Data A pervasive problem and is getting worse – Response rates are generally declining in all surveys (unit nonresponse) – Subjects who are willing to participate in surveys hesitate to provide all information (item nonresponse) Threat to quintessential notion of a representative sample from the population – Leading to bias of unknown direction and magnitude – Loss of efficiency

What is the reasons for missing data? ( Missing Data Mechansim) XY obs X? Missing Completely at random (MCAR) Missing at random (MAR) Not Missing At random (NMAR)

Analysis Most complete-case (available case ) analyses are valid under MCAR assumption – Default in most software packages – Unreasonable assumption MAR assumption is much weaker – Depends on how good are the X as predictors of Y – Non-testable assumption NMAR – Need explicit formulation of differences between respondents and non-respondents – Need External data – Non-testable assumption

Weighting (Unit Nonresponse) MAR assumption Group respondents and non-respondents based on X (Adjustment Cells) Attach weights to respondents in each group to compensate for non-respondents in the same group – Example : White females aged living in Southwest Region 100 in sample 20 nonrespondents pr(response in cell) = 0.8 response weight = nonrespondents

Formation of Adjustment Cells Need X that are predictive of Y (or a collection of Y’s in a multi-purpose survey) Using X’s that are not predictive of Y will not reduce bias but will increase variance Current survey practice focuses too much on finding X’s that differentiate respondents from non-respondents but predictive power of X for Y is more critical Need to think proactively in collecting X’s that are related to multiple Y’s through design modification

More General Problem Y 1 Y 2 Y 3 Y 4 … Y p Variables in The data set Complete cases Cases with some missing values Observed data: Missing data:

Imputation Imputation refers to filling in a value for each missing datum based on other information (e.g., a model and observed data)

Imputation Typically used for item nonresponse Benefits of imputation Completes the data matrix If imputation is performed by a producer of public-use data: Missing data are handled comparably across secondary data analyses Information available to the data producer but not the public can be used in creating imputations

Imputation Important issues: Imputations are not real values Single imputation fed into standard software package treats the imputed values as real values Underestimates the variance estimates due to ignoring uncertainties associated with imputes Goes against our “culture” where approximations of the sample designs (collapsing, combing PSUs, strata etc) avoid underestimation

Multiple Imputation Repeat Imputation process several times (say M times) Uncertainty due to imputation is captured by the “between Imputed Data” Variation

Analysis of Multiply Imputed Data Analyze each imputed data separately Combine Estimates Combine variances

Software for Creating Imputations SAS – PROC MI – User-developed IVEWARE ( Stata – ICE R – MICE – MI SOLAS AMELIA SPSS Stand-Alone – SRCWARE ( – NORM – PAN ( – CAT Another good source:

Software for Analysis of Imputed Data SAS – MIANALYZE – IVEWARE SUDAAN STATA – MICOMBINE – MI Newest version has excellent interface R (user defined macros) SRCWARE (Stand alone)

MULTIPLE IMPUTATION FOR MISSING INCOME DATA IN THE NATIONAL HEALTH INTERVIEW SURVEY Schenker, Raghunathan, Chiu, Makuc, Zhang, and Cohen (2006, JASA) National Health Interview Survey (NHIS) – Principal source of information on the health of the civilian non-institutionalized population – Data collected at both family and person levels – Contains items on health, demographic, and socioeconomic characteristics (e.g., income) – Allows the study of relationships between health and other characteristics

NHIS Percent distribution of types of family income responses by year for the NHIS in 1997 – 2004 Missingness appears to be related to several other characteristics, such as health, health insurance, age, race, country of birth, and region of residence

Missing income data multiply imputed for NHIS beginning with 1997 – M = 5 sets of imputations of: – employment status for adults (< 4% missing) – personal earnings for adults who worked for pay – family income (and ratio of family income to Federal poverty threshold) Imputed income files since 1997, with documentation, available at NHIS Web site: Used adaptation of IVEware

Complicating issues handled during imputation – Hierarchical structure of data Families and persons Sometimes, one variable (e.g., personal earnings) restricted based on another variable (e.g., whether worked for pay), but both variables missing Imputation within bounds – e.g., families for which categories rather than exact dollar values reported for income Several variables used as predictors (including design variables) Different types (continuous, categorical, count) – Small amounts of missingness (mostly  2%)

Results Estimated percentage of persons of ages in fair or poor health, by ratio of family income to Federal poverty threshold: 2001 NHIS Ratio to Poverty Threshold No Imp. (NI) Single Imp. (SI) Mult. Imp. (MI) Ratio of SEs Est.SEEst.SEEst.SE NI  MISI  MI  – –

Summary of Multiple Imputation Retains advantages of single imputation – Consistent analyses – Data collector’s knowledge – Rectangular data sets Corrects disadvantages of single imputation – Reflects uncertainty in imputed values – Corrects inefficiency from imputing draws estimates have high efficiency for modest M, e.g. 5 For this approach to be successful, we need to collect good correlates of variables that are expected to have large amounts of missing values