INFO 7470/ILRLE 7400 Statistical Tools: Missing Data Methods John M. Abowd and Lars Vilhuber March 15, 2011.

Slides:



Advertisements
Similar presentations
1 Editing the Integrated Census in Israel. EDITING THE INTEGRATED CENSUS IN ISRAEL Prepared by Eva Rotenberg, Central Bureau of Statistics, Israel (1)
Advertisements

1 The Synthetic Longitudinal Business Database Based on presentations by Kinney/Reiter/Jarmin/Miranda/Reznek 2 /Abowd on July 31, 2009 at the Census-NSF-IRS.
Research on Improvements to Current SIPP Imputation Methods ASA-SRM SIPP Working Group September 16, 2008 Martha Stinson.
Pengolahan dan Analisa Data Indra Budi Fasilkom UI.
United Nations Workshop on the 2010 World Programme on Population and Housing Censuses: Census Evaluation and Post Enumeration Surveys, Amman, Jordan,
What are Wage Records? Wage records are an administrative database used to calculate Unemployment Insurance benefits for employees who have been laid-off.
© 2007 John M. Abowd, Lars Vilhuber, all rights reserved Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2007.
© John M. Abowd 2007, all rights reserved Universes, Populations and Sampling Frames John M. Abowd February 2007.
CE Overview Jay T. Ryan Chief, Division of Consumer Expenditure Survey December 8, 2010.
© John M. Abowd 2005, all rights reserved Household Samples John M. Abowd March 2005.
© John M. Abowd 2005, all rights reserved Analyzing Frames and Samples with Missing Data John M. Abowd March 2005.
INFO 4470/ILRLE 4470 Social and Economic Data Populations and Frames John M. Abowd and Lars Vilhuber February 7, 2011.
MCCORMICK SRI: GOING DEEP WITH CENSUS DEMOGRAPHIC AND ECONOMIC DATA EMPLOYMENT AND UNEMPLOYMENT ESTIMATES FROM THE U.S. DEPARTMENT OF LABOR, BUREAU OF.
© John M. Abowd 2005, all rights reserved Statistical Tools for Data Integration John M. Abowd April 2005.
© John M. Abowd 2005, all rights reserved Recent Advances In Confidentiality Protection John M. Abowd April 2005.
© John M. Abowd 2005, all rights reserved Sampling Frame Maintenance John M. Abowd February 2005.
© John M. Abowd 2005, all rights reserved Statistical Programs of the Federal Government John M. Abowd February 2005.
© John M. Abowd and Lars Vilhuber 2005, all rights reserved Introduction to Probabilistic Record Linking John M. Abowd and Lars Vilhuber March 2005.
Recent Advances In Confidentiality Protection – Synthetic Data John M. Abowd April 2007.
How to deal with missing data: INTRODUCTION
© John M. Abowd 2005, all rights reserved Economic Surveys John M. Abowd March 2005.
Census Bureau Employment Data ACS, EC, and LED… And why you should use the data from one program vs. another… SDC/CIC Annual Training Conference Wednesday,
Statistical Methods for Missing Data Roberta Harnett MAR 550 October 30, 2007.
INFO 4470/ILRLE 4470 Register-based statistics by example: County Business Patterns John M. Abowd and Lars Vilhuber February 14, 2011.
PEAS wprkshop 2 Non-response and what to do about it Gillian Raab Professor of Applied Statistics Napier University.
1 The Business Register: Introduction and Overview Ronald H. Lee
Improvements in the BLS Business Register Richard Clayton David Talan 12th Meeting of the Group of Experts on Business Registers Paris, France September.
Household Surveys ACS – CPS - AHS INFO 7470 / ECON 8500 Warren A. Brown University of Georgia February 22,
Effects of Income Imputation on Traditional Poverty Estimates The views expressed here are the authors and do not represent the official positions.
Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved.
Copyright 2010, The World Bank Group. All Rights Reserved. Estimation and Weighting, Part I.
1 Business Register: Quality Practices Eddie Salyers
12th Meeting of the Group of Experts on Business Registers
1 Supplementing ACS: The LEHD Program Jeremy S. Wu Marc Roemer U.S. Census Bureau May 12, 2005 Jeremy S. Wu Marc Roemer U.S. Census Bureau May 12, 2005.
INFO 7470/ILRLE 7400 Statistical Tools: Edit and Imputation John M. Abowd and Lars Vilhuber March 25, 2013.
Local Employment Dynamics (LED) & OnTheMap Nick Beleiciks Oregon Census State Data Center Meeting April 14, 2009.
Using IPUMS.org Katie Genadek Minnesota Population Center University of Minnesota The IPUMS projects are funded by the National Science.
© John M. Abowd 2007, all rights reserved Analyzing Frames and Samples with Missing Data John M. Abowd March 2007.
The Role of Over-Sampling of the Wealthy in the SCF Arthur B. Kennickell Federal Reserve Board Opinions are those of the presenter alone and do not necessarily.
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
G Lecture 11 G Session 12 Analyses with missing data What should be reported?  Hoyle and Panter  McDonald and Moon-Ho (2002)
Current Population Survey Sponsor: Bureau of Labor Statistics Collector: Census Bureau Purpose: Monthly Data for Analysis of Labor Market Conditions –CPS.
MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT OSMAN BIN SAIF Session 16.
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
Use of Administrative Data Seminar on Developing a Programme on Integrated Statistics in support of the Implementation of the SNA for CARICOM countries.
DATA PREPARATION: PROCESSING & MANAGEMENT Lu Ann Aday, Ph.D. The University of Texas School of Public Health.
1 SIPP IMPUTATION SCHEME AND DISCUSSION ITEMS Presenters: Nat McKee - Branch Chief Census Bureau Demographic Surveys Division (DSD) Income Surveys Programming.
© John M. Abowd 2005, all rights reserved Multiple Imputation, II John M. Abowd March 2005.
© John M. Abowd 2007, all rights reserved General Methods for Missing Data John M. Abowd March 2007.
Household Surveys: American Community Survey & American Housing Survey Warren A. Brown February 8, 2007.
Chapter 6: 1 Sampling. Introduction Sampling - the process of selecting observations Often not possible to collect information from all persons or other.
Tutorial I: Missing Value Analysis
INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011.
INFO 7470/ECON 7400/ILRLE 7400 Understanding Social and Economic Data John M. Abowd and Lars Vilhuber January 21, 2013.
Sampling Design and Analysis MTH 494 LECTURE-11 Ossam Chohan Assistant Professor CIIT Abbottabad.
LED Local Employment Dynamics Bradley Keen Pennsylvania Department of Labor & Industry Center for Workforce Information & Analysis (CWIA)
INFO 7470 Statistical Tools: Edit and Imputation John M. Abowd and Lars Vilhuber April 11, 2016.
Local Employment Dynamics: Partnership, Public-Use Data, and Innovative Web Tools Eric Coyle Data Dissemination Specialist U.S. Census Bureau 1.
INFO 7470/ECON 7400/ILRLE 7400 Register-based statistics John M. Abowd and Lars Vilhuber March 4, 2013 and April 4, 2016.
INFO 7470 Statistical Tools: Edit and Imputation Examples of Multiple Imputation John M. Abowd and Lars Vilhuber April 18, 2016.
Measuring Data Quality in the BLS Business Register Richard Clayton Sherry Konigsberg David Talan WiesbadenGroup on Business Registers Tallin, Estonia.
John M. Abowd and Lars Vilhuber February 16, 2011
INFO 7470 Statistical Tools: Edit and Imputation
Introduction to Survey Data Analysis
Presenter: Ting-Ting Chung July 11, 2017
The European Statistical Training Programme (ESTP)
Treatment of Missing Data Pres. 8
Chapter 13: Item nonresponse
Presentation transcript:

INFO 7470/ILRLE 7400 Statistical Tools: Missing Data Methods John M. Abowd and Lars Vilhuber March 15, 2011

Outline Missing data overview Missing records – Frame or census – Survey Missing items Overview of different products Overview of methods Formal multiple imputation methods 3/15/20112 © John M. Abowd and Lars Vilhuber 2011, all rights reserved

Missing Data Overview Missing data are a constant feature of both sampling frames (derived from censuses) and surveys Two important types are distinguished – Missing record (frame) or interview (survey) – Missing item (in either context) Methods differ depending upon type 3/15/20113 © John M. Abowd and Lars Vilhuber 2011, all rights reserved

Missing Records: Frame or Census The problem of missing records in a census or sampling frame is detection By definition in these contexts the problem requires external information to solve 3/15/20114 © John M. Abowd and Lars Vilhuber 2011, all rights reserved

Census of Population and Housing Dress rehearsal Census Pre-census housing list review Census processing of housing units found on a block not present on the initial list Post-census evaluation survey Post-census coverage studies 3/15/20115 © John M. Abowd and Lars Vilhuber 2011, all rights reserved

Economic Censuses and the Business Register Discussed in lecture 2 Start with tax records Unduplication in the Business Register Weekly updates Multi-units updated with Company Organization Survey Multi-units discovered during the inter-censal surveys are added to the BR 3/15/20116 © John M. Abowd and Lars Vilhuber 2011, all rights reserved

Missing Records: Survey Non-response in a survey is normally handled within the sample design Follow-up (up to a limit) to obtain interview/data Assessment of non-response within sample strata Adjustment of design weights to reflect non- responses 3/15/20117 © John M. Abowd and Lars Vilhuber 2011, all rights reserved

Missing Items Imputation based on the other data in the interview/case (relational imputation) Imputation based on related information on the same respondent (longitudinal imputation) Imputation based on statistical modeling – Hot deck – Cold deck – Multiple imputation 3/15/20118 © John M. Abowd and Lars Vilhuber 2011, all rights reserved

Census 2000 PUMS Missing Data Pre-edit: When the original entry was rejected because it fell outside the range of acceptable values. Consistency: Imputed missing characteristics based on other information recorded for the person or housing unit. Hot Deck: Supplied the missing information from the record of another person or housing unit. Cold Deck: Supplied missing information from a predetermined distribution. 3/15/20119 © John M. Abowd and Lars Vilhuber 2011, all rights reserved

CPS Missing Data Relational imputation: use other information in the record to infer value Longitudinal edits: use values from the previous month if present in sample Hot deck: use values from actual respondents whose data are complete for the, relatively few, conditioning variables 3/15/ © John M. Abowd and Lars Vilhuber 2011, all rights reserved

County Business Patterns The County and Zipcode Business Patterns data are published from the Employer Business Register This is important because variables used in these publications are edited to publication standards The primary imputation method is a longitudinal edit 3/15/ © John M. Abowd and Lars Vilhuber 2011, all rights reserved

Economic Censuses Like demographic products, there are usually both edited and unedited versions of the publication variables in these files Publication variables (e.g., payroll, employment, sales, geography, ownership) have been edited Most recent files include allocation flags to indicate that a publication variable has been edited or imputed Many historical files include variables that have been edited or imputed but do not include the flags 3/15/ © John M. Abowd and Lars Vilhuber 2011, all rights reserved

QWI Missing Data Procedures Individual data – Multiple imputation Employer data – Relational edit – Bi-directional longitudinal edit – Single-value imputation Job data – Use multiple imputation of individual data – Multiple imputation of place of work Use data for each place of work 3/15/ © John M. Abowd and Lars Vilhuber 2011, all rights reserved

BLS National Longitudinal Surveys Non-responses to the first wave never enter the data Non-responses to subsequent waves are coded as “interview missing” Respondent are not dropped for missing an interview. Special procedures are used to fill critical items from missed interviews when the respondent is interviewed again Item non-response is coded as such 3/15/ © John M. Abowd and Lars Vilhuber 2011, all rights reserved

Federal Reserve Survey of Consumer Finances (SCF) General information on the Survey of Consumer Finances: Missing data and confidentiality protection are handled with the same multiple imputation procedure 3/15/ © John M. Abowd and Lars Vilhuber 2011, all rights reserved

SCF Details Survey collects detailed wealth information from an over-sample of wealthy households Item refusals and item non-response are rampant (see Kennickell article) When there is item refusal, interview instrument attempts to get an interval The reported interval is used in the missing data imputation When the response is deemed sensitive for confidentiality protection, the response is treated as an item missing (using the same interval model as above) First major survey released with multiple imputation. 3/15/ © John M. Abowd and Lars Vilhuber 2011, all rights reserved

Relational Imputation Uses information from the same respondent Example: respondent provided age but not birth date. Use age to impute birth date. Example: some members of household have missing race/ethnicity data. Use other members of same household to impute race/ethnicity 3/15/ © John M. Abowd and Lars Vilhuber 2011, all rights reserved

Longitudinal Imputation Look at the respondent’s history in the data to get the value Example: respondent’s employment information missing this month. Impute employment information from previous month Example: establishment industry code missing this quarter. Impute industry code from most recently reported code 3/15/ © John M. Abowd and Lars Vilhuber 2011, all rights reserved

Cross Walks and Other Imputations In business data, converting an activity code (e.g., SIC) to a different activity code (e.g., NAICS) is a form of missing data In general, the two activity codes are not done simultaneously for the same entity Often these imputations are treated as 1-1 when they are, in fact, many-to-many 3/15/ © John M. Abowd and Lars Vilhuber 2011, all rights reserved

Probabilistic Methods for Cross Walks Inputs: – original codes – new codes – information for computing Pr(new code | original code, other data) Processing – Randomly assign a new code from the appropriate conditional distribution See Lab 8 3/15/ © John M. Abowd and Lars Vilhuber 2011, all rights reserved

The Theory of Missing Data Models General principles Missing at random Weighting procedures Imputation procedures Hot decks Introduction to model-based procedures 3/15/ © John M. Abowd and Lars Vilhuber 2011, all rights reserved

General Principles Most of today’s lecture is taken from Statistical Analysis with Missing Data, 2nd edition, Roderick J. A. Little and Donald B. Rubin (New York: John Wiley & Sons, 2002). The basic insight is that missing data should be modeled using the same probability and statistical tools that are the basis of all data analysis. Missing data are not an anomaly to be swept under the carpet. They are an integral part of very analysis. 3/15/ © John M. Abowd and Lars Vilhuber 2011, all rights reserved

Missing Data Patterns Univariate non-response Multivariate non-response Monotone General File matching Latent factors, Bayesian parameters 3/15/ © John M. Abowd and Lars Vilhuber 2011, all rights reserved

Missing Data Mechanisms The complete data are defined as the matrix Y (n  K). The pattern of missing data is summarized by a matrix of indicator variables M (n  K). The data generating mechanism is summarized by the joint distribution of Y and M. 3/15/201124© John M. Abowd and Lars Vilhuber 2011, all rights reserved

Missing Completely at Random In this case the missing data mechanism does not depend upon the data Y. This case is called MCAR. 3/15/201125© John M. Abowd and Lars Vilhuber 2011, all rights reserved

Missing at Random Partition Y into observed and unobserved parts. Missing at random means that the distribution of M depends only on the observed parts of Y. Called MAR. 3/15/201126© John M. Abowd and Lars Vilhuber 2011, all rights reserved

Not Missing at Random If the condition for MAR fails, then we say that the data are not missing at random, NMAR. Censoring and more elaborate behavioral models often fall into this category. 3/15/ © John M. Abowd and Lars Vilhuber 2011, all rights reserved

The Rubin and Little Taxonomy Analysis of the complete records only Weighting procedures Imputation-based procedures Model-based procedures 3/15/ © John M. Abowd and Lars Vilhuber 2011, all rights reserved

Analysis of Complete Records Only Assumes that the data are MCAR. Only appropriate for small amounts of missing data. Used to be common in economics, less so in sociology. Now very rare. 3/15/ © John M. Abowd and Lars Vilhuber 2011, all rights reserved

Weighting Procedures Modify the design weights to correct for missing records. Provide an item weight (e.g., earnings and income weights in the CPS) that corrects for missing data on that variable. See Bollinger and Hirsch discussion later in lecture. See complete case and weighted complete case discussion in Rubin and Little. 3/15/ © John M. Abowd and Lars Vilhuber 2011, all rights reserved

Imputation-based Procedures Missing values are filled-in and the resulting “Completed” data are analyzed – Hot deck – Mean imputation – Regression imputation Some imputation procedures (e.g., Rubin’s multiple imputation) are really model-based procedures. 3/15/ © John M. Abowd and Lars Vilhuber 2011, all rights reserved

Imputation Based on Statistical Modeling Hot deck: use the data from related cases in the same survey to impute missing items (usually as a group) Cold deck: use a fixed probability model to impute the missing items Multiple imputation: use the posterior predictive distribution of the missing item, given all the other items, to impute the missing data 3/15/ © John M. Abowd and Lars Vilhuber 2011, all rights reserved

Current Population Survey Census Bureau Imputation Procedures: – Relational Imputation – Longitudinal Edit – Hot Deck Allocation Procedure 3/15/ © John M. Abowd and Lars Vilhuber 2011, all rights reserved

“Hot Deck” Allocation Labor Force Status – Employed – Unemployed – Not in the Labor Force (Thanks to Warren Brown) 3/15/ © John M. Abowd and Lars Vilhuber 2011, all rights reserved

“Hot Deck” Allocation BlackNon-Black Male 16 – ID #0062 Female /15/ © John M. Abowd and Lars Vilhuber 2011, all rights reserved

“Hot Deck” Allocation BlackNon-Black Male 16 – 24ID #3502ID # ID #8177ID #0062 Female 16-24ID #9923ID # ID #4396ID #2271 3/15/ © John M. Abowd and Lars Vilhuber 2011, all rights reserved

CPS Example Effects of hot-deck imputation of labor force status. 3/15/ © John M. Abowd and Lars Vilhuber 2011, all rights reserved

Public Use Statistics 3/15/ © John M. Abowd and Lars Vilhuber 2011, all rights reserved

Allocated v. Unallocated 3/15/ © John M. Abowd and Lars Vilhuber 2011, all rights reserved

Bollinger and Hirsch CPS Missing Data Studies the particular assumptions in the CPS hot deck imputer on wage regressions Census Bureau uses too few variables in its hot deck model Inclusion of additional variables improves the accuracy of the missing data models See Bollinger and HirschBollinger and Hirsch 3/15/2011 © John M. Abowd and Lars Vilhuber 2011, all rights reserved 40

Model-based Procedures A probability model based on p(Y, M) forms the basis for the analysis. This probability model is used as the basis for estimation of parameters or effects of interest. Some general-purpose model-based procedures are designed to be combined with likelihood functions that are not specified in advance. 3/15/ © John M. Abowd and Lars Vilhuber 2011, all rights reserved

Little and Rubin’s Principles Imputations should be – Conditioned on observed variables – Multivariate – Draws from a predictive distribution Single imputation methods do not provide a means to correct standard errors for estimation error. 3/15/ © John M. Abowd and Lars Vilhuber 2011, all rights reserved

Rest of lecture (PDF format)

Applications to Complicated Data Computational formulas for MI data Examples of building Multiply-imputed data files 3/15/ © John M. Abowd and Lars Vilhuber 2011, all rights reserved

Computational Formulas Assume that you want to estimate something as a function of the data Q(Y) Formulas account for missing data contribution to variance 3/15/201145© John M. Abowd and Lars Vilhuber 2011, all rights reserved

Examples Survey of Consumer Finances Quarterly Workforce Indicators 3/15/ © John M. Abowd and Lars Vilhuber 2011, all rights reserved

Survey of Consumer Finances Codebook description of missing data procedures 3/15/ © John M. Abowd and Lars Vilhuber 2011, all rights reserved

How are the QWIs Built? Raw input files: – UI wage records – QCEW/ES-202 EQUI report – Census Numident/Personal Characteristics File – Census Place of Residence – LEHD geo-coding system Processed data files: – Individual characteristics – Employer characteristics – Employment history with earnings 3/15/ © John M. Abowd and Lars Vilhuber 2011, all rights reserved

Flow Chart

Processing the Input Files Each quarter the complete history of every individual, every establishment, and every job is processed through the production system Missing data on the individual and employment history records are multiply imputed Missing data on the employer characteristics are singly-imputed (explanation to follow) 3/15/ © John M. Abowd and Lars Vilhuber 2011, all rights reserved

Garden Variety Problems Missing demographic data on the individual file (birth date, sex, education, place of residence) – Multiple imputations using information from the individual, establishment, and employment history files – Model estimation component updated every quarter This process was used to create the current snapshot (S2004/S2008) but not the current public-use data (updated individual data imputation procedure uses much more information) 3/15/ © John M. Abowd and Lars Vilhuber 2011, all rights reserved

The Mother of all Missing Data Problems The employment history records only code employer to the UI account level Establishment characteristics (industry, geo- codes) are missing for multi-unit establishments The establishment (within UI account) is multiply imputed using a dynamic multi-stage probability model Estimation of the posterior predictive distribution depends on the existence of a state with establishments coded on the UI wage record (MN) 3/15/ © John M. Abowd and Lars Vilhuber 2011, all rights reserved

Can It Be Done? Every quarter the QWI processes over 6 billion employment histories (unique person-employer pair) covering 1990 to 2010 Approximately 30% of these histories require multiple employer imputations So, the system does more than 25 billion full information imputations every quarter The information used for the imputations is current, it includes all of the historical information for the person and every establishment associated with that person’s UI account 3/15/ © John M. Abowd and Lars Vilhuber 2011, all rights reserved

Does It Work? Full assessment using the state that codes both Summary slide follows 3/15/ © John M. Abowd and Lars Vilhuber 2011, all rights reserved

3/15/ © John M. Abowd and Lars Vilhuber 2011, all rights reserved