Download presentation

Presentation is loading. Please wait.

Published byAngeline Babbs Modified over 2 years ago

1
Research on Improvements to Current SIPP Imputation Methods ASA-SRM SIPP Working Group September 16, 2008 Martha Stinson

2
2 Census Imputation Research Plan Few changes made to actual production imputation methods in many years With redesign of the SIPP, this is an opportunity to consider what changes might be made New committee formed with members from content, data processing, sampling, and statistical methodology divisions Incremental approach: test new methods and consider short list of variables that might be substantially improved

3
3 Proposed Improvements 1.Model-based approach 2.Use administrative data to mitigate problems caused when survey data are not “missing at random” 3.Multiple imputation

4
4 Model-based Approach Hot-deck depends on a donor matrix with reasonable cell sizes Small cells must sometimes be collapsed Collapsing cells creates a more heterogeneous group of donors Hot-deck can’t take account of variables that are dropped in order to combine cells

5
5 Model-based Approach: Research Consider an imputation method that uses a linear regression to impute missing values Stratify sample by set of characteristics, run regressions for each sub-group that is large enough Sub-groups that are too small are combined Variables that are dropped from stratification list are added as explanatory variables in the regression

6
6 Example Earnings imputation –Stratify by age, gender, race, education, industry, and disability –Including disability may cause some small cells –Perhaps combine sub-groups of disabled and not- disabled white women in their fifties –For this sub-group, include disability status as explanatory variable in regression of earnings on SIPP characteristics

7
7 Data Not “Missing At Random” All imputation methods that use survey data exclusively are built on the assumption that the relationships between survey variables are the same for everyone, regardless of missing data Assume relationship between X1, X2, X3 and Y can be estimated Assume if Y is missing, X1, X2, and X3 are good predictors However if the relationship between Y and X1, X2, X3 is different when Y is missing, the imputation will be flawed

8
8 Data Not “Missing At Random”: Research We can evaluate the magnitude of this problem and mitigate the impact on imputation using administrative data Information from an outside source can help account for unobservable (in the survey) differences between people

9
9 Example: 2004 SIPP panel 2004 Annual earnings at two main jobs –Earnings at each job are imputed on a monthly basis –Sum across jobs and then across months to get annual earnings –Create count of number of imputed months in the year (range from 0-12) –If either job has imputed earnings, count the full month as imputed

10
10 Example: 2004 SIPP panel (cont.) Split SIPP respondents into groups 1. No months of imputed or missing data 2. 1-4 months of imputed data (no missing) 3. 5-8 months of imputed data (no missing) 4. 9-12 months of imputed data (no missing) Match earnings report from W-2 records summed for all employers

11
11 Example: 2004 SIPP panel (cont.) If earnings are missing at random, relationship between admin. earnings and other SIPP variables should be the same for all four groups Test –regress admin. earnings on SIPP demographic variables separately for each group –predict earnings for each group using each set of coefficients (four predicted values per group) –compare each prediction to actual admin. earnings –if coefficients are good predictors, difference should be zero on average

12
12 Example: Results Coeff1Coeff2Coeff3Coeff4 Group1Actual1 – pred1 Actual 1– pred2 Actual 1– pred3 Actual 1– pred4 Group2Actual 2– pred1 Actual 2– pred2 Actual 2– pred3 Actual 2– pred4 Group3Actual 3– pred1 Actual 3– pred2 Actual3 – pred3 Actual 3– pred4 Group4Actual 4– pred1 Actual 4– pred2 Actual 4– pred3 Actual 4– pred4

13
13 Example: Results Coeff1Obs Group12.22E-1426,814No imputes Group2-.215,1341-4 months Might impute too high Group3-.011,4505-8 months Group4.221,4099-12 months Might impute too low

14
14 Multiple Imputation Since the 1970s, Donald Rubin has argued that imputation adds variability to user- calculated statistics Traditional methods impute only once User has no way to account for variability Multiple imputation allows the user to calculate variance that includes a piece due to imputation

15
15 Multiple Imputation: Example How might variance estimates change when switch from single to multiple imputation? Consider random variable X with mean of.5 Generate 1000 random samples by taking draws for 80 people 20 people have missing data for X

16
16 Multiple Imputation: Example (cont.) Impute missing data using 2 methods: –single implicate/hot deck – every observed value has equal prob. of being donor –multiple imputation/Bayesian Bootstrap – prob. of being donor changes across implicates but centered around 1/n; create 32 implicates Calculate mean and 95% confidence interval for all 1000 random samples

17
17 Multiple Imputation: Example (cont.) –Case of 1 implicate 95% confidence interval contains the true value 88% of the time –Case of multiple implicates Calculate variance of mean using Rubin formula 95% confidence interval contains the true value 96.5% of the time –What does this mean? Statistical hypotheses will be rejected too often using single imputation methods because variance estimates are too small

18
18 Examples of Census Research on Imputation Methods Generalized Additive Model (GAM) Predictive Mean Matching Bayesian Bootstrap Sequential Regression Multiple Imputation (SRMI)

19
19 Questions for Panel Discussion 1.General thoughts and suggestions on model-based imputation? 2.Suggest specific models? 3.Which variables should we prioritize? 4.Would SIPP user community be willing/able to handle multiple implicates?

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google