Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008 Multiple Imputation for households surveys A comparison of methods Stata Users Group Meeting.

Similar presentations


Presentation on theme: "1 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008 Multiple Imputation for households surveys A comparison of methods Stata Users Group Meeting."— Presentation transcript:

1 1 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008 Multiple Imputation for households surveys A comparison of methods Stata Users Group Meeting Rodrigo Alfaro – Marcelo Fuenzalida

2 2 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008 Outline  Multiple Imputation (MI)  Methods for MI  Chilean Household Surveys  Application of MI to EFH  Comments and Conclusions  Appendix

3 3 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008 Multiple Imputation  In empirical applications researchers must work with incomplete data sets.  A “solution” to the problem above is known as Multiple Imputation (MI) procedure.  MI relies on the assumption Missing At Random:  “…the probability of missing data on Y is unrelated to the value of Y, after controlling for other variables in the analysis.” (Allison, 2002)  In our empirical application we assume MAR.  Validity of assumption is beyond the scope of this analysis.

4 4 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008 Multiple Imputation  MI is based on the assumption that we have a good proxies of distributions of missing observations in the sample. Given this:  We could “fill the blanks” taking random realizations.  We create m-versions of the complete datasets in order to reflect randomness of the procedure.  So, we change our incomplete data set for a set of complete ones. We do not “solve” missing data problem we just measure it.  Analysis of multiple data sets could be combine using Rubin’s rules (Rubin, 1987).

5 5 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008 Methods for MI  We divide methods into 2 categories  Conditional: Hot-Deck and Univariate.  Multivariate: Normal and Chained Equations.  Hot-Deck (hotdeck.ado)  Replace missing observations with an observed one taken randomly within a specific group: males with college.  Informal conversations: std. dev.’s are still small.  Univariate (uvis.ado)  Regress variable with missing observations on exogenous variables with no missing.  Draw posterior of estimators (beta & sigma)  “Predict” missing values.

6 6 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008 Methods for MI  Chained Equation (ice.ado)  Based on Univariate method, but with the possibility of having missing values in exogenous variables.  Using reverse equation missing values of exogenous variables are replaced.  Loop over previous steps.  Normal  Assume Multivariate Normal. Estimate parameters using EM algorithm (or other initial value), and draw imputations using Data Augmentation procedure.  Theory relies on the convergence of EM.  No implemented in Stata. Schafer’s stand-alone package.

7 7 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008 Methods for MI  Schafer’s stand-alone package: Data.

8 8 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008 Methods for MI  Schafer’s stand-alone package: EM.

9 9 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008 Methods for MI  Schafer’s stand-alone package: DA.

10 10 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008 Methods for MI  Schafer’s stand-alone package: DA.

11 11 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008 Chilean Households Surveys  We have households surveys with a few number of waves: CASEN, and EPS.  CASEN was created to measure poverty.  EPS was created to evaluate pension system.  At the Central Bank we have been using these surveys to analyze financial fragility of households.  However, CASEN and EPS were not created for this purpose. We need new sample designs.  In 2007, we started a new survey designed for our purposes: EFH.

12 12 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008 Chilean Households Surveys  Our surveys have different levels of information  Personal information of each member of the household. For example: age, year of education, labor income, etc.  Aggregate information of the household. For example: value of assets (cars, house, financial instruments, etc.), debts (mortgage, consumer loans, educational loans, etc.)  Our variables of interest could be irrelevant for some households in the sample.  Many households have loans with retails-companies instead of borrowing the money directly from banks.  Few households have personal savings invested in financial instruments such as stocks, bonds, etc.

13 13 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008 Application of MI to EFH  Using conditional methods, we could attach the constraints to the imputation procedure.  We are able to impute labor income for each member of the household, considering only individual level vars.  At the household level, we could impute “banks loans” in a sub-sample of households that declared to have that kind of debt. We use as exogenous variables age, years of education, and gender of interviewee.  We impute “debt in retails-companies” with a different sub-sample but with the same exogenous variables.  But, we cannot impute “debt in retails-companies” with “banks loans” because sub-samples may be different.

14 14 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008 Application of MI to EFH  Multivariate methods imply groups of households.  Suppose that a household without a house, we could pretend that the value of its house is “zero”. However, that will be affect the correlation between value of the house and total amount of debts.  Our first round includes 3 groups defined by the credit access.  First group includes households without debts in financial institutions and without any kind of assets.  Second group includes households with real assets: cars, and primary house.  Third group adds households with debts in financial institutions.

15 15 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008 Missing Information Source: EFH 2007. Results for EFH  Low missing rate of information.  However, combining variables reduces sample size.  We will concentrate our analysis in the second group.  We use logit transformation to avoid unbounded results.

16 16 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008 Source: EFH 2007. Group 2: Conditional

17 17 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008  Households with debt in retails companies.  UVIS and HD could have lower variances than raw data.  We note that ICE and NORM are consistent in sd’s. Source: EFH. Group 2: Multivariate

18 18 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008 Comments and Conclusions  In our empirical application Hot-Deck as well as Univariate imputation have smaller variances than multivariate methods.  Under a multivariate imputation we are able to have a reasonable standard deviation that reflects the uncertainty of complete data sets.  Moving to multivariate methods we have ICE and NORM. Both have advantages and disadvantages.  ICE is implemented in Stata with many features available to accommodate several models.  In that respect is more general than NORM.

19 19 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008 Comments and Conclusions  NORM relies in EM algorithm in theoretical terms and DA for imputation method. For that we observed that we need convergence and “reasonable” positive definite matrix.  In practical terms we observed that a high rate of missing data is associated with non-converge of EM algorithm and/or some problems with DA.  In the case of ICE we observed that the algorithm does not have “convergence problem”. We were able to impute data with a high rate of incomplete information.

20 20 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008 Comments and Conclusions  We think that NORM provides a useful information about the stability of the model.  For that its implementation in Stata would be a good complement for ICE. Don’t you think?  We found 2 versions of NORM code: miss.sas by Paul Allison and norm.R by Alvaro Novo.  We translated miss SAS routine into Stata-ado programming. Allison used SAS package IML (Interactive Matrix Language). We observed that IML is similar to Mata.

21 21 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008 Comments and Conclusions  However, original code was not “optimized” as Schafer suggested in his book.  A month ago we move to R-code. We found that original code in Fortran was included in R routine.  So, norm.R allows to use Fortran code directly in R.  Speedy up for this meeting, we translated 800 lines of Fortran into Mata in a week.  However, our translation from R to Mata is not good enough… yet.

22 22 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008 Problem  Besides technical issues on MI we have an unsolved topic for which your opinion is crucial: Aggregation  We want to work at the household level, then aggregation of individual information must be done somehow.  In order to deal with missing observation at individual level we could apply “improper imputations”: (1) replacing with zeros, or (2) replacing with a predicted value.  Alternative, we could code “missing” to the household if any member has missing observation. Because we lost information we discarded it.  Any feasible mixture?  Is it possible to add variables in order to account for incomplete information at individual level?

23 23 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008 Multiple Imputation for households surveys A comparison of methods Stata Users Group Meeting Rodrigo Alfaro – Marcelo Fuenzalida

24 24 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008 Source: EFH 2007. All Households: Conditional

25 25 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008 Source: EFH 2007. All Households: Conditional


Download ppt "1 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008 Multiple Imputation for households surveys A comparison of methods Stata Users Group Meeting."

Similar presentations


Ads by Google