Presentation on theme: "WGCI - Seminar1 Sequential Regression – A Method for Multipe Imputations of Missing Data Seminar of the Working Group on Composite Indices 27 March 2002."— Presentation transcript:
WGCI - Seminar1 Sequential Regression – A Method for Multipe Imputations of Missing Data Seminar of the Working Group on Composite Indices 27 March 2002 Tanja Srebotnjak, United Nations Statistics Division, 2002
WGCI - Seminar2 Procedure: A multivariate technique for multiply imputing missing data Uses a sequence of regression models Developed by T. E. Raghunathan, J. M. Lepkowski, Peter Solenberger and John Van Hoewyk, [University of Michigan]
WGCI - Seminar3 Procedure (contd.): Assume a dataset of dimension (n p) with item non-response/missingness Partition the dataset into n 1 variables with no missing obs, say X=(X 1,X 2,…,X n1 ) and (n- n 1 ) with missing values Y=(Y 1,Y 2,…,Y n-n1 ) Y is ordered by degree of missingness, from least to most
WGCI - Seminar4 Procedure (contd.): Then, the conditional distribution of Y 1, i=1,2,…,n-n 1, given the observed values is modeled as a regression model of Y i on X, e.g. E(Y 1 |X)=X + e Missing values are imputed using this model Once, Y 1 is imputed, it is used as a predictor for Y 2, i.e. X=(X 1, X 2, …, X n, Y 1 )
WGCI - Seminar5 Procedure (contd.): The algorithm continues cycling through this series of regression models (using updated predictor sets until X=(X 1,X 2,…,X n1,Y 1,…,Y n-n1 ) Now, a new round begins, using the full dataset as predictor for Y 1 again, thus updating the regression coefficients The algorithm is repeated until convergence in the regression coefficients is achieved, i.e. change below a specified margin
WGCI - Seminar6 Procedure (contd.): Finally, the missing values for each Y i are imputed using the corresponding converged regression model In order to yield multiple imputations, the complete algorithm is repeated m times, resulting in m completed datasets The m datasets are analyzed and the results combined to yield final parameter estimates
WGCI - Seminar7 Flexibility: Basically all model types can be fitted Stepwise regressions possible to ensure that only most important predictors enter the model Introduction of randomization in the imputation process through prior distribution on regression coefficients, and/or perturbation of imputed values No assumption of specific distribution for joint distribution necessary
WGCI - Seminar8 Computational realization: SAS callable IVEware (Imputation, Variance Estimation software) developed by Raghunathan, Solenberger, Van Hoewyk Allows for the modeling of linear, logistic, Poisson and generalized logit distributions Prior distributions on coefficients Stepwise regressions with specified R 2 Inclusion of interaction terms in model Restrictions on imputed variables
WGCI - Seminar9 Example: The status of Least Developed Country is based on assessment of several requirements, including data containing 4 key indicators. Missing values in these 4 key indicators render the assessment more difficult. Hence, imputation of missing values would be desirable. Objective: Compare estimated values from LDC dataset with imputations using a srmi approach.
WGCI - Seminar10 Application to LDC dataset: LDC dataset consists of 4 variables: Child mortality rate per 1000, 1995-2000 [ChldMort] Calory intake as % of requirements, 1995/1997 [CalInt] 1 st and 2 nd level gross enrollment ratio, 1995 [EnrRatio] Adult literacy rate in %, 1995 [LitRate]
WGCI - Seminar14 Application to LDC dataset (contd.): Model specifications: All variables are continuous Bounds on imputed values derived from observed values (to guarantee sensible imputations) Min R 2 for inclusion of a variable in model 0.15 M=5 imputations Convergence diagnostics: Stabilization of coefficients and mean, variance of imputed variable
WGCI - Seminar15 Application to LDC dataset (contd.): Largest deviations observed between imputed and estimated values for ChldMort 1, although only 10 values were missing. Sqrt(deviations^2) ChldMort CalInt EnrRatio LitRate 155.89 48.84 82.14 96.57 This is reflected also in the standard deviation of ChldMort Note: the range of ChldMort is [0,500], thus larger deviations are possible compared to the other variables.
WGCI - Seminar16 Application to LDC dataset (contd.): Variable ChldMort Obs Imputed Combined No 118 10 128 Min 6.06 4.40 4.40 Max 262.59 187.249 262.59 Mean 88.20 110.118 89.91 StdDev 65.68 66.2078 65.73 Variable EnrRatio Obs Imputed Combined No 90 38 128 Min 19 13.72 13.72 Max 104 107.73 107.73 Mean 71.81 65.11 69.82 StdDev 22.89 23.47 23.18
WGCI - Seminar17 Application to LDC dataset (contd.): Variable LitRate Obs Imputed Combined No 92 36 128 Min 13.6 11.85 11.85 Max 98.2 97.93 98.2 Mean 69.47 56.92 65.94 StdDev 22.74 23.84 23.65 Variable CalInt Obs Imputed Combined No 119 9 128 Min 75 100.80 75 Max 164 162.8 164 Mean 115.48 124.79 116.13 StdDev 20.26 21.85 20.42
WGCI - Seminar18 Application to LDC dataset (contd.): Srmi provides a flexible method for multiply imputing missing data without requiring the assumption of a specific joint distribution of the data at hand. The srmi approach reflects intuitively the idea to use regression models for imputation purposes. However, it is more powerful than single regression through the full exploitation of the covariance structure.