Presentation on theme: "Prediction and Imputation in ISEE - Tools for more efficient use of combined data sources Li-Chun Zhang, Statistics Norway Svein Nordbotton, University."— Presentation transcript:
Prediction and Imputation in ISEE - Tools for more efficient use of combined data sources Li-Chun Zhang, Statistics Norway Svein Nordbotton, University of Bergen
Use of a statistical register Combining administrative and survey data Model-based prediction or weighting Construction of statistical registers Uses of a statistical register Prediction of (sub-)population totals Multiple uses & general database quality => inferential concerns associated with imputation How to balance between the two types inferential concerns?
A triple-goal criterion for statistical registers A.Effisicient population totals of interest B.Correct co-variances among survey variables, as well as between survey and auxiliary variables C.Non-stochastic & constant tabulation
A simultaneous prediction method NNI as the only feasible approach in terms of preserving co-variances among all the variables. To improve efficiency: introduce restrictions on the imputed totals, which may be obtained separately from imputation, say, through regression prediction. To be referred to as NNI with restrictions (NNI-WR). A simultaneous prediction method Values are generated outside of the sample Efficient for prediction of population totals Not optimal (or best) prediction of each specific unit, but for the assemble of units, now that attention is given to the co-variances among the variables.
About NNI-WR Separation of prediction of totals from general imputation concerns, allowing full freedom in search of efficient methods Solves variance estimation problem at the same time Genuine multivariate imputation with realistic imputed values Non-parametric nature and mild regularity condition suggest robustness, compared to standard regression based approaches NNI can be made non-stochastic, yielding constant tabulations on repetition
An algorithm and current research An algorithm Jump-start phase: to speed up the imputation procedure if desirable Fine-tune phase: relaxation to k-nearest neighbor imputation for better agreement with restrictions; consistency remains Adjustment between the two phases Current research How well does the algorithm perform in real statistical productions? Effective way of setting up the restrictions, i.e. maximum control with minimum number of explicit restrictions for imputation? Evaluation of micro-data quality
Background information: Some standard methods of prediction and imputation
Basic prediction approach Under the general linear model: Target parameter T = linear combination of y- values in the population Estimation of T Prediction of T outside of the selected sample Prediction of individuals: A special case Main problems for a statistical register Lack of natural variation in data; especially if many units have the same x-values Infeasible simultaneously for a large amount of variables; impractical as production mode; leading to inconsistency of cross-tabulation
Random regression imputation (RRI) To emulate the natural variation in data: Add a random residual to the best predicted y-value Hot-deck as a special case Main problems: Extra variance of imputed estimator due to random imputation => never fully efficient Random imputation not the only means for creating natural variation in data Different tabulations on repetition => lack of acceptability and face-value in official statist.
Multiple imputation (MI) Independent random imputations + formulae for combining results Bayesian or frequentist approach Main problems: Removes all the extra imputation variance only if infinite number of repetitions. Otherwise, still not fully efficient & non-constant tabulations A common misunderstanding: only MI can yield acceptable measures of accuracy.
Predictive mean matching (PMM) Find the donor among the observed units who has the same predict y- value & impute the observed y-value Noticeable difference from RRI as the chance of multiple donors decreases; PMM is more efficient due to the removal of imputation variance. Essentially a marginal, variable-by- variable approach
Nearest neighbor imputation (NNI) Provided a set of covariates and a distance metric, the donor is the ‘nearest’ observed unit. A non-parametric generalization of PMM & dot-deck as a special case. More flexible and practical for multivariate imputation than regression models. Chen and Shao (2000): consistent estimator of totals as well as finite population distributions, provided the absolute difference in conditional means of y is bounded by the ‘distance’ between two units. Linear models as special cases. Can be made non-stochastic by introducing extra seemingly uncorrelated covariates, such as Zip code. Main draw back: Usually not efficient (i.e. local smoothing instead of global regression predictor)
Artificial neural network (ANN) Class of functional imputation ANN as generalized regression functions (Bishop, 1995) No analytic predictor Unrealistic imputed values for categorical variables of interest Usually not fully efficient