Presentation is loading. Please wait.

Presentation is loading. Please wait.

MEASUREMENT OF THE QUALITY OF STATISTICS

Similar presentations


Presentation on theme: "MEASUREMENT OF THE QUALITY OF STATISTICS"— Presentation transcript:

1 MEASUREMENT OF THE QUALITY OF STATISTICS
Item Nonresponse Orietta Luzi Istat – Department for National Accounts and Economic Statistics

2 Item nonresponse Is an error of non observation
Occurs when a respondent provides some, but not all, of the information required, or if the information can not be used Common causes: Interview interruption Refusals Skip of a group of questions “Don’t know” Also known as missing values

3 Item nonresponse Action for preventing item nonresponse
Questionnaire wording Guarantee statistical confidentiality Specific training for interviewers Accuracy of questionnaire instructions (help on line for e- questionnaires) Add the “Don’t know” to question’s items ….

4 Item nonresponse Evaluation
Item nonresponse rates can be produced for critical variables (some rates as those for unit nonresponse) Item nonresponse rate: Units non responding to the question of interest Eligible units for the question of interest It may be difficult to compute this indicator in case of very complex questionnaires having many skip questions and alternative patterns It can also be used an indicator based on the number of missing values which have been integrated during the data processing phase (editing and imputation phase)

5 Item nonresponse Classification of non response (Rubin, 1987)
MCAR (Missing Completely At Random): the probability that a variable value is missing does not depend neither on the observed nor on the missing data MAR (Missing At Random): the probability that a value is missing depends only on the observed data MNAR (NMAR) (Missing Not At Random): the probability that a value is missing depends on both the observed and the missing data This classification is fundamental when using adjustment methods for non-response

6 Dealing with item nonresponse (1)
Complete case analysis: only data without missing information are considered (low precision of estimates, additional bias if the mechanism is not MCAR) Available case analysis: for each variable, only units with observed data are analysed (bias in estimates of variance/covariance matrices) Re-weighting: (different systems of weights for different items) All case analysis : Modelling Methods for incomplete data Imputation

7 Dealing with item nonresponse (2)
Modelling: given data model f(y;q) assumed for the data, ML estimates of q are obtained using all data (in case of incomplete data the EM algorithm is generally used). The estimated model is then used to impute missing data. Advantages: explicit assumptions Drawbacks: costly approach for complex data distributions (Little and Rubin, 2002; Little, 1988; Dempster et al., 1977)

8 Dealing with item nonresponse (3)
Imputation: missing data are replaced with properly estimated values. Often the substituted values are intended to create a data record that does not fail the so called consistency rules (edits) (Kalton and Kasprzyik, 1986 and 1982; Kovar et al., 1995) Several imputation methods have been proposed in literature

9 Dealing with item nonresponse (3)

10 Imputation methods Deductive imputation: where only one correct value exists, as in the missing sum of a balance. A value is thus determined from other values on the same questionnaire Manual imputation: the values of data items deemed erroneous are changed by subject-matter experts supported by programs specially developed for this purpose. Usually reserved to a small number of large or critical units (in terms of potential impact on target estimates). Sometimes these units can be re-contacted to collect the missing information

11 Imputation methods Imputation based on statistical models
Imputation based on explicit models: data are imputed following an explicit model assumed for the data (averages, medians, regressions) Imputation based on implicit models: more attention is paid to the algorithm, however there is a (may be unknown) model underlying the data

12 Imputation based on explicit models
Mean imputation: missing values are replaced by the mean of observed values. It is conceptually analogous to the re-weighting. It can lead to serious biasing effects if respondents and non respondents have significantly different behaviours (mean) with respect to the the target variable under imputation Mean imputation within classes : Classes of homogeneous units (imputation classes) are defined before imputation. Missing values in a class are imputed with the class mean or the mode in the class. In this way, if the auxiliary variables used to form class are correlated with the variable to be imputed, a reduction of the bias due to nonresponse and imputation is obtained.

13 Imputation based on explicit models
Regression imputation: missing values for a given (response) variable are replaced by values predicted based on a regression model fitted on responding units: the variable with missing values is the dependent variable, predictors are chosen among available auxiliary variables. Regression models generally are estimated by imputation cells

14 Imputation based on implicit models
Hot-deck imputation: missing values are replaced by a value provided by another respondent (the donor) Random donor: the donor is randomly selected (in imputation cells) Nearest-neighbour donor: the donor is the most similar unit w.r.t. a distance function computed using appropriate auxiliary variables (in imputation cells) (Chen and Shao, 2000; Chen, Rao and Sitter, 2000) Cold-deck imputation: missing values are replaced by a value provided by a unit observed in another survey or by the same unit in a previous survey repetition Combined methods: combines different methods. For example, in Predictive mean matching regression is performed at the first stage and hot-deck at the second stage (Rubin, 1987)

15 Deterministic and Stochastic Imputation
Deterministic: the estimated value (e.g. by mean or regression) is directly used for imputation Stochastic: a residual random term is added to the estimated (predicted) value In effect, deterministic methods for imputation can bias the distributions and lead a decrease in the variability. Stochastic methods allow for a better preservation of the distribution variability

16 Efficient use of information in imputation
In order to better preserve univariate and multivariate distributions the available information can be used: as covariates in regression models, hot-deck, predictive mean matching to form imputation cells imputation cells allow to approximate the MCAR assumption inside them imputation cells are highly internally homogeneous, different imputation cells are highly different For hot-deck and predictive mean matching, imputation cells contains ‘enough’ data (available donors)

17 Advantages of imputation
Simple to use Standard methods for complete data can be used in subsequent data analyses Reduces bias on univariate statistics compared to complete case and available case analyses Use of all the available information either observed or from other sources (register, historical data, other sources)

18 Risks of imputation Multivariate analyses: Imputation generally produces an attenuation of data relationships Variance (1): Imputation introduces a further variance term (imputation variance) Variance (2): If imputed data are treated as originally observed, the estimates precision is over-estimated (under-estimation of total variance, too narrow confidence intervals, invalid tests,…)

19 Risks of imputation: variance estimation
Variance estimation under single imputation Model-assisted (Särndal, 1992; Rao and Sitter, 1995; Lee, Rancourt and Särndal, 2002) Re-sampling techniques (Shao, 2002) Jackknife (Rao and Shao, 1992) Bootstrap (Shao and Sitter, 1996) Reversed approach (Shao and Steel, 1999) Multiple imputation (Rubin, 1987; Schafer, 1997; Raghunatan et al., 2001) The method consists of imputing several times (say m) the incomplete data set. The m data sets are then combined in order to estimate the additional uncertainty due to missing data and imputation

20 References Chen J., Shao J. (2000). Nearest Neighbor Imputation for Survey Data. Journal of Official Statistics, No 16, pp Chen J., RaoJ.N.K, Sitter R. (2000). Efficient Random Imputation for Missing Data in Complex Surveys, Statistics Sinica, Vol.10, pp Dempster, A.P., Laird, N.M., Rubin, D.B. (1977) Maximum Likelihood from Incomplete Data via the EM Algorithm, Journal of the Royal Statistical Society, Ser. B, 39, 1-38. Kalton G. (1983), Compensating for missing survey data, Survey Research Center, University of Michigan, 75-76 Kalton G., Kasprzyk D. (1986), The treatment of missing survey data, Survey methodology, 12, 1, Statistics Canada Kalton, G. and Kasprzyk, D. (1982), Imputing for missing survey responses, Proceedings of the section on Survey Research Methods, American Statistical Association, 22-31 Kovar J.G., MacMillian J.H., Whitridge P. (1988), Overview and strategy for the generalized edit and imputation system", Statistics Canada, Methodology Branch, April 1988 Kovar J.G., Whitridge P. (1995), Imputation of business survey data, in Business Survey Methods, John Wiley Little, R.J.A. (1988), Missing data adjustments in large surveys, Journal of Business and Economic Statistics, 6, No 3, pp Little R.J.A., Rubin D.B. (2002), Statistical Analysis with Missing Data, 2nd Edition, Wiley, New York Raghunatan, T. E., Lepkowsky, J. M., Van Hoewyk, J., Solenberger, P. (2001), A Multivariate technique for Multiply Imputing Missing Values Using a Sequence of Regression Models, Survey Methodology, 27, No 1, pp

21 Rao, J. N. K. (1996), On variance estimation with imputed survey data
Rao, J. N. K. (1996), On variance estimation with imputed survey data. Journal of the American Statistical Association, 91, pp Rao J.N.K., Shao J. (1992), Jackknife Variance Estimation with Survey Data under Hot-deck Imputation, Biometrika, 79, Rao J.N.K., Sitter R.R. (1995), Variance Estimation under Two-Phase Sampling with Application to Imputation for Missing Data, Biometrika, 82, Rubin D.B. (1976), Inference and missing data, Biometrika, 63: Rubin, D.B. (1987), Multiple Imputation for non-response in surveys. Wiley, New York Sarndal C.E. (1992), Method for Estimating the Precision of Survey Estimates when Imputation Has Been Used, Survey Methodology, Schafer J. L. (1997), Analysis of Incomplete Multivariate Data. Chapman & Hall, Shao J., Sitter, R.R. (1996), Bootstrap for Imputed Survey Data. Journal of the American Statistical Association, 91, Shao J., Steel P. (1999), Variance Estimation for Survey Data with Composite Imputation and Nonnegligible Sampling Fractions, Journal of the American Statistical Association, 94, Shao J. (2002), Replication Methods for Variance Estimation in Complex Surveys with Imputed Data, in Survey Nonresponse, Groves, R. et al eds., J. Wiley and Sons, New York,


Download ppt "MEASUREMENT OF THE QUALITY OF STATISTICS"

Similar presentations


Ads by Google