Chapter 13: Item nonresponse

Chapter 13: Item nonresponse
Handbook: chapter 14 How to treat missing values? Single imputation Effects of single imputation Multiple imputation

Introduction Nonresponse
Unit nonresponse: No information is obtained from a sampled person Item nonresponse: Person participated in the survey but answers to some questions are missing. full item response item nonresponse

Introduction How to deal with item nonresponse? Case wise deletion:
Ignore all cases with missing data. Pair wise deletion: Ignore only those cases with missing data on the variables needed for the analysis Imputation: Substitute estimates for missing data.

Introduction Estimation under item nonresponse
The effectiveness of deletion and imputation techniques depends on the missing-data-patterns. Under case wise and pair wise deletion one assumes that cases with missing data are on average the same as cases with full data. Values that are imputed follow from a model that assumes that within the model item respondents and nonrespondents are on average the same. Imputed data records cannot be treated the same way as non-imputed records. Missing-data-mechanisms as for unit nonresponse Missing Completely at Random (MCAR). Missing at Random (MAR). Not Missing at Random (NMAR).

Introduction Missing-data-mechanisms as for item nonresponse
Missing Completely at Random (MCAR) Missing at Random (MAR) Not Missing at Random (NMAR) Examples MCAR: Respondent accidently forgets to fill in backside of questionnaire or oversees a block of questions. MAR: Older respondents more often do not want to state their income. NMAR: Respondent does not want to state real income as it comes partially from moonlighting (untaxed income).

Single imputation Single versus multiple imputation
Single imputation: a missing value is replaced by a single (synthetic) value. Multiple imputation: a missing value is replaced by a set of (synthetic) values. Imputation techniques Deductive imputation Imputation of a mean Random imputation Imputation using donor records Imputation using a model with auxiliary variables

Single imputation Notation Sample indicators:
Item response indicators: Target variable: Auxiliary variable(s): Imputed target variable: Deductive imputation The value of the missing item can be deduced from the non-missing items. Example 1: Profits and costs are given while total revenue is missing. Example 2: Respondent is male but does not state how many times he was pregnant.

Single imputation Imputation of a mean Imputation of the overall mean:
Imputation of the mean within strata or groups: Examples: Imputation of mean income over all households Imputation of mean profit over persons with the same size and having the same kind of job.

Single imputation Random imputation
Impute at random one of possible values. Cases with same missing-data-pattern may have different imputed values. Random imputation can also be employed within strata or groups. Examples: If the marital status is missing, sample randomly a value from the set {married, not married, divorced, widowed} If the marital status is missing, sample randomly a value from the set {married, not married, divorced, widowed} in case the respondent is 16 years or older and otherwise impute not married If income is missing fit a normal distribution on the non-missing records and sample a value from the fitted distribution.

Single imputation Imputation using donors
Hot deck imputation: sample randomly from the set of values found under the item respondents Nearest neighbour imputation: define a distance measure, search for item respondent that is closest to item nonrespondent and impute corresponding value Examples: In case income is missing, identify all persons with the same age and gender and impute randomly one of their incomes. In case level of education is missing, search for the item respondent with the income that is closest in absolute sense and impute the corresponding level education.

Single imputation Imputation using a model
Implicitly imputation of the mean within groups and nearest neighbour imputation use auxiliary information and thus a model. Select those strata or nearest neighbour for which the corresponding auxiliary variables relate strongly to the missing item. More sophisticated imputation techniques have been developed that model missing items by non-missing items. Situation similar to unit nonresponse. How to build such models and how to select auxiliary variables? Main difference between item and unit nonresponse is the availability of non-missing items next to auxiliary information available from administrative data.

Single imputation Ratio imputation Impute where Regression imputation
Examples Model income using size of household and average house value Model health status using age, gender and employment status.

Single imputation A general model for imputation
Most of the proposed imputation techniques can be put into a general framework. Let be constants and be a random term, then the general model has the form Imputation of mean: All terms are zero except which equals the overall mean. Hot deck imputation: Let the random term take values in the set of item responses. Ratio and regression imputation: Take corresponding estimated parameters. Random term equals zero. Exception: nearest neighbour imputation Benefit of general framework is development of theory to compare different techniques

Effects of single imputation – general effects
Imputed value must belong to domain of valid answers Qualitative variable: some form of donor-imputation. Quantitative variable: any technique. Effect on mean: Deterministic imputation: mean not affected. Random imputation: mean is affected, but expected value not. Effect on distribution Deterministic imputation: distribution becomes more ‘peaked’. Random imputation: preserved distribution better. Effect on correlation Both deterministic and random imputation may affect the value of correlations. Correlations after imputation will be smaller.

Effects of single imputation – some notation
Target variable (with missing values) Y1, Y2, …, YN Sample of size n a1, a2, …, aN. ak = 1 if element k selected, otherwise ak = 0. Missing data R1, R2, …, RN. Rk = 1 if element k available, otherwise Rk = 0. Number of available observation Mean of available observation Imputation Value Yk is missing if ak = 1 and Rk =1. Then a synthetic value is used

Effects of single imputation – some notation
Mean of imputed values Estimator after imputation Expected value Variance

Effects of single imputation – imputation of the mean
Imputed value, for all missing Yk: Mean of imputed values: Estimator Expected value not affected Variance:

Suppose A researchers is given the complete (imputed) data set, and he doesn’t know that imputation of the mean has been carried out. To determine the standard error of the mean He computes the sample variance, and uses it as an estimator of the true population variance S2 However, the sample variance is equal to And therefore under-estimates the population variance: Estimates are less precise than he thinks!

Example: Population of size N = 19,000. Sample of size n = 1,000. Population variance S2 = 360,000. Variance of mean in case of full response: 10% missing values, available observations m = 900. Imputation of mean is carried out. Variance of estimator after imputation: Standard deviation of all (real and imputed) observations: (Wrong) estimate of variance of sample mean

Effects of single imputation – random imputation
Imputed value is randomly selected from available observation. Expected value is not affected: Variance: Variance consist of two components: Normal sampling variance. Variance introduced by imputation mechanism. Expected value of standard deviation of all observations: Variance estimator is asymptotically design unbiased.

Multiple imputation Disadvantage of single imputation: underestimation of the population variance Multiple imputation (MI) is a solution to this problem MI replaces each missing value by m >1 synthetic values. This leads to m datasets, for each of which we obtain an estimate for the population characteristic. These estimates can be combined to produce estimates and confidence intervals

Multiple imputation MI assumes some kind of model, e.g. a linear model like The effect of imputation depends on the missing data mechanism If MCAR, we can apply random imputation a number of times If MAR, we can use the linear model above If NMAR, no valid imputation model can be used Consequently, if the missing data mechanism is not modelled properly, analysis of the imputed data sets can be seriously wrong!

Multiple imputation Let denote the estimator of data set j (for j = 1, 2, …, m) The overall estimator is then defined by The variance of this estimator equals which can be seen as the within imputation variance plus the between imputation variance

Multiple imputation What should be the number of imputations m?
Rubin (1987) claims that it should not exceed m = 10. The relative increase in variance is approximately equal to where  is the rate of missing information

Multiple imputation

Chapter 13: Item nonresponse

Similar presentations

Presentation on theme: "Chapter 13: Item nonresponse"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 13: Item nonresponse

Similar presentations

Presentation on theme: "Chapter 13: Item nonresponse"— Presentation transcript:

Similar presentations

About project

Feedback