Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 10: Selection of auxiliary variables

Similar presentations


Presentation on theme: "Chapter 10: Selection of auxiliary variables"— Presentation transcript:

1 Chapter 10: Selection of auxiliary variables
Handbook: chapter 9 The auxiliary variable selection problem Getting started Level of availability of auxiliary variables Variable selection strategies

2 The auxiliary variable selection problem
Response behavior: R Target variable: Y Auxiliary variable: X Bias of response mean: Bounds on correlation:

3 The auxiliary variable selection problem
Rationale behind selection of auxiliary variables Without nonresponse: Auxiliary variables need to relate to key survey topics for variance reduction With nonresponse: Auxiliary variables need to relate to key survey topics and/or response behaviour for bias and variance reduction. Usual practice Model for response behaviour Model for key survey topics Some combination of both sets of auxiliary variables Practical requirements Some auxiliary variables are included for consistency purposes Weighting models are fixed for longer time period in order to avoid level shifts

4 The auxiliary variable selection problem – an example
Examples; ownership of a personal computer and house Model Estimate - 59.8 % AgeMar 58.5 % AgeMar + Hvalue 57.9 % AgeMar + Hvalue + Egroup 57.4 % AgeMar + Hvalue + Egroup + HHSize 57.3 % AgeMar + Hvalue + Egroup + HHSize + SocAllow 57.2 % Model Estimate - 63.3 % Hvalue 61.2 % HValue+ HHType 60.5 % HValue+ HHType + NonNative 59.8 % HValue+ HHType + NonNative + Province 59.4 % HValue+ HHType + NonNative + Province + AgeMar 59.3 %

5 The auxiliary variable selection problem
Selection and missing-data-mechanisms Underlying assumption in weighting models is Missing-at-Random (MAR). Within strata defined by auxiliary variables respondents and nonrespondents are the same on average. Even if final weighting model satisfies MAR then what about smaller intermediate models? MAR assumption does not give guidance to selection of auxiliary variables.

6 Getting started Type and level of auxiliary variables important. Three decisions are needed: Qualitative variables: definition and number of categories Quantitative variables: transformation to categorical measurement level or higher order terms Interactions between auxiliary variables

7 Getting started Qualitative variables (type of household or business, etnicity): Use publication classifications of variables (often required also for consistency) Perform an exploratory analysis based on tree methods like CHAID and follow categories identified as most powerful

8 Getting started Quantitative variables (income, turnover, age):
Usually transformed to categorical variables, unless intrinsic motivation to use continuous variable. Higher order terms (quadratic, cubic) may be added. Transformation to categorical variables again using standard publication classifications or using regresssion trees.

9 Getting started Interactions strongly increase the number of adjustment parameters, i.e. caution is needed in adding interactions Motives for inclusion of interactions Consistency Collinearity Interactions relate to nonresponse behaviour Interactions relate to target variables

10 Level of availability of auxiliary variables
Population level: Auxiliary variable is available for all individuals through linked registry or frame data (ideal situation). Sample level: Auxiliary variable is available only for sample units through paradata observations made by interviewers and data collection staff Aggregated population level: Auxiliary variable is available for respondents and in population tables or counts.

11 Level of availability of auxiliary variables
From a bias reduction point of view, there is no difference between population level and sample level auxiliary variables. Aggregated population level variables are produced by National Statistical Institutes (NSI`s) and are often used as golden standards. Sample level and aggregated population level auxiliary variables need to be included in the questionnaire or interviewer observations! In other words, variable selection starts in the design of the survey.

12 Variable selection strategies
Pre-selection of auxiliary variables from literature on similar surveys; Linkage of available population level variables from registrations; Inclusion of additional auxiliary variables in the survey questionnaire; Identification and observation of additional paradata by interviewers and data collection staff; Modeling of the missing-data-mechanism of nonresponse; Modeling of the main survey variables; Combination of auxiliary variable sets from the models resulting from steps 5 and 6; Checking of weight diagnostics and if necessary return to step 7;

13 Variable selection strategies
Combination of auxiliary variable sets is not at all trivial, unless the number and diversity of the target variables is very large. When the number and diversity of target variables is large, it is sufficient to model nonresponse. Advanced selection strategies account for relation to target variables and nonresponse simultaneously: Särndal and Lundström (2010): Coefficient of variation of adjustment weights. Schouten (2007): Maximal bias of regression estimator under worst-case scenario

14 Variable selection strategies – coefficient of variation
Särndal and Lundström proved that coefficient of variation is standard term in remaining bias of general regression estimators, i.e. regardless of Y It is denoted by Without a specific Y in mind, it is generally the best choice to optimize the variation of adjustment weights.

15 Variable selection strategies – maximal bias
Bias of general regression estimator Let be the predictor of Y based on X. Then Objective: minimize bias under worst-case scenario, i.e. at boundaries of interval

16 Variable selection strategies – maximal bias
Selection of auxiliary variables under worst case scenario Observe that is independent of the choice of auxiliary variables. Maximal bias using the vector X of auxiliary variables is proportional to Select auxiliary variables according to Properties of selection criterion It allows for building up weighting models bottom-up It leads to different models for each Y

17 Variable selection strategies - general
Implementation of selection criterion Need to account for significant decrease of criterion, i.e. accounting for variance. Implementation Analogous to regression analysis. Select forwards and remove backwards. Classification trees that use criterion as split rule and significance of decrease as stopping rule

18 Example 1 – variance of adjustment weights
Forward selection – backward removal Model q2 (x1000) Region 75 Region + Phone 114 Region + Phone + Pnonnat2 126 Region + Phone + Pnonnat2 + Hhtype 134 Region + Phone + Pnonnat2 + Hhtype + Age13 139 Region + Phone + Pnonnat2 + Hhtype + Age13 + Marstat 143 Region + Phone + Pnonnat2 + Hhtype + Age13 + Marstat + Nonnativ 146 Region + Phone + Pnonnat2 + Hhtype + Age13 + Marstat + Nonnativ + Houseval 148 Region + Phone + Pnonnat2 + Hhtype + Age13 + Marstat + Nonnativ + Houseval+ Allowan 149

19 Example 2 – maximal bias Forward selection – backward removal
Model for Ownership of a house Estimate Correlation X and Y Correlation X and R W Empty model 63.3% 1 HValue12 61.2% 0.47 0.11 0.875 HValue12 + HhType5 60.5% 0.52 0.13 0.849 HValue12 + HhType5 +PercAll 59.8% 0.54 0.15 0.829 HhType5 +PercAll 60.2% 0.44 0.16 0.886 HValue12 + HhType5 +PercAll + Prov12 59.4% 0.56 0.820 HhType5 +PercAll + Prov12 59.9% 0.45 0.17 0.881 HValue12 +PercAll + Prov12 0.53 0.841 HValue12 + HhType5 +PercAll + Prov12 + Age15 0.819

20 Example 2 - continued Forward selection – backward removal (Population is 12.1%) Model for Social allowance Estimate Correlation X and Y Correlation X and R W Empty model 10.4% 1 AgeMar36 10.9 % 0.34 -0.05 0.940 AgeMar36 + HValue12 11.2 % 0.36 -0.08 0.930 MarStat4 + HValue12 0.22 0.973 Age15 + HValue12 0.33 -0.06 0.943 AgeMar36 + HValue12 + Phone2 11.4 % 0.37 -0.11 0.925 HValue12 + Phone2 11.1 % 0.16 -0.16 0.975 MarStat4 + HValue12 + Phone2 0.23 0.968 Age15 + HValue12 + Phone2 11.3 % -0.10 0.937

21 Example 2 - continued Classification tree Receives social allowance
node 12 node 20 node 21 node 14 Married node 28 node 29 node 26 Age mar node 27 node 15 WOZ node 13 Male node 2 Job node 30 node 31 node 6 Divorced node 32 node 33 node 24 <10% non-native node 25 node 16 node 17 node 8 WOZ<150 node 10 node 18 node 22 node 23 node 19 WOZ node 11 <29 y node 9 Couple no children node 7 Couple with children node 4 node 5 node 3 20-54 y node 1 55-64 y

22 The selection of auxiliary variables
Conclusions The strongest candidate auxiliary variables are those that relate both to the key survey topics and the missing-data-mechanism. Even if MAR is assumed one needs a criterion to build and to differentiate between models Selection of auxiliary variables is often a laborious and partially manual process. Simultaneous adjustment of large number of survey target variables complicates selection. An efficient search for stratifications leads to nonresponse adjustments that are as effective as models incorporating many variables and interactions.


Download ppt "Chapter 10: Selection of auxiliary variables"

Similar presentations


Ads by Google