Presentation on theme: "Some birds, a cool cat and a wolf"— Presentation transcript:
1 Some birds, a cool cat and a wolf Tricks of the tradeSome birds, a cool cat and a wolfDick Wiggins, City University, LondonGopal Netuveli, Imperial College, University of LondonRSS Official Statistics/Statistical Computing Section18th May 2005
2 Acknowledgments Economic and Social Research Council Human Capability and Resilience Network
4 Sample dataset100 records randomly selected from British Household Panel Survey with the condition that all cases had complete information on age, sex and socio-economic position. The data contains variables selected from wave 1 and wave11.
5 TerminologyUnit nonresponse: complete absence of any information from a sampled individual or case.Item nonresponse: an individual who cooperates but for some reason has missing values for certain items.Attrition: In longitudinal data, attrition is the cumulative rate of unit nonresponse across waves.
6 Levels of measurement Nominal Ordinal Interval Ratio Values are just names e.g. 1 = male 2 = femaleOrdinalInherent ranking, but intervals are not equal e.g. RG’s social classIntervalNumerical, intervals are meaningful, but no zero e.g. temperature scales Celsius and FarenheitRatioNumerical, meaningful intervals, zero defined e.g. height, income
7 How is your measure distributed? The distribution of the measure is important and needs to be specified.
8 Pattern of missingness -monotone Percentage of missingness (Lambda) = number of missing values/number of values *100Pattern of missingness -monotoneLambda for both monotone and non-monotone missingness = 820/3500 = 23.4
9 Process of missingness Missing completely at random (MCAR) assumes that missing values are a simple random sample of all data values.Missing at random (MAR) assumes that missing values are a simple random sample of all data values with in subclasses defined by observed data.Missing not at random (MNAR)
10 MCAR, MAR, MNARLet Y represent the data which actually consists of Yobs (observed data) and Ymis (missing data)Let the missingness be described by a binary variable RR = 1 if data is missing, 0 otherwiseThen a simple way of describing the pattern of missingness will be by evaluating the probability P(R=1) using the data Y. P(R=1|Y)In MCAR we can not evaluate that probability using YIn MAR we assume we can evaluate the probability using Yobs, Ymis is not neededIn MNAR, we need both Yobs & Ymis to evaluate the probability
11 Dick’s menagerie The Ostrich The Hawk The Cuckoo The Owl The Pussycat The Wolf
12 The Ostrich aka Listwise Deletion Ignores missingness i.e. assumes MCAR and drops all cases with missing values.The Hawk aka ad hoc methodsAd hoc methods used are pairwise deletion, mean substituition, last value carry forward
13 The Cuckoo aka hot decking Like the cuckoo, hot decking ‘steals’ from other complete records to replace missing recordsThe choice of the complete record is based on a set of observed variables so that the complete and the missing records are as much similar as possibleSubstituting from an adjacent record is a very simple application of this principle on the assumption that adjacent records will be very similar
14 The Owl aka Multiple imputation Works with standard complete-data analysis methodsOne set of imputations may be used for many analysesCan be highly efficient
16 Efficiency= 1/(1+(proportion missing/No. of imputations))
17 Rubin’s rule for combining estimate Point estimate: Average of point estimates from each imputed sampleVariance estimate: Average of within imputation variance + between imputation variance inflated by a factor equal to (1+(1/number of imputations))
18 The Pussy Cat – Modelling (Heckman 2 step procedure) What is modelled?The probability of having a missing value based on fully observed characteristics (e.g. age, sex, socio-economic status)ANDThe model of interest (e.g. predictors of casp19)
19 Equations P(R=1) = f (age, sex, ses) Step 1 CASP-19= f (age, sex, financial situation, social network, P(R=1)) Step 2
20 Strengths and weaknesses Strength: Useful for sensitivity analysis. If the error terms in step 1 and step 2 are significantly correlated then MNAR should be considered.Weakness: Full information needed on variables in step 1
21 Setting up the illustration in STATA Listwise: defaultHotdeck single imputationMultiple imputation m=5Heckman ML
22 Comparison of results from different methods used to manage missingness Significant coefficients are emboldenedHot deck stratification by agegr & sexHeckman sample equation = agegr+0.06 sex sesRho (correlation of errors terms in selection and sustantive equations) significantly different from 0. (p <0.0001). MNAR to be considered.
23 Advice Don’t be an Ostrich Ignore the Hawk Be the Cuckoo if Lambda is smallOtherwise, use the OwlAlways stroke the Pussy CatAwait the Wolf