Presentation on theme: "Handling Missing Data on ALSPAC Paul Clarke (CMPO, University of Bristol) ALSPAC Social Science User Group meeting 21 May 2008."— Presentation transcript:
Handling Missing Data on ALSPAC Paul Clarke (CMPO, University of Bristol) ALSPAC Social Science User Group meeting 21 May 2008
Outline What causes missing data? Types of missing data Methods for missing data: quick overview ALSPAC Blitz on non-respondents Investigating MNAR data in ALSPAC
Example ALSPAC analysis At age 11 Outcome: Mood (ordinal, 3 categories) –Depressive symptoms, maternally rated Main exposure: Physical activity (score) –Measured on actigraph, 3 days Adjustment: –BMI (score) –Sex, Age at screening Ordinal logistic regression
Missing Value (MV) pattern 1 1 All MV patterns < 200 cases ignored MV Pattern SexMoodBMIPhysical Activity Age ?? 1204 ??? 5397 ???? Monotone90% 858 ? Non- monotone 95%
Non- contact Refusal Letter Telephone calls Interviewer visits Interviewer effectiveness Incentive for participant Loyalty Fail to attend clinic Incomplete questionnaire Parent characteristics Parent & child characteristics Follow-up What causes missing data?
Result of processes leading to: –Refusal to answer questions (item) –Refusal to participate (unit) –No contact (unit) –Longitudinal-specific: attrition & drop-out Non-response mechanism(s) - NRM
Rubins definitions 1 Missing Completely At Random (MCAR) –Independent of observed variables Missing At Random (MAR) –NRM depends only on observed variables Missing Not At Random (MNAR) –NRM depends on missing variables too 1 Little & Rubin (2002) Statistical Analysis with Missing Data
Directed Acyclic Graph (DAG) YX R C R independent data MCAR
MAR data YX R C R indirectly related to Y through X and C
Methods for MAR data Complete cases analysis/Listwise deletion Weighting –Weighting classes, post-stratification (Single) imputation methods –e.g. regression, hot-deck/nearest-neighbour Multiple imputation methods –e.g. Norm, MICE Semiparametric estimators
Imputation in practice: pitfalls 1 Omitting the outcome Imputing non-normal variables MAR completely implausible Convergence of iterative procedures 1 Sterne et al. (2008) British Medical Journal
Complex methods Analysis model –e.g. Ordinal logistic regression Imputation model: Missing given Observed ALL assume MAR data
MAR data in reality YX R C Unknown factors drive non-response ? …correlated with model predictors …but not with Y
Why is this important? Weakness of MAR: How do we know? Central problem: missing data is missing! MAR is a leap of faith
MNAR data YX R C Unknowns directly correlated with Y ? ?
Physical activity example MoodPhys Act R BMI, Sex, Age NRM is mother-driven (child age 11) Child must wear actigraph for 3 days Mother must assess her childs mood ?
ALSPAC Blitz Co-ordinated by Family Liaison Unit 4 tranches: Nov 2007-May 2008 Target 5000 teenagers not in last 2 waves Mini-clinic for difficult to persuade
Proposed analysis MAR is context dependent Risky behaviours (Glyn Lewis, et al) –Outcomes: Cannabis use, sexual practices, etc –Risk factors: mental health, sensation seeking, etc Basic analysis: –Compare follow-up with main sample –Still differences after adjustment?
Unit non-response 100% follow-up rate unlikely! Directly model NRM Continuum of non-response –Hard to contact less like main sample –Weighting scheme ( Alho 1990; Wood et al ) Lower bound for MNAR bias
Item non-response Parallel qualitative post Items: questions on risky behaviours What mechanisms drive non-response? Test hypotheses from this project