Diagnostic methods for checking multiple imputation models Cattram Nguyen, Katherine Lee, John Carlin Biometrics by the Harbour, 30 Nov, 2015.

Slides:



Advertisements
Similar presentations
Handling Missing Data on ALSPAC
Advertisements

11 Simple Linear Regression and Correlation CHAPTER OUTLINE
Departments of Medicine and Biostatistics
Latent Class Analysis of the Breadth, Severity and Stability of Child Health Inequalities Mensah FK, Nicholson JM, Headley L, Carlin JB, Berthelsen D,
Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line.
Prediction, Correlation, and Lack of Fit in Regression (§11. 4, 11
HSRP 734: Advanced Statistical Methods July 24, 2008.
RELATIVE RISK ESTIMATION IN RANDOMISED CONTROLLED TRIALS: A COMPARISON OF METHODS FOR INDEPENDENT OBSERVATIONS Lisa N Yelland, Amy B Salter, Philip Ryan.
CJT 765: Structural Equation Modeling Class 3: Data Screening: Fixing Distributional Problems, Missing Data, Measurement.
Data Mining Cardiovascular Bayesian Networks Charles Twardy †, Ann Nicholson †, Kevin Korb †, John McNeil ‡ (Danny Liew ‡, Sophie Rogers ‡, Lucas Hope.
Lecture 18: Thurs., Nov. 6th Chapters 8.3.2, 8.4, Outliers and Influential Observations Transformations Interpretation of log transformations (8.4)
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
Lecture 6: Multiple Regression
Lecture 24 Multiple Regression (Sections )
Correlational Designs
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Multiple Regression Dr. Andy Field.
Mother and Child Health: Research Methods G.J.Ebrahim Editor Journal of Tropical Pediatrics, Oxford University Press.
Studying treatment of suicidal ideation & attempts: Designs, Statistical Analysis, and Methodological Considerations Jill M. Harkavy-Friedman, Ph.D.
Advantages of Multivariate Analysis Close resemblance to how the researcher thinks. Close resemblance to how the researcher thinks. Easy visualisation.
Multiple imputation using ICE: A simulation study on a binary response Jochen Hardt Kai Görgen 6 th German Stata Meeting, Berlin June, 27 th 2008 Göteborg.
Regression and Correlation Methods Judy Zhong Ph.D.
Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.
Simple Linear Regression
(a.k.a: The statistical bare minimum I should take along from STAT 101)
AP Statistics Section 15 A. The Regression Model When a scatterplot shows a linear relationship between a quantitative explanatory variable x and a quantitative.
CHP400: Community Health Program- lI Research Methodology STUDY DESIGNS Observational / Analytical Studies Case Control Studies Present: Disease Past:
1 Multiple Imputation : Handling Interactions Michael Spratt.
Week 4: Multiple regression analysis Overview Questions from last week What is regression analysis? The mathematical model Interpreting the β coefficient.
Mrs. Watcharasa Pitug ID The Association between Waist Circumference and Renal Insufficiency among Hypertensive Patients 20/10/58 1.
Insert Program or Hospital Logo Introduction BACKGROUND Breastfeeding is very beneficial to the health and development of infants and is therefore highly.
Innovative statistical approaches in health services research: multiple informant analyses Nicholas Horton Department of Mathematics Smith College, Northampton.
Multiple Imputation Methods for Imputing Earnings in the Survey of Income and Program Participation (SIPP) María García, Chandra Erdman, and Ben Klemens.
Analysis of Residuals Data = Fit + Residual. Residual means left over Vertical distance of Y i from the regression hyper-plane An error of “prediction”
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
1 11 Simple Linear Regression and Correlation 11-1 Empirical Models 11-2 Simple Linear Regression 11-3 Properties of the Least Squares Estimators 11-4.
Calibrated imputation of numerical data under linear edit restrictions Jeroen Pannekoek Natalie Shlomo Ton de Waal.
SW 983 Missing Data Treatment Most of the slides presented here are from the Modern Missing Data Methods, 2011, 5 day course presented by the KUCRMDA,
1 Regression Analysis The contents in this chapter are from Chapters of the textbook. The cntry15.sav data will be used. The data collected 15 countries’
Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory.
Evaluating Risk Adjustment Models Andy Bindman MD Department of Medicine, Epidemiology and Biostatistics.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Model Building and Model Diagnostics Chapter 15.
Predictive Mean Matching using a Factor Model, Varriale - Guarnera – Nuremberg, 09/09/2013 Predictive Mean Matching using a Factor Model, an application.
1 METHODS FOR DETERMINING SIMILARITY OF EXPOSURE-RESPONSE BETWEEN PEDIATRIC AND ADULT POPULATIONS Stella G. Machado, Ph.D. Quantitative Methods and Research.
Data Quality Sharp project 5 June Statistical Problems with Data Quality in EHR Missing Data Missing Data Uncertain Diagnosis Uncertain Diagnosis.
Linear Models Alan Lee Sample presentation for STATS 760.
Multiple imputation: a miracle cure for missing data?
Logistic Regression. Linear regression – numerical response Logistic regression – binary categorical response eg. has the disease, or unaffected by the.
Multivariate selective editing via mixture models: first applications to Italian structural business surveys Orietta Luzi, Guarnera U., Silvestri F., Buglielli.
Applied Quantitative Analysis and Practices LECTURE#30 By Dr. Osman Sadiq Paracha.
A shared random effects transition model for longitudinal count data with informative missingness Jinhui Li Joint work with Yingnian Wu, Xiaowei Yang.
D/RS 1013 Data Screening/Cleaning/ Preparation for Analyses.
1 Reg12W G Multiple Regression Week 12 (Wednesday) Review of Regression Diagnostics Influence statistics Multicollinearity Examples.
General and Feeding Specific Behavior Problems in a Community Sample of Children Amy J. Majewski, Kathryn S. Holman & W. Hobart Davies University of Wisconsin-Milwaukee.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Simple Linear Regression and Correlation (Continue..,) Reference: Chapter 17 of Statistics for Management and Economics, 7 th Edition, Gerald Keller. 1.
1 Assessment and Interpretation: MBA Program Admission Policy The dean of a large university wants to raise the admission standards to the popular MBA.
Plausible Values of Latent Variables: A Useful Approach of Data Reduction for Outcome Measures in Pediatric Studies Jichuan Wang, Ph.D. Children’s National.
Analysis of Mismeasured Data David Yanez Department of Biostatistics University of Washington July 5, 2005 Biost/Stat 579.
Research and Evaluation Methodology Program College of Education A comparison of methods for imputation of missing covariate data prior to propensity score.
Outlier Detection Identifying anomalous values in the real- world database is important both for improving the quality of original data and for reducing.
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Multiple Regression Prof. Andy Field.
Introduction Results Hypotheses Discussion Method
Multiple Imputation Using Stata
Stats Club Marnie Brennan
Scientific Method Attitude Process
Public Health Physician, Lecturer Critical Care Medicine,
Clinical prediction models
Global PaedSurg Research Training Fellowship
Presentation transcript:

Diagnostic methods for checking multiple imputation models Cattram Nguyen, Katherine Lee, John Carlin Biometrics by the Harbour, 30 Nov, 2015

Motivating example: Longitudinal Study of Australian Children (LSAC) 5107 infants (0-1 year) recruited in 2004 Data collection has occurred every 2 years 2

Relationship between harsh parental discipline and behavioural problems Bayer et al. (2011) Pediatrics. 128(4):e

There was completely observed data for 3163 (62%) participants Missing data in LSAC 4 Variable Number missing Percentage Conduct problems89618% Harsh parenting160131% Gender00% Socieconomic position50510% Financial hardship53310% Psychological distress68813%

Proposed imputation model Multivariate imputation by chained equations (MICE) Variables in the imputation model: -Analysis model variables -Auxiliary variables (22 variables) -No transformation of skewed variables -Outcome variable included as continuous variable (not dichotomised) Created 40 imputed datasets 5

Proposed imputation diagnostics 1.Graphical comparisons of the observed and imputed data 2.Numerical comparisons of the observed and imputed data 3.Standard regression diagnostics 4.Cross-validation 5.Posterior predictive checking 6

Graphical comparisons of the observed and imputed data 7

8

Summary: graphical comparisons of observed and imputed data Exploring the imputed data Challenge when working with large numbers of imputed variables Difficulty interpreting differences when data are not MCAR. 9

Proposed imputation diagnostics 1.Graphical comparisons of the observed and imputed data 2.Numerical comparisons of the observed and imputed data 3.Standard regression diagnostics 4.Cross-validation 5.Posterior predictive checking 10

Numerical comparisons of the observed and imputed data Formally test for differences between the observed and imputed data Highlight variables that may be of concern. Overcome the challenge of checking all imputed variables Proposed numerical methods: – Compare means (difference in means greater than 2) – Compare variances (ratio of variances less than 0.5) – Kolmogorov-Smirnov test (p-value <0.05) 11 Abayomi, K. et al. (2008). Journal of the Royal Statistical Society Series Stuart, E. et a. (2009) American Journal of Epidemiology

Simulation evaluation of the Kolmogorov- Smirnov test Simulated incomplete datasets Deliberately misspecified imputation models Results Not useful under MAR Kolmogorov-Smirnov p-values did not correspond to bias/RMSE. KS test p-values depend on sample size and amount of missing data 12 Nguyen C, Carlin J, Lee K (2013). BMC Medical Research Methodology 13:144

Proposed imputation diagnostics 1.Graphical comparisons of the observed and imputed data 2.Numerical comparisons of the observed and imputed data 3.Standard regression diagnostics 4.Cross-validation 5.Posterior predictive checking 13

Regression diagnostics Possible to check the goodness of fit of imputation models using established regression diagnostic tools – Residuals, outliers, influential values 14 White et al Statistics in Medicine.

Proposed imputation diagnostics 1.Graphical comparisons of the observed and imputed data 2.Numerical comparisons of the observed and imputed data 3.Standard regression diagnostics 4.Cross-validation 5.Posterior predictive checking 15

Cross-validation Assess the predictive performance of the imputation model Delete each observed value in turn and use the imputation model to impute the withheld values 16 Gelman et al. (2005) Biometrics Honaker et a. (2011) Journal of Statistical Software

Cross-validation Plot of imputed/predicted vs observed 17

Summary: cross-validation Advantage – can be used to assess imputations produced by any method Disadvantages – Can only assess adequacy of the imputation model within range of observed values – Focuses on predictive ability of the imputation model (does not investigate relationships between variables) 18

Proposed imputation diagnostics 1.Graphical comparisons of the observed and imputed data 2.Numerical comparisons of the observed and imputed data 3.Standard regression diagnostics 4.Cross-validation 5.Posterior predictive checking 19

Posterior predictive checking Assesses model adequacy with respect to target parameters “Replicated” datasets are simulated from the imputation model Analyses of interest are applied to replicated datasets 20

DUPLICATE AND CONCATENATE 1 st completed 2 nd completedL th completed IMPUTATION MODEL … Based on He and Zaslavsky (2011) 1 st replicated 2 nd replicatedL th replicated … 21

Simulation evaluation of posterior predictive checking Simulated incomplete datasets under MAR Deliberately misspecified imputation models 22 1=de-skewing, 2=no de-skewing, 3=no auxiliary variables, 4=no outcome variables

Posterior predictive checking: summary Advantages – versatile: can be used to check any imputation model – focuses on the effect of the imputation model on target quantities of interest Disadvantages – Computationally intensive – Usefulness diminishes with increased amounts of missing data 23 Nguyen, C. D., Lee, K. J. and Carlin, J. B. (2015), Posterior predictive checking of multiple imputation models. Biometrical Journal

Posterior predictive checking Logistic regression coefficients CompletedReplicatedpbcom Harsh parenting Gender Socioeconomic position Financial hardship Psychological distress

Summary Graphical diagnostics useful for exploring imputed data Numerical comparisons (e.g. KS test) not recommended PPC was useful for assessing the model with respect to target parameters All methods have strengths and limitations. 25

References Abayomi, K., Gelman, A., & Levy, M. (2008). Diagnostics for multivariate imputations. Journal of the Royal Statistical Society Series C-Applied Statistics, 57, Bayer, J. K., Ukoumunne, O. C., Lucas, N., Wake, M., Scalzo, K., & Nicholson, J. M. (2011). Risk Factors for Childhood Mental Health Symptoms: National Longitudinal Study of Australian Children. Pediatrics, 128, e doi: /peds Gelman, A., Van Mechelen, I., Verbeke, G., Heitjan, D. F., & Meulders, M. (2005). Multiple imputation for model checking: Completed-data plots with missing and latent data. Biometrics, 61(1), He, Y., & Zaslavsky, A. M. (2011). Diagnosing imputation models by applying target analyses to posterior replicates of completed data. Statistics in Medicine, 31(1), doi: /sim.4413 Nguyen, C., Carlin, J., & Lee, K. (2013). Diagnosing problems with imputation models using the Kolmogorov-Smirnov test: a simulation study. BMC Medical Research Methodology, 13(1), 1-9. doi: / Nguyen, C. D., Lee, K. J. and Carlin, J. B. (2015), Posterior predictive checking of multiple imputation models. Biometrical Journal Stuart, E. A., Azur, M., Frangakis, C., & Leaf, P. (2009). Multiple Imputation With Large Data Sets: A Case Study of the Children's Mental Health Initiative. American Journal of Epidemiology, 169(9), doi: /aje/kwp026 26

Acknowledgements Missing data group John Carlin Katherine Lee Julie Simpson Jemisha Apajee Alysha Madhu De Livera Anurika De Silva Panteha Hayati Rezvan Emily Karahalios Margarita Moreno Betancur Laura Rodwell Helena Romaniuk Thomas Sullivan Funding ViCBiostat 27