Biostatistics Case Studies 2014 Session 6 An Overview of Missing Data Youngju Pak Biostatistician

Slides:



Advertisements
Similar presentations
Handling attrition and non- response in longitudinal data Harvey Goldstein University of Bristol.
Advertisements

Handling Missing Data on ALSPAC
Module B-4: Processing ICT survey data TRAINING COURSE ON THE PRODUCTION OF STATISTICS ON THE INFORMATION ECONOMY Module B-4 Processing ICT Survey data.
Treatment of missing values
Approaches for Addressing Issues of Missing Data in the Statistical Modeling of Adolescent Fertility Dudley L. Poston, Jr. Texas A&M University & Eugenia.
 Overview  Types of Missing Data  Strategies for Handling Missing Data  Software Applications and Examples.
Some birds, a cool cat and a wolf
Departments of Medicine and Biostatistics
Missing Data Issues in RCTs: What to Do When Data Are Missing? Analytic and Technical Support for Advancing Education Evaluations REL Directors Meeting.
Selection of Research Participants: Sampling Procedures
Chapter 3 Producing Data 1. During most of this semester we go about statistics as if we already have data to work with. This is okay, but a little misleading.
Adapting to missing data
How to Handle Missing Values in Multivariate Data By Jeff McNeal & Marlen Roberts 1.
Evaluating Hypotheses
Missing Data in Randomized Control Trials
How to deal with missing data: INTRODUCTION
Inferences About Process Quality
Partially Missing At Random and Ignorable Inferences for Parameter Subsets with Missing Data Roderick Little Rennes
Statistical Methods for Missing Data Roberta Harnett MAR 550 October 30, 2007.
PEAS wprkshop 2 Non-response and what to do about it Gillian Raab Professor of Applied Statistics Napier University.
Studying treatment of suicidal ideation & attempts: Designs, Statistical Analysis, and Methodological Considerations Jill M. Harkavy-Friedman, Ph.D.
Multiple imputation using ICE: A simulation study on a binary response Jochen Hardt Kai Görgen 6 th German Stata Meeting, Berlin June, 27 th 2008 Göteborg.
Eurostat Statistical Data Editing and Imputation.
Inference for regression - Simple linear regression
Introduction to plausible values National Research Coordinators Meeting Madrid, February 2010.
2004 Falls County Health Survey Texas Behavioral Risk Factor Surveillance System (BRFSS)
Workshop on methods for studying cancer patient survival with application in Stata Karolinska Institute, 6 th September 2007 Modeling relative survival.
Slide 1 Estimating Performance Below the National Level Applying Simulation Methods to TIMSS Fourth Annual IES Research Conference Dan Sherman, Ph.D. American.
Evaluating a Research Report
Biostatistics Case Studies 2007 Peter D. Christenson Biostatistician Session 3: Incomplete Data in Longitudinal Studies.
Multiple Regression The Basics. Multiple Regression (MR) Predicting one DV from a set of predictors, the DV should be interval/ratio or at least assumed.
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
Biostatistics Case Studies 2008 Peter D. Christenson Biostatistician Session 5: Choices for Longitudinal Data Analysis.
G Lecture 11 G Session 12 Analyses with missing data What should be reported?  Hoyle and Panter  McDonald and Moon-Ho (2002)
Applied Epidemiologic Analysis - P8400 Fall 2002 Lab 10 Missing Data Henian Chen, M.D., Ph.D.
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT OSMAN BIN SAIF Session 16.
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
CHAPTER 12 Descriptive, Program Evaluation, and Advanced Methods.
SW 983 Missing Data Treatment Most of the slides presented here are from the Modern Missing Data Methods, 2011, 5 day course presented by the KUCRMDA,
1 Updates on Regulatory Requirements for Missing Data Ferran Torres, MD, PhD Hospital Clinic Barcelona Universitat Autònoma de Barcelona.
© John M. Abowd 2007, all rights reserved General Methods for Missing Data John M. Abowd March 2007.
1 G Lect 13W Imputation (data augmentation) of missing data Multiple imputation Examples G Multiple Regression Week 13 (Wednesday)
The Impact of Missing Data on the Detection of Nonuniform Differential Item Functioning W. Holmes Finch.
1 G Lect 13M Why might data be missing in psychological studies? Missing data patterns Overview of statistical approaches Example G Multiple.
Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.
Simulation Study for Longitudinal Data with Nonignorable Missing Data Rong Liu, PhD Candidate Dr. Ramakrishnan, Advisor Department of Biostatistics Virginia.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 13, 2013.
A shared random effects transition model for longitudinal count data with informative missingness Jinhui Li Joint work with Yingnian Wu, Xiaowei Yang.
Tutorial I: Missing Value Analysis
INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Biostatistics Case Studies Peter D. Christenson Biostatistician Session 3: Missing Data in Longitudinal Studies.
Pre-Processing & Item Analysis DeShon Pre-Processing Method of Pre-processing depends on the type of measurement instrument used Method of Pre-processing.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 April 9, 2012.
Appendix I A Refresher on some Statistical Terms and Tests.
STA248 week 121 Bootstrap Test for Pairs of Means of a Non-Normal Population – small samples Suppose X 1, …, X n are iid from some distribution independent.
DATA STRUCTURES AND LONGITUDINAL DATA ANALYSIS Nidhi Kohli, Ph.D. Quantitative Methods in Education (QME) Department of Educational Psychology 1.
Research and Evaluation Methodology Program College of Education A comparison of methods for imputation of missing covariate data prior to propensity score.
Missing data: Why you should care about it and what to do about it
MISSING DATA AND DROPOUT
Introduction to Survey Data Analysis
Multiple Imputation Using Stata
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
The European Statistical Training Programme (ESTP)
CH2. Cleaning and Transforming Data
Missing Data Mechanisms
Analysis of missing responses to the sexual experience question in evaluation of an adolescent HIV risk reduction intervention Yu-li Hsieh, Barbara L.
Chapter 13: Item nonresponse
Presentation transcript:

Biostatistics Case Studies 2014 Session 6 An Overview of Missing Data Youngju Pak Biostatistician

Goals for this talk  Familiarize with conceptual and analytical issues on missing data  Raise the awareness of issues relevant to statistical inference when some data is missing  Introduce general methods to prevent and treat missing data, including multiple imputation.

Contents  When and why data are missing  Consequences  Prevention strategies when design studies  Classification  Diagnosis  Statistical methods  Final remarks

What is missing data?  The term missing data means that we are missing some type of information about the phenomena in which we are interested.  Usually leave blank cells in data sets.  Should be distinguished from “Not Applicable”.

The prevalence of missing data.  Among 3 years of publications (about 300) within a prominent psychological journal, about 90% of the articles had missing data.  Average amount of missing data is above 30%. (Source: McKinght, PE et al. 2007, p.3)

When do we miss the data in the research process?  Participant recruitment Survey non-response  Randomization & Implementation of the treatment Preference toward a particular group Refuse to participate in the study after their assignment of the group.  Data collection & Maintenance Subject drop out in longitudinal studies Biological samples get contaminated  Data entry  Data analysis and reporting

Some consequences of missing data  Reliability of measurements with multiple items # of item ↓ → inaccuracy of the variance and covariance of items ↑  Validity of study results (Sample selection) Difference in characteristics between participants and non-participants → Selection bias → Unrepresentative sample (Randomization) Data missing differentially → I nitial nonequivalence (Data analysis) Sample size ↓ → Power ↓  Generalizability of results Any or all of the above → difficulty with statistical inference and interpretation of findings → inaccurate knowledge base → Misinformed and possibly misleading policy recommendations

“The best solution to handle missing data is to have NONE”. R.A. Fisher

How to Prevent missing data?  Overall study design  Characteristics of the target population and the sample  Data collection and measurement  Treatment implementation  Data entry

How to Prevent missing data? 1. Overall study design  Measurement occasions and the timing of data collection Avoid excessive data collection Use existing information such as expected growth curve  Number of variables A strong justification additional variables. “just in case” is poor justification for additional variables.  Assignment to the intervention group Use separate sites or timing to avoid participants preference, wait list control, etc. Increase incentives as the study progress to avoid the drop out due to improvement or adverse results.

How to Prevent missing data? 1. Overall study design cont.  Attrition and retention strategy Differences between participants with complete data and participant with missing data can introduce bias in parameter estimates Multiple retention strategies  Detailed record  Creating a project identity  Developing a screening measures to identify individuals with high risk of drop out  Training and monitoring of research staffs, etc

How to Prevent missing data? 2. Characteristics of the target population  Some strategies In a survey, use appropriate words for the target population.  e.g., use “did not finish” instead of “dropped out” for a group of native Americans Translate questionnaires into the dominant language or face to face interview with low English proficiency. Provide the breaks during the interview for seniors Assure the confidentiality for the sensitive topics (Singer 1995)

How to Prevent missing data? 3. Data collection and measurement  Physiological indices e.g., blood samples-equipment error To prevent  Firm protocols  Random check  Develop solution prior to data collection  Observation of the behavior e.g., Facial expressions To Prevent  Close enough distance  Multiple observers.

How to Prevent missing data? 3. Data collection and measurement cont.  Interviews Inform participants in advance about the conditions and duration of interviews Consider participants preference towards interviewers Sort interview items from easiest to most difficult Careful selections of interviewers and training A computer assisted interview (e.g.,SurveyMonkey) can reduce the potential error of missing data

How to Prevent missing data? 4.Treatment implementation & Data entry  Reduce the treatment burden e.g., multiple sessions with short intervals might be more burden some than long-term intervention with less frequency  Improve treatment administration Consider the characteristic of providers  e.g., if providers are viewed as unskilled or unfriendly, participants are more likely to drop out Avoid circumstances that subjects do not like such as a parking lot that is far from the study site  Data entry Double entry or random cross checking

How to Prevent missing data? -Summary  Most have to do with reducing the burden of participants in studies.  Feasibility must also taken into consideration along with the costs and benefits, when selection prevention strategies. e.g., shorter questionnaires → less missing → the breadth or depth of knowledge ↓  Should design interventions to facilitate adherence and to prevent attrition.  More details can be found in McKnight, et al. 2007, Chapter 4)

Missing Data Classification  How to best carry out statistical inference in the presence of missing data depends on the missing data “mechanism”.  The most widely used missing data classification system was introduced by Rubin, Donald (1976)  Three distinct missing data type based on missing data mechanism 1. Missing Completely At Random (MCAR) 2. Missing At Random (MAR) 3. Not Missing At Random (NMAR)

Missing data classification 1. Rubin’s categories of missing data (Source: McKnight, et al. 2007)

Rubin’s categories of missing data - An example  Suppose interest centers on determining if the following factors effect plasma of beta- carotene Age Gender Current smoking status BMI Alcohol use (average # of drinks/week) Dietary beta-carotene as a covariate (mcg/day) (Source: StatLib data, Dept of Statistics, Carnegie Mellon University)

Rubin’s categories of missing data - An example cont.  Possibly MCAR Some plasma carotene levels are missing  e.g., Some blood samples lost in transport Some dietary carotene missing  e.g., Subjects recruited on a day when the dietician doing the diet-inventory interview calls in sick Some items missing “here and there” due to erratic scanning of data collection forms  Graduate students have not slept very well the night before the work day

Rubin’s categories of missing data - An example cont.  Possibly MAR Missing demographics  Perhaps females tend to omit reporting weight & age Missing dietary beta-carotene  Overweight individuals tend to refuse the beta- carotene dietary-inventory. Clearly not MCAR May be MAR as missing is related to other available variable

Rubin’s categories of missing data - An example cont.  Possibly MNAR Heavy drinkers tend not to respond to drinks per week questions Smokers reluctant to admit such Elderly subjects skip demographic items, such as age, due to poor design of data collection forms Any variable that the probability of an variable being missing is related to the value of that item

Missing data classification 2. Dimension of Missing Data  Missing on the variable (item nonresponses)  Missing on the occasions (wave nonresponses)  Missing on the individual (unit nonresponses) Individuals Variables/Items Occasions Cattell’s data box (1966)

Missing data classification 3.Mechanism or Dimension?  Less missing data (m.d) is better, in general  A large proportion of MCAR might be better than smaller amount of MAR or MNAR, for parameter estimation. Nonetheless, statistical power will be lower  The amount of m.d, in combination with the reason, dimension, and mechanism should be considered in diagnosis and treatments of m.d

D iagnostic procedure  Diagnosis plays important role in selecting the appropriate missing data techniques as well as interpretation of study findings (inferential limitation)  MCAR diagnostics Two sample t-test  Not effective for multivariate data Little’s MCAR test (1988)  A type of chi-square test  A significant p-value means data is not MCAR.  Available in SPSS

Diagnostic procedure cont.  MCAR or MAR ( ignorable) vs. NMAR(non-ignorable) ? No numerical test or graphical test exists m.d is non-ignorable when no information available to explain why the data are missing. Should look at source data outside studies such as previous findings, double sampling, or intensive follow up for non-respondents, etc. Schafer(1997) provides guidance for cases in which ignorability to be plausible and when it is not.

Handling m.d. in data analytic procedures  Four different methods Data deletion method Data augmentation method Single imputation method (SI) Multiple imputation method (MI)

Handling m.d. in data analytic procedures cont. - Data deletion method  Complete case method (listwise deletion) Discard observations with any missing value & only include complete cases Easy to implement If MCAR, parameter estimates are unbiased Can reduce the power substantially  Available case method (pairwise deletion) Discard data only at the level of the variable Can preserve larger portion of the sample If MCAR, parameter estimates are unbiased Results in different sample size per variables such as correlations → stability ↓

Handling m.d. in data analytic procedures cont. 2. Data Augmentation  Avoid many of the inherent problems of deletion methods.  Does not explicitly replace missing values. Instead, an algorithm is invoked that takes into account the observed data, the missing data, the relationship among observed data, and some underlying statistical assumptions when estimating parameters.  Maximum likelihood (ML), Expectation and Maximization (EM), Markov Chain Monte Carlo (MCMC), dummy variable method, and weighting method. Note: SPSS has default program for listwise, pairwise, EM and regression methods for estimation ( Analyze  Missing Value Analysis )

Handling m.d. in data analytic procedures cont. 3. Single Imputation (SI)  Replace a missing value with a single value  Replace with - Constant: zero, mean, median - Random o Hot deck : Randomly selecting a value from the observed data o Cold deck : Use another data set to replace missing values - Nonrandom o Last Observation Carried Forward (LOCF) o Next Observation Carried Backward (NOCB) o Regression predictions

Handling m.d. in data analytic procedures cont. 3. Single Imputation (SI) cont.  SI generally OK with a small amount ( < 5%) of m.d.  SI tend to underestimate standard errors, increasing type I error Ignores the uncertainty in imputed values Performance may depend on variability of items with missing value  SI tend to perform poorly even the missing data mechanism is ignorable.  MI are considered to be superior alternative, particularly MAR case.

Handling m.d. in data analytic procedures Cont. 4. Multiple Imputation (MI).  MI replace each missing value with a set of plausible values that are drawn from a assumed distribution.  Multiple imputations (from 3 to 10 times), repeat analysis with complete data, aggregate results from the analyses.  Pros Provide sound parameter estimates Most highly praised methods for statistically handling missing data (Allison 2002, Rubin 1996, Schafer & Graham 2002)  Cons Require substantial sample size Optimal choice of technique is often unclear May be difficult for less experience researchers due to the specification of the distribution along with assumptions. Sensitivity analysis recommended

Handling m.d. in data analytic procedures Cont. 4. Multiple Imputation (MI) cont.  Plasma Beta-Carotene Example Dependent variable: Natural log of plasma beta-carotene concentration Independent variables: age, gender, current smoking status, BMI, alcohol use, dietary beta carotene (logged) Complete data N=314 Second data set with data MAR  N=216 complete data cases Regression Analysis  Complete Data  Listwise deletion  Multiple Imputation # of imputation (M): 10 times

Handling m.d. in data analytic procedures Cont. 4. Multiple Imputation (MI) cont. Complete Data (N = 315) Parameter Parameter Estimate Standard Error t ValueProb t Intercept <.0001 Age Female CurSmoke r BMI <.0001 Alcohol LBeta_Diet Listwise deletion(N = 216) Parameter Paramete r Estimate Standar d Error t Valu eProb t Intercept < Age Female CurSmoke r BMI < Alcohol LBeta_Diet  Plasma Beta-Carotene Example cont.

Handling m.d. in data analytic procedures Cont. 4. Multiple Imputation (MI) cont. Complete Data Results Parameter Paramete r Estimate Standar d Error t ValueProb t Intercept <.0001 Age Female CurSmoker BMI <.0001 Alcohol LBeta_Diet Multiply Imputed Analysis (M = 10) Estimat e Standa rd ErrorMinMax t Valu eProb t < <  Plasma Beta-Carotene Example cont

Recommend readings for MI  UW-Madison Social Science computing cooperative readings.htm  UCLA : Institute for Digital Research and Education missing_data/mi_in_stata_pt1.htm

Final Remarks  No recipe for the single best approach !  An optimal solution for a particular analysis requires consideration of: Dimensions of missing data The missing data mechanism Reasons for missing data Data types of variables that are missing Objectives of studies

Final Remarks cont.  Try to minimize missing data when design studies  Nonetheless, some data can be missing.  When data are missing, investigate the reason, dimension, and mechanism to choose the appropriate treatment.  Deletion methods are sometimes OK (e.g., MCAR with the amount of missing 5%).

Final Remarks cont.  Multiple imputation is known to perform well in many cases.  Distributional assumptions along with data types are key component for MI, thus might hard to implement for less experienced researchers.  Recommend to seek professional help when you consider complicated methods such as multiple imputation.

Reference  McKinght, PE, et al.(2007) Missing Data: A Gentle Introduction. The Guilford Press, NY.  Allison, PD (2002). Missing Data. Thousand Oaks, CA:Sage  Little, RL & Rubin, DB. (2002). Statistical analysis with missing data, 2 nd. New York: Wiley.  Rubin, DB (1976). Inference and missing data. Biometrika, 63,  Rubin, DB (1987). Multiple imputation for nonresponse in surveys. New York: Wiley.  Schafer, JL. (1997). Analysis of Incomplete Multivariate Data. London: Chapman & Hall.  White, Royston & Woods(2011). Multiple imputation using chained equations: Issues and guidance for practice, Statistics in Medicine  Van Buuren (2007). Multiple imputations of discrete and continuous data by fully conditional specification, Statistics in Medical Research