Applied Epidemiologic Analysis - P8400 Fall 2002 Lab 10 Missing Data Henian Chen, M.D., Ph.D.

Slides:



Advertisements
Similar presentations
A. The Basic Principle We consider the multivariate extension of multiple linear regression – modeling the relationship between m responses Y 1,…,Y m and.
Advertisements

Treatment of missing values
The Multiple Regression Model.
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
EPI 809/Spring Probability Distribution of Random Error.
Objectives (BPS chapter 24)
CJT 765: Structural Equation Modeling Class 3: Data Screening: Fixing Distributional Problems, Missing Data, Measurement.
Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.
Point estimation, interval estimation
Additional Topics in Regression Analysis
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 7-1 Chapter 7 Confidence Interval Estimation Statistics for Managers.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 8-1 Chapter 8 Confidence Interval Estimation Basic Business Statistics 10 th Edition.
Fall 2006 – Fundamentals of Business Statistics 1 Chapter 13 Introduction to Linear Regression and Correlation Analysis.
So far, we have considered regression models with dummy variables of independent variables. In this lecture, we will study regression models whose dependent.
Linear Regression and Correlation Analysis
Chapter 11 Multiple Regression.
How to deal with missing data: INTRODUCTION
8-1 Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall Chapter 8 Confidence Interval Estimation Statistics for Managers using Microsoft.
Copyright ©2011 Pearson Education 8-1 Chapter 8 Confidence Interval Estimation Statistics for Managers using Microsoft Excel 6 th Global Edition.
LECTURE 15 MULTIPLE IMPUTATION
Statistics for Managers Using Microsoft® Excel 7th Edition
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Statistical Methods for Missing Data Roberta Harnett MAR 550 October 30, 2007.
Richard M. Jacobs, OSA, Ph.D.
Relationships Among Variables
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 7-1 Chapter 7 Confidence Interval Estimation Statistics for Managers.
Multiple imputation using ICE: A simulation study on a binary response Jochen Hardt Kai Görgen 6 th German Stata Meeting, Berlin June, 27 th 2008 Göteborg.
Introduction to Linear Regression and Correlation Analysis
Introduction to plausible values National Research Coordinators Meeting Madrid, February 2010.
1 CHAPTER 7 Homework:5,7,9,11,17,22,23,25,29,33,37,41,45,51, 59,65,77,79 : The U.S. Bureau of Census publishes annual price figures for new mobile homes.
Modeling Menstrual Cycle Length in Pre- and Peri-Menopausal Women Michael Elliott Xiaobi Huang Sioban Harlow University of Michigan School of Public Health.
APPENDIX B Data Preparation and Univariate Statistics How are computer used in data collection and analysis? How are collected data prepared for statistical.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 8-1 Chapter 8 Confidence Interval Estimation Basic Business Statistics 11 th Edition.
Confidence Interval Estimation
1 Experimental Statistics - week 4 Chapter 8: 1-factor ANOVA models Using SAS.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Multiple Regression The Basics. Multiple Regression (MR) Predicting one DV from a set of predictors, the DV should be interval/ratio or at least assumed.
April 6 Logistic Regression –Estimating probability based on logistic model –Testing differences among multiple groups –Assumptions for model.
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
Managerial Economics Demand Estimation & Forecasting.
Introduction to Multiple Imputation CFDR Workshop Series Spring 2008.
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.
SW 983 Missing Data Treatment Most of the slides presented here are from the Modern Missing Data Methods, 2011, 5 day course presented by the KUCRMDA,
1 G Lect 13W Imputation (data augmentation) of missing data Multiple imputation Examples G Multiple Regression Week 13 (Wednesday)
Chapter 8: Confidence Intervals based on a Single Sample
The Impact of Missing Data on the Detection of Nonuniform Differential Item Functioning W. Holmes Finch.
Chap 8-1 Chapter 8 Confidence Interval Estimation Statistics for Managers Using Microsoft Excel 7 th Edition, Global Edition Copyright ©2014 Pearson Education.
28. Multiple regression The Practice of Statistics in the Life Sciences Second Edition.
Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 13, 2013.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Pearson Prentice-Hall, Inc.Chap 8-1 Statistics for Managers Using Microsoft® Excel 5th Edition.
Tutorial I: Missing Value Analysis
1 Statistics 262: Intermediate Biostatistics Regression Models for longitudinal data: Mixed Models.
Multiple Imputation using SAS Don Miller 812 Oswald Tower
Pre-Processing & Item Analysis DeShon Pre-Processing Method of Pre-processing depends on the type of measurement instrument used Method of Pre-processing.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 April 9, 2012.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Plausible Values of Latent Variables: A Useful Approach of Data Reduction for Outcome Measures in Pediatric Studies Jichuan Wang, Ph.D. Children’s National.
Multiple Imputation using SAS Don Miller 812 Oswald Tower
Multiple Regression.
Multiple Imputation using SOLAS for Missing Data Analysis
Multiple Imputation Using Stata
Multiple Regression.
Presenter: Ting-Ting Chung July 11, 2017
Applied Epidemiologic Analysis - P8400 Fall 2002
Clinical prediction models
Considerations for the use of multiple imputation in a noninferiority trial setting Kimberly Walters, Jie Zhou, Janet Wittes, Lisa Weissfeld Joint Statistical.
Presentation transcript:

Applied Epidemiologic Analysis - P8400 Fall 2002 Lab 10 Missing Data Henian Chen, M.D., Ph.D.

Applied Epidemiologic Analysis - P8400 Fall 2002 What are missing data? Data contain various codes to indicate lack of response such as: “Don’t know,” “Refused,” “Unintelligible,” etc. Why do missing data create difficulty in scientific research? Because most data analysis procedures were not designed for missing data. Most SAS statistical procedures simply exclude observations with any missing variables from the analysis.

Applied Epidemiologic Analysis - P8400 Fall 2002 Source of Missing Data Missing data may occur because of oversights or data recording errors. Missing data may occur because some subjects either refuse or fail to respond one or more times during the study; some respondents may drop out of the study prematurely. Some missing data may result from the design of the study (planned missing). For example, some items may only be measured for a subset of the sample (e.g., when alcohol use is measured solely in older children).

Applied Epidemiologic Analysis - P8400 Fall 2002 Problems Caused by Missing Data May biased parameter estimates May inflate Type I and Type II error rates May degrade the performance of confidence intervals May reduce statistical power

Applied Epidemiologic Analysis - P8400 Fall 2002 Three Types of Missing Data 1. Missing Completely At Random (MCAR) Data are missing because of a random process (e.g., equipment failure). The missing scores of the participants are not related to other measured or unmeasured variables so estimates of population parameters (e.g., means, variances, correlation) will be unbiased.

Applied Epidemiologic Analysis - P8400 Fall 2002 Three Types of Missing Data 2. Missing At Random (MAR) Missing data depend only on the values of other observed variables measured on participants. Consider a study with income as a key variable of interest. If less educated individuals tend not to report their income, the missing income values may be MAR because whether an individual responds depends on his or her education.

Applied Epidemiologic Analysis - P8400 Fall 2002 Three Types of Missing Data 3. Missing Not At Random (MNAR) Missing data depend on the individual’s score on the variable of interest, whether measured or unmeasured. Missing income will be called MNAR if individuals with high or low income tend not to report their income.

Applied Epidemiologic Analysis - P8400 Fall 2002 Example ‘MISSING8266.DBF’ is data from a study of the mental health needs of students in the New York City public school system six months after the attack on the World Trade Center. The data we will use in here contains 5 variables for 8266 students. NMissingRangeMean (SD) CONDUCTD- conduct disorder symptoms (34.7%) 0  (1.96) SANXIETY- separation anxiety symptoms (36.2%) 0  (1.70) AGE  (2.57) SEX821848(0.6%) 0  1 Girl=04307 Boy=13911 MEDU- mother education (21.6%) 0  1 High school or less=01232 More than high school=15250

Applied Epidemiologic Analysis - P8400 Fall 2002 Example cont. Complete data: 2045 students (24.74%) without any missing values. Missing data: 6221 students (75.26%) with missing value(s) at least for one variable. The statistical analysis of interest is the regression analysis of conduct disorder symptoms on the 3 demographic variables (age, sex and mother education).

Applied Epidemiologic Analysis - P8400 Fall 2002 Traditional Approaches to Missing Data 1. Complete Case Analysis Only individuals with complete information on all the variables under study are included in the statistical analysis. This method (also known as listwise deletion) is very convenient but assumes MCAR. If the missing data really are MCAR, this method gives results that are valid, but inefficient, because some information (i.e. the incomplete records) is discarded. Complete case analysis may lead to inaccurate results. When less educated people are less likely to report their income, complete case analysis will result in an average income that is greater than the population parameter.

Applied Epidemiologic Analysis - P8400 Fall 2002 Traditional Approaches to Missing Data 1. Complete Case Analysis cont. Dependent Variable: CONDUCTD CONDUCTD Parameter Estimates (N=4288) Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept AGE <.0001 SEX <.0001 MEDU Are they missing completely at random? Do you feel comfortable to publish these findings just used 51.9% of the sample? How do you interpret the results from n=4288 to n=8266 to this population?

Applied Epidemiologic Analysis - P8400 Fall 2002 Traditional Approaches to Missing Data 2. Single Imputation a) Mean Substitution: All the missing values for a variable are replaced by the mean value of that variable. Use of mean substitution will lead to valid estimates of mean values from the data only if the missing values have the same mean as the rest of the sample. Since the method underestimates the variability among the missing cases as a result of mean substitution, estimates of the variance and covariance parameters will be too small.

Applied Epidemiologic Analysis - P8400 Fall 2002 Traditional Approaches to Missing Data 2. Single Imputation cont. a) Mean Substitution: The mean of CONDUCTD for total sample is 1.44, but the SD decreases from 1.96 to 1.58 because we replaced 2866 missing values by the same mean (1.44). No variance for those 2866 students. Parameter Estimates (N=6447) Parameter Standard Variable Estimate Error Pr > |t| Intercept (0.3819) (0.1867) AGE (0.0729) (0.0118) <.0001 SEX (0.3234) (0.0599) <.0001 MEDU ( ) (0.0775) Comparing with the model N=4288, both the parameter estimate, and its standard error are underestimated.

Applied Epidemiologic Analysis - P8400 Fall 2002 Traditional Approaches to Missing Data 2. Single Imputation cont. b) Regression Substitution: All the missing values of a data set replaced by the predicted value of that variable from a regression analysis based only on the complete cases. In this model, even if the mean parameters are correctly estimated, the variance parameters are underestimated because this method assumes no residual variance around the regression line predicting the missing variables.

Applied Epidemiologic Analysis - P8400 Fall 2002 Traditional Approaches to Missing Data 2. Single Imputation cont. Summary: Single estimation treats missing values as if they were known in the complete-data set. However, this method does not reflect the uncertainty about the predictions of the unknown missing values, and the resulting SEs of parameter estimates will be biased toward zero (underestimated).

Applied Epidemiologic Analysis - P8400 Fall 2002 Multiple Imputation MI and MIANALYZE procedures MI procedure performs multiple imputation (MI) of missing data. Multiple imputation replaces each missing value with a set of plausible values that include the uncertainty about the right value to impute.

Applied Epidemiologic Analysis - P8400 Fall 2002 MI and MIANALYZE procedures cont. In most applications, just three to five imputations are sufficient to obtain excellent results. MI results in valid statistical inferences that properly reflect the uncertainty due to missing values. MI offers the promise of improving both the accuracy and often the statistical power of results.

Applied Epidemiologic Analysis - P8400 Fall 2002 MI and MIANALYZE procedures cont. 1. The missing data are filled in m times (by default, m=5) to generate m complete data sets. 2. The m complete data sets are analyzed using standard statistical analyses. 3. The results from the m complete data sets are combined by using the MIANALYZE procedure to generate valid statistical inferences about these parameters. No matter which complete-data analysis is used, the process of combining results from different data sets is essentially the same.

Applied Epidemiologic Analysis - P8400 Fall 2002 Statistical Assumptions for MI 1. Missing data are either MCAR or MAR and the data necessary to account for any bias are included in the model. 2. The data are from a multivariate normal distribution. It often makes sense to use a normal model to create multiple information even when the observed data are somewhat non-normal. You can also use a TRANSFORM statement to transform variables to conform to the multivariate normality assumption.

Applied Epidemiologic Analysis - P8400 Fall 2002 Statistical Assumptions for MI cont. 3. The imputation model is the same as the analysis model. But in practice, the two models may not be the same because it is often unwise to include variables that may account for bias in the missing data in the same analysis as the explanatory equation for some other phenomenon. Thus, one may use education, age, occupation to estimate income, but you would often not want all these variables to compete in a model estimating some disease outcome. Imputation of MAR variables is usually done by the person who collected the data, the final analysis may be done by many other users who share the data set. Actually the data collector has much better knowledge about the data and is likely the best person to do imputations.

Applied Epidemiologic Analysis - P8400 Fall 2002 Statistical Assumptions for MI cont. Generally you should include as many variables as you can when you doing multiple imputation. The precision you lose when you include unimportant predictors is usually a relatively small price to pay for the general validity of analysis of the resultant multiply imputed data set.

Applied Epidemiologic Analysis - P8400 Fall 2002 SAS program for MI proc mi data=missing8266 nimpute=1 seed=88888 minimum= maximum= round=1 out=imputation1; mcmc chain=multiple; var conductd age sex medu; run; proc print data=imputation1; nimpute: 1 imputation (the default is 5) seed: seed to begin random number generator. The default is a value generated from reading the time of day from the computer’s clock. You need to specify the same seed number in the future to reproduce the results. minimum: minimum values for imputed variables maximum: maximum values for imputed variables round: units to round imputed variable values out: output data with imputed values (without missing data for n=8266) mcmc: uses a Markov Chain Monte Carlo method chain: single / multiple chain

Applied Epidemiologic Analysis - P8400 Fall 2002 SAS Output Data Set WORK.MISSING8266 Method MCMC Multiple Imputation Chain Multiple Chains Initial Estimates for MCMC EM Posterior Mode Start Starting Value Prior Jeffreys Number of Imputations 1 Number of Burn-in Iterations 200 Seed for random number generator Missing Data Patterns Group CONDUCTD AGE SEX MEDU Freq Percent 1 X X X X X X X X X. X X X X X X X X X. X X

Applied Epidemiologic Analysis - P8400 Fall 2002 SAS Output proc freq data=imputation1; tables sex medu; run; N=8266 Frequency Percent Frequency Percent SEX MEDU proc means; var conductd age; run; Variable N Mean Std Dev CONDUCTD AGE proc reg; model conductd=age sex medu; run; Parameter Standard Variable Estimate Error t Value Pr > |t| Intercept <.0001 AGE <.0001 SEX <.0001 MEDU

Applied Epidemiologic Analysis - P8400 Fall Imputations proc mi data=missing8266 nimpute=5 seed=88888 minimum= maximum= round=1 out=imputation5; mcmc chain=multiple; var conductd age sex medu; run; proc print data=imputation5; run; proc means data=imputation5; var conductd; by _imputation_; run; proc freq data=imputation5; tables sex medu; by _imputation_; run;

Applied Epidemiologic Analysis - P8400 Fall 2002 SAS Output Imputation Number = 1 N % Mean SD CONDUCTD SEX = = MEDU = = Imputation Number = 2 N % Mean SD CONDUCTD SEX = = MEDU = =

Applied Epidemiologic Analysis - P8400 Fall 2002 SAS Output Imputation Number = 3 N % Mean SD CONDUCTD SEX = = MEDU = = Imputation Number = 4 N % Mean SD CONDUCTD SEX = = MEDU = =

Applied Epidemiologic Analysis - P8400 Fall 2002 SAS program for MIANALYZE To generate regression coefficients for each of 5 imputed data sets proc reg data=imputation5 outest=outreg5 noprint covout; model conductd=age sex medu; by _imputation_; run; To combine the five sets of regression coefficients proc mianalyze data=outreg5; var intercept age sex medu; run;

Applied Epidemiologic Analysis - P8400 Fall 2002 SAS Output The MIANALYZE Procedure: Number of Imputations -- 5 Multiple Imputation Parameter Estimates Parameter Estimate Std Error 95% Confidence Limits intercept age sex medu Parameter Minimum Maximum t Pr > |t| intercept age <.0001 sex <.0001 medu