Download presentation
Presentation is loading. Please wait.
1
Multiple Imputation Using Stata
Chuck Huber, PhD StataCorp University of Michigan January 30, 2018
2
Outline Example Dataset Missing Data Mechanisms
What is multiple imputation? Multiple imputation in Stata Why use multiple imputation?
3
Example Dataset
4
Example Dataset The objective is to examine the relationship between smoking and heart attacks adjusting for age, body mass index, educational status, and gender We want to perform a logistic regression of heart attack (attack) with the other variables as regressors
5
Example Dataset
6
Example Dataset
7
Complete Case Analysis
8
Mean Substitution?
9
Mean Substitution?
10
Mean Substitution?
11
Mean Substitution? Complete Case Analysis (N=132) Mean Substitution
12
Outline Example Dataset Missing Data Mechanisms
What is multiple imputation? Multiple imputation in Stata Why use multiple imputation?
13
Missing Data Mechanisms
Missing Completely At Random (MCAR) Missing At Random (MAR) Missing Not At Random (MNAR)
14
Missing Completely At Random (MCAR)
Definition Missing data are MCAR if the reason for missing data is unrelated to the observed or unobserved (missing) data. That is, missing values are a simple random sample of all data values. Example Subjects withdraw from a study for reasons unrelated to the study. Data are missing because of equipment failures or data-recording errors
15
Missing Completely At Random (MCAR)
Other variables Missing values not related to the variable with missing data Missing values not related to other observed variables
16
Missing At Random (MAR)
Definition Missing data are MAR if the reason for missing data is unrelated to the unobserved (missing) data but may depend on the observed data. That is, missing values are not a simple random sample of all data values. Example In a study of blood pressure, subjects withdraw from the study because of severe side effects caused by a high dosage of a treatment. In a study of income, respondents with low education might be less inclined to report their income
17
Missing At Random (MAR)
Other variables Missing values not related to the variable with missing data Missing values are related to other observed variables
18
Missing Not At Random (MNAR)
Definition Missing data are MNAR if the reason for missing data is related to the unobserved (missing) data. Example In a study of income, respondents with low or high income might be less inclined to report their income; in a study of depression, respondents who are depressed might be less likely to report that they are depressed
19
Missing Not At Random (MNAR)
Other variables Missing values are related to the variable with missing data
20
Checking MCAR vs MAR
21
Outline Example Dataset Missing Data Mechanisms
What is multiple imputation? Multiple imputation in Stata Why use multiple imputation?
22
What is multiple imputation?
Multiple imputation (MI) is a flexible, simulation-based statistical technique for handling missing data. Multiple imputation consists of three steps: Imputation step. M imputations (completed datasets) are generated under some chosen imputation model. Completed-data analysis (estimation) step. The desired analysis is performed separately on each imputation (m = 1, … , M). This is called completed-data analysis and is the primary analysis to be performed once missing data have been imputed. Pooling step. The results obtained from M completed-data analyses are combined into a single multiple-imputation result.
23
Notation and some terminology
Original data are the data containing missing values With a slight abuse of terminology, by an imputation we mean a copy of the original data in which missing values are imputed M is the number of imputations m (= 0, ,M) refers to the original or imputed data: m = 0 means original data and m > 0 means imputed data. m = 1 means the first imputation, m = 2 means the second imputation, etc.
24
The Imputation Step Original Data (m=0) Copy of Data (m = 1)
25
The Imputation Step
26
The Imputation Step bmi_new = (attack) - .47(smokes) - .03(age) - .31(female)
27
The Imputation Step bmi_new = (attack) - .47(smokes) - .03(age) - .31(female) + rnormal()
28
The Imputation Step Original Data
bmi_new = (attack) - .47(smokes) - .03(age) - .31(female) + rnormal() bmi_new = (attack) - .47(smokes) - .03(age) - .31(female) + rnormal() bmi_new = (attack) - .47(smokes) - .03(age) - .31(female) + rnormal()
29
The Imputation Step Original Data
bmi_new = (attack) - .47(smokes) - .03(age) - .31(female) + 1.7 bmi_new = (attack) - .47(smokes) - .03(age) - .31(female) + 0.9 bmi_new = (attack) - .47(smokes) - .03(age) - .31(female)
30
The Estimation Step Original Data
logistic attack smokes age bmi_new hsgrad female logistic attack smokes age bmi_new hsgrad female logistic attack smokes age bmi_new hsgrad female
31
The Pooling Step 𝑇= 1 𝑀 𝑊+ 1+ 1 𝑀 𝐵
The within-imputation (W) variance is calculated for each imputed dataset during estimation step. The between-imputation (B) variance is calculated during the pooling step. The total variance (T) is then: 𝑇= 1 𝑀 𝑊 𝑀 𝐵
32
Outline Example Dataset Missing Data Mechanisms
What is multiple imputation? Multiple imputation in Stata Why use multiple imputation?
33
Main features of Stata’s mi command
Stata’s mi suite of commands perform all three steps of multiple imputation: Create imputed datasets, each with the missing values filled in (mi impute) Fit your model on each imputed dataset (mi estimate) Collect all the model fits and apply Rubin’s combination rules to form “mi-adjusted” parameter estimates and standard errors (mi estimate)
34
Multiple Imputation Using Stata
The mi Control Panel Examining and setting up mi data Univariate imputation Estimation Testing Prediction
35
The mi Control Panel
36
Examine Missing Data
37
Examine Missing Data
38
The Imputation Step NOTE: We’re only using 5 imputations to keep things simple but you should use at least 20.
39
The Imputation Step
40
The Imputation Step Three new variables were created by mi set and mi impute: _mi_id An identification number for records within an imputed dataset _mi_miss An indicator for missing values of the imputed variable _mi_m The number (m) for each imputed dataset (m=0 is original data)
41
The Imputation Step
42
Data Management
44
The Estimation Step
45
The Estimation and Pooling Step
46
Testing Coefficients
47
Testing Coefficients
48
Predictions
49
Outline Example Dataset Missing Data Mechanisms
What is multiple imputation? Multiple imputation in Stata Why use multiple imputation?
50
Why use multiple imputation?
The objective of MI is not to predict missing values as close as possible to the true ones but to handle missing data in a way resulting in valid statistical inference (Rubin 1996)
51
Why use multiple imputation?
It is more flexible than fully-parametric methods, e.g. maximum likelihood, purely Bayesian analysis It can be more efficient than listwise deletion (complete-cases analysis) and can avoid potential bias It accounts for missing-data uncertainty and, thus, does not underestimate the variance of estimates unlike single imputation methods
52
Statistical validity of MI
MI yields statistically valid inference if an imputation method used is proper per Rubin (1987, 118–119) Loosely speaking, the imputation mechanism, which produces imputations, must maintain the existing characteristics of the data and incorporate adequate variability (uncertainty) induced by unobserved data.
53
Summary MI is a stochastic method. Remember to set the random-number seed to reproduce the same point estimates later MI preserves all available data and thus can be more efficient than complete-cases analysis. It can also avoid potential bias when complete cases differ from incomplete cases Unlike fully-parametric methods, MI can easily be applied to a wide range of analyses
54
Summary MI separates the stochastic, imputation step from the analysis step — the imputer and the analyst can be different people! In Stata, use mi impute for imputation and mi estimate for analysis Use MI Control Panel to guide you through all the phases of MI
55
For more information
56
For more information Files Videos 09_multiple_imputation.do heart.dta
Multiple imputation in Stata®: Setup, imputation, estimation--regression imputation Multiple imputation in Stata®: Setup, imputation, estimation--predictive mean matching Multiple imputation in Stata®: Setup, imputation, estimation--logistic regression
57
Thanks for letting me hang out with you today! Questions?
You can contact me anytime at
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.