Multiple Imputation Using Stata

Multiple Imputation Using Stata
Chuck Huber, PhD StataCorp University of Michigan January 30, 2018

Outline Example Dataset Missing Data Mechanisms
What is multiple imputation? Multiple imputation in Stata Why use multiple imputation?

Example Dataset

Example Dataset The objective is to examine the relationship between smoking and heart attacks adjusting for age, body mass index, educational status, and gender We want to perform a logistic regression of heart attack (attack) with the other variables as regressors

Example Dataset

Complete Case Analysis

Mean Substitution?

Mean Substitution? Complete Case Analysis (N=132) Mean Substitution

Missing Data Mechanisms
Missing Completely At Random (MCAR) Missing At Random (MAR) Missing Not At Random (MNAR)

Missing Completely At Random (MCAR)
Definition Missing data are MCAR if the reason for missing data is unrelated to the observed or unobserved (missing) data. That is, missing values are a simple random sample of all data values. Example Subjects withdraw from a study for reasons unrelated to the study. Data are missing because of equipment failures or data-recording errors

Missing Completely At Random (MCAR)
Other variables Missing values not related to the variable with missing data Missing values not related to other observed variables

Missing At Random (MAR)
Definition Missing data are MAR if the reason for missing data is unrelated to the unobserved (missing) data but may depend on the observed data. That is, missing values are not a simple random sample of all data values. Example In a study of blood pressure, subjects withdraw from the study because of severe side effects caused by a high dosage of a treatment. In a study of income, respondents with low education might be less inclined to report their income

Missing At Random (MAR)
Other variables Missing values not related to the variable with missing data Missing values are related to other observed variables

Missing Not At Random (MNAR)
Definition Missing data are MNAR if the reason for missing data is related to the unobserved (missing) data. Example In a study of income, respondents with low or high income might be less inclined to report their income; in a study of depression, respondents who are depressed might be less likely to report that they are depressed

Missing Not At Random (MNAR)
Other variables Missing values are related to the variable with missing data

Checking MCAR vs MAR

What is multiple imputation?
Multiple imputation (MI) is a flexible, simulation-based statistical technique for handling missing data. Multiple imputation consists of three steps: Imputation step. M imputations (completed datasets) are generated under some chosen imputation model. Completed-data analysis (estimation) step. The desired analysis is performed separately on each imputation (m = 1, … , M). This is called completed-data analysis and is the primary analysis to be performed once missing data have been imputed. Pooling step. The results obtained from M completed-data analyses are combined into a single multiple-imputation result.

Notation and some terminology
Original data are the data containing missing values With a slight abuse of terminology, by an imputation we mean a copy of the original data in which missing values are imputed M is the number of imputations m (= 0, ,M) refers to the original or imputed data: m = 0 means original data and m > 0 means imputed data. m = 1 means the first imputation, m = 2 means the second imputation, etc.

The Imputation Step Original Data (m=0) Copy of Data (m = 1)

The Imputation Step

The Imputation Step bmi_new = (attack) - .47(smokes) - .03(age) - .31(female)

The Imputation Step bmi_new = (attack) - .47(smokes) - .03(age) - .31(female) + rnormal()

The Imputation Step Original Data
bmi_new = (attack) - .47(smokes) - .03(age) - .31(female) + rnormal() bmi_new = (attack) - .47(smokes) - .03(age) - .31(female) + rnormal() bmi_new = (attack) - .47(smokes) - .03(age) - .31(female) + rnormal()

The Imputation Step Original Data
bmi_new = (attack) - .47(smokes) - .03(age) - .31(female) + 1.7 bmi_new = (attack) - .47(smokes) - .03(age) - .31(female) + 0.9 bmi_new = (attack) - .47(smokes) - .03(age) - .31(female)

The Estimation Step Original Data
logistic attack smokes age bmi_new hsgrad female logistic attack smokes age bmi_new hsgrad female logistic attack smokes age bmi_new hsgrad female

The Pooling Step 𝑇= 1 𝑀 𝑊+ 1+ 1 𝑀 𝐵
The within-imputation (W) variance is calculated for each imputed dataset during estimation step. The between-imputation (B) variance is calculated during the pooling step. The total variance (T) is then: 𝑇= 1 𝑀 𝑊 𝑀 𝐵

Main features of Stata’s mi command
Stata’s mi suite of commands perform all three steps of multiple imputation: Create imputed datasets, each with the missing values filled in (mi impute) Fit your model on each imputed dataset (mi estimate) Collect all the model fits and apply Rubin’s combination rules to form “mi-adjusted” parameter estimates and standard errors (mi estimate)

Multiple Imputation Using Stata
The mi Control Panel Examining and setting up mi data Univariate imputation Estimation Testing Prediction

The mi Control Panel

Examine Missing Data

The Imputation Step NOTE: We’re only using 5 imputations to keep things simple but you should use at least 20.

The Imputation Step

The Imputation Step Three new variables were created by mi set and mi impute: _mi_id An identification number for records within an imputed dataset _mi_miss An indicator for missing values of the imputed variable _mi_m The number (m) for each imputed dataset (m=0 is original data)

The Imputation Step

Data Management

The Estimation Step

The Estimation and Pooling Step

Testing Coefficients

Predictions

Why use multiple imputation?
The objective of MI is not to predict missing values as close as possible to the true ones but to handle missing data in a way resulting in valid statistical inference (Rubin 1996)

Why use multiple imputation?
It is more flexible than fully-parametric methods, e.g. maximum likelihood, purely Bayesian analysis It can be more efficient than listwise deletion (complete-cases analysis) and can avoid potential bias It accounts for missing-data uncertainty and, thus, does not underestimate the variance of estimates unlike single imputation methods

Statistical validity of MI
MI yields statistically valid inference if an imputation method used is proper per Rubin (1987, 118–119) Loosely speaking, the imputation mechanism, which produces imputations, must maintain the existing characteristics of the data and incorporate adequate variability (uncertainty) induced by unobserved data.

Summary MI is a stochastic method. Remember to set the random-number seed to reproduce the same point estimates later MI preserves all available data and thus can be more efficient than complete-cases analysis. It can also avoid potential bias when complete cases differ from incomplete cases Unlike fully-parametric methods, MI can easily be applied to a wide range of analyses

Summary MI separates the stochastic, imputation step from the analysis step — the imputer and the analyst can be different people! In Stata, use mi impute for imputation and mi estimate for analysis Use MI Control Panel to guide you through all the phases of MI

For more information

For more information Files Videos 09_multiple_imputation.do heart.dta
Multiple imputation in Stata®: Setup, imputation, estimation--regression imputation Multiple imputation in Stata®: Setup, imputation, estimation--predictive mean matching Multiple imputation in Stata®: Setup, imputation, estimation--logistic regression

Thanks for letting me hang out with you today! Questions?
You can contact me anytime at

Multiple Imputation Using Stata

Similar presentations

Presentation on theme: "Multiple Imputation Using Stata"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multiple Imputation Using Stata

Similar presentations

Presentation on theme: "Multiple Imputation Using Stata"— Presentation transcript:

Similar presentations

About project

Feedback