Multiple Imputation Stata (ice) How and when to use it.

Slides:



Advertisements
Similar presentations
Latent normal models for missing data Harvey Goldstein Centre for Multilevel Modelling University of Bristol.
Advertisements

Controlling for Time Dependent Confounding Using Marginal Structural Models in the Case of a Continuous Treatment O Wang 1, T McMullan 2 1 Amgen, Thousand.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
19.Multivariate Analysis Using NLTS2 Data. 1 Prerequisites Recommended modules to complete before viewing this module  1. Introduction to the NLTS2 Training.
Regression With Categorical Variables. Overview Regression with Categorical Predictors Logistic Regression.
Survey Inference with Incomplete Data Trivellore Raghunathan Chair and Professor of Biostatistics, School of Public Health Research Professor, Institute.
Graphs in HLM. Model setup, Run the analysis before graphing Sector = 0 public school Sector = 1 private school.
Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.
By Wendiann Sethi Spring  The second stages of using SPSS is data analysis. We will review descriptive statistics and then move onto other methods.
Adapting to missing data

How to deal with missing data: INTRODUCTION
Missing Data.. What do we mean by missing data? Missing observations which were intended to be collected but: –Never collected –Lost accidently –Wrongly.
LECTURE 15 MULTIPLE IMPUTATION
Statistical Methods for Missing Data Roberta Harnett MAR 550 October 30, 2007.
Multiple imputation using ICE: A simulation study on a binary response Jochen Hardt Kai Görgen 6 th German Stata Meeting, Berlin June, 27 th 2008 Göteborg.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
STAT 3130 Statistical Methods II Missing Data and Imputation.
Prediction concerning Y variable. Three different research questions What is the mean response, E(Y h ), for a given level, X h, of the predictor variable?
Workshop on methods for studying cancer patient survival with application in Stata Karolinska Institute, 6 th September 2007 Modeling relative survival.
LINDSEY BREWER CSSCR (CENTER FOR SOCIAL SCIENCE COMPUTATION AND RESEARCH) UNIVERSITY OF WASHINGTON September 17, 2009 Introduction to SPSS (Version 16)
1 Multiple Imputation : Handling Interactions Michael Spratt.
Practical Missing Data Analysis in SPSS (v17 onwards) Peter T. Donnan Professor of Epidemiology and Biostatistics.
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
Limited Dependent Variable Models ECON 6002 Econometrics Memorial University of Newfoundland Adapted from Vera Tabakova’s notes.
Using Weighted Data Donald Miller Population Research Institute 812 Oswald Tower, December 2008.
The Group Lasso for Logistic Regression Lukas Meier, Sara van de Geer and Peter Bühlmann Presenter: Lu Ren ECE Dept., Duke University Sept. 19, 2008.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
Introduction to Multiple Imputation CFDR Workshop Series Spring 2008.
Imputation for Multi Care Data Naren Meadem. Introduction What is certain in life? –Death –Taxes What is certain in research? –Measurement error –Missing.
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
Chapter 4: Introduction to Predictive Modeling: Regressions
The Impact of Missing Data on the Detection of Nonuniform Differential Item Functioning W. Holmes Finch.
Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.
A REVIEW By Chi-Ming Kam Surajit Ray April 23, 2001 April 23, 2001.
Chapter 6 Introduction to Multiple Regression. 2 Outline 1. Omitted variable bias 2. Causality and regression analysis 3. Multiple regression and OLS.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 7 Sampling Distributions 7.1 What Is A Sampling.
Using Propensity Score Matching in Observational Services Research Neal Wallace, Ph.D. Portland State University February
1 Chapter 4: Introduction to Predictive Modeling: Regressions 4.1 Introduction 4.2 Selecting Regression Inputs 4.3 Optimizing Regression Complexity 4.4.
29 th TRF 2003, Denver July 14 th, Jenny H. Qin and Mike Singleton Kentucky CODES Kentucky Injury Prevention & Research Center University.
Introduction to Machine Learning Multivariate Methods 姓名 : 李政軒.
Today Introduction to Stata – Files / directories – Stata syntax – Useful commands / functions Logistic regression analysis with Stata – Estimation – GOF.
Tutorial I: Missing Value Analysis
Multiple Imputation using SAS Don Miller 812 Oswald Tower
More on regression Petter Mostad More on indicator variables If an independent variable is an indicator variable, cases where it is 1 will.
Topics Introduction to Stata – Files / directories – Stata syntax – Useful commands / functions Logistic regression analysis with Stata – Estimation –
Pre-Processing & Item Analysis DeShon Pre-Processing Method of Pre-processing depends on the type of measurement instrument used Method of Pre-processing.
Chapter 6: Modifying and Combining Data Sets  The SET statement is a powerful statement in the DATA step DATA newdatasetname; SET olddatasetname;.. run;
Stata – be the master Stata. “After I have run my standard commands, what can I do to make my model better (and understand better what is going on)?”
Additional Regression techniques Scott Harris October 2009.
A framework for multiple imputation & clustering -Mainly basic idea for imputation- Tokei Benkyokai 2013/10/28 T. Kawaguchi 1.
Topics Introduction to Stata – Files / directories – Stata syntax – Useful commands / functions Logistic regression analysis with Stata – Estimation –
Multiple Imputation using SAS Don Miller 812 Oswald Tower
Best Practices for Handling Missing Data
BINARY LOGISTIC REGRESSION
Logistic Regression APKC – STATS AFAC (2016).
Chapter 6: Modifying and Combining Data Sets
QM222 Class 13 Section D1 Omitted variable bias (Chapter 13.)
Classification of unlabeled data:
Introduction to Survey Data Analysis
Multiple Imputation.
Multiple Imputation Using Stata
How to handle missing data values
Presenter: Ting-Ting Chung July 11, 2017
Working with missing Data
Missing Data Imputation in the Bayesian Framework
Introduction to Logistic Regression
MG3117 Issues and Controversies in Accounting
Multinomial Logistic Regression: Complete Problems
Clinical prediction models
Presentation transcript:

Multiple Imputation Stata (ice) How and when to use it.

How ice() works Each variable with missing data is the subject of a regression. –Typically all other variables are used as predictors –Estimate ß, σ via the regression –Draw σ* from its posterior distribution (non-informative prior) –Draw ß* from its posterior distribution (non-informative prior) –Find predicted values: Ŷ=Xß*, then either: Keep Ŷ for the missing values (default option) Predictive Mean Matching –Move on to the next variable, using the newly-predicted values –Cycle through the variables a number of times (10 is default)

Assumptions Missing at Random –No getting around this one. MCAR is fine, of course. Distinct Parameters –Does the missing data mechanism govern what data-generating parameters you can see? Ex: limits of detection. Adequate Sample Size –Hard to quantify. Regression on continuous variables doesn’t take much, but other methods certainly can Convergence to a Posterior Distribution –Standard MI (such as Proc MI) is known to converge to a posterior distribution with enough iterations. Ice() does not have this guarantee. This is typically ignored when ice() is used.

Predictive Mean Matching We have Ŷ mis for the variable with missing information –Previously Find the ŷ obs that is closest to ŷ mis, fill in the missing observation’s value with the true value of the ŷ obs Was the default behavior for previous versions of ice() Could be a problem; not enough variability. –Currently Find a set of ŷ obs that are close to ŷ mis, choose one randomly, fill in the missing observation’s value with the true value of the ŷ obs Invoked by using the “match” argument

Other Regression Methods Multinomial Logistic Regression –For categorical variables, ordered or unordered –Finds a probability for each category value, then imputes a value using those probabilities. –My advice: try to avoid using it, as I’ve found its results to be incorrect (biased) Ordinal Logistic Regression –For ordered categorical variables –My advice: it seems to work well, but it needs a large (n>1000) sample size to work

Useful Material: How to run ice() Getting the program –Help -> Search -> [Search all] “ice imputation” –Click on st_0067_2 ( –Click “click here to install” –This gets you ice and micombine, as well as a few other commands

Running ice –Have the dataset open insheet using "C:\path\example.csv", clear –Four variables with missing information npnitm: binary variable npceradm, npneurm: continuous variables npbrkm: 3-category ordered variable –Four variables with complete data –We need to make dummy variables for categorical variables: recode npbrkm (4=0) (5=1) (6=0) (.=.), generate(brk5) recode npbrkm (4=0) (5=0) (6=1) (.=.), generate(brk6)

Running ice, continued (1) –Call ice() ice educ mmselast npdage npgender npnitm npceradm npbrkm brk5 brk6 npneurm using "C:\path\outfile", m(5) passive(brk5:npbrkm==5 \ brk6:npbrkm==6) substitute(npbrkm:brk5 brk6) cmd(npbrkm:mlogit, npnitm:logit) –Here’s what the code pieces do: educ … npneurm: Variables to be used for imputation using "C:\path\outfile“: the result; outfile.dta m(5): 5 imputed datasets passive(brk5:npbrkm==5 \ brk6:npbrkm==6) –Stata will not impute for brk5 and brk6: they will be updated from the new values in npbrkm

Running ice, continued (2) –Here’s what the code pieces do: substitute(npbrkm:brk5 brk6) –npbrkm won’t be used to impute other variables; brk5 and brk6 will be used in its place –cmd(npbrkm:mlogit, npnitm:logit) –npbrkm will have multiple logistic regression –npnitm will have logistic regression –all other variables with missing data use default methods: »continuous: OLS »n=2 categories: Logistic Regression »n>2 categories: Multinomial Logistic Regression

Results A dataset, outfile.dta –use “C:\path\outfile.dta”, clear New variables –_i: row number per dataset (not generally used) –_j: imputed dataset number (same as _Imputation_ from Proc MI) Analyzing the results using micombine, an example –xi: micombine regress mmselast npgender npnitm npceradm i.npbrkm –xi: expand interactions. Used to break npbrkm into dummy variables for the analysis –micombine: automatically does the MI analysis, using _j to distinguish between the imputed datasets See its help file for a list of supported regression commands For some methods, SAS’s MIANALYZE may be needed

The end. Questions?