Multiple Imputation Stata (ice) How and when to use it.

How ice() works Each variable with missing data is the subject of a regression. –Typically all other variables are used as predictors –Estimate ß, σ via the regression –Draw σ* from its posterior distribution (non-informative prior) –Draw ß* from its posterior distribution (non-informative prior) –Find predicted values: Ŷ=Xß*, then either: Keep Ŷ for the missing values (default option) Predictive Mean Matching –Move on to the next variable, using the newly-predicted values –Cycle through the variables a number of times (10 is default)

Assumptions Missing at Random –No getting around this one. MCAR is fine, of course. Distinct Parameters –Does the missing data mechanism govern what data-generating parameters you can see? Ex: limits of detection. Adequate Sample Size –Hard to quantify. Regression on continuous variables doesn’t take much, but other methods certainly can Convergence to a Posterior Distribution –Standard MI (such as Proc MI) is known to converge to a posterior distribution with enough iterations. Ice() does not have this guarantee. This is typically ignored when ice() is used.

Predictive Mean Matching We have Ŷ mis for the variable with missing information –Previously Find the ŷ obs that is closest to ŷ mis, fill in the missing observation’s value with the true value of the ŷ obs Was the default behavior for previous versions of ice() Could be a problem; not enough variability. –Currently Find a set of ŷ obs that are close to ŷ mis, choose one randomly, fill in the missing observation’s value with the true value of the ŷ obs Invoked by using the “match” argument

Other Regression Methods Multinomial Logistic Regression –For categorical variables, ordered or unordered –Finds a probability for each category value, then imputes a value using those probabilities. –My advice: try to avoid using it, as I’ve found its results to be incorrect (biased) Ordinal Logistic Regression –For ordered categorical variables –My advice: it seems to work well, but it needs a large (n>1000) sample size to work

Useful Material: How to run ice() Getting the program –Help -> Search -> [Search all] “ice imputation” –Click on st_0067_2 (www.stata-journal.com)www.stata-journal.com –Click “click here to install” –This gets you ice and micombine, as well as a few other commands

Running ice –Have the dataset open insheet using "C:\path\example.csv", clear –Four variables with missing information npnitm: binary variable npceradm, npneurm: continuous variables npbrkm: 3-category ordered variable –Four variables with complete data –We need to make dummy variables for categorical variables: recode npbrkm (4=0) (5=1) (6=0) (.=.), generate(brk5) recode npbrkm (4=0) (5=0) (6=1) (.=.), generate(brk6)

Running ice, continued (1) –Call ice() ice educ mmselast npdage npgender npnitm npceradm npbrkm brk5 brk6 npneurm using "C:\path\outfile", m(5) passive(brk5:npbrkm==5 \ brk6:npbrkm==6) substitute(npbrkm:brk5 brk6) cmd(npbrkm:mlogit, npnitm:logit) –Here’s what the code pieces do: educ … npneurm: Variables to be used for imputation using "C:\path\outfile“: the result; outfile.dta m(5): 5 imputed datasets passive(brk5:npbrkm==5 \ brk6:npbrkm==6) –Stata will not impute for brk5 and brk6: they will be updated from the new values in npbrkm

Running ice, continued (2) –Here’s what the code pieces do: substitute(npbrkm:brk5 brk6) –npbrkm won’t be used to impute other variables; brk5 and brk6 will be used in its place –cmd(npbrkm:mlogit, npnitm:logit) –npbrkm will have multiple logistic regression –npnitm will have logistic regression –all other variables with missing data use default methods: »continuous: OLS »n=2 categories: Logistic Regression »n>2 categories: Multinomial Logistic Regression

Results A dataset, outfile.dta –use “C:\path\outfile.dta”, clear New variables –_i: row number per dataset (not generally used) –_j: imputed dataset number (same as _Imputation_ from Proc MI) Analyzing the results using micombine, an example –xi: micombine regress mmselast npgender npnitm npceradm i.npbrkm –xi: expand interactions. Used to break npbrkm into dummy variables for the analysis –micombine: automatically does the MI analysis, using _j to distinguish between the imputed datasets See its help file for a list of supported regression commands For some methods, SAS’s MIANALYZE may be needed

The end. Questions?

Multiple Imputation Stata (ice) How and when to use it.

Similar presentations

Presentation on theme: "Multiple Imputation Stata (ice) How and when to use it."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multiple Imputation Stata (ice) How and when to use it.

Similar presentations

Presentation on theme: "Multiple Imputation Stata (ice) How and when to use it."— Presentation transcript:

Similar presentations

About project

Feedback