Presentation is loading. Please wait.

Presentation is loading. Please wait.

Texas A&M HSC Jin is designed by Dr. Huber. Korean Female Colon Cancer Risk Factors Range EventNon-event HR95% CIP n%n% Smoking Habits Missing144940079.57407195.70----

Similar presentations


Presentation on theme: "Texas A&M HSC Jin is designed by Dr. Huber. Korean Female Colon Cancer Risk Factors Range EventNon-event HR95% CIP n%n% Smoking Habits Missing144940079.57407195.70----"— Presentation transcript:

1 Texas A&M HSC Jin is designed by Dr. Huber

2 Korean Female Colon Cancer Risk Factors Range EventNon-event HR95% CIP n%n% Smoking Habits Missing No smoking Smoked before, but quitted Currently, 1/2 pack Currently, 1/2-One pack Currently, More than One pack ☞ Is smoking protective? Not sure b/c Huge missing!! ☞

3 1. Missing Completely At Random(MCAR) : depends neither on observation nor on missing 2. Missing At Random(MAR) : depends only on observation 3. Not Missing At Random(NMAR) : depends both on observation and on missing Diff. by Why data are missing Affect the effectiveness and biasness of methods for missing data

4 1. Complete Case Analysis(CCA) 2. Available Case Analysis(ACA) 3. Mean imputation 4. Expectation and Maximum(EM) 5. Multiple Imputation Older Methods Single Imputation Multiple Imputation Only CCA and MI

5 Y1Y2Y Complete Case Analysis (CCA) 1. CCA = NOT using any methods of handling missing data 2. By deleting cases, power will be decreased (b/c reduced sample size) 1. Delete all cases of missing values on Y1,Y2,Y3 2. Analyze remaining cases

6 2. Multiple Imputation (MI) (1) Imputation Step (2) Analysis Step (3) Combination Step MI has 3 steps

7 Imputation Number YX1X MI (1) Imputation Step YX1X Imputation Number YX1X Imputation Number YX1X Imputation Number YX1X Imputation Number YX1X “5 complete datasets”

8 2. MI (2) Analysis Step Imputation Number Label of model Type of statistics Variable names f or rows of estimated COV Dependent v ariable Root mean squared error InterceptX1X2Y 11MODEL1PARMS Y MODEL1COVInterceptY MODEL1COVX1Y MODEL1COVX2Y MODEL1PARMS Y MODEL1COVInterceptY MODEL1COVX1Y MODEL1COVX2Y MODEL1PARMS Y MODEL1COVInterceptY MODEL1COVX1Y MODEL1COVX2Y MODEL1PARMS Y MODEL1COVInterceptY MODEL1COVX1Y MODEL1COVX2Y MODEL1PARMS Y MODEL1COVInterceptY MODEL1COVX1Y MODEL1COVX2Y * Standard statistical procedure > regression for each complete datasets (5) separately Analyzed 5 times

9 2. MI (3) Combination Step > the results from 5 data are combined to ONE with combination equations. 1.Combined estimate: 2.Variance Total: 3.Var. Within: 4.Var. Between: 5.DF: 6.Fraction missing Info. : 7.Confidence Interval: combined to 1 result

10 * Comparison of methods to handle missing values CriteriaCCAACA Mean Imputation EM method Multiple Imputation Unbiased Parameter Estimation MCAR OXXOO MAR XXXOO MNAR XXXXX Good Estimates Variability XXXXO Best Statistical Power XOOOO MI is the BEST!! Excellent Estimation Variance among ‘M’est. b/c multiply imputed data by not deleting any cases

11 (1) Imputation step of MI : imputation mechanisms for substituting missing values PatternTypeNormality Imputation mechanisms UnivariateMonotoneContinuousORegression UnivariateMonotoneContinuousX Predictive Mean Matching Multivariate Not Monotone Continuous-MCMC MCMC is NOT tested to Univariate

12 * 3000 obs. are generated on Z1, and X1,…,X6 (all variables are continuous) ( Xs: observed variables and Z: partly missing var. ) * Z1, and X1,…,X6 are drawn from multivariate normal dist with Means = 0 and Correlation = Simulated Data

13 * 3154 obs. (all variables are continuous) - Missing variable: Systolic Blood Pressure (Mean: ) - Observed variables: DBP (82.02), height (69.78), weight (169.95), age (46.28), BMI (24.52), and Cholesterol (Mean: ) * Correlation = Example Data (“A Predictive Study of Coronary Heart Disease” )

14 1. Missing Mechanisms 1) MCAR: Randomly Z1(SBP) deleted 2) MAR: After sorting by one of X(obs.var), Z1(SBP) deleted 3) NMAR: After sorting by Z1(SBP), Z1(SBP) deleted 2. Biasness mainly measured by RMSE (Root Mean Square Error)= Sqrt (Variance of Estimates + Bias^2) : captures estimates’ Accuracy and Variability and compares them in the same units. * True value= Mean of Z1 (SBP) at 0% missing * Estimate= Mean of Z1 (SBP) at 10% to 80% missing after MI to 0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80% When RMSE “smaller” → Estimation “better”

15 3. The method to deal with missing values (to measure effectiveness of MI) Complete Case Analysis (CCA) Multiple Imputation (MI) 4. Imputation numbers M=10, 20, 30, 40, and 50 numbers 5. Imputation model (z1= x1 x2 x3 x4 x5x6), (z1= x1 x2 x5), (z1= x3 x4x6) all variable highly corr. var to z1 rarely corr. var z1=x1x2x5 model is best model b/c smallest RMSE

16 6. Imputation Mechanisms repetitions on each MI (to reduce random variability of imputation) ex) M=10 *500 reps. → Average them→ … M=50 *500 reps. → Average them→ 8. Statistical Software STATA11 (Multiple Imputation) Mean of Est. for M=10 Mean of Est. for M=50 Regression method PMM MCMC

17 Proportion of missing data better Under MCAR and MAR, both CCA and MI are Good. changing scale of Y axis, Under All missing mechanisms, MI is better than CCA. Percent of missing, RMSEs are linearly & Diff. of RMSE b/w CCA and MI > High amount of missing, using Multiple Imputation

18 Proportion of missing data Similar (Regardless of imputation #) Under MCAR and MAR, MI Good! Under NMAR, MI biased est. at 80% missing b/c large RMSE ≒ ( 1 SD of data=0.99 ) 5 lines(M=10~M=50) go together and look like 1 line. > No difference among diff. Imputation numbers(m)= 10, 20, 30, 40, 50.

19 1. Under MCAR and MAR, theoretically Reg. should be better because of normality, but All method are good. However, Reg. method is slightly better under MAR. 2. Under NMAR, even though normality is not met, Reg. method is better than PMM. Proportion of missing data MCMC/ Reg. NormalityTheoryPractically (MI) MCARNormalRegressionAll imputation mechanisms MARNormalRegressionAll imputation mechanisms (Reg. slightly better) NMARNot NormalPMMRegression, MCMC Proportion of missing data *Normal assumption may not be important under NMAR. *MCMC is good under all missing mechanisms. Thus, MCMC can be used in univariate and continuous missing.

20 Proportion of missing data better Under MCAR and MAR, both CCA and MI are Good. changing scale of Y axis, Under MCAR, MAR, and NMAR, MI produced significantly unbiased values than CCA. Percent of missing, RMSEs are linearly & Diff. of RMSE b/w CCA and MI > High amount of missing, Multiple Imputation is preferable

21 Proportion of missing data Similar (Regardless of imputation # and percent of missing ) Under MCAR and MAR, MI produces unbiased est. Under NMAR, MI did not well at 80% missing due to large RMSE ≒ ( 1 SD of data=15.11 ) No difference among increased Imputation numbers = 10, 20, 30, 40, 50 > Increased Imputation numbers No sign. effect to correct bias in this data characteristics.

22 Proportion of missing data MCMC/ Reg. Proportion of missing data NormalityTheoryPractically(MI) MCARNot NormalPMMAll missing mechanisms MARNot NormalPMMAll missing mechanisms (PMM method slightly better ) NMARNot NormalPMMRegression, MCMC 1.Under MCAR and MAR, theoretically PMM should be better because normal assumption is broken, but All method are good. However, PMM method is slightly better under MAR. 2. Under NMAR, even though normality is not met, Reg. has lower RMSE than PMM. *Normal assumption maybe important only under MAR. *MCMC is good to use under MCAR, MAR, and NMAR. Thus, MCMC can be used not only in multivariate and continuous missing, but also in univariate and continuous missing.

23 1. Multiple Imputation (MI) > Complete Case Analysis always. 2. No significant difference in imputation numbers in my data. 3. Under MCAR and MAR, MI produce unbiased estimates at high amount of missing. 4. However, under NMAR, the estimation by MI is also biased at high amount of missing. 5. MCMC is good for univariate and continuous missing under MCAR, MAR and NMAR. 1. Multiple Imputation (MI) > Complete Case Analysis always. 2. No significant difference in imputation numbers in my data. 3. Under MCAR and MAR, MI produce unbiased estimates at high amount of missing. 4. However, under NMAR, the estimation by MI is also biased at high amount of missing. 5. MCMC is good for univariate and continuous missing under MCAR, MAR and NMAR. Conclusion

24


Download ppt "Texas A&M HSC Jin is designed by Dr. Huber. Korean Female Colon Cancer Risk Factors Range EventNon-event HR95% CIP n%n% Smoking Habits Missing144940079.57407195.70----"

Similar presentations


Ads by Google