Presentation on theme: "Texas A&M HSC Jin is designed by Dr. Huber. Korean Female Colon Cancer Risk Factors Range EventNon-event HR95% CIP n%n% Smoking Habits Missing144940079.57407195.70----"— Presentation transcript:
Korean Female Colon Cancer Risk Factors Range EventNon-event HR95% CIP n%n% Smoking Habits Missing144940079.57407195.70---- No smoking 35189619.32932.191.000 Smoked before, but quitted 46110.25210.491.1741.0581.3030.0025 Currently, 1/2 pack 87350.48380.890.9480.8281.0840.4339 Currently, 1/2-One pack 55340.30260.610.9910.9011.090.8457 Currently, More than One pack 14100.0850.121.0150.8941.1530.8162 ☞ Is smoking protective? Not sure b/c Huge missing!! ☞
1. Missing Completely At Random(MCAR) : depends neither on observation nor on missing 2. Missing At Random(MAR) : depends only on observation 3. Not Missing At Random(NMAR) : depends both on observation and on missing Diff. by Why data are missing Affect the effectiveness and biasness of methods for missing data
1. Complete Case Analysis(CCA) 2. Available Case Analysis(ACA) 3. Mean imputation 4. Expectation and Maximum(EM) 5. Multiple Imputation Older Methods Single Imputation Multiple Imputation Only CCA and MI
Y1Y2Y3 140.20 3125. 103540 254857 304960 355565 374770 1403230 426540 5020020 1. Complete Case Analysis (CCA) 1. CCA = NOT using any methods of handling missing data 2. By deleting cases, power will be decreased (b/c reduced sample size) 1. Delete all cases of missing values on Y1,Y2,Y3 2. Analyze remaining cases
2. Multiple Imputation (MI) (1) Imputation Step (2) Analysis Step (3) Combination Step MI has 3 steps
2. MI (2) Analysis Step Imputation Number Label of model Type of statistics Variable names f or rows of estimated COV Dependent v ariable Root mean squared error InterceptX1X2Y 11MODEL1PARMS Y9.49417.91-7.96-1.64 21MODEL1COVInterceptY9.49722.00 - 15.61 -3.26. 31MODEL1COVX1Y9.49-15.610.340.07. 41MODEL1COVX2Y9.49-3.260.070.02. 52MODEL1PARMS Y11.80405.16-7.81-1.53 62MODEL1COVInterceptY11.801052.74 - 23.16 -4.60. 72MODEL1COVX1Y11.80-23.160.520.10. 82MODEL1COVX2Y11.80-4.600.100.02. 93MODEL1PARMS Y3.86233.43-4.31-0.80 103MODEL1COVInterceptY3.8628.82-0.66-0.12. 113MODEL1COVX1Y3.86-0.660.020.00. 123MODEL1COVX2Y3.86-0.120.00. 134MODEL1PARMS Y1.76221.04-4.17-0.74 144MODEL1COVInterceptY1.765.20-0.12-0.02. 154MODEL1COVX1Y1.76-0.120.00. 164MODEL1COVX2Y1.76-0.020.00. 175MODEL1PARMS Y1.46215.80-4.08-0.71 185MODEL1COVInterceptY1.463.36-0.08-0.01. 195MODEL1COVX1Y1.46-0.080.00. 205MODEL1COVX2Y1.46-0.010.00. * Standard statistical procedure > regression for each complete datasets (5) separately Analyzed 5 times
2. MI (3) Combination Step > the results from 5 data are combined to ONE with combination equations. 1.Combined estimate: 2.Variance Total: 3.Var. Within: 4.Var. Between: 5.DF: 6.Fraction missing Info. : 7.Confidence Interval: combined to 1 result
* Comparison of methods to handle missing values CriteriaCCAACA Mean Imputation EM method Multiple Imputation Unbiased Parameter Estimation MCAR OXXOO MAR XXXOO MNAR XXXXX Good Estimates Variability XXXXO Best Statistical Power XOOOO MI is the BEST!! Excellent Estimation Variance among ‘M’est. b/c multiply imputed data by not deleting any cases
(1) Imputation step of MI : imputation mechanisms for substituting missing values PatternTypeNormality Imputation mechanisms UnivariateMonotoneContinuousORegression UnivariateMonotoneContinuousX Predictive Mean Matching Multivariate Not Monotone Continuous-MCMC MCMC is NOT tested to Univariate
* 3000 obs. are generated on Z1, and X1,…,X6 (all variables are continuous) ( Xs: observed variables and Z: partly missing var. ) * Z1, and X1,…,X6 are drawn from multivariate normal dist with Means = 0 and Correlation = Simulated Data
* 3154 obs. (all variables are continuous) - Missing variable: Systolic Blood Pressure (Mean: 128.63) - Observed variables: DBP (82.02), height (69.78), weight (169.95), age (46.28), BMI (24.52), and Cholesterol (Mean: 226.37) * Correlation = Example Data (“A Predictive Study of Coronary Heart Disease” )
1. Missing Mechanisms 1) MCAR: Randomly Z1(SBP) deleted 2) MAR: After sorting by one of X(obs.var), Z1(SBP) deleted 3) NMAR: After sorting by Z1(SBP), Z1(SBP) deleted 2. Biasness mainly measured by RMSE (Root Mean Square Error)= Sqrt (Variance of Estimates + Bias^2) : captures estimates’ Accuracy and Variability and compares them in the same units. * True value= Mean of Z1 (SBP) at 0% missing * Estimate= Mean of Z1 (SBP) at 10% to 80% missing after MI to 0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80% When RMSE “smaller” → Estimation “better”
3. The method to deal with missing values (to measure effectiveness of MI) Complete Case Analysis (CCA) Multiple Imputation (MI) 4. Imputation numbers M=10, 20, 30, 40, and 50 numbers 5. Imputation model (z1= x1 x2 x3 x4 x5x6), (z1= x1 x2 x5), (z1= x3 x4x6) all variable highly corr. var to z1 rarely corr. var z1=x1x2x5 model is best model b/c smallest RMSE
6. Imputation Mechanisms 7. 500 repetitions on each MI (to reduce random variability of imputation) ex) M=10 *500 reps. → Average them→ … M=50 *500 reps. → Average them→ 8. Statistical Software STATA11 (Multiple Imputation) Mean of Est. for M=10 Mean of Est. for M=50 Regression method PMM MCMC
Proportion of missing data better Under MCAR and MAR, both CCA and MI are Good. changing scale of Y axis, Under All missing mechanisms, MI is better than CCA. Percent of missing, RMSEs are linearly & Diff. of RMSE b/w CCA and MI > High amount of missing, using Multiple Imputation
Proportion of missing data Similar (Regardless of imputation #) Under MCAR and MAR, MI Good! Under NMAR, MI biased est. at 80% missing b/c large RMSE ≒ ( 1 SD of data=0.99 ) 5 lines(M=10~M=50) go together and look like 1 line. > No difference among diff. Imputation numbers(m)= 10, 20, 30, 40, 50.
1. Under MCAR and MAR, theoretically Reg. should be better because of normality, but All method are good. However, Reg. method is slightly better under MAR. 2. Under NMAR, even though normality is not met, Reg. method is better than PMM. Proportion of missing data MCMC/ Reg. NormalityTheoryPractically (MI) MCARNormalRegressionAll imputation mechanisms MARNormalRegressionAll imputation mechanisms (Reg. slightly better) NMARNot NormalPMMRegression, MCMC Proportion of missing data *Normal assumption may not be important under NMAR. *MCMC is good under all missing mechanisms. Thus, MCMC can be used in univariate and continuous missing.
Proportion of missing data better Under MCAR and MAR, both CCA and MI are Good. changing scale of Y axis, Under MCAR, MAR, and NMAR, MI produced significantly unbiased values than CCA. Percent of missing, RMSEs are linearly & Diff. of RMSE b/w CCA and MI > High amount of missing, Multiple Imputation is preferable
Proportion of missing data Similar (Regardless of imputation # and percent of missing ) Under MCAR and MAR, MI produces unbiased est. Under NMAR, MI did not well at 80% missing due to large RMSE ≒ ( 1 SD of data=15.11 ) No difference among increased Imputation numbers = 10, 20, 30, 40, 50 > Increased Imputation numbers No sign. effect to correct bias in this data characteristics.
Proportion of missing data MCMC/ Reg. Proportion of missing data NormalityTheoryPractically(MI) MCARNot NormalPMMAll missing mechanisms MARNot NormalPMMAll missing mechanisms (PMM method slightly better ) NMARNot NormalPMMRegression, MCMC 1.Under MCAR and MAR, theoretically PMM should be better because normal assumption is broken, but All method are good. However, PMM method is slightly better under MAR. 2. Under NMAR, even though normality is not met, Reg. has lower RMSE than PMM. *Normal assumption maybe important only under MAR. *MCMC is good to use under MCAR, MAR, and NMAR. Thus, MCMC can be used not only in multivariate and continuous missing, but also in univariate and continuous missing.
1. Multiple Imputation (MI) > Complete Case Analysis always. 2. No significant difference in imputation numbers in my data. 3. Under MCAR and MAR, MI produce unbiased estimates at high amount of missing. 4. However, under NMAR, the estimation by MI is also biased at high amount of missing. 5. MCMC is good for univariate and continuous missing under MCAR, MAR and NMAR. 1. Multiple Imputation (MI) > Complete Case Analysis always. 2. No significant difference in imputation numbers in my data. 3. Under MCAR and MAR, MI produce unbiased estimates at high amount of missing. 4. However, under NMAR, the estimation by MI is also biased at high amount of missing. 5. MCMC is good for univariate and continuous missing under MCAR, MAR and NMAR. Conclusion