# DEPARTMENT OF MATHEMATICS AND STATISTICS Handling Missing Data Tao Sun Lena Zhang Yaqing Chen Francisco Aguirre SSC Case Study 2002.

## Presentation on theme: "DEPARTMENT OF MATHEMATICS AND STATISTICS Handling Missing Data Tao Sun Lena Zhang Yaqing Chen Francisco Aguirre SSC Case Study 2002."— Presentation transcript:

DEPARTMENT OF MATHEMATICS AND STATISTICS Handling Missing Data Tao Sun Lena Zhang Yaqing Chen Francisco Aguirre SSC Case Study 2002

DEPARTMENT OF MATHEMATICS AND STATISTICS SSC Conference Hamilton Ontario May 2002 2 1.Preliminary analysis Various plotsVarious plots 2.Assessing the missing pattern Spearman rank correlation, logistic regressionSpearman rank correlation, logistic regression 3.Data analysis with missing data - Multiple Imputation Random hot deck imputation with bootstrapRandom hot deck imputation with bootstrap PROC MI and MIANALIZE (SAS)PROC MI and MIANALIZE (SAS) Transcan function (Hmisc library in S plus or R)Transcan function (Hmisc library in S plus or R) 4.Conclusions 5.Further work Presentation Outline Objective: Compare different approaches to handle missing data from a practitioner’s point of view

DEPARTMENT OF MATHEMATICS AND STATISTICS SSC Conference Hamilton Ontario May 2002 3 Preliminary analysis RESPONSE OVERVIEW Sample size: 2389 Males: 1097 (45.9%) Females: 1292 (54.1%) Observed: 1691 Missing: 698 (28.8%) Mean: 0.9129 The response variable is highly skewed to the left. Histogram of observed responses DVHST94

DEPARTMENT OF MATHEMATICS AND STATISTICS SSC Conference Hamilton Ontario May 2002 4 Preliminary analysis 8 covariates in total, first 4 shown here. There appears to be a pattern of two clusters in the response DVHST94 (below 0.5 and above 0.5). DVBMI94 appears to have some “wild” values ( = 96) –43 observations, all males. (3.9% of males sample) –Wild values were replaced with the mean DVBMI94 of males –DVBMI94 transformation: NEW.DVBMI94 = abs (DVBMI94 – 22)

DEPARTMENT OF MATHEMATICS AND STATISTICS SSC Conference Hamilton Ontario May 2002 5 Preliminary analysis There are no obvious linear patterns between the covariates and the response DVHST94 DVPP94 is recoded as dichotomous: NEW.DVPP94 = 0 (91% of observations) NEW.DVPP94 > 0 (9% of observations) The AGEGRP covariate is recoded to NEW.AGE NEW.AGE = mid range value (AGEGRP) – 20

DEPARTMENT OF MATHEMATICS AND STATISTICS SSC Conference Hamilton Ontario May 2002 6 Preliminary analysis Mean DVHST94

DEPARTMENT OF MATHEMATICS AND STATISTICS SSC Conference Hamilton Ontario May 2002 7 Preliminary analysis Strength of marginal relationships between the covariates and the response using generalized Spearman chi-square

DEPARTMENT OF MATHEMATICS AND STATISTICS SSC Conference Hamilton Ontario May 2002 8 The missing pattern of the response does not appear to depend on the sampling weights Assessing the missing pattern

DEPARTMENT OF MATHEMATICS AND STATISTICS SSC Conference Hamilton Ontario May 2002 9 The missing values depend on age Assessing the missing pattern

DEPARTMENT OF MATHEMATICS AND STATISTICS SSC Conference Hamilton Ontario May 2002 10 Assessing the missing pattern LOGISTIC REGRESSION Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.058793 0.367083 -13.781 < 2e-16 *** NEW.AGE 0.181625 0.007524 24.140 < 2e-16 *** SEXMale -0.847947 0.131475 -6.450 1.12e-10 *** DVHHIN94 0.047828 0.026768 1.787 0.0740. DVSMKT94 -0.015131 0.031662 -0.478 0.6327 NEW.DVPP94 = 0 0.233188 0.226732 1.028 0.3037 NUMCHRON -0.087992 0.048783 -1.804 0.0713. VISITS 0.012483 0.006563 1.902 0.0572. NEW.WT6 -0.043935 0.077407 -0.568 0.5703 NEW.DVBMI94 -0.015622 0.017299 -0.903 0.3665 % missing for males: 24% % missing for females: 34%

DEPARTMENT OF MATHEMATICS AND STATISTICS SSC Conference Hamilton Ontario May 2002 11 Multiple imputation Methods: –Random Hot Deck MI with Bootstrap –SAS PROC MI and PROC MIANALIZE –Function TRANSCAN in S-plus from Hmisc Library (Frank Harrel)

DEPARTMENT OF MATHEMATICS AND STATISTICS SSC Conference Hamilton Ontario May 2002 12 Multiple Imputation INCOMPLETE DATA IMPUTATION ANALYSIS POOLING IMPUTED DATA ANALYSIS RESULTS FINAL RESULTS IMPUTATION: Impute the missing entries of the incomplete data sets B times, resulting in B complete data sets. ANALYSIS: Analyze each of the B completed data sets using weighted least squares. POOLING: Integrate the B analysis results into a final result. Simple rules exist for combining the B analyses.

DEPARTMENT OF MATHEMATICS AND STATISTICS SSC Conference Hamilton Ontario May 2002 13 Random hot-deck MI with Bootstrap B = 1000 replicates Observed Missing response response Choose randomly with replacement Probability ~ weights Complete data (, ) (, ) (Within variance,R-square) Same procedure (, ) (, ) (Within variance,R-square) Estimated Estimated Compute 95% CI for judging significance of predictors

DEPARTMENT OF MATHEMATICS AND STATISTICS SSC Conference Hamilton Ontario May 2002 14 PROC MI & MIANALYZE Method PROC MI 1By default generates 5 imputation values for each missing value 2Imputation method: MCMC (Markov Chain Monte Carlo)  EM algorithm determines initial values  MCMC repeatedly simulates the distribution of interest from which the imputed values are drawn 3Assumption: Data follows multivariate normal distribution PROC REG  Fits five weighted linear regression models to the five complete data sets obtained from PROC MI (used by_imputation_statement ) PROC MIANALIZE Reads the parameter estimates and associated Reads the parameter estimates and associated covariance matrix from the analysis covariance matrix from the analysis performed on the multiple imputed data sets performed on the multiple imputed data sets and derives valid statistics for the parameters and derives valid statistics for the parameters

DEPARTMENT OF MATHEMATICS AND STATISTICS SSC Conference Hamilton Ontario May 2002 15 TRANSCAN(Splus,Hmisc) Transforms continuous and categorical variables to have maximum correlation with the best linear combination of the other variables. It approximates the multiple imputation algorithm described by Rubin’s Bayesian bootstrap. Draws a sample of size r from r non-missing residuals. Chooses a sample of size m from this sample of size r with replacement. m is the number of missing values. LS Bootstrap Bootstrap Generates imputed values with the linear imputation model and the bootstrapped residuals. Advantage: Does not need normality assumption or symmetry of residuals. Does shrinkage to avoid overfitting Disadvantage: “ Freezes” the imputation model before drawing the multiple imputations. Frank Harrell This algorithm is repeated B times to obtain the multiple imputed data sets that are analyzed using WLS with the function LM.

DEPARTMENT OF MATHEMATICS AND STATISTICS SSC Conference Hamilton Ontario May 2002 16 Comparing imputation methods Ranking: 1.TRANSCAN ( Advantage: shrinkage correction to prevent over fitting) 2.PROC MI (Drawback: normality assumption) 3.Bootstrap random hot deck (does not use the information of the covariates)

DEPARTMENT OF MATHEMATICS AND STATISTICS SSC Conference Hamilton Ontario May 2002 17 Significant variables

DEPARTMENT OF MATHEMATICS AND STATISTICS SSC Conference Hamilton Ontario May 2002 18 Conclusions about the missing pattern The missing values of the response variable DVHST94 are not MCAR. The probability of missing depends primarily on the age and sex covariates, therefore the missing values are MAR.

DEPARTMENT OF MATHEMATICS AND STATISTICS SSC Conference Hamilton Ontario May 2002 19 Conclusions about multiple imputation Transcan function appeared to perform better than PROC MI for imputing and analyzing this data set given non-normality. Random hot deck MI with bootstrap gave significantly biased results. This approach does not take into account the information provided by the covariates therefore is not appropriate for data MAR.

DEPARTMENT OF MATHEMATICS AND STATISTICS SSC Conference Hamilton Ontario May 2002 20 Conclusions about the data analysis The health status of the population tends decreases with age. People with higher income tend to have better health than people with less income. People with lower health status demand more medical services (visits to a doctor). People that are propense to depression have lower health. Smoking does not appear to have a decisive influence on the health status.

DEPARTMENT OF MATHEMATICS AND STATISTICS SSC Conference Hamilton Ontario May 2002 21 Future work GLM could be used to model the categorical response GQ.H1 using a multinomial logistic model to impute the missing categorical responses Interactions of the significant variables with the insignificant variables should be explored in order to further assess the concomitant effects (e.g. smoking and depression).

DEPARTMENT OF MATHEMATICS AND STATISTICS SSC Conference Hamilton Ontario May 2002 22 Acknowledgements: Special thanks to professor Peggy Ng and George Monette for their support.

Similar presentations