# Bayesian Methods to Handle Missing Data in High-Dimensional Data Sets using Factor Analysis Strategies Thomas R. Belin UCLA Department of Biostatistics.

## Presentation on theme: "Bayesian Methods to Handle Missing Data in High-Dimensional Data Sets using Factor Analysis Strategies Thomas R. Belin UCLA Department of Biostatistics."— Presentation transcript:

Bayesian Methods to Handle Missing Data in High-Dimensional Data Sets using Factor Analysis Strategies Thomas R. Belin UCLA Department of Biostatistics Juwon Song Univ. of Texas-M.D. Anderson Cancer Center Jianming Wang Medtronic Inc.

Introduction General Problem: Incomplete high- dimensional longitudinal data A large number of variables A modest number of cases With missing values Initially consider cross-sectional data, then consider longitudinal structure

Multiple imputation Rationale : Useful framework for representing uncertainty due to missingness Requires imputations to be proper Advice : include available information to the fullest extent possible (Rubin 1996 JASA) - avoid bias in the imputation - make assumption of ignorable missing data more plausible

Overparameterization concerns With modest sample size and large number of variables, even a simple model can be overparameterized Example : 50 variables 50 49/2=1225 correlation parameters in multivariate normal model with general covariance matrix Analysis often proceeds based on arbitrary choice of variables to include or exclude

Alternative modeling strategies Address inestimable or unstable parameters by : deleting variables using proper prior distribution - ridge prior for multivariate normal (MVN) model (Schafer 1997 text) restrictions on covariance matrix (common factors in MVN model)

Factor model for incomplete multivariate normal data Idea : ignore factors corresponding to small eigenvalues Notation: Y : n p data matrix with missing items Z : n k unobserved factor-score matrix, where k p (Y i Z i ): iid (p+k)-variate normal distribution Z i N(0, I k ), i.e., assuming orthogonal factors

Factor model for incomplete multivariate normal data (contd) Model: Y i = + Z i + i, for i=1, 2,..., n, where is 1 p mean vector, is k p factor-loading matrix, and i N( 0, 2 ), where 2 = diag( 1 2, 2 2,…, p 2 )

Model fitting Gibbs sampling : based on assumed factor structure (i.e., k known), draw: (a) mean vector (b) factor loadings (c) uniqueness (d) factor scores (e) missing items

Details of model fitting Can use weakly informative prior for uniqueness terms j 2 to avoid degenerate variance estimates Can use either noninformative or weakly informative priors for means and factor loadings Used transformations to speed convergence Multiple modes possible (Rubin and Thayer 1982, 1983 Psychometrika), so simulate multiple chains Monitor convergence (Gelman and Rubin 1992 Statistical Science)

Simulation evaluations Evaluate bias, coverage when model is correct, overparameterized, or underparameterized np# true factors# assumed factors 100 55, 10 105, 10 50010055, 10 105, 10

Simulation factor structure Example: Each item loads on one factor

Simulation details Also considered hypothetical scenario where items load on two factors 200 replications for each combination of simulation conditions - error standard deviation of 1.5% for 95% coverage Percentage of missing data ranged from 5-25% for each variable Three missing-data mechanisms (MAR where available-case analysis might do well, MAR where available-case analysis not expected to do well, and non-ignorable where method appropriate under MAR might do well)

Simulation results: Factor model, cross-sectional mean Factor model performs well when model correct or overparameterized (coverages range from 93% - 97%) Factor model coverage is below nominal level when model underparameterized (coverages range from 86% - 93%)

Simulation results: Other methods, cross-sectional mean MVN frequently fails to converge with n=100 without ridge prior MVN with ridge prior has good coverage (94% - 98%), interval widths typically wider than for factor model (2-16% wider on average, depending on details such as missing data mechanism) Available-case analysis performs poorly (coverages ranging from 37% - 88%)

Simulation study based on observed covariance matrix Generate multivariate normal data (200 replicates, SE = 1.5% for 95% coverage statistics) with mean and covariance fixed at published values from Harman (1967) study of 24 psychological tests on 145 school children Number of factors not known in advance Consider 4, 5, 7 factors following earlier analysis Also consider 11 factors based on cumulative variance explained exceeding 80% and desire not to underparameterize model

Simulation results: psychological testing example Coverage rates: 4-factor model:93% - 95% 5-factor model:93% - 96% 7-factor model:93% - 95% 11-factor model:93% - 95% MVN model:94% - 95% Available-case analysis:12% - 84% Interval widths for MVN model within 5% of factor model widths, usually within 1%

Application: Emergency room intervention study Specialized emergency room intervention vs. standard emergency room treatment for 140 female adolescents after suicide attempt Twenty-seven outcomes measured at baseline, 3, 6, 12, 18 months + many baseline characteristics Most vars 5-25% missing, some 50-60% missing Main interests: - effectiveness of emergency room intervention - whether baseline psychological impairment is related to outcomes over time

Factor model for emergency room intervention study 135 variables, including 27 longitudinal outcomes Longitudinal outcomes: measures at different time points treated as separate variables Assume 30 factors: - explained about 80% of the variation - simulation analysis: insufficient number of factors can cause serious bias - with 27 longitudinal outcomes, general enough to allow each longitudinal variable to represent a separate factor

Emergency-room intervention study: evaluations, results After imputation, related longitudinal outcomes to baseline predictors using SAS PROC MIXED Compared imputation under factor model with growth-curve imputation strategy developed by Schafer (1997 PAN program) No substantial differences seen in significance tests for intervention effect Some sensitivity seen in significance of impairment effect, intervention and impairment interactions

Imputation for longitudinal data PAN (Schafer, 1997): Using Multivariate Linear Mixed-effect Model (MLMM) Appropriate for multivariate longitudinal data or clustered data Imputation by multivariate linear mixed- effect model txm txp pxm txq qxm txm Assume and

Challenge with MI using PAN MI under PAN can be over-parameterized easily Example: 15 variables collected longitudinally five times, modeled with 2 random effects in PAN # of parameters in, random effects: 15*31/2=465 # of parameters in, error terms: 15*16/2=120 Total # of parameters: 585 Parameter reduction seems sensible when number of cases is modest, e.g. 300

Potential solution to over-parameterization If those 15 variables feature sizable correlations, they could be viewed as measuring 3-5 underlying factors. Strategy: Reduce the dimension of the problem by factor analysis Model the estimated factor scores by a MLMM Factor structure reflects cross-sectional correlations among variables measured at the same time; MLMM reflects longitudinal correlations

Ordinary factor analysis model Factor analysis model where and Because we often assume Also assume that is of full rank (Seber, 1977)

Ordinary factor analysis model (continued) Identifiability Solution invariant under orthogonal transformation Common restrictions which is equivalent to k(k-1)/2 restrictions Identifiable if

Generalizing factor analysis model Standardization of factor scores presents challenge for generalizing factor analysis model to longitudinal setting Idea: Use error-in-variables representation of factor model

Error-in-variables factor model Error-in-variables model (Fuller, 1987) Interpretation: If we partition into and let,, and, Then

Error-in-variables factor model (continued) Covariance matrix of Y is The total # of distinct parameters is which is exactly the same as the ordinary model with the additional k(k-1)/2 restrictions used to avoid indeterminacy No additional restrictions necessary

A Longitudinal Factor Analysis model Extending Error-in-variables Model to LFA

Aspects of LFA model The # of factors is the same on each occasion, but the factor loadings and factor scores may change No constraints on covariance structure of the The unique-component vectors are uncorrelated with the factors both within and across occasions. The unique-component errors are uncorrelated within occasion and across occasions

Advantages of LFA model Advantages of this LFA model: Identifiability problem can easily be handled Preserves the mean structure and covariance structure, making the study of elevation change and pattern change simultaneously possible Can incorporate linear mixed-effect model structure for longitudinal data Can incorporate baseline covariates

Implementation Use data augmentation (I-step: linear regressions, P-step: analog to ML for multivariate normal with complete data) Assume conjugate forms (normal, inverse Wishart) for prior distributions for parameters, assume relatively diffuse priors that still produce proper posteriors Conditional distributions all in closed form

Evaluations We generated 100 data sets with from a MVN with mean and variance for i=1,2,…,350, p=15 measurements, k=5 factors at t=5 time points, has dimension (15x5)x1=75x1

Simulation design X incorporates intercept, 3 continuous variables, 1 binary variable and time Z allows for random intercepts, slopes ( reflects small to moderate covariate effects for predicting factor scores and a linear trend in factor scores)

Simulation design (continued) (to avoid singular factor loading matrix) Missingness introduced using MAR mechanism (a series of binary draws with probabilities depending on observed values) (, and incorporate relative variances, covariance describing unique variance, common variance among factor scores, and variance of random effects Simulation SE 95% of coverage statistics with 100 replicates=0.0218, margin of error=0.0427

The mean of, which (averaged across simulation replicates) was missing on 27% of individuals Analysis Method M.C. Average M.C. S.E. Average 95% Interval length Actual 95% Coverage True value17.074 All data17.0780.4261.677 98% Available data18.8540.5302.0917% 5 imputations17.0720.5672.23196% Simulation when number of factors is correctly specified

The mean of, a variable which is missing 100% of the time (i.e. a variable not measured at a given time point) Analysis Method M.C. Average M.C. S.E. Average 95% Interval length Actual 95% Coverage True value20.8195 All data20.79550.51282.017094% Available data -- 5 imputations 20.76780.65032.555495% Simulation when number of factors is correctly specified

The mean of (average missingness rate=27%) Analysis Method M.C. Average M.C. S.E. Average 95% Interval length Actual 95% Coverage True value17.074 All data17.0780.42631.67798% Available data 18.8540.53042.0917% F=5 (true number) 17.0720.56722.23196% F=617.0550.48731.915394% F=417.6120.59622.342989% F=317.6630.62132.441086% Simulation when number of factors is incorrectly specified

The mean of, which has a 100% missingness rate Analysis Method M.C. Average M. C. S. E. Average 95% Interval length Actual 95% Coverage True value20.8195 All data20.79550.51282.017094% Available data -- F=5(true number) 20.76780.65032.555495% F=620.95650.71612.814294% F=420.64731.11394.378091% F=3 20.4091 1.24844.906083% Simulation when number of factors is incorrectly specified

Example using LFA: oral surgery study Randomized study of two oral surgery treatments (MMF, RIF) with longitudinal follow-up of quality-of-life (GOHAI) and psychological outcomes Hierarchical growth-curve model using WINBUGS:, if RIF, if MMF

Findings of interest Difference in average intercept, average slope between RIF and MMF (, ) significant under MI (NORM or LFA) analysis, not under available- case analysis Different interpretations emerge from MI analysis (RIF starts lower, ends with comparable values) Compared to MI using NORM, MI using LFA has 17%-34% narrower interval estimates for parameters

Summary and future research Summary Factor-analysis methods provide flexible framework for addressing incomplete high-dimensional longitudinal data Ongoing and future research Rounding continuous to binary imputations Determining number of factors Robustness of methods to normality assumption Can the parameters in LFA be estimated by EM or related methods? Comparisons with IVEWare and related methods, hot deck approaches

Goal To develop general-purpose multiple imputation procedures appropriate for high-dimensional data sets Cross-sectional Longitudinal

Simulation missing data mechanisms M1 (MAR): First 99 variables MCAR, missingness on last variable according to logistic regression on other 99 with normally distributed coefficients M2 (MAR): First 99 variables MCAR, missingness on last variable according to logistic regression on other variables included in same factor with half- normal distributed coefficients M3 (nonignorable but close to MAR): Missingness on each variable depends on two other variables in overlapping manner

Simulation results: simple regression coefficient Factor model: coverages 93% - 98% when model correct or overparameterized, 19% - 80% when model underparameterized MVN model: Frequently fails to converge with non-informative prior, coverages 91% - 99% with ridge prior Available-case analysis: coverages range from 44% - 100%

Equivalence of two factor analysis models One can write:

Incorporating multivariate linear mixed-effect model for factor scores Rearrange in a matrix form Then can be modeled by txk txm mxk txq qxk txk We assume that the t rows of are iid and. Thus

Modified LFA with covariates Combining the LFA with the linear mixed- effect model, we obtain

Analysis Method Available Case Analysis Multiple Imputation Using NORM Multiple Imputation Using LFA EstimatePosterior Mean 95% CIPosterior Mean 95% CIPosterior Mean 95% CI Beta0028.55(26.24, 30.92) 29.30(26.35, 32.33) 28.90(26.45, 31.20) Beta01 -0.29 (-4.67, 4.05) -4.24(-7.18, -1.44)* -3.93(-5.72, -1.95)* Beta10 7.07 (4.78, 9.24)* 6.15(1.90, 9.79)* 6.57(2.24, 9.34)* Beta11 1.86 (-2.42, 5.96) 2.72(0.20, 5.38)* 2.69(0.92, 5.02)* *p<0.05. Linear growth curve model estimates: Available-case analysis, MI using NORM, MI using LFA

Download ppt "Bayesian Methods to Handle Missing Data in High-Dimensional Data Sets using Factor Analysis Strategies Thomas R. Belin UCLA Department of Biostatistics."

Similar presentations