Presentation is loading. Please wait.

Presentation is loading. Please wait.

Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1 1 Department of Epidemiology and Public Health Imperial College, London.

Similar presentations


Presentation on theme: "Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1 1 Department of Epidemiology and Public Health Imperial College, London."— Presentation transcript:

1 Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1 1 Department of Epidemiology and Public Health Imperial College, London 2 MRC Biostatistics Unit, Cambridge jassy.molitor@imperial.ac.uk@imperial.ac.uk chris.jackson@mrc-bsu.cam.ac.uk http://www.bias-project.org.uk Bayesian graphical models for combining mismatched administrative and survey data: application to low birth weight and water disinfection by-products

2 Motivation of combining different data sources Case study: Chlorination Study Data Sources Statistical modeling Simulation and Real Data Analysis Outlines

3 Observational studies Fill with lots of uncertainties other than random errors Missing values Unobserved confounder Measurement errors Selection bias Random errors Uncertainties are hard to identify within a single data set

4 Combining multiple data sources Research questions are complicated in nature and a single data set may not able to provide sufficient answer. Example: Puzzle

5 Case study Combining birth register, survey and census data to study effects of water disinfection by-products on risk of low birth weight

6 Low Birthweight (LBW) (birth weight < 2.5kg) Environmental Exposure Chlorine Byproducts (THMs) Outcome Low Birth-weight (LBW) LBW and pre-term (LBWP) LBW and Full-term (LBWF) LBW: baby ’ s birth weight is less than 2.5 kg LBWP: LBW babies were born less than 37 weeks LBWF: LBW babies were born at least 37 weeks Covariates: mothers’ race/ethnicity Babies’ sex mothers’ smoking status Mothers’ maternal age during the pregnancy Example of combining different data sources – Chlorination Study Chlorine Natural organic matter and / or Chemical compound bromide organic & inorganic byproducts organic & inorganic byproducts bromate bromate bromate bromate chlorite chlorite chlorite chlorite haloacetic acids (HAA5) haloacetic acids (HAA5) haloacetic acids (HAA5) haloacetic acids (HAA5) total trihalomethanes (THMs ) total trihalomethanes (THMs ) total trihalomethanes (THMs ) total trihalomethanes (THMs ) reacts Gestation age

7 Available data sources related to the Chlorination Study Why do we need them? Administrative data (NBR) Deal with Small % of LBW in pop Inconclusive link between LBW and THMs Imputing missing covariates Aggregate data Survey data (MCS) Adjust for important subject level covariate Allows to examine different types of LBW

8 Administrative data (large) -power, no selection bias Observed postcode Missing smoking and race/ethnicity Missing baby’s gestation age NBR (national birth registry) Observed postcode Census 2001 - region-level of race/ethnicity composition Consumer survey: CACI - region-level of tobacco expenditure Aggregate Data (UK) Survey data (Subset of NBR) - low power, selection bias Observed postcode Observed smoking and race/ethnicity Observed baby’s gestation age MCS (millennium cohort study) Summary of data sources

9 Disease sub-model for MCS m: subject index for MCS r: region index y r m normal LBWP LBWF THM r m C r m Disease Model Parameters Unknown Known y : Birth weight indicator (1: normal, 2: LBWP, 3: LBWF) THM : THM (chlorine byproduct) exposure C : missing covariates such as race/ethnicity and smoking. Only observed in the MCS. Multinomial logistic regression for MCS y r m ~ Multinomial (p r m,1:3, 1) log(p r m,2 / p r m,1 )= b 10 + b 11 THM r m + b 12 C r m log(p r m,3 / p r m,1 )= b 20 + b 21 THM r m + b 22 C r m Building the sub-model

10 Disease sub-model for NBR n: subject index for NBR r: region index y r n normal THM r n Disease Model Parameters Unknown Known C r n LBWP LBWF Missing LBWP & LBWF were due to missing gestation age C : missing covariates such as race/ethnicity and smoking (Missing in the NBR, but Observed in the MCS) Building the sub-model Multinomial logistic regression for NBR y r n ~ Multinomial (p r n,1:3, 1) log(p r n,2 / p r n,1 )= b 10 + b 11 THM r n + b 12 C r n log(p r n,3 / p r n,1 )= b 20 + b 21 THM r n + b 22 C r n

11 G-age: Gestation age y r n THM r n Disease Model Parameters normalLBW THM r m Disease Model Parameters y r m normal LBW C r n C r m Birth Weight (BW ) LBWP LBWF LBWP LBWF missing G-age known unknown NBR MCS Missing outcome model - impute LBWP and LBWF for NBR

12 C r n C r m NBR MCS Aggregate A r Unknown Known missing covar. model parameters Missing Covariate Model Impute C r n in terms of aggregate data and MCS data Building the sub-model Since our missing covariate such as race and smoke are binary variables, we use a multivariate-probit model to account for their correlation

13 1: nonwhite (Asian, Black, Others) 0: white 1: yes 0: no RaceSmoke Define underlying continuous variables (smoke*, race*) Smoke= I(smok* >0) & Race= I (Race* >0) Multivariate Probit Model (Chip & Greenberg,1998) Correlation S: Sampling Stratum Adjust for selection bias

14 NBR disease sub-model THM r n Disease Model Parameters THM r m Disease Model Parameters C r n C r m y r m normal LBWP LBWF y r n normal LBWF LBWP MCS disease sub-model C r n C r m Missing covar. model parameters Missing covar. sub-model Missing Outcome Model Unified model known unknown Aggre. A r

15 1. Disease Model (y={1,2,3} ) 3. Missing Covariates Model (Multivariate Probit) 2. Missing Outcome Model i: subject index N m : group of subjects who had missing outcome (y miss ) r: region u: index for the category of outcome y obs : observed outcome X: observed covariates

16 Y (1, 2, 3) C (0/1) A (aggre.) Missing Covariate Model Missing Outcome Model Investigating the performance of the unified model Good Performance of model depended on 1.How well the aggre. data can inform C (covariate) 2.How strong C and Y are linked We can examine the following 4 data scenarios 1. Strong (A  C) Strong (C  Y) 2. Strong (A  C) Weak (C  Y) 3. Weak (A  C) Strong (C  Y) 4. Weak (A  C) Weak (C  Y)

17 Step 1: Create data (N=1333) under the scenarios : Step 3: Compare the prediction based on an analysis using fully observed data (no imputation) with an analysis using partially observed data (imputation). Note: partially observed data were analyzed under various models 1.Covariate sub-model (examining A  C) 2.Outcome sub-model (examining C  Y) 3.Unified Model (examining A  C and C  Y) 4.Unified Model with cut Step 2: Missing assignment: - randomly chose 80% of subjects and treat their C as missing - only 10% of individuals with outcomes in categories 2 or 3 were assigned to be missing Repeat step 2 : generate 20 replicate samples Simulation Study

18 Examining the Imputation of missing covariate one level (A  C) Strong A  C Weak A  C Assign higher probability of covariate pattern to subjects whose true covariates corresponding to that pattern than to those whose true pattern is different Ability to discriminate ture covariate pattern decrease

19 Examining the Imputation of missing covariate two level (A  C & C  Y) Feedback form outcome model is beneficial to covariate imputation. The predicted probabilities of covariate patter (C=0,0) are better able to discriminate between subjects whose true covariates are C=0,0 or not. In particular, weak C scenarios.

20 Examining the impact of the imputation model on the Y-C association outcome model onlyUnified model w/ cut SYSCESTEst (MSE) beta.smoke[3]0.90.91 (0.01)1.07 (0.27)0.25 (0.43) beta.race[3]1.791.83 (0.01)2.22 (0.25)1.12 (0.47) SYWC beta.smoke[3]0.990.97 (0.00)0.97 (0.51)0.15 (0.71) beta.race[3]2.562.57 (0.01)2.71 (0.49)0.67 (3.63) WYSC beta.smoke[3]-0.020.05 (0.01)0.57 (1.34)0.06 (0.07) beta.race[3]0.320.41 (0.03)0.61 (0.41)0.18 (0.09) WYWC beta.smoke[3]0.350.34 (0.03)0.91 (0.89)0.09 (0.11) beta.race[3]11.06 (0.04)1.23 (1.32)0.18 (0.84) Outcome VS unified model Unified model has higher MSE than outcome model (more missing values need to impute) Unified VS. Unified with cut Strong Y-C association help reduce MSE but not weak Y-C association

21 Real data analysis – a water company in Northern England Data: Restrict on: Singleton birth Period: Sep 2000 – Aug 2001 Subjects: MCS 1333 NBR 7945 += Total 9278 Missing % in Race and Smoke: ~ 85% Missing % in Outcome: ~ 7% Complete Observed information Missing Race Missing Smoke Missing outcome at levels of 2 (LBWP) and 3 (LBWF)

22 Real data analysis – a water company in northern England Exposure variable : THMs It was dichotomized into 2 groups low-medium exposure group (<= 60 g/l) : 57.35 % high exposure group (>60 g/l) : 42.65 % Estimated in separate model for MCS and NBR (Whitaker et al, 2005) In addition to race and smoke, we also adjust for : baby’s sex mother maternal age Observed in both MCS and NBR

23 No imputation VS. Imputation a. Multinomial logistic regression model for MCS data (Bayesian) - no imputation b. Bayesian multiple bias model for combined NBR, MCS and aggregate data - impute missing outcome and covariates Models for real data analysis

24 Results for the real data analysis (Low birth-weight full-term VS Normal) OR ( 95% CI)* DataModelOutcomeTHMsSmokeNon-white MCS (1333) Multinomial Logistic (Bayesian) LBWF 1.64 (0.8-3.1) 2.65 (1.2-5.2) 5.92 (2.2-12.9) MCS+NBR (9278) Bayesian Multiple Bias LBWF2.4 (1.1- 4.5) 2.5 (1.1-4.7) 5.6 (2.6-10.8) * 95% Bayesian Credible Interval All parameter estimates adjusted for baby’s sex, mother maternal age

25 Conclusion There is an evidence for association of THM exposure with low birth-weight full-term. Combining the datasets can  increase statistical power of the survey data  alleviate bias due to confounding in the administrative data Must allow for selection mechanism of survey when combining data

26 THANKS Mireille Toledano Mark Nieuwenhuijsen James Bennett Peter Hambly Daniela Fecht John Molitor


Download ppt "Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1 1 Department of Epidemiology and Public Health Imperial College, London."

Similar presentations


Ads by Google