Presentation is loading. Please wait.

Presentation is loading. Please wait.

Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1 1 Department of Epidemiology and Public Health Imperial College, London.

Similar presentations


Presentation on theme: "Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1 1 Department of Epidemiology and Public Health Imperial College, London."— Presentation transcript:

1 Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1 1 Department of Epidemiology and Public Health Imperial College, London 2 MRC Biostatistics Unit, Cambridge jassy.molitor@imperial.ac.uk@imperial.ac.uk chris.jackson@mrc-bsu.cam.ac.uk http://www.bias-project.org.uk Bayesian graphical models for combining mismatched administrative and survey data: application to low birth weight and water disinfection by-products

2 BIAS Project “ Bayesian methods for integrated bias modelling and analysis of multiple data sources” http://www.bias-project.org.ukhttp://www.bias-project.org.uk Observational data in social sciences / epidemiology Account for common biases  …especially by using multiple data sources  Bayesian graphical models Outline of talk  (10 mins) Overview of graphical models for observational data biases (CJ)  (20 mins) Case study: Combining birth register, survey and census data to study effects of water disinfection by-products on risk of low birthweight (Jassy Molitor)

3 OUT- COME Observed individuals PRED- ICTOR ? Population of interest SELECTION BIAS (BY DESIGN) NON-RESPONSE (ACCIDENTAL) CONFOUNDING (BY DESIGN) MISSING DATA (ACCIDENTAL) ?? ?

4 Observed individuals PRED- ICTOR Population of interest SELECTION BIAS (BY DESIGN) NON-RESPONSE (ACCIDENTAL) CONFOUNDING (BY DESIGN) MISSING DATA (ACCIDENTAL) ?? ? ? ? PRED- ICTOR OUT- COME MEASUREMENT ERROR

5 Graphical model More general than a multilevel model As well as hierarchical structures (groups of groups of individuals) … …can express any relationship between known or unknown quantities Represented by a graph with nodes and links Y W Z X Genotypes of parents Genotypes of children

6 Advantages of graphical models Mathematical: Use network structure to build a joint probability distribution for known and unknown quantities.

7 Joint distributions and graphical models Use ideas from graph theory to: represent structure of a joint probability distribution… …by encoding conditional independencies Factorization thm: Jt distribution P(V) =  P(v | parents[v]) D EB C A F P(A,B,C,D,E,F) = P(A|C) P(B|D,E) P(C|D,E) P(D) P(E) P(F|D,E)

8 Advantages of graphical models Mathematical: Use network structure to build a joint probability distribution for known and unknown quantities. Modelling: Easy to represent real-world complexity as a fusion of simpler sub-models.

9 Conditional independence provides mathematical basis for expressing large system as fusion of smaller components D EB C A F Building complex models

10 Conditional independence provides mathematical basis for expressing large system as fusion of smaller components D EB C D E F C A Building complex models

11 Advantages of graphical models Mathematical: Use network structure to build a joint probability distribution for known and unknown quantities. Modelling: Easy to represent real-world complexity as a fusion of simpler sub-models. Inference: Bayesian, unknown quantities have probability distributions, updated as data arrive. Uncertainties propagated through model Computational: Allow efficient algorithms for estimating Bayesian posterior distributions

12 Simple example OUT- COME Effect Observed data unknowns Individuals EXPO- SURE

13 Simple example OUT- COME Effect Observed data unknowns CONFO- UNDER Individuals EXPO- SURE

14 Simple example EXPO- SURE OUT- COME Effect Observed data unknowns EXPO- SURE OUT- COME Individuals with complete data CONFO- UNDER CONFO- UNDER ??? Individuals with missing data

15 Simple example EXPO- SURE OUT- COME Effect on outcome Observed data unknowns EXPO- SURE OUT- COME Individuals with complete data CONFO- UNDER CONFO- UNDER ??? Individuals with missing data Effect on confounder

16 EXPO- SURE OUT- COME Observed data unknowns EXPO- SURE OUT- COME Individuals with complete data CONFO- UNDER CONFO- UNDER ??? Individuals with missing data Effect on confounder Effect on outcome

17 Building complex models Key idea understand complex system through global model built from small pieces  comprehensible  each with only a few variables  representing a different data source or bias

18 Case study Combining birth register, survey and census data to study effects of water disinfection by-products on risk of low birth weight

19 Low Birthweight (LBW) (birth weight < 2.5kg) Environmental Exposure Chlorine Byproducts (THMs) Outcome Low Birth-weight (LBW) LBW and pre-term (LBWP) LBW and Full-term (LBWF) LBW: baby ’ s birth weight is less than 2.5 kg LBWP: LBW babies were born less than 37 weeks LBWF: LBW babies were born at least 37 weeks Covariates: mothers’ race/ethnicity Babies’ sex mothers’ smoking status Mothers’ maternal age during the pregnancy Example of combining different data sources – Chlorination Study Chlorine Natural organic matter and / or Chemical compound bromide organic & inorganic byproducts organic & inorganic byproducts bromate bromate bromate bromate chlorite chlorite chlorite chlorite haloacetic acids (HAA5) haloacetic acids (HAA5) haloacetic acids (HAA5) haloacetic acids (HAA5) total trihalomethanes (THMs ) total trihalomethanes (THMs ) total trihalomethanes (THMs ) total trihalomethanes (THMs ) reacts Gestation age

20 Available data sources related to the Chlorination Study Why do we need them? Administrative data (NBR) Deal with Small % of LBW in pop Inconclusive link between LBW and THMs Imputing missing covariates Aggregate data Survey data (MCS) Adjust for important subject level covariate Allows to examine different types of LBW

21 Administrative data (large) -power, no selection bias Observed postcode Missing smoking and race/ethnicity Missing baby’s gestation age NBR (national birth registry) Observed postcode Census 2001 - region-level of race/ethnicity composition Consumer survey: CACI - region-level of tobacco expenditure Aggregate Data (UK) Survey data (Subset of NBR) - low power, selection bias Observed postcode Observed smoking and race/ethnicity Observed baby’s gestation age MCS (millennium cohort study) Summary of data sources

22 Disease sub-model for MCS m: subject index for MCS r: region index y r m normal LBWP LBWF THM r m C r m Disease Model Parameters Unknown Known y : Birth weight indicator (1: normal, 2: LBWP, 3: LBWF) THM : THM (chlorine byproduct) exposure C : missing covariates such as race/ethnicity and smoking. Only observed in the MCS. Multinomial logistic regression for MCS y r m ~ Multinomial (p r m,1:3, 1) log(p r m,2 / p r m,1 )= b 10 + b 11 THM r m + b 12 C r m log(p r m,3 / p r m,1 )= b 20 + b 21 THM r m + b 22 C r m Building the sub-model

23 Disease sub-model for NBR n: subject index for NBR r: region index y r n normal THM r n Disease Model Parameters Unknown Known C r n LBWP LBWF Missing LBWP & LBWF were due to missing gestation age C : missing covariates such as race/ethnicity and smoking (Missing in the NBR, but Observed in the MCS) Building the sub-model Multinomial logistic regression for NBR y r n ~ Multinomial (p r n,1:3, 1) log(p r n,2 / p r n,1 )= b 10 + b 11 THM r n + b 12 C r n log(p r n,3 / p r n,1 )= b 20 + b 21 THM r n + b 22 C r n

24 G-age: Gestation age y r n THM r n Disease Model Parameters normalLBW THM r m Disease Model Parameters y r m normal LBW C r n C r m Birth Weight (BW ) LBWP LBWF LBWP LBWF missing G-age known unknown NBR MCS Missing outcome model - impute LBWP and LBWF for NBR

25 C r n C r m NBR MCS Aggregate A r Aggregate A r Unknown Known missing covar. model parameters Missing Covariate Model Impute C r n in terms of aggregate data and MCS data Building the sub-model Since our missing covariate such as race and smoke are binary variables, we use a multivariate-probit model to account for their correlation

26 1: nonwhite (Asian, Black, Others) 0: white 1: yes 0: no RaceSmoke Define underlying continuous variables (smoke*, race*) Smoke= I(smok* >0) & Race= I (Race* >0) Multivariate Probit Model (Chip & Greenberg,1998) Correlation S: Sampling Stratum Adjust for selection bias

27 NBR disease sub-model THM r n Disease Model Parameters THM r m Disease Model Parameters C r n C r m y r m normal LBWP LBWF y r n normal LBWF LBWP MCS disease sub-model Aggre. A r C r n C r m Missing covar. model parameters Missing covar. sub-model Missing Outcome Model Unified model known unknown Aggre. A r

28 1. Disease Model (y={1,2,3} ) 3. Missing Covariates Model (Multivariate Probit) 2. Missing Outcome Model i: subject index N m : group of subjects who had missing outcome (y miss ) r: region u: index for the category of outcome y obs : observed outcome X: observed covariates

29 Y (1, 2, 3) C (0/1) Aggre. (census) Missing Covariate Model Missing Outcome Model Investigating the performance of the unified model Good Performance of model depended on 1.How well the aggre. data can inform C (covariate) 2.How strong C and Y are linked MCS data shown there was 1. a strong association between aggre. data and race, smoke 2. a strong association between race, smoke and Y

30 Strong C-Aggre. association Strong Y-C link Step 1: Create data (N=1333) under the scenarios : Step 3: Compare the prediction based on an analysis using fully observed data (no imputation) with an analysis using partially observed data (imputation). Step 2: Randomly assign missing values 50% for Y=2 & Y=3 and 80% for C Repeat step 2 : generate 10 replicate samples Simulation Study

31 Pr( Y=2 | Y=2 or 3, covariates) conditional probability for LBWP given LBW, covariates Pr(Y=3 |Y=2 or 3, covariates) conditional probability for LBWF given LBW covariates Examining the missing outcome model: imputing Y In this dataset, missing outcome data are always LBW, either pre or full term (Y=2 or Y=3). Therefore, for missing outcome data, we wish to determine the conditional probabilities, If we are to accurately impute Y, these probabilities must be accurately estimated.

32 Examining the missing outcome model: imputing Y S=0, R=0 S=0, R=1 S=1, R=0 S=1, R=1 Y contains 50% missing values at categories 2 and 3 S and R is totally observed More challenging ! Y contains 50% missing values at categories 2 and 3 S and R contains 80% missing values

33 Examining the missing covariate model : imputing C (smoke and race) Y=1 Y=2 Y=3 One level Imputation C Aggre. C contains 80% missing Two levels imputation C Aggre. Y C C contains 80% missing Y contains 50% missing at categories 2 and 3 P00 Non-smoker White P01 Smoker Non-White P10 Smoker White P11 Non-smoker Non-white Smoke RACERACE

34 Real data analysis – United Utilities water company Data: Restrict on: Singleton birth Period: Sep 2000 – Aug 2001 Subjects: MCS 1333 NBR 7945 += Total 9278 Missing % in Race and Smoke: ~ 85% Missing % in Outcome: ~ 7% Complete Observed information Missing Race Missing Smoke Missing outcome at levels of 2 (LBWP) and 3 (LBWF)

35 Real data analysis – United Utilities water company Exposure variable : THMs It was dichotomized into 2 groups low-medium exposure group (<= 60 g/l) : 57.35 % high exposure group (>60 g/l) : 42.65 % Estimated in separate model for MCS and NBR (Whitaker et al, 2005) In addition to race and smoke, we also adjust for : baby’s sex mother maternal age Observed in both MCS and NBR

36 Standard (STATA) VS. Bayesian a. Multinomial logistic regression model for MCS data - no imputation b. Bayesian multiple bias model for combined NBR, MCS and aggregate data - impute missing outcome and covariates Models for real data analysis

37 Results for the real data analysis (Low birth-weight full-term VS Normal) OR ( 95% CI) DataModelOutcomeTHMsSmokeNon-white MCS (1333) Multinomial Logistic (STATA) LBWF 1.51 (0.8-3.0) 2.4 (1.2-4.9) 4.7 (2.2-10) MCS+NBR (9278) Bayesian Multiple Bias LBWF2.13 (1.1- 4.2)* 2.6 (1.3-5.3)* 6.9 (3.3-14.5)* * 95% Bayesian Credible Interval All parameter estimates adjusted for baby’s sex, mother maternal age

38 Conclusion There is an evidence for association of THM exposure with low birth-weight full-term. Combining the datasets can  increase statistical power of the survey data  alleviate bias due to confounding in the administrative data Must allow for selection mechanism of survey when combining data

39 THANKS Mireille Toledano Mark Nieuwenhuijsen James Bennett Peter Hambly Daniela Fecht John Molitor

40 using one-level imputation Strong C-aggre. Weak C-aggre. Y=1 Y=2 Y=3

41 S=0, R=0 S=0, R=1 S=1, R=0 S=1, R=1 Strong Y-C Weak Y-C Y contains 50% missing values at categories 2 and 3 using one-level imputation

42 two-levels VS one-level imputation Y=1 Y=2 Y=3 Strong C-aggre Strong Y-C Weak C-aggre Strong Y-C Strong C-aggre Weak Y-C

43 Without cut function Cut function

44 Without cut functionCut function using two-level imputation


Download ppt "Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1 1 Department of Epidemiology and Public Health Imperial College, London."

Similar presentations


Ads by Google