Presentation is loading. Please wait.

Presentation is loading. Please wait.

Martijn Schuemie, Peter Rijnbeek, Jenna Reps, Marc Suchard

Similar presentations


Presentation on theme: "Martijn Schuemie, Peter Rijnbeek, Jenna Reps, Marc Suchard"— Presentation transcript:

1 Martijn Schuemie, Peter Rijnbeek, Jenna Reps, Marc Suchard
Integrating Observational Data with Prior Knowledge: Wikipedia-Informed Priors for Predicting Health Outcomes Martijn Schuemie, Peter Rijnbeek, Jenna Reps, Marc Suchard

2 Clinical outcome prediction
Models used in clinical practice: Framingham Risk Score Probability of cardiovascular outcome in next 10 years CHADS2 score Probability of stroke in patients with atrial fibrillation

3 Fitting models in observational data
Procedures Measurements Outcome Diagnoses Drug prescriptions Time Observation of covariates Predefined prediction interval Point of prediction

4 Which covariates to include?
Traditionally Expert identifies n covariates, where n < 25 ‘Big data’ Generate all possible covariates Each diagnose code Each diagnose group Each drug Each drug class 10,000 < n < 100,000 Big data covariates include all hand-picked covariates

5 Bayesian prediction modeling
L1 regularized regression (LASSO) equivalent to Laplace prior on each coefficient Handpicked covariates = prior with large variance on selected covariates, 0 variance on rest Big data covariates = same prior on all covariates Probability Coefficient value

6 Informed priors Prior variance known risk factors ≠
Prior variance other covariates All priors still centered on 0 Variances are selected by optimizing likelihood in a cross-validation

7 Sources of prior knowledge
Expert knowledge Wikipedia

8 Wikipedia pages are linked to coding systems

9 Wikipedia pages link to other pages
Those other pages may be linked to codes as well

10 Outcome definition Contains Standard concepts Descendants Map to Codes Higher levels Linked to Cohort def. wiki pages Reference Related wiki pages Redirects Linked to Codes Map to Standard concepts Descendants Used in Covariates

11 Proof of concept Population of interest: Prediction time
Type 2 diabetes mellitus Prediction time One year Outcomes of interest: Stroke Myocardial infarction Endocarditis Heart Failure Databases: CCAE (US insurance claims) JMDC (Japanese insurance claims)

12 Population sizes 5,486,674 89,889 Type 2 diabetes diagnosis 100,000
Sampled to max 100,000 50,000 44,944 Training set 50,000 44,945 Test set People with prior outcomes and less than one year of follow up (and no outcome) were removed for each outcome Unique covariates 22,817 8,829

13 Approaches under evaluation
Strict manual: Only including expert-selected risk factors Manual informed: separate prior variance for expert-selected risk factors Wiki informed: separate prior variance for risk factors automatically derived from Wikipedia Uninformed: same prior for all variables All use L1 regression with 10-fold cross-validation

14 Area under the ROC

15 Area under the ROC Exploring these

16 Model differences for Heart Failure
Uninformed: Prior variance = 0.068 Wiki-informed: Variance risk factors = 0.201 Variance other vars = 0.034 Chronic obstructive lung disease Chronic kidney disease Beta wiki-informed model Kidney disease Chronic disease of genitourinary system Beta uninformed model Prior variance = 0.068

17 Portability Train Test

18 Portability Train Test

19 Conclusions Expert-informed and Wikipedia-informed priors can lead to different models compared to uninformed priors Despite these differences, performance (within and across databases) is comparable Strict hand-picking is almost always suboptimal Next steps: Consider other sources of prior knowledge (e.g. models fitted elsewhere) Other algorithms (e.g. random forests) Try out more scenarios


Download ppt "Martijn Schuemie, Peter Rijnbeek, Jenna Reps, Marc Suchard"

Similar presentations


Ads by Google