Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining Cardiovascular Bayesian Networks Charles Twardy †, Ann Nicholson †, Kevin Korb †, John McNeil ‡ (Danny Liew ‡, Sophie Rogers ‡, Lucas Hope.

Similar presentations


Presentation on theme: "Data Mining Cardiovascular Bayesian Networks Charles Twardy †, Ann Nicholson †, Kevin Korb †, John McNeil ‡ (Danny Liew ‡, Sophie Rogers ‡, Lucas Hope."— Presentation transcript:

1 Data Mining Cardiovascular Bayesian Networks Charles Twardy †, Ann Nicholson †, Kevin Korb †, John McNeil ‡ (Danny Liew ‡, Sophie Rogers ‡, Lucas Hope † ) † School of Computer Science & Software Engineering ‡ Dept. of Epidemilogy & Preventive Medicine Monash University www.datamining.monash.edu.au/bwww.datamining.monash.edu.au/bnepi

2 Overview Medical Experts 2 epidemiological models 1. Knowledge Engineering Causal discovery (CaMML) + Other learners 3. Evaluation 2. Data Mining Busselton Study data Problem: assessment of risk for coronary heart disease (CHD) Bayesian network software (Netica)

3 Knowledge Engineering BNs from the medical literature l The Australian Busselton Study »every 3 years, 1966-1981, > 8,000 participants »mortality followup via WA death register + manually »Cox proportional-hazards model, 2,258 from 1978 cohort »CHD event base rates: 23% for men, 14% for women l The German PROCAM Study »1979-1985, followup every 2 years, > 25,000 participants »Scoring model (based on Cox), ~5,000 men »CHD event base rates: ~6% General question: are models transferable across populations?

4 Bayesian networks (BNs) l Use probability theory for representing uncertainty l Represents a probability distribution graphically (directed acyclic graphs) l Nodes: random variables (discrete, continuous) l Arcs indicate conditional dependencies between variables »P(X,Y,Z) can be decomposed to P(X)P(Y|X)P(Z|X) l Conditional Probability Distribution (CPD) »Associated with each variable, probability of each state given parent states l BN inference »Evidence: observation of specific state »Task: compute the posterior probabilities for query node(s) given evidence.

5 The Busselton BN: nodes

6 The Busselton BN: arcs predictor variables uninformative 10-year risk of CHD event P(S,B,Al,At) =P(S)P(B|S)P(Al|S)P(At|S) BNs summarize the joint distribution All nodes have an associated conditional prob. distribution

7 The Busselton BN: discretization discretization choices binary nodes

8 The Busselton BN: reasoning

9

10 Bad cholesterol Heavy smoking Normal

11 The Busselton BN: reasoning More risk factors !

12 A risk assessment tool for clinicians l Previous tool: TAKEHEART l Combine risk assessment (probability) with costs.

13 Risk Assessment Tool: example Young, predictor not observed – don’t treat old, predictor not observed – treatNot so old, predictor not observed – treat Young, predictor observed – don’t treat

14 PROCAM BN

15 CaMML: a causal learner l Developed at Monash University l Data mines BNs from epidemiological data l Minimum message length (MML) metric: Trades-off complexity vs goodness of fit l MCMC search over model space

16 CaMML: summary graph (remove)

17 CaMML: summary graph (mm)

18 CaMML: summary graph (impute)

19 CaMML: example BN

20

21 CaMML: linear BN

22 CaMML: linear summary graph

23 Evaluation l Predicting 10 year risk of CHD using Busselton data l Metrics: »ROC Curves (area under curve) »Bayesian Information Reward (BIR) l Experiment 1: »Compare Busselton, PROCAM and CaMML BNs l Experiment 2 »Compare CaMML and other standard machine learners (from Weka) »90-10 training/testing split, 10-fold crossvalidation

24 Experiment 1: ROC Results Area under curve (AUC) priors No-one at risk! Everyone at risk! Extremes:

25 Experiment 1: Bayesian Info Reward

26 Experiment 2: ROC Results

27 Experiment 2: Bayesian Info Reward

28 Summary of Results Experiment I (Models of whole data) l PROCAM model does at least as well as Busselton » On Busselton data » For both "relative" (ROC) and "absolute" (BIR) risk l CaMML Models do as well »But much simpler: only 4 nodes matter to CHD10! Experiment II (Cross-validation of learners) l Logistic regression does best on both metrics »Statistically powerful: only 1 parameter per arc »No search required: structure is given »No discretization necessary

29 Conclusions l Busselton & PROCAM models appear to perform equally well on Busselton data, using an absolute risk measure (BIR) from the literature l CaMML results suggest the data have high variance and are too weak to support inference to complex models. Combining data would help.

30 Future directions l Improve data mining by »Adding prior knowledge to search »Assessing whether data sources can be combined; if so, do so l Investigate combination of continuous and discrete variables in data mining and modeling l Develop new TAKEHEART model using BNs (taking the best from experts, literature, data mining) »with intervention modeling (Causal Reckoner) »with decision support »with GUI, usable by clinicians

31 References l G. Assmann, P. Cullen and H. Schulte. Simple scoring scheme for calculating the risk of acute coronary events based on the 10-year follow-up of the Prospective Cardiovascular Munster (PROCAM) study. Circulation, 105(3):310-315, 2002. l M.W. Knuiman, H.T. Vu and H. C. Bartholomew. Multivariate risk estimation for coronary heart disease: the Busselton Health Study, Australian & New Zealand Journal of Public Health, 22:747-753, 1998. l C.S. Wallace and K.B. Korb. Learning Linear Causal Models by MML Sampling, In A. Gammerman, editor, Causal Models and Intelligent Data Management, pages 89-111. Springer-Verlag, 1999. www.datamining.monash.edu.au/software/camml l C.R. Twardy, A.E. Nicholson and K.B. Korb. Knowledge engineering cardiovascular Bayesian networks from the literature, Technical Report 2005/170, School of CSSE, Monash University, 2005.


Download ppt "Data Mining Cardiovascular Bayesian Networks Charles Twardy †, Ann Nicholson †, Kevin Korb †, John McNeil ‡ (Danny Liew ‡, Sophie Rogers ‡, Lucas Hope."

Similar presentations


Ads by Google