Download presentation

Presentation is loading. Please wait.

Published byMitchel Cumberledge Modified over 3 years ago

1
Association Rule Mining in Type- 2 Diabetes Risk Prediction Gyorgy J. Simon Dept. of Health Sciences Research Mayo Clinic SHARPn Summit 2012

2
Outline Introduction Modeling Diabetes Risk – Association Rule Mining Results – Diabetes Disease Network Reconstruction – Diabetes Risk Prediction Applicability to SHARP

3
Diabetes In the US, 25.8 million people (8% of the population) suffer from Diabetes Mellitus – Type 2 Diabetes Mellitus (DM) DM leads to significant medical complications Effective preventive treatments exist – Identifying subpopulations at risk is important Pre-Diabetes (PreDM) is a condition that precedes DM – fasting glucose 100-125 Identify sets of risk factors that significantly increase the risk of developing diabetes in a pre-diabetic population – Risk factors: Co-morbid diseases: obesity, cardiac-, vascular conditions Vitals, lab test results, medications, co-morbid conditions 85k Mayo Patients 1999-2004 with research consent

4
Design 1/1/1999 12/31/2004 Normal 84,708 Normal 84,708 DM 424 DM 424 PreDM 23,828 PreDM 23,828 Normal 44,156 Normal 44,156 DM 19,013 DM 19,013 Normal 43,809 Normal 43,809 PreDM 21,826 2,002 347 16,664 7/2010 Study PeriodFollow-Up

5
Data Follow-up Time (FUT): Time since PreDM Dx Co-morbidities: before elevated glucose measurement – hypertension, hyperlipidemia, obesity, various cardiac and vascular diseases Age and Follow-up time (FUT) are predictive of DM – They are not modifiable, we need to compensate for them Goal is different from high-throughput phenotyping – None of the patients have the disease – Predict the risk that patients progress to DM PIDCo-morbiditiesGlucoseAgeFUTDM OBHTN… 001YY110551.8Y 002115192.5N ………

6
Outline Introduction Modeling Diabetes Risk – Association Rule Mining Results – Diabetes Disease Network Reconstruction – Diabetes Risk Prediction Applicability to SHARP

7
Computational Model Age Sex Unknown Disease Mechanism Unknown Disease Mechanism bmi Tobacco hdl HTN glucose DM Dx statin …… … …… Level 1 Unmodifiable “nuisance” factors Level 2 Clinical factors of interest Level 3 Glucose “definition” of DM We have to adjust for level 1 factors before we can assess the effect of level 2 factors ! Goal Find sets of clinical factors (level 2) that are associated with elevated risk of DM Goal Find sets of clinical factors (level 2) that are associated with elevated risk of DM

8
Modeling Approaches 1.Logistic regression / Survival Analysis – No ability to discover interactions 2.Decision Trees/RandomForest/Gradient-boosted Trees – Greedy approach to discover interaction – No ability to compensate for age and follow-up time (FUT) 3.Association Rule Mining (ARM) – Specifically designed to discover interactions – No ability to compensate for age and FUT Regression Analysis + Association Rule Mining Remove the effect of age gender and FUT Find association between the risk factors and the DM risk not explained by age and FUT Simon et al. AMIA 2011

9
PIDDMAgeFUT 001Y551.8 002N192.5 …… R1R1 Co-morbidities ObeseHTN… YY E 1 Expected Number of DM incidents based on age and sex only E 1 Expected Number of DM incidents based on age and sex only O Observed Number of DM incidents O Observed Number of DM incidents R 1 = O – E 1 1 st Phase Residual R 1 = O – E 1 1 st Phase Residual 1 st Phase2 nd Phase R2R2 Glucose 103 112 … E 2 Expected Number of DM incidents based on co-morbidities only (after adjusting for age and sex) E 2 Expected Number of DM incidents based on co-morbidities only (after adjusting for age and sex) 3 rd Phase R 2 = O–(E 1 +E 2 ) = R 1 -E 2 2 nd Phase Residual R 2 = O–(E 1 +E 2 ) = R 1 -E 2 2 nd Phase Residual E 3 Expected Number of DM incidents based on glucose (after adjusting for everything else) E 3 Expected Number of DM incidents based on glucose (after adjusting for everything else) E = E 1 + E 2 + E 3 Final Prediction E = E 1 + E 2 + E 3 Final Prediction Overview Regression modeling Survival model or Logistic regression Association Rule Mining

10
Origins from sales data Items (columns): co-morbid conditions Transactions (rows): patients Itemsets: sets of co-morbid conditions Goal: find all itemsets (sets of conditions) that frequently co-occur in patients. – One of those conditions should be DM. Support: # of transactions the itemset I appeared in – Support({OB, HTN, IHD})=3 Frequent: an itemset I is frequent, if support(I)>minsup PatientOBHTNIHD…DM 001YYYY 002YYYY 003YY 004Y 005YYY X: infrequent

11
Distributional Association Rule Mining Distributional Association Rules associate an itemset with a continuous outcome. PIDABCD…R 01YYYY.40 02YYY.38 03YYYY.39 04YYY.41 05YY.00 06YY.01 07Y.02 08Y.00 Application to Diabetes Find all sets I of co-morbid conditions, such that the distribution of risk R is significantly different between the patient population having I and without I Application to Diabetes Find all sets I of co-morbid conditions, such that the distribution of risk R is significantly different between the patient population having I and without I Simon et al, KDD 2011a Frequency R R

12
Why Association Rule Mining? ChallengeSolution InteractionsDesigned to discover associations Missing dataAsymmetry in items Absence of item does not mean that the risk factor was not present Clinical questionDirectly extracts sets of risk factors Allows for differences in modeling for prediction and for disease mechanism discovery Computational EfficiencyEfficient algorithms exist

13
Outline Introduction Modeling Diabetes Risk – Association Rule Mining Results – Diabetes Disease Network Reconstruction – 4.5-yr DM Risk Prediction Applicability to SHARP

14
Diabetes Disease Network Reconstruction Metabolic Syndrome: DM + cardiac/vascular diseases Use Association Rule Mining to map out the relationships between DM and other metabolic syndrome diseases – Also measure their effect on DM progression risk Predictors: Age, sex, FUT; co-morbid disease Dx 1 st Phase model is survival model 2 nd Phase ARM

15
Results SupCasesP-valueRRItemset 71168192.0e-71.32HTN 47295601.7e-81.45OB 86129642.6e-81.31HL 19802911.9e-91.78HTN,OB 41715341.5e-81.47HTN,HL 553858.3e-41.86OB,IHD 24343354.3e-91.68OB,HL 382667.7e-42.08HTN,OB,IHD 12712042.8e-81.93HTN,OB,HL 470767.2e-41.93OB,IHD,HL 339616.1e-42.15HTN,OB,IHD,HL Interpretation: Patients with HTN,OB,IHD and HL have age and FUT adjusted 2.15 RR of DM. Effect of age- and FUT adjustment – The entire PreDM population has 8.04% chance of DM. – Without age and FUT adjustment, the above population has 61/339=17.9% – With age and FUT adjustment, 1- (1-.084) 2.15 =17.2% Legend OBObesity HTNHypertension IHDIschemic Heart Disease HLHyperlipidemia 37 Distributional Association Rules were discovered 11 are significant. (Poisson test; Bonferroni adjusted 5%)

16
Results Legend OBObesity HTNHypertension IHDIschemic Heart Disease HLHyperlipidemia Condition(s) Subpop. ( Relative Size Risk ) Condition(s) Subpop. ( Relative Size Risk ) IHD 2366 (1.16) [p-value.11] IHD 2366 (1.16) [p-value.11] HTN, OB, IHD 382 (2.08) HTN, OB, IHD 382 (2.08) HTN, IHD, HL 1210 (1.36) [p-value.015] HTN, IHD, HL 1210 (1.36) [p-value.015]

17
Outline Introduction Modeling Diabetes Risk – Association Rule Mining Results – Diabetes disease network re-construction – 4.5-yr DM risk prediction Applicability to SHARP

18
DM Progression Risk Prediction Predicting the probability of progression to DM within 4.5 years Predictors: age, sex, co-morbid Dx, laboratory results and medication orders 1 st Phase: spline logistic regression to adjust for age and sex 2 nd Phase: ARM 3 rd Phase: linear regression using glucose

19
Machine Learned Indices Comparison to machine learning methods – Gradient Boosted Trees (GBM) 10,000 trees – Linear Model (LM) – Random Forest (RF) 275-325 trees – Association Rule Mining (ARM) 100 rules 10-fold CV repeated 50 times Same predictive performance but more interpretable model C-statistic

20
Traditional Indices Performance similar to San Antonio (Refit) ARM readily provides a justification as to why the risk is high Proposed method places the patient on a path in the diabetes network

21
Clinical Validation Work in progress… Apply the rules to both normo- glycemic and Pre-DM patients Each point is a rule Patterns similar for lower-risk subpopulations For high-RR rules, risk of DM is higher for Pre-DM patients

22
Outline Introduction Modeling Diabetes Risk – Association Rule Mining Results – Interpretability – Predictive Performance Applicability to SHARP

23
High-Throughput Phenotyping (HTP) We can use the Association Rules as a HTP algorithm – Discover the rules with ARM – Validate the rules with an expert clinician

24
Acknowledgment Peter W. Li, PhD Health Sciences Research, Mayo Clinic, MN Pedro J. Caraballo, MD Internal Medicine, Mayo Clinic, MN M. Regina Castro, MD Division of Endocrinology and Metabolism, Mayo Clinic, MN Terry M. Therneau, PhD Health Sciences Research, Mayo Clinic, MN Vipin Kumar, PhD Department of Computer Science, University of Minnesota

25
References Vemuri P, Simon G, Kantarci K, Whitwell J, Senjem M, Przybelski S, Gunter J, Josephs K, Knopman D, Boeve B, Ferman T, Dickson D, Parisi J, Petersen R and Jack C. Antemortem differential diagnosis of dementia pathology using structural MRI: Differential-STAND. NeuroImage, 2010. Caraballo P, Li P, Simon G. Use of Association Rule-mining to Assess Diabetes Risk in Patients with Impaired Fasting Glucose, AMIA, 2011. Simon G, Kumar V, Li P. A Simple statistical model and association rule filtering. In Proc. ACM International Conference on Data Mining and Knowledge Discovery (KDD), 2011. Simon G. Li P, Jack C, Vemuri P. Understanding Atrophy Trajectories in Alzheimer’s Disease Using Association Rules on MRI images. In Proc. ACM International Conference on Data Mining and Knowledge Discovery (KDD), 2011.

Similar presentations

OK

Data Quality (a.k.a. “Data Heterogeneity”) Kent Bailey, Susan Rea Welch, Lacey Hart, Kevin Bruce, Susan Fenton.

Data Quality (a.k.a. “Data Heterogeneity”) Kent Bailey, Susan Rea Welch, Lacey Hart, Kevin Bruce, Susan Fenton.

© 2018 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on biography of bill gates Ppt on history generation and classification of computer Ppt on statistics and probability formulas Free ppt on time management Ppt on ganga action plan Consumer behaviour ppt on luxury watch brands Ppt on chromosomes and genes worksheet Ppt on asymptotic notation of algorithms and data Ppt on intelligent manufacturing in industrial automation Upload and view ppt online form