Presentation is loading. Please wait.

Presentation is loading. Please wait.

High-Throughput Machine Learning from EHR Data

Similar presentations


Presentation on theme: "High-Throughput Machine Learning from EHR Data"— Presentation transcript:

1 High-Throughput Machine Learning from EHR Data
David Page Department of Biostatistics & Medical Informatics, and Center for Predictive Computational Phenotyping (CPCP) University of Wisconsin-Madison

2 Acknowledgements NIH BD2K Center for Predictive Computational Phenotyping Ross Kleiman Paul Bennett Michael Caldwell Scott Hebbring Miron Livny Peggy Peissig Vitor Santos Costa Humberto Vidaillet Wisconsin Genomics Initiative The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

3 The Electronic Health Record (EHR)
Demographics ID Year of Birth Gender P1 M Diagnoses ID Date Diagnosis Sign/Symptom P1 (PVC) Palpitations Mlers excel at analyzing large data sets and alternative data sources. EHRs fit those descriptions

4 The Electronic Health Record (EHR)
Demographics ID Year of Birth Gender P1 M Diagnoses ID Date Diagnosis Symptoms P1 Atrial fibrillation Dizzy, discomfort ID Date Diagnosis Sign/Symptom P1 Elevated BP Mlers excel at analyzing large data sets and alternative data sources. EHRs fit those descriptions

5 The Electronic Health Record (EHR)
Demographics ID Year of Birth Gender P1 M Diagnoses ID Date Diagnosis Symptoms P1 Atrial fibrillation Dizzy, discomfort ID Date Diagnosis Symptoms P1 Atrial fibrillation Dizzy, discomfort ID Date Diagnosis Sign/Symptom P1 Atrial Fibrillation Shortness of Breath Mlers excel at analyzing large data sets and alternative data sources. EHRs fit those descriptions

6 Precision Medicine (Personalized Medicine)
Individual Patient C + G + E Precision Medicine (Personalized Medicine) Genetic, Clinical, & Environmental Data Predictive Model for Disease Susceptibility & Treatment Response State-of-the-Art Machine Learning Data picture is from: Machine picture is from Microsoft Clip Art j jpg j jpg : Pill picture is from Microsoft Clip Art j jpg : Black stick figure picture is from: Personalized medicine picture is from: Personalized Treatment Wisconsin Genomics Initiative (WGI)

7 Marshfield Clinic EMR Marshfield Clinic
Health system in North Central Wisconsin 1.5M Patient Records spanning 40 years Demographics Diagnoses (ICD-9) Labs Procedures Vitals

8 Electronic Health Record (EHR)
PatientID Date Physician Symptoms Diagnosis P /1/ Smith palpitations hypoglycemic P /1/ Jones fever, aches influenza PatientID Gender Birthdate P M /22/63 PatientID Date Lab Test Result PatientID SNP1 SNP2 … SNP500K P AA AB BB P AB BB AA P /1/01 blood glucose P /9/01 blood glucose This type of database is where the world is headed, with electronic medical records and less-expensive genotyping. Google and Microsoft may be there in a few years, with personally-controlled health records (PCHRs). K.D. Mandl and I. S. Kohane. Tectonic Shifts in the Health Information Economy. New England Journal of Medicine, 358(16): , April 17, 2008. To our knowledge, this type of database is not available now in U.S. Only place we know is Iceland. May become available in near future in U.K., perhaps some other countries with national health service. Much more limited version in future when Framingham adds genetic data, though clinical data is limited and collected only at specified intervals. But we can have this in less than a year in U.S., in Wisconsin, if WGI is funded. All but the red is available right now. When such a database becomes available, will we know what to do with it? WGI proposes what to do with it now (next slide). PatientID Date Prescribed Date Filled Physician Medication Dose Duration P /17/ /18/ Jones prilosec mg 3 months

9 Vision Build predictive models for every diagnosis, every procedure, response to every drug, at press of a button. Translate the most accurate models into the clinic, whether as decision support algorithms or lessons for clinicians, FDA, etc.

10 Data Cleaning Originally 1.5M patients Remove Infrequent Patients
4 diagnoses and 2 encounters 1.1M patients remained (~73%)

11 Case Control Matching 30 days DX Present day Birth Chart data +++++
Death Birth

12 Model Construction and Evaluation
Model nearly every ICD-9 code At least 500 pairs Exclude symptoms Build random forest model Evaluate models via AUC-ROC

13 Predictive Accuracy of Models

14 High-Throughput ML (Kleiman, Bennett, et al.)
Predicting Every ICD Diagnosis Code at the Press of a Button

15 Simulated Prospective Study
How well would these models perform in practice? Evaluate model accuracy on 10,000 test patients 2012 2013 2014 Training Data Study Year Activity Window

16 Simulated Prospective Study Results

17 HTCondor Essential to this Work and Future Work
Over 1M patients Over 4000 different diagnoses (models) 750 trees per model Producing slide 14 took 30K jobs and roughly 123 years of compute time In future, predict all drugs, procedures, and responses In future, predict on 100M or 1B patients In future, add genomics (3B bp per patient) In future, add tumor genomes (1000 genomes per tumor) High-throughput ML applicable to many other domains High-throughput computing applicable to many other tasks in NIH Big Data to Knowledge Program


Download ppt "High-Throughput Machine Learning from EHR Data"

Similar presentations


Ads by Google