Presentation is loading. Please wait.

Presentation is loading. Please wait.

Square wheels: electronic medical records for discovery research in rheumatoid arthritis Robert M. Plenge, M.D., Ph.D. October 30, 2009 NCRR sponsored.

Similar presentations


Presentation on theme: "Square wheels: electronic medical records for discovery research in rheumatoid arthritis Robert M. Plenge, M.D., Ph.D. October 30, 2009 NCRR sponsored."— Presentation transcript:

1 Square wheels: electronic medical records for discovery research in rheumatoid arthritis Robert M. Plenge, M.D., Ph.D. October 30, 2009 NCRR sponsored "Using EHR Data for Discovery Research" HARVARD MEDICAL SCHOOL

2 Key questions What are the regulatory obstacles impacting your work? What are the resource needs required to replicate your work at other institutions? What are the priority short term "translational" questions in your field that would represent the most rapid payoff on investment?

3 Key questions How can I implement your approach, and how much better is it?

4

5 genotype phenotype clinical care

6 genotype phenotype clinical care bottleneck

7 Raychaudhuri et al in press Nature Genetics October 2009: >30 RA risk loci 20031978198720052004 PTPN22 2008 “shared epitope” hypothesis HLA DR4 2007 PADI4CTLA4 TNFAIP3 STAT4 TRAF1- C5 IL2-IL21 CD40 CCL21 CD244 IL2RB TNFRSF14 PRKCQ PIP4K2C IL2RA AFF3 Latest GWAS in 25,000 case-control samples with replication in 20,000 additional samples: >10 new loci 2009 REL BLK TAGAP CD28 TRAF6 PTPRC FCGR2A PRDM1 CD2-CD58 Together explain ~35% of the genetic burden of disease

8 genotype phenotype clinical care bottleneck

9 Genetic predictors of response to anti-TNF therapy in RA PTPRC/CD45 allele n=1,283 patients P=0.0001 Submitted to Arth & Rheum

10 How can we collect DNA and detailed clinical data on >20,000 RA patients?

11 What are the options for collecting clinical data and DNA for genetic studies?

12 Options for clinical + DNA designClinical data DNASample size cost clinical trial +++ +$$$ registry +++++++$$ claims data +n/a+++$ EMR +++++ $

13 Narrative data = free-form written text –info about symptoms, medical history, medications, exam, impression/plan Codified data = structured format –age, demographics, and billing codes Content of EMRs EMRs are increasingly utilized!

14 Gabriel (1994) Arthritis and Rheumatism This is not a new idea… Sens: 89% PPV: 57% Sens: 89% PPV: 57%

15 Gabriel (1994) Arthritis and Rheumatism Conclusion: The sole reliance on such databases for the diagnosis of RA can result in substantial misdiagnosis. …but EMR data are “dirty”

16 Partners HealthCare: 4 million patients

17 Partners HealthCare: linked by EMR

18 Partners HealthCare: organized by i2b2

19 4 million patients 31,171 patients ICD9 RA and/or CCP checked (goal = high sensitivity) 3,585 RA patients Classification algorithm (goal = high PPV) Clinical subsets Discarded blood for DNA

20 Natural language processing (NLP) –disease terms (e.g., RA, lupus) –medications (e.g., methotrexate) –autoantibodies (e.g., CCP, RF) –radiographic erosions Codified data –ICD9 disease codes –prescription medications –laboratory autoantibodies Our library of RA phenotypes Qing Zeng Concept/termAccuracy of concept presence of erosion88% seropositive96% CCP positive98.7% RF positive99.3% etanercept100% methotrexate100%

21 Natural language processing (NLP) –disease terms (e.g., RA, lupus) –medications (e.g., methotrexate) –autoantibodies (e.g., CCP, RF) –radiographic erosions Codified data –ICD9 disease codes –prescription medications –laboratory autoantibodies Our library of RA phenotypes Shawn Murphy

22 ‘Optimal’ algorithm to classify RA: NLP + codified data Regression model with a penalty parameter (to avoid over-fitting) Codified dataNLP data Tianxi Cai, Kat Liao

23 High PPV with adequate sensitivity ✪ 392 out of 400 (98%) had definite or possible RA!

24 This means more patients! ~25% more subjects with the complete algorithm: 3,585 subjects (3,334 with true RA) 3,046 subjects (2,680 with true RA)

25 4 million patients 31,171 patients ICD9 RA and/or CCP checked (goal = high sensitivity) 3,585 RA patients Classification algorithm (goal = high PPV) Discarded blood for DNA

26 Linking the Datamart-Crimson NLP data Codified data

27 Over 3,000 samples collected to date –cost = $10 per sample DNA extracted on >2,400 Buffy coats –cost = $20 per sample –>90% had ≥1 ug of DNA –>99% had ≥5 ug of DNA after WGA Status of i2b2 Crimson collection genotyping of 384 SNPs (RA risk alleles, AIMs, other) is ongoing at Broad Institute

28 Measured autoantibodies from plasma –5 autoantibodies in ~380 RA patients –~85% are CCP+, ~35% ANA+, ~15% TPO+ Question: are non-RA autoantibodies present at increased frequency in RA patients vs matched controls? stay tuned…more data soon! Status of i2b2 Crimson collection

29 Key questions How can I implement your approach, and how much better is it?

30 Key questions What are the regulatory obstacles impacting your work? What are the resource needs required to replicate your work at other institutions? What are the priority short term "translational" questions in your field that would represent the most rapid payoff on investment?

31 Key questions What are the regulatory obstacles impacting your work? What are the resource needs required to replicate your work at other institutions? What are the priority short term "translational" questions in your fields that would represent the most rapid payoff on investment?

32 Regulatory obstacles IRB approval De-identified vs truly anonymous Open question: sharing of genetic data

33 Key questions What are the regulatory obstacles impacting your work? What are the resource needs required to replicate your work at other institutions? What are the priority short term "translational" questions in your fields that would represent the most rapid payoff on investment?

34 Resources required Building a research DataMart –clinical EMR ≠ research EMR –multiple FTE’s to build/maintain NLP expertise –open-source software available –iterative process for fine-tuning Clinical expertise –understand nature of clinical data

35 Resources required (cont.) Statistical expertise –simple algorithm is not sufficient –prepare for the unexpected! –true for narrative and codified Biospecimen collection, DNA extraction –varies by institution –Crimson –Broad Institute

36 Key questions What are the regulatory obstacles impacting your work? What are the resource needs required to replicate your work at other institutions? What are the priority short term "translational" questions in your field that would represent the most rapid payoff on investment?

37 4 million patients 31,171 patients ICD9 RA and/or CCP checked (goal = high sensitivity) 3,585 RA patients Classification algorithm (goal = high PPV) Clinical subsets Discarded blood for DNA

38 Characteristicsi2b2 RACORRONA total number 3,5857,971 Mean age (SD) 57.5 (17.5)58.9 (13.4) Female (%) 79.974.5 Anti-CCP(%) 63N/A RF (%) 74.472.1 Erosions (%) 59.259.7 MTX (%) 59.552.8 Anti-TNF (%) 32.622.6 Clinical features of patients CCP has an OR = 1.5 for predicting erosions

39 Subset patients in clinically meaningful ways: causes of mortality NLP+codified data, together with statistical modeling, to define cardiovascular disease

40 Non-responder to anti-TNF therapy NLP+codified data, together with statistical modeling, to define treatment response

41 Responder to anti-TNF therapy NLP+codified data, together with statistical modeling, to define treatment response

42 Post-marketing surveillance of adverse events NLP+codified data, together with statistical modeling, to define treatment response pharmacovigilance

43 Conclusions

44 Options for clinical + DNA designClinical data DNASample size cost clinical trial +++ +$$$ registry +++++++$$ claims data +n/a+++$ EMR +++++ $ Conclusion: NLP + codified data, together with appropriate statistical modeling, can yield accurate clinical data.

45 Options for clinical + DNA designClinical data DNASample size cost clinical trial +++ +$$$ registry +++++++$$ claims data +n/a+++$ EMR +++++ $ Conclusion: We can collect DNA and plasma in a high-throughput manner.

46 Options for clinical + DNA designClinical data DNASample size cost clinical trial +++ +$$$ registry +++++++$$ claims data +n/a+++$ EMR +++++ $ Conclusion: The cost is reasonable...even for >20,000 RA patients!

47 genotype phenotype clinical care

48 Acknowledgments Zak Kohane Susanne Churchill Vivian Gainer Kat Liao Tianxi Cai Shawn Murphy Qing Zing Soumya Raychaudhuri Beth Karlson Pete Szolovits Lee-Jen Wei Lynn Bry (Crimson) Sergey Goryachev Barbara Mawn & many others ! Namaste!

49

50 Narrative data (NLP text extractions) Codified data (ICD9 codes, etc)

51 Run specific queries

52 Visualize results in a timeline

53 Identifying RA patients in our i2b2 RA DataMart 19932008 Signs and symptoms Diseases that mimick RA Medications specific to RA Notes (including whether seen by a rheumatologist) diagnostic codes for RA Shawn Murphy, Vivian Gainer, others

54 signs and symptoms c/w RA RA without other diseases Specific RA meds, including MTX Seen by rheumatology Many diagnostic codes for RA 19932008 Identifying RA patients in our i2b2 RA DataMart

55 Probability of RA: all 31K subjects Probability of RA Frequency not RARA (n=3,585)

56 ROC curves for algorithms sensitivity 1 - specificity 97% specificity codified + NLP NLP only codified only

57 Other algorithms to classify RA NLP Only Codified only Portability!

58 Classification of RA cases (and not RA) 1.00 0.80 0.60 0.40 0.20 0.00 Probability RA Not RA possibleYes RA threshold 0.29 ???

59 Diagnosis = Ankylosing Spondylitis (but many RA codes) A few signs and symptoms c/w RA NLP with few mentions of RA Specific meds Visits to BWH/MGH diagnostic codes for RA Probability RA = 0.78

60 Diagnosis = JRA (but many RA codes) signs and symptoms c/w RA NLP with “RA” and “JRA” Specific meds Visits to the RA Center at BWH Many diagnostic codes for RA

61 Probability RA = 0.33 Diagnosis not clear initially… signs and symptoms c/w RA NLP without much “RA”, few specific meds (MTX x 1) …and few diagnostic codes for RA, despite multiple LMR notes, including visits to the BWH Arthritis Center

62 Now the false negatives…

63 Diagnosed in 1992, little follow-up For some reason few RA diagnostic codes Probability RA = 0.11

64 Enbrel (etanercept) codified: 1,628 NLP: 3,796 overlap: 1,612 (99%) Note: review of 50 NLP occurrences shows that 38 out of 50 actively on Enbrel Medications: codified data vs. NLP


Download ppt "Square wheels: electronic medical records for discovery research in rheumatoid arthritis Robert M. Plenge, M.D., Ph.D. October 30, 2009 NCRR sponsored."

Similar presentations


Ads by Google