Data Quality (a.k.a. “Data Heterogeneity”) Kent Bailey, Susan Rea Welch, Lacey Hart, Kevin Bruce, Susan Fenton.

Data Quality (a.k.a. “Data Heterogeneity”) Kent Bailey, Susan Rea Welch, Lacey Hart, Kevin Bruce, Susan Fenton

Objectives  Assess Data variability within and across institutions  Assess impact of this variability on Secondary Use of EMR  Generate specifications for Widgets –“Warning Label” for suspect data categories –Data quality audits with logs –Batch data correction / removal

Current Research: Effects of Variation on Diabetes Phenotyping Algorithm  Purpose: Compare data relevant to Type 2 DM eMERGE phenotyping algorithm between Intermountain and Mayo  Methods: 1. Identify adult subjects with evidence in any semantic category of algorithm:  ICD-9-CM codes for Diabetes Mellitus  Abnormal glucose or HbA1C  Antihyperglycemic medications  Capillary glucose (Glucometer) procedures

Methods 2.Collect relevant data on these subjects –ICD-9-CM codes –Procedure codes –Demographic data –Smoking status –Body Mass index –Specialty of provider –Geographic info –Frequency of health care encounters 3.Describe variation between institutions

Analysis  Compare (between institutions) frequencies of data elements –ICD9 codes– overall and specific codes  Compare lab values– number and values  Compare medications–  Control for: –Provider specialty –Geographic variables –Demographic variables

Interpretation  Assess impact of data heterogeneity on phenotyping at different institutions  Recommendations for –High throughput Phenotyping –High throughput screening for clinical trials  Generalization to other phenotypes  Hypothesis generation

Preliminary Mayo Results  Mayo Data: ( ICD or abn.labs or capill. Glucose, limited to Olmsted and surrounding counties) –13,754 subjects  89% Caucasian,  2.5% African-American,  2.0% Asian  6.5% Native Am, Pac. Isl., other, unknown, refuse –Mean current age 64, range 20 to 104 –Sex: 53% male, 47% female

Preliminary Mayo results N=13,754  Smoking (n=11,626) –Current 66%, past 16%, never 13%, Unk 6%  BMI (limited to < 60) (n=6,338) –Mean 32.6 +/- 7.2 –Median 31.6, quartiles (27.5, 36.6)

Preliminary Results: ICD9 codes  Complications –None 6743(250.0) –Ketoacidosis 1(250.1) –Hyperosmolality 2(250.2) –Renal 398(250.4) –Opthalmic 1385(250.5) –Neuro 586(250.6) –Peripheral Circ. 25(250.7) –“other specified” 312(250.8) –Unspecified 336(250.9)

Preliminary Results: ICD9 codes  250.X0 Type 2 or unspecified, controlled or not » specified as uncontrolled  250.X1 Type 1, controlled or not »Specified as uncontrolled  250.X2 Type 2 or unspecified, uncontrolled  250.X3 Type 1, uncontrolled

Type 2/U vs. Type 1 DM codes Mayo Data: n=13707 Type 1 DM codes Type 2/U DM codes 01+ 06339 (46%) 6631 (48%) 1+483 (4%) 254 (2%)

Intermountain peek (sic) Type 1 ICD9 codes Type 2/U ICD9 codes 01+ 0--65,983 1+2,0836,629  Disclaimer– don’t assume data are ready to compare between sites at this point

Back to Mayo Summary Sample Lab data Test name NMin1%Med.99%Max Glucose (P) 40,7861671273941300 Glucose POCT 211,7462563141392600 Hemogl obin A1c, B 35,2064.0 % 5.1 % 6.9 % 12.1 % 16.7 %

Future Directions  Carry out inter-institution comparison  Study effects of geography, race, etc.  Implement chart review (on random sample) for “gold standard” definition of Type 2 DM  Use of lab values /meds for definition of continuous phenotype (DM-ness)  Extrapolation / generalization to other diseases /phenotypes

Data Quality (a.k.a. “Data Heterogeneity”) Susan Rea Welch

Conclusions: PhD Research Cohort Amplification –Knowledge Discovery from Databases (KDD) –Associative Classification Methods –Classification Rules for Diabetes and Asthma  comparably accurate  Concise  consistent with domain knowledge –Contributed new knowledge  Attributes for cohort identification  Unanticipated comorbidity associations

Consistency and Novelty Diabetes  Elevated quantitative lab glucose assays –Frequency 19%, Likelihood 87% –Less predictive than glucose by glucometer or Urine Microalbumin  Abnormal HbA1c test –Equivalent predictive power of HBA1c test order  Antihyperglycemic medications –Variable predictive strength: Metformin, Insulin, Insulin Release Stimulators, Insulin Response Enhancers

Consistency and Novelty Asthma  Medications were most predictive –High Likelihood: Salmeterol, Leukotriene receptor antagonist –Albuterol / Glucocorticoid combine:  Pulmonary Procedures (CPT hierarchy)  Female gender  Abnormal CBC  Unexpected comorbidity associations –Suggests discovery of shared pathways

Associative Classification – What? Pattern discovery in transaction database Independent of domain expertise Deductive, global associations in data Induce a general & accurate classifier

Associative Classification – Why? No domain expertise attribute selection Not affected by missing data Proven accuracy Understandable rules Independent rules

Core Candidate Attributes  Diagnosis codes  Provider specialty  Lab observations  Procedure codes  ‘Abnormal’ lab obs.  Imaging procedures  Medication list  Age groups  Female gender

SHARPn Y2 Research Aims  Associations reliable across EHRs?  Improve algorithms’ sensitivity / specificity? –AC attribute selection + other classifiers

Data Quality (a.k.a. “Data Heterogeneity”) Kent Bailey, Susan Rea Welch, Lacey Hart, Kevin Bruce, Susan Fenton.

Similar presentations

Presentation on theme: "Data Quality (a.k.a. “Data Heterogeneity”) Kent Bailey, Susan Rea Welch, Lacey Hart, Kevin Bruce, Susan Fenton."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Quality (a.k.a. “Data Heterogeneity”) Kent Bailey, Susan Rea Welch, Lacey Hart, Kevin Bruce, Susan Fenton.

Similar presentations

Presentation on theme: "Data Quality (a.k.a. “Data Heterogeneity”) Kent Bailey, Susan Rea Welch, Lacey Hart, Kevin Bruce, Susan Fenton."— Presentation transcript:

Similar presentations

About project

Feedback