Patient Matching Algorithm Challenge

Patient Matching Algorithm Challenge
Informational Webinar Caitlin Ryan, PMP| IRIS Health Solutions LLC, Contract Support to ONC Adam Culbertson, M.S., M.S. | HIMSS Innovator in Residence, ONC

Background on Matching About the Challenge
Agenda ONC Overview Background on Matching About the Challenge Eligibility Requirements Registration Project Submissions Winners and Prizes Calculating Metrics Creating Test Data Challenge Q&A

Office of the National Coordinator for Health IT (ONC)
The Office of the National Coordinator for Health Information Technology (ONC) is at the forefront of the administration’s health IT efforts and is a resource to the entire health system to support the adoption of health information technology and the promotion of nationwide health information exchange to improve health care, ONC is organizationally located within the Office of the Secretary for the U.S. Department of Health and Human Services (HHS), ONC is the principal federal entity charged with coordination of nationwide efforts to implement and use the most advanced health information technology and the electronic exchange of health information.

ONC Challenges Overview
The statutory authority for Challenges are hosted under Section of the America COMPETES Reauthorization Act of (Public L. No ). ONC Tech Lab - Innovation Spotlight areas of high interest to ONC and HHS Direct attention to new market opportunities Continue work with start-up community and administer challenge contests Increase awareness and uptake of new standards and data

ONC Roadmap Connecting Health and Care for the Nation: A Shared Nationwide Interoperability Roadmap Released in 2015 A 10 Year Vision to Achieve An Interoperable Health IT Infrastructure Section L: Accurate Individual Data Matching, states that patient matching is fundamental requirement in achieving interoperability.

Patient Matching Definition
Patient matching: Comparing data from multiple sources to identify records that represent the same patient. Also called merge-purge, record linkage and entity resolution in other fields. Entity Resolution, Data Matching, Record Linkage, Object Identification, Merge-Purge, Linked Data, Data Deduplication, Identify Resolution, Schema Matching

Source: Culbertson, A. Patient Matching A-Z, Wednesday, March 2ndst HIMSS 2016, Las Vegas, NV

Significant Dates in (Patient) Matching
Campbell, K et al A Comparison of Link Plus, The Link King, and a “Basic” Deterministic Algorithm A Framework for Cross-Organizational Patient Identity Management HIMSS Patient Identify Integrity Toolkit, Patient Key Performance Indicators RAND Health Report Identity Crisis: An Examination of the Costs and Benefits of a Unique Patient Identifier for the US Health Care System 2008 2015 Kho, Abel N., et al Design and Implementation of a Privacy Preserving Electronic Health Record Linkage Tool Dunn Record Linkage 1946 Fellegi & Sunter A Theory of Record Linkage 1969 Winkler Matching and Record Linkage 2011 Soundex US Patent 1918 Newcombe, Kennedy, & Axford Automatic Linkage of Vital Records 1959 Grannis, et al Analysis of Identifier Performance Using a Deterministic Linkage Algorithm 2002 Grannis, et al Privacy and Security Solutions for Interoperable Health Information Exchange 2009 Audacious Inquiry and ONC Patient Identification and Matching Final Report 2014 Joffe et al A Benchmark Comparison of Deterministic and Probabilistic Methods for Defining Manual Review Datasets in Duplicate Records Reconciliation HIMSS Patient Identity Integrity Exemplars Intelligence Industry post 9/11 realized that had a problem matching non-Anglo names Problem faced in many verticals such intelligence finance, marketing A Comparison of Link Plus, The Link King, and a “Basic” Deterministic Algorithm Campbell, K et al. (2008) (Notes: Campbell, Kevin M., Dennis Deck, and Antoinette Krupski. "Record linkage software in the public domain: a comparison of Link Plus, The Link King, and a basic' deterministic algorithm." Health Informatics Journal 14.1 (2008): 5-15.) A benchmark comparison of deterministic and probabilistic methods for defining manual review datasets in duplicate records reconciliation Joffe et al. Jan 14. (Notes: Joffe E, Byrne MJ, Reeder P, et al. A benchmark comparison of deterministic and probabilistic methods for defining manual review datasets in duplicate records reconciliation. Journal of the American Medical Informatics Association : JAMIA. 2014;21(1): doi: /amiajnl ) “work of the IIR will lead to establishing metrics of patient matching technology approaches and create a pathway for evaluating solutions.” and " IIR to develop a vision, strategy, and implementation plan for the near-term deployment of consistent patient data matching in health that builds on the body of work from HHS’s Office of the National Coordinator for Health IT (ONC) and healthcare community partners. The IIR will also assess the longer-term applicability of identity management methods, processes and technologies currently in use in healthcare and other sectors.” Design and implementation of a privacy preserving electronic health record linkage tool in Chicago." Journal of the American Medical Informatics Association (2015): ocv038. Dusetzina, Stacie B., et al Linking Data for Health Services Research: A Framework and Instructional Guide HIMSS hires Innovator In Residence (IIR) focused on Patient Matching Source: Culbertson, A. & Miller, K., Patient Matching EHR Ailments: Going from Placebo to Cure, Tuesday, March 1st HIMSS 2016 Las Vegas, NV

The 5 Step Data Match Process
Data pre-processing Indexing Comparison Classification Evaluation Pre-processing: characterizes the data and ensures the elements have the same structure and the content follows the same format Indexing: organizing the data to support better pairing (blocking and the use of a blocking key common) Comparison: identifying the similarity between two records seeking comparison vectors Classification: based on comparison results, records are found to be matches, non-matches or potential matches Data pre-processing: 3 to 4 steps: Remove unwanted characters and words Expand abbreviations and correct misspellings Segment attributes into well-defined and consistent output attributes Verify the correctness of attribute values Important to note that this step must not overwrite the original input data – but new attributes can and should be created that contain the cleaned and standardized data. Evaluation: comparing match results with the known ground truth or gold standard. Data Matching; Concepts and techniques for record linkage, entity resolutions, and duplication detection, Peter Christen 2012

Data quality issues make matching more complicated
Problem Patient data matching has been noted as one key barriers to achieving interoperability in the Nations Road Map for Health IT Patient Matching causes issues for over 50% of health information managers1 Problem will increases as we increase the volume of health data sharing Data quality issues make matching more complicated Lack of knowledge of patient matching algorithms performance, adoption of metrics 1) MMS - all known milestones ONC Annual Plan – subset of database focused on ONC ownership incremented into quarterly milestones ONC Quarterly Dashboard – subset of annual plan for quarterly monitoring and cascade into NC & SES perforamnce plans ONC Bi-Weekly / Monthly Milestone monitoring report 1)

Data entry errors are compound data matching complexity
Data Quality Data Quality is a Key Garbage in and Garbage out Data entry errors are compound data matching complexity Various algorithmic solutions to address these, not perfect Types of errors: Missing or Incomplete Values Inaccurate data Fat finger errors Information is out of date Transposed names Misspelled names

"If you can't measure it, you can't improve it.“
Solution "If you can't measure it, you can't improve it.“ -Peter Drucker

ONC’s Patient Matching Algorithm Challenge
The goal of this challenge is to: Bring about greater transparency and data on the performance of existing patient matching algorithms, Spur the adoption of performance metrics for patient data matching algorithm vendors, and positively impact other aspects of patient matching such as deduplication and linking to clinical data. Website:

Eligibility Requirements
There is no age requirement for this challenge. All members of a team must meet the eligibility requirements. Shall have registered to participate in the Challenge under the challenge requirements by ONC. Shall have complied with all the stated requirements of the Challenge. Businesses must be incorporated in and maintained a primary place of business in the United States; Individuals must be a citizen or permanent resident of the United States. Shall not be an HHS employee.

Eligibility Requirements (cont’d)
May not be a federal entity or federal employee acting within the scope of their employment. Federal grantees may not use federal funds to develop COMPETES Act challenge applications unless consistent with the purpose of their grant award. Federal contractors may not use federal funds from a contract to develop COMPETES Act challenge applications or to fund efforts in support of a COMPETES Act challenge submission. Participants must also agree to indemnify the Federal Government against third party claims for damages arising from or related to Challenge activities.

Receive performance scores and appear on a Challenge leaderboard.
Challenge Process Register your team Contestants will unlock a test data set provided by ONC to run algorithms Run Algorithm Submit results for evaluation which will be scored against an “answer key” Receive performance scores and appear on a Challenge leaderboard. Repeat submission until satisfied with the result or have hit 100 submissions or end date has passed. Participants will unlock and download the test data set at the time of registration. Participants will then run their algorithms and submit their results to the scoring server on the Challenge Website. A small set of true match pairs (that have been created and verified through manual review) exist within the large data set and will serve as the “answer key” against which participants’ submissions will be scored.

Challenge Process Synthetic Data Download Data in CSV File
Set Download Data in CSV File Submit Linked Data Scoring Server (Gold Standard)) Returns a Score Brain image source: Submit Results to Leader Board

Create a username and password (1 account per team)
Registration Visit the challenge website and fill in all required fields of the registration form Create a username and password (1 account per team) Enter a team name which will be used on the leader board Can be used to keep team identities private Acknowledge and agree to all terms and rules of the Challenge

Known potential duplicate pairs mimic real world scenarios
Challenge Dataset Dataset synthetically generated by Just Associates using a proprietary software algorithm Based on real-world data in an MPI and actual data discrepancies in each field across the fields Known potential duplicate pairs mimic real world scenarios Does not contain PHI Approximately 1M Patient Records Available early June, will send out when data set is made available

Fields include: Challenge Dataset
Enterprise ID, LAST NAME, FIRST NAME, MIDDLE NAME, SUFFIX, DOB, GENDER, SSN, ADDRESS1, ADDRESS2, CITY, STATE, ZIP, PHONE, PHONE2, , ALIAS, MOTHERS_MAIDEN_NAME, MRN, SSNs(Most are within the 800 range) Data Format CSV Also available in FHIR bundle

1 John Smith 1-1-1990 202-223-9910 Washington, DC
Challenge Dataset 1 John Smith Washington, DC 2 Carol Jones Bethesda, MD 3 Bobby Johnson Arlington, VA 4 Johnny Smith Washington, DC 1 John Smith Washington, DC 4 Johnny Smith Washington, DC Carol Jones Bethesda, MD 3 Bobby Johnson Arlington, VA Scoring Server

One dataset will be provided to all participants
Submission Process One dataset will be provided to all participants Participants will submit their matches to the ONC scoring server The answer key, separate from the dataset provided to participants, will be used to score submissions. Submission Data format CSV, Enterprise ID 1, Enterprise ID 4, 0.90 Optionally FHIR bundle Submit Enterprise ID, Enterprise ID linked to, Optional submit confidence score for probabilistic algorithms

Project Submissions Teams will submit the results, matched records, of their algorithm tests The submission period* will be open for 3 months 100 submissions from each individual/team Can submit at any time during the “submission period” Challenge will open on June 12th 12:00 p.m. E.S.T. Submissions will be allowed until 11:59pm on the last day of the submission period *Submission period dates have not been determined. Once the test data set is available we will add these dates to the challenge website.

Tradeoffs between Precision and Recall
Project Submissions Calculation Precision Recall Tradeoffs between Precision and Recall F-Score is the harmonic mean between precision and recall F score. = Precision If Jon Smith is the record you are looking for and you search for Jon Smith. Precision is the likelihood that the record you have that corresponds to john smith is actually Jon Smith Recall – If Bob Jones has 100 records in a database how many of Bob Jones records to you get? If you get 90 Recall you got 90 of bob Jones Records Different use cases have different requirements, and different balance of tradeoffs. Security vs Healthcare

Participants will receive a
Returned Results Participants will receive a F-Score Precision Recall Run ID Month one will include a beta period New matches found will be manually reviewed to determine match status Previous submissions will be rescored with updated answer key and leader boards updated After beta period all future submissions will be scored against updated answer key only

Leader Board Example

Precision Recall Best first f-score run
Winners and Prizes The Total Prize Purse for this challenge is $75,000 Judging will be based upon the empirical evaluation of the performance of the algorithms. Highest F-Score 1st- $25,000 2nd- $20,000 3rd- $15,000 Best in Category: ($5,000 ea. Category) Precision Recall Best first f-score run Judging will be based upon the empirical evaluation of the performance of the algorithms. For the purposes of scoring, all scores will be assessed to three places after the decimal (i.e., 0.xxx). F-score judging and award conditions: In general, the top three algorithms with the highest F-scores will be selected as the winners. In the event that a team achieves F-scores that would consecutively occupy 1st, 2nd, and/or 3rd place (or any consecutive combination or otherwise (e.g., (1st and 3rd) or (2nd & 3rd)), the team will be awarded only one prize, which will be the highest available prize. After that award, that teams remaining score(s) will be skipped until a different competitor can be awarded the next prize level. In the event of a tie in F-scores, the winners will be awarded their prizes based on least amount of “runs”/tries that it took to get such score. For example, if two teams tied for first place and one got its high F-score on its 31st try and the other team on its 57th try, the first team would be awarded 1st place and the other team 2nd place. The same tie-breaking approach will be applied for all award positions. If the least amount of tries method subsequently results in a tie, the teams will be awarded that place and the prize will be split evenly to all who won that place. Best in category prizes In general, best in category prizes will be made for precision, recall, and best first run. Unlike the F-score award rules, a team is permitted to win one or all of these “best of” category prizes. In the event that a tie occurs for the precision and recall categories, the same least amount of tries method (as noted in F-score) will be used to break the tie and same even award split method will be used if the least amount of tries method does not break the tie. If the best first F-score run category results in a tie, the award will be split evenly to all who won the category.

Precision: Best precision with recall >= 90%
Best in Category Best F-Score 1st Place 2nd Place 3rd Place Best 1st Run F-Score: This prize will be awarded to the contestant/team whose first submission to the scoring server results in the highest f-score. Precision: Best precision with recall >= 90% Recall: Best Recall with precision >= 90%

Metrics for Algorithm Performance

Ideal outcome of any matching exercise is correctly answering this one question hundreds or thousands of times, Are these two things the same thing? Correctly identifying all the true positives and true negatives while minimizing the number of errors, false positives and false negatives Patient Matching Goal Source: Culbertson, A. Patient Matching A-Z, Wednesday, March 2ndst HIMSS 2016, Las Vegas, NV

Patient Matching Terminology
True Positive- The two records represent the same patient True Negative- The two records don't represent the same patient Source: Culbertson, A. Patient Matching A-Z, Wednesday, March 2ndst HIMSS 2016, Las Vegas, NV

Patient Matching Terminology
False Negative: The algorithm misses a record that should be matched False Positive: The algorithm creates a link to two records that don’t actually match Source: Culbertson, A. & Miller, K., Patient Matching EHR Ailments: Going from Placebo to Cure, Tuesday, March 1st HIMSS 2016 Las Vegas, NV

Evaluation EHR A EHR B Truth (Gold Standard) Algorithm Match Type
Jonathan True Positive Sally Non-Match True Negative False Positive Jon False Negative Good Bad Add color to highligh differences Source: Culbertson, A. Patient Matching A-Z, Wednesday, March 2ndst HIMSS 2016, Las Vegas, NV

Jonathan True Positive Sally Non-Match True Negative False Positive Jon False Negative Bad Add color to highligh differences Source: Culbertson, A. Patient Matching A-Z, Wednesday, March 2ndst HIMSS 2016, Las Vegas, NV

Jonathan True Positive Sally Non-Match True Negative False Positive Jon False Negative Bad Add color to highlight differences Source: Culbertson, A. Patient Matching A-Z, Wednesday, March 2ndst HIMSS 2016, Las Vegas, NV

Precision = True Positives / (True Positives + False Positives)
Evaluation Truth Algorithm Positive Negative True Positive False Positive False Negative True Negative Precision Recall = True Positives / (True Positives + False Negatives) Recall Source: Culbertson, A. Patient Matching A-Z, Wednesday, March 2ndst HIMSS 2016, Las Vegas, NV

Tradeoffs between Precision and Recall
Evaluation Calculation Precision Recall Tradeoffs between Precision and Recall F-Score is the harmonic mean between precision and recall F score. = Precision If Jon Smith is the record you are looking for and you search for Jon Smith. Precision is the likelihood that the record you have that corresponds to john smith is actually Jon Smith Recall – If Bob Jones has 100 records in a database how many of Bob Jones records to you get? If you get 90 Recall you got 90 of bob Jones Records Different use cases have different requirements, and different balance of tradeoffs. Security vs Healthcare

Creating Test Data Sets

Development of Test Data Set
Patient Database Select Potential Matches (aka Adjudication Pool) Manual Reviewer 1 Reviewer 2 Reviewer 3 Human-Reviewed Match Decisions (Answer Key == Ground Truth Data Set) MITRE help in graphics Compare Algorithm and Test Data Set Source: Culbertson, A. & Miller, K., Patient Matching EHR Ailments: Going from Placebo to Cure, Tuesday, March 1st HIMSS 2016 Las Vegas, NV

Development of Ground Truth Sets
Identify data set that reflects real word use case Develop potential duplicates Human adjudication review and classification Match or Non-Match Estimate truth Pooled methods using multiple matching methods

Issues In Establishing Ground Truth
First step in evaluation is to determine why the evaluation is being conducted Different truth for different applications Security Applications vs Patient Health Record What is the cost of missing a match? Security: Lives are lost Health: Patient safety event, missed medications, allergies, etc… death But…this is situation today. What is the cost of wrongly identifying a match? Security : Passenger is inconvenienced / delayed Health: Patient safety event, wrong medication, treatment, liability, death Criteria for truth must be carefully established and well-understood E.g. Question posed to annotators must be carefully phrased Summary for Healthcare Use Case

Issues In Establishing Ground Truth (cont’d)
Different truth for different applications Credit check Security applications Customer support De-duplication of mailing lists What is the cost of missing a match? New record entered into database Irritated customer Lives are lost Criteria for truth must be carefully established and well-understood by annotators Question posed to annotators must be carefully phrased

Issues In Establishing Ground Truth (cont’d)
How much time / expertise is available to judge (/discount) false positives? Needs to reflect real word test use case Evaluation results are only as good as the truth on which they are based And only as appropriate as the evaluation is to the task that will be performed with the operational system Absolute recall impossible to measure without completely known test set (i.e. “You don’t know what you’re missing.”) Estimate with pooled results

B Smith  Bill Smythe  William Smythe  W Smith ??
Examples B Smith  Bill Smythe  William Smythe  W Smith ?? DOB: 10/12/1972  October 11, 1972  December 10, 1972  12/10/72  October 12, 1927 Use all same name example throughout

Webinars on how to participate and challenge overview
Get Involved Webinars on how to participate and challenge overview May 24th Kicking off Patient Data Matching Algorithm Challenge in June Participant Discussion Board Website:

Acknowledgments Thank you to the following individuals and organizations for their involvement in the planning and development of this challenge : Debbie Bucci and the ONC Team HIMSS North America, Tom Leary, HIMSS Greg Downing, HHS Idea Lab, ONC Jerry and Beth Just and the Just Associates Team Keith Miller and Andy Gregorowicz, MITRE Caitlin Ryan, IRIS Health Solutions Capital Consulting Corporation Team

FOR ADDITIONAL QUESTIONS/INFORMATION CONTACT:
Adam Culbertson, Debbie Bucci, (preferred) Phone:

Thank you for your interest!
The ONC Team

Patient Matching Algorithm Challenge

Similar presentations

Presentation on theme: "Patient Matching Algorithm Challenge"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Patient Matching Algorithm Challenge

Similar presentations

Presentation on theme: "Patient Matching Algorithm Challenge"— Presentation transcript:

Similar presentations

About project

Feedback