Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Overview of Patient Matching

Similar presentations


Presentation on theme: "An Overview of Patient Matching"— Presentation transcript:

1 An Overview of Patient Matching
Shaun Grannis, MD MS Medical Informatics Research Scientist, Regenstrief Institute Assistant Professor of Family Medicine, Indiana University School of Medicine U.S. Population Health Technical Work Group Co-Chair, Health Information Technology Standards Panel

2 What We’ll Cover Definition and Motivation Use Cases
Barriers to Accurate Patient Identification Patient Identifier Characteristics Patient Identification Terminology Patient Matching Methodologies Patient Identification Architectures Overview of OpenMRS Patient Matching Process

3 Patient Matching: Description
“… Each person in the world creates a book of life. The book starts with birth and ends with death. It’s pages are made up of all the principal events in life. Record linkage is the name given to the process of assembling the pages of this book into one volume. The person retains the same identity throughout the book. Except for advancing age, he is that same person …” - Dunn, 1946

4 Patient Matching: Synonyms and Definition
“Patient Matching”  “Patient Linkage” “Record Matching”  “Record Linkage” “Identity Management” Identify records that represent the same entity. Entities are typically individual persons, but can be families, twins, organizations, etc. Records contain fields describing the entity. These fields can include: “Unique” ID’s, Names, birth dates, addresses, Sex, Parents’ names, tribe, telephone numbers, etc Terminology - {Patient/Record} {Matching/linkage} - True Pos - True Neg - False Pos - False Neg - Precision - Recall - De-duplication - MPI/EMPI - Blocking/Grouping - Potential pairs - Global weight - field weight

5 Motivation Clinical information is fragmented across many independent databases using different identifiers This situation makes record matching challenging for such uses as: Public Health/Administrative Reporting Outcomes management Vital status determination Research Clinical Care - Increasingly health care information is fragmented (distributed) across many independent databases and systems, both WITHIN and AMONG organizations as SEPARATE ISLANDS with different patient identifiers. - This is the case FOR DATA collected WITHIN an institution where there may be multiple identifiers, and FOR DATA collected about the same patient at - different health care institutions - different pharmacy systems - different payers, and so on. - This situation INTERFERES with the aggregation of information as needed for: - public health reporting - clinical research - outcomes management, and health care policyt Aggregation is important NOT ONLY to determine a patient’s health care status, BUT ALSO for population based studies. Record linkage is the process of combining information about an individual, family, or entity residing in one or more databases. So we need a way of linking

6 Patient Matching Use Cases
Data Aggregation Immunization Registry Process Improvement Newborn screening Process Evaluation ELR Completeness Reporting/Research (combining datasets to evaluate outcomes) Cancer rates among Depressed/Anxious Mortality Assessment – Cancer Survival Assessing effects of Maternal EtOH use on fetal outcomes De-identified Linkage Health Information Exchange 2 Basic functions can be performed with record linkage Determine whether 2 records represent the same person. Do you have this person? De-duplicate Join data about the same person from different sources

7 Barriers to Accurate Patient Matching
Recording Errors Phonetic (“Shaun”, “Sean”, “Shawn”) Typographical (Smith  Snith, “07”  “01”) Changing Identifiers Last Name (Marriage) Geographic location (Home address, etc) Sharing Identifiers (SSN, etc.) Identifiers Limited or Unavailable

8 Ideal Identifier Characteristics
Unique (eg, fingerprint, Iris, DNA, National ID) Ubiquitous (eg, Name, DOB, Sex, Eye Color) Unchanging (eg, DOB, Sex, Given Name, DNA) Uncomplicated (eg, Name, DOB, Sex) Uncontroversial (eg, avoid sensitive data) Easily and Inexpensively Accessible No identifier meets all of these characteristics Give examples and counter examples Could envision a table with columns being the characteristic and the rows being different types of identifers.

9 Patient Matching Terminology
True match/True link/True positive Truly matching records declared to be the same entity False match/False link/False positive Truly non-matching records declared to be the same entity True Non-match/True Non-link/True negative Truly non-matching records not declared to be the same entity False non-match/False non-link/False negative Truly matching records not declared to be the same entity

10 Patient Matching Terminology
“Truth” True Match True Non-Match “Pos Predictive Value” or “Precision” True Match True Match False Match TM TM+FM Matching System Declaration “Neg Predictive Value” True Non-Match False Non-Match True Non-Match TNM TNM+FNM “Sensitivity” or “Recall” “Specificity” TM TM+FNM TNM TNM+FM

11 Patient Matching Terminology
Potential Pairs/Potential Links Record-pairs that have not been declared a match or non-match Blocking/Grouping Method to limit search space for potential links, usually by forcing exact match with one or more fields. (Analogous to sorting socks by color before pairing) Field Agreement Weight/Score Value assigned when two fields are declared to agree Field Disagreement Weight/Score Value assigned when two fields are declared to disagree Record Pair Score/Composite Score/Global Score Value derived from individual field contributions (typically the product or sum of field weights) Score Threshold record pair score above which a match is declared and/or below which a non-match is declared Terminology - De-duplication - MPI/EMPI - Blocking/Grouping - Potential pairs - Global weight - field weight

12 Potential Solutions National Patient Identifier Biometrics
Recording errors Sharing ID’s Lost ID’s Controversial (in some regions) Biometrics Require proprietary hardware for all data generators How secure? Privacy concerns These may help but they’re not a panacea Biometric privacy concern: use my fingerprint elsewhere

13 Patient Matching Methodologies
Increasing Complexity Fuzzy Match Machine Learning Deterministic Probabilistic

14 Deterministic ‘Rules-based’ or ‘Heuristic’
Accuracy is highly dependent on presence of discriminating identifiers (national or local ID, etc) Rule-based, eg declare a match if exact match on: National ID + DOB Full Name + Address etc.

15 Fuzzy Match Non-exact agreement, allows for errors:
“If last name agrees on first 6 characters then declare agreement” “If birth date is within 1 month, then declare agreement” To loosen agreement, string comparators or phonetic transformation functions may be used: Soundex - Phonetic NYSIIS - Phonetic Levenshtein Edit Distance - Comparator Jaro-Winkler Comparator - Comparator Longest Common Sub-sequence - Comparator

16 Probabilistic/Machine Learning
Implements a statistical model for matching A common model is Felligi-Sunter maximum likelihood model Establish parameters for model using machine learning algorithms (EM) or bootstrap review Maximum Entropy Model also used

17 Patient Matching Methodologies
Deterministic/Heuristic Rapid Implementation Simple calculations Relies on accurate and consistent data May not generalize well to other data sets Probabilistic Complex implementation Computationally intensive More forgiving of data errors Algorithms adapt to data being linked

18 Probabilistic (F-S) Example
Among the 10 true-links, the last names agreed in 9/10 pairs (e.g. one of the last names was misspelled) This represents a 90% AGREEMENT RATE for last name among TRUE LINKS. Similarly, among the 90 non-links, last names agreed (by random chance) in 2/90 pairs This represents a 2% AGREEMENT RATE for last name among NON-LINKS.

19 Probabilistic (F-S) Example
= 45 Records that agree on last name are 45 times more likely to be a true-link than a non-link 90% Weights for each field are combined to form a composite record pair score. Field disagreement contributes a negative weight, and reduces the overall record pair score. 2%

20 Probabilistic (F-S) Example
Each record pair is assigned a score. A histogram of scores may look like: 2 Generate Record-Pairs: 1 File 1 File 2 Record A Record A Record A Record X Record X Record X Record B Record B Record B Record Y Record Y Record Y Record C Record C Record C Record Z Record Z Record Z First, a BRIEF OVERVIEW of probabilistic RECORD linkage: With any record linkage process, WE MUST GENERATE record pairs from two files. (INFO ABOUT PAIRING) - Records may CONTAIN such information as SSN, name, birth date. <PRESS BUTTON> - Pairs are initially FORMED by BLOCKING on LIMITED information. - BLOCKING refers to the process of grouping similar pairs of records. It’s analogous to sorting socks by color before pairing them up. - In our case we blocked on SSN, that is, all of our record pairs agreed on social security number. Each record pair is then ASSIGNED a score: high scores for MORE LIKELY PAIRS, and lower scores for LESS LIKELY PAIRS. Given A DISTRIBUTION OF SCORES, the question arises, “WHICH ARE THE TRUE LINKS?” FOR contained in this distribution are record pairs which should be TRULY-LINKED and pairs which are NON-LINKS. It is this area of overlap (RED SQUARE) where we will next focus our attention. Potential Record Pairs Which are true links?

21 Probabilistic Linkage Overview: Human Review Thresholds
In reality most USES OF probabilistic algorithms incorporate a HUMAN REVIEWER into the process. <PRESS BUTTON> THAT IS, any record pair below a LOWER threshold is considered a non-link, and above an UPPER threshold is a considered a TRUE-link. So, the results of most probabilistic methods are ACTUALLY probabilistic PLUS humans, and they work quite well. Question: If human review combined with probabilistic works well, why WOULDN’T we want to use it? One SITUATION is when we cannot afford the high cost of a human operator (human review can require thousands of man-hours of work), or because privacy concerns dominate. In such a case, we may wish to AUTOMATICALLY link PATIENT DATA using all available demographic identifiers. Once linkage is established, patient identifiers may be removed, leaving only unidentified clinical data. Therefore, we wanted to evaluate a probabilistic method without human intervention, (or with a single threshold). In doing so that we can make the right methodology choice (either probabilistic or heuristic) when we cannot afford the high cost of a human operator, or because privacy concerns dominate. We can remove the human operator by picking a single threshold above which we declare a link and below which a non-link. We hypothesize that a probabilistic linkage method will perform better than our EMPIRICLE (exact-agreement deterministic method) because probabilistic methods produce scores that are tailored to the unique characteristics of the specific records being linked. (That will be explained shortly.)

22 Patient Identity Architectures
There is no ideal architecture, only best principles and practices for a particular use case(s) Patient care Reporting/Research Registry clean-up Potential Architectures: Peer-to-peer Patient carried Central Index There is no ideal Architecure, only best principles and practices for the tasks at hand

23 Peer-to-Peer No central list of patient demographics
Each participating data source maintains a patient registry Each source is queried for potential matches; results sets are linked

24 Peer-to-Peer Matcher Matcher Matcher Matcher Query/ Matcher

25 Central Index Contains patient identifiers with pointers to clinical data sources. No clinical data contained in the repository Contributing data sources send patient demographics, matching can be performed in real-time or near real-time Name Birth Date Sex Source Smith, Jane 12-Oct-1943 F Public Health Jones, Fred L 07-Feb-1955 M Hospital A Smith, Jayne Clinic B Williams, Mary 20-Dec-1968 Clinic A Mary, Williams Hospital B Jones, Freddy 01-Feb-1955

26 Central Index Jane Receives Immunizations @ Health Department
Data delivered to immunization registry Immunization Registry Jane Receives Health Department Jane Receives Immunizations and other care (measurements, labs, diagnoses, Clinical Practice Data delivered to EMR Clinic A

27 Central Index ??????????? Registry Web Interface Immunization Registry
EMR Interface Clinic A

28 Immunization Registry
Central Index Patient ID: 123LMNOP Name: Jane Doe DOB: 01/01/04 SSN: N/A Address: 555 Johnson Road City: Indianapolis State: Indiana ZIP: 46202 Immunization Registry Central Patient Index Patient ID: 6789XYZ Name: Jane Ellen Doe DOB: 01/01/04 SSN: Address: 555 Johnson Road City: Indianapolis State: Indiana ZIP: 46202 Global ID: 45678 Name: Jane Ellen Doe Lots of Demographics.. MRF1 ID: OU81247 MRF2 ID: IMM REG ID: 123LMNOP CLINIC A ID: 6789XYZ Clinic A

29 Central Index Hospital B Hospital A Central Patient Index
Immunization Registry Central Patient Index Central Patient Index Immunization Registry Clinic C Clinic A Clinic A Clinic B

30 A Nation-wide Infrastructure of Central Indexes (?)
NHII SCHEMATIC LHII’s/RHII’s combine to form national infrastructure Emphasize the local nature of the system and the importance of information exchange standards (HL7, LOINC, etc.) to transfer information between Local and Regional exchanges. “A network of networks”

31 OpenMRS Patient Matching: Overview
1. Analytic API Component: - Fields are examined for NULL values/default values (1900, ‘JOHN DOE’, etc) Data sources to be linked are analyzed to customize probabilistic matching parameters Threshold match scores are established Blocking variables established Record Linkage Module 1 2 2. Operational API Component: - Incoming data is preprocessed and validated (Case normalized, Fields validated) - Potential pairs are formed (blocking) and scored (recently implemented frequency scaling through Google Summer of Code) - Post-processing (detect twins/familial linkages that may represent false matches) Patient Matching Module implements Felligi-Sunter Model and initializes matching parameters using Expectation Maximization Analytic Phase: Data sources to be linked are analyzed to customize probabilistic matching parameters Threshold match scores are established Fields are examined for NULL values/default values (1900, ‘JOHN DOE’, etc) Operational Phase Incoming data is preprocessed and validated (Case normalized, Fields validated) Potential pairs are formed (blocking) and scored (recently implemented frequency scaling through Google Summer od Code) Post-processing (detect twins/familial linkages that may represent false matches) OpenMRS

32 OpenMRS Patient Matching: Overview
Inbound HL7 Registration or Results message Linking Fields validated, cleaned (Name, DOB, etc) Record Passed to Linkage Module Potential Pairs Scored using Felligi Sunter probabilistic model, returned to OpenMRS registration handler

33

34 An Overview of Patient Matching
Questions? Shaun Grannis, MD MS Medical Informatics Research Scientist, Regenstrief Institute Assistant Professor of Family Medicine, Indiana University School of Medicine U.S. Population Health Technical Work Group Co-Chair, Health Information Technology Standards Panel

35 Bibliography - Theory Fellegi IP, Sunter SB. (1969). A Theory for Record Linkage. Journal of the American Statistical Association, 64(328), Dunn HL. (1946) Record Linkage. Am J Public Health. 36, Newcombe HB. (1988) Handbook of Record Linkage, Methods for Health and Statistical Studies, Administration, and Business. Oxford University Press. Newcomb HB, Kennedy JM. Axford SJ, James AP. (1959) Automatic Linkage of Vital Records. Science, 130, Gill, L., Methods for Automatic Record Matching and Linking and their use in National Statistics. Her Majesty’s Stationary Office, Norwich, 2001. Porter E, Winkler W. Approximate String Comparison and its Effect on an Advanced Record Linkage System. Record Linkage Techniques--1997: Proceedings of an International Workshop and Exposition. National Academy Press, Washington DC 1999. Public Health Informatics Institute. The unique records portfolio. Decatur, GA: Public Health Informatics Institute, 2006.

36 Bibliography: Applications and Research (1)
Christen P. Febrl: A freely available record linkage system with a graphical user interface. Submitted to the Australasian Workshop on Health Data and Knowledge Management (HDKM), Wollongong, January 2008. Potosky A, Riley G, Lubitz J, et al. Potential for Cancer Related Health Services Research Using a Linked Medicare-Tumor Registry Database. Medical Care 1993;31(8): Whalen D, Pepitone A, Graver L, Busch JD. Linking Client Records from Substance Abuse, Mental Health and Medicaid State Agencies. SAMHSA Publication No. SMA Rockville, MD: Center for Substance Abuse Treatment and Center for Mental Health Services, Substance Abuse and Mental Health Services Administration, July 2000. Liu S, Wen SW. Development of Record Linkage of Hospital Discharge Data for the Study of Neonatal Readmission. Chronic Diseases in Canada 1999; 20(2):77-81. Pates R, Scully W, et al. Adding Value to Clinical Data by Linkage to a Public Death Registry. MedInfo 2001;10(Pt 2):1384-8

37 Bibliography: Applications and Research (2)
Lynch BT, Arends WL. Selection of a surname coding procedure for the SRS record linkage system. Washington, DC: US Department of Agriculture, Sample Survey Research Branch, Research Division, 1977. Newman T, Brown A. Use of Commercial Record Linkage Software and Vital Statistics to Identify Patient Deaths. J Am Med Inform Assoc May-June; 4 (3): Schadow G, McDonald CJ Maintaining Patient Privacy in a Large Scale Multi-Institutional Clinical Case Research Network. AMIA Proceedings (2002 Submission). Public Health Informatics Institute. (2006). The Unique Records Portfolio. Decatur, GA: Public Health Informatics Institute Sideli R, Friedman C. Validating Patient Names in an Integrated Clinical Information System. Symposium on Computer Applications in Medical Care, Washington, DC. November 1991:

38 Bibliography: Applications and Research (3)
Miller PL, Frawley SJ, Sayward FG. IMM/Scrub: a domain-specific tool for the deduplication of vaccination history records in childhood immunization registries. Computers and Biomedical Research 2000;33:126–143. Salkowitz SM, Clyde S. De-duplication technology and practices for integrated child-health information systems. Decatur, GA: All Kids Count, Public Health Informatics Institute, 2003. Van Den Brandt PA, Schouten LJ, Goldbohm RA, Dorant E, Hunan PMH. Development of a record linkage protocol for use in the Dutch Cancer Registry for epidemiological research. Int J Epidemiol 1990; 19:553-8. Grannis SJ, Overhage JM, McDonald CJ. Analysis of Identifier Performance Using a Deterministic Linkage Algorithm. Proc AMIA Symp 2002:305-9. Grannis SJ, Overhage JM, McDonald CJ. Analysis of a Probabilistic Record Linkage Technique without Human Review. In: Proceedings of American Medical Informatics Association Fall Symposium; 2003; Washington, D.C.; 2003. Integrating the Health Care Enterprise. (2006) Patient Identifier Cross-Reference (PIX) and Patient Demographic Query (PDQ) HL7 v3 Transaction Updates. Available at: IHE_ITI_TF_Suppl_PIXPDQ_HL7v3_PC_2006_08_15.pdf


Download ppt "An Overview of Patient Matching"

Similar presentations


Ads by Google