Download presentation

Presentation is loading. Please wait.

Published byMary Malone Modified over 3 years ago

1
1 Probabilistic Linkage: Issues and Strategies Craig A. Mason, Ph.D. University of Maine craig.mason@umit.maine.edu

2
2 Faculty Disclosure Information In the past 12 months, I have not had a significant financial interest or other relationship with the manufacturer(s) of the product(s) or provider(s) of the service(s) that will be discussed in my presentation This presentation will (not) include discussion of pharmaceuticals or devices that have not been approved by the FDA or if you will be discussing unapproved or "off-label" uses of pharmaceuticals or devices.

3
3 Acknowledgements Shihfen Tu, Quansheng Song Keith Scott, Marygrace Yale, Tony Gonzalez Derek Chapman

4
4 Overview of Linkage Process Two databases containing information on some of the same individuals Birth CertificatesEHDI Diagnostic Data

5
5 Overview of Linkage Process Many births not in Diagnostic Data Birth CertificatesEHDI Diagnostic Data

6
6 Overview of Linkage Process Some entries in EHDI Diagnostic Data do not appear in Electronic Birth Certificates Birth CertificatesEHDI Diagnostic Data

7
7 Overview of Linkage Process Final linkage is a subset of each Birth CertificatesEHDI Diagnostic Data

8
8 Linkage Algorithms Deterministic –Exactly match on specified common fields –Easiest, quickest linkage strategy –Misconception that this is the gold standard

9
9 Linkage Algorithms Deterministic –May result in significant bias Non-traditional spellings in African American names –Result in errors due to non-links Many non-links can result in greater bias than a few erroneous pairings

10
10 Linkage Algorithms Probabilistic –Statistically estimate likelihood or odds that two records are for the same individual, even if they disagree on some fields

11
11 Linkage Algorithms Factors Impacting Probabilistic Linkage –Likelihood that a fields would agree if a correct link Good quality data counts more than poor quality data –Likelihood that fields would agree if not a correct link Rare values count more than common values –Number of expected matches Much more complicated and expensive strategy

12
12 Good work, but I think we might need just a little more detail right here. Implementing an Effective Data Linkage Then a miracle occurs out Start Modified from Kim Church, Maine Genetics Program

13
13 Probabilistic Matching Probabilistic Matching: Two records are not required to match in all fields –Two records are compared on each of the specified fields. –A weightw i is calculated for each field in a potential match reflecting the strength of the agreement or disagreement w1w1 w2w2

14
14 Reliability of data fields –Greater reliability results in increased odds of correct match A match on a high-quality, reliably entered field is good Not matching on a poor-quality field with lots of known data entry errors may not be a fatal error –If a field is pure noise, correct matches will be random across the databases Factors Influencing Likelihood of Match

15
15 Frequency of field values –The more common the value in a field, the greater the odds that the records will be erroneously matched A match based on the name Zbignew is a relatively good indicator of a match, even if there may be disagreement in other fields A match based on the name John may be of much less value, requiring matches on more fields in order to conclude two records are the same individual Number of expected matches one would obtain randomly Factors Influencing Likelihood of Match

16
16 Weight Calculation –M-probability Probability that a field agrees if the pairing reflects a correct match –U-probability Probability that a field agrees if the pairing reflects an incorrect match Chance that a given field will agree randomly Approximately = # records with a specific value/total # of records Calculating Match Weights

17
17 Probabilistic Matching If the field agrees, w i is equal to …. w1w1 w2w2

18
18 Probabilistic Matching –m i for first name =.98, or 98% of the time, if its a correct match, the first names will agree –u i for Zbignew is.00001 is the probability of randomly getting two first names that are Zbignew w1w1 w2w2

19
19 Probabilistic Matching In cases where two records disagree on a specified field, w i is equal to ….. w1w1 w2w2

20
20 Probabilistic Matching –m i for last name =.96, or 96% of the time, if its a correct match, the last names will agree –u i for Brezinsky is.00003 is the probability of randomly getting two last names that are Brezinsky w1w1 w2w2

21
21 A composite weight, w t calculated for each pair of records –The sum of weights across all fields used in linkage Larger w t suggest a correct match, Smaller or negative w t suggest an incorrect match. Calculating Match Weights

22
22 Match Determination –Could compare every record in one dataset with every record in the second dataset Result in N 1 x N 2 comparisons –Blocking Records first blocked on a subset of fields for which a deterministic match is required. Within each block, all records from the one dataset are compared to all records from the other dataset w t calculated for each of these possible pairings. The distribution of w t s across all blocks examined in order to determine a critical cut-off score necessary to classify two records as a match. Blocking

23
23

24
24

25
25 The total-weight required for two records to have a probability, p, of being a match is equal to… –Where p is the desired probability of a match, –E is the expected potential matches –N 1 and N 2 are the number of records in each database, Estimating Probabilities is the base 2 log of the odds of a random match

26
26 if two fields agree, and… Estimating Probabilities if two fields do not agree odds of a random match, From this formula, it is possible to derive an equation for estimating the probability that any two records are a match

27
27 Note that the probability equation is equivalent to a base-2 version of the logistic probability formula The computational formula avoids the need to repeatedly calculate powers of 2 and log 2 –This is due to the weights in the exponent themselves being a log-value The same probability is obtained using e and the natural log in place of 2 and log 2 throughout –Base 2 results in improved computational speed Notes

28
28 Thats nice, but ….. All right. But apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, the fresh water system, and public health… What have the Romans ever done for us? --- Reg, spokesman for the Peoples Front of Judea Monty Python Life of Brian (and Martin White, UC Berkeley)

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google