Presentation is loading. Please wait.

Presentation is loading. Please wait.

Deterministic Record Linking

Similar presentations


Presentation on theme: "Deterministic Record Linking"— Presentation transcript:

1 Deterministic Record Linking
University of North Carolina, Chapel Hill Hye-Chung Kum

2 Example EISID : E1 EISID : E2 EISID : E3 EISID : E4 ssn : 085-66-9980
first name : Sally last name : Hill MI : L DOB : 3/4/1999 ssn : first name : Emily last name : Brown MI : K DOB : 6/2/2004 ssn : first name : Mary last name : Johnson MI : G DOB : 5/13/1983 ssn : first name : David last name : Ford MI : J DOB : 10/25/1990 SISID : S1 SISID : S2 SISID : S3 SISID : S4 ssn : first name : Sally last name : Hill MI : L DOB : 3/4/1999 ssn : first name : Emily last name : Brown MI : K DOB : 6/2/2004 ssn : first name : Mary last name : Hawkins MI : J DOB : 5/13/1983 ssn : first name : David last name : Ford DOB : 10/23/1990

3 Exact Match EISID : E1 ssn : 085-66-9980 first name : Sally
last name : Hill MI : L DOB : 3/4/1999 SISID : S1 ssn : first name : Sally last name : Hill MI : L DOB : 3/4/1999

4 Approximate Matching I : SSN
EISID : E2 ssn : first name : Emily last name : Brown MI : K DOB : 6/2/2004 SISID : S2 ssn : first name : Emily last name : Brown MI : K DOB : 6/2/2004

5 Approximate Matching II : DOB
EISID : E4 ssn : first name : David last name : Ford MI : J DOB : 10/25/1990 SISID : S4 ssn : first name : David last name : Ford MI : J DOB : 10/23/1990

6 Approximate Matching III : Name
EISID : E3 ssn : first name : Mary last name : Johnson MI : G DOB : 5/13/1983 SISID : S3 ssn : first name : Mary last name : Hawkins MI : J DOB : 5/13/1983

7 Deterministic Record Linking
Allow for approximate matching Use explicit approximate rules Pros : can control the linkage process Con: difficult to implement Alternative : Probabilistic record linking Also approximate matching However, uses general rules specified by users Based on total probability Con: can not control exactly what to consider a match or not Pros: can use specialized software

8 Approximate Matching : DOB
element to element match : date, month, year Allow for one element difference Allow for month and day transposed DOB : one element dob1 : 10/25/1990 dob2 : 10/23/1990 DOB : transpose dob1 : 11/7/1995 dob2 : 7/11/1995

9 Approximate Matching : Name
First name soundex match First name is approx one letter different insert or replace and/or substr lsound equal or lname approx MI=FI FI equal Fsound & Lsound swapped obs fname kfname mi kmi 1 RUDOLPH RULDOLPH A 2 ALIJAH ALIYAH M 3 CAROL CAROLYN J 4 ANGELIQUE ANGIE D 5 JOHNNY JOHNNY JR L 6 ZACHARY ZACK 7 MICHAEL 8 ANTON COUDRAY C 9 ARTHUR AUTHOR R 10 EDWIN EDDIE 11 GOLDY OWENS

10 Approximate Matching : Name
obs fname kfname mi kmi lname klname 1 RUDOLPH RULDOLPH A SIMARD 2 ALIJAH ALIYAH M FOSS 3 CAROL CAROLYN J YOUNG 4 ANGELIQUE ANGIE D OUELLETTE 5 JOHNNY JOHNNY JR L MAYO 6 ZACHARY ZACK ROGERS 7 MICHAEL GALLAGHER 8 ANTON COUDRAY C CYPRESS 9 ARTHUR AUTHOR R DAVIS 10 EDWIN EDDIE KAHKONE 11 GOLDY OWENS

11 Match on ssn (ssn equal)
1 : dob, fsound equal dob approx 2 : dob approx, fsound equal 3 : dob approx, fname approx 4 : dob approx, lsound equal, & fsound diff, but MI=FI 5 : dob approx, lsound equal, & fsound diff, but FI equal 6 : dob approx, lsound and fsound swapped 7 : dob approx, lname approx & fsound diff but MI=FI (4 with lname approx rather than equal) or FI equal (5 with lname approx rather than equal) dob mismatch 8 : fname approx, lsound equal, and dob diff 9 : fname approx, lsound approx, and dob diff

12 Match on ssn (ssn equal)
1 : dob, fsound equal dob approx 2 : dob approx, fsound equal 3 : dob approx, fname approx 4 : dob approx, lsound equal, & fsound diff, but MI=FI 5 : dob approx, lsound equal, & fsound diff, but FI equal 6 : dob approx, lsound and fsound swapped 7 : dob approx, lname approx & fsound diff but MI=FI (4 with lname approx rather than equal) or FI equal (5 with lname approx rather than equal) dob diff 8 : fname approx, lsound equal, and dob diff 9 : fname approx, lsound approx, and dob diff

13 Approximate Matching : SSN
Digit to digit match Allow for one digit difference Allow for two digit difference if transposed SSN : one digit ssn1 : ssn2 : SSN : transpose ssn1 : ssn2 :

14 Match on ndob (dob+fsound)
ssn missing 1: lname equal 2: lname approx ssn approx 3: lname equal 4: lname approx 5: lname diff but fname equal ssn different 11 : lname equal 12 : lname approx lname different 51: ssn approx 52: ssn missing

15 Match on ndob (dob+fsound)
ssn missing 1: lname equal 2: lname approx ssn approx 3: lname equal 4: lname approx 5: lname diff but fname equal ssn different 11 : lname equal 12 : lname approx lname different 51: ssn approx 52: ssn missing

16 SSN kSSN fname kfname lname klname 1 244572812 . APPOLONIA GAVINS 2
obs SSN kSSN fname kfname lname klname 1 . APPOLONIA GAVINS 2 ABEL LOMELIGARCIA LOMELI 3 JOSH JOSHUA PHIPPS

17 SSN kSSN fname kfname lname klname 1 244572812 . APPOLONIA GAVINS 2
obs SSN kSSN fname kfname lname klname 1 . APPOLONIA GAVINS 2 ABEL LOMELIGARCIA LOMELI 3 JOSH JOSHUA PHIPPS 4 LENA COOPER 5 MILES KNIGHT JR. KNIGHT

18 SSN kSSN fname kfname lname klname 1 244572812 . APPOLONIA GAVINS 2
obs SSN kSSN fname kfname lname klname 1 . APPOLONIA GAVINS 2 ABEL LOMELIGARCIA LOMELI 3 JOSH JOSHUA PHIPPS 4 LENA COOPER 5 MILES KNIGHT JR. KNIGHT 6 MARTHA LYDA HOPKINS

19 SSN kSSN fname kfname lname klname 1 244572812 . APPOLONIA GAVINS 2
obs SSN kSSN fname kfname lname klname 1 . APPOLONIA GAVINS 2 ABEL LOMELIGARCIA LOMELI 3 JOSH JOSHUA PHIPPS 4 LENA COOPER 5 MILES KNIGHT JR. KNIGHT 6 MARTHA LYDA HOPKINS 7 AUSTIN AUSTYN TERWILLIGER OMEARA 8 ALISIA ALICE GRAVES WATSON 9 ANNA ANAYA MONTAGUE BOLDING

20 SSN kSSN fname kfname lname klname 1 244572812 . APPOLONIA GAVINS 2
obs SSN kSSN fname kfname lname klname 1 . APPOLONIA GAVINS 2 ABEL LOMELIGARCIA LOMELI 3 JOSH JOSHUA PHIPPS 4 LENA COOPER 5 MILES KNIGHT JR. KNIGHT 6 MARTHA LYDA HOPKINS 7 AUSTIN AUSTYN TERWILLIGER OMEARA 8 ALISIA ALICE GRAVES WATSON 9 ANNA ANAYA MONTAGUE BOLDING Matches in green are NOT considered to be a match

21 SSN kSSN fname kfname lname klname 1 244572812 . APPOLONIA GAVINS 2
obs SSN kSSN fname kfname lname klname 1 . APPOLONIA GAVINS 2 ABEL LOMELIGARCIA LOMELI 3 JOSH JOSHUA PHIPPS 4 LENA COOPER 5 MILES KNIGHT JR. KNIGHT 6 MARTHA LYDA HOPKINS 7 AUSTIN AUSTYN TERWILLIGER OMEARA 8 ALISIA ALICE GRAVES WATSON 9 ANNA ANAYA MONTAGUE BOLDING 10 BRITTNEY REVELS 11 DANIEL ROBINSON 12 HELEN HALL HOLLER 13 DEBORAH DEBRA LEE LEACH

22 SSN kSSN fname kfname lname klname 1 244572812 . APPOLONIA GAVINS 2
obs SSN kSSN fname kfname lname klname 1 . APPOLONIA GAVINS 2 ABEL LOMELIGARCIA LOMELI 3 JOSH JOSHUA PHIPPS 4 LENA COOPER 5 MILES KNIGHT JR. KNIGHT 6 MARTHA LYDA HOPKINS 7 AUSTIN AUSTYN TERWILLIGER OMEARA 8 ALISIA ALICE GRAVES WATSON 9 ANNA ANAYA MONTAGUE BOLDING 10 BRITTNEY REVELS 11 DANIEL ROBINSON 12 HELEN HALL HOLLER 13 DEBORAH DEBRA LEE LEACH

23 SSN kSSN fname kfname lname klname 1 244572812 . APPOLONIA GAVINS 2
obs SSN kSSN fname kfname lname klname 1 . APPOLONIA GAVINS 2 ABEL LOMELIGARCIA LOMELI 3 JOSH JOSHUA PHIPPS 4 LENA COOPER 5 MILES KNIGHT JR. KNIGHT 6 MARTHA LYDA HOPKINS 7 AUSTIN AUSTYN TERWILLIGER OMEARA 8 ALISIA ALICE GRAVES WATSON 9 ANNA ANAYA MONTAGUE BOLDING 10 BRITTNEY REVELS 11 DANIEL ROBINSON 12 HELEN HALL HOLLER 13 DEBORAH DEBRA LEE LEACH 14 ABIGAHIL GARCIATREJO TREJO 15 APSLEY CARLYLE KARYLE 16 ABIGAIL GENTRY KING 17 RODRIGUEZRINCON HERNANDEZ 18 ABIGAYLE FITZGERALD

24 Match on name (fname+lname)
ssn missing & dob approx 1: MI equal 7: MI missing 8: MI not equal ssn approx 3: dob equal dob approx 4: one element 5: transpose

25 Match on name (fname+lname)
ssn missing & dob approx 1: MI equal 7: MI missing 8: MI not equal ssn approx 3: dob equal dob approx 4: one element 5: transpose obs ssn kssn dob kdob 1 09/06/09 09/06/08 2 12/09/75 12/09/76 3 07/ 15/20 07/07/20 4 11/12/14 12/11/14 5 12/08/94 08/12 /94

26 Match on name (fname+lname)
obs Type ssn kssn dob kdob fname lname 1 4 09/06/09 09/06/08 MARION MONTAGUE 2 12/09/75 12/09/76 WILLIAM JOHNSON 3 07/ 15/20 07/07/20 WILLIE GRANT 5 11/12/14 12/11/14 GLADYS SOUTHARD 12/08/94 08/12 /94 TAYLOR FORD 6 52 09/11/77 . NICOLE PARKER 7 07/07/88 ASAJAH ROSS 8 100 01/31/05 10/31/99 PATRICIA BANEGAS 9 01/12/88 02/12/88 DANIEL ANDRONIC 10 02/27/89 11/15/89 VICTORIA HORN

27 Match on name (fname+lname)
obs Type ssn kssn dob kdob fname lname 1 4 09/06/09 09/06/08 MARION MONTAGUE 2 12/09/75 12/09/76 WILLIAM JOHNSON 3 07/ 15/20 07/07/20 WILLIE GRANT 5 11/12/14 12/11/14 GLADYS SOUTHARD 12/08/94 08/12 /94 TAYLOR FORD 6 52 09/11/77 . NICOLE PARKER 7 07/07/88 ASAJAH ROSS 8 100 01/31/05 10/31/99 PATRICIA BANEGAS 9 01/12/88 02/12/88 DANIEL ANDRONIC 10 02/27/89 11/15/89 VICTORIA HORN

28 link Put together all links found
Identify indirect duplicates (type2>10000) i.e. both EISID1 & EISID2 link to identical SISID1 Consider indirect duplicates on both EIS & SIS Create unique link and indirect duplicate files Keep only the first id in data file link Create indirect duplicates files dupeis2 & dupsis2 TODO : explore indirect duplicates

29 Create unique list of EIS & SIS
Generate unique full list of each set of ids use linkage info Link in the duplicates (dupeis & dupsis) TODO : link in the indirect duplicates eis & sis

30 Data flow Link eis to sis ueis.sas7bdat usis.sas7bdat link.sas7bdat dupeis2.sas7bdat 4,308,863 dupsis2.sas7bdat eisid.sas7bdat sisid.sas7bdat dupeis.sas7bdat dupsis.sas7bdat eis.sas7bdat sis.sas7bdat duplicates unduplicated unique records 4,277,402 99% 1,888,747 1,638,112 87% 31,461 250,635 1,173,404 28% 74% 1270 493 27% 72%

31 Type of links ssn, dob, fsound ssn, fsound dob ssn dob, fsound
Exact match Approx match (miss) Freq % cum % ssn, dob, fsound 781094 66.57% ssn, fsound dob 52173 4.45% 71.01% ssn dob, fsound 10959 0.93% 71.95% ssn, lsound fname (dob mismatch) 9320 0.79% 72.74% other 7095 0.60% 73.35% dob, fsound, lname (ssn=.) 251124 21.40% 94.75% lname 16189 1.38% 96.13% 23653 2.02% 98.14% (ssn mismatch) 15544 1.32% 99.47% 4398 0.37% 99.84% fname, lname 1855 0.16% 100.00% TOTAL

32 Type of duplicates and links
EIS SIS freq % cum % DLD 3270 0.08% 4345 0.23% DLX 8790 0.20% 0.28% 228039 12.07% 12.30% DXX 19401 0.45% 0.73% 18251 0.97% 13.27% PLD 3221 0.07% 0.80% 0.17% 13.44% PLX 8706 1.01% 185066 9.80% 23.24% PXX 19198 1.45% 16929 0.90% 24.14% XLD 4.30% 5.75% 0.46% 24.60% XLX 976411 22.66% 28.41% 51.70% 76.29% XXX 71.59% 100.00% 447779 23.71% TOT

33 Number of Duplicates 4246277 98.55% 1432896 75.86% 61600 30800 1.43%
dups EIS SIS freq sets % cum % 1 98.55% 75.86% 2 61600 30800 1.43% 99.98% 338251 169125 17.91% 93.77% 3 942 314 0.02% 100.00% 86379 28793 4.57% 98.35% 4 44 11 0.00% 22928 5732 1.21% 99.56% 5 6020 1204 0.32% 99.88% 6 1662 277 0.09% 99.97% 7 497 71 0.03% 99.99% 8 96 12 0.01% 9 18 TOT

34 Implementation details
Ndob & name must be looped multiple matches Too many match on name use half of ssn Overlap for transpose

35 Basic Process Unduplicate EIS (dupeis) Unduplicate SIS (dupsis)
Link unduplicated EIS & SIS (link) Generate unique full list of each set of ids (list) use linkage info Link in the duplicates eis & sis

36 Unduplication Same as matching between different system
Except, match the database to itself i.e. EIS to EIS, SIS to SIS Randomly select one as Primary TODO: for those not linked using primary ID, try with duplicate ID TODO: explore indirect duplicate links

37 Conclusion Future work : SSN have been changed from real data
indirect duplicates Link using duplicates SSN have been changed from real data

38 Thank You !

39 Type of id first letter: second letter: link status
P : primary id with duplicates D : duplicates (primary info given with prefix ‘l’) X : no duplicates second letter: link status L: linked X: no linked id third letter: duplicates status of the linked id D: duplicates exist for the linked id X: no duplicates for the linked id

40 EIS & SIS Table Unique full is of EIS (or SIS) ids
Type : type of id (XXX) – see next slide All eis info have no prefix All sis info have prefix ‘k’ Prefix ‘l’ is the link id info freqeis & freqsis : # of duplicate ids Pindid (eis) & pkindid (sis) is the primary id indid1-indid3 & kindid1-kindid8

41 Link type sdiff : # digits different in ssn ddiff : diff in dob
-1 : one or both ssn is missing 2 : two digits are transposed 10 : two digits are different but not transposed ddiff : diff in dob -1 : one or both dob is missing 2 : date and month is transposed 3 : date, month and year are different 4 : date and month are different Fdiff (ldiff) : difference in first (last) name -1 : one or both are missing 1 : one letter difference (INDEL or REPL) 100 : one is a substring of the other 101 : one letter diff & substring

42 Duplicate type If duplicate id If primary id
Primary id info is given with prefix “l” Duplicate type Lsdiff, lddiff, lfdiff, & lldiff If primary id # of duplicates : freqeis & freqsis Duplicate ids Indid1-indid3 (eis) & kindid1-kindid8 (sis)

43 Other tables Link dupeis & dupsis
Linkage between the primary eis & sis ids dupeis & dupsis List of duplicates with primary id

44 Data flow eisid: 4,308,863 sisid: 1,888,747
ueis (4,277,402)+dupeis (31,461) : 99% sisid: 1,888,747 usis (1,638,112)+dupsis (250,635) : 87% Link : 1,173,404 (eis: 27%, sis: 72%) dupeis2 (1,270) + dupsis2(493) EIS: 4,308,863 (28%) SIS: 1,888,747 (74%)


Download ppt "Deterministic Record Linking"

Similar presentations


Ads by Google