Download presentation
Presentation is loading. Please wait.
1
Deterministic Record Linking
University of North Carolina, Chapel Hill Hye-Chung Kum
2
Example EISID : E1 EISID : E2 EISID : E3 EISID : E4 ssn : 085-66-9980
first name : Sally last name : Hill MI : L DOB : 3/4/1999 ssn : first name : Emily last name : Brown MI : K DOB : 6/2/2004 ssn : first name : Mary last name : Johnson MI : G DOB : 5/13/1983 ssn : first name : David last name : Ford MI : J DOB : 10/25/1990 SISID : S1 SISID : S2 SISID : S3 SISID : S4 ssn : first name : Sally last name : Hill MI : L DOB : 3/4/1999 ssn : first name : Emily last name : Brown MI : K DOB : 6/2/2004 ssn : first name : Mary last name : Hawkins MI : J DOB : 5/13/1983 ssn : first name : David last name : Ford DOB : 10/23/1990
3
Exact Match EISID : E1 ssn : 085-66-9980 first name : Sally
last name : Hill MI : L DOB : 3/4/1999 SISID : S1 ssn : first name : Sally last name : Hill MI : L DOB : 3/4/1999
4
Approximate Matching I : SSN
EISID : E2 ssn : first name : Emily last name : Brown MI : K DOB : 6/2/2004 SISID : S2 ssn : first name : Emily last name : Brown MI : K DOB : 6/2/2004
5
Approximate Matching II : DOB
EISID : E4 ssn : first name : David last name : Ford MI : J DOB : 10/25/1990 SISID : S4 ssn : first name : David last name : Ford MI : J DOB : 10/23/1990
6
Approximate Matching III : Name
EISID : E3 ssn : first name : Mary last name : Johnson MI : G DOB : 5/13/1983 SISID : S3 ssn : first name : Mary last name : Hawkins MI : J DOB : 5/13/1983
7
Deterministic Record Linking
Allow for approximate matching Use explicit approximate rules Pros : can control the linkage process Con: difficult to implement Alternative : Probabilistic record linking Also approximate matching However, uses general rules specified by users Based on total probability Con: can not control exactly what to consider a match or not Pros: can use specialized software
8
Approximate Matching : DOB
element to element match : date, month, year Allow for one element difference Allow for month and day transposed DOB : one element dob1 : 10/25/1990 dob2 : 10/23/1990 DOB : transpose dob1 : 11/7/1995 dob2 : 7/11/1995
9
Approximate Matching : Name
First name soundex match First name is approx one letter different insert or replace and/or substr lsound equal or lname approx MI=FI FI equal Fsound & Lsound swapped obs fname kfname mi kmi 1 RUDOLPH RULDOLPH A 2 ALIJAH ALIYAH M 3 CAROL CAROLYN J 4 ANGELIQUE ANGIE D 5 JOHNNY JOHNNY JR L 6 ZACHARY ZACK 7 MICHAEL 8 ANTON COUDRAY C 9 ARTHUR AUTHOR R 10 EDWIN EDDIE 11 GOLDY OWENS
10
Approximate Matching : Name
obs fname kfname mi kmi lname klname 1 RUDOLPH RULDOLPH A SIMARD 2 ALIJAH ALIYAH M FOSS 3 CAROL CAROLYN J YOUNG 4 ANGELIQUE ANGIE D OUELLETTE 5 JOHNNY JOHNNY JR L MAYO 6 ZACHARY ZACK ROGERS 7 MICHAEL GALLAGHER 8 ANTON COUDRAY C CYPRESS 9 ARTHUR AUTHOR R DAVIS 10 EDWIN EDDIE KAHKONE 11 GOLDY OWENS
11
Match on ssn (ssn equal)
1 : dob, fsound equal dob approx 2 : dob approx, fsound equal 3 : dob approx, fname approx 4 : dob approx, lsound equal, & fsound diff, but MI=FI 5 : dob approx, lsound equal, & fsound diff, but FI equal 6 : dob approx, lsound and fsound swapped 7 : dob approx, lname approx & fsound diff but MI=FI (4 with lname approx rather than equal) or FI equal (5 with lname approx rather than equal) dob mismatch 8 : fname approx, lsound equal, and dob diff 9 : fname approx, lsound approx, and dob diff
12
Match on ssn (ssn equal)
1 : dob, fsound equal dob approx 2 : dob approx, fsound equal 3 : dob approx, fname approx 4 : dob approx, lsound equal, & fsound diff, but MI=FI 5 : dob approx, lsound equal, & fsound diff, but FI equal 6 : dob approx, lsound and fsound swapped 7 : dob approx, lname approx & fsound diff but MI=FI (4 with lname approx rather than equal) or FI equal (5 with lname approx rather than equal) dob diff 8 : fname approx, lsound equal, and dob diff 9 : fname approx, lsound approx, and dob diff
13
Approximate Matching : SSN
Digit to digit match Allow for one digit difference Allow for two digit difference if transposed SSN : one digit ssn1 : ssn2 : SSN : transpose ssn1 : ssn2 :
14
Match on ndob (dob+fsound)
ssn missing 1: lname equal 2: lname approx ssn approx 3: lname equal 4: lname approx 5: lname diff but fname equal ssn different 11 : lname equal 12 : lname approx lname different 51: ssn approx 52: ssn missing
15
Match on ndob (dob+fsound)
ssn missing 1: lname equal 2: lname approx ssn approx 3: lname equal 4: lname approx 5: lname diff but fname equal ssn different 11 : lname equal 12 : lname approx lname different 51: ssn approx 52: ssn missing
16
SSN kSSN fname kfname lname klname 1 244572812 . APPOLONIA GAVINS 2
obs SSN kSSN fname kfname lname klname 1 . APPOLONIA GAVINS 2 ABEL LOMELIGARCIA LOMELI 3 JOSH JOSHUA PHIPPS
17
SSN kSSN fname kfname lname klname 1 244572812 . APPOLONIA GAVINS 2
obs SSN kSSN fname kfname lname klname 1 . APPOLONIA GAVINS 2 ABEL LOMELIGARCIA LOMELI 3 JOSH JOSHUA PHIPPS 4 LENA COOPER 5 MILES KNIGHT JR. KNIGHT
18
SSN kSSN fname kfname lname klname 1 244572812 . APPOLONIA GAVINS 2
obs SSN kSSN fname kfname lname klname 1 . APPOLONIA GAVINS 2 ABEL LOMELIGARCIA LOMELI 3 JOSH JOSHUA PHIPPS 4 LENA COOPER 5 MILES KNIGHT JR. KNIGHT 6 MARTHA LYDA HOPKINS
19
SSN kSSN fname kfname lname klname 1 244572812 . APPOLONIA GAVINS 2
obs SSN kSSN fname kfname lname klname 1 . APPOLONIA GAVINS 2 ABEL LOMELIGARCIA LOMELI 3 JOSH JOSHUA PHIPPS 4 LENA COOPER 5 MILES KNIGHT JR. KNIGHT 6 MARTHA LYDA HOPKINS 7 AUSTIN AUSTYN TERWILLIGER OMEARA 8 ALISIA ALICE GRAVES WATSON 9 ANNA ANAYA MONTAGUE BOLDING
20
SSN kSSN fname kfname lname klname 1 244572812 . APPOLONIA GAVINS 2
obs SSN kSSN fname kfname lname klname 1 . APPOLONIA GAVINS 2 ABEL LOMELIGARCIA LOMELI 3 JOSH JOSHUA PHIPPS 4 LENA COOPER 5 MILES KNIGHT JR. KNIGHT 6 MARTHA LYDA HOPKINS 7 AUSTIN AUSTYN TERWILLIGER OMEARA 8 ALISIA ALICE GRAVES WATSON 9 ANNA ANAYA MONTAGUE BOLDING Matches in green are NOT considered to be a match
21
SSN kSSN fname kfname lname klname 1 244572812 . APPOLONIA GAVINS 2
obs SSN kSSN fname kfname lname klname 1 . APPOLONIA GAVINS 2 ABEL LOMELIGARCIA LOMELI 3 JOSH JOSHUA PHIPPS 4 LENA COOPER 5 MILES KNIGHT JR. KNIGHT 6 MARTHA LYDA HOPKINS 7 AUSTIN AUSTYN TERWILLIGER OMEARA 8 ALISIA ALICE GRAVES WATSON 9 ANNA ANAYA MONTAGUE BOLDING 10 BRITTNEY REVELS 11 DANIEL ROBINSON 12 HELEN HALL HOLLER 13 DEBORAH DEBRA LEE LEACH
22
SSN kSSN fname kfname lname klname 1 244572812 . APPOLONIA GAVINS 2
obs SSN kSSN fname kfname lname klname 1 . APPOLONIA GAVINS 2 ABEL LOMELIGARCIA LOMELI 3 JOSH JOSHUA PHIPPS 4 LENA COOPER 5 MILES KNIGHT JR. KNIGHT 6 MARTHA LYDA HOPKINS 7 AUSTIN AUSTYN TERWILLIGER OMEARA 8 ALISIA ALICE GRAVES WATSON 9 ANNA ANAYA MONTAGUE BOLDING 10 BRITTNEY REVELS 11 DANIEL ROBINSON 12 HELEN HALL HOLLER 13 DEBORAH DEBRA LEE LEACH
23
SSN kSSN fname kfname lname klname 1 244572812 . APPOLONIA GAVINS 2
obs SSN kSSN fname kfname lname klname 1 . APPOLONIA GAVINS 2 ABEL LOMELIGARCIA LOMELI 3 JOSH JOSHUA PHIPPS 4 LENA COOPER 5 MILES KNIGHT JR. KNIGHT 6 MARTHA LYDA HOPKINS 7 AUSTIN AUSTYN TERWILLIGER OMEARA 8 ALISIA ALICE GRAVES WATSON 9 ANNA ANAYA MONTAGUE BOLDING 10 BRITTNEY REVELS 11 DANIEL ROBINSON 12 HELEN HALL HOLLER 13 DEBORAH DEBRA LEE LEACH 14 ABIGAHIL GARCIATREJO TREJO 15 APSLEY CARLYLE KARYLE 16 ABIGAIL GENTRY KING 17 RODRIGUEZRINCON HERNANDEZ 18 ABIGAYLE FITZGERALD
24
Match on name (fname+lname)
ssn missing & dob approx 1: MI equal 7: MI missing 8: MI not equal ssn approx 3: dob equal dob approx 4: one element 5: transpose
25
Match on name (fname+lname)
ssn missing & dob approx 1: MI equal 7: MI missing 8: MI not equal ssn approx 3: dob equal dob approx 4: one element 5: transpose obs ssn kssn dob kdob 1 09/06/09 09/06/08 2 12/09/75 12/09/76 3 07/ 15/20 07/07/20 4 11/12/14 12/11/14 5 12/08/94 08/12 /94
26
Match on name (fname+lname)
obs Type ssn kssn dob kdob fname lname 1 4 09/06/09 09/06/08 MARION MONTAGUE 2 12/09/75 12/09/76 WILLIAM JOHNSON 3 07/ 15/20 07/07/20 WILLIE GRANT 5 11/12/14 12/11/14 GLADYS SOUTHARD 12/08/94 08/12 /94 TAYLOR FORD 6 52 09/11/77 . NICOLE PARKER 7 07/07/88 ASAJAH ROSS 8 100 01/31/05 10/31/99 PATRICIA BANEGAS 9 01/12/88 02/12/88 DANIEL ANDRONIC 10 02/27/89 11/15/89 VICTORIA HORN
27
Match on name (fname+lname)
obs Type ssn kssn dob kdob fname lname 1 4 09/06/09 09/06/08 MARION MONTAGUE 2 12/09/75 12/09/76 WILLIAM JOHNSON 3 07/ 15/20 07/07/20 WILLIE GRANT 5 11/12/14 12/11/14 GLADYS SOUTHARD 12/08/94 08/12 /94 TAYLOR FORD 6 52 09/11/77 . NICOLE PARKER 7 07/07/88 ASAJAH ROSS 8 100 01/31/05 10/31/99 PATRICIA BANEGAS 9 01/12/88 02/12/88 DANIEL ANDRONIC 10 02/27/89 11/15/89 VICTORIA HORN
28
link Put together all links found
Identify indirect duplicates (type2>10000) i.e. both EISID1 & EISID2 link to identical SISID1 Consider indirect duplicates on both EIS & SIS Create unique link and indirect duplicate files Keep only the first id in data file link Create indirect duplicates files dupeis2 & dupsis2 TODO : explore indirect duplicates
29
Create unique list of EIS & SIS
Generate unique full list of each set of ids use linkage info Link in the duplicates (dupeis & dupsis) TODO : link in the indirect duplicates eis & sis
30
Data flow Link eis to sis ueis.sas7bdat usis.sas7bdat link.sas7bdat dupeis2.sas7bdat 4,308,863 dupsis2.sas7bdat eisid.sas7bdat sisid.sas7bdat dupeis.sas7bdat dupsis.sas7bdat eis.sas7bdat sis.sas7bdat duplicates unduplicated unique records 4,277,402 99% 1,888,747 1,638,112 87% 31,461 250,635 1,173,404 28% 74% 1270 493 27% 72%
31
Type of links ssn, dob, fsound ssn, fsound dob ssn dob, fsound
Exact match Approx match (miss) Freq % cum % ssn, dob, fsound 781094 66.57% ssn, fsound dob 52173 4.45% 71.01% ssn dob, fsound 10959 0.93% 71.95% ssn, lsound fname (dob mismatch) 9320 0.79% 72.74% other 7095 0.60% 73.35% dob, fsound, lname (ssn=.) 251124 21.40% 94.75% lname 16189 1.38% 96.13% 23653 2.02% 98.14% (ssn mismatch) 15544 1.32% 99.47% 4398 0.37% 99.84% fname, lname 1855 0.16% 100.00% TOTAL
32
Type of duplicates and links
EIS SIS freq % cum % DLD 3270 0.08% 4345 0.23% DLX 8790 0.20% 0.28% 228039 12.07% 12.30% DXX 19401 0.45% 0.73% 18251 0.97% 13.27% PLD 3221 0.07% 0.80% 0.17% 13.44% PLX 8706 1.01% 185066 9.80% 23.24% PXX 19198 1.45% 16929 0.90% 24.14% XLD 4.30% 5.75% 0.46% 24.60% XLX 976411 22.66% 28.41% 51.70% 76.29% XXX 71.59% 100.00% 447779 23.71% TOT
33
Number of Duplicates 4246277 98.55% 1432896 75.86% 61600 30800 1.43%
dups EIS SIS freq sets % cum % 1 98.55% 75.86% 2 61600 30800 1.43% 99.98% 338251 169125 17.91% 93.77% 3 942 314 0.02% 100.00% 86379 28793 4.57% 98.35% 4 44 11 0.00% 22928 5732 1.21% 99.56% 5 6020 1204 0.32% 99.88% 6 1662 277 0.09% 99.97% 7 497 71 0.03% 99.99% 8 96 12 0.01% 9 18 TOT
34
Implementation details
Ndob & name must be looped multiple matches Too many match on name use half of ssn Overlap for transpose
35
Basic Process Unduplicate EIS (dupeis) Unduplicate SIS (dupsis)
Link unduplicated EIS & SIS (link) Generate unique full list of each set of ids (list) use linkage info Link in the duplicates eis & sis
36
Unduplication Same as matching between different system
Except, match the database to itself i.e. EIS to EIS, SIS to SIS Randomly select one as Primary TODO: for those not linked using primary ID, try with duplicate ID TODO: explore indirect duplicate links
37
Conclusion Future work : SSN have been changed from real data
indirect duplicates Link using duplicates SSN have been changed from real data
38
Thank You !
39
Type of id first letter: second letter: link status
P : primary id with duplicates D : duplicates (primary info given with prefix ‘l’) X : no duplicates second letter: link status L: linked X: no linked id third letter: duplicates status of the linked id D: duplicates exist for the linked id X: no duplicates for the linked id
40
EIS & SIS Table Unique full is of EIS (or SIS) ids
Type : type of id (XXX) – see next slide All eis info have no prefix All sis info have prefix ‘k’ Prefix ‘l’ is the link id info freqeis & freqsis : # of duplicate ids Pindid (eis) & pkindid (sis) is the primary id indid1-indid3 & kindid1-kindid8
41
Link type sdiff : # digits different in ssn ddiff : diff in dob
-1 : one or both ssn is missing 2 : two digits are transposed 10 : two digits are different but not transposed ddiff : diff in dob -1 : one or both dob is missing 2 : date and month is transposed 3 : date, month and year are different 4 : date and month are different Fdiff (ldiff) : difference in first (last) name -1 : one or both are missing 1 : one letter difference (INDEL or REPL) 100 : one is a substring of the other 101 : one letter diff & substring
42
Duplicate type If duplicate id If primary id
Primary id info is given with prefix “l” Duplicate type Lsdiff, lddiff, lfdiff, & lldiff If primary id # of duplicates : freqeis & freqsis Duplicate ids Indid1-indid3 (eis) & kindid1-kindid8 (sis)
43
Other tables Link dupeis & dupsis
Linkage between the primary eis & sis ids dupeis & dupsis List of duplicates with primary id
44
Data flow eisid: 4,308,863 sisid: 1,888,747
ueis (4,277,402)+dupeis (31,461) : 99% sisid: 1,888,747 usis (1,638,112)+dupsis (250,635) : 87% Link : 1,173,404 (eis: 27%, sis: 72%) dupeis2 (1,270) + dupsis2(493) EIS: 4,308,863 (28%) SIS: 1,888,747 (74%)
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.