Presentation is loading. Please wait.

Presentation is loading. Please wait.

Deterministic Record Linking University of North Carolina, Chapel Hill Hye-Chung Kum.

Similar presentations


Presentation on theme: "Deterministic Record Linking University of North Carolina, Chapel Hill Hye-Chung Kum."— Presentation transcript:

1 Deterministic Record Linking University of North Carolina, Chapel Hill Hye-Chung Kum

2 Example EISID : E1EISID : E2EISID : E3EISID : E4 ssn : 085-66-9980 first name : Sally last name : Hill MI : L DOB : 3/4/1999 ssn : 143-25-9304 first name : Emily last name : Brown MI : K DOB : 6/2/2004 ssn : 354-563-2343 first name : Mary last name : Johnson MI : G DOB : 5/13/1983 ssn : 532-34-9183 first name : David last name : Ford MI : J DOB : 10/25/1990 SISID : S1SISID : S2SISID : S3SISID : S4 ssn : 085-66-9980 first name : Sally last name : Hill MI : L DOB : 3/4/1999 ssn : 143-52-9304 first name : Emily last name : Brown MI : K DOB : 6/2/2004 ssn : 354-563-2343 first name : Mary last name : Hawkins MI : J DOB : 5/13/1983 ssn : 532-34-9183 first name : David last name : Ford MI : J DOB : 10/23/1990

3 Exact Match EISID : E1 ssn : 085-66-9980 first name : Sally last name : Hill MI : L DOB : 3/4/1999 SISID : S1 ssn : 085-66-9980 first name : Sally last name : Hill MI : L DOB : 3/4/1999

4 Approximate Matching I : SSN EISID : E2 25 ssn : 143-25-9304 first name : Emily last name : Brown MI : K DOB : 6/2/2004 SISID : S2 52 ssn : 143-52-9304 first name : Emily last name : Brown MI : K DOB : 6/2/2004

5 Approximate Matching II : DOB EISID : E4 ssn : 532-34-9183 first name : David last name : Ford MI : J 10/25/1990 DOB : 10/25/1990 SISID : S4 ssn : 532-34-9183 first name : David last name : Ford MI : J 10/23/1990 DOB : 10/23/1990

6 Approximate Matching III : Name EISID : E3 ssn : 354-563-2343 first name : Mary Johnson last name : Johnson G MI : G DOB : 5/13/1983 SISID : S3 ssn : 354-563-2343 first name : Mary Hawkins last name : Hawkins J MI : J DOB : 5/13/1983

7 Deterministic Record Linking Allow for approximate matching Use explicit approximate rules Pros : can control the linkage process Con: difficult to implement Alternative : Probabilistic record linking – Also approximate matching – However, uses general rules specified by users – Based on total probability – Con: can not control exactly what to consider a match or not – Pros: can use specialized software

8 Approximate Matching : DOB element to element match : date, month, year Allow for one element difference Allow for month and day transposed DOB : one element 25 dob1 : 10/25/1990 23 dob2 : 10/23/1990 DOB : transpose 11/7/ dob1 : 11/7/1995 7/11/ dob2 : 7/11/1995

9 Approximate Matching : Name First name soundex match First name is approx – one letter different insert or replace – and/or substr lsound equal or lname approx – MI=FI – FI equal Fsound & Lsound swapped obs fnamekfname mikmi 1RUDOLPH L RU L DOLPH AA 2 J ALI J AH Y ALI Y AH M 3CAROL YN CAROL YN J 4 LIQUE ANGE LIQUE I ANG I E D 5JOHNNY JR JOHNNY JR L 6 HARY ZAC HARY K ZAC K L 7J M M ICHAELM 8 A A NTON C C OUDRAYCA 9 A A RTHUR A A UTHOR RR 10 E E DWIN E E DDIE 11GOLDYOWENSAA

10 Approximate Matching : Name obs fnamekfname mikmi lnameklname 1RUDOLPH L RU L DOLPH AASIMARD 2 J ALI J AH Y ALI Y AH MFOSS 3CAROL YN CAROL YN JYOUNG 4 LIQUE ANGE LIQUE I ANG I E DOUELLETTE 5JOHNNY JR JOHNNY JR LMAYO 6 HARY ZAC HARY K ZAC K LROGERS 7J M M ICHAELM GALLAGHER 8 A A NTON C C OUDRAYCA CYPRESS 9 A A RTHUR A A UTHOR RRDAVIS 10 E E DWIN E E DDIE KAHKONE 11GOLDYOWENS AAOWENSGOLDY

11 Match on ssn (ssn equal) 1 : dob, fsound equal dob approx – 2 : dob approx, fsound equal – 3 : dob approx, fname approx – 4 : dob approx, lsound equal, & fsound diff, but MI=FI – 5 : dob approx, lsound equal, & fsound diff, but FI equal – 6 : dob approx, lsound and fsound swapped – 7 : dob approx, lname approx & fsound diff but MI=FI (4 with lname approx rather than equal) or FI equal (5 with lname approx rather than equal) dob mismatch – 8 : fname approx, lsound equal, and dob diff – 9 : fname approx, lsound approx, and dob diff

12 Match on ssn (ssn equal) 1 : dob, fsound equal dob approx – 2 : dob approx, fsound equal – 3 : dob approx, fname approx – 4 : dob approx, lsound equal, & fsound diff, but MI=FI – 5 : dob approx, lsound equal, & fsound diff, but FI equal – 6 : dob approx, lsound and fsound swapped – 7 : dob approx, lname approx & fsound diff but MI=FI (4 with lname approx rather than equal) or FI equal (5 with lname approx rather than equal) dob diff – 8 : fname approx, lsound equal, and dob diff – 9 : fname approx, lsound approx, and dob diff

13 Approximate Matching : SSN Digit to digit match Allow for one digit difference Allow for two digit difference if transposed SSN : one digit 9 ssn1 : 532-34-9183 8 ssn2 : 532-34-8183 SSN : transpose 25 ssn1 : 143-25-9304 52 ssn2 : 143-52-9304

14 Match on ndob (dob+fsound) ssn missing – 1: lname equal – 2: lname approx ssn approx – 3: lname equal – 4: lname approx – 5: lname diff but fname equal ssn different – 11 : lname equal – 12 : lname approx lname different – 51: ssn approx – 52: ssn missing

15 Match on ndob (dob+fsound) ssn missing – 1: lname equal – 2: lname approx ssn approx – 3: lname equal – 4: lname approx – 5: lname diff but fname equal ssn different – 11 : lname equal – 12 : lname approx lname different – 51: ssn approx – 52: ssn missing

16 obs SSNkSSNfnamekfnamelnameklname 1244572812.APPOLONIA GAVINS 2..ABEL GARCIA LOMELI GARCIA LOMELI 3248511181.JOSH JOSHUA PHIPPS

17 obs SSNkSSNfnamekfnamelnameklname 1244572812.APPOLONIA GAVINS 2..ABEL GARCIA LOMELI GARCIA LOMELI 3248511181.JOSH JOSHUA PHIPPS 4 45 2433520 45 54 2433520 54 LENA COOPER 5 8 23956551 8 9 23956551 9 MILES JR. KNIGHT JR. KNIGHT

18 obs SSNkSSNfnamekfnamelnameklname 1244572812.APPOLONIA GAVINS 2..ABEL GARCIA LOMELI GARCIA LOMELI 3248511181.JOSH JOSHUA PHIPPS 4 45 2433520 45 54 2433520 54 LENA COOPER 5 8 23956551 8 9 23956551 9 MILES JR. KNIGHT JR. KNIGHT 6 1 245 1 93584 4 245 4 93584 MARTHA LYDAHOPKINS

19 obs SSNkSSNfnamekfnamelnameklname 1244572812.APPOLONIA GAVINS 2..ABEL GARCIA LOMELI GARCIA LOMELI 3248511181.JOSH JOSHUA PHIPPS 4 45 2433520 45 54 2433520 54 LENA COOPER 5 8 23956551 8 9 23956551 9 MILES JR. KNIGHT JR. KNIGHT 6 1 245 1 93584 4 245 4 93584 MARTHA LYDAHOPKINS 7 9 24477 9 182 8 24477 8 182AUSTINAUSTYNTERWILLIGEROMEARA 8 1 4899875 1 3 7 4899875 7 3ALISIAALICEGRAVESWATSON 9 6 2399665 6 8 7 2399665 7 8ANNAANAYAMONTAGUEBOLDING

20 obs SSNkSSNfnamekfnamelnameklname 1244572812.APPOLONIA GAVINS 2..ABEL GARCIA LOMELI GARCIA LOMELI 3248511181.JOSH JOSHUA PHIPPS 4 45 2433520 45 54 2433520 54 LENA COOPER 5 8 23956551 8 9 23956551 9 MILES JR. KNIGHT JR. KNIGHT 6 1 245 1 93584 4 245 4 93584 MARTHA LYDAHOPKINS 7 9 24477 9 182 8 24477 8 182AUSTINAUSTYNTERWILLIGEROMEARA 8 1 4899875 1 3 7 4899875 7 3ALISIAALICEGRAVESWATSON 9 6 2399665 6 8 7 2399665 7 8ANNAANAYAMONTAGUEBOLDING

21 obs SSNkSSNfnamekfnamelnameklname 1244572812.APPOLONIA GAVINS 2..ABEL GARCIA LOMELI GARCIA LOMELI 3248511181.JOSH JOSHUA PHIPPS 4 45 2433520 45 54 2433520 54 LENA COOPER 5 8 23956551 8 9 23956551 9 MILES JR. KNIGHT JR. KNIGHT 6 1 245 1 93584 4 245 4 93584 MARTHA LYDAHOPKINS 7 9 24477 9 182 8 24477 8 182AUSTINAUSTYNTERWILLIGEROMEARA 8 1 4899875 1 3 7 4899875 7 3ALISIAALICEGRAVESWATSON 9 6 2399665 6 8 7 2399665 7 8ANNAANAYAMONTAGUEBOLDING 10 55 2276916 55 33 2276916 33 BRITTNEY REVELS 11 242339913 239524402 DANIEL ROBINSON 12 221864852 225206017 HELEN A H A LL OER H O LL ER 13 240212489 222565604 DEBORAH DEBRA E LE E ACH LE ACH

22 obs SSNkSSNfnamekfnamelnameklname 1244572812.APPOLONIA GAVINS 2..ABEL GARCIA LOMELI GARCIA LOMELI 3248511181.JOSH JOSHUA PHIPPS 4 45 2433520 45 54 2433520 54 LENA COOPER 5 8 23956551 8 9 23956551 9 MILES JR. KNIGHT JR. KNIGHT 6 1 245 1 93584 4 245 4 93584 MARTHA LYDAHOPKINS 7 9 24477 9 182 8 24477 8 182AUSTINAUSTYNTERWILLIGEROMEARA 8 1 4899875 1 3 7 4899875 7 3ALISIAALICEGRAVESWATSON 9 6 2399665 6 8 7 2399665 7 8ANNAANAYAMONTAGUEBOLDING 10 55 2276916 55 33 2276916 33 BRITTNEY REVELS 11 242339913 239524402 DANIEL ROBINSON 12 221864852 225206017 HELEN A H A LL OER H O LL ER 13 240212489 222565604 DEBORAH DEBRA E LE E ACH LE ACH

23 obs SSNkSSNfnamekfnamelnameklname 1244572812.APPOLONIA GAVINS 2..ABEL GARCIA LOMELI GARCIA LOMELI 3248511181.JOSH JOSHUA PHIPPS 4 45 2433520 45 54 2433520 54 LENA COOPER 5 8 23956551 8 9 23956551 9 MILES JR. KNIGHT JR. KNIGHT 6 1 245 1 93584 4 245 4 93584 MARTHA LYDAHOPKINS 7 9 24477 9 182 8 24477 8 182AUSTINAUSTYNTERWILLIGEROMEARA 8 1 4899875 1 3 7 4899875 7 3ALISIAALICEGRAVESWATSON 9 6 2399665 6 8 7 2399665 7 8ANNAANAYAMONTAGUEBOLDING 10 55 2276916 55 33 2276916 33 BRITTNEY REVELS 11 242339913 239524402 DANIEL ROBINSON 12 221864852 225206017 HELEN A H A LL OER H O LL ER 13 240212489 222565604 DEBORAH DEBRA E LE E ACH LE ACH 14238995019.ABIGAHIL GARCIA GARCIA TREJO TREJO 15..APSLEY CARLYLEKARYLE 16..ABIGAIL GENTRYKING 17237999685.ABIGAIL RODRIGUEZRINCON HERNANDEZ 18237998504.ABIGAYLEABIGAILFITZGERALD HERNANDEZ

24 Match on name (fname+lname) ssn missing & dob approx – 1: MI equal – 7: MI missing – 8: MI not equal ssn approx – 3: dob equal – dob approx 4: one element 5: transpose

25 Match on name (fname+lname) ssn missing & dob approx – 1: MI equal – 7: MI missing – 8: MI not equal ssn approx – 3: dob equal – dob approx 4: one element 5: transpose obsssnkssndobkdob 1 62 3 62 201047 26 3 26 201047 09 09/06/ 09 08 09/06/ 08 2 41 313 41 6906 14 313 14 6906 75 12/09/ 75 76 12/09/ 76 3 4 2 4 6381056 1 2 1 6381056 15 07/ 15 /20 07 07/ 07 /20 4 03 2380138 03 30 2380138 30 11/12 11/12 /14 12/11 12/11 /14 5 4 24119110 4 3 24119110 3 12/08 12/08 /94 08/12 08/12 /94

26 obs Type ssnkssndobkdobfnamelname 14 62 3 62 201047 26 3 26 201047 09 09/06/ 09 08 09/06/ 08 MARIONMONTAGUE 24 41 313 41 6906 14 313 14 6906 75 12/09/ 75 76 12/09/ 76 WILLIAMJOHNSON 34 4 2 4 6381056 1 2 1 6381056 15 07/ 15 /20 07 07/ 07 /20 WILLIEGRANT 45 03 2380138 03 30 2380138 30 11/12 11/12 /14 12/11 12/11 /14 GLADYSSOUTHARD 55 4 24119110 4 3 24119110 3 12/08 12/08 /94 08/12 08/12 /94 TAYLORFORD 652 3 27231886 3 0 27231886 0 09/11/77.NICOLEPARKER 752 7 5781111 7 3 1 5781111 1 3 07/07/88.ASAJAHROSS 8100 6 12068814 6 2 12068814 2 0105 01 /31/ 05 1099 10 /31/ 99 PATRICIABANEGAS 9100 80 1336807 80 98 1336807 98 01 01 /12/88 02 02 /12/88 DANIELANDRONIC 10100 6 1327 6 9052 5 1327 5 9052 02/27 02/27 /89 11/15 11/15 /89 VICTORIAHORN Match on name (fname+lname)

27 obs Type ssnkssndobkdobfnamelname 14 62 3 62 201047 26 3 26 201047 09 09/06/ 09 08 09/06/ 08 MARIONMONTAGUE 24 41 313 41 6906 14 313 14 6906 75 12/09/ 75 76 12/09/ 76 WILLIAMJOHNSON 34 4 2 4 6381056 1 2 1 6381056 15 07/ 15 /20 07 07/ 07 /20 WILLIEGRANT 45 03 2380138 03 30 2380138 30 11/12 11/12 /14 12/11 12/11 /14 GLADYSSOUTHARD 55 4 24119110 4 3 24119110 3 12/08 12/08 /94 08/12 08/12 /94 TAYLORFORD 652 3 27231886 3 0 27231886 0 09/11/77.NICOLEPARKER 752 7 5781111 7 3 1 5781111 1 3 07/07/88.ASAJAHROSS 8100 6 12068814 6 2 12068814 2 0105 01 /31/ 05 1099 10 /31/ 99 PATRICIABANEGAS 9100 80 1336807 80 98 1336807 98 01 01 /12/88 02 02 /12/88 DANIELANDRONIC 10100 6 1327 6 9052 5 1327 5 9052 02/27 02/27 /89 11/15 11/15 /89 VICTORIAHORN Match on name (fname+lname)

28 link Put together all links found Identify indirect duplicates (type2>10000) – i.e. both EISID1 & EISID2 link to identical SISID1 – Consider indirect duplicates on both EIS & SIS Create unique link and indirect duplicate files – Keep only the first id in data file link – Create indirect duplicates files dupeis2 & dupsis2 TODO : explore indirect duplicates

29 Create unique list of EIS & SIS Generate unique full list of each set of ids – use linkage info – Link in the duplicates (dupeis & dupsis) – TODO : link in the indirect duplicates – eis & sis

30 Data flow Link eis to sis ueis.sas7bdatusis.sas7bdat link.sas7bdat dupeis2.sas7bdat 4,308,863 dupsis2.sas7bdat eisid.sas7bdatsisid.sas7bdat dupeis.sas7bdat dupsis.sas7bdat eis.sas7bdatsis.sas7bdat duplicates unduplicated unique records 4,277,402 99% 1,888,747 1,638,112 87% 31,461 250,635 1,173,404 4,308,863 28% 1,888,747 74% 1270 493 27% 72%

31 Type of links Exact matchApprox match (miss)Freq%cum % ssn, dob, fsound 78109466.57% ssn, fsounddob 521734.45%71.01% ssndob, fsound 109590.93%71.95% ssn, lsoundfname (dob mismatch) 93200.79%72.74% ssnother 70950.60%73.35% dob, fsound, lname(ssn=.) 25112421.40%94.75% dob, fsoundlname 161891.38%96.13% dob, fsound, lnamessn 236532.02%98.14% dob, fsound, lname(ssn mismatch) 155441.32%99.47% dob, fsoundother 43980.37%99.84% fname, lnameother 18550.16%100.00% TOTAL 1173404100.00%

32 Type of duplicates and links TypeEISSIS freq%cum %freq%cum % DLD 32700.08% 43450.23% DLX 87900.20%0.28%22803912.07%12.30% DXX 194010.45%0.73%182510.97%13.27% PLD 32210.07%0.80%32210.17%13.44% PLX 87060.20%1.01%1850669.80%23.24% PXX 191980.45%1.45%169290.90%24.14% XLD 1850664.30%5.75%87060.46%24.60% XLX 97641122.66%28.41%97641151.70%76.29% XXX 308480071.59%100.00%44777923.71%100.00% TOT 4308863100.00%1888747100.00%

33 Number of Duplicates dupsEISSIS freqsets%cum %freqsets%cum % 1 4246277 98.55% 1432896 75.86% 2 61600308001.43%99.98%33825116912517.91%93.77% 3 9423140.02%100.00%86379287934.57%98.35% 4 44110.00%100.00%2292857321.21%99.56% 5 602012040.32%99.88% 6 16622770.09%99.97% 7 497710.03%99.99% 8 96120.01%100.00% 9 1820.00%100.00% TOT 43088634277402100.00%18887471638112100.00%

34 Implementation details Ndob & name must be looped – multiple matches Too many match on name – use half of ssn – Overlap for transpose

35 Basic Process Unduplicate EIS (dupeis) Unduplicate SIS (dupsis) Link unduplicated EIS & SIS (link) Generate unique full list of each set of ids (list) – use linkage info – Link in the duplicates – eis & sis

36 Unduplication Same as matching between different system Except, match the database to itself – i.e. EIS to EIS, SIS to SIS Randomly select one as Primary – TODO: for those not linked using primary ID, try with duplicate ID TODO: explore indirect duplicate links

37 Conclusion Future work : – indirect duplicates – Link using duplicates SSN have been changed from real data

38 Thank You !

39 Type of id first letter: – P : primary id with duplicates – D : duplicates (primary info given with prefix ‘l’) – X : no duplicates second letter: link status – L: linked – X: no linked id third letter: duplicates status of the linked id – D: duplicates exist for the linked id – X: no duplicates for the linked id

40 EIS & SIS Table Unique full is of EIS (or SIS) ids Type : type of id (XXX) – see next slide All eis info have no prefix All sis info have prefix ‘k’ Prefix ‘l’ is the link id info freqeis & freqsis : # of duplicate ids Pindid (eis) & pkindid (sis) is the primary id indid1-indid3 & kindid1-kindid8

41 Link type sdiff : # digits different in ssn – -1 : one or both ssn is missing – 2 : two digits are transposed – 10 : two digits are different but not transposed ddiff : diff in dob – -1 : one or both dob is missing – 2 : date and month is transposed – 3 : date, month and year are different – 4 : date and month are different Fdiff (ldiff) : difference in first (last) name – -1 : one or both are missing – 1 : one letter difference (INDEL or REPL) – 100 : one is a substring of the other – 101 : one letter diff & substring

42 Duplicate type If duplicate id – Primary id info is given with prefix “l” – Duplicate type Lsdiff, lddiff, lfdiff, & lldiff If primary id – # of duplicates : freqeis & freqsis – Duplicate ids Indid1-indid3 (eis) & kindid1-kindid8 (sis)

43 Other tables Link – Linkage between the primary eis & sis ids dupeis & dupsis – List of duplicates with primary id

44 Data flow eisid: 4,308,863 – ueis (4,277,402)+dupeis (31,461) : 99% sisid: 1,888,747 – usis (1,638,112)+dupsis (250,635) : 87% Link : 1,173,404 (eis: 27%, sis: 72%) – dupeis2 (1,270) + dupsis2(493) EIS: 4,308,863 (28%) SIS: 1,888,747 (74%)


Download ppt "Deterministic Record Linking University of North Carolina, Chapel Hill Hye-Chung Kum."

Similar presentations


Ads by Google