Presentation is loading. Please wait.

Presentation is loading. Please wait.

Deterministic Record Linking University of North Carolina, Chapel Hill Hye-Chung Kum.

Similar presentations


Presentation on theme: "Deterministic Record Linking University of North Carolina, Chapel Hill Hye-Chung Kum."— Presentation transcript:

1 Deterministic Record Linking University of North Carolina, Chapel Hill Hye-Chung Kum

2 Example EISID : E1EISID : E2EISID : E3EISID : E4 ssn : first name : Sally last name : Hill MI : L DOB : 3/4/1999 ssn : first name : Emily last name : Brown MI : K DOB : 6/2/2004 ssn : first name : Mary last name : Johnson MI : G DOB : 5/13/1983 ssn : first name : David last name : Ford MI : J DOB : 10/25/1990 SISID : S1SISID : S2SISID : S3SISID : S4 ssn : first name : Sally last name : Hill MI : L DOB : 3/4/1999 ssn : first name : Emily last name : Brown MI : K DOB : 6/2/2004 ssn : first name : Mary last name : Hawkins MI : J DOB : 5/13/1983 ssn : first name : David last name : Ford MI : J DOB : 10/23/1990

3 Exact Match EISID : E1 ssn : first name : Sally last name : Hill MI : L DOB : 3/4/1999 SISID : S1 ssn : first name : Sally last name : Hill MI : L DOB : 3/4/1999

4 Approximate Matching I : SSN EISID : E2 25 ssn : first name : Emily last name : Brown MI : K DOB : 6/2/2004 SISID : S2 52 ssn : first name : Emily last name : Brown MI : K DOB : 6/2/2004

5 Approximate Matching II : DOB EISID : E4 ssn : first name : David last name : Ford MI : J 10/25/1990 DOB : 10/25/1990 SISID : S4 ssn : first name : David last name : Ford MI : J 10/23/1990 DOB : 10/23/1990

6 Approximate Matching III : Name EISID : E3 ssn : first name : Mary Johnson last name : Johnson G MI : G DOB : 5/13/1983 SISID : S3 ssn : first name : Mary Hawkins last name : Hawkins J MI : J DOB : 5/13/1983

7 Deterministic Record Linking Allow for approximate matching Use explicit approximate rules Pros : can control the linkage process Con: difficult to implement Alternative : Probabilistic record linking – Also approximate matching – However, uses general rules specified by users – Based on total probability – Con: can not control exactly what to consider a match or not – Pros: can use specialized software

8 Approximate Matching : DOB element to element match : date, month, year Allow for one element difference Allow for month and day transposed DOB : one element 25 dob1 : 10/25/ dob2 : 10/23/1990 DOB : transpose 11/7/ dob1 : 11/7/1995 7/11/ dob2 : 7/11/1995

9 Approximate Matching : Name First name soundex match First name is approx – one letter different insert or replace – and/or substr lsound equal or lname approx – MI=FI – FI equal Fsound & Lsound swapped obs fnamekfname mikmi 1RUDOLPH L RU L DOLPH AA 2 J ALI J AH Y ALI Y AH M 3CAROL YN CAROL YN J 4 LIQUE ANGE LIQUE I ANG I E D 5JOHNNY JR JOHNNY JR L 6 HARY ZAC HARY K ZAC K L 7J M M ICHAELM 8 A A NTON C C OUDRAYCA 9 A A RTHUR A A UTHOR RR 10 E E DWIN E E DDIE 11GOLDYOWENSAA

10 Approximate Matching : Name obs fnamekfname mikmi lnameklname 1RUDOLPH L RU L DOLPH AASIMARD 2 J ALI J AH Y ALI Y AH MFOSS 3CAROL YN CAROL YN JYOUNG 4 LIQUE ANGE LIQUE I ANG I E DOUELLETTE 5JOHNNY JR JOHNNY JR LMAYO 6 HARY ZAC HARY K ZAC K LROGERS 7J M M ICHAELM GALLAGHER 8 A A NTON C C OUDRAYCA CYPRESS 9 A A RTHUR A A UTHOR RRDAVIS 10 E E DWIN E E DDIE KAHKONE 11GOLDYOWENS AAOWENSGOLDY

11 Match on ssn (ssn equal) 1 : dob, fsound equal dob approx – 2 : dob approx, fsound equal – 3 : dob approx, fname approx – 4 : dob approx, lsound equal, & fsound diff, but MI=FI – 5 : dob approx, lsound equal, & fsound diff, but FI equal – 6 : dob approx, lsound and fsound swapped – 7 : dob approx, lname approx & fsound diff but MI=FI (4 with lname approx rather than equal) or FI equal (5 with lname approx rather than equal) dob mismatch – 8 : fname approx, lsound equal, and dob diff – 9 : fname approx, lsound approx, and dob diff

12 Match on ssn (ssn equal) 1 : dob, fsound equal dob approx – 2 : dob approx, fsound equal – 3 : dob approx, fname approx – 4 : dob approx, lsound equal, & fsound diff, but MI=FI – 5 : dob approx, lsound equal, & fsound diff, but FI equal – 6 : dob approx, lsound and fsound swapped – 7 : dob approx, lname approx & fsound diff but MI=FI (4 with lname approx rather than equal) or FI equal (5 with lname approx rather than equal) dob diff – 8 : fname approx, lsound equal, and dob diff – 9 : fname approx, lsound approx, and dob diff

13 Approximate Matching : SSN Digit to digit match Allow for one digit difference Allow for two digit difference if transposed SSN : one digit 9 ssn1 : ssn2 : SSN : transpose 25 ssn1 : ssn2 :

14 Match on ndob (dob+fsound) ssn missing – 1: lname equal – 2: lname approx ssn approx – 3: lname equal – 4: lname approx – 5: lname diff but fname equal ssn different – 11 : lname equal – 12 : lname approx lname different – 51: ssn approx – 52: ssn missing

15 Match on ndob (dob+fsound) ssn missing – 1: lname equal – 2: lname approx ssn approx – 3: lname equal – 4: lname approx – 5: lname diff but fname equal ssn different – 11 : lname equal – 12 : lname approx lname different – 51: ssn approx – 52: ssn missing

16 obs SSNkSSNfnamekfnamelnameklname APPOLONIA GAVINS 2..ABEL GARCIA LOMELI GARCIA LOMELI JOSH JOSHUA PHIPPS

17 obs SSNkSSNfnamekfnamelnameklname APPOLONIA GAVINS 2..ABEL GARCIA LOMELI GARCIA LOMELI JOSH JOSHUA PHIPPS LENA COOPER MILES JR. KNIGHT JR. KNIGHT

18 obs SSNkSSNfnamekfnamelnameklname APPOLONIA GAVINS 2..ABEL GARCIA LOMELI GARCIA LOMELI JOSH JOSHUA PHIPPS LENA COOPER MILES JR. KNIGHT JR. KNIGHT MARTHA LYDAHOPKINS

19 obs SSNkSSNfnamekfnamelnameklname APPOLONIA GAVINS 2..ABEL GARCIA LOMELI GARCIA LOMELI JOSH JOSHUA PHIPPS LENA COOPER MILES JR. KNIGHT JR. KNIGHT MARTHA LYDAHOPKINS AUSTINAUSTYNTERWILLIGEROMEARA ALISIAALICEGRAVESWATSON ANNAANAYAMONTAGUEBOLDING

20 obs SSNkSSNfnamekfnamelnameklname APPOLONIA GAVINS 2..ABEL GARCIA LOMELI GARCIA LOMELI JOSH JOSHUA PHIPPS LENA COOPER MILES JR. KNIGHT JR. KNIGHT MARTHA LYDAHOPKINS AUSTINAUSTYNTERWILLIGEROMEARA ALISIAALICEGRAVESWATSON ANNAANAYAMONTAGUEBOLDING

21 obs SSNkSSNfnamekfnamelnameklname APPOLONIA GAVINS 2..ABEL GARCIA LOMELI GARCIA LOMELI JOSH JOSHUA PHIPPS LENA COOPER MILES JR. KNIGHT JR. KNIGHT MARTHA LYDAHOPKINS AUSTINAUSTYNTERWILLIGEROMEARA ALISIAALICEGRAVESWATSON ANNAANAYAMONTAGUEBOLDING BRITTNEY REVELS DANIEL ROBINSON HELEN A H A LL OER H O LL ER DEBORAH DEBRA E LE E ACH LE ACH

22 obs SSNkSSNfnamekfnamelnameklname APPOLONIA GAVINS 2..ABEL GARCIA LOMELI GARCIA LOMELI JOSH JOSHUA PHIPPS LENA COOPER MILES JR. KNIGHT JR. KNIGHT MARTHA LYDAHOPKINS AUSTINAUSTYNTERWILLIGEROMEARA ALISIAALICEGRAVESWATSON ANNAANAYAMONTAGUEBOLDING BRITTNEY REVELS DANIEL ROBINSON HELEN A H A LL OER H O LL ER DEBORAH DEBRA E LE E ACH LE ACH

23 obs SSNkSSNfnamekfnamelnameklname APPOLONIA GAVINS 2..ABEL GARCIA LOMELI GARCIA LOMELI JOSH JOSHUA PHIPPS LENA COOPER MILES JR. KNIGHT JR. KNIGHT MARTHA LYDAHOPKINS AUSTINAUSTYNTERWILLIGEROMEARA ALISIAALICEGRAVESWATSON ANNAANAYAMONTAGUEBOLDING BRITTNEY REVELS DANIEL ROBINSON HELEN A H A LL OER H O LL ER DEBORAH DEBRA E LE E ACH LE ACH ABIGAHIL GARCIA GARCIA TREJO TREJO 15..APSLEY CARLYLEKARYLE 16..ABIGAIL GENTRYKING ABIGAIL RODRIGUEZRINCON HERNANDEZ ABIGAYLEABIGAILFITZGERALD HERNANDEZ

24 Match on name (fname+lname) ssn missing & dob approx – 1: MI equal – 7: MI missing – 8: MI not equal ssn approx – 3: dob equal – dob approx 4: one element 5: transpose

25 Match on name (fname+lname) ssn missing & dob approx – 1: MI equal – 7: MI missing – 8: MI not equal ssn approx – 3: dob equal – dob approx 4: one element 5: transpose obsssnkssndobkdob /06/ /06/ /09/ /09/ / 15 / / 07 / /12 11/12 /14 12/11 12/11 / /08 12/08 /94 08/12 08/12 /94

26 obs Type ssnkssndobkdobfnamelname /06/ /06/ 08 MARIONMONTAGUE /09/ /09/ 76 WILLIAMJOHNSON / 15 / / 07 /20 WILLIEGRANT /12 11/12 /14 12/11 12/11 /14 GLADYSSOUTHARD /08 12/08 /94 08/12 08/12 /94 TAYLORFORD /11/77.NICOLEPARKER /07/88.ASAJAHROSS /31/ /31/ 99 PATRICIABANEGAS /12/ /12/88 DANIELANDRONIC /27 02/27 /89 11/15 11/15 /89 VICTORIAHORN Match on name (fname+lname)

27 obs Type ssnkssndobkdobfnamelname /06/ /06/ 08 MARIONMONTAGUE /09/ /09/ 76 WILLIAMJOHNSON / 15 / / 07 /20 WILLIEGRANT /12 11/12 /14 12/11 12/11 /14 GLADYSSOUTHARD /08 12/08 /94 08/12 08/12 /94 TAYLORFORD /11/77.NICOLEPARKER /07/88.ASAJAHROSS /31/ /31/ 99 PATRICIABANEGAS /12/ /12/88 DANIELANDRONIC /27 02/27 /89 11/15 11/15 /89 VICTORIAHORN Match on name (fname+lname)

28 link Put together all links found Identify indirect duplicates (type2>10000) – i.e. both EISID1 & EISID2 link to identical SISID1 – Consider indirect duplicates on both EIS & SIS Create unique link and indirect duplicate files – Keep only the first id in data file link – Create indirect duplicates files dupeis2 & dupsis2 TODO : explore indirect duplicates

29 Create unique list of EIS & SIS Generate unique full list of each set of ids – use linkage info – Link in the duplicates (dupeis & dupsis) – TODO : link in the indirect duplicates – eis & sis

30 Data flow Link eis to sis ueis.sas7bdatusis.sas7bdat link.sas7bdat dupeis2.sas7bdat 4,308,863 dupsis2.sas7bdat eisid.sas7bdatsisid.sas7bdat dupeis.sas7bdat dupsis.sas7bdat eis.sas7bdatsis.sas7bdat duplicates unduplicated unique records 4,277,402 99% 1,888,747 1,638,112 87% 31, ,635 1,173,404 4,308,863 28% 1,888,747 74% % 72%

31 Type of links Exact matchApprox match (miss)Freq%cum % ssn, dob, fsound % ssn, fsounddob %71.01% ssndob, fsound %71.95% ssn, lsoundfname (dob mismatch) %72.74% ssnother %73.35% dob, fsound, lname(ssn=.) %94.75% dob, fsoundlname %96.13% dob, fsound, lnamessn %98.14% dob, fsound, lname(ssn mismatch) %99.47% dob, fsoundother %99.84% fname, lnameother %100.00% TOTAL %

32 Type of duplicates and links TypeEISSIS freq%cum %freq%cum % DLD % % DLX %0.28% %12.30% DXX %0.73% %13.27% PLD %0.80% %13.44% PLX %1.01% %23.24% PXX %1.45% %24.14% XLD %5.75% %24.60% XLX %28.41% %76.29% XXX %100.00% %100.00% TOT % %

33 Number of Duplicates dupsEISSIS freqsets%cum %freqsets%cum % % % %99.98% %93.77% %100.00% %98.35% %100.00% %99.56% %99.88% %99.97% %99.99% %100.00% %100.00% TOT % %

34 Implementation details Ndob & name must be looped – multiple matches Too many match on name – use half of ssn – Overlap for transpose

35 Basic Process Unduplicate EIS (dupeis) Unduplicate SIS (dupsis) Link unduplicated EIS & SIS (link) Generate unique full list of each set of ids (list) – use linkage info – Link in the duplicates – eis & sis

36 Unduplication Same as matching between different system Except, match the database to itself – i.e. EIS to EIS, SIS to SIS Randomly select one as Primary – TODO: for those not linked using primary ID, try with duplicate ID TODO: explore indirect duplicate links

37 Conclusion Future work : – indirect duplicates – Link using duplicates SSN have been changed from real data

38 Thank You !

39 Type of id first letter: – P : primary id with duplicates – D : duplicates (primary info given with prefix ‘l’) – X : no duplicates second letter: link status – L: linked – X: no linked id third letter: duplicates status of the linked id – D: duplicates exist for the linked id – X: no duplicates for the linked id

40 EIS & SIS Table Unique full is of EIS (or SIS) ids Type : type of id (XXX) – see next slide All eis info have no prefix All sis info have prefix ‘k’ Prefix ‘l’ is the link id info freqeis & freqsis : # of duplicate ids Pindid (eis) & pkindid (sis) is the primary id indid1-indid3 & kindid1-kindid8

41 Link type sdiff : # digits different in ssn – -1 : one or both ssn is missing – 2 : two digits are transposed – 10 : two digits are different but not transposed ddiff : diff in dob – -1 : one or both dob is missing – 2 : date and month is transposed – 3 : date, month and year are different – 4 : date and month are different Fdiff (ldiff) : difference in first (last) name – -1 : one or both are missing – 1 : one letter difference (INDEL or REPL) – 100 : one is a substring of the other – 101 : one letter diff & substring

42 Duplicate type If duplicate id – Primary id info is given with prefix “l” – Duplicate type Lsdiff, lddiff, lfdiff, & lldiff If primary id – # of duplicates : freqeis & freqsis – Duplicate ids Indid1-indid3 (eis) & kindid1-kindid8 (sis)

43 Other tables Link – Linkage between the primary eis & sis ids dupeis & dupsis – List of duplicates with primary id

44 Data flow eisid: 4,308,863 – ueis (4,277,402)+dupeis (31,461) : 99% sisid: 1,888,747 – usis (1,638,112)+dupsis (250,635) : 87% Link : 1,173,404 (eis: 27%, sis: 72%) – dupeis2 (1,270) + dupsis2(493) EIS: 4,308,863 (28%) SIS: 1,888,747 (74%)


Download ppt "Deterministic Record Linking University of North Carolina, Chapel Hill Hye-Chung Kum."

Similar presentations


Ads by Google