Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1
2 Efficient Record Linkage in Large Data Sets Liang Jin, Chen Li, Sharad Mehrotra University of California, Irvine DASFAA, Kyoto, Japan, March 2003
How was the paper written? Two faculty working on different areas, plus 1 st year PhD student
Chen’s Story: 2001 … 5
Data Integration Problems? Talking to medical doctors… 6
Example NameSSNAddr Jack Lemmon Maple St Harrison Ford Culver Blvd Tom Hanks Main St ……… Table R NameSSNAddr Ton Hanks Main Street Kevin Spacey Frost Blvd Jack Lemon Maple Street ……… Table S Q: Find records from different datasets that could be the same entity 7Chen Li
Sharad’s research 8Chen Li
Liang’s story 1 st -year PhD student at UC Irvine 9Chen Li
Challenges How to define good similarity functions? How to do matching efficiently? 10Chen Li
11 Nested-loop? Not desirable for large data sets 5 hours for 30K strings!
12 Our 2-step approach Step 1: map strings (in a metric space) to objects in a Euclidean space Step 2: do a similarity join in the Euclidean space
13 Advantages Applicable to many metric similarity functions — E.g.: Edit distance Open to existing algorithms — Mapping techniques — Join techniques
14 Step 1 Map strings into a high-dimensional Euclidean space Metric Space Euclidean Space
15 Use data set 1 (54K names) as an example k=2, d=20 — Use k’=5.2 to differentiate similar and dissimilar pairs. Can it preserve distances?
16 Multi-attribute linkage Example: title + name + year Different attributes have different similarity functions and thresholds Consider merge rules in disjunctive format:
17 Secret of the paper …
18
19 Work since then … Chen: efficiency Sharad: quality
20 Chen’s Work on Efficiency Gram-based algorithms — Indexing — Selection algorithms — Join algorithms — Variable-length grams — Selectivity estimation Trie-based algorithms — Instant search
The Flamingo Package
22 Follow-up work in the community Significant amount of work on approximate string queries — Selection — Join
Make an impact? 23
UCI People Search 24Chen Li
Psearch (2008) : 2 stories 25Chen Li
Fuzzy search 26
Location-based search 27
Research commercialization 28Chen Li
Lesson learned: Hands-on experiences important! 29Chen Li