Presentation is loading. Please wait.

Presentation is loading. Please wait.

Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1.

Similar presentations


Presentation on theme: "Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1."— Presentation transcript:

1 Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1

2 2 Efficient Record Linkage in Large Data Sets Liang Jin, Chen Li, Sharad Mehrotra University of California, Irvine DASFAA, Kyoto, Japan, March 2003

3

4 How was the paper written?  Two faculty working on different areas, plus  1 st year PhD student

5 Chen’s Story: 2001 … 5

6 Data Integration Problems? Talking to medical doctors… 6

7 Example NameSSNAddr Jack Lemmon430-871-8294Maple St Harrison Ford292-918-2913Culver Blvd Tom Hanks234-762-1234Main St ……… Table R NameSSNAddr Ton Hanks234-162-1234Main Street Kevin Spacey928-184-2813Frost Blvd Jack Lemon430-817-8294Maple Street ……… Table S Q: Find records from different datasets that could be the same entity 7Chen Li

8 Sharad’s research 8Chen Li

9 Liang’s story 1 st -year PhD student at UC Irvine 9Chen Li

10 Challenges How to define good similarity functions? How to do matching efficiently? 10Chen Li

11 11 Nested-loop? Not desirable for large data sets 5 hours for 30K strings!

12 12 Our 2-step approach Step 1: map strings (in a metric space) to objects in a Euclidean space Step 2: do a similarity join in the Euclidean space

13 13 Advantages Applicable to many metric similarity functions — E.g.: Edit distance Open to existing algorithms — Mapping techniques — Join techniques

14 14 Step 1 Map strings into a high-dimensional Euclidean space Metric Space Euclidean Space

15 15 Use data set 1 (54K names) as an example k=2, d=20 — Use k’=5.2 to differentiate similar and dissimilar pairs. Can it preserve distances?

16 16 Multi-attribute linkage Example: title + name + year Different attributes have different similarity functions and thresholds Consider merge rules in disjunctive format:

17 17 Secret of the paper …

18 18

19 19 Work since then … Chen: efficiency Sharad: quality

20 20 Chen’s Work on Efficiency Gram-based algorithms — Indexing — Selection algorithms — Join algorithms — Variable-length grams — Selectivity estimation Trie-based algorithms — Instant search

21 The Flamingo Package http://flamingo.ics.uci.edu/

22 22 Follow-up work in the community Significant amount of work on approximate string queries — Selection — Join

23 Make an impact? 23

24 UCI People Search 24Chen Li

25 Psearch (2008) : 2 stories 25Chen Li

26 Fuzzy search 26

27 www.omniplaces.com Location-based search 27

28 Research commercialization 28Chen Li

29 Lesson learned: Hands-on experiences important! 29Chen Li


Download ppt "Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1."

Similar presentations


Ads by Google