Download presentation

Presentation is loading. Please wait.

Published byJamie Daye Modified over 2 years ago

1
1 Efficient Record Linkage in Large Data Sets Liang Jin, Chen Li, Sharad Mehrotra University of California, Irvine DASFAA, Kyoto, Japan, March 2003

2
2 Motivation Correlate data from different data sources (e.g., data integration) — Data is often dirty — Needs to be cleansed before being used Example: — A hospital needs to merge patient records from different data sources — They have different formats, typos, and abbreviations

3
3 Example NameSSNAddr Jack Lemmon430-871-8294Maple St Harrison Ford292-918-2913Culver Blvd Tom Hanks234-762-1234Main St ……… Table R NameSSNAddr Ton Hanks234-162-1234Main Street Kevin Spacey928-184-2813Frost Blvd Jack Lemon430-817-8294Maple Street ……… Table S Find records from different datasets that could be the same entity

4
4 Another Example P. Bernstein, D. Chiu: Using Semi-Joins to Solve Relational Queries. JACM 28(1): 25- 40(1981) P. BernsteinD. Chiu Philip A. Bernstein, Dah-Ming W. Chiu, Using Semi-Joins to Solve Relational Queries, Journal of the ACM (JACM), v.28 n.1, p.25-40, Jan. 1981

5
5 Record linkage Problem statement: “Given two relations, identify the potentially matched records — Efficiently and — Effectively”

6
6 Challenges How to define good similarity functions? — Many functions proposed (edit distance, cosine similarity, …) — Domain knowledge is critical Names: “Wall Street Journal” and “LA Times” Address: “Main Street” versus “Main St” How to do matching efficiently — Offline join version — Online (interactive) search Nearest search Range search

7
7 Outline Motivation of record linkage Single-attribute case: two-step approach Multi-attribute linkage Conclusion and related work

8
8 Single-attribute Case Given — two sets of strings, R and S — a similarity function f between strings (metric space) Reflexive: f(s1,s2) = 0 iff s1=s2 Symmetric: f(s1,s2) = d(s2, s1) Triangle inequality: f(s1,s2)+f(s2,s3) >= f(s1,s3) — a threshold k Find: all pairs of strings (r, s) from R and S, such that f(r,s) <= k. R S

9
9 Nested-loop? Not desirable for large data sets 5 hours for 30K strings!

10
10 Our 2-step approach Step 1: map strings (in a metric space) to objects in a Euclidean space Step 2: do a similarity join in the Euclidean space

11
11 Advantages Applicable to many metric similarity functions — Use edit distance as an example — Other similarity functions also tried, e.g., q- gram-based similarity Open to existing algorithms — Mapping techniques — Join techniques

12
12 Step 1 Map strings into a high-dimensional Euclidean space Metric Space Euclidean Space

13
13 Example: Edit Distance A widely used metric to define string similarity Ed(s1,s2)= minimum # of operations (insertion, deletion, substitution) to change s1 to s2 Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2

14
14 Mapping: StringMap Input: A list of strings Output: Points in a high-dimensional Euclidean space that preserve the original distances well A variation of FastMap — Each step greedily picks two strings (pivots) to form an axis — All axes are orthogonal

15
15 Can it preserve distances? Data Sources: — IMDB star names: 54,000 — German names: 132,000 Distribution of string lengths:

16
16 Use data set 1 (54K names) as an example k=2, d=20 — Use k’=5.2 to differentiate similar and dissimilar pairs. Can it preserve distances?

17
17 Choose Dimensionality d Increase d? Good : — better to differentiate similar pairs from dissimilar ones. Bad : — Step 1: Efficiency ↓ — Step 2: “curse of dimensionality”

18
18 Choose dimensionality d using sampling Sample 1Kx1K strings, find their similar pairs (within distance k) Calculate maximum of their new distances w Define “Cost” of finding a similar pair: # of similar pairs # of pairs within distance w Cost=

19
19 Choose Dimensionality d d=15 ~ 25

20
20 Choose new threshold k’ Closely related to the mapping property Ideally, if ed(r,s) <= k, the Euclidean distance between two corresponding points <= k’. Choose k’ using sampling — Sample 1Kx1K strings, find similar pairs — Calculate their maximum new distance as k’ — repeat multiple times, choose their maximum

21
21 New threshold k’ in step 2 d=20

22
22 Step 2: Similarity Join Input: Two sets of points in Euclidean space. Output: Pairs of two points whose distance is less than new threshold k’. Many join algorithms can be used

23
23 Example Adopted an algorithm by Hjaltason and Samet. — Building two R-Trees. — Traverse two trees, find points whose distance is within k’. — Pruning during traversal (e.g., using MinDist).

24
24 Final processing Among the pairs produced from the similarity-join step, check their edit distance. Return those pairs satisfying the threshold k

25
25 Running time

26
26 Recall Recall: (#of found similar pairs)/(#of all similar pairs)

27
27 Outline Motivation of record linkage Single-attribute case: two-step approach Multi-attribute linkage Conclusion and related work

28
28 Multi-attribute linkage Example: title + name + year Different attributes have different similarity functions and thresholds Consider merge rules in disjunctive format:

29
29 Evaluation strategies Many ways to evaluate rules Finding an optimal one: NP-hard Heuristics: — Treat different conjuncts independently. Pick the “most efficient” attribute in each conjunct. — Choose the largest threshold for each attribute. Then choose the “most efficient” attribute among these thresholds.

30
30 Summary A novel two-step approach to record linkage. Many existing mapping and join algorithms can be adopted Applicable to many distance metrics. Time and space efficient. Multi-attribute case studied

31
31 Related work Learning similarity functions: [Sarawagi and Bhamidipaty, 2003] Efficient merge and purge: [Hernandez and Stolfo, 1995] String edit-distance join using DBMS: [Gravano et al, 2001]

32
32 The Flamingo Project on Data Cleansing

Similar presentations

OK

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on hotel industry in india 2013 Ppt on group decision making techniques Ppt on elections in india download film Ppt on conservation of wildlife and natural vegetation in pakistan Ppt on point contact diodes Ppt on call center training Ppt on standing order act 1946 Ppt on clean india Glass fiber post ppt online Ppt on resistance temperature detectors