Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efficient Exact Set-Similarity Joins Arvind Arasu Venkatesh Ganti Raghav Kaushik DMX Group, Microsoft Research.

Similar presentations


Presentation on theme: "Efficient Exact Set-Similarity Joins Arvind Arasu Venkatesh Ganti Raghav Kaushik DMX Group, Microsoft Research."— Presentation transcript:

1 Efficient Exact Set-Similarity Joins Arvind Arasu Venkatesh Ganti Raghav Kaushik DMX Group, Microsoft Research

2 Sept. 15, 2006Set-Similarity Joins2 Data Cleaning NameStreetCityStateZip INGRAM MICRO 1600 ST ANDREWS PL SANTA ANA CA92799 GTE CORP 1 STAMFORD FORUM STAMFORDCT LOGISOFT 274 GOODMAN ST N ROCHESTER14607 CIEDC 1800 5TH ST LINCONLIL92799 INGRAM MCRO 1600 ST ANDREW’S PL SANTA ANA CA92799

3 Sept. 15, 2006Set-Similarity Joins3 Data Cleaning NameStreetCityStateZip INGRAM MICRO 1600 ST ANDREWS PL SANTA ANA CA92799 GTE CORP 1 STAMFORD FORUM STAMFORDCT LOGISOFT 274 GOODMAN ST N ROCHESTER14607 CIEDC 1800 5TH ST LINCONLIL92799 INGRAM MCRO 1600 ST ANDREW’S PL SANTA ANA CA92799

4 Sept. 15, 2006Set-Similarity Joins4 Data Cleaning NameStreetCityStateZip INGRAM MICRO 1600 ST ANDREWS PL SANTA ANA CA92799 GTE CORP 1 STAMFORD FORUM STAMFORDCT LOGISOFT 274 GOODMAN ST N ROCHESTER14607 CIEDC 1800 5TH ST LINCONL IL92799 INGRAM MCRO 1600 ST ANDREW’S PL SANTA ANA CA92799

5 Sept. 15, 2006Set-Similarity Joins5 Data Cleaning NameStreetCityStateZip INGRAM MICRO 1600 ST ANDREWS PL SANTA ANA CA92799 GTE CORP 1 STAMFORD FORUM STAMFORDCT LOGISOFT 274 GOODMAN ST N ROCHESTER14607 CIEDC 1800 5TH ST LINCONL IL92799 INGRAM MCRO 1600 ST ANDREW’S PL SANTA ANA CA92799

6 Sept. 15, 2006Set-Similarity Joins6 Data Cleaning NameStreetCityStateZip INGRAM MICRO 1600 ST ANDREWS PL SANTA ANA CA92799 GTE CORP 1 STAMFORD FORUM STAMFORDCT06901 LOGISOFT 274 GOODMAN ST N ROCHESTERNY14607 CIEDC 1800 5TH ST LINCOLN IL92799

7 Sept. 15, 2006Set-Similarity Joins7 String Similarity Join CITY ALABASTER ALBERTVILLE … … … LINCOLN … … YUCAIPA Reference Table……City………………… …… LINCONL …… …………… ……………

8 Sept. 15, 2006Set-Similarity Joins8 NameStreetCityStateZip INGRAM MICRO 1600 ST ANDREWS PL SANTA ANA CA92799 GTE CORP 1 STAMFORD FORUM STAMFORDCT LOGISOFT 274 GOODMAN ST N ROCHESTER14607 CIEDC 1800 5TH ST LINCONLIL92799 INGRAM MCRO 1600 ST ANDREW’S PL SANTA ANA CA92799 String Similarity (Self) Join

9 Sept. 15, 2006Set-Similarity Joins9 Strings  Sets [CGK ’06] microsoftmcrosoft {mc, cr, ro, os, so, of, ft}{mi, ic, cr, ro, os, so, of, ft} (edit distance ≤ 1) ----> (Δ ≤ 4) 2-grams

10 mcrosoft … … … … … … … microsoft … … … … … … … SR String Sim Join edit distance ≤ 1 Strings  Sets

11 mcrosoft … … … … … … … microsoft … … … … … … … Set Sim Join Δ ≤ 4 RS Tokenize Post-Process Strings  Sets

12 Sept. 15, 2006Set-Similarity Joins12 String  Set: Advantages Generalizes to many string similarity funcs Generalizes to many string similarity funcs Powerful primitive Powerful primitive Sets ≈ Relations Sets ≈ Relations Leverage relational data processing Leverage relational data processing [CGK ‘06] [CGK ‘06]

13 Sept. 15, 2006Set-Similarity Joins13 Contributions New algorithms for set-similarity joins New algorithms for set-similarity joins Exact answers Exact answers Performance guarantees Performance guarantees Outperform previous exact algorithms Outperform previous exact algorithms Orders of magnitude Orders of magnitude Exact answers are important for operators

14 Sept. 15, 2006Set-Similarity Joins14 Outline Introduction Introduction Algorithms Algorithms Experiments Experiments Conclusion Conclusion

15 { mi, ic, cr, ro, os, so, of, ft } { lo, og, gi, is, so, of, ft } { … } { bo, oe, ei, in, ng }{ mc, cr, ro, os, so, of, ft } { lg, gi, is, so, of, ft } { … } S R

16 { mi, ic, cr, ro, os, so, of, ft } { lo, og, gi, is, so, of, ft } { … } { bo, oe, ei, in, ng }{ mc, cr, ro, os, so, of, ft } { lg, gi, is, so, of, ft } { … } S R Intersection size ≥ 5

17 { mi, ic, cr, ro, os, so, of, ft } { lo, og, gi, is, so, of, ft } { … } { bo, oe, ei, in, ng }{ mc, cr, ro, os, so, of, ft } { lg, gi, is, so, of, ft } { … } S R Intersection size ≥ 5

18 { mi, ic, cr, ro, os, so, of, ft } { lo, og, gi, is, so, of, ft } { … } { bo, oe, ei, in, ng }{ mc, cr, ro, os, so, of, ft } { lg, gi, is, so, of, ft } { … } S R Intersection size ≥ 5

19 { mi, ic, cr, ro, os, so, of, ft } { lo, og, gi, is, so, of, ft } { … } { bo, oe, ei, in, ng } { mc, cr, ro, os, so, of, ft } { lg, gi, is, so, of, ft } { … } S R { mc, cr, ro, os, so, of, ft } { mi, ic, cr, ro, os, so, of, ft } Intersection size ≥ 5

20 { mi, ic, cr, ro, os, so, of, ft } { lo, og, gi, is, so, of, ft } { … } { bo, oe, ei, in, ng } { mc, cr, ro, os, so, of, ft } { lg, gi, is, so, of, ft } { … } S R { mc, cr, ro, os, so, of, ft } { mi, ic, cr, ro, os, so, of, ft } Intersection size ≥ 5 { lg, gi, is, so, of, ft } { lo, og, gi, is, so, of, ft }

21 { … } { bo, oe, ei, in, ng } { … } S R { mc, cr, ro, os, so, of, ft } { mi, ic, cr, ro, os, so, of, ft } Sim ( r i, s j ) ≥ θ { lg, gi, is, so, of, ft } { lo, og, gi, is, so, of, ft } s2s2s2s2 s3s3s3s3 smsmsmsm s1s1s1s1 r2r2r2r2 r3r3r3r3 rnrnrnrn r1r1r1r1

22 { … } { bo, oe, ei, in, ng } { … } S R { mc, cr, ro, os, so, of, ft } { mi, ic, cr, ro, os, so, of, ft } Sim ( r i, s j ) ≥ θ { lg, gi, is, so, of, ft } { lo, og, gi, is, so, of, ft } s2s2s2s2 s3s3s3s3 smsmsmsm s1s1s1s1 r2r2r2r2 r3r3r3r3 rnrnrnrn r1r1r1r1 Large

23 Input: Input: R: r 1, r 2, …, r n (n sets) R: r 1, r 2, …, r n (n sets) S: s 1, s 2, …, s m (m sets) S: s 1, s 2, …, s m (m sets) Output: All pairs (r i, s j ) such that: Output: All pairs (r i, s j ) such that: |r i Δ s j | ≤ k |r i Δ s j | ≤ k Set-Similarity Join: Symmetric Difference ≤ k Running example: k = 4

24 Sept. 15, 2006Set-Similarity Joins24 Alternate Set Representation s = { 4, 10, 13, 24, 29, 35, 41, 46, 48 }

25 Sept. 15, 2006Set-Similarity Joins25 Alternate Set Representation s = { 4, 10, 13, 24, 29, 35, 41, 46, 48 } 12550

26 Sept. 15, 2006Set-Similarity Joins26 Alternate Set Representation s = { 4, 10, 13, 24, 29, 35, 41, 46, 48 } 12550

27 Sept. 15, 2006Set-Similarity Joins27 Alternate Set Representation s = { 4, 10, 13, 24, 29, 35, 41, 46, 48 } 12550

28 Sept. 15, 2006Set-Similarity Joins28 Alternate Set Representation s = { 4, 10, 13, 24, 29, 35, 41, 46, 48 } 12550

29 Sept. 15, 2006Set-Similarity Joins29 Enumeration s r |r Δ s | ≤ 4

30 Sept. 15, 2006Set-Similarity Joins30 Enumeration s r |r Δ s | ≤ 4

31 Sept. 15, 2006Set-Similarity Joins31 Enumeration s r |r Δ s | ≤ 4 Errors

32 Sept. 15, 2006Set-Similarity Joins32 Enumeration 23451 s r |r Δ s | ≤ 4

33 Sept. 15, 2006Set-Similarity Joins33 Enumeration: Signature Generation s,,,,{} Sig (s )

34 Sept. 15, 2006Set-Similarity Joins34 Enumeration: Signature Generation s,,,,{} Sig (s ) { 0x4f72ba91, 0x29c8af10, 0x594b2c17, 0xa3b0e20f, 0xdd21f32a} Hash32()

35 Sept. 15, 2006Set-Similarity Joins35 Property of Signatures |r Δ s | ≤ 4 Sig (r ) Sig (s ) ≠ Φ U 23451 s r

36 Sept. 15, 2006Set-Similarity Joins36 Enumeration: Algorithm Generate signatures for each r i, s j Generate signatures for each r i, s j Enumerate (r i, s j ) s.t Sig (r i ) Sig (s j ) ≠ Φ Enumerate (r i, s j ) s.t Sig (r i ) Sig (s j ) ≠ Φ Output those satisfying |r i Δ s j | ≤ 4 Output those satisfying |r i Δ s j | ≤ 4 U

37 Sept. 15, 2006Set-Similarity Joins37 Enumeration s1s1 s5s5 s2s2 s3s3 s4s4 Sig (s 2 ) Sig (s 5 ) Sig (s 3 ) Sig (s 4 ) U r1r1 r5r5 r2r2 r3r3 r4r4 Sig (s 1 ) Sig (r 2 ) Sig (r 5 ) Sig (r 3 ) Sig (r 4 ) Sig (r 1 ) Sig (r 2 ) Sig (s 1 ) ≠ Φ

38 Sept. 15, 2006Set-Similarity Joins38 Enumeration s1s1 s5s5 s2s2 s3s3 s4s4 Sig (s 2 ) Sig (s 5 ) Sig (s 3 ) Sig (s 4 ) U r1r1 r5r5 r2r2 r3r3 r4r4 Sig (s 1 ) Sig (r 2 ) Sig (r 5 ) Sig (r 3 ) Sig (r 4 ) Sig (r 1 ) Sig (r 2 ) Sig (s 1 ) ≠ Φ

39 Sept. 15, 2006Set-Similarity Joins39 Enumeration s1s1 s5s5 s2s2 s3s3 s4s4 Sig (s 2 ) Sig (s 5 ) Sig (s 3 ) Sig (s 4 ) U r1r1 r5r5 r2r2 r3r3 r4r4 Sig (s 1 ) Sig (r 2 ) Sig (r 5 ) Sig (r 3 ) Sig (r 4 ) Sig (r 1 ) Sig (r 2 ) Sig (s 1 ) ≠ Φ Output False positive candidate pairs

40 S (Id, Elem) R.Sig = S.Sig δ R.Id, S.Id R (Id, Elem) Post-Process each R.Id, S.Id Gen Signatures S’ (Id, Sig)R’ (Id, Sig)

41 Sept. 15, 2006Set-Similarity Joins41 No False Positive Candidate Pair 23451 s r |r Δ s | = 5

42 Sept. 15, 2006Set-Similarity Joins42 False Positive Candidate Pair s2s2 s1s1 23451 |r Δ s | = 5

43 Sept. 15, 2006Set-Similarity Joins43 Enumeration: Performance k = 4

44 Sept. 15, 2006Set-Similarity Joins44 Enumeration: Performance Ideal Performance k = 4

45 Sept. 15, 2006Set-Similarity Joins45 Enumeration |r Δ s | ≤ 4 s r

46 Sept. 15, 2006Set-Similarity Joins46 Enumeration 234615 s r |r Δ s | ≤ 4

47 Sept. 15, 2006Set-Similarity Joins47 Enumeration: Signature Generation s1s1 234615

48 Sept. 15, 2006Set-Similarity Joins48 Enumeration: Signature Generation s1s1 234615

49 Sept. 15, 2006Set-Similarity Joins49 Enumeration: Signature Generation s1s1 234615

50 Sept. 15, 2006Set-Similarity Joins50 Enumeration: Signature Generation s1s1 234615

51 Sept. 15, 2006Set-Similarity Joins51 Enumeration: Signature Generation s1s1 234615 ( ) 6 2 = 15

52 Sept. 15, 2006Set-Similarity Joins52 Algorithm Generate signatures for each r i, s j Generate signatures for each r i, s j Enumerate (r i, s j ) s.t Sig (r i ) Sig (s j ) ≠ Φ Enumerate (r i, s j ) s.t Sig (r i ) Sig (s j ) ≠ Φ Output those satisfying |r i Δ s j | ≤ 4 Output those satisfying |r i Δ s j | ≤ 4 U Only the signature function changes

53 Sept. 15, 2006Set-Similarity Joins53 Enumeration: Performance k = 4

54 Sept. 15, 2006Set-Similarity Joins54 False Positive Candidate Pair 234615 s r |r Δ s | = 5

55 Sept. 15, 2006Set-Similarity Joins55 Enumeration: Performance k = 4

56 Sept. 15, 2006Set-Similarity Joins56 Enumeration: Performance 5 15 35 4845 k = 4

57 Sept. 15, 2006Set-Similarity Joins57 PartEnum: Divide and Conquer s1s1 21 k = 4 k 2 = 1 k 1 = 2 Generate signatures using Enumeration

58 Sept. 15, 2006Set-Similarity Joins58 PartEnum: Asymptotic Performance Theorem: There is an instance of PartEnum such that: Theorem: There is an instance of PartEnum such that: If |r Δ s | > 7.5 k, then r and s do not share a signature with probability 1 – o(1) If |r Δ s | > 7.5 k, then r and s do not share a signature with probability 1 – o(1) The number of signatures per set: O (k 2 ) The number of signatures per set: O (k 2 )

59 Sept. 15, 2006Set-Similarity Joins59 PartEnum: Summary Set-Similarity Joins with predicate |r Δ s | ≤ k Set-Similarity Joins with predicate |r Δ s | ≤ k Theoretical guarantees Theoretical guarantees First exact algorithm First exact algorithm

60 Sept. 15, 2006Set-Similarity Joins60 Other results PartEnum extensions: PartEnum extensions: Larger class of set-similarity join predicates Larger class of set-similarity join predicates Jaccard Jaccard Basic idea: reduce to symmetric set difference Basic idea: reduce to symmetric set difference WtEnum class of signature functions: WtEnum class of signature functions: Use frequency of elements Use frequency of elements Weighted set-similarity joins Weighted set-similarity joins

61 Sept. 15, 2006Set-Similarity Joins61 Outline Introduction Introduction Algorithms Algorithms Experiments Experiments Conclusion Conclusion

62 S (Id, Elem) R.Sig = S.Sig δ R.Id, S.Id R (Id, Elem) Post-Process each R.Id, S.Id Gen Signatures Implementation DBMS Client + DBMS DBMS Client

63 Sept. 15, 2006Set-Similarity Joins63 Previous Work Prefix Filtering [CGK ’06] Prefix Filtering [CGK ’06] Exact Exact Locality Sensitive Hashing [IM ’98] Locality Sensitive Hashing [IM ’98] Approximate Approximate False negative rate: 5% False negative rate: 5%

64 Sept. 15, 2006Set-Similarity Joins64 Data Sets Organization addresses [MS Sales] Organization addresses [MS Sales] Concatenation: Org name, street, city, zip Concatenation: Org name, street, city, zip Input size: 1 million Input size: 1 million Avg. length: 11 words, 58 chars Avg. length: 11 words, 58 chars Tokenization: Words, n-grams Tokenization: Words, n-grams

65 Sept. 15, 2006Set-Similarity Joins65 Jaccard, 1M, MS Sales 0.80.90.85

66 S (Id, Elem) R.Sig = S.Sig δ R.Id, S.Id R (Id, Elem) Post-Process each R.Id, S.Id Gen Signatures Evaluation DBMS DBMS Intermediate Result size Client + DBMS Client

67 Jaccard, 1M, MS Sales 0.80.90.85

68 Sept. 15, 2006Set-Similarity Joins68 Jaccard, Synthetic

69 Sept. 15, 2006Set-Similarity Joins69 Similar Results for … Other data sets Other data sets DBLP, Synthetic data sets DBLP, Synthetic data sets Other similarity functions Other similarity functions Weighted jaccard Weighted jaccard Edit distance Edit distance

70 Sept. 15, 2006Set-Similarity Joins70 Conclusion New algorithms for set-similarity joins New algorithms for set-similarity joins Exact Exact Performance guarantees Performance guarantees Outperform previous exact algorithms Outperform previous exact algorithms Search: “data cleaning project”


Download ppt "Efficient Exact Set-Similarity Joins Arvind Arasu Venkatesh Ganti Raghav Kaushik DMX Group, Microsoft Research."

Similar presentations


Ads by Google