Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China) Jianhua Feng (Tsinghua, China)

Similar presentations


Presentation on theme: "Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China) Jianhua Feng (Tsinghua, China)"— Presentation transcript:

1 Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China) Jianhua Feng (Tsinghua, China)

2 Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for token sets Signature Scheme for tokens Experiment Conclusion 2011/4/13 Fast-Join @ ICDE2011 2/34

3 Background String Similarity Join Find similar string pairs between two string sets An essential operation in many applications 2011/4/13 Fast-Join @ ICDE2011 Card #NameAddrPhn 1234****Jeffery UllmanCS Dept. Stanford, CA111-1111 1018****Marvin MinskyCS Dept., MIT, MA222-2222 ………… Card #NameEmailTel 1205****David PattersonPatterson@ucb.com999-9999 0101****Jeffrey Ullmanullman@stanford.com(650)111-1111 ………… Jeffery Ullman Jeffrey Ullman Perform a similarity join on name attribute 3/34

4 Background String Similarity Join Find similar string pairs between two string sets An essential operation in many applications 2011/4/13 Fast-Join @ ICDE2011 User Id QueryTimestamp 1018**** ICDE 2011 Hanover 2011-01-15 10:12:10 1234**** NBA All Stars 2011 2011-01-15 11:05:06 2823**** ICDE Hannover 2011-01-15 11:10:10 6345**** weather Hanover 2011-01-15 12:34:10 … … … Perform a self similarity join on query attribute 4/34

5 Motivation 2011/4/13 Fast-Join @ ICDE2011 Existing Similarity Metrics Token-based Similarity Character-based Similarity Hybrid Similarity Dice, Cosine, Jaccard, … Edit Distance, Edit Similarity, … GED [SIGMOD 03] Jaccard(S1, S2) = 1/3 ED(S1, S2) = 8GED(S1, S2) = 0 S1 = “nba mcgrady”, S2 = “macgrady nba” 5/34

6 Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for token sets Signature Scheme for tokens Experiment Conclusion 2011/4/13 Fast-Join @ ICDE2011 6/34

7 Token-based Similarity Dice similarity Cosine similarity Jaccard similarity 2011/4/13 Fast-Join @ ICDE2011 T 1 = {nba, mcgrady} T 2 = {macgrady, nba} |T 1 ∩ T 2 | =1 Example Exactly matched token pairs, i.e. T 1 ∩ T 2 7/34

8 2011/4/13 Fast-Join @ ICDE2011 T1T1 T2T2 mcgrady nba wnba macgrady nba 0.125 0.75 0.875 0.143 1 0.125 Weighted Bipartite Graph 3.Fuzzy Overlap: Maximum Weighted Matching (Quantify token similarity) Better than |T 1 ∩ T 2 |= 1 8/34

9 Fuzzy-Token Similarity Fuzzy-Dice similarity Fuzzy-Cosine similarity Fuzzy-Jaccard similarity 2011/4/13 Fast-Join @ ICDE2011 T 1 = {nba, mcgrady} T 2 = {macgrady, nba} Example 9/34

10 Comparison with Existing Similarities 2011/4/13 Fast-Join @ ICDE2011 10/34

11 Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for token sets Signature Scheme for tokens Experiment Conclusion 2011/4/13 Fast-Join @ ICDE2011 11/34

12 2011/4/13 Fast-Join @ ICDE2011 String Similarity Join using Fuzzy-Token Similarity s1s1 “kobe and trancy” s2s2 “trcy macgrady mvp” …… s' 1 “kobe bryant age” s' 2 “mvp tracy mcgrady” …… T1T1 {kobe, and, trancy} T2T2 {trcy, macgrady, mvp} …… T’ 1 {kobe, bryant, age} T’ 2 {mvp, tracy, mcgrady} …… Tokenization (s 2, s’ 2 ), … Naive Solution Enumerating N 2 pairs Quite Expensive !!! Naive Solution Enumerating N 2 pairs Quite Expensive !!! 12/34

13 Using Existing Methods 2011/4/13 Fast-Join @ ICDE2011 13/34

14 Our Signature Scheme 2011/4/13 Fast-Join @ ICDE2011 The superscript denotes which token generates the signature The superscript denotes which token generates the signature 14/34

15 Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for Token Sets Signature Scheme for Tokens Experiment Conclusion 2011/4/13 Fast-Join @ ICDE2011 15/34

16 Fast-Join @ ICDE2011 2011/4/13 Prefix Filtering Signature Scheme Alphabetical Order Remove 2 largest signatures 16/34

17 2011/4/13 Fast-Join @ ICDE2011 Token Sensitive Signature Scheme Prefix Filtering No! Token Sensitive Yes! 17/34

18 2011/4/13 Fast-Join @ ICDE2011 Candidates : {(T2,T4)} Delete the maximal number of largest signatures that contain 2 tokens Alphabetical Order Token Sensitive Signature Scheme (Cont’d) Candidates : {(T 1,T 2 ),(T 1,T 3 ),(T 1,T 4 ),(T 2,T 4 )} 18/34

19 Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for token sets Signature Scheme for tokens Experiment Conclusion 2011/4/13 Fast-Join @ ICDE2011 19/34

20 2011/4/13 Fast-Join @ ICDE2011 Partition-NED Signature Scheme 20/34

21 2011/4/13 Fast-Join @ ICDE2011 Partition t’ 21/34

22 2011/4/13 Fast-Join @ ICDE2011 Partition t 22/34

23 2011/4/13 Fast-Join @ ICDE2011 Partition t (Cont’d) -3 -2 2 23/34

24 2011/4/13 Fast-Join @ ICDE2011 Punning Techniques Reduce substrings from 21 to 8 24/34

25 Comparison with Partition-ED (SIGMOD 09) 2011/4/13 Fast-Join @ ICDE2011 25/34

26 Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for token sets Signature Scheme for tokens Experiment Conclusion 2011/4/13 Fast-Join @ ICDE2011 26/34

27 Experiment Setup Data sets DBLP Author: Author names from DBLP dataset AOL Query Log: Queries from AOL dataset Environment C++, GCC 4.2.3, Ubuntu Intel Core 2 Quad X5450 3.00GHz processor and 4 GB memory 2011/4/13 Fast-Join @ ICDE2011 27/34

28 Result Quality 2011/4/13 Fast-Join @ ICDE2011 28/34

29 Evaluation on Different Signature Schemes for Tokens 2011/4/13 Fast-Join @ ICDE2011 29/34

30 Evaluation on Different Signature Schemes for Token Sets 2011/4/13 Fast-Join @ ICDE2011 30/34

31 Put Everything Together 2011/4/13 Fast-Join @ ICDE2011 31/34

32 Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for Token Sets Signature Scheme for Tokens Experiment Conclusion 2011/4/13 Fast-Join @ ICDE2011 32/34

33 Conclusion Fuzzy-token similarity Hybrid similarity Subsume many well-known similarities High result quality String similarity join using fuzzy-token similarity Signature-based framework Token-sensitive signature scheme Partition-NED signature scheme Achieve higher performance than the state-of-the-art methods both theoretically and experimentally 2011/4/13 Fast-Join @ ICDE2011 33/34

34 2011/4/13 Fast-Join @ ICDE2011 http://dbgroup.cs.tsinghua.edu.cn/wangjn/projects/fastjoin/ 34/34


Download ppt "Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China) Jianhua Feng (Tsinghua, China)"

Similar presentations


Ads by Google