Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China) Jianhua Feng (Tsinghua, China)
Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for token sets Signature Scheme for tokens Experiment Conclusion 2011/4/13 ICDE2011 2/34
Background String Similarity Join Find similar string pairs between two string sets An essential operation in many applications 2011/4/13 ICDE2011 Card #NameAddrPhn 1234****Jeffery UllmanCS Dept. Stanford, CA ****Marvin MinskyCS Dept., MIT, MA ………… Card #Name Tel 1205****David 0101****Jeffrey ………… Jeffery Ullman Jeffrey Ullman Perform a similarity join on name attribute 3/34
Background String Similarity Join Find similar string pairs between two string sets An essential operation in many applications 2011/4/13 ICDE2011 User Id QueryTimestamp 1018**** ICDE 2011 Hanover :12: **** NBA All Stars :05: **** ICDE Hannover :10: **** weather Hanover :34:10 … … … Perform a self similarity join on query attribute 4/34
Motivation 2011/4/13 ICDE2011 Existing Similarity Metrics Token-based Similarity Character-based Similarity Hybrid Similarity Dice, Cosine, Jaccard, … Edit Distance, Edit Similarity, … GED [SIGMOD 03] Jaccard(S1, S2) = 1/3 ED(S1, S2) = 8GED(S1, S2) = 0 S1 = “nba mcgrady”, S2 = “macgrady nba” 5/34
Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for token sets Signature Scheme for tokens Experiment Conclusion 2011/4/13 ICDE2011 6/34
Token-based Similarity Dice similarity Cosine similarity Jaccard similarity 2011/4/13 ICDE2011 T 1 = {nba, mcgrady} T 2 = {macgrady, nba} |T 1 ∩ T 2 | =1 Example Exactly matched token pairs, i.e. T 1 ∩ T 2 7/34
2011/4/13 ICDE2011 T1T1 T2T2 mcgrady nba wnba macgrady nba Weighted Bipartite Graph 3.Fuzzy Overlap: Maximum Weighted Matching (Quantify token similarity) Better than |T 1 ∩ T 2 |= 1 8/34
Fuzzy-Token Similarity Fuzzy-Dice similarity Fuzzy-Cosine similarity Fuzzy-Jaccard similarity 2011/4/13 ICDE2011 T 1 = {nba, mcgrady} T 2 = {macgrady, nba} Example 9/34
Comparison with Existing Similarities 2011/4/13 ICDE /34
Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for token sets Signature Scheme for tokens Experiment Conclusion 2011/4/13 ICDE /34
2011/4/13 ICDE2011 String Similarity Join using Fuzzy-Token Similarity s1s1 “kobe and trancy” s2s2 “trcy macgrady mvp” …… s' 1 “kobe bryant age” s' 2 “mvp tracy mcgrady” …… T1T1 {kobe, and, trancy} T2T2 {trcy, macgrady, mvp} …… T’ 1 {kobe, bryant, age} T’ 2 {mvp, tracy, mcgrady} …… Tokenization (s 2, s’ 2 ), … Naive Solution Enumerating N 2 pairs Quite Expensive !!! Naive Solution Enumerating N 2 pairs Quite Expensive !!! 12/34
Using Existing Methods 2011/4/13 ICDE /34
Our Signature Scheme 2011/4/13 ICDE2011 The superscript denotes which token generates the signature The superscript denotes which token generates the signature 14/34
Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for Token Sets Signature Scheme for Tokens Experiment Conclusion 2011/4/13 ICDE /34
ICDE /4/13 Prefix Filtering Signature Scheme Alphabetical Order Remove 2 largest signatures 16/34
2011/4/13 ICDE2011 Token Sensitive Signature Scheme Prefix Filtering No! Token Sensitive Yes! 17/34
2011/4/13 ICDE2011 Candidates : {(T2,T4)} Delete the maximal number of largest signatures that contain 2 tokens Alphabetical Order Token Sensitive Signature Scheme (Cont’d) Candidates : {(T 1,T 2 ),(T 1,T 3 ),(T 1,T 4 ),(T 2,T 4 )} 18/34
Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for token sets Signature Scheme for tokens Experiment Conclusion 2011/4/13 ICDE /34
2011/4/13 ICDE2011 Partition-NED Signature Scheme 20/34
2011/4/13 ICDE2011 Partition t’ 21/34
2011/4/13 ICDE2011 Partition t 22/34
2011/4/13 ICDE2011 Partition t (Cont’d) /34
2011/4/13 ICDE2011 Punning Techniques Reduce substrings from 21 to 8 24/34
Comparison with Partition-ED (SIGMOD 09) 2011/4/13 ICDE /34
Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for token sets Signature Scheme for tokens Experiment Conclusion 2011/4/13 ICDE /34
Experiment Setup Data sets DBLP Author: Author names from DBLP dataset AOL Query Log: Queries from AOL dataset Environment C++, GCC 4.2.3, Ubuntu Intel Core 2 Quad X GHz processor and 4 GB memory 2011/4/13 ICDE /34
Result Quality 2011/4/13 ICDE /34
Evaluation on Different Signature Schemes for Tokens 2011/4/13 ICDE /34
Evaluation on Different Signature Schemes for Token Sets 2011/4/13 ICDE /34
Put Everything Together 2011/4/13 ICDE /34
Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for Token Sets Signature Scheme for Tokens Experiment Conclusion 2011/4/13 ICDE /34
Conclusion Fuzzy-token similarity Hybrid similarity Subsume many well-known similarities High result quality String similarity join using fuzzy-token similarity Signature-based framework Token-sensitive signature scheme Partition-NED signature scheme Achieve higher performance than the state-of-the-art methods both theoretically and experimentally 2011/4/13 ICDE /34
2011/4/13 ICDE /34