Presentation is loading. Please wait.

Presentation is loading. Please wait.

String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang Jiaheng Lu Renmin University of China.

Similar presentations


Presentation on theme: "String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang Jiaheng Lu Renmin University of China."— Presentation transcript:

1 String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang Jiaheng Lu Renmin University of China

2 Motivation Example (String Measure) no semantic S1= International Conference on Management of Data NY USA S2= SIGMOD 2013 New York United States

3 Example (String Measure) S1=International Conference on Management of Data NY USA S2=SIGMOD 2013 New York United States SIGMOD International Conference on Management of Data NY New York USA United States Synonyms How to use the existing synonyms? SIGMOD ACM's Special Interest Group on Management Of Data

4 Research Problem 1--(String Measurements) Two strings s and t, and a set of synonyms R Input Using R to return the maximal Jaccard similarity Jaccard(s,t,R) Output

5 Problem 2-- (String Similarity Join) Two set of strings S and T, and a set of synonyms R, and a threshold value Input Return all similar pairs, such that Jaccard(s,t,R)>= Output

6 6 An example of similarity join IDString q12013 ACM Intl Conf on Management of Data USA q2Very Large Data Bases Conf q3VLDB Conf q4ICDE 2013 Table S1 Table S2 IDString s1SIGMOD s2VLDB SIGMOD International Conference on Management of Data VLDB Very Large Data Bases Synonyms

7 7 Existing works on approximate string match with synonyms Transform based framework (JaccT) [1], compared with our method. Machine leaning method [ 2], Hidden Markov Model-based measure. Depend on training data, not efficient [1] A. Arasu, S. Chaudhuri, and R. Kaushik. Transformation-based framework for record matching. In ICDE, pages 40–49, [2] M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In KDD, pages 39–48, Transform based framework (JaccT) [1], compared with our method. Machine leaning method [ 2], Hidden Markov Model-based measure. Depend on training data, not efficient [1] A. Arasu, S. Chaudhuri, and R. Kaushik. Transformation-based framework for record matching. In ICDE, pages 40–49, [2] M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In KDD, pages 39–48, 2003.

8 Motivation & Problem Statement String Similarity Measures String Similarity Joins Experimental Results Conclusion Outline

9 String Similarity Measures (Full-expansion) S1=International Conference on Management of Data NY USA S2=SIGMOD 2013 New York United States SIGMOD ACM's Special Interest Group on Management Of Data SIGMOD International Conference on Management of Data NY New York USA United States Synonyms S1=" International Conference on Management of Data NY USA SIGMOD New York United States " S2=" SIGMOD 2013 New York United States International Conference on Management of Data NY USA ACM's Special Interest Group on Management Of Data Expanding using all synonyms Jaccard(S1,S2)= 13/18 = 0.72

10 String Similarity Measures (Selective-expansion) S1=International Conference on Management of Data NY USA S2=SIGMOD 2013 New York United States of America Synonyms Expanding using only good synonyms SIGMOD ACM's Special Interest Group on Management Of Data SIGMOD International Conference on Management of Data NY New York USA United States S1=" International Conference on Management of Data NY USA SIGMOD New York United States " S2=" SIGMOD 2013 New York United States International Conference on Management of Data NY USA " Jaccard(S1,S2)= 13/14 = 0.93

11 String Similarity Measures (Selective) Selective-expansion is: NP-hard : Reduction from 3-SAT Choose synonyms that can increase current similarity by computing the similarity-gain Property Optimal, when more than 70% cases in practice. Greedy algorithm

12 Motivation & Problem Statement String Similarity Measures String Similarity Joins Experimental Results Conclusion Outline

13 13 Similarity Joins (Filtering and Verification) Generate Signatures with full expansion Filtering candidates Prefix method LSH method Similarity Measures Verify candidates Selective expansion Full expansion

14 14 String Similarity Joins (SN-Join) Global ordering: {a b c d e f g h i j k l} S1=c k, e, a, fS2=d, b, f, e, k Threshold=0.8 Order the strings S1=a, c, e, f, kS2 =b, d, e, f, k Sig(s1)=a, cSig(s2)=b, d Get signatures No overlapJacc(s1,s2)<0.8 Prefix method

15 15 Signatures selection is important How to select signatures to enhance the signature filtering power? It is unrealistic to find a one-size-fits-all solution.

16 16 Estimation-based signatures selection. Three steps to select signatures: Generate multiple signatures schemes for each data set. Given two tables for join, quickly estimate the filtering power of each scheme. Select the scheme with the best filtering power.

17 17 An example on estimator IDStringSignatures q12013 ACM Intl Conf on Management of Data USA ACM, International, Conference, on q2Very Large Data Bases Conf Conf, Conference q3VLDB ConfConf, Conference q4ICDE 2013ICDE ACMConfConferenceInternationalonICDE q1q2 q3 q1 q2 q3 q1 q4 Self-join: Filtering results (candidates): (q2,q3),(q1,q2) (q1,q3)

18 18 Applying FM sketches on inverted lists IDStringSignatures q12013 ACM Intl Conf on Management of Data USA ACM, International, Conference, on q2Very Large Data Bases Conf Conf, Conference q3VLDB ConfConf, Conference q4ICDE 2013ICDE ACMConfConferenceInternationalonICDE q1 Filtering results (candidates): (q2,q3),(q1,q2), (q1,q3) q2 q3 q1 q2 q3 q1 q4 Self-join: Using Flajolet- Martin (FM) sketch for each list

19 19 FM sketches Flajolet and Martin JCSS 1985 Estimates the number of distinct items in a multi-set of values from [0,…, M-1] Assume a hash function h(x) that maps incoming values x in [0,…, M-1] uniformly across [0,…, 2 L-1 ], where L = O(logM) Let lsb(y) denote the position of the least-significant 1 bit in the binary representation of y –A value x is mapped to lsb(h(x)) Number of distinct values: 5 x = 5 h(x) = lsb(h(x)) = BITMAP

20 20 Estimating the filtering power of a signature scheme Constructing a two-dimensional hash sketch Computing tighter upper and lower bounds of candidates size

21 String Similarity Filtering with Length Filter Generate Signatures Filtering candidates Prefix method LSH method Similarity Measures Verify candidates Selective expansion Compute lengths Length filter Full expansion

22 String Similarity Joins (SI-Join) Jacc(s1,s2,R)<0.9 Length filtering Strings S1=a b c d e S2=x y z Synonyms a->f g h x-> s b->k Full-expansion S1=a b c d e f g h k S2=x y z s Length range s1: [5, 9] s2: [3, 4]

23 23 String Similarity Joins (SI-Join) Generate Signatures Filtering candidates Prefix/LSH method Similarity Measures Verify candidates Selective expansion Compute lengths Length filter Full expansion

24 String Similarity Joins (SI-tree)

25 Motivation & Problem Statement String Similarity Measures String Similarity Joins Experimental Results Conclusion Outline

26 26 Data sets and algorithms Compared method: JaccT [Arasu et al. ICDE 2008] Three datasets: Data# of strings String Len (avg/max) #of Synonyms # of applied synonyms (avg/max) USPS1M6.75/ /5 CONF10K5.84/ /4 SPROT1M10.32/2010K37.78/104

27 Effectiveness of different similarity measurements String Similarity Measures Selective-expansion (SE) achieves the best effectiveness.

28 String Similarity Joins Efficiency of algorithms SI-Join achieve the best performance. S: selective expansion F: full expansion

29 Prefix scheme VS. LSH schemee LSH is better Prefix is better Prefix V.s. LSH

30 Estimation effectiveness

31 Motivation & Problem Statement String Similarity Measures String Similarity Joins Experimental Results Conclusion Outline

32 Conclusion and future work String similarity measure with synonyms Two new measures and a new join algorithm One estimator for signature selection Future work: how to deal with synonym ambiguity E.g. UW = University of Washington UW = University of Waterloo OR

33 String Similarity Measures and Joins with Synonyms


Download ppt "String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang Jiaheng Lu Renmin University of China."

Similar presentations


Ads by Google