# Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang University of New South Wales and NICTA.

## Presentation on theme: "Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang University of New South Wales and NICTA."— Presentation transcript:

Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang University of New South Wales and NICTA

2 Motivation  Data Cleaning UniversityCityStatePostal Code University of New South WalesSydneyNSW2052 University of SydneySydneyNSW2006 University of MelbourneMelbourneVictoria3010 University of QueenslandBrisbaneQueensland4072 University of New South ValesSydneyNSW2052

3 More Applications Obama Has Busy Final Day Before Taking Office as Bush Says Farewells New York Times Jan 19th, 2009 iht.com Jan 20, 2009

4 (Traditional) Set Similarity Join  Each record is tokenized into a set  Given a collection of records, the set similarity join problem is to find all pairs of records,, such that sim(x,y)  t  Common similarity functions: jaccard: cosine: dice:  What if t is unknown beforehand? x = {A,B,C,D,E} y = {B,C,D,E,F} 4/6 = 0.67 4/5 = 0.8 8/10 = 0.8

5 What if t is unknown beforehand?  Example – using jaccard similarity function w = {A, B, C, D, E} x = {A, B, C, E, F} y = {B, C, D, E, F} z = {B, C, F, G, H} If t = 0.7  no results If t = 0.4 ,,,,  too many results and long running time  Return the top-k results ranked by their similarity values if k = 1 

6 Top-k Set Similarity Join  Return top-k pairs of records, ranked by similarity scores.  Advantages over traditional similarity join without specifying a threshold output results progressively  benefit interactive applications produces most meaningful results under limited resources or time constraints  can be stopped at any time, but still guarantee sim(output results)  sim(unseen pairs)

7 Straightforward Solution  Start from a certain t, repeat the following steps: answer traditional sim-join with t as threshold if # of results  k, stop and output k results with highest sim else, decrease t  Example (jaccard, k = 2) w = {A, B, C, E} x = {A, B, C, E, F} y = {B, C, D, E, F} z = {B, C, F, G, H} t = 0.9  no result t = 0.8  t = 0.7  t = 0.6 , results don ’ t change! Which thresholds shall we enumerate? 0.8, 0.6

8 Naïve and Index-Based Algorithms  Na ï ve Algorithm: Compare every pair of objects -> O(n 2 ) time complexity  Index-based Algorithm [Sarawagi et al. SIGMOD04] : Record Set Index Construction Candidate Generation Verification Result Pairs tokenrecord_id Awxy Bxz … Cyz … … inverted lists

9 Prefix Filter [Chaudhuri et al. ICDE06, Bayardo et al. WWW07]  Sort the tokens by a global ordering increasing order of document frequency  Only need to index the first few tokens (prefix) for each record  Example: jaccard t = 0.8  |x  y|  4 if |x|=|y|=5 x = y =  Must share at least one token in prefix to be a candidate pair For jaccard, prefix length = |x| * (1 – t) + 1  each t is associated with a prefix length AB CD upper bound O(x,y) = 3 < 4! prefix sorted EFG EFG

10 Necessary Thresholds  Each prefix is associated with a threshold, i.e., the maximum possible similarity a record can achieve with other records.  What thresholds shall we enumerate? All the thresholds with which prefixes are associated!  Necessary thresholds If we change between different thresholds, there exists a database instance where the results will change extend prefix by one token, and consider the new t ABC x = 1.00.80.6 t

11 Event-driven Model  Problem: repeated invocation of sim-join algorithm t is decreasing  run sim-join algorithm in an incremental way  Prefix Event initialize prefix length for each record as 1  for each prefix event  probe the inverted list of the token for candidate pairs, verify the candidate pairs, and insert them into temp results.  insert x into A ’ s inverted list  extend prefix by one token  maintain prefix events with a max- heap on t stop until t  k-th temp result ’ s similarity 1.00.75 x y z 1.00.80.6 1.00.90.80.7

12 topk-join - Example ABCE ABCEF BCDEF BCFGH w x y z tokenrecord_id Awx Byzxw Cyz inverted list prefix event (w,x) = 0.8 (y,z) = 0.43 (x,y) = 0.67 temporary result jaccard, k=2 verified twice! t=0.6  2nd temp result ’ s sim

13 Optimizations - Verification  In the above example, (w,x) and (y,z) have been verified twice  How to avoid repeated verification? memorize all verified pairs with a hash table  too much memory consumption check if this pair will be identified again when it is verified for the first time keep only those will be identified again before algorithm stops guarantee no pair will be verified twice ABDEF ACDEF x y 1.00.80.6 if k-th temp result ’ s sim = 0.7 won ’ t be identified again!

14 Optimizations - Indexing  How to reduce inverted list size to save memory? identified by or, yet the maximum similarity they can achieve is 4/6 = 0.67 t is decreasing  calculate the upper bound of similarity for future probings into inverted lists don ’ t insert into inverted list if this upper bound  k-th temp result ’ s similarity ACDEF BCDEF x y

15 Experiment Settings  Algorithms topk-join pptopk: modified ppjoin[ Xiao, et al. WWW08 ], a prefix-filter based approach, with t = 0.95, 0.90, 0.85...  Measure compare topk-join and pptopk (candidate size, running time) output results progressively  Dataset dataset# of recordsavg. record size DBLP (author, title)855k14.0 TREC (author, title, abstract)348k130.1 TREC-3GRAM348k868.5 UNIREF-3GRAM (protein seq.)500k372.9

16 Experiment Results

17 Experiment Results

18 Thank you! Questions?

19 Related Work  Index-based approaches S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD, 2004. C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. in ICDE, 2008.  Prefix-based approaches S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, 2006. R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, 2007. C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In WWW, 2008.  PartEnum A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set- similarity joins. In VLDB, 2006.

Download ppt "Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang University of New South Wales and NICTA."

Similar presentations