Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.

Slides:



Advertisements
Similar presentations
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Space-Constrained Gram-Based Indexing for Efficient.
Advertisements

Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.
A General Algorithm for Subtree Similarity-Search The Hebrew University of Jerusalem ICDE 2014, Chicago, USA Sara Cohen, Nerya Or 1.
Computer Science and Engineering Inverted Linear Quadtree: Efficient Top K Spatial Keyword Search Chengyuan Zhang 1,Ying Zhang 1,Wenjie Zhang 1, Xuemin.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China) Jianhua Feng (Tsinghua, China)
Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang University of New South Wales and NICTA.
Introduction to Algorithms
Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.
Dong Deng, Guoliang Li, Jianhua Feng Database Group, Tsinghua University Present by Dong Deng A Pivotal Prefix Based Filtering Algorithm for String Similarity.
An Overview of Similarity Query Processing 김종익 전북대학교 컴퓨터공학부.
CS 430 / INFO 430 Information Retrieval
CS 430 / INFO 430 Information Retrieval
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search Alexander Behm 1, Shengyue Ji 1, Chen Li 1, Jiaheng.
Subscription Subsumption Evaluation for Content-Based Publish/Subscribe Systems Hojjat Jafarpour, Bijit Hore, Sharad Mehrotra, and Nalini Venkatasubramanian.
Optimal Merging Of Runs
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Reza Sherkat ICDE061 Reza Sherkat and Davood Rafiei Department of Computing Science University of Alberta Canada Efficiently Evaluating Order Preserving.
Near Duplicate Detection
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring.
Near-duplicates detection Comparison of the two algorithms seen in class Romain Colle.
Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Efficient Exact Set-Similarity Joins Arvind Arasu Venkatesh Ganti Raghav Kaushik DMX Group, Microsoft Research.
VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams Chen Li Bin Wang and Xiaochun Yang Northeastern University,
DBease: Making Databases User-Friendly and Easily Accessible Guoliang Li, Ju Fan, Hao Wu, Jiannan Wang, Jianhua Feng Database Group, Department of Computer.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang.
FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.
Click to edit Present’s Name Xiaoyang Zhang 1, Jianbin Qin 1, Wei Wang 1, Yifang Sun 1, Jiaheng Lu 2 HmSearch: An Efficient Hamming Distance Query Processing.
Filter Algorithms for Approximate String Matching Stefan Burkhardt.
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
Experiments An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints Entity Extraction A Document An Efficient Filter.
Querying Structured Text in an XML Database By Xuemei Luo.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang Univ. of New South Wales, Austrailia ICDE ’09 9 Feb 2011 Taewhi Lee Based.
Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.
文本挖掘简介 邹权 博士,助理教授. Outline  Introduction  TF-IDF  Similarity.
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
Efficient Common Items Extraction from Multiple Sorted Lists Wei Lu, Cuitian Rong, Jinchuan Chen, Xiaoyong Du, Gabriel Fung, Xiaofang Zhou Renmin University.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
VGRAM:Improving Performance of Approximate Queries on String Collections Using Variable- Length Grams VLDB 2007 Chen Li (UC, Irvine) Bin Wang (Northeastern.
Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava.
DUST Different URLs with Similar Text DUST Different URLs with Similar Text Do Not Crawl in the DUST: Different URLs with Similar Text : ZIV BARYOSSEF.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Efficient Merging and Filtering Algorithms for Approximate String Searches Chen Li, Jiaheng Lu and Yiming Lu Univ. of California, Irvine, USA ICDE ’08.
All Your Queries are Belong to Us: The Power of File-Injection Attacks on Searchable Encryption Yupeng Zhang, Jonathan Katz, Charalampos Papamanthou University.
EFFICIENT ALGORITHMS FOR APPROXIMATE MEMBER EXTRACTION By Swapnil Kharche and Pavan Basheerabad.
Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University,
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Efficient Approximate Search on String Collections Part I
COMP9313: Big Data Management Lecturer: Xin Cao Course web site:
Optimizing Parallel Algorithms for All Pairs Similarity Search
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
RE-Tree: An Efficient Index Structure for Regular Expressions
Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China)
Efficient Similarity Joins for Near Duplicate Detection
Pass-Join: A Partition based Method for Similarity Joins
Weighted Exact Set Similarity Join
Efficient Record Linkage in Large Data Sets
Jongik Kim1, Dong-Hoon Choi2, and Chen Li3
Fragment Assembly 7/30/2019.
An Efficient Partition Based Method for Exact Set Similarity Joins
Presentation transcript:

Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.

Introduction Similarity search is important in many applications Data cleaning Record linkage Near duplicate detection Query refinement The focus of our work is efficient evaluation of similarity queries A lot of applications invoke queries simultaneously Applications usually require fast response times Need to evaluate similarity queries efficiently … Simultaneous query requests angty bird typo

Problem Definition How do we measure the similarity between two string? Name Bill Gates Linus Torvalds Steven P. Jobs Dennis Ritchie … Search Query q: Steve Jobs Output: each string s that satisfies sim(q, s) ≥ α ste The overlap similarity, sim(“steve”, “steven”), is defined as |TS(“steve”) ∩ TS(“steven”)| 1.Convert each string into a record, where a record is a set of tokens Tokenize each string into a token set containing all q-gram tokens of the string  q-gram: a substring of a string of length q  TS(“steve”) = {ste, tev, eve} and TS(“steven”) = {ste, tev, eve, ven} 2.Count the number of common tokens between two records (or token sets) Collection of strings tev eve steve Why do we use the overlap similarity? It supports many other similarity measures. e.g. J(x, y) = t O(x, y) = t(|x|+|y|)/1+t J: Jaccard similarity, O: Overlap similarity

Inverted Lists based Approach IDStringRecord (token set) 1area{, re, ea} 2artisan{, rt, ti, is, sa, an} 3artist{, rt, ti, is, st} 4tisk{ti, is, sk} ……… ar sk ea is sa rt st ti re 1 Make Inverted Lists an 2 3 Query: “artist”  Overlap threshold: 4 Merge to count occurrences Answers of the query 2: “artisan” 3: “artist” {,,,, } ar rt tiis st 4 ar

Prefix Filtering based Approach Query q = “artist”  {ar, rt, ti, is, st} and overlap threshold α = 4 ar is rt st ti Inverted lists for the query st rt ar is ti Sort the lists by their sizes Prefix Lists: the first |TS(q)| – α + 1 lists Suffix Lists: remaining α – 1 lists Filtering Phase (the prefix filtering) Merge the prefix lists to generate candidates Verification Phase Search the suffix lists for each candidate A candidate searches each suffix list to identify if it is contained in the list Binary search is used because suffix lists are usually very long candidates Sort the tokens by their document frequencies Document frequency ordering

Document Frequency Ordering General Goal: minimize the number of candidates by making use of the document frequency ordering rt st ti ar is st rt ar is ti Prefix Lists: the first |TS(q)| – α + 1 lists Query q = “artist”  {ar, rt, ti, is, st} and overlap threshold α = 4 Suffix Lists: remaining α – 1 lists Prefix Lists: the first |TS(q)| – α + 1 lists Suffix Lists: remaining α – 1 lists Sort the tokens by their document frequencies candidates 12 3 We can reduce 1.time for merging short lists 2.number of candidates  time for verification candidates

Our Observation Query q = {w 1, w 2 } and overlap threshold α = 2 w 2 is the prefix list # of candidates is 5 w 2 is the prefix list # of candidates is 0 w 1 is the prefix list # of candidates is 0 Total number of candidates is 0 Partition Our observation By partitioning a data set, we can artificially modify document frequencies of tokens in each partition. We evaluate a query in each partition and take the union of the results. We can reduce the number of candidates by utilizing different token orderings among partitions. Because partitions have different token orderings, we need to sort tokens in a query record in each partition.

Generalization of the Observation Query q = reaby ={re, ea, ab, by} ={w 1, w 3, ab, by} Overlap threshold α = 2 Grouping records in I(w p ) into P, the number of candidates is reduced by at least |I(w p )| – |I(w s ) ∩ P | Grouping records in I(w s ) into P, the number of candidates is reduced by at least |I(w p ) – P| Prefix list: w 3 # of candidate: 5 In P 1, prefix list is w 1 # of candidate: 2 In P 2, prefix list is w 3 # of candidate: 0 In P 1, prefix list is w 3 # of candidate: 2 In P 2, prefix list is w 1 # of candidate: 0 By grouping records containing a token w into a partition, we can benefit queries containing w I(w): the inverted list of w, w p : a prefix token, w s : a suffix token

Pivot Set & Partitioning By grouping records containing a token w into a partition, we can benefit queries containing w Pivot set S is a set of tokens such that Grouping I(w i ) into one partition does not affect grouping I(w j ) into another partition w1w1 w2w2 w3w3 w5w5 w4w4 w6w6 w7w7 r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 r7r7 r8r8 r9r9 r 10 r 11 r 12 r 13 r 14 r 15 There are many pivot sets S 1 = {w 1, w 3 } S 2 = {w 2, w 3, w 4 } S 3 = {w 3, w 5 } S 4 = {w 5, w 6 } S 5 = {w 2, w 6 } S 6 = {w 3, w 7 } We can benefit queries containing w i as well as queries containing w j Question: 1.Existence of pivot sets 2.Selection of a good pivot set P1P1 P2P2 P3P3 orphan record: randomly select its partition

Relaxation of a Pivot Set w1w1 w2w2 w3w3 w5w5 w4w4 w6w6 w7w7 r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 r7r7 r8r8 r9r9 r 10 r 11 r 12 r 13 r 14 r 15 ※ JC(S 1, S 2 ) = | S 1 ∩ S 2 |/min(| S 1 |, | S 2 |) Pivot set S is a set of tokens such that for any two tokens w i and w j in S, JC(I(w i ), I(w j )) ≤ β If JC(S 1, S 2 ) = 0.1, 90% of S 1 10% of S 1 S1S1 S2S2 less than 10% of S 2 more than 90% of S 2 If β = 0.2, the set S = {w 2, w 3, w 4 } is a pivot set

Pivot Set Selection The weight of a token w is the number of queries that contain w Goodness of a pivot set S: By partitioning using tokens contained in many queries, we can benefit many queries Selecting the best pivot set is an NP-hard Problem (see the paper) We use a simple greedy algorithm (simplified version) Select those tokens first whose weights are high w1w1 w2w2 w3w3 w5w5 w4w4 w6w6 w7w7 r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 r7r7 r8r8 r9r9 r 10 r 11 r 12 r 13 r 14 r 15 (See the paper for the details) Problem: By selecting high frequency token w 1 first, we lose the chance to divide records in I(w 2 ) and I(w 4 ). If we divide records in I(w 2 ) and I(w 4 ), however, we can benefit more queries We solve the problem using partitioning algorithm

Partitioning Algorithm w1w1 w2w2 w3w3 w5w5 w4w4 w6w6 w7w7 r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 r7r7 r8r8 r9r9 r 10 r 11 r 12 r 13 r 14 r 15 P1P1 P2P2 P 11 P 12 local orphan record: insert it into either P 11 or P 12 Partitioning algorithm (simplified version, see the paper for the details) Select a pivot set Partition records using the pivot set In each partition, recursively partition records and handle local orphan records Balance between the overhead and the benefit of partitioning using a cost model Note: recursive partitioning does not affect the relative document frequencies of w 1 in each partition

Experiments Dataset# recordsAvg # tokens# partitions IMDB Actor1,213,39116ED 28, JC 12 IMDB Movie1,568,89119ED 18, JC 12 DBLP Author2,948,92915ED 55, JC 55 Web Corpus6,000,00021ED 54, JC 85 D ATASETS AND STATISTICS Similarity functions Jaccard similary (thresholds - 0.6, 0.7, 0.8) Edit distance (thresholds - 2, 3, 4) Search algorithms Jaccard: SequentialMerge, DivideSkip [Li et al., ICDE `08], PPMerge [Xiao et al., WWW `08] Edit distance: SequentialMerge, DivideSkip, EDMerge [Xiao et al., PVLDB `08] Size Filtering [Arasu et al., VLDB 06] (for all algorithms) Partitioned case vs. unpartitioned case Elapsed times Number of candidates

Experiments Jaccard similarity (DBLP Author) Running TimeNumber of Candidates

Experiments Edit distance (Web Corpus) Running TimeNumber of Candidates ※ Edit distance – false positives are not removed!!

Conclusions Studied how to reduce the number of candidates for efficient similarity searches Proposed the concept of the pivot set and partitioning technique using a pivot set Showed benefits of the proposed technique experimentally

THANK YOU!