Download presentation

Presentation is loading. Please wait.

Published byDerek Snuggs Modified about 1 year ago

1
Efficient Top-k Algorithms for Approximate Substring Matching Presented by Jagadeesh Potluri Shiva Krishna Imminni

2
Our Presentation covers Introduction Problem Statement Approach Performance Results

3
Introduction With availability of vast amounts of data, retrieving similar strings becomes a more challenging problem today. Applications: web search, music data retrieval, finding DNA subsequences and many more Given a large database of texts and a query string, we need to find an efficient way to search for similar strings or sub-strings. Edit distance is the most widely accepted distance measures for database applications.

4
Problem Statement Traditional approximate substring matching requests a user to specify a similarity threshold. This behavior is not efficient as -There is no global threshold value that works for all types of data. -This requires fine tuning of threshold value. We are going to opt for algorithms which return top-K results - K - total number of results a user would like to see.

5
Example Given a query string, the Top K algorithm fetches the Top-K results of the approximate sub-string matches from the database of text. Query string is ‘Jackson’ and k = 3 s 1 - substring edit distance is 0. s 2 - substring edit distance is 3. s 3 - substring edit distance is 2. s 4, s 5, s 6 - substring edit distances are 1. Output- Top 3 ={‘Jackson Pollock’, ‘Jacksomville’, ‘Jakson Pollack’}

6
TopK Algorithms TopK-Naive TopK-LB TopK-Split

7
TopK-Naïve Given a set of strings ‘D’ and a query string ‘σ’ Algorithm: 1.H TopK = an empty max-heap storing ; 2.For every string s in D, computes the substring edit distance d sub (s, σ). If the size of H TopK is less than k insert the string s to H TopK Else if d sub (s, σ) < d sub (s R, σ ) delete s R of H TopK insert the string s to H TopK minimum among the edit distance between σ and every substring of s

8
TopK-Naïve Is this efficient? Examines every string s in D and compute the substring edit distance d sub (s, σ) one by one. Computation of substring edit distance d sub (s, σ) is very expensive.

9
Can we do better ? By utilizing q-grams in the query strings (TopK-LB) and Inverted q-gram indexes (TopK-Split).

10
q-grams and Inverted q-gram Indexes Positional q-grams: For the string s=‘Jackson’ and q=3, (‘Jac’,1), (‘ack’,2), (‘cks’,3), (‘kso’,4) and (‘son’,5) are the positional 3-grams of the string s. Inverted q-gram index of D

11
TopK-LB What is LB (Lower Bound)? s – string σ - query c - number of common q-grams q- q-gram

12
Cont.. SAMPLESTR |s| = 9 SAMPLES |s|-3+1 = 7 Assume there are no matching qgrams then according previous formulae we get, d(s, σ) = ceil(7/3 ) = 3 i.e, Assume there are is 2 matching qgrams then according previous formulae we get, d(s, σ) = ceil(7-2/3 ) = 2 i.e, SAMPLESTR SAMPLES SAMPLES

13
Lower Bound for Substring c i be the number of common q-grams between σ and s[i, i+|σ|-1] Time complexity - O(|σ| ・ |s|) Towards tight lower bound and O( l 2 ) algorithm where l number of matching q-gram pairs which is << min(|σ|,|s|)

14
Calculating lo(d sub (s, σ) Given σ =‘Jacksonville’, |σ|=12 and s=‘Jack Willson’. Matching positional 3-gram: X σ = {(Jac,1), (ack,2), (son,5), (ill,9)} Y s = {(Jac,1), (ack,2), (ill,7), (son,10)}

15
Finding LB by DYN-LB m[i, j]: For a positional q-gram pair such that (x i, p i ) ∈ X σ,(y j, r j ) ∈ Y s and x i =y j = ‘Jac’ -> m[1, 1] = 0, ‘ack’ -> m[2, 2] = 0, ‘ill’ -> m[4,3] = 2, ‘son’ -> m[3,4] = 2 Query q-gram With its position Input string q-gram with its position

16
Choosing LB ‘Jac’ -> m[1, 1] = 0, ‘ack’ -> m[2, 2] = 0, ‘ill’ -> m[4,3] = 2 and ‘son’ -> m[3,4] = 2 lo(d sub (s,σ)) m[1][1] = 0 + 3 = 3, m[2][2] = 0 + 3 = 3, m[3][4] = 2 + 2 = 4, m[4][3] = 2 + 1 = 3 lo(d sub (s, σ)) = min{4,min{3,3,4,3}} = 3 Ceil((12-3+1)/3)

17
Example for TopK-LB σ = ‘Jacksen’, |σ| = 7,k = 2 s 1 and s 2, d sub (s 1, σ) = 1, d sub (s 2, σ)) = 4 inserted into the max-heap H TopK. Next, lo(d sub (s 3, σ)) = 2 by DYN-LB, dsub(s R, σ) = 4 and lo(d sub (s 3, σ)) < 4 compute d sub (s 3,q). Next, lo(d sub (s 4, σ))= 1 dsub(s R, σ) = 3 and lo(d sub (s 4, σ)) < 3 compute d sub (s 4, σ). At the end of this step Max Heap contains {(s1, 1), (s3, 3)} At the end of this step Max Heap contains {(s1, 1), (s 4, 2)} At the end of this step Max Heap contains {(s1, 1), (s2, 4)}

18
Cont… Next, lo(d sub (s 5, σ))= 2 which is not less than 2 Skip edit distance computation Next, lo(d sub (s 6, σ))= 1 dsub(s R, σ) = 2 and lo(d sub (s 6, σ )) < 2 compute d sub (s 6, σ). Summary: we calculated the substring edit distances with 5 strings out of 6 strings. At the end of this step Max Heap contains {(s1, 1), (s 6, 1)} At the end of this step Max Heap contains {(s1, 1), (s 4, 2)}

19
TopK-Split improves on TopK-LB In TopK-LB, for every string S we are calculating Lower bound. Can we reduce LB computations?? Split data set D into D G + and D G - Calculate LB for strings that fall in D G + only.

21
Computing Best G` 1. Get inverted index of positional qgrams in σ

22
Cont.. 2. Return non overlapping qgram set g’ in σ of length τ that has minimum u[i, τ]

23
Cont.. Finally, to select the best G` with size 2 we choose [8, 2] = {‘ack’, ‘onv’} for G’ since u[8, 2] is the minimum among u[6, 2], u[7, 2] and u[8, 2]

24
Performance Results We have evaluated performance of TopK-Naïve and TopK-LB by varying K length Number of Strings = 865361, Query Length = 7, qgram=3, Java Heap space=4096MB

25
Cont.. We are going to evaluate the performance (execution time) of three algorithms by varying the following parameters K length Query length Length of grams Buffer size Input data set (Wikipedia and DBLP) For all the parameters, we are expecting the execution time should be TopK-Naïve > TopK-LB > TopK-Split

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google