# Efficient Top-k Algorithms for Approximate Substring Matching Presented by Jagadeesh Potluri Shiva Krishna Imminni.

## Presentation on theme: "Efficient Top-k Algorithms for Approximate Substring Matching Presented by Jagadeesh Potluri Shiva Krishna Imminni."— Presentation transcript:

Efficient Top-k Algorithms for Approximate Substring Matching Presented by Jagadeesh Potluri Shiva Krishna Imminni

Our Presentation covers Introduction Problem Statement Approach Performance Results

Introduction With availability of vast amounts of data, retrieving similar strings becomes a more challenging problem today. Applications: web search, music data retrieval, finding DNA subsequences and many more Given a large database of texts and a query string, we need to find an efficient way to search for similar strings or sub-strings. Edit distance is the most widely accepted distance measures for database applications.

Problem Statement Traditional approximate substring matching requests a user to specify a similarity threshold. This behavior is not efficient as -There is no global threshold value that works for all types of data. -This requires fine tuning of threshold value. We are going to opt for algorithms which return top-K results - K - total number of results a user would like to see.

Example Given a query string, the Top K algorithm fetches the Top-K results of the approximate sub-string matches from the database of text. Query string is ‘Jackson’ and k = 3 s 1 - substring edit distance is 0. s 2 - substring edit distance is 3. s 3 - substring edit distance is 2. s 4, s 5, s 6 - substring edit distances are 1. Output- Top 3 ={‘Jackson Pollock’, ‘Jacksomville’, ‘Jakson Pollack’}

TopK Algorithms TopK-Naive TopK-LB TopK-Split

TopK-Naïve Given a set of strings ‘D’ and a query string ‘σ’ Algorithm: 1.H TopK = an empty max-heap storing ; 2.For every string s in D, computes the substring edit distance d sub (s, σ). If the size of H TopK is less than k insert the string s to H TopK Else if d sub (s, σ) < d sub (s R, σ ) delete s R of H TopK insert the string s to H TopK minimum among the edit distance between σ and every substring of s

TopK-Naïve Is this efficient? Examines every string s in D and compute the substring edit distance d sub (s, σ) one by one. Computation of substring edit distance d sub (s, σ) is very expensive.

Can we do better ? By utilizing q-grams in the query strings (TopK-LB) and Inverted q-gram indexes (TopK-Split).

q-grams and Inverted q-gram Indexes Positional q-grams: For the string s=‘Jackson’ and q=3, (‘Jac’,1), (‘ack’,2), (‘cks’,3), (‘kso’,4) and (‘son’,5) are the positional 3-grams of the string s. Inverted q-gram index of D

TopK-LB What is LB (Lower Bound)? s – string σ - query c - number of common q-grams q- q-gram

Cont.. SAMPLESTR |s| = 9 SAMPLES |s|-3+1 = 7 Assume there are no matching qgrams then according previous formulae we get, d(s, σ) = ceil(7/3 ) = 3 i.e, Assume there are is 2 matching qgrams then according previous formulae we get, d(s, σ) = ceil(7-2/3 ) = 2 i.e, SAMPLESTR SAMPLES SAMPLES

Lower Bound for Substring c i be the number of common q-grams between σ and s[i, i+|σ|-1] Time complexity - O(|σ| ・ |s|) Towards tight lower bound and O( l 2 ) algorithm where l number of matching q-gram pairs which is << min(|σ|,|s|)

Calculating lo(d sub (s, σ) Given σ =‘Jacksonville’, |σ|=12 and s=‘Jack Willson’. Matching positional 3-gram: X σ = {(Jac,1), (ack,2), (son,5), (ill,9)} Y s = {(Jac,1), (ack,2), (ill,7), (son,10)}

Finding LB by DYN-LB m[i, j]: For a positional q-gram pair such that (x i, p i ) ∈ X σ,(y j, r j ) ∈ Y s and x i =y j = ‘Jac’ -> m[1, 1] = 0, ‘ack’ -> m[2, 2] = 0, ‘ill’ -> m[4,3] = 2, ‘son’ -> m[3,4] = 2 Query q-gram With its position Input string q-gram with its position

Choosing LB ‘Jac’ -> m[1, 1] = 0, ‘ack’ -> m[2, 2] = 0, ‘ill’ -> m[4,3] = 2 and ‘son’ -> m[3,4] = 2 lo(d sub (s,σ)) m[1][1] = 0 + 3 = 3, m[2][2] = 0 + 3 = 3, m[3][4] = 2 + 2 = 4, m[4][3] = 2 + 1 = 3 lo(d sub (s, σ)) = min{4,min{3,3,4,3}} = 3 Ceil((12-3+1)/3)

Example for TopK-LB σ = ‘Jacksen’, |σ| = 7,k = 2 s 1 and s 2, d sub (s 1, σ) = 1, d sub (s 2, σ)) = 4 inserted into the max-heap H TopK. Next, lo(d sub (s 3, σ)) = 2 by DYN-LB, dsub(s R, σ) = 4 and lo(d sub (s 3, σ)) < 4 compute d sub (s 3,q). Next, lo(d sub (s 4, σ))= 1 dsub(s R, σ) = 3 and lo(d sub (s 4, σ)) < 3 compute d sub (s 4, σ). At the end of this step Max Heap contains {(s1, 1), (s3, 3)} At the end of this step Max Heap contains {(s1, 1), (s 4, 2)} At the end of this step Max Heap contains {(s1, 1), (s2, 4)}

Cont… Next, lo(d sub (s 5, σ))= 2 which is not less than 2 Skip edit distance computation Next, lo(d sub (s 6, σ))= 1 dsub(s R, σ) = 2 and lo(d sub (s 6, σ )) < 2 compute d sub (s 6, σ). Summary: we calculated the substring edit distances with 5 strings out of 6 strings. At the end of this step Max Heap contains {(s1, 1), (s 6, 1)} At the end of this step Max Heap contains {(s1, 1), (s 4, 2)}

TopK-Split improves on TopK-LB In TopK-LB, for every string S we are calculating Lower bound. Can we reduce LB computations?? Split data set D into D G + and D G - Calculate LB for strings that fall in D G + only.

Computing Best G` 1. Get inverted index of positional qgrams in σ

Cont.. 2. Return non overlapping qgram set g’ in σ of length τ that has minimum u[i, τ]

Cont.. Finally, to select the best G` with size 2 we choose [8, 2] = {‘ack’, ‘onv’} for G’ since u[8, 2] is the minimum among u[6, 2], u[7, 2] and u[8, 2]

Performance Results We have evaluated performance of TopK-Naïve and TopK-LB by varying K length Number of Strings = 865361, Query Length = 7, qgram=3, Java Heap space=4096MB

Cont.. We are going to evaluate the performance (execution time) of three algorithms by varying the following parameters K length Query length Length of grams Buffer size Input data set (Wikipedia and DBLP) For all the parameters, we are expecting the execution time should be TopK-Naïve > TopK-LB > TopK-Split