Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.

Similar presentations


Presentation on theme: "A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur."— Presentation transcript:

1 A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur

2 Edit Distance (Levenshtein distance) Let A,B be two strings over a fixed alphabet Σ. The edit distance D(A,B) between A and B is defined as the minimum number of character insertions, deletions, and substitutions that transform A into B, or vice versa.

3 Applications Bioinformatics Text processing Web search

4 Algorithms Wagner and Fischer gave a dynamic programming algorithm that runs in time O(n 2 ) Masek and Paterson gave an improved algorithm that runs in time O(n 2 /logn)

5 The Edit Distance Testing Problem On input A,B and parameters 0 1: If D(A,B)≤n α, output CLOSE with probability at least 2/3 If D(A,B)>n/C, output FAR with probability at least 2/3 Note that the output is unrestricted for n α <D(A,B)≤n/C E.g. cannot distinguish between n 0.1 and n 0.9 The algorithm presented for the problem runs in time Õ(n max{α/2,2α-1} )

6 Motivation In some applications, given many pairs of strings, one is interested in computing the edit distance only for close strings For string pairs where the edit distance is above a certain threshold, the actual value of the distance is irrelevant

7 Lower Bound Any probabilistic algorithm for the edit distance test problem requires Ω(n α/2 ) queries The algorithm presented for the problem runs in time Õ(n max{α/2,2α-1} ), which is close to optimal for α≤2/3

8 Other Approximations There are several papers that give better approximation results, but none run in sublinear time Andoni and Onak give an algorithm that computes the edit distance between two strings up to a factor of in n 1+o(1) time

9 Algorithm Overview A recursive divide and conquer algorithm B is broken into substrings which are recursively matched against A The matches are pieced together to form a matching for A It is too expensive to match all the substrings A small number of them are sampled and matched, relying on statistical properties of the matchings

10 Approximate Matching Definition 1: An interval I = B[s…e] has a (t,E)-(approximate) matching with respect to A if for some interval A[s’…e’], s’=s+t and D(A[s’…e’],I)≤E A abcd1234efgh5678 B cd02 I has a (2,1)-(approximate) matching with respect to A

11 Coordinated Matching Definition 2:Let I = (I 1,…I k ) be a collection of intervals. We say that I has a (t,σ,E,D)- coordinated matching with A if for all but D of the intervals I i I, I i has a (t i,E)-matching with A, where |t-t i |≤σ A abcd1234efgh5678 B cd0236gjfkl5 I has a (1,1,2,1)-coordinated matching with A

12 Coordinated Matching to Approximate Matching We decompose an interval I of size S into k disjoint continuous subintervals, I=(I 1,…I k ), each of size S’=S/k (assuming k|S) Lemma 1: If (I 1,…I k ) has a (t,σ,εS’,δk)- coordinated matching with A, then I has a (t,βS)-(approximate) matching with A, where β = (2σ/S’ + ε+δ)

13 Approximate Matching to Coordinated Matching Lemma 2: Let c>1 and S>cE. If I has a (t,E)- matching with A then I=(I 1,…I k ) has (t,E,cE/k,k/c)-coordinated matching with A Lemma 3: If I has a (t,E)-matching with A, and k≥E, then I=(I 1,…I k ) has (t,E,0,E)- coordinated matching with A

14 To match A and B Decompose B into a set of continuous disjoint intervals I Lemma 2 argues that a match for A and B gives a coordinated matching for A and I Use a subroutine (COORD-MATCHES) to find coordinated matches for I Lemma 1 infers the existence of good matches for B from coordinated matches for I

15 COORD-MATCHES COORD-MATCHES(A,I,σ,E,D,ε,c) Let d be a constant, l=dlog(n). Choose samples i 1,…,i l uniformly and independently from [1,…,k] For each chosen sample i j compute T j =MATCHES(A,i j,E) Let Δ=(D/k+ε/2)l Return the set T, where t T iff T j ∩[t-σ…t+σ]=Ø for at most Δ sets T j

16 Sampling Lemma Lemma 4: Suppose that a random element of a set S of size n has a property Z with probability p. For any positive ε and c, there exists d such that for dlog(n) random samples from S the fraction p’ of these samples with property Z satisfies p-ε/2≤p’≤p+ε/2 with probability 1-1/n c

17 COORD-MATCHES Lemma 5: With probability 1-1/n c-1 over the random coins of COORD-MATCHES, the output T of COORD- MATCHES(A,I,σ,E,D,ε,c) has the following properties: If I has a (t,σ,E,D)-coordinated matching then t T If t T then I has a (t,σ,E,D+εk)-coordinated matching

18 MATCHES(A,I,E) If E≥1, use a recursive call to COORD- MATCHES If E<1 (i.e E=0), then A must contain the interval I unchanged. The set of t values is computed directly using the algorithm SHIFTS

19 Implementing SHIFTS A naïve implementation of SHIFTS may give an output set T consisting of n elements We may restrict the allowed shifts to [-n α,…,+n α ] However, we need a running time of o(n α ), so we must further restrict the set of possible outputs

20 The Approximate Matching problem Actually, we will solve the approximate matching problem: Given a block I=B[s…e] of length b=e-s+1, and a constant c 2 >1, find all indexes s’ such that A[s’…(s’+b-1)] matches I, in a sense that the two substrings have Hamming-distance at most b/c 2 Note that if D(A,B)<n α, it is enough to consider s’ in the interval [s-n α,s+n α ]

21 The Approximate Matching problem Naively, we can randomly sample O(log(n)) indexes i to determine (with high probability) if a substring of A[(t+1)…(t+b)] matches I, for a given t, and try all 2n α possible shifts Requires Ω(n α ) queries to A

22 The Ruler Procedure We can compare pairs of characters A[i],I[j] such that a pair is compared for every i-j from 0 to u=2n α with √u queries to each string given that b>√u In A character positions divisible by √u are queried A[√u,2√u,…u]. In I, √u consecutive positions are queried I[1…√u] Define cen=  t/√u  1  mil=t(mod√u), then for i=cen√u, j=√u-mil i-j=t

23 The Ruler Procedure To test whether a block matches: pick l=Θ(log(n)) random numbers m 1,m 2 …,m l from [0,b-√u] For each cen and mil marks construct a fingerprint with l offsets e.g. f(√u)=A[√u+m 1,√u+m 2,…,√u+m l ] Detect with high probability if a block matches with shift t by comparing the cen and mil fingerprints. i.e. f(cen√u)= A[cen√u+m 1 …cen√u+m l ] and f(t(mod√u)) =I[t(mod√u)+m 1 … t(mod√u)+m l ]

24 The Ruler Procedure If b≤√u we have only O(b) mil marks and Ω(u/b) cen marks We can find all matching shifts by using O(max{√u,u/b}log(n)) queries

25 Efficient Implementation of the Ruler We need an efficiently algorithm to compare all fingerprints and return valid shifts A dbadaabcdabddcd B abcdab u=|A|-|B|=9 √u=3 l=2 m 1 =1 m 2 =3 B-ListA-List Fingerprint

26 Efficient Implementation of the Ruler A dbadaabcdabddcd B abcdab u=|A|-|B|=9 √u=3 l=2 m 1 =1 m 2 =3 B-ListA-List Fingerprint 3da

27 Efficient Implementation of the Ruler A dbadaabcdabddcd B abcdab u=|A|-|B|=9 √u=3 l=2 m 1 =1 m 2 =3 B-ListA-List Fingerprint 3da 16bd 9ad 2ca 3db

28 Quantizing the Ruler The explicit list of all matching t can have Ω(u) values We round the values of t to multiples of some integer Q and return all quantized shifts The running time is O(max{√u,u/b,u/Q}log(n))

29 SHIFTS(A,I,Q) Initialize the fingerprint data structure Pick l=Θ(log(n)) random numbers m 1,m 2 …,m l Add all the fingerprints f(i) of A to the data structure, adding i to the A-list of f(i) Add all the fingerprints f(j) of I to the data structure, adding j to the B-list of f(j) Quantize all A-lists and B-lists For each fingerprint, output the list of quantized shifts (differences)

30 SHIFTS(A,I,Q) Theorem 1: Procedure SHIFTS finds all quantized shifts of interval I in A, with high probability. It runs in time O(max{√u,u/b,u/Q}log(n)), where u=|A|-b

31 MATCHES(A,I,E) If E<1, use SHIFTS to compute T If E≥1 Set k=min{εn 1-α,2c 1 E} Decompose I into a set I of continuous disjoint intervals of size |I|/k Compute T=COORD-MATCHES(A,I,E,c 1 E/k,k/c 1 ) Return T

32 DECIDE(A,B,α,C) Choose sufficiently small ε, and sufficiently large c 1 (given α,C) Let the quantization parameter be Q=εmin{n 1-α,n α/2 } Set T = MATCHES(A,B,n α ) If T is nonempty, output CLOSE, otherwise output FAR

33 DECIDE(A,B,α,C) For any fixed α<1, we can choose constants ε and c 1 such that procedure DECIDE solves the edit distance testing problem with high probability

34 Running Time Analysis Note that when k=2c 1 E, COORD-MATCHES is called with edit distance parameter c 1 E/k=1/2<1. I.e. next call to MATCHES will call SHIFTS and end the recursion Each level, The interval input to MATCHES goes down by a factor of k=Ω(n 1-α ), after r=α/(1-α) levels the intervals are of length n/n r(1-α) =O(n 1-α ), E=O(n α /n r(1-α) )=O(1) and SHIFT will be called next

35 Running Time Analysis α<1/2 One level of recursion B is broken to intervals of size O(n α ) dlog(n) calls to SHIFT with Q=εn α/2 Each call takes O(max{√u,u/b,u/Q}log(n)) = O(max{n α/2,1,n α/2 }log(n))=O(n α/2 log(n)) One merge taking O(n α/2 log(n)) Total running time O(n α/2 log 2 (n))

36 Running Time Analysis 1/2<α<2/3 Two levels of recursion At the last level, B is broken to intervals of size O(n α/2 ) log 2 (n) calls to SHIFT with Q=εn α/2 Each call takes O(n α/2 log(n)) log(n) merges each taking O(n α/2 log(n)) Total running time O(n α/2 log 3 (n))

37 Running Time Analysis α>2/3 r>2 levels of recursion At the last level, B is broken to intervals of size O(n 1-α ) log O(1) (n) calls to SHIFT with Q=εn 1-α Note that n 1-α <n α/2 Each call takes O(max{√u,u/b,u/Q}log(n)) = O((u/b)log(n))=O(n 2α-1 log(n)) Total running time Õ(n 2α-1 log(n))

38 Conclusion We saw an algorithm for the edit distance test problem that runs in time Õ(n max{α/2,2α-1} ) Any probabilistic algorithm for the edit distance test problem requires Ω(n α/2 ) queries


Download ppt "A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur."

Similar presentations


Ads by Google