# 1 Approximating Edit Distance in Near-Linear Time Alexandr Andoni (MIT) Joint work with Krzysztof Onak (MIT)

## Presentation on theme: "1 Approximating Edit Distance in Near-Linear Time Alexandr Andoni (MIT) Joint work with Krzysztof Onak (MIT)"— Presentation transcript:

1 Approximating Edit Distance in Near-Linear Time Alexandr Andoni (MIT) Joint work with Krzysztof Onak (MIT)

2 Edit Distance For two strings x,y  ∑ n ed(x,y) = minimum number of edit operations to transform x into y  Edit operations = insertion/deletion/substitution Important in: computational biology, text processing, etc Example: ED(0101010, 1010101) = 2

3 Computing Edit Distance Problem: compute ed(x,y) for given x,y  {0,1} n Exactly:  O(n 2 ) [Levenshtein’65]  O(n 2 /log 2 n) for |∑|=O(1) [Masek-Paterson’80] Approximately in n 1+o(1) time:  n 1/3+o(1) approximation [Batu-Ergun-Sahinalp’06], improving over [Myers’86, BarYossef-Jayram-Krauthgamer-Kumar’04] Sublinear time:  ≤n 1-ε vs ≥n/100 in n 1-2ε time [Batu-Ergun-Kilian-Magen- Raskhodnikova-Rubinfeld-Sami’03]

4 Computing via embedding into ℓ 1 Embedding: f:{0,1} n → ℓ 1  such that ed(x,y) ≈ ||f(x) - f(y)|| 1  up to some distortion (=approximation)  Can compute ed(x,y) in time to compute f(x) Best embedding by [Ostrovsky-Rabani’05]:  distortion = 2 Õ(√log n)  Computation time: ~n 2 randomized (and similar dimension)  Helps for nearest neighbor search, sketching, but not computation…

5 Our result Theorem: Can compute ed(x,y) in  n*2 Õ(√log n) time with  2 Õ(√log n) approximation While uses some ideas of [OR’05] embedding, it is not an algorithm for computing the [OR’05] embedding

6 Sketcher’s hat 2 examples of “sketches” from embeddings… [Johnson-Lindenstrauss]: pick a random k- subspace of R n, then for any q 1,…q n  R n, if q̃ i is projection of q i, then, w.h.p.  ||q i -q j || 2 ≈ ||q̃ i -q̃ j || 2 up to O(1) distortion.  for k=O(log n) [Bourgain]: given n vectors q i, can construct n vectors q̃ i of k=O(log 2 n) dimension such that  ||q i -q j || 1 ≈ ||q̃ i -q̃ j || 1 up to O(log n) distortion.

7 Our Algorithm For each length m in some fixed set L  [n], compute vectors v i m  ℓ 1 such that  ||v i m – v j m || 1 ≈ ed( z[i:i+m], z[j:j+m] )  Dimension of v i m is only O(log 2 n) Vectors {v i m } are computed recursively from {v i k } corresponding to shorter substrings (smaller k  L) Output: ed(x,y)≈||v 1 n/2 – v n/2+1 n/2 || 1 (i.e., for m=n/2=|x|=|y|) i z[i:i+m] z= xy

8 Idea: intuition How to compute {v i m } from {v i k } for k< { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/11/3188107/slides/slide_8.jpg", "name": "8 Idea: intuition How to compute {v i m } from {v i k } for k<

9 Key step: Main Lemma: fix n vectors v i  ℓ 1 k, of dimension k=O(log 2 n).  Let s { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/11/3188107/slides/slide_9.jpg", "name": "9 Key step: Main Lemma: fix n vectors v i  ℓ 1 k, of dimension k=O(log 2 n).", "description": " Let s

10 Proof of Main Lemma “low” = log O(1) n Graph-metric: shortest path on a weighted graph Sparse: Õ(n) edges  min k M is semi-metric on M k with “distance” d min,M (x,y)=min i=1..k d M (x i,y i ) EMD over n sets A i  min low ℓ 1 high  min low ℓ 1 low  min low tree-metric sparse graph-metric O(log 2 n) O(1) O(log n) O(log 3 n) ℓ 1 low O(log n) [Bourgain] (efficient)

11 Step 1 EMD over n sets A i  min low ℓ 1 high O(log 2 n) q.e.d.

12 Step 2 Lemma 2: can embed an n point set from ℓ 1 H into  min O(log n) ℓ 1 k, for k=log 3 n, with O(1) distortion. Use weak dimensionality reduction in ℓ 1 Thm [Indyk’06]: Let A be a random* matrix of size H by k=log 3 n. Then for any x,y, letting x̃=Ax, ỹ=Ay:  no contraction: ||x̃-ỹ|| 1 ≥||x-y|| 1 (w.h.p.)  5-expansion: ||x̃-ỹ|| 1 ≤5*||x-y|| 1 (with 0.01 probability) Just use O(log n) of such embeddings  Their min is O(1) approximation to ||x-y|| 1, w.h.p.  min low ℓ 1 high  min low ℓ 1 low O(1)

13 Efficiency of Step 1+2 From step 1+2, we get some embedding f() of sets A i ={v i, v i+1, …, v i+s-1 } into  min low ℓ 1 low Naively would take Ω(n*s)=Ω(n 2 ) time to compute all f(A i ) Save using linearity of sketches:  f() is linear: f(A) = ∑ a  A f(a)  Then f(A i ) = f(A i-1 )-f(v i-1 )+f(v i+s-1 )  Compute f(A i ) in order, for a total of Õ(n) time

14 Step 3 Lemma 3: can embed ℓ 1 over {0..M} p into  min low tree-m, with O(log n) distortion. For each Δ = a power of 2, take O(log n) random grids. Each grid gives a  min - coordinate  min low ℓ 1 low  min low tree-metric O(log n)  ∞ Δ

15 Step 4 Lemma 4: suppose have n points in  min low tree-m, which approximates a metric up to distortion D. Can embed into a graph-metric of size Õ(n) with distortion D.  min low tree-metric sparse graph-metric O(log 3 n)

16 Step 5 Lemma 5: Given a graph with m edges, can embed the graph-metric into ℓ 1 low with O(log n) distortion in Õ(m) time. Just implement [Bourgain]’s embedding:  Choose O(log 2 n) sets B i  Need to compute the distance from each node to each B i  For each B i can compute its distance to each node using Dijkstra’s algorithm in Õ(m) time sparse graph-metric ℓ 1 low O(log n)

17 Summary of Main Lemma Min-product helps to get low dimension (~small-size sketch)  bypasses impossibility of dim-reduction in ℓ 1 Ok that it is not a metric, as long as it is close to a metric EMD over n sets A i  min low ℓ 1 high  min low ℓ 1 low  min low tree-metric sparse graph-metric O(log 2 n) O(1) O(log n) O(log 3 n) ℓ 1 low O(log n) oblivious non-oblivious

18 Conclusion Theorem: can compute ed(x,y) in n*2 Õ(√log n) time with 2 Õ(√log n) approximation

Download ppt "1 Approximating Edit Distance in Near-Linear Time Alexandr Andoni (MIT) Joint work with Krzysztof Onak (MIT)"

Similar presentations