Approximating Edit Distance in Near-Linear Time

Approximating Edit Distance in Near-Linear Time
Alexandr Andoni (MIT) Joint work with Krzysztof Onak (MIT)

Edit Distance For two strings x,y  ∑n
ed(x,y) = minimum number of edit operations to transform x into y Edit operations = insertion/deletion/substitution Important in: computational biology, text processing, etc Example: ED( , ) = 2

Computing Edit Distance
Problem: compute ed(x,y) for given x,y{0,1}n Exactly: O(n2) [Levenshtein’65] O(n2/log2 n) for |∑|=O(1) [Masek-Paterson’80] Approximately in n1+o(1) time: n1/3+o(1) approximation [Batu-Ergun-Sahinalp’06], improving over [Sahinalp-Vishkin’96, Cole-Hariharan’02, BarYossef-Jayram-Krauthgamer-Kumar’04] Sublinear time: ≤n1-ε vs ≥n/100 in n1-2ε time [Batu-Ergun-Kilian-Magen-Raskhodnikova-Rubinfeld-Sami’03]

Computing via embedding into ℓ1
Embedding: f:{0,1}n → ℓ1 such that ed(x,y) ≈ ||f(x) - f(y)||1 up to some distortion (=approximation) Can compute ed(x,y) in time to compute f(x) Best embedding by [Ostrovsky-Rabani’05]: distortion = 2Õ(√log n) Computation time: ~n2 randomized (and similar dimension) Helps for nearest neighbor search, sketching, but not computation…

Our result Theorem: Can compute ed(x,y) in
n*2Õ(√log n) time with 2Õ(√log n) approximation While uses some ideas of [OR’05] embedding, it is not an algorithm for computing the [OR’05] embedding

Review of Ostrovsky-Rabani embedding
φm = embedding of strings of length m δ(m) = distortion of φm Embedding is recursive Partition into b blocks (b later chosen to be exp(√log m)) Use embeddings φk for k ≤ m/b Embed each block separately as follows… X= m/b

Ostrovsky-Rabani embedding (II)
X= s E2s E3s Ebs E1s= rec. embedding of the s substrings Want to approximate ed(x,y) by ∑i=1..b ∑sS TEMDs(Eis(x), Eis(y)) EMD(A,B) = min-cost bipartite matching Finish by embedding TEMD into ℓ1 with small distortion T (thresholded)

Distortion of [OR] embedding
Suppose can embed TEMD into ℓ1 with distortion (log m)O(1) Then [Ostrovsky-Rabani’05] show that distortion of φm is δ(m) ≤ (log m)O(1) * [δ(m/b) + b] For b=exp[√log m] δ(m) ≤ exp[Õ(√log m)]

Why it is expensive to compute [OR] embedding
E1s= rec. embedding of the s substrings In first step, need to compute recursive embedding for ~n/b strings of length ~n/b The dimension blows up

Our Algorithm For each length m in some fixed set L[n],
y i z= z[i:i+m] For each length m in some fixed set L[n], compute vectors vimℓ1 such that ||vim – vjm||1 ≈ ed( z[i:i+m], z[j:j+m] ) up to distortion δ(m) Dimension of vim is only O(log2 n) Vectors vim are computed inductively from vik for k≤m/b (kL) Output: ed(x,y)≈||v1n/2 – vn/2+1n/2||1 (i.e., for m=n/2=|x|=|y|)

Idea: intuition ||vim – vjm||1 ≈ ed( z[i:i+m], z[j:j+m] )
For each mL, compute φm(z[i:i+m]) as in the O-R recursive step except we use vectors vik, k<m/b & kL, in place of recursive embeddings of shorter substrings (sets Eis) Resulting φm(z[i:i+m]) have high dimension, >m/b… Use Bourgain’s Lemma to vectors φm(z[i:i+m]), i=1..n-m, [Bourgain]: given n vectors qi, construct n vectors q̃i of O(log2 n) dimension such that ||qi-qj||1 ≈ ||q̃i-q̃j||1 up to O(log n) distortion. Apply to vectors φm(z[i:i+m]) to obtain vectors vim of polylogaritmic dimension incurs O(log n) distortion at each step of recursion. but OK as there are only ~√log n steps, giving an additional distortion of only exp[Õ(√log n)]

Idea: implementation Essential step is:
Main Lemma: fix n vectors viℓ1, of dimension p=O(log2n). Let s<n. Define Ai={vi, vi+1, …, vi+s-1}. Then we can compute vectors qiℓ1k for k=O(log2n) such that ||qi – qj||1≈ TEMD(Ai, Aj) up to distortion logO(1) n Computing qi’s takes Õ(n) time.

Proof of Main Lemma Graph-metric: shortest path on a weighted graph
TEMD over n sets Ai O(log2 n) Graph-metric: shortest path on a weighted graph Sparse: Õ(n) edges “low” = logO(1) n mink M is semi-metric on Mk with “distance” dmin,M(x,y)=mini=1..kdM(xi,yi) minlow ℓ1high O(1) minlow ℓ1low O(log n) minlow tree-metric O(log3n) sparse graph-metric [Bourgain] (efficient) O(log n) ℓ1low

Step 1 TEMD over n sets Ai minlow ℓ1high
O(log2 n) minlow ℓ1high Lemma 1: can embed TEMD over n sets in ({0..M}p, ℓ1) into minO(log n) ℓ1M^p with O(log2n) distortion, w.h.p. Use [A-Indyk-Krauthgamer’08] (similar to Ostrovsky-Rabani embedding) Embedding: for each Δ = powers of 2 impose a randomly-shifted grid one coordinate per cell, equal to # of points in the cell Theorem [AIK]: no contraction w.h.p. expected expansion = O(log2 n) Just repeat O(log n) times 

minlow ℓ1high Step 2 O(1) minlow ℓ1low Lemma 2: can embed an n point set from ℓ1M into minO(log n) ℓ1k, for k=O(log3 n), with O(1) distortion. Use (weak) dimensionality reduction in ℓ1 Thm [Indyk’06]: Let A be matrix of size M by k=O(log3 n) with each element chosen from Cauchy distribution. Then for any x̃=Ax, ỹ=Ay: no contraction: ||x̃-ỹ||1≥||x-y||1 (w.h.p.) 5-expansion: ||x̃-ỹ||1≤5*||x-y||1 (with 0.01 probability) Just use O(log n) of such embeddings

Efficiency of Step 1+2 From step 1+2, we get some embedding f() of sets Ai={vi, vi+1, …, vi+s-1} into minlow ℓ1low Naively would take Ω(n*s)=Ω(n2) time to compute all f(Ai) More efficiently: Note that f() is linear: f(A) = ∑aA f(a) Then f(Ai) = f(Ai-1)-f(vi-1)+f(vi+s-1) Compute f(Ai) in order, for a total of Õ(n) time

Step 3 minlow ℓ1low O(log n) minlow tree-metric Lemma 3: can embed ℓ1 over {0..M}p into minO(log^2 n) tree-m, with O(log n) distortion. For each Δ = a power of 2, take O(log n) random grids. Each grid gives a min-coordinate ∞ Δ 

minlow tree-metric Step 4 O(log3n) sparse graph-metric Lemma 4: suppose have n points in minlow tree-m, which approximates a metric up to distortion D. Can embed into a graph-metric of size Õ(n) with distortion D.

Step 5 sparse graph-metric O(log n) ℓ1low Lemma 5: Given a graph with m edges, can embed the graph-metric into ℓ1low with O(log n) distortion in Õ(m) time. Just implement Bourgain’s embedding: Choose O(log2 n) sets Bi Need to compute the distance from each node to each Bi For each Bi can compute its distance to each node using Dijkstra’s algorithm in Õ(m) time

Summary of Main Lemma TEMD over n sets Ai Min-product helps to get low dimension (~small-size sketch) bypasses impossibility of dim-reduction in ℓ1 Ok that it is not a metric, as long as it is close to a metric O(log2 n) minlow ℓ1high O(1) oblivious minlow ℓ1low O(log n) minlow tree-metric O(log3n) sparse graph-metric non-oblivious O(log n) ℓ1low

Conclusion + a question
Theorem: can compute ed(x,y) in n*2Õ(√log n) time with 2Õ(√log n) approximation Question: can we do the following “oblivious” dimensionality reduction in ℓ1 Given n, construct a randomized embedding φ:ℓ1M→ℓ1polylog n such that for any v1…vnℓ1M, with high probability, φ has distortion logO(1) n on these vectors? If φ exists, it cannot be linear [Charikar-Sahai’02]

Approximating Edit Distance in Near-Linear Time

Similar presentations

Presentation on theme: "Approximating Edit Distance in Near-Linear Time"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Approximating Edit Distance in Near-Linear Time

Similar presentations

Presentation on theme: "Approximating Edit Distance in Near-Linear Time"— Presentation transcript:

Similar presentations

About project

Feedback