On Embedding Edit Distance into L_11 On Embedding Edit Distance into L 1 Robert Krauthgamer (Weizmann Institute and IBM Almaden) Based on joint work (i) with Moses Charikar, (ii) with Yuval Rabani, (iii) with Parikshit Gopalan and T.S. Jayram. (iv) with Alex Andoni
On Embedding Edit Distance into L_1 2 x 2 n, y 2 m ED(x,y) = Minimum number of character insertions, deletions and substitutions that transform x to y. [aka Levenshtein distance] Examples: ED(00000, 1111) = 5 ED(01010, 10101) = 2 Applications: Genomics Text processing Web search For simplicity: m = n. Edit Distance X
On Embedding Edit Distance into L_1 3 Embedding into L 1 An embedding of (X,d) into l 1 is a map f : X ! l 1. It has distortion K ¸ 1 if d(x,y) ≤ k f(x)-f(y) k 1 ≤ K d(x,y) 8 x,y 2 X Very powerful concept (when distortion is small) Goal: Embed edit distance into l 1 with small distortion Motivation: Reduce algorithmic problems to l 1 E.g. Nearest-Neighbor Search Study a simple metric space without norm E.g. Hamming cube w/cyclic shifts.
On Embedding Edit Distance into L_1 4 Large Gap…Despite signficant effort!!! Known Results for Edit Distance O(n 2/3 ) [Bar Yossef-Jayram-K.- Kumar’04] 2 O(√log n) [Ostrovsky-Rabani’05] Upper bound: Lower bound: (log n) 1/2-o(1) [Khot-Naor’05] and 3/2 [Andoni-Deza-Gupta-Indyk- Raskhodnikova’03] (log n) [K.-Rabani’06] Previous boundsEmbed ({0,1} n, ED) into L 1
On Embedding Edit Distance into L_1 5 Submetrics (Restricted Strings) Why focus on submetrics of edit distance? May admit smaller distortion Partial progress towards general case A framework to analyzing non worst-case instances Example (a la computational biology): Handle only “typical” strings Class 1: A string is k-non-repetitive if all its k-substrings are distinct A random 0-1 string is WHP (2log n)-non-repetitive Yields a submetric containing 1-o(1) fraction of the strings Class 2: Ulam metric = edit distance on all permutations (here ={1,…,n}) Every permutation is 1-non-repetitive Note: k-non-repetitive strings embed into Ulam with distortion k. Theory of Computation Seminar, Computer Science Department k=7
On Embedding Edit Distance into L_1 6 Large Gap …Near-tight! Known Results for Ulam Metric O(log n) [Charikar-K.’06] (New proof by [Gopalan-Jayram-K.]) 2 O(√log n) [Ostrovsky-Rabani’05] Upper bound: Lower bound: log n/loglog n) [Andoni-K.’07] (Actually qualitatively stronger) (log n) [K.-Rabani’06] Embed Ulam metric into L 1 Embed ({0,1} n, ED) into L 1
On Embedding Edit Distance into L_1 7 Embedding of permutations Theorem [Charikar-K.’06]: The Ulam metric of dimension n embeds into l 1 with distortion O(log n). Proof.Define where Claim 1: ||f(P)-f(Q)|| 1 ≤ O(log n) ED(P,Q) Suppose Q is obtained from P by moving one symbol, say ‘s’ General case then follows by applying triangle inequality on P,P’,P’’,…,Q Total contribution of coordinates s 2 {a,b} is 2 k (1/k) ≤ O(log n) other coordinates is k k(1/k – 1/(k+1)) ≤ O(log n) Intuition: sign(f a,b (P)) is indicator for “a appears before b” in P Thus, |f a,b (P)-f a,b (Q)| “measures” if {a,b} is an inversion in P vs. Q
On Embedding Edit Distance into L_1 8 Embedding of permutations Theorem [Charikar-K.’06]: The Ulam metric of dimension n embeds into l 1 with distortion O(log n). Proof.Define where Claim 1: ||f(P)-f(Q)|| 1 ≤ O(log n) ED(P,Q) Claim 2: ||f(P)-f(Q)|| 1 ¸ ½ ED(P,Q) Assume wlog that P=identity Edit Q into an increasing sequence (thus into P) using quicksort: Choose a random pivot, Delete all characters inverted wrt to pivot Repeat recursively on left and right portions Now argue ||f(P)-f(Q)|| 1 ¸ E [ #quicksort deletions ] ¸ ½ ED(P,Q) Surviving subsequence is increasing ED(P,Q) ≤ 2 #deletions For every inversion (a,b) in Q: Pr[a deleted “by” pivot b] ≤ 1/ | Q -1 [a]-Q -1 [b]+1 | ≤ 2 |f a,b (P) – f a,b (Q)|
On Embedding Edit Distance into L_1 9 Lower bound for 0-1 strings Theorem [K.-Rabani’06]: Embedding of ({0,1} n,ED) into L 1 requires distortion (log n) Proof sketch: Suppose embeds with distortion D ¸ 1, and let V={0,1} n. By the cut-cone characterization of L 1 : For every symmetric probability distributions and over V £ V, The embedding f into L 1 can be written as Hence,
On Embedding Edit Distance into L_1 10 Lower bound for 0-1 strings Theorem [K.-Rabani’06]: Embedding of ({0,1} n,ED) into L 1 requires distortion (log n) Proof sketch: Suppose embeds with distortion D ¸ 1, and let V={0,1} n. By the cut-cone characterization of L 1 : For every symmetric probability distributions and over V £ V, We choose: = uniform over V £ V = ½( H + S ) where H =random point+random bit flip (uniform over E H ={(x,y): ||x-y|| 1 =1}) S =random point+a cyclic shift (uniform over E S ={(x,S(x)} ) The RHS of (*) evaluates to O(D/n) by a counting argument. Main Lemma: For all A µ V, the LHS of (*) is (log n) / n. Analysis of Boolean functions on the hypercube
On Embedding Edit Distance into L_1 11 Lower bound for 0-1 strings – cont. Recall = ½( H + S ) where H =random point+random bit flip S =random point+a cyclic shift Lemma: For all A µ V, the LHS of (*) is Proof sketch: Assume to contrary, and define f = 1 A.
On Embedding Edit Distance into L_1 12 Lower bound for 0-1 strings – cont. Claim: I j ¸ 1/n 1/8 ) I j +1 ¸ 1/2n 1/8 Proof: x x+ejx+ej S(x+ej)S(x+ej) flip bit j cyclic shift S(x)S(x) flip bit j+1 cyclic shift = S ( x )+ e j +1
On Embedding Edit Distance into L_1 13 Communication Complexity Approach Alice x2nx2n y2ny2n randomness Distance Estimation Problem: decide whether d(x,y) ¸ R or d(x,y)·R/A Communication complexity model: Two-party protocol Shared randomness Promise (gap) version A = approximation factor CC A = min. # bits to decide whp … CC A bits Bob Previous communication lower bounds: l 1 [Saks-Sun’02, BarYossef-Jayram- Kumar-Shivakumar’04] l 1 [Woodruff’04] Earthmover [Andoni-Indyk-K.’07]
On Embedding Edit Distance into L_1 14 Communication Bounds for Edit Distance A tradeoff between approximation and communication Theorem [Andoni-K.’07]: For Hamming distance : CC 1+ = (1/ 2 ) [Kushilevitz-Ostrovsky-Rabani’98], [Woodruff’04] First computational model where edit is provably harder than Hamming! Corollary 1: Approximation A=O(1) requires CC A ¸ (loglog n) Corollary 2: Communication CC A =O(1) requires A ¸ * (log n) Implications to embeddings: Embedding ED into L 1 (or squared-L 2 ) requires distortion * (log n) Furthermore, holds for both 0-1 strings and permutations (Ulam)
On Embedding Edit Distance into L_1 15 Proof Outline Step 1 [Yao’s minimax Theorem]: Reduce to distributional complexity If CC A ≤k then for every two distributions far, close there is a k-bit deterministic protocol with success probability ¸ 2/3 Step 2 [Andoni-Indyk-K.’07]: Reduce to 1-bit protocols Further to above, there are Boolean functions s A,s B : n {0,1} with advantage Pr (x,y) 2 far [s A (x) s B (y)] – Pr (x,y) 2 close [s A (x) s B (y)] ¸ (2 -k ) Step 3 [Fourier expansion]: Reduce to one Fourier level Furthermore, s A,s B depend only on fixed positions j 1,…,j Step 4 [Choose distribution]: Analyze (x,y) 2 projected on these positions Let close, far include -noise handle a high level Let close, far include (few/more) block rotations handle a low level Step 5: Reduce Ulam to {0,1} n A random mapping {0,1} works Key property: distribution of ( x j1,…,x j, y j1,…,y j ) is “statistically close” under far vs. under close Compare this additive analysis to our previous analysis:
On Embedding Edit Distance into L_1 16 Summary of Known Results O(log n) [Charikar-K.’06] (New proof by [Gopalan-Jayram-K.]) 2 O(√log n) [Ostrovsky-Rabani’05] Upper bound: Lower bound: log n/loglog n) [Andoni-K.’07] (Qualitatively much stronger) (log n) [K.-Rabani’06] Embed Ulam metric into L 1 Embed ({0,1} n, ED) into L 1
On Embedding Edit Distance into L_1 17 Concluding Remarks The computational lens Study Distance Estimation problems rather than embeddings Open problems: Still large gap for 0-1 strings Variants of edit distance (e.g. edit distance with block-moves) Rule out other algorithms (e.g. “CC model” capturing Indyk’s NNS for l 1 ) Recent progress: Bypass L 1 -embedding by devising new techniques E.g. using max ( l 1 ) product for NNS under Ulam metric [Andoni- Indyk-K.] Analyze/design “good” heuristics E.g. smoothed analysis [Andoni-K.]