Presentation is loading. Please wait.

Presentation is loading. Please wait.

On Embedding Edit Distance into L_11 On Embedding Edit Distance into L 1 Robert Krauthgamer (Weizmann Institute and IBM Almaden)‏ Based on joint work (i)

Similar presentations


Presentation on theme: "On Embedding Edit Distance into L_11 On Embedding Edit Distance into L 1 Robert Krauthgamer (Weizmann Institute and IBM Almaden)‏ Based on joint work (i)"— Presentation transcript:

1 On Embedding Edit Distance into L_11 On Embedding Edit Distance into L 1 Robert Krauthgamer (Weizmann Institute and IBM Almaden)‏ Based on joint work (i) with Moses Charikar, (ii) with Yuval Rabani, (iii) with Parikshit Gopalan and T.S. Jayram. (iv) with Alex Andoni

2 On Embedding Edit Distance into L_1 2 x 2  n, y 2  m ED(x,y) = Minimum number of character insertions, deletions and substitutions that transform x to y. [aka Levenshtein distance] Examples: ED(00000, 1111) = 5 ED(01010, 10101) = 2 Applications: Genomics Text processing Web search For simplicity: m = n. Edit Distance X

3 On Embedding Edit Distance into L_1 3 Embedding into L 1 An embedding of (X,d) into l 1 is a map f : X ! l 1.  It has distortion K ¸ 1 if d(x,y) ≤ k f(x)-f(y) k 1 ≤ K d(x,y) 8 x,y 2 X Very powerful concept (when distortion is small) Goal: Embed edit distance into l 1 with small distortion Motivation:  Reduce algorithmic problems to l 1 E.g. Nearest-Neighbor Search  Study a simple metric space without norm E.g. Hamming cube w/cyclic shifts.

4 On Embedding Edit Distance into L_1 4 Large Gap…Despite signficant effort!!! Known Results for Edit Distance O(n 2/3 ) [Bar Yossef-Jayram-K.- Kumar’04] 2 O(√log n) [Ostrovsky-Rabani’05] Upper bound: Lower bound: (log n) 1/2-o(1) [Khot-Naor’05] and 3/2 [Andoni-Deza-Gupta-Indyk- Raskhodnikova’03]  (log n) [K.-Rabani’06] Previous boundsEmbed ({0,1} n, ED) into L 1

5 On Embedding Edit Distance into L_1 5 Submetrics (Restricted Strings) ‏ Why focus on submetrics of edit distance?  May admit smaller distortion  Partial progress towards general case  A framework to analyzing non worst-case instances Example (a la computational biology): Handle only “typical” strings Class 1:  A string is k-non-repetitive if all its k-substrings are distinct  A random 0-1 string is WHP (2log n)-non-repetitive Yields a submetric containing 1-o(1) fraction of the strings Class 2:  Ulam metric = edit distance on all permutations (here  ={1,…,n})‏  Every permutation is 1-non-repetitive  Note: k-non-repetitive strings embed into Ulam with distortion k. Theory of Computation Seminar, Computer Science Department k=7

6 On Embedding Edit Distance into L_1 6 Large Gap …Near-tight! Known Results for Ulam Metric O(log n) [Charikar-K.’06] (New proof by [Gopalan-Jayram-K.]) 2 O(√log n) [Ostrovsky-Rabani’05] Upper bound: Lower bound:  log n/loglog n) [Andoni-K.’07] (Actually qualitatively stronger)‏  (log n) [K.-Rabani’06] Embed Ulam metric into L 1 Embed ({0,1} n, ED) into L 1

7 On Embedding Edit Distance into L_1 7 Embedding of permutations Theorem [Charikar-K.’06]: The Ulam metric of dimension n embeds into l 1 with distortion O(log n). Proof.Define where Claim 1: ||f(P)-f(Q)|| 1 ≤ O(log n) ED(P,Q)‏  Suppose Q is obtained from P by moving one symbol, say ‘s’  General case then follows by applying triangle inequality on P,P’,P’’,…,Q  Total contribution of coordinates s 2 {a,b} is 2  k (1/k) ≤ O(log n)‏ other coordinates is  k k(1/k – 1/(k+1)) ≤ O(log n)‏ Intuition: sign(f a,b (P)) is indicator for “a appears before b” in P Thus, |f a,b (P)-f a,b (Q)| “measures” if {a,b} is an inversion in P vs. Q

8 On Embedding Edit Distance into L_1 8 Embedding of permutations Theorem [Charikar-K.’06]: The Ulam metric of dimension n embeds into l 1 with distortion O(log n). Proof.Define where Claim 1: ||f(P)-f(Q)|| 1 ≤ O(log n) ED(P,Q)‏ Claim 2: ||f(P)-f(Q)|| 1 ¸ ½ ED(P,Q)  Assume wlog that P=identity  Edit Q into an increasing sequence (thus into P) using quicksort: Choose a random pivot, Delete all characters inverted wrt to pivot Repeat recursively on left and right portions  Now argue ||f(P)-f(Q)|| 1 ¸ E [ #quicksort deletions ] ¸ ½ ED(P,Q) Surviving subsequence is increasing  ED(P,Q) ≤ 2 #deletions For every inversion (a,b) in Q: Pr[a deleted “by” pivot b] ≤ 1/ | Q -1 [a]-Q -1 [b]+1 | ≤ 2 |f a,b (P) – f a,b (Q)|

9 On Embedding Edit Distance into L_1 9 Lower bound for 0-1 strings Theorem [K.-Rabani’06]: Embedding of ({0,1} n,ED) into L 1 requires distortion  (log n) Proof sketch: Suppose embeds with distortion D ¸ 1, and let V={0,1} n. By the cut-cone characterization of L 1 :  For every symmetric probability distributions  and  over V £ V, The embedding f into L 1 can be written as Hence,

10 On Embedding Edit Distance into L_1 10 Lower bound for 0-1 strings Theorem [K.-Rabani’06]: Embedding of ({0,1} n,ED) into L 1 requires distortion  (log n) Proof sketch: Suppose embeds with distortion D ¸ 1, and let V={0,1} n. By the cut-cone characterization of L 1 :  For every symmetric probability distributions  and  over V £ V, We choose:   = uniform over V £ V   = ½(  H +  S ) where  H =random point+random bit flip (uniform over E H ={(x,y): ||x-y|| 1 =1})‏  S =random point+a cyclic shift (uniform over E S ={(x,S(x)} )‏ The RHS of (*) evaluates to O(D/n) by a counting argument. Main Lemma: For all A µ V, the LHS of (*) is  (log n) / n.  Analysis of Boolean functions on the hypercube

11 On Embedding Edit Distance into L_1 11 Lower bound for 0-1 strings – cont. Recall  = ½(  H +  S ) where   H =random point+random bit flip   S =random point+a cyclic shift Lemma: For all A µ V, the LHS of (*) is Proof sketch:  Assume to contrary, and define f = 1 A.

12 On Embedding Edit Distance into L_1 12 Lower bound for 0-1 strings – cont. Claim: I j ¸ 1/n 1/8 ) I j +1 ¸ 1/2n 1/8 Proof: x x+ejx+ej S(x+ej)S(x+ej) flip bit j cyclic shift S(x)S(x) flip bit j+1 cyclic shift = S ( x )+ e j +1

13 On Embedding Edit Distance into L_1 13 Communication Complexity Approach Alice x2nx2n y2ny2n randomness Distance Estimation Problem: decide whether d(x,y) ¸ R or d(x,y)·R/A Communication complexity model: Two-party protocol Shared randomness Promise (gap) version A = approximation factor CC A = min. # bits to decide whp … CC A bits Bob Previous communication lower bounds: l 1 [Saks-Sun’02, BarYossef-Jayram- Kumar-Shivakumar’04] l 1 [Woodruff’04] Earthmover [Andoni-Indyk-K.’07]

14 On Embedding Edit Distance into L_1 14 Communication Bounds for Edit Distance A tradeoff between approximation and communication Theorem [Andoni-K.’07]: For Hamming distance : CC 1+  =  (1/  2 ) [Kushilevitz-Ostrovsky-Rabani’98], [Woodruff’04] First computational model where edit is provably harder than Hamming! Corollary 1: Approximation A=O(1) requires CC A ¸  (loglog n) Corollary 2: Communication CC A =O(1) requires A ¸  * (log n) Implications to embeddings: Embedding ED into L 1 (or squared-L 2 ) requires distortion  * (log n) Furthermore, holds for both 0-1 strings and permutations (Ulam)‏

15 On Embedding Edit Distance into L_1 15 Proof Outline Step 1 [Yao’s minimax Theorem]: Reduce to distributional complexity  If CC A ≤k then for every two distributions  far,  close there is a k-bit deterministic protocol with success probability ¸ 2/3 Step 2 [Andoni-Indyk-K.’07]: Reduce to 1-bit protocols  Further to above, there are Boolean functions s A,s B :  n  {0,1} with advantage Pr (x,y) 2  far [s A (x)  s B (y)] – Pr (x,y) 2  close [s A (x)  s B (y)] ¸  (2 -k ) Step 3 [Fourier expansion]: Reduce to one Fourier level  Furthermore, s A,s B depend only on fixed positions j 1,…,j Step 4 [Choose distribution]: Analyze (x,y) 2  projected on these positions  Let  close,  far include  -noise  handle a high level  Let  close,  far include (few/more) block rotations  handle a low level Step 5: Reduce Ulam to {0,1} n  A random mapping   {0,1} works Key property: distribution of ( x j1,…,x j, y j1,…,y j ) is “statistically close” under  far vs. under  close Compare this additive analysis to our previous analysis:

16 On Embedding Edit Distance into L_1 16 Summary of Known Results O(log n) [Charikar-K.’06] (New proof by [Gopalan-Jayram-K.]) 2 O(√log n) [Ostrovsky-Rabani’05] Upper bound: Lower bound:  log n/loglog n) [Andoni-K.’07] (Qualitatively much stronger)‏  (log n) [K.-Rabani’06] Embed Ulam metric into L 1 Embed ({0,1} n, ED) into L 1

17 On Embedding Edit Distance into L_1 17 Concluding Remarks The computational lens  Study Distance Estimation problems rather than embeddings Open problems:  Still large gap for 0-1 strings  Variants of edit distance (e.g. edit distance with block-moves)‏  Rule out other algorithms (e.g. “CC model” capturing Indyk’s NNS for l 1 )‏ Recent progress:  Bypass L 1 -embedding by devising new techniques E.g. using max ( l 1 ) product for NNS under Ulam metric [Andoni- Indyk-K.]  Analyze/design “good” heuristics E.g. smoothed analysis [Andoni-K.]


Download ppt "On Embedding Edit Distance into L_11 On Embedding Edit Distance into L 1 Robert Krauthgamer (Weizmann Institute and IBM Almaden)‏ Based on joint work (i)"

Similar presentations


Ads by Google