On Embedding Edit Distance into L_11 On Embedding Edit Distance into L 1 Robert Krauthgamer (Weizmann Institute and IBM Almaden)‏ Based on joint work (i)

Slides:



Advertisements
Similar presentations
Estimating Distinct Elements, Optimally
Advertisements

1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
Optimal Space Lower Bounds for All Frequency Moments David Woodruff MIT
Quantum Lower Bounds The Polynomial and Adversary Methods Scott Aaronson September 14, 2001 Prelim Exam Talk.
The Future (and Past) of Quantum Lower Bounds by Polynomials Scott Aaronson UC Berkeley.
Routing Complexity of Faulty Networks Omer Angel Itai Benjamini Eran Ofek Udi Wieder The Weizmann Institute of Science.
Optimal Space Lower Bounds for all Frequency Moments David Woodruff Based on SODA 04 paper.
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.
Lower Bounds for Additive Spanners, Emulators, and More David P. Woodruff MIT and Tsinghua University To appear in FOCS, 2006.
Xiaoming Sun Tsinghua University David Woodruff MIT
Tight Lower Bounds for the Distinct Elements Problem David Woodruff MIT Joint work with Piotr Indyk.
Truthful Mechanisms for Combinatorial Auctions with Subadditive Bidders Speaker: Shahar Dobzinski Based on joint works with Noam Nisan & Michael Schapira.
Quantum One-Way Communication is Exponentially Stronger than Classical Communication TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
1 Approximating Edit Distance in Near-Linear Time Alexandr Andoni (MIT) Joint work with Krzysztof Onak (MIT)
Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
Embedding the Ulam metric into ℓ 1 (Ενκρεβάτωση του μετρικού χώρου Ulam στον ℓ 1 ) Για το μάθημα “Advanced Data Structures” Αντώνης Αχιλλέως.
Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)
QuickSort Average Case Analysis An Incompressibility Approach Brendan Lucier August 2, 2005.
Gibbs sampler - simple properties It’s not hard to show that this MC chain is aperiodic. Often is reversible distribution. If in addition the chain is.
Online Scheduling with Known Arrival Times Nicholas G Hall (Ohio State University) Marc E Posner (Ohio State University) Chris N Potts (University of Southampton)
Metric embeddings, graph expansion, and high-dimensional convex geometry James R. Lee Institute for Advanced Study.
Metric Embeddings As Computational Primitives Robert Krauthgamer Weizmann Institute of Science [Based on joint work with Alex Andoni]
1 Truthful Mechanism for Facility Allocation: A Characterization and Improvement of Approximation Ratio Pinyan Lu, MSR Asia Yajun Wang, MSR Asia Yuan Zhou,
Navigating Nets: Simple algorithms for proximity search Robert Krauthgamer (IBM Almaden) Joint work with James R. Lee (UC Berkeley)
Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Institute) Robert Krauthgamer (Weizmann Institute) Ilya Razenshteyn (CSAIL MIT)
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
CPSC 689: Discrete Algorithms for Mobile and Wireless Systems Spring 2009 Prof. Jennifer Welch.
1 Lecture 18 Syntactic Web Clustering CS
Avraham Ben-Aroya (Tel Aviv University) Oded Regev (Tel Aviv University) Ronald de Wolf (CWI, Amsterdam) A Hypercontractive Inequality for Matrix-Valued.
1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion.
Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Inst. / Columbia) Robert Krauthgamer (Weizmann Inst.) Ilya Razenshteyn (MIT, now.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 22, 2005
Embedding and Sketching Alexandr Andoni (MSR). Definition by example  Problem: Compute the diameter of a set S, of size n, living in d-dimensional ℓ.
How Robust are Linear Sketches to Adaptive Inputs? Moritz Hardt, David P. Woodruff IBM Research Almaden.
Embedding and Sketching Non-normed spaces Alexandr Andoni (MSR)
Efficient Approximation of Edit Distance Robert Krauthgamer, Weizmann Institute of Science SPIRE 2013 TexPoint fonts used in EMF. Read the TexPoint manual.
Correlation testing for affine invariant properties on Shachar Lovett Institute for Advanced Study Joint with Hamed Hatami (McGill)
Entropy-based Bounds on Dimension Reduction in L 1 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A AAAA A Oded Regev.
Information Complexity Lower Bounds for Data Streams David Woodruff IBM Almaden.
Quantum Computing MAS 725 Hartmut Klauck NTU TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Information Theory for Data Streams David P. Woodruff IBM Almaden.
1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008.
The Cost of Fault Tolerance in Multi-Party Communication Complexity Binbin Chen Advanced Digital Sciences Center Haifeng Yu National University of Singapore.
Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity) Robert Krauthgamer [Weizmann Institute] Joint with: Alexandr Andoni.
Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion.
Data Stream Algorithms Lower Bounds Graham Cormode
Lower Bounds for Embedding Edit Distance into Normed Spaces A. Andoni, M. Deza, A. Gupta, P. Indyk, S. Raskhodnikova.
Tight Bound for the Gap Hamming Distance Problem Oded Regev Tel Aviv University TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Columbia) Robert Krauthgamer (Weizmann Inst) Ilya Razenshteyn (MIT) 1.
The Message Passing Communication Model David Woodruff IBM Almaden.
Sketching complexity of graph cuts Alexandr Andoni joint work with: Robi Krauthgamer, David Woodruff.
Random Access Codes and a Hypercontractive Inequality for
Information Complexity Lower Bounds
Sublinear Algorithmic Tools 3
Sketching and Embedding are Equivalent for Norms
Lecture 16: Earth-Mover Distance
Linear sketching with parities
Lower Bounds for Edit Distance Estimation
Linear sketching over
Overcoming the L1 Non-Embeddability Barrier
Linear sketching with parities
Streaming Symmetric Norms via Measure Concentration
Dimension versus Distortion a.k.a. Euclidean Dimension Reduction
Lecture 15: Least Square Regression Metric Embeddings
Approximating Edit Distance in Near-Linear Time
Presentation transcript:

On Embedding Edit Distance into L_11 On Embedding Edit Distance into L 1 Robert Krauthgamer (Weizmann Institute and IBM Almaden)‏ Based on joint work (i) with Moses Charikar, (ii) with Yuval Rabani, (iii) with Parikshit Gopalan and T.S. Jayram. (iv) with Alex Andoni

On Embedding Edit Distance into L_1 2 x 2  n, y 2  m ED(x,y) = Minimum number of character insertions, deletions and substitutions that transform x to y. [aka Levenshtein distance] Examples: ED(00000, 1111) = 5 ED(01010, 10101) = 2 Applications: Genomics Text processing Web search For simplicity: m = n. Edit Distance X

On Embedding Edit Distance into L_1 3 Embedding into L 1 An embedding of (X,d) into l 1 is a map f : X ! l 1.  It has distortion K ¸ 1 if d(x,y) ≤ k f(x)-f(y) k 1 ≤ K d(x,y) 8 x,y 2 X Very powerful concept (when distortion is small) Goal: Embed edit distance into l 1 with small distortion Motivation:  Reduce algorithmic problems to l 1 E.g. Nearest-Neighbor Search  Study a simple metric space without norm E.g. Hamming cube w/cyclic shifts.

On Embedding Edit Distance into L_1 4 Large Gap…Despite signficant effort!!! Known Results for Edit Distance O(n 2/3 ) [Bar Yossef-Jayram-K.- Kumar’04] 2 O(√log n) [Ostrovsky-Rabani’05] Upper bound: Lower bound: (log n) 1/2-o(1) [Khot-Naor’05] and 3/2 [Andoni-Deza-Gupta-Indyk- Raskhodnikova’03]  (log n) [K.-Rabani’06] Previous boundsEmbed ({0,1} n, ED) into L 1

On Embedding Edit Distance into L_1 5 Submetrics (Restricted Strings) ‏ Why focus on submetrics of edit distance?  May admit smaller distortion  Partial progress towards general case  A framework to analyzing non worst-case instances Example (a la computational biology): Handle only “typical” strings Class 1:  A string is k-non-repetitive if all its k-substrings are distinct  A random 0-1 string is WHP (2log n)-non-repetitive Yields a submetric containing 1-o(1) fraction of the strings Class 2:  Ulam metric = edit distance on all permutations (here  ={1,…,n})‏  Every permutation is 1-non-repetitive  Note: k-non-repetitive strings embed into Ulam with distortion k. Theory of Computation Seminar, Computer Science Department k=7

On Embedding Edit Distance into L_1 6 Large Gap …Near-tight! Known Results for Ulam Metric O(log n) [Charikar-K.’06] (New proof by [Gopalan-Jayram-K.]) 2 O(√log n) [Ostrovsky-Rabani’05] Upper bound: Lower bound:  log n/loglog n) [Andoni-K.’07] (Actually qualitatively stronger)‏  (log n) [K.-Rabani’06] Embed Ulam metric into L 1 Embed ({0,1} n, ED) into L 1

On Embedding Edit Distance into L_1 7 Embedding of permutations Theorem [Charikar-K.’06]: The Ulam metric of dimension n embeds into l 1 with distortion O(log n). Proof.Define where Claim 1: ||f(P)-f(Q)|| 1 ≤ O(log n) ED(P,Q)‏  Suppose Q is obtained from P by moving one symbol, say ‘s’  General case then follows by applying triangle inequality on P,P’,P’’,…,Q  Total contribution of coordinates s 2 {a,b} is 2  k (1/k) ≤ O(log n)‏ other coordinates is  k k(1/k – 1/(k+1)) ≤ O(log n)‏ Intuition: sign(f a,b (P)) is indicator for “a appears before b” in P Thus, |f a,b (P)-f a,b (Q)| “measures” if {a,b} is an inversion in P vs. Q

On Embedding Edit Distance into L_1 8 Embedding of permutations Theorem [Charikar-K.’06]: The Ulam metric of dimension n embeds into l 1 with distortion O(log n). Proof.Define where Claim 1: ||f(P)-f(Q)|| 1 ≤ O(log n) ED(P,Q)‏ Claim 2: ||f(P)-f(Q)|| 1 ¸ ½ ED(P,Q)  Assume wlog that P=identity  Edit Q into an increasing sequence (thus into P) using quicksort: Choose a random pivot, Delete all characters inverted wrt to pivot Repeat recursively on left and right portions  Now argue ||f(P)-f(Q)|| 1 ¸ E [ #quicksort deletions ] ¸ ½ ED(P,Q) Surviving subsequence is increasing  ED(P,Q) ≤ 2 #deletions For every inversion (a,b) in Q: Pr[a deleted “by” pivot b] ≤ 1/ | Q -1 [a]-Q -1 [b]+1 | ≤ 2 |f a,b (P) – f a,b (Q)|

On Embedding Edit Distance into L_1 9 Lower bound for 0-1 strings Theorem [K.-Rabani’06]: Embedding of ({0,1} n,ED) into L 1 requires distortion  (log n) Proof sketch: Suppose embeds with distortion D ¸ 1, and let V={0,1} n. By the cut-cone characterization of L 1 :  For every symmetric probability distributions  and  over V £ V, The embedding f into L 1 can be written as Hence,

On Embedding Edit Distance into L_1 10 Lower bound for 0-1 strings Theorem [K.-Rabani’06]: Embedding of ({0,1} n,ED) into L 1 requires distortion  (log n) Proof sketch: Suppose embeds with distortion D ¸ 1, and let V={0,1} n. By the cut-cone characterization of L 1 :  For every symmetric probability distributions  and  over V £ V, We choose:   = uniform over V £ V   = ½(  H +  S ) where  H =random point+random bit flip (uniform over E H ={(x,y): ||x-y|| 1 =1})‏  S =random point+a cyclic shift (uniform over E S ={(x,S(x)} )‏ The RHS of (*) evaluates to O(D/n) by a counting argument. Main Lemma: For all A µ V, the LHS of (*) is  (log n) / n.  Analysis of Boolean functions on the hypercube

On Embedding Edit Distance into L_1 11 Lower bound for 0-1 strings – cont. Recall  = ½(  H +  S ) where   H =random point+random bit flip   S =random point+a cyclic shift Lemma: For all A µ V, the LHS of (*) is Proof sketch:  Assume to contrary, and define f = 1 A.

On Embedding Edit Distance into L_1 12 Lower bound for 0-1 strings – cont. Claim: I j ¸ 1/n 1/8 ) I j +1 ¸ 1/2n 1/8 Proof: x x+ejx+ej S(x+ej)S(x+ej) flip bit j cyclic shift S(x)S(x) flip bit j+1 cyclic shift = S ( x )+ e j +1

On Embedding Edit Distance into L_1 13 Communication Complexity Approach Alice x2nx2n y2ny2n randomness Distance Estimation Problem: decide whether d(x,y) ¸ R or d(x,y)·R/A Communication complexity model: Two-party protocol Shared randomness Promise (gap) version A = approximation factor CC A = min. # bits to decide whp … CC A bits Bob Previous communication lower bounds: l 1 [Saks-Sun’02, BarYossef-Jayram- Kumar-Shivakumar’04] l 1 [Woodruff’04] Earthmover [Andoni-Indyk-K.’07]

On Embedding Edit Distance into L_1 14 Communication Bounds for Edit Distance A tradeoff between approximation and communication Theorem [Andoni-K.’07]: For Hamming distance : CC 1+  =  (1/  2 ) [Kushilevitz-Ostrovsky-Rabani’98], [Woodruff’04] First computational model where edit is provably harder than Hamming! Corollary 1: Approximation A=O(1) requires CC A ¸  (loglog n) Corollary 2: Communication CC A =O(1) requires A ¸  * (log n) Implications to embeddings: Embedding ED into L 1 (or squared-L 2 ) requires distortion  * (log n) Furthermore, holds for both 0-1 strings and permutations (Ulam)‏

On Embedding Edit Distance into L_1 15 Proof Outline Step 1 [Yao’s minimax Theorem]: Reduce to distributional complexity  If CC A ≤k then for every two distributions  far,  close there is a k-bit deterministic protocol with success probability ¸ 2/3 Step 2 [Andoni-Indyk-K.’07]: Reduce to 1-bit protocols  Further to above, there are Boolean functions s A,s B :  n  {0,1} with advantage Pr (x,y) 2  far [s A (x)  s B (y)] – Pr (x,y) 2  close [s A (x)  s B (y)] ¸  (2 -k ) Step 3 [Fourier expansion]: Reduce to one Fourier level  Furthermore, s A,s B depend only on fixed positions j 1,…,j Step 4 [Choose distribution]: Analyze (x,y) 2  projected on these positions  Let  close,  far include  -noise  handle a high level  Let  close,  far include (few/more) block rotations  handle a low level Step 5: Reduce Ulam to {0,1} n  A random mapping   {0,1} works Key property: distribution of ( x j1,…,x j, y j1,…,y j ) is “statistically close” under  far vs. under  close Compare this additive analysis to our previous analysis:

On Embedding Edit Distance into L_1 16 Summary of Known Results O(log n) [Charikar-K.’06] (New proof by [Gopalan-Jayram-K.]) 2 O(√log n) [Ostrovsky-Rabani’05] Upper bound: Lower bound:  log n/loglog n) [Andoni-K.’07] (Qualitatively much stronger)‏  (log n) [K.-Rabani’06] Embed Ulam metric into L 1 Embed ({0,1} n, ED) into L 1

On Embedding Edit Distance into L_1 17 Concluding Remarks The computational lens  Study Distance Estimation problems rather than embeddings Open problems:  Still large gap for 0-1 strings  Variants of edit distance (e.g. edit distance with block-moves)‏  Rule out other algorithms (e.g. “CC model” capturing Indyk’s NNS for l 1 )‏ Recent progress:  Bypass L 1 -embedding by devising new techniques E.g. using max ( l 1 ) product for NNS under Ulam metric [Andoni- Indyk-K.]  Analyze/design “good” heuristics E.g. smoothed analysis [Andoni-K.]