Overcoming the L1 Non-Embeddability Barrier

Slides:

Advertisements

Similar presentations

Tight Lower Bounds for the Distinct Elements Problem David Woodruff MIT Joint work with Piotr Indyk.

Advertisements

Shortest Vector In A Lattice is NP-Hard to approximate

A Nonlinear Approach to Dimension Reduction Robert Krauthgamer Weizmann Institute of Science Joint work with Lee-Ad Gottlieb TexPoint fonts used in EMF.

1 Approximating Edit Distance in Near-Linear Time Alexandr Andoni (MIT) Joint work with Krzysztof Onak (MIT)

Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.

Embedding the Ulam metric into ℓ 1 (Ενκρεβάτωση του μετρικού χώρου Ulam στον ℓ 1 ) Για το μάθημα “Advanced Data Structures” Αντώνης Αχιλλέως.

Algorithmic High-Dimensional Geometry 1 Alex Andoni (Microsoft Research SVC)

Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)

Comments We consider in this topic a large class of related problems that deal with proximity of points in the plane. We will: 1.Define some proximity.

Metric Embeddings with Relaxed Guarantees Hubert Chan Joint work with Kedar Dhamdhere, Anupam Gupta, Jon Kleinberg, Aleksandrs Slivkins.

Metric Embeddings As Computational Primitives Robert Krauthgamer Weizmann Institute of Science [Based on joint work with Alex Andoni]

Data Structures and Functional Programming Algorithms for Big Data Ramin Zabih Cornell University Fall 2012.

Navigating Nets: Simple algorithms for proximity search Robert Krauthgamer (IBM Almaden) Joint work with James R. Lee (UC Berkeley)

Nearest Neighbor Search in high-dimensional spaces Alexandr Andoni (Microsoft Research)

Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Institute) Robert Krauthgamer (Weizmann Institute) Ilya Razenshteyn (CSAIL MIT)

Nearest Neighbor Search in High-Dimensional Spaces Alexandr Andoni (Microsoft Research Silicon Valley)

Parametric Inference.

1 Streaming Computation of Combinatorial Objects Ziv Bar-Yossef U.C. Berkeley Omer Reingold AT&T Labs – Research Ronen.

Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.

1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion.

Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Inst. / Columbia) Robert Krauthgamer (Weizmann Inst.) Ilya Razenshteyn (MIT, now.

Approximate Nearest Subspace Search with applications to pattern recognition Ronen Basri Tal Hassner Lihi Zelnik-Manor Weizmann Institute Caltech.

Embedding and Sketching Alexandr Andoni (MSR). Definition by example  Problem: Compute the diameter of a set S, of size n, living in d-dimensional ℓ.

Embedding and Sketching Non-normed spaces Alexandr Andoni (MSR)

Geometric Problems in High Dimensions: Sketching Piotr Indyk.

On Embedding Edit Distance into L_11 On Embedding Edit Distance into L 1 Robert Krauthgamer (Weizmann Institute and IBM Almaden)‏ Based on joint work (i)

Nearest Neighbor Search in high-dimensional spaces Alexandr Andoni (Princeton/CCI → MSR SVC) Barriers II August 30, 2010.

Sublinear Algorithms via Precision Sampling Alexandr Andoni (Microsoft Research) joint work with: Robert Krauthgamer (Weizmann Inst.) Krzysztof Onak (CMU)

1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008.

Geometric Problems in High Dimensions: Sketching Piotr Indyk.

Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality Piotr Indyk, Rajeev Motwani The 30 th annual ACM symposium on theory of computing.

Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity) Robert Krauthgamer [Weizmann Institute] Joint with: Alexandr Andoni.

1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT.

Lower Bounds for Embedding Edit Distance into Normed Spaces A. Andoni, M. Deza, A. Gupta, P. Indyk, S. Raskhodnikova.

On the Impossibility of Dimension Reduction for Doubling Subsets of L p Yair Bartal Lee-Ad Gottlieb Ofer Neiman.

Optimal Data-Dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Joint work with: Ilya Razenshteyn.

Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Columbia) Robert Krauthgamer (Weizmann Inst) Ilya Razenshteyn (MIT) 1.

S IMILARITY E STIMATION T ECHNIQUES FROM R OUNDING A LGORITHMS Paper Review Jieun Lee Moses S. Charikar Princeton University Advanced Database.

Algorithms for Big Data: Streaming and Sublinear Time Algorithms

Approximate Near Neighbors for General Symmetric Norms

New Characterizations in Turnstile Streams with Applications

Dana Ron Tel Aviv University

Approximating the MST Weight in Sublinear Time

Ultra-low-dimensional embeddings of doubling metrics

Lecture 18: Uniformity Testing Monotonicity Testing

Sublinear Algorithmic Tools 3

Lecture 11: Nearest Neighbor Search

Sublinear Algorithmic Tools 2

Lecture 10: Sketching S3: Nearest Neighbor Search

Sketching and Embedding are Equivalent for Norms

Lecture 7: Dynamic sampling Dimension Reduction

Near(est) Neighbor in High Dimensions

Enumerating Distances Using Spanners of Bounded Degree

Lecture 16: Earth-Mover Distance

Linear sketching with parities

Alternating tree Automata and Parity games

Near-Optimal (Euclidean) Metric Compression

Yair Bartal Lee-Ad Gottlieb Hebrew U. Ariel University

Locality Sensitive Hashing

On the effect of randomness on planted 3-coloring models

Streaming Symmetric Norms via Measure Concentration

Embedding Metrics into Geometric Spaces

CS5112: Algorithms and Data Structures for Applications

Lecture 15: Least Square Regression Metric Embeddings

Minwise Hashing and Efficient Search

President’s Day Lecture: Advanced Nearest Neighbor Search

Approximating Edit Distance in Near-Linear Time

Ronen Basri Tal Hassner Lihi Zelnik-Manor Weizmann Institute Caltech

Sublinear Algorihms for Big Data

Presentation transcript:

Overcoming the L1 Non-Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)

Algorithms on Metric Spaces Fix a metric M Fix a computational problem Solve problem under M Hamming distance Ulam metric Compute distance between x,y Earthmover distance ED(x,y) = minimum number of edit operations that transform x into y. edit operation = insert/delete/ substitute a character ED(0101010, 1010101) = 2 Nearest Neighbor Search: Preprocess n strings, so that given a query string, can find the closest string to it. … … Overcoming the L_1 non-embeddability barrier

Motivation for Nearest Neighbor Many applications: Image search (Euclidean dist, Earth-mover dist) Processing of genetic information, text processing (edit dist.) many others… Generic Search Engine Overcoming the L_1 non-embeddability barrier

A General Tool: Embeddings An embedding of M into a host metric (H,dH) is a map f : M→H preserves distances approximately has distortion A ≥ 1 if for all x,y M, dM(x,y) ≤ dH(f(x),f(y)) ≤ A*dM(x,y) Why? If H is “easy” (= can solve efficiently computational problems like NNS) Then get good algorithms for the original space M! f Overcoming the L_1 non-embeddability barrier

Overcoming the L_1 non-embeddability barrier Host space? ℓ1=real space with d1(x,y) =∑i |xi-yi| Popular target metric: ℓ1 Have efficient algorithms: Distance estimation: O(d) for d-dimensional space (often less) NNS: c-approx with O(n1/c) query time and O(n1+1/c) space [IM98] Powerful enough for some things… Metric References Upper bound Lower bound Edit distance over {0,1}d [OR05]; [KN05,KR06,AK07] 2Õ(√log d) Ω(log d) Ulam (= edit distance over permutations) [CK06]; [AK07] O(log d) Ω̃(log d) Block edit distance over {0,1}d [MS00, CM07]; [Cor03] Õ(log d) 4/3 Earthmover distance in 2 (sets of size s) [Cha02, IT03]; [NS07] O(log s) (log1/2 s) Earthmover distance in {0,1}d (set of size s) [AIK08]; [KN05] O(log s*log d) (log s) Overcoming the L_1 non-embeddability barrier

Overcoming the L_1 non-embeddability barrier Below logarithmic? (ℓ2)p=real space with dist2p(x,y)=||x-y||2p Cannot work with ℓ1 Other possibilities? (ℓ2)p is bigger and algorithmically tractable but not rich enough (often same lower bounds) ℓ∞ is rich (includes all metrics), but not efficient computationally usually (high dimension) And that’s roughly it…  (at least for efficient NNS) ℓ∞=real space with dist∞(x,y)=maxi|xi-yi| Overcoming the L_1 non-embeddability barrier

Overcoming the L_1 non-embeddability barrier α Meet our new host d1 d1 … d∞,1 d1 … d∞,1 … β Iterated product space, Ρ22,∞,1= d∞,1 d22,∞,1 γ Overcoming the L_1 non-embeddability barrier

Overcoming the L_1 non-embeddability barrier Why Ρ22,∞,1? Because we can… Theorem 1. Ulam embeds into Ρ22,∞,1 with O(1) distortion Dimensions (γ,β,α)=(d, log d, d) Theorem 2. Ρ22,∞,1 admits NNS on n points with O(log log n) approximation O(nε) query time and O(n1+ε) space In fact, there is more for Ulam… Rich Algorithmically tractable Overcoming the L_1 non-embeddability barrier

Our Algorithms for Ulam ED(1234567, 7123456) = 2 Ulam = edit on strings where each symbol appears at most once A classical distance between rankings Exhibits hardness of misalignments (as in general edit) All lower bounds same as for general edit (up to Θ̃() ) Distortion of embedding into ℓ1 (and (ℓ2)p, etc): Θ̃(log d) Our approach implies new algorithms for Ulam: 1. NNS with O(log log n) approx, O(nε) query time Can improve to O(log log d) approx 2. Sketching with O(1)-approx in logO(1) d space 3. Distance estimation with O(1)-approx in time If we ever hope for approximation <<log d for NNS under general edit, first we have to get it under Ulam! [BEKMRRS03]: when ED¼d, approx dε in O(d1-2ε) time Overcoming the L_1 non-embeddability barrier

Overcoming the L_1 non-embeddability barrier Theorem 1 Theorem 1. Can embed Ulam into Ρ22,∞,1 with O(1) distortion Dimensions (γ,β,α)=(d, log d, d) Proof “Geometrization” of Ulam characterizations Previously studied in the context of testing monotonicity (sortedness): Sublinear algorithms [EKKRV98, ACCL04] Data-stream algorithms [GJKK07, GG07, EH08] Overcoming the L_1 non-embeddability barrier

Thm 1: Characterizing Ulam Consider permutations x,y over [d] Assume for now: x = identity permutation Idea: Count # chars in y to delete to obtain increasing sequence (≈ Ulam(x,y)) Call them faulty characters Issues: Ambiguity… How do we count them? 123456789 X= 123456789 y= 234657891 341256789 Overcoming the L_1 non-embeddability barrier

Thm 1: Characterization – inversions Definition: chars a<b form inversion if b precedes a in y How to identify faulty char? Has an inversion? Doesn’t work: all chars might have inversion Has many inversions? Still can miss “faulty” chars Has many inversions locally? Same problem Check if either is true! 123456789 123456789 123456789 X= y= 234567891 213456798 567981234 Overcoming the L_1 non-embeddability barrier

Thm 1: Characterization – faulty chars Definition 1: a is faulty if exists K>0 s.t. a is inverted w.r.t. a majority of the K symbols preceding a in y (ok to consider K=2k) Lemma [ACCL04, GJKK07]: # faulty chars = Θ(Ulam(x,y)). 123456789 234567891 4 characters preceding 1 (all inversions with 1) Overcoming the L_1 non-embeddability barrier

Thm 1: CharacterizationEmbedding To get embedding, need: Symmetrization (neither string is identity) Deal with “exists”, “majority”…? To resolve (1), use instead X[a;K] … Definition 2: a is faulty if exists K=2k such that |X[a;2k] Δ Y[a;2k]| > 2k (symmetric difference) X[5;4] 123456789 123467895 Y[5;4] Overcoming the L_1 non-embeddability barrier

Thm 1: Embedding – final step X[5;22] 123456789 We have Replace by weight? Final embedding: 123467895 Y[5;22] equal 1 iff true ( )2 Overcoming the L_1 non-embeddability barrier

Overcoming the L_1 non-embeddability barrier Theorem 2 Theorem 2. Ρ22,∞,1 admits NNS on n points O(log log n) approximation O(nε) query time and O(n1+ε) space for any small ε (ignoring (αβγ)O(1)) A rather general approach “LSH” on ℓ1-products of general metric spaces Of course, cannot do, but can reduce to ℓ∞-products Overcoming the L_1 non-embeddability barrier

Overcoming the L_1 non-embeddability barrier Thm 2: Proof Let’s start from basics: ℓ1α [IM98]: c-approx with O(n1/c) query time and O(n1+1/c) space (ignoring αO(1)) Ok, what about Suppose: NNS for M with cM-approx QM query time SM space. Then: NNS for O(cM * log log n) -approx Õ(QM) query time O(SM * n1+ε) space. [I02] Overcoming the L_1 non-embeddability barrier

Thm 2: What about (ℓ2)2-product? Enough to consider (for us, M is the l1-product) Off-the-shelf? [I04]: gives space ~n or >log n approximation We reduce to multiple NNS queries under Instructive to first look at NNS for standard ℓ1 … Overcoming the L_1 non-embeddability barrier

Overcoming the L_1 non-embeddability barrier Thm 2: Review of NNS for ℓ1  LSH family: collection H of hash functions such that: For random hH (parameter >0) Pr[h(q)=h(p)] ≈ 1-||q-p||1 /  Query just uses primitive: Can obtain H by imposing randomly-shifted grid of side-length  Then for h defined by ri2[0, ] at random, primitive becomes: q p “return all points p such that h(q)=h(p) “return all p s.t. |qi-pi|<ri for all i[d] Overcoming the L_1 non-embeddability barrier

Overcoming the L_1 non-embeddability barrier Thm 2: LSH for ℓ1-product  Intuition: abstract LSH! Recall we had: for ri random from [0, ], point p returned if for all i: |qi-pi|<ri Equivalently For all i: q p ℓ∞ product of R! For ℓ1 “return all p s.t. |qi-pi|<ri for all i[d] “return all points p’s such that maxi dM(qi,pi)/ri<1 For Overcoming the L_1 non-embeddability barrier

Overcoming the L_1 non-embeddability barrier Thm 2: Final Thus, sufficient to solve primitive: We reduced NNS over to several instances of NNS over (with appropriately scaled coordinates) Approximation is O(1)*O(log log n) Done! For “return all points p’s such that maxi dM(qi,pi)/ri<1 (in fact, for k independent choices of (r1,…rd)) Overcoming the L_1 non-embeddability barrier

Overcoming the L_1 non-embeddability barrier Take-home message: Can embed combinatorial metrics into iterated product spaces Works for Ulam (=edit on non-repetitive strings) Approach bypasses non-embeddability results into usual-suspect spaces like ℓ1, (ℓ2)2 … Open: Embeddings for edit over {0,1}d, EMD, other metrics? Understanding product spaces? [Jayram-Woodruff]: sketching Thank you! Overcoming the L_1 non-embeddability barrier