Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)

Presentation on theme: "Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)"— Presentation transcript:

Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)

Overcoming the L_1 non-embeddability barrier 2 Algorithms on Metric Spaces Fix a metric M Fix a computational problem Solve problem under M Ulam metric ED(x,y) = minimum number of edit operations that transform x into y. edit operation = insert/delete/ substitute a character ED(0101010, 1010101) = 2 Nearest Neighbor Search: Preprocess n strings, so that given a query string, can find the closest string to it. Compute distance between x,y Earthmover distance … … Hamming distance

Overcoming the L_1 non-embeddability barrier 3 Motivation for Nearest Neighbor Many applications:  Image search (Euclidean dist, Earth-mover dist)  Processing of genetic information, text processing (edit dist.)  many others… Generic Search Engine

Overcoming the L_1 non-embeddability barrier 4 A General Tool: Embeddings An embedding of M into a host metric (H,d H ) is a map f : M →H  preserves distances approximately  has distortion A ≥ 1 if for all x,y   d M (x,y) ≤ d H (f(x),f(y)) ≤ A*d M (x,y) Why?  If H is “easy” (= can solve efficiently computational problems like NNS)  Then get good algorithms for the original space M! f

Overcoming the L_1 non-embeddability barrier 5 Host space? Popular target metric: ℓ 1 Have efficient algorithms:  Distance estimation: O(d) for d-dimensional space (often less)  NNS: c-approx with O(n 1/c ) query time and O(n 1+1/c ) space [IM98] Powerful enough for some things… MetricReferencesUpper boundLower bound Edit distance over {0,1} d [OR05]; [KN05,KR06,AK07] 2 Õ(√log d) Ω(log d) Ulam (= edit distance over permutations) [CK06]; [AK07] O(log d)Ω̃(log d) Block edit distance over {0,1} d [MS00, CM07]; [Cor03] Õ(log d)4/3 Earthmover distance in  2 (sets of size s) [Cha02, IT03]; [NS07] O(log s)  (log 1/2 s) Earthmover distance in {0,1} d (set of size s) [AIK08]; [KN05] O(log s*log d)  (log s) ℓ 1 =real space with d 1 (x,y) =∑ i |x i -y i |

Overcoming the L_1 non-embeddability barrier 6 Below logarithmic? Cannot work with ℓ 1 Other possibilities?  (ℓ 2 ) p is bigger and algorithmically tractable but not rich enough (often same lower bounds)  ℓ ∞ is rich (includes all metrics), but not efficient computationally usually (high dimension) And that’s roughly it…   (at least for efficient NNS) (ℓ 2 ) p =real space with dist 2p (x,y)=||x-y|| 2 p ℓ ∞ =real space with dist ∞ (x,y)=max i |x i -y i |

Overcoming the L_1 non-embeddability barrier 7 d ∞,1 d1d1 … Meet our new host Iterated product space, Ρ 22,∞,1 = β α γ d1d1 … d ∞,1 d1d1 … d 22,∞,1

Overcoming the L_1 non-embeddability barrier 8 Why Ρ 22,∞,1 ? Because we can… Theorem 1. Ulam embeds into Ρ 22,∞,1 with O(1) distortion  Dimensions (γ,β,α)=(d, log d, d) Theorem 2. Ρ 22,∞,1 admits NNS on n points with O(log log n) approximation O(n ε ) query time and O(n 1+ε ) space In fact, there is more for Ulam… Rich Algorithmically tractable

Overcoming the L_1 non-embeddability barrier 9 Our Algorithms for Ulam Ulam = edit on strings where each symbol appears at most once  A classical distance between rankings  Exhibits hardness of misalignments (as in general edit) All lower bounds same as for general edit (up to Θ̃() ) Distortion of embedding into ℓ 1 (and (ℓ 2 ) p, etc): Θ̃(log d) Our approach implies new algorithms for Ulam: 1. NNS with O(log log n) approx, O(n ε ) query time Can improve to O(log log d) approx 2. Sketching with O(1)-approx in log O(1) d space 3. Distance estimation with O(1)-approx in time ED(1234567, 7123456) = 2 [BEKMRRS03]: when ED ¼ d, approx d ε in O(d 1-2ε ) time If we ever hope for approximation <<log d for NNS under general edit, first we have to get it under Ulam!

Overcoming the L_1 non-embeddability barrier 10 Theorem 1 Theorem 1. Can embed Ulam into Ρ 22,∞,1 with O(1) distortion  Dimensions (γ,β,α)=(d, log d, d) Proof  “Geometrization” of Ulam characterizations  Previously studied in the context of testing monotonicity (sortedness): Sublinear algorithms [EKKRV98, ACCL04] Data-stream algorithms [GJKK07, GG07, EH08]

Overcoming the L_1 non-embeddability barrier 11 Thm 1: Characterizing Ulam Consider permutations x,y over [d]  Assume for now: x = identity permutation Idea:  Count # chars in y to delete to obtain increasing sequence (≈ Ulam(x,y))  Call them faulty characters Issues:  Ambiguity…  How do we count them? 123456789 234657891 123456789 341256789 X= y=

Overcoming the L_1 non-embeddability barrier 12 Thm 1: Characterization – inversions Definition: chars a<b form inversion if b precedes a in y How to identify faulty char?  Has an inversion? Doesn’t work: all chars might have inversion  Has many inversions? Still can miss “faulty” chars  Has many inversions locally? Same problem 123456789 234567891 123456789 213456798 123456789 567981234 Check if either is true! X= y=

Overcoming the L_1 non-embeddability barrier 13 Thm 1: Characterization – faulty chars Definition 1: a is faulty if exists K>0 s.t.  a is inverted w.r.t. a majority of the K symbols preceding a in y  (ok to consider K=2 k ) Lemma [ACCL04, GJKK07]: # faulty chars = Θ(Ulam(x,y)). 123456789 234567891 4 characters preceding 1 (all inversions with 1)

Overcoming the L_1 non-embeddability barrier 14 Thm 1: Characterization  Embedding To get embedding, need: 1. Symmetrization (neither string is identity) 2. Deal with “exists”, “majority”…? To resolve (1), use instead X[a;K] … Definition 2: a is faulty if exists K=2 k such that  |X[a;2 k ] Δ Y[a;2 k ]| > 2 k (symmetric difference) 123456789 123467895 Y[5;4] X[5;4]

Overcoming the L_1 non-embeddability barrier 15 Thm 1: Embedding – final step We have Replace by weight? Final embedding: 123456789 123467895 Y[5;2 2 ] X[5;2 2 ] equal 1 iff true ( )2)2

Overcoming the L_1 non-embeddability barrier 16 Theorem 2 Theorem 2. Ρ 22,∞,1 admits NNS on n points  O(log log n) approximation  O(n ε ) query time and O(n 1+ε ) space for any small ε (ignoring (αβγ) O(1) ) A rather general approach “LSH” on ℓ 1 -products of general metric spaces  Of course, cannot do, but can reduce to ℓ ∞ -products

Overcoming the L_1 non-embeddability barrier 17 Thm 2: Proof Let’s start from basics: ℓ 1 α  [IM98]: c-approx with O(n 1/c ) query time and O(n 1+1/c ) space (ignoring α O(1) ) Ok, what about Suppose: NNS for M with c M -approx Q M query time S M space. Then: NNS for O(c M * log log n) -approx Õ(Q M ) query time O(S M * n 1+ε ) space. [I02]

Overcoming the L_1 non-embeddability barrier 18 Thm 2: What about (ℓ 2 ) 2 -product? Enough to consider  (for us, M is the l 1 -product) Off-the-shelf?  [I04]: gives space ~n  or >log n approximation We reduce to multiple NNS queries under  Instructive to first look at NNS for standard ℓ 1 …

Overcoming the L_1 non-embeddability barrier 19 Thm 2: Review of NNS for ℓ 1 LSH family: collection H of hash functions such that:  For random h  H (parameter  >0) Pr[h(q)=h(p)] ≈ 1-||q-p|| 1 /  Query just uses primitive: Can obtain H by imposing randomly-shifted grid of side-length   Then for h defined by r i 2 [0,  ] at random, primitive becomes:  p q “return all points p such that h(q)=h(p) “return all p s.t. |q i -p i |<r i for all i  [d]

Overcoming the L_1 non-embeddability barrier 20 Thm 2: LSH for ℓ 1 -product Intuition: abstract LSH! Recall we had: for r i random from [0,  ], point p returned if for all i: |q i -p i |<r i Equivalently  For all i:  p q ℓ ∞ product of R ! “return all points p’s such that max i d M (q i,p i )/r i <1 For ℓ 1 For “return all p s.t. |q i -p i |<r i for all i  [d]

Overcoming the L_1 non-embeddability barrier 21 Thm 2: Final Thus, sufficient to solve primitive: We reduced NNS over to several instances of NNS over (with appropriately scaled coordinates) Approximation is O(1)*O(log log n) Done! “return all points p’s such that max i d M (q i,p i )/r i <1 (in fact, for k independent choices of (r 1,…r d )) For

Overcoming the L_1 non-embeddability barrier 22 Take-home message: Can embed combinatorial metrics into iterated product spaces  Works for Ulam (=edit on non-repetitive strings) Approach bypasses non-embeddability results into usual-suspect spaces like ℓ 1, (ℓ 2 ) 2 … Open: Embeddings for edit over {0,1} d, EMD, other metrics? Understanding product spaces? [Jayram-Woodruff]: sketching

Similar presentations