Download presentation

Presentation is loading. Please wait.

Published byMercedes Lanning Modified over 3 years ago

1
Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)

2
Overcoming the L_1 non-embeddability barrier 2 Algorithms on Metric Spaces Fix a metric M Fix a computational problem Solve problem under M Ulam metric ED(x,y) = minimum number of edit operations that transform x into y. edit operation = insert/delete/ substitute a character ED(0101010, 1010101) = 2 Nearest Neighbor Search: Preprocess n strings, so that given a query string, can find the closest string to it. Compute distance between x,y Earthmover distance … … Hamming distance

3
Overcoming the L_1 non-embeddability barrier 3 Motivation for Nearest Neighbor Many applications: Image search (Euclidean dist, Earth-mover dist) Processing of genetic information, text processing (edit dist.) many others… Generic Search Engine

4
Overcoming the L_1 non-embeddability barrier 4 A General Tool: Embeddings An embedding of M into a host metric (H,d H ) is a map f : M →H preserves distances approximately has distortion A ≥ 1 if for all x,y d M (x,y) ≤ d H (f(x),f(y)) ≤ A*d M (x,y) Why? If H is “easy” (= can solve efficiently computational problems like NNS) Then get good algorithms for the original space M! f

5
Overcoming the L_1 non-embeddability barrier 5 Host space? Popular target metric: ℓ 1 Have efficient algorithms: Distance estimation: O(d) for d-dimensional space (often less) NNS: c-approx with O(n 1/c ) query time and O(n 1+1/c ) space [IM98] Powerful enough for some things… MetricReferencesUpper boundLower bound Edit distance over {0,1} d [OR05]; [KN05,KR06,AK07] 2 Õ(√log d) Ω(log d) Ulam (= edit distance over permutations) [CK06]; [AK07] O(log d)Ω̃(log d) Block edit distance over {0,1} d [MS00, CM07]; [Cor03] Õ(log d)4/3 Earthmover distance in 2 (sets of size s) [Cha02, IT03]; [NS07] O(log s) (log 1/2 s) Earthmover distance in {0,1} d (set of size s) [AIK08]; [KN05] O(log s*log d) (log s) ℓ 1 =real space with d 1 (x,y) =∑ i |x i -y i |

6
Overcoming the L_1 non-embeddability barrier 6 Below logarithmic? Cannot work with ℓ 1 Other possibilities? (ℓ 2 ) p is bigger and algorithmically tractable but not rich enough (often same lower bounds) ℓ ∞ is rich (includes all metrics), but not efficient computationally usually (high dimension) And that’s roughly it… (at least for efficient NNS) (ℓ 2 ) p =real space with dist 2p (x,y)=||x-y|| 2 p ℓ ∞ =real space with dist ∞ (x,y)=max i |x i -y i |

7
Overcoming the L_1 non-embeddability barrier 7 d ∞,1 d1d1 … Meet our new host Iterated product space, Ρ 22,∞,1 = β α γ d1d1 … d ∞,1 d1d1 … d 22,∞,1

8
Overcoming the L_1 non-embeddability barrier 8 Why Ρ 22,∞,1 ? Because we can… Theorem 1. Ulam embeds into Ρ 22,∞,1 with O(1) distortion Dimensions (γ,β,α)=(d, log d, d) Theorem 2. Ρ 22,∞,1 admits NNS on n points with O(log log n) approximation O(n ε ) query time and O(n 1+ε ) space In fact, there is more for Ulam… Rich Algorithmically tractable

9
Overcoming the L_1 non-embeddability barrier 9 Our Algorithms for Ulam Ulam = edit on strings where each symbol appears at most once A classical distance between rankings Exhibits hardness of misalignments (as in general edit) All lower bounds same as for general edit (up to Θ̃() ) Distortion of embedding into ℓ 1 (and (ℓ 2 ) p, etc): Θ̃(log d) Our approach implies new algorithms for Ulam: 1. NNS with O(log log n) approx, O(n ε ) query time Can improve to O(log log d) approx 2. Sketching with O(1)-approx in log O(1) d space 3. Distance estimation with O(1)-approx in time ED(1234567, 7123456) = 2 [BEKMRRS03]: when ED ¼ d, approx d ε in O(d 1-2ε ) time If we ever hope for approximation <<log d for NNS under general edit, first we have to get it under Ulam!

10
Overcoming the L_1 non-embeddability barrier 10 Theorem 1 Theorem 1. Can embed Ulam into Ρ 22,∞,1 with O(1) distortion Dimensions (γ,β,α)=(d, log d, d) Proof “Geometrization” of Ulam characterizations Previously studied in the context of testing monotonicity (sortedness): Sublinear algorithms [EKKRV98, ACCL04] Data-stream algorithms [GJKK07, GG07, EH08]

11
Overcoming the L_1 non-embeddability barrier 11 Thm 1: Characterizing Ulam Consider permutations x,y over [d] Assume for now: x = identity permutation Idea: Count # chars in y to delete to obtain increasing sequence (≈ Ulam(x,y)) Call them faulty characters Issues: Ambiguity… How do we count them? 123456789 234657891 123456789 341256789 X= y=

12
Overcoming the L_1 non-embeddability barrier 12 Thm 1: Characterization – inversions Definition: chars a<b form inversion if b precedes a in y How to identify faulty char? Has an inversion? Doesn’t work: all chars might have inversion Has many inversions? Still can miss “faulty” chars Has many inversions locally? Same problem 123456789 234567891 123456789 213456798 123456789 567981234 Check if either is true! X= y=

13
Overcoming the L_1 non-embeddability barrier 13 Thm 1: Characterization – faulty chars Definition 1: a is faulty if exists K>0 s.t. a is inverted w.r.t. a majority of the K symbols preceding a in y (ok to consider K=2 k ) Lemma [ACCL04, GJKK07]: # faulty chars = Θ(Ulam(x,y)). 123456789 234567891 4 characters preceding 1 (all inversions with 1)

14
Overcoming the L_1 non-embeddability barrier 14 Thm 1: Characterization Embedding To get embedding, need: 1. Symmetrization (neither string is identity) 2. Deal with “exists”, “majority”…? To resolve (1), use instead X[a;K] … Definition 2: a is faulty if exists K=2 k such that |X[a;2 k ] Δ Y[a;2 k ]| > 2 k (symmetric difference) 123456789 123467895 Y[5;4] X[5;4]

15
Overcoming the L_1 non-embeddability barrier 15 Thm 1: Embedding – final step We have Replace by weight? Final embedding: 123456789 123467895 Y[5;2 2 ] X[5;2 2 ] equal 1 iff true ( )2)2

16
Overcoming the L_1 non-embeddability barrier 16 Theorem 2 Theorem 2. Ρ 22,∞,1 admits NNS on n points O(log log n) approximation O(n ε ) query time and O(n 1+ε ) space for any small ε (ignoring (αβγ) O(1) ) A rather general approach “LSH” on ℓ 1 -products of general metric spaces Of course, cannot do, but can reduce to ℓ ∞ -products

17
Overcoming the L_1 non-embeddability barrier 17 Thm 2: Proof Let’s start from basics: ℓ 1 α [IM98]: c-approx with O(n 1/c ) query time and O(n 1+1/c ) space (ignoring α O(1) ) Ok, what about Suppose: NNS for M with c M -approx Q M query time S M space. Then: NNS for O(c M * log log n) -approx Õ(Q M ) query time O(S M * n 1+ε ) space. [I02]

18
Overcoming the L_1 non-embeddability barrier 18 Thm 2: What about (ℓ 2 ) 2 -product? Enough to consider (for us, M is the l 1 -product) Off-the-shelf? [I04]: gives space ~n or >log n approximation We reduce to multiple NNS queries under Instructive to first look at NNS for standard ℓ 1 …

19
Overcoming the L_1 non-embeddability barrier 19 Thm 2: Review of NNS for ℓ 1 LSH family: collection H of hash functions such that: For random h H (parameter >0) Pr[h(q)=h(p)] ≈ 1-||q-p|| 1 / Query just uses primitive: Can obtain H by imposing randomly-shifted grid of side-length Then for h defined by r i 2 [0, ] at random, primitive becomes: p q “return all points p such that h(q)=h(p) “return all p s.t. |q i -p i |<r i for all i [d]

20
Overcoming the L_1 non-embeddability barrier 20 Thm 2: LSH for ℓ 1 -product Intuition: abstract LSH! Recall we had: for r i random from [0, ], point p returned if for all i: |q i -p i |<r i Equivalently For all i: p q ℓ ∞ product of R ! “return all points p’s such that max i d M (q i,p i )/r i <1 For ℓ 1 For “return all p s.t. |q i -p i |<r i for all i [d]

21
Overcoming the L_1 non-embeddability barrier 21 Thm 2: Final Thus, sufficient to solve primitive: We reduced NNS over to several instances of NNS over (with appropriately scaled coordinates) Approximation is O(1)*O(log log n) Done! “return all points p’s such that max i d M (q i,p i )/r i <1 (in fact, for k independent choices of (r 1,…r d )) For

22
Overcoming the L_1 non-embeddability barrier 22 Take-home message: Can embed combinatorial metrics into iterated product spaces Works for Ulam (=edit on non-repetitive strings) Approach bypasses non-embeddability results into usual-suspect spaces like ℓ 1, (ℓ 2 ) 2 … Open: Embeddings for edit over {0,1} d, EMD, other metrics? Understanding product spaces? [Jayram-Woodruff]: sketching

Similar presentations

OK

Geometric Problems in High Dimensions: Sketching Piotr Indyk.

Geometric Problems in High Dimensions: Sketching Piotr Indyk.

© 2018 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google