Presentation is loading. Please wait.

Presentation is loading. Please wait.

Overcoming the L1 Non-Embeddability Barrier

Similar presentations


Presentation on theme: "Overcoming the L1 Non-Embeddability Barrier"— Presentation transcript:

1 Overcoming the L1 Non-Embeddability Barrier
Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)

2 Algorithms on Metric Spaces
Fix a metric M Fix a computational problem Solve problem under M Hamming distance Ulam metric Compute distance between x,y Earthmover distance ED(x,y) = minimum number of edit operations that transform x into y. edit operation = insert/delete/ substitute a character ED( , ) = 2 Nearest Neighbor Search: Preprocess n strings, so that given a query string, can find the closest string to it. Overcoming the L_1 non-embeddability barrier

3 Motivation for Nearest Neighbor
Many applications: Image search (Euclidean dist, Earth-mover dist) Processing of genetic information, text processing (edit dist.) many others… Generic Search Engine Overcoming the L_1 non-embeddability barrier

4 A General Tool: Embeddings
An embedding of M into a host metric (H,dH) is a map f : M→H preserves distances approximately has distortion A ≥ 1 if for all x,y M, dM(x,y) ≤ dH(f(x),f(y)) ≤ A*dM(x,y) Why? If H is “easy” (= can solve efficiently computational problems like NNS) Then get good algorithms for the original space M! f Overcoming the L_1 non-embeddability barrier

5 Overcoming the L_1 non-embeddability barrier
Host space? ℓ1=real space with d1(x,y) =∑i |xi-yi| Popular target metric: ℓ1 Have efficient algorithms: Distance estimation: O(d) for d-dimensional space (often less) NNS: c-approx with O(n1/c) query time and O(n1+1/c) space [IM98] Powerful enough for some things… Metric References Upper bound Lower bound Edit distance over {0,1}d [OR05]; [KN05,KR06,AK07] 2Õ(√log d) Ω(log d) Ulam (= edit distance over permutations) [CK06]; [AK07] O(log d) Ω̃(log d) Block edit distance over {0,1}d [MS00, CM07]; [Cor03] Õ(log d) 4/3 Earthmover distance in 2 (sets of size s) [Cha02, IT03]; [NS07] O(log s) (log1/2 s) Earthmover distance in {0,1}d (set of size s) [AIK08]; [KN05] O(log s*log d) (log s) Overcoming the L_1 non-embeddability barrier

6 Overcoming the L_1 non-embeddability barrier
Below logarithmic? (ℓ2)p=real space with dist2p(x,y)=||x-y||2p Cannot work with ℓ1 Other possibilities? (ℓ2)p is bigger and algorithmically tractable but not rich enough (often same lower bounds) ℓ∞ is rich (includes all metrics), but not efficient computationally usually (high dimension) And that’s roughly it…  (at least for efficient NNS) ℓ∞=real space with dist∞(x,y)=maxi|xi-yi| Overcoming the L_1 non-embeddability barrier

7 Overcoming the L_1 non-embeddability barrier
α Meet our new host d1 d1 d∞,1 d1 d∞,1 β Iterated product space, Ρ22,∞,1= d∞,1 d22,∞,1 γ Overcoming the L_1 non-embeddability barrier

8 Overcoming the L_1 non-embeddability barrier
Why Ρ22,∞,1? Because we can… Theorem 1. Ulam embeds into Ρ22,∞,1 with O(1) distortion Dimensions (γ,β,α)=(d, log d, d) Theorem 2. Ρ22,∞,1 admits NNS on n points with O(log log n) approximation O(nε) query time and O(n1+ε) space In fact, there is more for Ulam… Rich Algorithmically tractable Overcoming the L_1 non-embeddability barrier

9 Our Algorithms for Ulam
ED( , ) = 2 Ulam = edit on strings where each symbol appears at most once A classical distance between rankings Exhibits hardness of misalignments (as in general edit) All lower bounds same as for general edit (up to Θ̃() ) Distortion of embedding into ℓ1 (and (ℓ2)p, etc): Θ̃(log d) Our approach implies new algorithms for Ulam: 1. NNS with O(log log n) approx, O(nε) query time Can improve to O(log log d) approx 2. Sketching with O(1)-approx in logO(1) d space 3. Distance estimation with O(1)-approx in time If we ever hope for approximation <<log d for NNS under general edit, first we have to get it under Ulam! [BEKMRRS03]: when ED¼d, approx dε in O(d1-2ε) time Overcoming the L_1 non-embeddability barrier

10 Overcoming the L_1 non-embeddability barrier
Theorem 1 Theorem 1. Can embed Ulam into Ρ22,∞,1 with O(1) distortion Dimensions (γ,β,α)=(d, log d, d) Proof “Geometrization” of Ulam characterizations Previously studied in the context of testing monotonicity (sortedness): Sublinear algorithms [EKKRV98, ACCL04] Data-stream algorithms [GJKK07, GG07, EH08] Overcoming the L_1 non-embeddability barrier

11 Thm 1: Characterizing Ulam
Consider permutations x,y over [d] Assume for now: x = identity permutation Idea: Count # chars in y to delete to obtain increasing sequence (≈ Ulam(x,y)) Call them faulty characters Issues: Ambiguity… How do we count them? X= y= Overcoming the L_1 non-embeddability barrier

12 Thm 1: Characterization – inversions
Definition: chars a<b form inversion if b precedes a in y How to identify faulty char? Has an inversion? Doesn’t work: all chars might have inversion Has many inversions? Still can miss “faulty” chars Has many inversions locally? Same problem Check if either is true! X= y= Overcoming the L_1 non-embeddability barrier

13 Thm 1: Characterization – faulty chars
Definition 1: a is faulty if exists K>0 s.t. a is inverted w.r.t. a majority of the K symbols preceding a in y (ok to consider K=2k) Lemma [ACCL04, GJKK07]: # faulty chars = Θ(Ulam(x,y)). 4 characters preceding 1 (all inversions with 1) Overcoming the L_1 non-embeddability barrier

14 Thm 1: CharacterizationEmbedding
To get embedding, need: Symmetrization (neither string is identity) Deal with “exists”, “majority”…? To resolve (1), use instead X[a;K] … Definition 2: a is faulty if exists K=2k such that |X[a;2k] Δ Y[a;2k]| > 2k (symmetric difference) X[5;4] Y[5;4] Overcoming the L_1 non-embeddability barrier

15 Thm 1: Embedding – final step
X[5;22] We have Replace by weight? Final embedding: Y[5;22] equal 1 iff true ( )2 Overcoming the L_1 non-embeddability barrier

16 Overcoming the L_1 non-embeddability barrier
Theorem 2 Theorem 2. Ρ22,∞,1 admits NNS on n points O(log log n) approximation O(nε) query time and O(n1+ε) space for any small ε (ignoring (αβγ)O(1)) A rather general approach “LSH” on ℓ1-products of general metric spaces Of course, cannot do, but can reduce to ℓ∞-products Overcoming the L_1 non-embeddability barrier

17 Overcoming the L_1 non-embeddability barrier
Thm 2: Proof Let’s start from basics: ℓ1α [IM98]: c-approx with O(n1/c) query time and O(n1+1/c) space (ignoring αO(1)) Ok, what about Suppose: NNS for M with cM-approx QM query time SM space. Then: NNS for O(cM * log log n) -approx Õ(QM) query time O(SM * n1+ε) space. [I02] Overcoming the L_1 non-embeddability barrier

18 Thm 2: What about (ℓ2)2-product?
Enough to consider (for us, M is the l1-product) Off-the-shelf? [I04]: gives space ~n or >log n approximation We reduce to multiple NNS queries under Instructive to first look at NNS for standard ℓ1 … Overcoming the L_1 non-embeddability barrier

19 Overcoming the L_1 non-embeddability barrier
Thm 2: Review of NNS for ℓ1 LSH family: collection H of hash functions such that: For random hH (parameter >0) Pr[h(q)=h(p)] ≈ 1-||q-p||1 /  Query just uses primitive: Can obtain H by imposing randomly-shifted grid of side-length  Then for h defined by ri2[0, ] at random, primitive becomes: q p “return all points p such that h(q)=h(p) “return all p s.t. |qi-pi|<ri for all i[d] Overcoming the L_1 non-embeddability barrier

20 Overcoming the L_1 non-embeddability barrier
Thm 2: LSH for ℓ1-product Intuition: abstract LSH! Recall we had: for ri random from [0, ], point p returned if for all i: |qi-pi|<ri Equivalently For all i: q p ℓ∞ product of R! For ℓ1 “return all p s.t. |qi-pi|<ri for all i[d] “return all points p’s such that maxi dM(qi,pi)/ri<1 For Overcoming the L_1 non-embeddability barrier

21 Overcoming the L_1 non-embeddability barrier
Thm 2: Final Thus, sufficient to solve primitive: We reduced NNS over to several instances of NNS over (with appropriately scaled coordinates) Approximation is O(1)*O(log log n) Done! For “return all points p’s such that maxi dM(qi,pi)/ri<1 (in fact, for k independent choices of (r1,…rd)) Overcoming the L_1 non-embeddability barrier

22 Overcoming the L_1 non-embeddability barrier
Take-home message: Can embed combinatorial metrics into iterated product spaces Works for Ulam (=edit on non-repetitive strings) Approach bypasses non-embeddability results into usual-suspect spaces like ℓ1, (ℓ2)2 … Open: Embeddings for edit over {0,1}d, EMD, other metrics? Understanding product spaces? [Jayram-Woodruff]: sketching Thank you! Overcoming the L_1 non-embeddability barrier


Download ppt "Overcoming the L1 Non-Embeddability Barrier"

Similar presentations


Ads by Google