Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lower Bounds for Edit Distance Estimation

Similar presentations


Presentation on theme: "Lower Bounds for Edit Distance Estimation"— Presentation transcript:

1 Lower Bounds for Edit Distance Estimation
Robert Krauthgamer (IBM Almaden) Joint work with Alexandr Andoni (MIT) April 2007

2 Edit Distance (Levenshtein Distance)
Two strings x, y 2 d (for simplicity, of same length) ED(x,y) = Minimum number of edit operations that transform x to y. Edit operation = character insertion/deletion/substitution Examples: ED(00000, 11101) = 4 ED(01010, 10101) = 2 Applications: Genomics Text processing Web search x= ACTACTACTA ED(x,y)=3 y= CCACTACATA Lower Bounds for Edit Distance Estimation X

3 Basic Computational Tasks
1) Compute distance ED(x,y) O(d2) by dynamic programming O(d2 / log d) when ||=O(1) [Masek-Paterson’80] Approximation in linear time? Currently, d1/3 [Batu-Ergün-Sahinalp’06] Improving over [Bar-Yossef-Jayram-K.-Ravi’04], [Batu-Ergün-Killian-Magen-Raskhodnikova-Rubinfeld-Sami’03] For Hamming distance: O(d) time Central question: Is edit distance really harder than Hamming? Fact: No computational separation between the two is known. 2) Nearest Neighbor Search (NNS) – preprocess n input strings using poly(d,n) storage, so as to answer queries as fast as possible No sublinear (in n) algorithm is known Approximation in poly(d,log n) query time? Currently 2O(√log d) [Ostrovsky-Rabani’05] Improving over [Indyk’04], [Bar-Yossef-Jayram-K.-Ravi’04] For Hamming: 1+ approx. in poly(d,log n) query time [Indyk-Motwani’98], [Kushilevitz-Ostrovsky-Rabani’98] Lower Bounds for Edit Distance Estimation

4 Distance Estimation x2d y2d … CCA bits Decide whether: randomness
Communication complexity model: Two-party protocol Shared randomness Promise (gap) version A = approximation factor CCA = min. # bits to decide whp CCA bits Alice Bob Referee sB(y) sA(x) Sketching model: Referee decides based on s(x), s(y) SKA = min. sketch size to decide Fact: SKA ¸ CCA Decide whether: ED(x,y) ¸ R or ED(x,y) · R/A Lower Bounds for Edit Distance Estimation

5 Lower Bounds for Edit Distance Estimation
Results Preview We show a tradeoff between approximation and communication Main Theorem: Corollary 1: Approximation A=O(1) requires CCA ¸ (loglog d) Corollary 2: Sketch size SKA=O(1) requires A ¸ (log d/loglog d) For Hamming distance: SK1+ = (1/2) ) CCO(1)·O(1) [Kushilevitz-Ostrovsky-Rabani’98], [Woodruff’04] First model where edit distance is provably harder than Hamming! Several implications to embeddings and algorithms… Lower Bounds for Edit Distance Estimation

6 Embedding – A Basic Tool
An embedding of edit distance into a metric (X,) is a map f : d!X It has distortion A¸1 if ED(x,y) · (f(x),f(y)) · A¢ED(x,y) x,y2d Low-distortion embeddings reduce algorithmic tasks (like NNS) to “easier” spaces like l1 Indeed [Ostrovsky-Rabani’05] show distortion · 2O(√log d) Lower bounds: (log d) [Rabani-K.’06], improving over [Khot-Naor’05] Caveat: embedding a fixed power (e.g. ED0.1) is generally easier (because l1µ l22 µ l24) and suffices for applications like NNS Lower bound for l22 is 3/2 [Andoni-Deza-Gupta-Indyk-Raskhodnikova’03] Lower Bounds for Edit Distance Estimation

7 Hierarchy of Algorithms
Two theorems in [Indyk-Motwani’98, Kushilevitz-Ostrovsky-Rabani’98]: 1) embedding into l1 implies O(1)-size sketching distortion A ) (1+)A-approximation 2) Sketching implies NNS (same approximation) sketch size s ) poly(ns) preprocessing and O(s¢log n) query time Thus, a hierarchy of approximations: l1 embedding ) l22 embedding ) … ) l210 embedding ) … ) O(1)-size sketch ) NNS * Revised goal: Preclude small sketch-size Known communication lower bounds: l1 [Saks-Sun’02, BarYossef-Jayram-Kumar-Shivakumar’04], l1 [Woodruff’04], Earthmover distance [Andoni-Indyk-K.’07] Lower Bounds for Edit Distance Estimation

8 Results for Sketching and Embeddings
Recall: Main Theorem: Corollary 1: Approximation A=O(1) requires CCA ¸ (loglog d) Corollary 2: Sketch size SKA=O(1) requires A ¸ (log d/loglog d) By the hierarchy argument: Corollary 3: Embedding into l1 requires distortion (log d/loglog d) Corollary 4: Even into a fixed power l1m requires such distortion Important Remark: All results hold also for Ulam metric Ulam metric = edit distance on permutations Nearly tight w/ O(log d) distortion embedding into l1 [Charikar-K.’06] Lower Bounds for Edit Distance Estimation

9 Edit Distance Approximation
Metric Into l1 Into l22 O(1)-sketch Edit Distance on {0,1}d [Ostrovsky-Rabani’05] [Rabani-K.’06], [ADGIR’03] This paper 2O(√log d) (log d) Ã ! 3/2 ? *(log d) Ulam Metric (permutations) [Charikar-K.’06] [Cormode’03] O(log d) 4/3 Block edit distance [Cormode-Muthukrishnan’02] O*(log d) Lower Bounds for Edit Distance Estimation

10 Techniques and Proof Outline
Step 1: Apply Yao’s minimax Theorem Analyze distributional complexity of deterministic protocols Step 2: Reduce communication complexity to Boolean functions Analyze advantage of 1-bit sketching protocol, i.e. functions f,g:d{0,1} Show that Pr(x,y)2  [f(x)=g(y)] is almost the same for close and far Step 3: Go to Fourier basis Analyze one Fourier level  at a time Step 4: Analyze correlation between x,y in  fixed positions j1,…,j For a high level , let close,far include -noise (to destroy correlation) For a low level , let close,far include block rotations Step 5: Reduce Ulam to {0,1}d Lower Bounds for Edit Distance Estimation

11 Steps 1+2: Reduce to Boolean functions
randomness Definition: A 1-bit LSH protocol is A sketching protocol with |sA |= |sB |=1 Referee Accepts if sA(x)=sB(y) Let Advntg = Pr[Acc.|close] – Pr[Acc.|far] By Yao’s minimax: distributional complexity sA,sB are functions f,g : d → {0,1} x2d y2d Alice Bob Lemma [Andoni-Indyk-K.’07]: CC·k ) 91-bit LSH w/Pr[success] ¸ ½+2-k/4 Proof idea: Reduce to 1-bit by “guessing all other bits” A small advantage remains With one bit, can only test equality sA(x)≡f sB(y)≡g Remains to show: a distribution μ=(μfar+μclose)/2 such that for all f,g: Prμ[success] < ½ + O(A log A / log d) Referee Accepts if f(x)=g(y) Lower Bounds for Edit Distance Estimation

12 Basic Idea for Hard Distribution 
Fix εfar=0.1, εclose=εfar/A, R=√d For t{close, far}, distribution μt is defined to be (x,y) generated as: 1) Pick at random xd 2) z  ρt(x) (namely t-rotation on L-block) 3) y  noise(z) (namely changing R/A positions at random) WHP : Choose ||=d3 ) x,z,y are permutations Choose L=R ) ed(x,z) ≈ εtR ed(x,y) ≈ ed(x,z) + R/A ) ed(x,y) ≈ εtR for (x,y) 2 supp(t) L εtL Lower Bounds for Edit Distance Estimation

13 “Position-Tests” View
Give more power to upper bound: Consider this λ-test: The protocols fixes λ positions (in x) “See” where these λ symbols end up (in y) Example: A distribution t that does NOT work: Partition x into d/L blocks (L ≈ random power of 2) Rotate each block by εtL A 2-test breaks this distribution: Pick two random positions (symbols) at distance R/2 “See” if the symbols are at the same distance in y (after rotation) They are not with probability ≈ 2εt Simplifying assumption (making the algorithms more powerful)… E.g., consider the following distribution (which does NOT work) L Lower Bounds for Edit Distance Estimation

14 Lower Bounds for Edit Distance Estimation
Fooling λ-tests We will fool all these λ-tests If λ is “big” Then random noise will destroy the test At least one (of λ) position is randomized by noise If λ is “small” Want “Position Indistinguishability”: Fix any λ positions (symbols) Let Pt = distribution of positions of these λ symbols after rotation ρt Then statistical distance between Pclose and Pfar is small Destroy <> fool λ =4 Lower Bounds for Edit Distance Estimation

15 Lower Bounds for Edit Distance Estimation
Hard Distribution  0) Pick ε to be εfar ≈0.1 or εclose ≈ εfar/A with probability ½ each 1) Choose random xd 2) z = ρ(x), where ρ is “rotation operation” Pick random start position of the block Pick block length L{β1, β2, β3, …, d0.01} β ≈ 1.1 L= βl with probability ≈ β-l =1/L Goal: any symbol, conditioned on being inside of the block, is equally likely in block of length β1, β2, β3, …, d0.01 Rotate L-block by εL to left or right 3) y = noiseR/A(z) Observe: E[L] = l (l ¢ -l) = 0.01*logβ d edit(x,z) ≈ E[εL]= ε*E[L] ≈ ε log d Fineprint: choose w = R/log d >> 1 rotations independently This guarantees the distance is “w.h.p.”, not just in expectation Lower Bounds for Edit Distance Estimation

16 Position Indistinguishability: =1
Let Pt be the distribution of positions of λ symbols after applying rotation ρt Want: Statistical distance ||Pclose – Pfar||1 is small Warm-up: λ=1 Will count common probability mass i.e. “cancellation” Probability not in a block: 1-E[L]/d=1-(log d)/d=1 - R/d Will prove statistical distance is: <(log A) / d Now, condition on being in the block 1st Calculation: After conditioning, block length is equally likely β1, β2, β3, …, d0.01 Thus symbol moves, equally likely, by εβ1, εβ2, εβ3, …, εd0.01 positions If εfar/εclose=βa: Match 1-O(a/log d) fraction of remaining mass Not quite right… λ =1 εtL Lower Bounds for Edit Distance Estimation

17 Lower Bounds for Edit Distance Estimation
More on =1 Correct calculation: (1-ε) probability of moving εβ1, εβ2, εβ3, …, εd0.01 to right or left ε probability of moving (1-ε)β1, (1-ε)β2, (1-ε)β3, …, (1-ε)d0.01 to right or left If (1-ε)/ε= βb, Match 1-(b/log d) fraction of mass within the same ε Out of remaining, match 1-(a/log d) between εfar and εclose In total, Match 1-O((a+b)/log d)*(log d) / d=1-O((log A) / d) Statistical distance ||Pclose – Pfar||1 · O((log A) / d) Lower Bounds for Edit Distance Estimation

18 Position Indistinguishability: λ>1
Almost the same idea: If L >> diameter of a set of symbols Treat them as if they are one symbol If L << diameter of a set of symbols Each symbol behaves independently from the others There remain only ~λ2 “bad L’s” (all pairwise distances) More careful analysis improves the bound to λ Contribution to statistical distance: λ * O((log A) / d) for when each symbol behaves independently λ * O((log A) / d) from “bad L’s” (when analyzed globally) Total: ||Pclose – Pfar||1 < λ * O((log A) / d) Lower Bounds for Edit Distance Estimation

19 Why we care about λ-Tests
Recall Protocol is given by f,g : d→{+1,-1} Simplification: ={0,1} We showed ||Pclose – Pfar||1 < λ * (log A) / d Use Fourier expansion to prove: Pr[success] < ½+ maxλ { ||Pclose – Pfar||1 * Prnoise[λ symbols not hit by noise] } When λ=d*A/R, constant probability to be hit by the R/A-noise, thus Pr[success] < ½+(d*A/R) * (log A)/d < ½+(A*log A) / log d QED Lower Bounds for Edit Distance Estimation

20 Step 5: Reduce Ulam to {0,1}d
Theorem: Let x,y be permutations and let  :   {0,1} be a random function. Then whp ( ¸1-2-(ed(x,y)) ) (1)¢ ed(x,y) · ed((x),(y)) · ed(x,y) Corollary: SKA(Ulamd) · SK(A)({0,1}d) CCA(Ulamd) · CC(A)({0,1}d) Proof idea: Let k=ed(x,y) Find S={k/4 (disjoint) pairs of symbols that are inverted in x wrt y} Alignment of x,y with cost < k/8 must be “wrong” in at least k/8 places Alignment of (x),(y) with cost < k/8 must align at least k/8 positions against random bits Apply union bound by counting number of alignments w/cost <k/8 Caveat: Too many alignments! Show # of restrictions to S is ·ck/8 Lower Bounds for Edit Distance Estimation

21 Lower Bounds for Edit Distance Estimation
Open Problems Improve communication complexity lower bound? The reduction to Boolean functions is “lossy” Lower bound for streaming algorithms of edit distance? Some upper bounds in [Gopalan-Jayram-K.-Kumar’07] Variants of edit distance, e.g. block-edit-distance Better understanding NNS “landscape” Show “computational models” that yield NNS algorithms Examples: O(1)-size sketching and Locality Sensitive Hashing Impossibility results that exclude ”a family of algorithms” Thank you! Lower Bounds for Edit Distance Estimation


Download ppt "Lower Bounds for Edit Distance Estimation"

Similar presentations


Ads by Google