1 Embedded Stringology Piotr Indyk MIT. 2 Combinatorial Pattern Matching Stringology [Galil] : algorithms for strings (as well as trees and other plants)

Slides:



Advertisements
Similar presentations
Tight Lower Bounds for the Distinct Elements Problem David Woodruff MIT Joint work with Piotr Indyk.
Advertisements

1 Approximating Edit Distance in Near-Linear Time Alexandr Andoni (MIT) Joint work with Krzysztof Onak (MIT)
Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
Embedding the Ulam metric into ℓ 1 (Ενκρεβάτωση του μετρικού χώρου Ulam στον ℓ 1 ) Για το μάθημα “Advanced Data Structures” Αντώνης Αχιλλέως.
Algorithmic High-Dimensional Geometry 1 Alex Andoni (Microsoft Research SVC)
Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)
D ICTIONARY M ATCHING WITH O NE G AP Amihood Amir, Avivit Levy, Ely Porat and B. Riva Shalom 1 CPM 2014.
Metric Embeddings with Relaxed Guarantees Hubert Chan Joint work with Kedar Dhamdhere, Anupam Gupta, Jon Kleinberg, Aleksandrs Slivkins.
Asynchronous Pattern Matching - Metrics Amihood Amir CPM 2006.
Similarity Search in High Dimensions via Hashing
Metric Embeddings As Computational Primitives Robert Krauthgamer Weizmann Institute of Science [Based on joint work with Alex Andoni]
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
Data Structures and Functional Programming Algorithms for Big Data Ramin Zabih Cornell University Fall 2012.
Navigating Nets: Simple algorithms for proximity search Robert Krauthgamer (IBM Almaden) Joint work with James R. Lee (UC Berkeley)
VLSH: Voronoi-based Locality Sensitive Hashing Sung-eui Yoon Authors: Lin Loi, Jae-Pil Heo, Junghwan Lee, and Sung-Eui Yoon KAIST
Nearest Neighbor Search in high-dimensional spaces Alexandr Andoni (Microsoft Research)
Uncertainty Principles, Extractors, and Explicit Embeddings of L 2 into L 1 Piotr Indyk MIT.
CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lecture 19 (Dec 5, 2005) Nearest Neighbors: Dimensionality Reduction and Locality-Sensitive.
UMass Lowell Computer Science Analysis of Algorithms Prof. Karen Daniels Fall, 2001 Lecture 8 Tuesday, 11/13/01 String Matching Algorithms Chapter.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Deterministic Length Reduction: Fast Convolution in Sparse Data and Applications Written by: Amihood Amir, Oren Kapah and Ely Porat.
Algorithms for Regulatory Motif Discovery Xiaohui Xie University of California, Irvine.
Exact and Approximate Pattern in the Streaming Model Presented by - Tanushree Mitra Benny Porat and Ely Porat 2009 FOCS.
1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Optimal Data-Dependent Hashing for Approximate Near Neighbors
Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Inst. / Columbia) Robert Krauthgamer (Weizmann Inst.) Ilya Razenshteyn (MIT, now.
Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University.
Embedding and Sketching Alexandr Andoni (MSR). Definition by example  Problem: Compute the diameter of a set S, of size n, living in d-dimensional ℓ.
Embedding and Sketching Non-normed spaces Alexandr Andoni (MSR)
Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Kyushu University, Japan SPIRE Cartagena, Colombia.
Geometric Problems in High Dimensions: Sketching Piotr Indyk.
On Embedding Edit Distance into L_11 On Embedding Edit Distance into L 1 Robert Krauthgamer (Weizmann Institute and IBM Almaden)‏ Based on joint work (i)
Beyond Locality Sensitive Hashing Alex Andoni (Microsoft Research) Joint with: Piotr Indyk (MIT), Huy L. Nguyen (Princeton), Ilya Razenshteyn (MIT)
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
1 Streaming Algorithms for Geometric Problems Piotr Indyk MIT.
Nearest Neighbor Paul Hsiung March 16, Quick Review of NN Set of points P Query point q Distance metric d Find p in P such that d(p,q) < d(p’,q)
Quantum Computing MAS 725 Hartmut Klauck NTU
Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.
Length Reduction in Binary Transforms Oren Kapah Ely Porat Amir Rothschild Amihood Amir Bar Ilan University and Johns Hopkins University.
Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp
Nearest Neighbor Search in high-dimensional spaces Alexandr Andoni (Princeton/CCI → MSR SVC) Barriers II August 30, 2010.
NEAREST NEIGHBORS ALGORITHM Lecturer: Yishay Mansour Presentation: Adi Haviv and Guy Lev 1.
Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.
Sketching and Nearest Neighbor Search (2) Alex Andoni (Columbia University) MADALGO Summer School on Streaming Algorithms 2015.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
06/12/2015Applied Algorithmics - week41 Non-periodicity and witnesses  Periodicity - continued If string w=w[0..n-1] has periodicity p if w[i]=w[i+p],
1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008.
Geometric Problems in High Dimensions: Sketching Piotr Indyk.
Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality Piotr Indyk, Rajeev Motwani The 30 th annual ACM symposium on theory of computing.
1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT.
Ravello, Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.
Lower Bounds for Embedding Edit Distance into Normed Spaces A. Andoni, M. Deza, A. Gupta, P. Indyk, S. Raskhodnikova.
On the Hardness of Optimal Vertex Relabeling and Restricted Vertex Relabeling Amihood Amir Benny Porat.
Optimal Data-Dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Joint work with: Ilya Razenshteyn.
1 Approximations and Streaming Algorithms for Geometric Problems Piotr Indyk MIT.
Amihood Amir, Gary Benson, Avivit Levy, Ely Porat, Uzi Vishne
Sublinear Algorithmic Tools 3
Lecture 10: Sketching S3: Nearest Neighbor Search
Sketching and Embedding are Equivalent for Norms
Near(est) Neighbor in High Dimensions
String matching.
Yair Bartal Lee-Ad Gottlieb Hebrew U. Ariel University
Locality Sensitive Hashing
Overcoming the L1 Non-Embeddability Barrier
String Matching with k Mismatches
Lecture 15: Least Square Regression Metric Embeddings
Minwise Hashing and Efficient Search
Approximating Edit Distance in Near-Linear Time
Presentation transcript:

1 Embedded Stringology Piotr Indyk MIT

2 Combinatorial Pattern Matching Stringology [Galil] : algorithms for strings (as well as trees and other plants) –Classic/standard stringology: exact String matching, suffix trees etc Tools: automata theory, combinatorics on words –Non-standard stringology: approximate/noisy Pattern matching with mismatches Dictionary problems Tool: FFT

3 Plan the talk Overview of problems Embeddings: what, why ? Embeddings for stringology Open problems

4 Noisy Pattern Matching Real life data is often noisy Algorithms should be robust to noise How to define noise ? Typically, via a distance function. E.g., when searching for pattern P, we accept substrings S such that D(P,S) ≤ k

5 Distance functions Hamming: D(P,S)=H(P,S) = # indices i s.t. P i  S i –Simple and general –Not realistic ? –[Buhler, RECOMB’01] : AGTC A0233 G2033 T3302 C3320 AGTC A+1-2 G+1-2 T +1 C-2 +1 A0AA G0GG T1TT C1CC

6 Distance functions ctd. L p norms: –P i and S i are real numbers –D(P,S)=||P-S|| p

7 Distance functions ctd. Edit distance: D(P,S)=minimum number of operations needed to transform P to S –Typical operations: Insertions, deletions, substitutions of characters (ED) Swaps, etc. Copies/reversals of whole blocks (BED) –Operations reversible  D(P,S)=D(S,P)

8 Problems Pattern matching: –Exact: given T, |T|=n, and P, |P|=m, find substring S of T such that D(S,P) ≤ k (if it exists) –Approximate: can output a substring S’ such that D(S’,P) ≤ k(1+  ) (if a “ ≤ k -match” exists) Near neighbor/dictionary/post-office problem: –Given S = S 1 …S N, |S i | ≤ m, build a data structure which does the following: Given P, |P| ≤ m, report S i such that D(S i,P) ≤k(1+  ) (if a “ ≤ k match” exists) –Variant: S 1 …S N are all m-substrings of a text T

9 Problems Recap Pattern matching or near neighbor Under Hamming, L p or Edit distances

10 Embeddings

11 Embeddings: Definition Assume we have M 1 =(X 1,D 1 ), M 2 =(X 2,D 2 ) A mapping f:X 1  X 2 is a c-embedding if for any p,q from X 1 we have D 1 (p,q) ≤ D 2 (f(p),f(q)) ≤ c*D 1 (p,q) Example: AGTC A0233 G2033 T3302 C3320 A0AA G0GG T1TT C1CC

12 Embeddings for Algorithms

13 Hamming metric Noisy pattern matching: –Exact: O(n |Σ| log n) [Fisher-Paterson’74] O(nk) [Landau-Vishkin, Galil-Giancarlo’85] O ~ (n m 1/2 ) [Abrahamson, Kosaraju’89] O ~ (n k 1/2 ) [Amir-Lewenstein-Porat, SODA’00] O(n (1+poly(k)/m)) [Sahinalp-Vishkin, FOCS’96, Cole- Hariharan, SODA’00] –Approximate: O(n/  2 log |Σ| log m) [Karloff, IPL’93] O(n/  2 log m) [Indyk, FOCS’98]

14 Karloff’s Algorithm Embed Hamming over Σ into Hamming over {0,1} : –Take f: Σ  {0,1} t=O(log |Σ|/  2 ) such that for any a,b in Σ, H(f(a),f(b)) = t/2 (1  ) –Replace each symbol a in T and P by f(a), obtaining f(T) and f(P) a b a c b  b b c 

15 L p norms L 2 : Exact, in O(n log m) time –||S-P|| 2 = ||S|| 2 + ||P|| 2 – 2 S*P L 1 : –Exact: O ~ (n m 1/2 ) [Indyk-Lewenstein-Lipsky-Porat, ICALP’04] –Approximate: O( (m log m +n) log n/  2 ) [Indyk] O( n log m log |Σ|/  2 ) [Lipsky-Porat]

16 L 1 norm Imagine we have a linear mapping A:R m  R t, t=O(log n/  2 ), such that for all P,S: ||P-S|| 1 =||AP-AS|| 1 (1  ) Then we easily get an O(n t log n ) algorithm: –Denote A=[a 1 a 2 … a t ] T –Compute AP O(mt) –For j=1..t, compute a j *T[i..i+m-1], i=1…n via FFT O(n t log n) This gives us AS for all m-substrings S of T –Estimate ||P-S|| 1 for all SO(n t) Faster algorithm obtained by reversing the pattern and text computation

17 Dimensionality reduction in L 1 Unfortunately, such mapping A does not exist [Charikar-Sahai, FOCS’02] But, there are A’s such that ||P-S|| 1 =median[ |AP-AS|] (1  ) with high probability [Indyk, FOCS’00] Construction uses 1-stable distributions: a j *x has the same distribution as z*||x|| 1

18 Bonus section Consider the following general matching problem: –We have arbitrary metric (D,Σ) –The distance D(P,S)=Σ i D(P[i],S[i]) Theorem [Bourgain’85] : Any metric (D,Σ) can be embedded into R O(log |Σ|) under L 1 with distortion O(log |Σ|), in time O ~ (|Σ| 2 ). Corollary: a O(log |Σ|)-approximate algorithm for the g.m.p. [Lipsky-Porat]

19 Approximate Near Neighbor c-Approximate Near Neighbor: –Given: set S of N points S i, r>0,c>1 –Goal: build data structure which, for any query q, if there is a point p  P, ||q-p|| 2 ≤r, it returns p’  P, ||q-p’|| 2 ≤ cr Can be used to solve exact NN –E.g., report all c-approximate NNs –Query time depends on the data set q r cr

20 Approximate NN in Hamming space Exact algorithms: – 2 m space, O(m) query time –O(Nm) time Approximate algorithms: –Space/time exponential in m [Arya-Mount-et al], [Clarkson, STOC’97], [Kleinberg, STOC’97], [Har-Peled, FOCS’02] –Space/time polynomial in m [Kushilevitz-Ostrovsky-Rabani, STOC’98], [Indyk-Motwani, STOC’98], [Indyk, FOCS’98],…

21 Approach I: Dim Reduction Would like to: –Reduce the dimension m to t=O(log N/  2 ) –Induce only c=(1+  ) distortion Possible for: –L 2 norm [Johnson-Lindenstrauss’84]  N O(log(1/  )/  2 ) space, O(d log N/  2 ) query [Indyk-Motwani’98] –Hamming [Kushilevitz-Ostrowsky-Rabani’98]  N O(1/  2 ) space, O(d log N/  2 ) query Tool: random linear map

22 Approach II: Locality-Sensitive Hashing [Indyk-Motwani’98] Idea: construct hash functions g: {0,1} m  U such that for any points p,q: –If D(p,q) ≤ r, then Pr[g(p)=g(q)] is “high” –If D(p,q) >cr, then Pr[g(p)=g(q)] is “small” Then we can solve the problem by hashing “not-so-small” q p

23 LSH for Hamming g A (p)=p |A, |A|=t Works because: However, t is large, so p  p |A * (a 1,...,a t ) mod M Can show #hash tables = N 1/c O(N 1+1/c ) space, O(mN 1/c log N) query time g A ( )=0 0 1 g A ( )=0 0 1 g A ( )= * a 1 0 a 2 0a

24 All m-substrings version Can –Generate N-m+1 substrings of T[1…N] –Use LSH algorithm Drawback: O(m N 1+1/c ) preprocessing time But, we hash all substrings of T using FFT –O(N log m) time per hash function –O(N 1+1/c log m) time total Other optimizations possible [Buhler, RECOMB’02,…]

25 Edit distance Many algorithms for the exact problem Approximation algorithms ? Embeddings ?

26 Embeddings of Edit Distance ED cannot be embedded into L 1 with distortion ≤ [Andoni-Deza-Gupta-Indyk-Raskhodnikova, SODA’02] ED over strings of length ≤ m can be embedded * into L 1 with distortion O(m  ) [Bar-Yossef-Jayram-Krauthgamer- Kumar, FOCS’04] 3/2

27 Block Edit Distance If we allow block operations (each with unit cost): –Move: ababcd  cdabab –Copy: abcd  abcdab (plus the inverse op) –Etc. Then BED can be embedded into L 1 with distortion O(log m log * m) [Cormode-Paterson-Sahinalp-Vishkin, SODA’00, Muthukrishnan-Sahinalp, STOC’00, Cormode-Muthukrishnan, SODA’02]

28 Implications BED: –O(log m log * m)-approximate NN with O(N 1.1 ) space, poly(m) query [Muthukrishnan-Sahinalp’00] –O(log m log * m)-approximate pattern matching in O ~ (n+m) time [Cormode-Muthukrishnan’02] ED: –O(m  ) -approximate NN with O(N 1.1 ) space, poly(m) query for some  >0 [Bar-Yossef et al’04] Known: O(m  )-approximate NN with O(N 2 1/  ) space for any  >0 [Indyk, SODA’04] –O(m  )-approximate pattern matching in O ~ (n+m) time

29 Edit and Hamming Distances Want to find patterns modified by: –k insertions/deletions (indels) –l substitutions –k << l Can find a substring [Badoiu-Indyk, SODA’04]: –With k indels, (1+  )l substitutions, –In time O(n poly(1/  + k+ log n) ) Method: Extend the O(nk)-time algorithm: –Instead of finding longest T[i…j] matching prefix of P, find the longest T[i…j] matching prefix of P approximately –Use poly(log m+1/  ) data structure from [Indyk-Koudas-Muthukrishnan, VLDB’00]

30 Conclusions Examples of embeddings: –General metrics into L 1 –Concrete metrics into L 1 –Dimensionality reduction Applications to problems: –Pattern matching –Near Neighbor

31 Open Problems Near neighbor: –Improve the O(m n 1/c ) query time (but keep small space) Recent (small) improvement for L 2 norm [Datar-Immorlica-Indyk- Mirrokni, SoCG’04] –Better space bound for data set induced by substrings of T of arbitrary length m Preprocessing for all m’s gives O(n 1+1+1/c ) space General pattern matching tradeoff: –Exact, O(|Σ| n log n) time –log |Σ|-approximate, O ~ (n)-time

32 Open Problems Better embeddings (or lower bounds) for ED or BED into L 1 Better NN for k indels, l substitution, k<<l

33 The End – Thank You!