1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion.

Slides:

Advertisements

Similar presentations

Estimating Distinct Elements, Optimally

Advertisements

1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.

Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research.

The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.

Xiaoming Sun Tsinghua University David Woodruff MIT

Tight Lower Bounds for the Distinct Elements Problem David Woodruff MIT Joint work with Piotr Indyk.

1 Approximating Edit Distance in Near-Linear Time Alexandr Andoni (MIT) Joint work with Krzysztof Onak (MIT)

Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.

Embedding the Ulam metric into ℓ 1 (Ενκρεβάτωση του μετρικού χώρου Ulam στον ℓ 1 ) Για το μάθημα “Advanced Data Structures” Αντώνης Αχιλλέως.

Algorithmic High-Dimensional Geometry 1 Alex Andoni (Microsoft Research SVC)

Spectral Approaches to Nearest Neighbor Search arXiv: Robert Krauthgamer (Weizmann Institute) Joint with: Amirali Abdullah, Alexandr Andoni, Ravi.

Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)

Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.

Big Data Lecture 6: Locality Sensitive Hashing (LSH)

High Dimensional Search Min-Hashing Locality Sensitive Hashing

MIT CSAIL Vision interfaces Towards efficient matching with random hashing methods… Kristen Grauman Gregory Shakhnarovich Trevor Darrell.

Metric Embeddings As Computational Primitives Robert Krauthgamer Weizmann Institute of Science [Based on joint work with Alex Andoni]

Navigating Nets: Simple algorithms for proximity search Robert Krauthgamer (IBM Almaden) Joint work with James R. Lee (UC Berkeley)

Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Institute) Robert Krauthgamer (Weizmann Institute) Ilya Razenshteyn (CSAIL MIT)

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006

Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:

1 Lecture 18 Syntactic Web Clustering CS

1 Sampling Lower Bounds via Information Theory Ziv Bar-Yossef IBM Almaden.

1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center.

1 Streaming Computation of Combinatorial Objects Ziv Bar-Yossef U.C. Berkeley Omer Reingold AT&T Labs – Research Ronen.

Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Inst. / Columbia) Robert Krauthgamer (Weizmann Inst.) Ilya Razenshteyn (MIT, now.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 22, 2005

Embedding and Sketching Alexandr Andoni (MSR). Definition by example  Problem: Compute the diameter of a set S, of size n, living in d-dimensional ℓ.

DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.

On Embedding Edit Distance into L_11 On Embedding Edit Distance into L 1 Robert Krauthgamer (Weizmann Institute and IBM Almaden)‏ Based on joint work (i)

Tight Bounds for Graph Problems in Insertion Streams Xiaoming Sun and David P. Woodruff Chinese Academy of Sciences and IBM Research-Almaden.

Information Complexity Lower Bounds for Data Streams David Woodruff IBM Almaden.

Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.

Information Theory for Data Streams David P. Woodruff IBM Almaden.

Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.

© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Applying Syntactic Similarity Algorithms.

1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008.

Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.

Implicit Hitting Set Problems Richard M. Karp Erick Moreno Centeno DIMACS 20 th Anniversary.

Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion.

Data Stream Algorithms Lower Bounds Graham Cormode

Lower Bounds for Embedding Edit Distance into Normed Spaces A. Andoni, M. Deza, A. Gupta, P. Indyk, S. Raskhodnikova.

Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Columbia) Robert Krauthgamer (Weizmann Inst) Ilya Razenshteyn (MIT) 1.

The Message Passing Communication Model David Woodruff IBM Almaden.

S IMILARITY E STIMATION T ECHNIQUES FROM R OUNDING A LGORITHMS Paper Review Jieun Lee Moses S. Charikar Princeton University Advanced Database.

Syntactic Clustering of the Web By Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig CSCI 572 Ameya Patil Syntactic Clustering of the.

Approximate Near Neighbors for General Symmetric Norms

New Characterizations in Turnstile Streams with Applications

15-499:Algorithms and Applications

Sublinear Algorithmic Tools 3

Lecture 11: Nearest Neighbor Search

Sublinear Algorithmic Tools 2

Sketching and Embedding are Equivalent for Norms

Near(est) Neighbor in High Dimensions

Lecture 16: Earth-Mover Distance

Lower Bounds for Edit Distance Estimation

Near-Optimal (Euclidean) Metric Compression

Overcoming the L1 Non-Embeddability Barrier

Range-Efficient Computation of F0 over Massive Data Streams

Streaming Symmetric Norms via Measure Concentration

Lecture 15: Least Square Regression Metric Embeddings

Minwise Hashing and Efficient Search

On the resemblance and containment of documents (MinHash)

Approximating Edit Distance in Near-Linear Time

New Jersey, October 9-11, 2016 Field of theoretical computer science

Presentation transcript:

1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion

2 Motivating Example: Near-Duplicate Elimination Web Syntactic clustering [Broder, Glassman, Manasse, Zweig 97] Group pages into clusters of “similar” pages Keep one “representative” from each cluster Crawler Duplicate elimination Page Repository

3 Syntactic Clustering via Sketching [Broder,Glassman,Manasse,Zweig 97] Corpus is huge (billions of pages, 10K/page) Streaming access Limited main memory Linear running time Challenges p h(p) Locality Sensitive Hashes [Indyk, Motwani 98] Pr h [h(p) = h(q)] = sim(p,q) Cluster: Collection of pages that have a common sketch Can compute sketches in one pass Sketches can be stored and processed on a single machine

4 Shingling and Resemblance [Broder,Glassman,Manasse,Zweig 97], [Broder,Charikar,Frieze,Mitzenmacher 98] S w (p ) S w (q ) w-shingling: S w (p) = all substrings of p of length w resemblance w (p,q) = Pr  [min(  (S w (p)) = min(  (S w (q))] =

5 The Sketching Model Alice Bob Referee d(x,y) · k x y  x)  y) d(x,y) ¸ r Shared Randomness k vs. r Gap Problem d(x,y) · k or d(x,y) ¸ r Decide which of the two holds. Approximation Promise: Goal:

6 Applications of Sketching Large data sets Clustering Nearest Neighbor schemes Data streams Management of Files over the Network Differential backup Synchronization Theory Low distortion embeddings Simultaneous messages communication complexity

7 Known Sketching Schemes Resemblance [Broder, Glassman, Manasse, Zweig 97], [Broder, Charikar, Frieze, Mitzenmacher 98] Hamming distance [Kushilevitz, Ostrovsky, Rabani 98], [Indyk, Motwani 98] [Feigenbaum,Ishai,Malkin,Nissim,Strauss,Wright 01] Cosine similarity [Charikar 02] Earth mover distance [Charikar 02] In this talk: Edit Distance

8 Edit Distance x 2  n, y 2  m Minimum number of character insertions, deletions and substitutions that transform x to y. Examples: ED(00000, 1111) = 5 ED(01010, 10101) = 2 Applications Genomics Text processing Web search For simplicity: m = n,  = {0,1}. ED(x,y):

9 Computing Edit Distance Dynamic programming (1970)O(n 2 ) Masek and Paterson (1980)O(n 2 /log n) Exact Computation Impractical for comparing two very long strings. Natural question 1: can we do it in linear time? Impractical for handling massive document repositories. Natural question 2: are there constant size sketches of edit distance? Can we solve the above problems if we settle for approximation? Can we solve the above problems if we settle for approximation? Focus of this talk

10 Sketching Schemes for Edit Distance AlgorithmGapSketch size Batu et al O(n  ) vs.  (n) O(n max(  /2, 2  – 1) ) This paperk vs. O((kn) 2/3 )O(1) This paper (non-repetitive strings) k vs. O(k 2 )O(1) No known embeddings of Edit distance into a normed space. Every embedding of Edit distance into L 1 incurs ¸ 3/2 distortion [Andoni,Deza,Gupta,Indyk,Raskhodnikova 03] Weak nearest neighbor schemes [Indyk 04] Negative Indications

11 Hamming Distance Sketches [Kushilevitz, Ostrovsky, Rabani 98] Ham(x,y) = # of positions in which x,y differ Gap: k vs. 2k Sketch size: O(1) Shared randomness: r 1,…,r n 2 {0,1} are independent and Sketch: h(x) = (  i x i r i ) mod 2 h(y) = (  i y i r i ) mod 2 Analysis: Pr[h(x)  h(y)] = Pr[h(x) + h(y) = 1] = Pr[  i: x i  y i r i = 1] = ½(1- (1 – 1/k) Ham(x,y) )  x) = (h 1 (x),…,h t (x)),  y) = (h 1 (y),…,h t (y)), t = O(1)

12 Edit Distance Sketches: Basic Framework Underlying Principle ED(x,y) is small iff x and y share many common substrings at nearby positions. S x = set of pairs of the form ( ,h(i))  a substring of x h(i): a “locality sensitive” encoding of the substring’s position x SxSx y SySy ED(x,y) small iff intersection S x Å S y large common substrings at nearby positions

13 Basic Framework (cont.) Need to estimate size of symmetric difference Hamming distance computation of characteristic vectors Use constant size sketches [KOR] x SxSx y SySy ED(x,y) small iff symmetric difference S x  S y small Reduced Edit Distance to Hamming Distance

14 11 22 33 11 22 33 General Case: Encoding Scheme Gap: k vs. O((kn) 2/3 ) x y B = n 2/3 /k 1/3, W = n/B 1 S x = { S y = { (  1,1), (  1,1), (  2,1), (  2,1), (  3,2), (  3,2), … … B windows of size W each.,(  i, win(i)),…,(  i, win(i)),…

15 Analysis jj ii x y Case 1: ED(x,y) · k If  i is “unmarked”, it has a matching “companion”  j (  i,win(i)) 2 S x n S y, only if: either  i is “marked” or  i is unmarked, but win(i)  win(j) At most kB marked substrings At most k * n/W = kB companions with mismatched windows Therefore, Ham(S x,S y ) · 4kB

16 Analysis (cont.) 22 11 x y Case 2: Ham(S x,S y ) · 8kB If  i has a “companion”  j and win(i) = win(j), can align  i with  j using at most W operations Otherwise, substitute first character of  i At most 8kB substrings of x have no companion Therefore, ED(x,y) · 8kB + W * n/B = O((kn) 2/3 )  B+1  2B+1  B-1

17 y2y2 x2x2 y1y1 x1x1 Non-repetitive Case: Encoding Scheme 11 22 33 44 55 66 77 11 22 33 44 55 66 77 t ¸ 1 “non-repetitiveness” parameter, W = O(k * t) no substring of length t repeats within a window of size W x y W W Alice and Bob choose a sequence of “anchors” in a coordinated way  1 : a random permutation on {0,1} t  1 : minimal length-t substring of x 1 (under  1 )  1 : minimal length-t substring of y 1 (under  1 ) Gap: k vs. O(k W)

18 11 11 Encoding scheme (cont.) 22 33 44 55 66 77 11 22 33 44 55 66 77 22 33 44 55 66 77 88 11 22 33 44 55 66 77 88 x y S x = { (  1,1),…,(  8,8) } S y = { (  1,1),…,(  8,8) }

19 11 22 33 44 55 66 77 11 22 33 44 55 66 77 11 22 33 44 55 66 77 88 Analysis Case 1: ED(x,y) · k. All anchors are “unmarked” with probability 1 - kt/W =  (1) If  i,  i are unmarked, they are aligned # of mismatching substrings · 2k Ham(S x,S y ) · 2k x y 11 22 33 44 55 66 77 88

20 11 22 33 44 55 66 77 11 22 33 44 55 66 77 88 11 22 33 44 55 66 77 11 22 33 44 55 66 77 88 Analysis (cont.) Case 2: Ham(S x,S y ) · 4k # of mismatching substrings · 4k ED(x,y) · 2 ¢ W ¢ 4k = O(k W). x y

21 Approximation in Linear Time AlgorithmGapTimeApprox. factor in O(n) time Dynamic Programming k vs. k+1O(kn)None Batu et al O(n  ) vs.  (n) O(n max(  /2, 2  - 1) ) None Cole, Hariharank vs. 2kO(n + k 4 )O(n 3/4 ) This paperk vs. k 7/4 O(n)O(n 3/7 ) AlgorithmGapTimeApprox. factor in O(n) time Cole, Hariharank vs. 2kO(n + k 3 )O(n 2/3 ) This paperk vs. k 3/2 O(n)O(n 1/3 ) Non- repetitive Strings Arbitrary Strings

22 Summary and Open Problems Designed efficient approximation schemes for edit distance. –Best sketching and linear-time approximations to date Subsequent work: –O(n 2/3 ) distortion embedding of edit distance into L 1 [Indyk 04] [Rabani 04] –Better embeddings of edit distance into L 1 [Ostrovsky, Rabani, 05] –Embeddings of the Ulam metric into L 1 [Charikar, Krauthgamer, 05] Open Problems –Sketch size lower bounds –Constant factor approximations in linear time –Better embeddings of edit distance –Sketching schemes for other distance measures

23 Thank You