Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University.

Slides:



Advertisements
Similar presentations
Speaker: C. C. Lin Adviser: R. C. T. Lee
Advertisements

1 Faster algorithms for string matching with k mismatches Adviser : R. C. T. Lee Speaker: C. C. Yen Journal of Algorithms, Volume 50, Issue 2, February.
Greedy Algorithms Amihood Amir Bar-Ilan University.
Suffix Trees Construction and Applications João Carreira 2008.
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.
Two implementation issues Alphabet size Generalizing to multiple strings.
Constant-Time LCA Retrieval
Suffix Trees Specialized form of keyword trees New ideas –preprocess text T, not pattern P O(m) preprocess time O(n+k) search time –k is number of occurrences.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
3 -1 Chapter 3 String Matching String Matching Problem Given a text string T of length n and a pattern string P of length m, the exact string matching.
21/05/2015Applied Algorithmics - week51 Off-line text search (indexing)  Off-line text search refers to the situation in which a preprocessed digital.
Tries Standard Tries Compressed Tries Suffix Tries.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Motivation  DNA sequencing processes large chains into subsequences of ~500 characters long  Assembling all pieces, produces a single sequence but… –At.
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Suffix Trees and Their Uses.
Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i ate is.
1 2 Dimensional Parameterized Matching Carmit Hazay Moshe Lewenstein Dekel Tsur.
Function Matching Amihood Amir Yonatan Aumann Moshe Lewenstein Ely Porat Bar Ilan University.
Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University.
Suffix trees.
Suffix trees and suffix arrays presentation by Haim Kaplan.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Faster Algorithm for String Matching with k Mismatches Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp Date.
Suffix trees. Trie A tree representing a set of strings. a c b c e e f d b f e g { aeef ad bbfe bbfg c }
Document Retrieval Problems S. Muthukrishnan. Storyline Zvi Galil gave a talk on the 13 th on 13 open problems he posed 13 years ago in string matching.
Pattern Matching in Weighted Sequences Oren Kapah Bar-Ilan University Joint Work With: Amihood Amir Costas S. Iliopoulos Ely Porat.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
String Matching with Mismatches Some slides are stolen from Moshe Lewenstein (Bar Ilan University)
Aho-Corasick Algorithm Generalizes KMP to handle sets of strings New ideas –keyword trees –failure functions/links –output links.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.3: Exclusion Methods.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Kyushu University, Japan SPIRE Cartagena, Colombia.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.
Length Reduction in Binary Transforms Oren Kapah Ely Porat Amir Rothschild Amihood Amir Bar Ilan University and Johns Hopkins University.
String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.
Multiple Pattern Matching in LZW Compressed Text Takuya KIDA Masayuki TAKEDA Ayumi SHINOHARA Masamichi MIYAZAKI Setsuo ARIKAWA Department of Informatics.
Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein.
Fast Algorithm for String Matching with k Mismatches by Amihood Amir, Moshe Lewenstein, and Ely Porat, Journal of Algorithms, to appear, 2003/2004 Speaker:
Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp
Constant-Time LCA Retrieval Presentation by Danny Hermelin, String Matching Algorithms Seminar, Haifa University.
Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.
On the Sorting-Complexity of Suffix Tree Construction MARTIN FARACH-COLTON PAOLO FERRAGINA S. MUTHUKRISHNAN Requires Math fonts downloadable from herehere.
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
06/12/2015Applied Algorithmics - week41 Non-periodicity and witnesses  Periodicity - continued If string w=w[0..n-1] has periodicity p if w[i]=w[i+p],
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
15-853:Algorithms in the Real World
Tries 07/28/16 11:04 Text Compression
COMP9319 Web Data Compression and Search
String Processing.
Suffix trees.
Reachability on Suffix Tree Graphs
String Data Structures and Algorithms
String Data Structures and Algorithms
Suffix trees and suffix arrays
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
String Matching with k Mismatches
String Processing.
Presentation transcript:

Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

String Matching with k Mismatches Landau – Vishkin 1986 Galil – Giancarlo 1986 Abrahamson 1987 Amir - Lewenstein - Porat 2000

Exact String Matching Input: T = t 1... t n P = p 1 … p m Output: All locations i of T where P appears Example: P = A B C A A B T = A B A B C A A B C A A B C A A B A A …

Input: T = t 1... t n P = p 1 … p m Output: All locations i of T where P appears Example: P = A B C A A B T = A B A B C A A B C A A B C A A B A A … 3 Exact String Matching

Input: T = t 1... t n P = p 1 … p m Output: All locations i of T where P appears Example: P = A B C A A B T = A B A B C A A B C A A B C A A B A A … 3 7 Exact String Matching

Input: T = t 1... t n P = p 1 … p m Output: All locations i of T where P appears Example: P = A B C A A B T = A B A B C A A B C A A B C A A B A A … Exact String Matching

Input: T = t 1... t n P = p 1 … p m Output: All locations i of T where P appears Example: P = A B C A A B T = A B A B C A A B C A A B C A A B A A … Answer: {3,7,11,..} Exact String Matching

Problem: Matching not exact in applications of: Computational Biology Musicology Text Editing Meteorology etc. Need other definitions of string matching!

Approximate String Matching Idea: Find all text locations where distance from pattern is sufficiently small. distance metric:HAMMING DISTANCE Let S = s 1 s 2 … s m R = r 1 r 2 … r m Ham(S,R) = The number of locations j where s j r j Example: S = ABCABC R = ABBAAC Ham(S,R) = 2

String Matching with Mismatches Input: T = t 1... t n P = p 1 … p m Output: For each i in T Ham(P, t i t i+1 … t i+m-1 ) Example: P = A B B A A C T = A B C A A B C A C …

String Matching with Mismatches Input: T = t 1... t n P = p 1 … p m Output: For each i in T Ham(P, t i t i+1 … t i+m-1 ) Example: P = A B B A A C T = A B C A A B C A C … 2 Ham(P,T 1 ) = 2

String Matching with Mismatches Input: T = t 1... t n P = p 1 … p m Output: For each i in T Ham(P, t i t i+1 … t i+m-1 ) Example: P = A B B A A C T = A B C A A B C A C … 2, 4 Ham(P,T 2 ) = 4

String Matching with Mismatches Input: T = t 1... t n P = p 1 … p m Output: For each i in T Ham(P, t i t i+1 … t i+m-1 ) Example: P = A B B A A C T = A B C A A B C A C … 2, 4, 6 Ham(P,T 3 ) = 6

String Matching with Mismatches Input: T = t 1... t n P = p 1 … p m Output: For each i in T Ham(P, t i t i+1 … t i+m-1 ) Example: P = A B B A A C T = A B C A A B C A C … 2, 4, 6, 2 Ham(P,T 4 ) = 2

String Matching with Mismatches Input: T = t 1... t n P = p 1 … p m Output: For each i in T Ham(P, t i t i+1 … t i+m-1 ) Example: P = A B B A A C T = A B C A A B C A C … 2, 4, 6, 2, …

Input: T = t 1... t n, P = p 1 … p m String Matching with k Mismatches Output: Every i in T s.t. Ham(P, t i t i+1 … t i+m-1 ) k Example: k = 2 P = A B B A A C T = A B C A A B C A C … 2, 4, 6, 2, …

Input: T = t 1... t n, P = p 1 … p m String Matching with k Mismatches Output: Every i in T s.t. Ham(P, t i t i+1 … t i+m-1 ) k Example: k = 2 P = A B B A A C T = A B C A A B C A C … 2, 4, 6, 2, …

Input: T = t 1... t n, P = p 1 … p m String Matching with k Mismatches Output: Every i in T s.t. Ham(P, t i t i+1 … t i+m-1 ) k Example: k = 2 P = A B B A A C T = A B C A A B C A C … 2, 4, 6, 2, … Y,N,N,Y, …

Naïve Algorithm (for counting mismatches or k-mismatches problem) Running Time: O(nm) n = |T|, m = |P| - Goto each location of text and compute hamming distance of P and T i

The Kangaroo Method (for k-mismatches) Landau – Vishkin 1986 Galil – Giancarlo 1986

Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }

Trie (Cont) Assume no string is a prefix of another a b c e e f d b f e g Each string corresponds to a leaf.

Compressed Trie Compress unary nodes, label edges by strings a b c e e f d b f e g a bbf c eef d e g 

Suffix tree Suffix tree of string s: a compressed trie of all suffixes of s Prefix-free: add a special character, say $, at the end of s

Suffix tree (Example) Let s = abab, a suffix tree of s is a compressed trie of all suffixes of s=abab$ { $ b$ ab$ bab$ abab$ } a b a b $ a b $ b $ $ $

Suffix Tree properties - Succint in space - O(n). - Can be built in O(n) time. McCreight, Weiner, Ukkonen, Farach-Colton b 1 2 a b a b $ a b $ 3 $ 4 $ 5 $

Exact string matching 1 2 a b a b $ a b $ b 3 $ 4 $ 5 $ Given a pattern P = ab we traverse the tree according to the pattern. s=abab$

Exact string matching 1 2 a b a b $ a b $ b 3 $ 4 $ 5 $ Leaves correspond to locations of appearance! s=abab$ 1 3

Exact string matching 1 2 a b a b $ a b $ b 3 $ 4 $ 5 $ Prepare Tree: O(n) time Find matches: O(m + occ) time occ = # of matches s=abab$ 1 3

Lowest common ancestors A lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it

Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes s = abbaab$ 1 3 a b a a b a b $ b 5 $ 2 b 4 b $ a 6 $ 7 $ b $ a a a b $

Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes 1 3 a b a a b a b $ b 5 $ 2 b 4 b $ a 6 $ 7 $ b $ a a a b $ s = abbaab$ aab$

Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes 1 3 a b a a b a b $ b 5 $ 2 b 4 b $ a 6 $ 7 $ b $ a a a b $ s = abbaab$ aab$ abbaab$

Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes 1 3 a b a a b a b $ b 5 $ 2 b 4 b $ a 6 $ 7 $ b $ a a a b $ s = abbaab$ aab$ abbaab$

LCA/LCP properties a 1 3 b a a b a b $ b 5 $ 2 b 4 b $ a 6 $ 7 $ b $ a a a b $ Preprocesssing time : O(n) Query Time: O(1) Harel & Tarjan 1984, Schieber & Vishkin 1988, Berkman & Vishkin 1993

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

The Kangaroo Method (for k-mismatches) Preprocess: Build suffix tree of both P and T - O(n+m) time LCA preprocessing - O(n+m) time Check P at given text location Kangroo jump till next mismatch - O(k) time Overall time: O(nk)

a b a c c a c b a c a b a c c P = a b b b c c c a a a a b a c b... T = Boolean Convolutions (FFT) Method

a b a c c a c b a c a b a c c P = a b a c c a c b a c a b a c c a-mask a b b b c c c a a a a b a c b... T = Boolean Convolutions (FFT) Method

a b a c c a c b a c a b a c c P = a b a c c a c b a c a b a c c a-mask PaPa a b b b c c c a a a a b a c b... T = Boolean Convolutions (FFT) Method

a b a c c a c b a c a b a c c P = PaPa a b b b c c c a a a a b a c b... T = Boolean Convolutions (FFT) Method

a b a c c a c b a c a b a c c P = PaPa a b b b c c c a a a a b a c b... T = a b b b c c c a a a a b a c b... not-a mask Boolean Convolutions (FFT) Method

a b a c c a c b a c a b a c c P = PaPa a b b b c c c a a a a b a c b... T = a b b b c c c a a a a b a c b... not-a mask T not a Boolean Convolutions (FFT) Method

a b a c c a c b a c a b a c c P = PaPa a b b b c c c a a a a b a c b... T = T not a Boolean Convolutions (FFT) Method

a b a c c a c b a c a b a c c P = PaPa a b b b c c c a a a a b a c b... T = T not a Multiply P a and T not a to count mismatches (use FFT) PaPa T not a... Boolean Convolutions (FFT) Method

a b a c c a c b a c a b a c c P = PaPa a b b b c c c a a a a b a c b... T = T not a Multiply P a and T not a to count mismatches (use FFT) PaPa T not a... Boolean Convolutions (FFT) Method

Running Time: One boolean convolution - O(n log m) time # of matches of all symbols - O(n| | log m) time Boolean Convolutions (FFT) Method

Counting Method Input:Text: T = t 1 …t n Pattern: P = p 1 …p m Max # of allowed mismatches: k Assumption: Each pattern element is distinct a b c d e f g h b g d e f h d c c a b g h h... Count matches (instead of mismatches) P = T = counter increment

O(n log m) Algorithm Frequent Symbol: a symbol that appears at least times in P. Case 1 : At least frequent symbols. - Consider first frequent symbols. - For each of them construct a mask for first appearances. We distinguish between two cases: Case 2 : Less than frequent symbols. Case 1 : At least frequent symbols.

Example of Masked Counting k = 4, = 4 a b a c c a c b a c a b a c c P = a b a c c a c b a c a b a c c a-mask c-mask a b b b c c c a a a a b a c b... T = use a-mask

Example of Masked Counting k = 4, = 4 a b a c c a c b a c a b a c c P = a b a c c a c b a c a b a c c a-mask c-mask a b b b c c c a a a a b a c b... T = a b a c c a c b a c a b a c c d counter

Example of Masked Counting k = 4, = 4 a b a c c a c b a c a b a c c P = a b a c c a c b a c a b a c c a-mask c-mask a b b b c c c a a a a b a c b... T = a b a c c a c b a c a b a c c d counter

Counting Stage: Run through text and count occurrences of all marks. Time: O(n ). For location i of T, if counter i < k then no match at location i. Why? The total # of elements in all masks is 2 = 2k. Important Observations: 1) Sum of all counters 2 n 2) Every counter whose value is less than k already has more than k errors.

How many locations remain? Sum of all counters: 2n Value of potential matches > k The Kangaroo Method. How do we check these locations? Use Kangaroo Method Time: O(k) per location Overall Time: O( ) = O( ) # of potential matches:

Case 2:X frequent symbols, x < a) Count all matches of frequent symbols - one boolean convolution per symbol. b) For non-frequent symbols, build full masks. Time: O(x n log m) = O( n log m) Symbol non-frequent appears < 2 in P mask size < 2 Count time: O(n )

So, Case 2 is O(n log m) Overall Algo. Time: O(n log m) c) Add results of a) & b) and get total number of matches at every text location. Time: a) O(n log m) b) O(n ) c) O(n)

Additional Points 1. O(n log k) For there is a linear time algorithm - O( ) 2. O( n ) Better tradeoff: Define frequent symbol >

O( ) time algorithm Outline : 1. Find 2k special substrings of pattern. 2. Construct forest data structure combining info of special pattern substrings and text. 3. Use local counting arguments and quick queries to forest data structure to prune candidates. 4. Use kangaroo method to check leftover potential candidates.

k-Mismatches and Matrix Multiplication “Or-And” matrix multiplication: AxB = C, c ij = a ik b kj Pattern all-mismatch problem: Find all text locations where the pattern mismatches at every character. Indyk: If there is an algorithm faster than O(n ) for the Pattern all-mismatch problem then there is a new method for solving “Or-And” matrix multiplication faster than O(n 3 )

OPEN PROBLEMS Hamming Distance in time: O(n log m) Edit Distance? Other metrics?