String Matching with Mismatches Some slides are stolen from Moshe Lewenstein (Bar Ilan University)

Slides:



Advertisements
Similar presentations
1 Faster algorithms for string matching with k mismatches Adviser : R. C. T. Lee Speaker: C. C. Yen Journal of Algorithms, Volume 50, Issue 2, February.
Advertisements

Less Than Matching Orgad Keller.
Applied Algorithmics - week7
Analysis of Algorithms
Greedy Algorithms Amihood Amir Bar-Ilan University.
Algorithm : Design & Analysis [19]
Lectures on Recursive Algorithms1 COMP 523: Advanced Algorithmic Techniques Lecturer: Dariusz Kowalski.
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
Constant-Time LCA Retrieval
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
1 2 Dimensional Parameterized Matching Carmit Hazay Moshe Lewenstein Dekel Tsur.
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search, part 1)
Function Matching Amihood Amir Yonatan Aumann Moshe Lewenstein Ely Porat Bar Ilan University.
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Boyer-Moore string search algorithm Book by Dan Gusfield: Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore.
Property Matching and Weighted Matching Amihood Amir, Eran Chencinski, Costas Iliopoulos, Tsvi Kopelowitz and Hui Zhang.
1 A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber Tech. Rep. TR94-17,Department of Computer Science, University of Arizona, May 1994.
Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Deterministic Length Reduction: Fast Convolution in Sparse Data and Applications Written by: Amihood Amir, Oren Kapah and Ely Porat.
Faster Algorithm for String Matching with k Mismatches Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp Date.
Algorithms and Efficiency of Algorithms February 4th.
Exact and Approximate Pattern in the Streaming Model Presented by - Tanushree Mitra Benny Porat and Ely Porat 2009 FOCS.
Document Retrieval Problems S. Muthukrishnan. Storyline Zvi Galil gave a talk on the 13 th on 13 open problems he posed 13 years ago in string matching.
Pattern Matching in Weighted Sequences Oren Kapah Bar-Ilan University Joint Work With: Amihood Amir Costas S. Iliopoulos Ely Porat.
Pattern Matching COMP171 Spring Pattern Matching / Slide 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
Orgad Keller Modified by Ariel Rosenfeld Less Than Matching.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.3: Exclusion Methods.
1 Exact Matching Charles Yan Na ï ve Method Input: P: pattern; T: Text Output: Occurrences of P in T Algorithm Naive Align P with the left end.
Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University.
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Kyushu University, Japan SPIRE Cartagena, Colombia.
Vakhitov Alexander Approximate Text Indexing. Using simple mathematical arguments the matching probabilities in the suffix tree are bound and by a clever.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:
Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.
Length Reduction in Binary Transforms Oren Kapah Ely Porat Amir Rothschild Amihood Amir Bar Ilan University and Johns Hopkins University.
String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.
Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein.
Boyer Moore Algorithm Idan Szpektor. Boyer and Moore.
MCS 101: Algorithms Instructor Neelima Gupta
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 1: Exact String Matching.
Fast Algorithm for String Matching with k Mismatches by Amihood Amir, Moshe Lewenstein, and Ely Porat, Journal of Algorithms, to appear, 2003/2004 Speaker:
Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp
Swaps + Mismatches Based on Estrella Eizenberg M.Sc. Thesis Supervised by Ely Porat.
Constant-Time LCA Retrieval Presentation by Danny Hermelin, String Matching Algorithms Seminar, Haifa University.
Strings and Pattern Matching Algorithms Pattern P[0..m-1] Text T[0..n-1] Brute Force Pattern Matching Algorithm BruteForceMatch(T,P): Input: Strings T.
Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.
Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.
Exact String Matching Algorithms Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU.
06/12/2015Applied Algorithmics - week41 Non-periodicity and witnesses  Periodicity - continued If string w=w[0..n-1] has periodicity p if w[i]=w[i+p],
Fall 2008Simple Parallel Algorithms1. Fall 2008Simple Parallel Algorithms2 Scalar Product of Two Vectors Let a = (a 1, a 2, …, a n ); b = (b 1, b 2, …,
Fundamental Programming Fundamental Programming More on Repetition.
On the Hardness of Optimal Vertex Relabeling and Restricted Vertex Relabeling Amihood Amir Benny Porat.
Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.
23 Jan, 2008SOFSEM A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns Costas Iliopoulos M. Sohel Rahman.
Pattern Matching With Don’t Cares Clifford & Clifford’s Algorithm Orgad Keller.
1/39 COMP170 Tutorial 13: Pattern Matching T: P:.
13 Text Processing Hongfei Yan June 1, 2016.
String Processing.
Fast Fourier Transform
Pattern Matching With Don’t Cares Clifford & Clifford’s Algorithm
Suffix trees.
In Pattern Matching Convolutions: O(n log m) using FFT b0 b1 b2
String Matching with k Mismatches
String Processing.
Pattern Matching 4/27/2019 1:16 AM Pattern Matching Pattern Matching
Sorting We have actually seen already two efficient ways to sort:
Presentation transcript:

String Matching with Mismatches Some slides are stolen from Moshe Lewenstein (Bar Ilan University)

String Matching with Mismatches Landau – Vishkin 1986 Galil – Giancarlo 1986 Abrahamson 1987 Amir - Lewenstein - Porat 2000

Approximate String Matching problem: Find all text locations where distance from pattern is sufficiently small. distance metric:HAMMING DISTANCE Let S = s 1 s 2 … s m R = r 1 r 2 … r m Ham(S,R) = The number of locations j where s j r j Example: S = ABCABC R = ABBAAC Ham(S,R) = 2

Example: P = A B B A A C T = A B C A A B C A C … Input: T = t 1... t n P = p 1 … p m Output: For each i in T Ham(P, t i t i+1 … t i+m-1 ) Problem 1: Counting mismatches

Example: P = A B B A A C T = A B C A A B C A C … 2 Ham(P,T 1 ) = 2 Input: T = t 1... t n P = p 1 … p m Output: For each i in T Ham(P, t i t i+1 … t i+m-1 ) Counting mismatches

Example: P = A B B A A C T = A B C A A B C A C … 2, 4 Ham(P,T 2 ) = 4 Input: T = t 1... t n P = p 1 … p m Output: For each i in T Ham(P, t i t i+1 … t i+m-1 ) Counting mismatches

Input: T = t 1... t n P = p 1 … p m Output: For each i in T Ham(P, t i t i+1 … t i+m-1 ) Example: P = A B B A A C T = A B C A A B C A C … 2, 4, 6 Ham(P,T 3 ) = 6 Counting mismatches

Example: P = A B B A A C T = A B C A A B C A C … 2, 4, 6, 2 Ham(P,T 4 ) = 2 Input: T = t 1... t n P = p 1 … p m Output: For each i in T Ham(P, t i t i+1 … t i+m-1 ) Counting mismatches

Example: P = A B B A A C T = A B C A A B C A C … 2, 4, 6, 2, … Input: T = t 1... t n P = p 1 … p m Output: For each i in T Ham(P, t i t i+1 … t i+m-1 ) Counting mismatches

Example: k = 2 P = A B B A A C T = A B C A A B C A C … 2, 4, 6, 2, … Problem 2: k-mismatches Input: T = t 1... t n P = p 1 … p m Output: Every i where Ham(P, t i t i+1 … t i+m-1 ) ≤ k

Example: k = 2 P = A B B A A C T = A B C A A B C A C … 2, 4, 6, 2, … 1, 0, 0, 1, Problem 2: k-mismatches Input: T = t 1... t n P = p 1 … p m Output: Every i where Ham(P, t i t i+1 … t i+m-1 ) ≤ kh

Naïve Algorithm (for counting mismatches or k-mismatches problem) Running Time: O(nm) n = |T|, m = |P| - Goto each location of text and compute hamming distance of P and T i

The Kangaroo Method (for k-mismatches) Landau – Vishkin 1986 Galil – Giancarlo 1986

The Kangaroo Method (for k-mismatches) -Create suffix tree (+ lca) for: s = P#T -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T - Do up to k LCP queries for every text location Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

The Kangaroo Method (for k-mismatches) Preprocess: Build suffix tree of both P and T - O(n+m) time LCA preprocessing - O(n+m) time Check P at given text location Kangroo jump till next mismatch - O(k) time Overall time: O(nk)

How do we do counting in less than O(nm) ?

Lets start with binary strings P = T = If we pad the pattern and the text with zeros this is like a convolution of two vectors of length m+n We can do this using FFT in O(nlog(n)) time

a b a c c a c b a c a b a c c P = a b b b c c c a a a a b a c b... T = And if the strings are not binary ?

a b a c c a c b a c a b a c c P = a b a c c a c b a c a b a c c a-mask a b b b c c c a a a a b a c b... T =

a b a c c a c b a c a b a c c P = a b a c c a c b a c a b a c c a-mask PaPa a b b b c c c a a a a b a c b... T =

a b a c c a c b a c a b a c c P = PaPa a b b b c c c a a a a b a c b... T =

a b a c c a c b a c a b a c c P = PaPa a b b b c c c a a a a b a c b... T = a b b b c c c a a a a b a c b... not-a mask

a b a c c a c b a c a b a c c P = PaPa a b b b c c c a a a a b a c b... T = a b b b c c c a a a a b a c b... not-a mask T not a

a b a c c a c b a c a b a c c P = PaPa a b b b c c c a a a a b a c b... T = T not a

a b a c c a c b a c a b a c c P = PaPa a b b b c c c a a a a b a c b... T = T not a Multiply P a and T (not a) to count mismatches using “a” PaPa T not a...

a b a c c a c b a c a b a c c P = PaPa a b b b c c c a a a a b a c b... T = T not a Multiply P a and T not a to count mismatches using a PaPa T not a...

Boolean Convolutions (FFT) Method

Running Time: One boolean convolution - O(n log m) time # of matches of all symbols - O(n| | log m) time Boolean Convolutions (FFT) Method

How do we do counting in less than O(nm) ? Lets count matches rather than mismatches

a b c d e b b d b g d e f h d c c a b g h h... P = T = counter increment For each character you have a list of offsets where it occurs in the pattern, When you see the char in the text, you increment the appropriate counters.

a b c d e b b d b g d e f h d c c a b g h h... P = T = counter increment For each character you have a list of offsets where it occurs in the pattern, When you see the char in the text, you increment the appropriate counters.

a b c d e b b d b g d e f h d c c a b g h h... P = T = counter increment This is fast if all characters are “rare”

Partition the characters into rare and frequent Rare: occurs ≤ c times in the pattern For rare characters run this scheme with the counters Takes O(nc) time

Frequest chars You have at most m/c of them Do a convolution for each Total cost O(m/c n log(n)).

Fix c Complexity:

Frequent Symbol: a symbol that appears at least times in P. Back to the k-mismatch problem Want to beat the O(nk) kangaroo bound

Few (≤√k) frequent symbols Do the counters scheme for non-frequent Convolve for each frequent O(n log n) O(n )

(≥√k) frequent symbols Intuition: There cannot be too many places where we match

(≥√k) frequent symbols - Consider frequent symbols. - For each of them consider the first appearances. Do the counters scheme just for these symbols and occurrences

k = 4, = 4 a b a c c a c b a c a b a c c P = a b a c c a c b a c a b a c c a-mask c-mask a b b b c c c a a a a b a c b... T = use a-mask

Example of Masked Counting k = 4, = 4 a b a c c a c b a c a b a c c P = a b a c c a c b a c a b a c c a-mask c-mask a b b b c c c a a a a b a c b... T = a b a c c a c b a c a b a c c d counter

Example of Masked Counting k = 4, = 4 a b a c c a c b a c a b a c c P = a b a c c a c b a c a b a c c a-mask c-mask a b b b c c c a a a a b a c b... T = a b a c c a c b a c a b a c c d counter

Counting Stage: Run through text and count occurrences of all marks. Time: O(n ). For location i of T, if counter i < k then no match at location i. Why? The total # of elements in all masks is 2 = 2k. Important Observations: 1) Sum of all counters 2 n 2) Every counter whose value is less than k already has more than k errors.

How many locations remain? Sum of all counters: 2n Value of potential matches > k The Kangaroo Method. How do we check these locations? Use Kangaroo Method Time: O(k) per location Overall Time: O( ) = O( ) # of potential matches:

Additional Points Can reduce to O( n )