Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp. 257-275.

Slides:



Advertisements
Similar presentations
1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen.
Advertisements

Speaker: C. C. Lin Adviser: R. C. T. Lee
1 Faster algorithms for string matching with k mismatches Adviser : R. C. T. Lee Speaker: C. C. Yen Journal of Algorithms, Volume 50, Issue 2, February.
Parameterized Matching Amir, Farach, Muthukrishnan Orgad Keller Modified by Ariel Rosenfeld.
Parametrized Matching Amir, Farach, Muthukrishnan Orgad Keller.
Longest Common Subsequence
Applied Algorithmics - week7
1 Discrete Structures & Algorithms Graphs and Trees: III EECE 320.
1 Parallel Parentheses Matching Plus Some Applications.
Asynchronous Pattern Matching - Metrics Amihood Amir CPM 2006.
Lectures on Recursive Algorithms1 COMP 523: Advanced Algorithmic Techniques Lecturer: Dariusz Kowalski.
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
Zoo-Keeper’s Problem An O(nlogn) algorithm for the zoo-keeper’s problem Sergei Bespamyatnikh Computational Geometry 24 (2003), pp th CGC Workshop.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Motivation  DNA sequencing processes large chains into subsequences of ~500 characters long  Assembling all pieces, produces a single sequence but… –At.
Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.
1 2 Dimensional Parameterized Matching Carmit Hazay Moshe Lewenstein Dekel Tsur.
Advanced Topics in Algorithms and Data Structures Page 1 Parallel merging through partitioning The partitioning strategy consists of: Breaking up the given.
Function Matching Amihood Amir Yonatan Aumann Moshe Lewenstein Ely Porat Bar Ilan University.
1 The Colussi Algorithm Advisor: Prof. R. C. T. Lee Speaker: Y. L. Chen Correctness and Efficiency of Pattern Matching Algorithms Information and Computation,
Property Matching and Weighted Matching Amihood Amir, Eran Chencinski, Costas Iliopoulos, Tsvi Kopelowitz and Hui Zhang.
Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Deterministic Length Reduction: Fast Convolution in Sparse Data and Applications Written by: Amihood Amir, Oren Kapah and Ely Porat.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
Faster Algorithm for String Matching with k Mismatches Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp Date.
Exact and Approximate Pattern in the Streaming Model Presented by - Tanushree Mitra Benny Porat and Ely Porat 2009 FOCS.
Reverse Colussi algorithm
Pattern Matching in Weighted Sequences Oren Kapah Bar-Ilan University Joint Work With: Amihood Amir Costas S. Iliopoulos Ely Porat.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
Efficient algorithms for the scaled indexing problem Biing-Feng Wang, Jyh-Jye Lin, and Shan-Chyun Ku Journal of Algorithms 52 (2004) 82–100 Presenter:
String Matching with Mismatches Some slides are stolen from Moshe Lewenstein (Bar Ilan University)
Faster 2-Dimensional Scaled Matching Amihood Amir and Eran Chencinski.
Aho-Corasick Algorithm Generalizes KMP to handle sets of strings New ideas –keyword trees –failure functions/links –output links.
1 Boyer-Moore Charles Yan Exact Matching Boyer-Moore ( worst-case: linear time, Typical: sublinear time ) Aho-Corasik ( A set of pattern )
Linear Time Algorithms for Finding and Representing all Tandem Repeats in a String Dan Gusfield and Jens Stoye Journal of Computer and System Science 69.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.3: Exclusion Methods.
S C A L E D Pattern Matching Amihood Amir Ayelet Butman Bar-Ilan University Moshe Lewenstein and Johns Hopkins University Bar-Ilan University.
1 Exact Matching Charles Yan Na ï ve Method Input: P: pattern; T: Text Output: Occurrences of P in T Algorithm Naive Align P with the left end.
Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
Detection and Resolution of Anomalies in Firewall Policy Rules
Vakhitov Alexander Approximate Text Indexing. Using simple mathematical arguments the matching probabilities in the suffix tree are bound and by a clever.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
On The Connections Between Sorting Permutations By Interchanges and Generalized Swap Matching Joint work of: Amihood Amir, Gary Benson, Avivit Levy, Ely.
Length Reduction in Binary Transforms Oren Kapah Ely Porat Amir Rothschild Amihood Amir Bar Ilan University and Johns Hopkins University.
String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.
Multiple Pattern Matching in LZW Compressed Text Takuya KIDA Masayuki TAKEDA Ayumi SHINOHARA Masamichi MIYAZAKI Setsuo ARIKAWA Department of Informatics.
Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein.
Fast Algorithm for String Matching with k Mismatches by Amihood Amir, Moshe Lewenstein, and Ely Porat, Journal of Algorithms, to appear, 2003/2004 Speaker:
Swaps + Mismatches Based on Estrella Eizenberg M.Sc. Thesis Supervised by Ely Porat.
Discrete Structures Lecture 12: Trees Ji Yanyan United International College Thanks to Professor Michael Hvidsten.
Strings and Pattern Matching Algorithms Pattern P[0..m-1] Text T[0..n-1] Brute Force Pattern Matching Algorithm BruteForceMatch(T,P): Input: Strings T.
Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.
Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.
Lectures on Greedy Algorithms and Dynamic Programming
Chapter 8 Maximum Flows: Additional Topics All-Pairs Minimum Value Cut Problem  Given an undirected network G, find minimum value cut for all.
LIMITATIONS OF ALGORITHM POWER
Fall 2008Simple Parallel Algorithms1. Fall 2008Simple Parallel Algorithms2 Scalar Product of Two Vectors Let a = (a 1, a 2, …, a n ); b = (b 1, b 2, …,
On the Hardness of Optimal Vertex Relabeling and Restricted Vertex Relabeling Amihood Amir Benny Porat.
23 Jan, 2008SOFSEM A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns Costas Iliopoulos M. Sohel Rahman.
Amihood Amir, Gary Benson, Avivit Levy, Ely Porat, Uzi Vishne
Searching Similar Segments over Textual Event Sequences
CSE 589 Applied Algorithms Spring 1999
Phylogeny.
String Matching with k Mismatches
Presentation transcript:

Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp Date : Dec. 24, 2004 Created by : Hsing-Yen Ann

2004/11/22Hsing-Yen Ann Problem Definition String matching with k mismatches: Input: Text T = t 1 t 2...t n Pattern P = p 1 p 2...p m A natural number k Output: All pairs, where 1 ≦ i ≦ n and ham(P, T [i,i+m-1] ) ≦ k ham(): hamming distance (# of errors)

2004/11/22Hsing-Yen Ann Algorithm for Solving this Problem Two-stage algorithm Marking stage Identifying the potential starts of the pattern. Reducing the # to be verified. Focused in this paper. Verification stage Verifying which of the potential candidates is indeed a pattern occurrence. Using the Kangaroo method for speed-up. O (1) for jumping to next mismatch.

2004/11/22Hsing-Yen Ann Previous Conclusion This problem can be solved by previous presented algorithms in. When : When : use another algorithm. Finally, this problem can be solved in.

2004/11/22Hsing-Yen Ann Periodicity periodic: S is periodic if S=u j w, where j ≧ 2 and w is a prefix of u. aperiodic: a string is not periodic PeriodicAperiodic A A A A A A A AB AB AB A ABCD ABCD ABC A A B C D E AB A ABCD ABC A

2004/11/22Hsing-Yen Ann Breaks break: an aperiodic substring of a string S. l -break: a break of length l. Cole and Hariharan[9] give a linear time algorithm to find out all l -breaks with given l.

2004/11/22Hsing-Yen Ann Breaks (cont’d) The goodness of break: A l -break in P exactly match to T at position i implies that the next position in T to match this l -break will be at least i + ( l/2 ).

2004/11/22Hsing-Yen Ann Some Lemmas Lemma 3: Let P be a pattern with 2k disjoint l -breaks and let T be a text. In each match (with k mismatches) of P in T at least k of the l -breaks match exactly. Lemma 4: Let P be an m length pattern with less than 2k l -breaks. Let T be of length 2m. Then all matches of P in T are in a substring of T which has at most O(k) l -breaks.

2004/11/22Hsing-Yen Ann Time Complexity on Different Cases Case 1: There are at least 2k disjoint k -breaks in P. Time: O(n+m) = O(n) Case 2: There are at least 2k disjoint l -breaks in P, where 2 ≦ l ≦ k-1. Time: O(k log k) for each local match Case 3: There are not even 2k disjoint 2 -breaks. Dominated pattern: O(n + m log k + (nk 3 log k)/m) Non-dominated pattern: O(n + m log k + (nk 4 log k)/m)

2004/11/22Hsing-Yen Ann Algorithm for 2k k -breaks in P Algorithm: 1.Find all exact matches of all breaks in the text. 2.For every such match, mark all text locations for pattern occurrences appropriate for this break. 3.Discard every text location that is marked less than k marks. Result: 1.There are at most ( 4n )/ k candidates left. 2.The candidates can be marked in O ( n+m ) time. 3.The verification stage needs O ( n ) time.

2004/11/22Hsing-Yen Ann Algorithm for 2k k -breaks in P (cont’d)

2004/11/22Hsing-Yen Ann Algorithm for 2k l -breaks in P Algorithm: 1.Let S ={ b 1, …, b 2k } be a set of 2k disjoint l - breaks of P. 2.Let S’ ={ b 1 ’, …, b f ’ } be the distinct subset of S. S’ can be found in O ( m ) time.

2004/11/22Hsing-Yen Ann Algorithm for 2k l -breaks in P (cont’d) 3.Partition the text T to the local matching form T’={T 1 ’, T 2 ’, …, T 2n/k -1 ’}. Local match: Split the text T into 2n/k -1 overlap substrings, for which the length is k, T’={T 1 ’, T 2 ’, …, T 2n/k -1 ’}. Then solves the problem by doing the local match separately.

2004/11/22Hsing-Yen Ann Algorithm for 2k l -breaks in P (cont’d) 4.For each piece T i ' and each break b j ' in S' create a balanced binary tree Tree (i,j). The height of each tree is O (log k ). The number of trees is at most | T' | × | S' | = ( 2n )/ k × 2k = O ( n ).

2004/11/22Hsing-Yen Ann Algorithm for 2k l -breaks in P (cont’d) There are at most n leave nodes in all trees. => The trees can be constructed in O ( n ) time. Given l contiguous text locations, the (at most 4) candidates can be identified in time | S' | × O (log k ) = O ( k log k ). => All the candidates can be marked in time | T' | × O ( k log k ) = O ( n log k ). There are at most 4 n / l candidates. The verification stage needs O ( n ) time.

2004/11/22Hsing-Yen Ann Algorithm for no 2k 2 -breaks in P Definition: l -segment: Partition the P to equal segment of size l. Dominated patterns: At most 4k segments do not have general period w. bad l -segment: A l -segment that is not fully within a periodic stretch of S.

2004/11/22Hsing-Yen Ann Algorithm for no 2k 2 -breaks in P (cont’d) Lemma 6. Let P be a pattern with a dominating period w. In the partition of P into l -segments there are at most 8k bad l-segments. The algorithm for dominated patterns can be done in O(n + m log k + (nk 3 log k)/m) time. For a non-dominated pattern P, there exists a sparsifying substring P' of length Ω ( m/k ). Then P' is a dominated pattern. The algorithm can be done in O(n + (nk 4 log k)/m) time.

2004/11/22Hsing-Yen Ann Algorithm for no 2k 2 -breaks in P (cont’d) 1. Find all matches of P in T at overlapping (bad l -segment) locations. 2. For each bad l -segment B do pattern matching, with pattern B and w 2l *. 3. Do pattern matching with mismatches, with pattern w and text w 2l *. 4. Compute the # of mismatches of P at the first | w | locations of T using steps 2 and i <= | w | While end of text not reached 6a. if i is not an overlapping location 6aa. # of mismatches at location i <= # of mismatches at location i -| w |, 6ab. i <= i + 1; 6b. else, if j is the next non-overlapping location 6ba. for each of the bad l -segment that participate in an overlap in the overlapping locations (bad segment vs. bad segment) from i to j, update the # of mismatches it accrues in the next | w | locations, 6bb. i <= j.

2004/11/22Hsing-Yen Ann Algorithm for no 2k 2 -breaks in P (cont’d)

2004/11/22Hsing-Yen Ann Algorithm for no 2k 2 -breaks in P (cont’d)