Exact and Approximate Pattern in the Streaming Model Presented by - Tanushree Mitra Benny Porat and Ely Porat 2009 FOCS.

Slides:



Advertisements
Similar presentations
1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen.
Advertisements

Algorithm : Design & Analysis [19]
Suffix Trees Construction and Applications João Carreira 2008.
Two implementation issues Alphabet size Generalizing to multiple strings.
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
Exact String Search Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005.
G ENOME - SCALE D ISK - BASED S UFFIX T REE I NDEXING Phoophakdee and Zaki.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
String Searching Algorithms Problem Description Given two strings P and T over the same alphabet , determine whether P occurs as a substring in T (or.
1 A simple fast hybrid pattern- matching algorithm Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C.
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search, part 1)
Pattern Matching1. 2 Outline and Reading Strings (§9.1.1) Pattern matching algorithms Brute-force algorithm (§9.1.2) Boyer-Moore algorithm (§9.1.3) Knuth-Morris-Pratt.
Goodrich, Tamassia String Processing1 Pattern Matching.
6-1 String Matching Learning Outcomes Students are able to: Explain naïve, Rabin-Karp, Knuth-Morris- Pratt algorithms Analyse the complexity of these algorithms.
Advisor: Prof. R. C. T. Lee Speaker: Y. L. Chen
1 The Colussi Algorithm Advisor: Prof. R. C. T. Lee Speaker: Y. L. Chen Correctness and Efficiency of Pattern Matching Algorithms Information and Computation,
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 2: KMP Algorithm Lecturer:
Knuth-Morris-Pratt Algorithm left to right scan like the naïve algorithm one main improvement –on a mismatch, calculate maximum possible shift to the right.
Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Deterministic Length Reduction: Fast Convolution in Sparse Data and Applications Written by: Amihood Amir, Oren Kapah and Ely Porat.
Algorithms for Regulatory Motif Discovery Xiaohui Xie University of California, Irvine.
Quick Search Algorithm A very fast substring search algorithm, SUNDAY D.M., Communications of the ACM. 33(8),1990, pp Adviser: R. C. T. Lee Speaker:
Pattern Matching in the streaming model Ely Porat Google inc & Bar-Ilan University.
Homework page 102 questions 1, 4, and 10 page 106 questions 4 and 5 page 111 question 1 page 119 question 9.
Reverse Colussi algorithm
Pattern Matching COMP171 Spring Pattern Matching / Slide 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences.
Algorithms and Data Structures. /course/eleg67701-f/Topic-1b2 Outline  Data Structures  Space Complexity  Case Study: string matching Array implementation.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.3: Exclusion Methods.
Pattern Matching1. 2 Outline Strings Pattern matching algorithms Brute-force algorithm Boyer-Moore algorithm Knuth-Morris-Pratt algorithm.
1 Exact Matching Charles Yan Na ï ve Method Input: P: pattern; T: Text Output: Occurrences of P in T Algorithm Naive Align P with the left end.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
JASS04 - Sequential Pattern MatchingTobias Reichl1 Joint Advanced Student School 2004 Complexity Analysis of String Algorithms Sequential Pattern Matching:
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
KMP String Matching Prepared By: Carlens Faustin.
Advanced Algorithm Design and Analysis (Lecture 3) SW5 fall 2004 Simonas Šaltenis E1-215b
20/10/2015Applied Algorithmics - week31 String Processing  Typical applications: pattern matching/recognition molecular biology, comparative genomics,
String Matching Fundamental Data Structures and Algorithms April 22, 2003.
Boyer Moore Algorithm Idan Szpektor. Boyer and Moore.
MCS 101: Algorithms Instructor Neelima Gupta
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 1: Exact String Matching.
Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp
Strings and Pattern Matching Algorithms Pattern P[0..m-1] Text T[0..n-1] Brute Force Pattern Matching Algorithm BruteForceMatch(T,P): Input: Strings T.
Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.
Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.
MCS 101: Algorithms Instructor Neelima Gupta
String Searching CSCI 2720 Spring 2007 Eileen Kraemer.
Exact String Matching Algorithms Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU.
CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms.
Ravello, Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.
Computer Science Background for Biologists CSC 487/687 Computing for Bioinformatics Fall 2005.
Fundamental Data Structures and Algorithms
String-Matching Problem COSC Advanced Algorithm Analysis and Design
ICS220 – Data Structures and Algorithms Analysis Lecture 14 Dr. Ken Cosh.
1/39 COMP170 Tutorial 13: Pattern Matching T: P:.
A new matching algorithm based on prime numbers N. D. Atreas and C. Karanikas Department of Informatics Aristotle University of Thessaloniki.
Rabin & Karp Algorithm. Rabin-Karp – the idea Compare a string's hash values, rather than the strings themselves. For efficiency, the hash value of the.
1 String Matching Algorithms Mohd. Fahim Lecturer Department of Computer Engineering Faculty of Engineering and Technology Jamia Millia Islamia New Delhi,
JinJu Lee & Beatrice Seifert CSE 5311 Fall 2005 Week 10 (Nov 1 & 3)
Rabin & Karp Algorithm.
Pattern Matching 12/8/ :21 PM Pattern Matching Pattern Matching
Pattern Matching 1/14/2019 8:30 AM Pattern Matching Pattern Matching.
KMP String Matching Donald Knuth Jim H. Morris Vaughan Pratt 1997.
Pattern Matching 2/15/2019 6:17 PM Pattern Matching Pattern Matching.
Knuth-Morris-Pratt Algorithm.
Chap 3 String Matching 3 -.
Pattern Matching Pattern Matching 5/1/2019 3:53 PM Spring 2007
Pattern Matching 4/27/2019 1:16 AM Pattern Matching Pattern Matching
Sequences 5/17/ :43 AM Pattern Matching.
Week 14 - Wednesday CS221.
Presentation transcript:

Exact and Approximate Pattern in the Streaming Model Presented by - Tanushree Mitra Benny Porat and Ely Porat 2009 FOCS

Problem Statement Find all instances of pattern P of length m, as a contiguous substring in a text string T, of length n, where m < n.

Contributions Exact pattern matching - A fully online randomized algorithm for the classical pattern matching problem Time complexity - O(logm) per character that arrives Space complexity - O(logm), breaking the O(m) barrier that held for this problem for a long time. Approximate pattern matching – An algorithm for pattern matching with k mismatches problem. Time complexity - O(k 2 poly(logm)) per character Space complexity - O(k 3 poly(logm))

Applications Monitoring Internet traffic Computational Biology Large Scale web searching Viruses and Malware detection Automatic Stock market analysis Robotics

Background Brute Force Algorithm – – Slide the pattern along the text and – Compare it to the corresponding portion of the text Time Complexity – O(mn) Speedup possible in these 2 steps. Sliding step speedup by pre-processing the pattern, – Knuth-Morris-Pratt algorithm – Boyer-Moore algorithm. – Ukkonen’s algorithm to construct suffix trees Comparison step speedup – Rabin-Karp algorithm.

Quick History

The Intuition Combine the key features of KMP and the Rabin- Karp algorithms to achieve an online algorithm that uses less space. The Idea When Rabin-Karp’s algorithm is done with the i’th character, and advances to the next position in the text, it does not use any of the information gathered. The KMP algorithm, on the other hand, puts that information to good use.

Definitions - Fingerprints String S ф(S) Fingerprint Polynomial Fingerprint q = s 1 r + s 2 r 2 + … +s l r l mod p, where pЄθ(N 4 ), rЄF p False Positives If S1 ≠ S2, then probability of ф r,p (S 1 ) = ф r,p (S 2 ) is < 1 /n 3 Sliding Fingerprint

Definitions - Period P l Period - A prefix Sp = s 1,s 2,….,s l of a string S is defined to be a period of S, iff s i = s i+ l, for 0 ≤ i ≤ n - l Period P l - For a pattern P = p 1,p 2,….,p m, prefix is, P l = p 1,p 2,….,p l,0 ≤ l ≤ m. The shortest period of P l is period P l If P l matches the test at a given index i, then there cannot be a match between i to i + |period Pl | Put the information to good use

The Idea Match at i’th index indicates that we know the last ‘m’ characters, so no point saving them? Preprocessing phase – Calculate Sliding fingerprint on the pattern ф p and on the shortest period ф period p Online phase – Slide fingerprint ф over the entire text. – While ф = ф p, slide ф by | Period P l | characters – If we do not reach end of text abort False Positives?? Slide over |period P l | position that could be a match. Very LOW PROBABILITY of false positives Text and pattern should satisfy stringent restrictions

Go for subpatterns Log m subpatterns p 1, p 2, p 3, … p m-3, p m-2, p m-1, p m pmpm p 1, p 2, p 3, … p m/2 p m-6,p m-5, p m-4,p m-3 p m-2,p m-1 P1P1 P2P2 P4P4 P m/2 Starting point – Find a position in which the smallest subpattern matches the text. Smallest subpattern is of length 1 – this can be easily found.

Algorithm Guidelines – Find a position where P i is a match, try to match P i + 1 from the same starting point as P i If P i + 1 does not match, use the information that P i is a match. Check in jumps of |period P i | until there is no overlap with the area where P i matches. PROCESS 1.Initialize an empty sliding fingerprint ф. 2. For each character that arrive: – Extend ф to include the new character – If |ф| = 2 i and ф = ф i for some 0 ≤ i ≤ log m. If ф has at least |period P i-1 | length overlaps with the last match, slide ф by |period P i-1 | characters. Else, abort. What if there is a match that starts in substring of 1 st process and ends in substring of 2 nd process

Exact_PM final Algorithm Introduce Checkpoint Checkpoint - Start a new process in the last checkpoint of each process Algorithm Preprocessing - – Initialize an empty sliding fingerprint ф. – For each 0 ≤ i ≤ log m calculate the sliding fingerprint – ф i of P i and – ф i,period of the period of P i

Final Algorithm – Online Phase Online Phase – – Start a new process – For any character that arrive send it to all the processes – If some process aborts start new prorcess – If some process, A reaches to a checkpoint Stop the ‘son process’ of A (if it has one) Start a new ‘son process’ of A

Complexity Space – – All fingerprints from preprocessing use O(log m) space. – Each process saves another fingerprint and there can be atmost log m processes in parallel – OVERALL usage – O(log m) space Time – – Each process spends O(1) time for each new character that arrives – Each time there are at most 3 log m processes running (1. process A, 2. son-process of A, grandson-process of A. A has to die when great-granson of A is created) – OVERALL running time – O(log m) per character

Pattern Matching ( 1 – Mistmatch) Partition the pattern and the text We need to align every partition of the pattern P qi,j to q i text shifts

Intuition For each P qi,j, run q i processes of Exact_PM. Process qi,j,σ - σ’ th process of the subpattern P qi,j, for 0 ≤ σ < q i. This will try to match the P qi,j to the text by considering the text as if it starts from the σ character. (τ mod q i = j – σ) If for all qi, – numOfNotMatch qi,σ = 0 ‘match’. – numOfNotMatch qi,σ = 1, ‘exactly 1-mismatch’ – Otherwise, ‘more than 1-mismatch’.

Complexity FACTS – – Run ∑ l i=1 q i 2 processes of Exact_PM – There exists a constant c such that for any x, there exist (x / logm) prime numbers, between x, and cx – We have q 1,q 2,... q l groups of partitions. Each qi is a prime number Space - O(log 4 m / log log m) Time - O(log 3 m / log log m)

Pattern Matching ( k – Errors) Preprocessing Phase – Initialize a process Process qi,j,σ of 1-mismatch, for each q i Є {q 1,q 2,... q l }, 0 ≤ i ≤ q i and 0 ≤ σ < q i Online Phase – Send τ character to each Process qi,j,σ such that τ mod q i = j – σ d = all mismatches from all processes that return ‘exactly 1-mismatch’ – d > k more than k mismatches

Complexity Space – – Run ∑ i=1 klogm q i 2 Є O(k 3 log 4 m/ log log m) processes of 1-mismatch in parallel. – Each process requires log 4 m space. – OVERALL - O(k 3 poly(log m)) Time – – Number of processes of 1-mismatch algorithm is bounded by ∑ i=1 klogm q i 2 Є O(k 3 log 4 m/ log log m) – Running time of each character O(log 3 m) – OVERALL - O(k 2 poly(log m))

Concluding Discussion The Two-Dimensional String-Matching Problem The String-Matching Problem with Wild Characters – Example: pattern P = {abc#abc#} is found in texts T1 = {abcdcadbaccabc}, T2 = {abcabc} String matching with weighted mismatch