Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

Slides:



Advertisements
Similar presentations
1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen.
Advertisements

The beauty of prime numbers vs the beauty of the random Ely Porat Bar-Ilan University Israel.
Practice Quiz Question
Approximate On-line Palindrome Recognition, and Applications Amihood Amir Benny Porat.
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
1 2 Dimensional Parameterized Matching Carmit Hazay Moshe Lewenstein Dekel Tsur.
Sorting Algorithms. Motivation Example: Phone Book Searching Example: Phone Book Searching If the phone book was in random order, we would probably never.
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search, part 1)
6-1 String Matching Learning Outcomes Students are able to: Explain naïve, Rabin-Karp, Knuth-Morris- Pratt algorithms Analyse the complexity of these algorithms.
FALL 2004CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
Knuth-Morris-Pratt Algorithm left to right scan like the naïve algorithm one main improvement –on a mismatch, calculate maximum possible shift to the right.
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
String Matching COMP171 Fall String matching 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences of.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Deterministic Length Reduction: Fast Convolution in Sparse Data and Applications Written by: Amihood Amir, Oren Kapah and Ely Porat.
Algorithms for Regulatory Motif Discovery Xiaohui Xie University of California, Irvine.
Algorithms and Efficiency of Algorithms February 4th.
Pattern Matching in the streaming model Ely Porat Google inc & Bar-Ilan University.
Exact and Approximate Pattern in the Streaming Model Presented by - Tanushree Mitra Benny Porat and Ely Porat 2009 FOCS.
Cache Oblivious Search Trees via Binary Trees of Small Height
Reverse Colussi algorithm
Pattern Matching in Weighted Sequences Oren Kapah Bar-Ilan University Joint Work With: Amihood Amir Costas S. Iliopoulos Ely Porat.
Pattern Matching COMP171 Spring Pattern Matching / Slide 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences.
String Matching with Mismatches Some slides are stolen from Moshe Lewenstein (Bar Ilan University)
S C A L E D Pattern Matching Amihood Amir Ayelet Butman Bar-Ilan University Moshe Lewenstein and Johns Hopkins University Bar-Ilan University.
Student Seminar – Fall 2012 A Simple Algorithm for Finding Frequent Elements in Streams and Bags RICHARD M. KARP, SCOTT SHENKER and CHRISTOS H. PAPADIMITRIOU.
1 Exact Matching Charles Yan Na ï ve Method Input: P: pattern; T: Text Output: Occurrences of P in T Algorithm Naive Align P with the left end.
Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University.
Vakhitov Alexander Approximate Text Indexing. Using simple mathematical arguments the matching probabilities in the suffix tree are bound and by a clever.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Order Statistics The ith order statistic in a set of n elements is the ith smallest element The minimum is thus the 1st order statistic The maximum is.
Advanced Algorithm Design and Analysis (Lecture 3) SW5 fall 2004 Simonas Šaltenis E1-215b
Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:
1 Lecture 16: Lists and vectors Binary search, Sorting.
Optimizing multi-pattern searches for compressed suffix arrays Kalle Karhu Department of Computer Science and Engineering Aalto University, School of Science,
Length Reduction in Binary Transforms Oren Kapah Ely Porat Amir Rothschild Amihood Amir Bar Ilan University and Johns Hopkins University.
String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.
Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein.
Boyer Moore Algorithm Idan Szpektor. Boyer and Moore.
MCS 101: Algorithms Instructor Neelima Gupta
Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp
CSC 211 Data Structures Lecture 13
Clearly Visual Basic: Programming with Visual Basic 2008 Chapter 24 The String Section.
Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.
Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach
MCS 101: Algorithms Instructor Neelima Gupta
06/12/2015Applied Algorithmics - week41 Non-periodicity and witnesses  Periodicity - continued If string w=w[0..n-1] has periodicity p if w[i]=w[i+p],
Section 5.5 The Real Zeros of a Polynomial Function.
The Misra Gries Algorithm. Motivation Espionage The rest we monitor.
Data Stream Algorithms Lower Bounds Graham Cormode
Communication Complexity Guy Feigenblat Based on lecture by Dr. Ely Porat Some slides where adapted from various sources Complexity course Computer science.
University of Macau Faculty of Science and Technology Programming Languages Architecture SFTW 241 spring 2004 Class B Group 3.
Accelerating Multi-Pattern Matching on Compressed HTTP Traffic Dr. Anat Bremler-Barr (IDC) Joint work with Yaron Koral (IDC), Infocom[2009]
Pattern Matching With Don’t Cares Clifford & Clifford’s Algorithm Orgad Keller.
Clustering Data Streams A presentation by George Toderici.
1/39 COMP170 Tutorial 13: Pattern Matching T: P:.
Real Zeros of Polynomial Functions
A new matching algorithm based on prime numbers N. D. Atreas and C. Karanikas Department of Informatics Aristotle University of Thessaloniki.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Rabin & Karp Algorithm. Rabin-Karp – the idea Compare a string's hash values, rather than the strings themselves. For efficiency, the hash value of the.
New Characterizations in Turnstile Streams with Applications
Rabin & Karp Algorithm.
Fast Fourier Transform
CS 154, Lecture 6: Communication Complexity
Reachability on Suffix Tree Graphs
Knuth-Morris-Pratt Algorithm.
CENG 351 Data Management and File Structures
Pattern Matching Pattern Matching 5/1/2019 3:53 PM Spring 2007
Presentation transcript:

Real time pattern matching Porat Benny Porat Ely Bar-Ilan University

Pattern Matching  Given a Text T and Pattern P, the problem is to find all the substring of T that equal to P. T= P=

Online pattern matching  We get the text character by character =P

Outline  Motivation  Presentation of 3 online models  Space lower bound  A black box algorithm  Exact and approximate pattern matching in the streaming model

Motivation …  Monitoring internet traffic

Motivation …  Stock market

Motivation..  Espionage

Motivation …  Viruses and malware

3 online models Read only memory Working memory Second m, for saving the pattern O(poly(log(m)) third 0, we can ’ t save the pattern O(poly(log(m)) First m, for saving the pattern O(m)

Space lower bound (deterministic)  Assume algorithm A, use o(m) space for solving the online pattern matching problem Alice Bob A s 1,s 2,s 3 …. s m S = S A Run over all the string Q = q 1,q 2, … q m. and insert Q, as the text for A. A Q Q = S match

A black box for online approximate pattern matching Raphaël Clifford Benny Porat Ely Porat CPM 2008

Black box for the First model Read only memory Working memory First m, for saving the pattern O(m)

Problem definition  There are a lot of offline pattern matching algorithms.  We want to find a black box algorithm, that takes most offline pattern matching algorithms and converts them to be pseudo real time. pseudo real time – take the best time of the offline algorithm, divide it by n And this is bound the time per character. Not Amortized!!

Result  In example, we can applied our algorithm to the flowing problem Hamming norm K-mismatch Matching under L 2 Matching under L 1 Online Convolution..

Exact And Approximate Pattern Matching In The Streaming Model Porat Benny Porat Ely FOCS 2009

solution for the third model Read only memory Working memory third 0, we can ’ t save the pattern O(poly(log(m))  Pattern Matching  Pattern Matching up to k mistake

It ’ s not minor!  Cache Work much faster then the Ram Now it ’ s can fit!  Anti virus on routers Researchers thought that there is a lower bound and it can't be done.

Randomized algorithm (RK) p m-1, …p 2,p 1, p 0 t 1,t 2,t 3, …,t i+1,t i+2, … t m,, … t n How can I calculate from without remembering t i ??? titi t m+1 All the calculation in F q

Streaming pattern matching P= Z Z T Signature Start signing Signature The pattern start with z, and there is no more z's in the pattern Z Signature Start signing

No Z P= U U T Signature Start signing Signature There is a prefix U s.t U appear only once in the pattern U Signature Start signing m =<m/2 Seek in recursion

No small U P= U Look on the first m/2 character They appear again somewhere U P= v v v v v v v v Prefix of v Option 1 Option 2 P= v v v v w w isn't a prefix of v and v isn't a prefix of w v=<m/2

Solving this case Option 2 P= v v v v w v=<m/2 Search in recursion for v, and count how many time you found it Sign on w T v v Start signing Signature v

Solving this case - continue Option 2 P= v v v v w v=<m/2 Search in recursion for v, and count how many time you found it Sign on w T v v Start signing Signature v Using O(log m) signatures and counters in the worst case Time = O(log m) in the worst case v v v >m/2 <m/2 Signature Start signing

Pattern Matching up to k mistake  1 – mismatch  Pattern Matching up to k mistake

Chinese Remainder Theorem  Lets n and m be two coprimes.

1-mismatch p 1,p 2,p 3, … p m p 1,p 3,p 5 … p 2,p 4,p 6 … p 1,p 4,p 7 … p 2,p 5,p 8 … p 3,p 6,p 9 … mod 2 mod 3

1-mismatch p 1,p 3,p 5 … p 2,p 4,p 6 … t 1,t 3,t 5 … t 2,t 4,t 6 … p 1,p 3,p 5 … p 2,p 4,p 6 … mod 2 p 1,p 4,p 7 … p 2,p 5,p 8 … p 3,p 6,p 9 … mod 3 Overall sum of all primes

1-mismatch p 1,p 3,p 5 … p 2,p 4,p 6 … t 1,t 3,t 5 … t 2,t 4,t 6 … p 1,p 3,p 5 … p 2,p 4,p 6 … mod 2

Problem p 1,p 3,p 5 … p 2,p 4,p 6 … t 1,t 3,t 5 … t 2,t 4,t 6 … p 1,p 3,p 5 … p 2,p 4,p 6 … mod 2 p 1,p 3,p 5 … t 2,t 4,t 6 … When we compare? For each q i we will start to compare for each alignment

Space complexity  For each q i we run q i time our algorithm for each alignment.  For each alignment we run again q i time for each shift.  Overall:

Time complexity  Each character go to just one alignment for each shift.  Overall:

1-mismatch  Lemma1  There is exactly one mismatch  There is exactly one subpattern in each group that not match. C.R.T

Pattern Matching up to k mistake  Group testing/ Random selector …

A black box for online approximate pattern matching Raphaël Clifford Benny Porat Ely Porat CPM 2008

The idea  We will split the pattern to log(m) consecutive subpattern p 1, p 2, p 3, … p m-3, p m-2, p m-1, p m pmpm p 1, p 2, p 3, … p m/2 p m-6,p m-5, p m-4,p m-3 p m-2,p m-1 P1P1 P2P2 P4P4 P m/2

Bring it online  Let look on subpattern with length m ’ =>P m ’ When we got to the i ’ th character of the text, to where is P m ’ align?  Conclusion 1 We need to know DIFF(P m ’,T (i-m ’,i) ) just at position i+m ’ of the text. titi pmpm p m-1 p m-2 … Pm’Pm’ … m m’m’ m ’ -1 …

The idea …  For each subpattren of length m ’. we partition the text to overlap substring of length 2m ’ m’m’ m’m’ m’m’ m’m’ m’m’ m’m’ 2m ’

The idea …  For each subpattren of length m ’ we run the offline algorithm on each partition of the text separately.  This ensure us, that we got the difference on time. titi If i=2lm ’ or 2lm ’ +m ’ for some l run the offline algorithm on the last 2m ’ character. m’m’ 2m ’ We will got all the differences for this section

Running Time  T(n,m)=nT(m) – the running time of the offline algorithm  For each subpattern of length m ’ We got overlap partition. total time for each subpattrn:  Total time:

The problem  We saw, that overall the time is good  But, 2m ’ = m 2tm ’ +m ’ m ’ = m/2P m/2 m ’ = m/2 titi 2(t+1)m ’ We must wait until the run of the offline algorithm on P m/2 and the last m character to finish, before we can return the answer for. => (m/2)T(m) time!

The solution  We will split the text to partition of length 1.5m ’ m’m’ m’m’ m’m’ m’m’ m’m’ 1.5m ’ m’m’

The solution …  The latest we will get DIFF(P m ’,T i-m ’,i ) will be at index i+m ’ /2  And by Conclusion 1, we can wait m ’ /2 character, before we will need this difference. Conclusion 1. We need to know DIFF(P m ’,T i-m ’,i ) just at position i+m ’ of the text.

Spreading the work  So, we can spread the work over the next m ’ /2 character. m ’ /2 P1P1 P2P2 P3P3 Work on p 1 Work on p 2 Work on p 3 Need to know the difference of P 1

Spreading the work …  Overall, we can spread the work for a specific subpattern equivalently between all the character of the text.  All we left to do, is to check that the running time, not change.

Running Time  T(n,m)=nT(m) – the running time of the offline algorithm  For each subpattern of length m ’ Now, We got overlap partition. total time for each subpattrn:  Total time for all the text: Not change!

Running Time …  By spreading the work we got total running time for each character

conclusion  We give a space lower bound for deterministic online pattern matching  We give a black box algorithm that can adapt any offline algorithm to online algorithm, using only O(m) space and take time per character.