Presentation is loading. Please wait.

Presentation is loading. Please wait.

Exact and Approximate Pattern in the Streaming Model Presented by - Tanushree Mitra Benny Porat and Ely Porat 2009 FOCS.

Similar presentations


Presentation on theme: "Exact and Approximate Pattern in the Streaming Model Presented by - Tanushree Mitra Benny Porat and Ely Porat 2009 FOCS."— Presentation transcript:

1 Exact and Approximate Pattern in the Streaming Model Presented by - Tanushree Mitra Benny Porat and Ely Porat 2009 FOCS

2 Problem Statement Find all instances of pattern P of length m, as a contiguous substring in a text string T, of length n, where m < n.

3 Contributions Exact pattern matching - A fully online randomized algorithm for the classical pattern matching problem Time complexity - O(logm) per character that arrives Space complexity - O(logm), breaking the O(m) barrier that held for this problem for a long time. Approximate pattern matching – An algorithm for pattern matching with k mismatches problem. Time complexity - O(k 2 poly(logm)) per character Space complexity - O(k 3 poly(logm))

4 Applications Monitoring Internet traffic Computational Biology Large Scale web searching Viruses and Malware detection Automatic Stock market analysis Robotics

5 Background Brute Force Algorithm – – Slide the pattern along the text and – Compare it to the corresponding portion of the text Time Complexity – O(mn) Speedup possible in these 2 steps. Sliding step speedup by pre-processing the pattern, – Knuth-Morris-Pratt algorithm – Boyer-Moore algorithm. – Ukkonen’s algorithm to construct suffix trees Comparison step speedup – Rabin-Karp algorithm.

6 Quick History

7 The Intuition Combine the key features of KMP and the Rabin- Karp algorithms to achieve an online algorithm that uses less space. The Idea When Rabin-Karp’s algorithm is done with the i’th character, and advances to the next position in the text, it does not use any of the information gathered. The KMP algorithm, on the other hand, puts that information to good use.

8 Definitions - Fingerprints String S ф(S) Fingerprint Polynomial Fingerprint q = s 1 r + s 2 r 2 + … +s l r l mod p, where pЄθ(N 4 ), rЄF p False Positives If S1 ≠ S2, then probability of ф r,p (S 1 ) = ф r,p (S 2 ) is < 1 /n 3 Sliding Fingerprint

9 Definitions - Period P l Period - A prefix Sp = s 1,s 2,….,s l of a string S is defined to be a period of S, iff s i = s i+ l, for 0 ≤ i ≤ n - l Period P l - For a pattern P = p 1,p 2,….,p m, prefix is, P l = p 1,p 2,….,p l,0 ≤ l ≤ m. The shortest period of P l is period P l If P l matches the test at a given index i, then there cannot be a match between i to i + |period Pl | Put the information to good use

10 The Idea Match at i’th index indicates that we know the last ‘m’ characters, so no point saving them? Preprocessing phase – Calculate Sliding fingerprint on the pattern ф p and on the shortest period ф period p Online phase – Slide fingerprint ф over the entire text. – While ф = ф p, slide ф by | Period P l | characters – If we do not reach end of text abort False Positives?? Slide over |period P l | position that could be a match. Very LOW PROBABILITY of false positives Text and pattern should satisfy stringent restrictions

11 Go for subpatterns Log m subpatterns p 1, p 2, p 3, … p m-3, p m-2, p m-1, p m pmpm p 1, p 2, p 3, … p m/2 p m-6,p m-5, p m-4,p m-3 p m-2,p m-1 P1P1 P2P2 P4P4 P m/2 Starting point – Find a position in which the smallest subpattern matches the text. Smallest subpattern is of length 1 – this can be easily found.

12 Algorithm Guidelines – Find a position where P i is a match, try to match P i + 1 from the same starting point as P i If P i + 1 does not match, use the information that P i is a match. Check in jumps of |period P i | until there is no overlap with the area where P i matches. PROCESS 1.Initialize an empty sliding fingerprint ф. 2. For each character that arrive: – Extend ф to include the new character – If |ф| = 2 i and ф = ф i for some 0 ≤ i ≤ log m. If ф has at least |period P i-1 | length overlaps with the last match, slide ф by |period P i-1 | characters. Else, abort. What if there is a match that starts in substring of 1 st process and ends in substring of 2 nd process

13 Exact_PM final Algorithm Introduce Checkpoint Checkpoint - Start a new process in the last checkpoint of each process Algorithm Preprocessing - – Initialize an empty sliding fingerprint ф. – For each 0 ≤ i ≤ log m calculate the sliding fingerprint – ф i of P i and – ф i,period of the period of P i

14 Final Algorithm – Online Phase Online Phase – – Start a new process – For any character that arrive send it to all the processes – If some process aborts start new prorcess – If some process, A reaches to a checkpoint Stop the ‘son process’ of A (if it has one) Start a new ‘son process’ of A

15 Complexity Space – – All fingerprints from preprocessing use O(log m) space. – Each process saves another fingerprint and there can be atmost log m processes in parallel – OVERALL usage – O(log m) space Time – – Each process spends O(1) time for each new character that arrives – Each time there are at most 3 log m processes running (1. process A, 2. son-process of A, grandson-process of A. A has to die when great-granson of A is created) – OVERALL running time – O(log m) per character

16 Pattern Matching ( 1 – Mistmatch) Partition the pattern and the text We need to align every partition of the pattern P qi,j to q i text shifts

17 Intuition For each P qi,j, run q i processes of Exact_PM. Process qi,j,σ - σ’ th process of the subpattern P qi,j, for 0 ≤ σ < q i. This will try to match the P qi,j to the text by considering the text as if it starts from the σ character. (τ mod q i = j – σ) If for all qi, – numOfNotMatch qi,σ = 0 ‘match’. – numOfNotMatch qi,σ = 1, ‘exactly 1-mismatch’ – Otherwise, ‘more than 1-mismatch’.

18 Complexity FACTS – – Run ∑ l i=1 q i 2 processes of Exact_PM – There exists a constant c such that for any x, there exist (x / logm) prime numbers, between x, and cx – We have q 1,q 2,... q l groups of partitions. Each qi is a prime number Space - O(log 4 m / log log m) Time - O(log 3 m / log log m)

19 Pattern Matching ( k – Errors) Preprocessing Phase – Initialize a process Process qi,j,σ of 1-mismatch, for each q i Є {q 1,q 2,... q l }, 0 ≤ i ≤ q i and 0 ≤ σ < q i Online Phase – Send τ character to each Process qi,j,σ such that τ mod q i = j – σ d = all mismatches from all processes that return ‘exactly 1-mismatch’ – d > k more than k mismatches

20 Complexity Space – – Run ∑ i=1 klogm q i 2 Є O(k 3 log 4 m/ log log m) processes of 1-mismatch in parallel. – Each process requires log 4 m space. – OVERALL - O(k 3 poly(log m)) Time – – Number of processes of 1-mismatch algorithm is bounded by ∑ i=1 klogm q i 2 Є O(k 3 log 4 m/ log log m) – Running time of each character O(log 3 m) – OVERALL - O(k 2 poly(log m))

21 Concluding Discussion The Two-Dimensional String-Matching Problem The String-Matching Problem with Wild Characters – Example: pattern P = {abc#abc#} is found in texts T1 = {abcdcadbaccabc}, T2 = {abcabc} String matching with weighted mismatch


Download ppt "Exact and Approximate Pattern in the Streaming Model Presented by - Tanushree Mitra Benny Porat and Ely Porat 2009 FOCS."

Similar presentations


Ads by Google