Presentation is loading. Please wait.

Presentation is loading. Please wait.

Exact String Matching Algorithms Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU.

Similar presentations


Presentation on theme: "Exact String Matching Algorithms Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU."— Presentation transcript:

1 Exact String Matching Algorithms Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU

2 Exact Matching: What’s the Problem 1 1 2 34 5 67 8 90 1 2 T = bbabaxababay P = aba P occurs in T starting at locations 3, 7, and 9 P may overlap, as found at 7 and 9.

3 The Naive Method Problem is to find if a pattern P[1..m] occurs within text T[1..n] Let P = abxyabxz and T = xabxyabxyabxz Where m = 8 and n = 13

4 The Naive Method If P = aaa and T = aaaaaaaaaa then n=3, m=10 In worst case exactly n(m-n+1) comparisons In this case 24 comparisons in the order of θ ( mn ).

5 The Naive Algorithm Char text[], pat[] ; int n, m ; { int i, j, k, lim ; lim=n-m+1 ; for (i=1 ; i<=lim ; i++) /* search */ { k=i ; for (j=1 ; j<=m && text[k]==pat[j]; j++) k++; if (j>m) Report_match_at_position(i-j+1); } The worst-case bound can be reduced to O ( m + n ) For applications with n = 1000 and m = 10,000,000 the improvement is significant.

6 The Smart Algorithm Reasoning of this sort is the key to shifting by more than one character Instead of Skips over three comparisons If you know first character of P (namely a) does not occur again at P until position 5 of P 12345 678

7 The Smarter Algorithm Instead of Skips over three comparisons Instead of Starts at Skips another three

8 The Smart Algorithms Knuth-Morris-Pratt (KMP) Alogorithm Boyer-Moore Algorithm Reduced run-time to O ( n + m ) Additional knowledge requires preprocessing of strings Usually P is much shorter than T So P is preprocessed

9 The Preprocessing Approach Usually P is preprocessed instead of T Sometimes T is preprocessed, e.g. suffix tree The preprocessing methods are similar in spirit, but often quite different in detail and conceptual difficulty Fundamental preprocessing of P is independent of any particular algorithm Each algorithm uses this information

10 Basic String Definitions/Notations Let, S be the string S[i..j] is the substring of S starting at position i and ending at position j, S[i..j] is empty if i > j 1 1 2 34 5 67 8 90 1 2 S = bbabaxababay S[3..7] = abaxa S[1..4] = bbab |S| is the length of the string. Here, |S| = 12 S[1..i] is prefix of S that ends at position i Prefix S[i..|S|] is the suffix of S that begins at position i S[9..12] = abay Suffix

11 A proper prefix, suffix or substring of S is, respectively, a prefix, suffix or substring that is not the entire string S, not the empty string. For any string S, S(i) denotes the i th character of S Basic String Definitions/Notations

12 12 Preprocessing Goal: To gather the information needed for speeding up the algorithm Definitions: – Z i : For i>1, the length of the longest substring of S that starts at i and matches a prefix of S – Z-box: for any position i >1 where Z i >0, the Z-box at i starts at i and ends at i+Z i -1 – r i; For every i>1, r i is the right-most endpoint of the Z-boxes that begin at or before i – l i; For every i>1, l i is the left endpoint of the Z-box ends at r i

13 Preprocessing Z i (S) = The longest prefix of S[i..|S|] that matches a prefix of S, where i > 1 1 12 3 456 7 8 901 S = aabcaabxaaz Z 5 (S) = Z 6 (S) = Z 7 (S) = Z 8 (S) = 0 Z 9 (S) = 2 (aab…aaz) 3 (aabc…aabx…) 1 (aa…ab…) We will use Z i in place of Z i (S) Z Box for i > 1, where Z i is greater than zero Figure 1.2: From Gusfield

14 The l i and r i of Z-Box 40 50 55 62 70 78 82 85 89 95 r i = the right-most endpoint of the Z-boxes that begin at or before position i. l i = the left end of the Z-box that ends at r i. r 78 =95l 78 =78 r 82 =95l 82 =78 r 52 =50l 52 =40 r 75 =85l 75 =70

15 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 S: a a b a a b c a x a a b a a b c y Z: 0 1 0 3 1 0 0 1 0 7 1 0 3 1 0 0 0 Z-box a a b a a b c a x a a b a a b c y r i: 0 2 2 6 6 6 6 8 8 16 16 16 16 16 16 16 16 l i: 0 2 2 4 4 4 4 8 8 10 10 10 10 10 10 10 10 Preprocessing

16 16 Z-Algorithm Goal: To calculate Z i for an input string S in a linear time Starting from i=2, calculate Z 2, r 2 and l 2 For i=3; i<n; i++ In iteration k, calculate Z k, r k and l k based on Z j, r j and l j for j=2,…,k-1 For iteration k, the algorithm only need r k-1 and l k-1. Thus, there is no need to keep all r i and l i. We use r, and l to denote r k-1 and l k-1

17 17 Z-Algorithm ’’ k r l   k’ r’ l’ ’’ k’=k-l+1; r’=r-l+1;  =  ’;  =  ’ k r l In iteration k: (I) if k<=r a a b a a b c a x a a b a a b c y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17   ’’ ’’

18 18 k r l   k’ r’ l’ ’’ ’’ ’’  A) If |  ’ |<|  ’ |, that is, Z k’ < r-k+1, Z k = Z k’  ’’ x y y  =  ’=  ’’; x≠y a a b a a b c a x a a b a a b c y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Z: 0 1 0 3 1 0 0 1 0 7 1 0 3    ’’ ’’  ’’ ’’ Z-Algorithm

19 19 Z-Algorithm k r l   k’ r’ l’ ’’ ’’ ’’  B) If |  ’ |>|  ’ |, that is, Z k’ >r-k+1, Z k =|  |, i.e., r-k+1  ’’ y  ’  ’’  ’=  ’’; x ≠y (because  is a Z box)  ’’ xx Z k =|  |, i.e., r-k+1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 S: a a b a a b c a x a a b a a c d Z: 0 1 0 3 1 0 0 1 0 6 1 0 2 1 0 0   ’’ ’’ ’’  ’’  ’’

20 20 Z-Algorithm k r l   k’ r’ l’ ’’ ’’ ’’  C) If |  ’ |=|  ’ |, that is, Z k’ =r-k+1, Z k ≥|  |, i.e., ≥ r-k+1  ’’ y  ’  ’’  =  ’=  ’’; x ≠y (because  is a Z box) z ≠x (because  ’ is a Z box) z ?? y  ’’ xz Compare S[r+1,...] with S[ |  | +1,…] until a mismatch occurs. Update Z k, r, and l 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 S: a a b a a e c a x a a b a a b d Z: 0 1 0 2 1 0 0 1 0 6 1 0 3 1 0 0   ’’ ’’ ’’  ’’

21 21 Z-Algorithm krl (II) if k>r Compare the characters starting at k+1 with those starting at 1. Update r, and l if necessary

22 22 Z-Algorithm Input: Pattern P Output: Z i Z Algorithm Calculate Z 2, r 2 and l 2 specifically by comparisons. R= r 2 and l=l 2 for i=3; i<n; i++ if k<=r if Z k-l+1 <r-k+1, then Z k = Z k-l+1 else if Z k-l+1 > r-k+1 Z k = r-k+1 else compare the characters starting at r+1 with those starting at |  | +1. Update r, and l if necessary else Compare the characters starting at k to those starting at 1. Update r, and l if necessary

23 23 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 S: a a b a a b c a x a a b a a b c y Z: 0 1 0 3 1 0 0 1 0 7 1 0 3 1 0 0 0 r : 0 2 2 6 6 6 6 8 8 16 16 16 16 16 16 16 16 l : 0 2 2 4 4 4 4 8 8 10 10 10 10 10 10 10 10 Preprocessing

24 24 Z-Algorithm Time complexity #mismatches <= number of iterations, n #matches Let q be the number of matches at iteration k, then we need to increase r by at least q r<=n Thus total #match <=n T=O( #matches + #mismatches +#iterations)=O(n) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 S: a a b a a b c a x a a b a a b c y Z: 0 1 0 3 1 0 0 1 0 7 1 0 3 1 0 0 0 r : 0 2 2 6 6 6 6 8 8 16 16 16 16 16 16 16 16 l : 0 2 2 4 4 4 4 8 8 10 10 10 10 10 10 10 10 #m: 0 1 0 3 0 0 0 1 0 7 0 0 0 0 0 0 0 #mis: 0 1 1 1 0 0 1 1 1 1 0 0 0 0 0 0 1

25 25 Simplest Linear Time Exact Matching Algorithm Input: Pattern P, Text T Output: Occurrences of P in T Algorithm Simplest S=P$T, where $ is a character that do not appear in P and T For i=2; i<|S|; i++ Calculate Z i If Z i =|P|, then report that there is an occurrence of P in T starting at i-|P|-1 of T=O(|P|+|T|+1)=O(n+m)

26 26 Simplest Linear Time Exact Matching Algorithm Take only O (n) extra space Alphabet-independent linear time k r l   k’ r’ l’ ’’ ’’ $

27 Reference Chapter 1, 2: Exact Matching: Fundamental Preprocessing and First Algorithms


Download ppt "Exact String Matching Algorithms Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU."

Similar presentations


Ads by Google