Presentation is loading. Please wait.

Presentation is loading. Please wait.

String-Matching Algorithms (UNIT-5)

Similar presentations


Presentation on theme: "String-Matching Algorithms (UNIT-5)"— Presentation transcript:

1 String-Matching Algorithms (UNIT-5)
ADVANCED ALGORITHMS String-Matching Algorithms (UNIT-5)

2 Let there is an array of text, T[1..n] of length ‘n’.
String Matching : Let there is an array of text, T[1..n] of length ‘n’. Let there is a pattern of text, P[1..m] of length ‘m’. Let T and P are drawn from a finite alphabet . Here P and T are called ‘Strings of Characters’. Here, the pattern P occurs with shift s in text T, if, ≤ s ≤ n – m and T[s+1..s+m] = P[1..m] i.e., for 1 ≤ j ≤ m, T[s+j] = P[j] If P occurs with shift s in T, it is a VALID SHIFT. Other wise, we call INVALID SHIFT.

3 The String-matching Problem is the problem of finding all valid shifts with which a given pattern P occurs in a given text T. Ex-1 : Let text T : a b c a b a a b c a b a c Let pattern P : a b a a Find the number of valid shifts and ‘s’ values. Answer : Only one Valid Shift. s = 3 The symbol * (read as ‘sigma-star’) is the set of all finite-length strings formed using characters from the alphabet .

4 The zero-length string is called ‘Empty String’.
denoted by ‘ɛ’, also belongs to *. The length of the string ‘x’ is denoted |x|. The concatenation of two strings x and y, denoted xy has length |x| + |y|. A string ω is a prefix of a string x, denoted as ω ⊏ x, if x = ω y for some string y ∊ *. Here, note that if ω ⊏ x, then |w| ≤ |x|. Similarly, a string ω is a suffix of a string x, denoted as ω ⊐ x, if x = y ω for some string y ∊ *. Here, note that if ω ⊐ x, then |w| ≤ |x|.

5 Ex-2 : Let abcca is a string.
Here, ab ⊏ abcca and cca ⊐ abcca Note-1: The empty string ɛ is both a suffix and prefix of every string. Note-2 : Both prefix and suffix are transitive relations. Lemma : Suppose that x, y, and z are strings such that x ⊐ z and y ⊐ z. Here, if |x| ≤ |y| then x ⊐ y. if |x| ≥ |y| then y ⊐ x. if |x| = |y| then x = y.

6 2. The Naïve String-matching Algorithm :
This algorithm finds all valid shifts using a loop that checks the condition P[1..m] = T[s+1..s+m] for each of the n –m + 1 possible values of s. NAÏVE-STRING-MATCHER(T,P) n = T.length m = P.length 3. for s = 0 to n – m 4. if P[1..m] = = T[s+1..s+m] 5 Print “Pattern occurs with shift s.”

7 Ex-3 : Let T = acaabc & P = aab
Find the value of s. Answer : The value of s = 2 Ex-4 : Let T = P = 0001 Find the values of ‘s’. Answer : The value of s = 1 & 5 & 11 Ex-5 : Let T = an and P = am Answer : The values of s = 0 to n – m i.e., s contains n – m + 1 values

8 ts = p iff T[s+1..s+m] = P[1..m]  s is a valid shift iff ts = p
3. The Rabin-Karp Algorithm : Let  = {0, 1, 2, … , 9} Here each character is a decimal digit. d = |  | = 10. The string represents 31,415 in radix-d notation. Let there is a text T[1..n]. Let there is a pattern P[1..m]. Let p denote the corresponding decimal value. Let ts is the decimal value of the length –m substring T[s+1..s+m], for s = 0,1,2,..n-m. ts = p iff T[s+1..s+m] = P[1..m]  s is a valid shift iff ts = p

9 ts+1 = 10 (ts – 10m-1 T[s+1 ]) + T[s+m+1].
Now, the value of p can be computed using Horner’s rule as follows: p = P[1..m] = P[1] P[2] P[3]…P[m] So, p = P[m] + 10 (P[m-1] + 10 (P[m-2] + … + 10 (P[2] + 10 P[1])…)). Similarly, one can compute t0 as follows : t0 = T[m] + 10 (T[m-1] + 10 (T[m-2] + … + 10 (T[2] + 10 T[1])…)). Here we can compute ts+1 from ts as follows : ts+1 = 10 (ts – 10m-1 T[s+1 ]) + T[s+m+1].

10 ts+1 = (d (ts – T[s+1] h ) + T[s+m+1]) mod q.
Ex-6 : Let m = 5, ts = 31415 Let T[s+m+1] = 2 So, RHS = 10 (ts – 10m-1 T[s+1 ]) + T[s+m+1] = 10 (31415 – ) + 2 = = 14152 Let q is defined so that dq fits in one computer word and the above recurrence equation can be written as : ts+1 = (d (ts – T[s+1] h ) + T[s+m+1]) mod q. Here, h  dm-1 (mod q) i.e., h is the first digit in the m-digit text window.

11 The test ts  p (mod q) is a fast heuristic test to rule out invalid shifts s.
For any value of ‘s’, if ts  p (mod q) is TRUE and P[1..m] = T[s+1..s+m] is FALSE then ‘s’ is called SPURIOUS HIT. Note : a) If ts  p (mod q) is TRUE then ts = p may be TRUE b) If ts  p (mod q) is FALSE then ts ≠ p is definitely TRUE

12 RABIN-KARP-MATCHER (T,P,d,q)
n = T.length m = P.length h = dm-1 (mod q) p = 0 5 t0 = 0 6 for i = 1 to m // preprocessing 7 p = (dp + P[i]) mod q 8 t0 = (d t0 + T[i]) mod q 9 for s = 0 to n-m //matching 10 if (p = = ts ) if (P[1..m] = T[s+1..s+m]) 12 print “Pattern occurs with shift” s 13 if (s < n – m) ts+1 = (d (ts – T[s+1] h ) + T[s+m+1]) mod q.

13 Ex-7 : Let T = Let P = Here n = 19 m = 5 d = 10 q = 13 h = 3 p = 0 t0 = 0 First for statement : i = 1 : p = 3 t0 = 2 i = 2 : p = 5 t0 = 10 i = 3 : p = 2 t0 = 1 i = 4 : p = 8 t0 = 6 i = 5 : p = 7 t0 = 8

14 s p ts T p = = ts s < n – m ts+1 0 7 8 23590 FALSE TRUE 9
Second for statement : s p ts T p = = ts s < n – m ts+1 FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE S = TRUE VM 8 FALSE TRUE 4 FALSE TRUE 5

15 Hence, there is only ONE VALID MATCH at s = 6
s p ts T p = = ts s < n – m ts+1 FALSE TRUE FALSE TRUE FALSE TRUE TRUE S = 12 TRUE SH 9 FALSE TRUE FALSE FALSE Hence, there is only ONE VALID MATCH at s = 6 there is only ONE SPURIOUS HIT at s = 12

16 The Knuth-Morris-Pratt Algorithm :
This algorithm is meant for ‘Pattern Matching’. Here, the prefix function  for a pattern encapsulates knowledge about how the pattern matches against shifts of itself. Ex-8 : Let the Text String T & Pattern P is : T : b a c b a b a b a c a c a c a P : a b a b a c a

17 COMPUTE-PREFIX-FUNCTION (P) :
1. m = P.length Let [1..m] be a new array [1] = 0 k = 0 for q = 2 to m while k > 0 and P[k+1]  P[q] k = [k] 8. if P[k+1] = = P[q] k = k + 1 [q] = k 11. return 

18 Ex-8 (contd…) P : a b a b a c a INIT : m = 7 [1] = 0 k = 0 Step : q = 2 : Here, k = 0 & P[k+1] = a & P[q] = b So, while : FALSE & if : FALSE Hence, [2] = 0 Step : q = 3 : Here, k = 0 & P[k+1] = a & P[q] = a So, while : FALSE & if : TRUE k = 1 Hence, [3] = 1

19 Step : q = 4 : Here, k = 1 & P[k+1] = b & P[q] = b So, while : FALSE & if : TRUE k = 2 Hence, [4] = 2 Step : q = 5 : Here, k = 2 & P[k+1] = a & P[q] = a So, while : FALSE & if : TRUE k = 3 Hence, [5] = 3 Step : q = 6 : Here, k = 3 & P[k+1] = b & P[q] = c So, while : TRUE  k = 1 ( = [3] ) & k = 1 & P[k+1] = b & P[q] = c while : TRUE  k = 0 ( = [1] ) if : FALSE ([P[1] = = P[6]) Hence, [6] = 0

20 Step : q = 7 : Here, k = 0 & P[k+1] = a & P[q] = a So, while : FALSE & if : TRUE (P[1] = = P[7] ) k = 1 Hence, [7] = 1 Hence the  array is as follows : q :  : Hence, this returns the value : 1

21 6. while q > 0 and P[q+1]  T[i] 7. q =  [q] 8. if P[q+1] = = T[i]
KMP-MATCHER (T,P) : 1. n = T.length m = P.length  = COMPUTE-PREFIX-FUNCTION(P) q = 0 5. for i = 1 to n while q > 0 and P[q+1]  T[i] 7. q =  [q] if P[q+1] = = T[i] 9. q = q + 1 if q = = m print ”Pattern occurs with shift” i - m q =  [q]

22 i q C1 C2 wh q=  [q] if q++ if print q=  [q]
Ex-8 contd.. KMP-Matcher (T,P) : INIT : n = m = 7  = q = 0 i q C1 C2 wh q=  [q] if q++ if print q=  [q] F T F F F F F F T q = 1 F T T T q = F F F T F F F F F F T q = 1 F

23 6 1 T F F --- T q=2 F ---- ---- 7 2 T F F --- T q=3 F ---- ----
i q C1 C2 wh q=  [q] if q if print q=  [q] T F F T q= F T F F T q= F T F F T q= F T F F T q= F T F F T q= F T F F T q= F shift 4 q=1 T T T q= F F F F F T q= F T T T q= F F F F F T q= F


Download ppt "String-Matching Algorithms (UNIT-5)"

Similar presentations


Ads by Google