Presentation is loading. Please wait.

Presentation is loading. Please wait.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 2: Boyer-Moore Algorithm.

Similar presentations


Presentation on theme: "UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 2: Boyer-Moore Algorithm."— Presentation transcript:

1 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 2: Boyer-Moore Algorithm Lecturer: Dr. Rose Slides by: Dr. Rose January 21, 2003

2 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Boyer-Moore Algorithm Basic ideas: –Previously discussed ideas for naïve matching 1.successively align P and T to check for a match. 2.Shift P to the right on match failure. –new concepts wrt the naïve algorithm 1.Scan from right-to-left, i.e.,  2.Special Bad character rule 3.Suffix shift rule

3 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Concept: Right-to-left Scan How can we check for a match of pattern P at location i in target T? Naïve algorithm scanned left-to-right, i.e., T[i+k]&P[1+k], k = 0 to length(P)-1 ^ 1a == a ^ 2d != b Example: P = adab, T = abaracadabara a b a r a c a d a b a r a a d a b

4 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Concept: Right-to-left Scan Alternative, scan right-to-left, i.e., T[i+k]&P[1+k], k = length(P)-1 down-to 0 ^ 1b != r Example: P = adab, T = abaracadabara a b a r a c a d a b a r a a d a b

5 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Concept: Right-to-left Scan Why is scanning right-to-left a good idea? Answer: by itself, it isn’t any better than left-to-right. –A naïve approach with right-to-left scanning is also  (nm). –Larger shifts, supported by a clever bad character rule and a suffix shift rule make it better.

6 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Concept: Bad Character Rule Idea: the mismatched character indicates a safe minimum shift. ^ 1a == a Example: P = adacara, T=abaracadabara a b a r a c a d a b a r a a d a c a r a ^ 2r != c Here the bad character is c. Perhaps we should shift to align this character with its rightmost occurrence in P?

7 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Concept: Bad Character Rule Shift two positions to align the rightmost occurrence of the mismatched character c in P. a b a r a c a d a b a r a a d a c a r a Now, start matching again from right to left.

8 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Concept: Bad Character Rule ^ 1a == a Second Example: P = adacara, T=abaxaradabara a b a x a r a d a b a r a a d a c a r a ^ 2r == r Here the bad character is x. The minimum that we should shift should align this character with its occurrence in P. But x doesn’t occur in P!!!! ^ 3a == a ^ 4c != x

9 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Concept: Bad Character Rule Second Example: P = adacara, T=abaxaradabara a b a x a r a d a b a r a a d a c a r a Since x doesn’t occur in P, we can shift past it. a d a c a r a Now, start matching again from right to left.

10 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Concept: Bad Character Rule We will define a bad character rule that uses the concept of the rightmost occurrence of each letter. Let R(x) be the rightmost position of the letter x in P for each letter x in our alphabet. If x doesn’t occur in P, define R(x) to be 0.

11 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Concept: Bad Character Rule Bad Character Rule: If P[i] mismatches T[k], shift P along T by max[1, i - R(T[k])] This rule is allows us to shift by more than 1 when R(T[k]) + 1 < i. Otherwise, we will shift P one position, i.e., when R(T[k]) >= i, 1 >= i - R(T[k]) Obviously this rule is not very useful when R(T[k]) >= i.

12 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Concept: Extended Bad Character Rule Extended Bad Character Rule: If P[i] mismatches T[k], shift P along T so that the closest occurrence of the letter T[k] in P to the left of i in P is aligned with T[k]. ^ 1a == a Example: P = aracara, T=abararadabara a b a r a r a d a b a r a a r a c a r a ^ 2r == r ^ 3a == a ^ 4c != r ^ This is the rightmost occurrence of r in P. Notice that i - R(T(k)) < 0, i.e., 4 – 6 < 0 ^ This is the rightmost occurrence of r to the left of i in P. Notice that 4 – 2 > 0, i.e., this gives us a positive shift.

13 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Concept: Extended Bad Character Rule The amount of shift is i – j, where: –i is the index of the mismatch in P. –j is the rightmost occurrence of T[k] to the left of i in P. ^ 1a == a Example: P = aracara, T=abataradabara a b a t a r a d a b a r a a r a c a r a ^ 2r == r ^ 3a == a ^ 4c != t There is no occurrence of t in P, thus j = 0. Notice that i – j = 4, i.e., this gives us a positive shift past the point of mismatch.

14 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Concept: Extended Bad Character Rule How do we implement this rule? We preprocess P (from right to left), recording the position of each occurrence of the letters. For each character x in , the alphabet, create a list of its occurrences in P. If x doesn’t occur in P, then it has an empty list.

15 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Concept: Extended Bad Character Rule Example:  = {a, b, c, d, r, t}, P = abataradabara a_list = since ‘a’ occurs at these positions in P, i.e., abataradabara b_list = ( abataradabara) c_list = Ø d_list = ( abataradabara) r_list = ( abataradabara) t_list = ( abataradabara)

16 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Concept: Suffix Shift Rule Recall that we investigated finding prefixes last week. Since we are matching P to T from right-to-left, we will instead need to use suffixes. Note: historically, the preprocessing method for finding good suffixes for Boyer-Moore has been regarded as inscrutable. If you are confused, that is ok If you are not confused does that mean you aren’t paying close enough attention?

17 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Concept: Suffix Shift Rule Consider the partial right-to-left matching of P to T below. This partial match involves  a suffix of P.

18 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Concept: Suffix Shift Rule This partial match ends where the first mismatch occurs, where x is aligned with d.

19 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Concept: Suffix Shift Rule We want to find a right-most copy  ´ of this substring  in P such that:  ´ is not a suffix of P and 2.The character to the left of  ´ is not the same as the character to the left of 

20 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Concept: Suffix Shift Rule 1.If  ´ exists, shift P to the right such that  ´ is now aligned with the substring in T that was previously aligned with .

21 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Concept: Suffix Shift Rule 2.If  ´ doesn’t exist, shift P right by the least amount such that a prefix of P is aligned with a suffix of  in T.

22 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Concept: Suffix Shift Rule 3.If  ´ doesn’t exist, and there is no prefix of P that matches a suffix of  in T, shift P left by n positions.

23 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Concept: Suffix Shift Rule Let L(i) denote the largest position less than n s.t. P[i..n] matches a suffix of P[1..L(i)]. If there is no such position, then L(i) = 0 Example 1: If i = 17 then L(i) = 9 Example 2: If i = 16 then L(i) = 0

24 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Concept: Suffix Shift Rule Let L´(i) denote the largest position less than n s.t. P[i..n] matches a suffix of P[1..L´(i)] and s.t. the character preceding the suffix is not equal to P(i-1). If there is no such position, then L´(i) = 0 Example 1: If i = 20 then L(i) = 12 and L´(i) = 6

25 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Concept: Suffix Shift Rule Example 2: If i = 19 then L(i) = 12 and L´(i) = 0 slydogsaddogdbadbaddog P 19 L(19)

26 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Concept: Suffix Shift Rule Notice that L(i) indicates the right-most copy of P[i..n] that is not a suffix of P. In contrast, L´(i) indicates the right-most copy of P[i..n] that is not a suffix of P and whose preceding character doesn’t match P(i-1). The relation between L´(i) and L(i) is analogous to the relation between  ´ and .

27 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Concept: Suffix Shift Rule Q: What is the point? A: If P(i - 1) causes the mismatch and L´(i) > 0, then we can shift P right by n - L´(i) positions. Example:

28 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Concept: Suffix Shift Rule If L(i) and L´(i) are different, then obviously shifting by n - L´(i) positions is a greater shift than n - L(i). Example:

29 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Concept: Suffix Shift Rule Let N j (P) denote the length of the longest suffix of P[1..j] that is also a suffix of P. Example 1: N 6 (P) = 3 and N 12 (P) = 5. Example 2: N 3 (P) = 2, N 9 (P) = 3, N 15 (P) = 5, N 19 (P) = 0.

30 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Concept: Suffix Shift Rule Q: How are the concepts of N i and Z i related? Recall that Z i is the length of a maximal substring starting at position i of P that matches a prefix of P. In contrast, N i is the length of a maximal substring ending at position i in P that matches a suffix of P. In the case of Boyer-Moore, we are naturally interested in suffixes since we are scanning right- to-left

31 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Concept: Suffix Shift Rule Let P r denote the mirror image of P, then the relationship can be expressed as N j (P)=Z n-j+1 (P r ). In words, the length of the substring matching a suffix at position j in P is equal to the length of the corresponding substring matching a prefix in the reverse of P. Q: Why must this true? A: Because they are the same substring, except that one is the reverse of the other.

32 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Concept: Suffix Shift Rule Since N j (P) = Z n-j+1 (P r ), we can use the Z algorithm to compute N in O(n). Q: How do we do this? A: We create P r, the reverse of P, and process it with the Z algorithm.

33 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Concept: Suffix Shift Rule We can then find L´(i) and L(i) values from N values in linear time with the following: For i = 1 to n {L´(i) = 0;} For j = 1 to n – 1 { i = n - N j (P) + 1; L´(i) = j; } L(2) = L´(2) ; For i = 3 to n { L(i) = max(L(i - 1), L´(i));}

34 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Concept: Suffix Shift Rule Example: P = asdbasasas, n = 10 Values of N i (P): 0, 2, 0, 0, 0, 2, 0, 4, 0 Computed values i:11, 9, 11, 11, 11, 9, 11, 7, 11 Values of L´: 0, 0, 0, 0, 0, 0, 8, 0, 6 For i = 1 to n {L´(i) = 0;} For j = 1 to n – 1 { i = n - N j (P) + 1; L´(i) = j; } L(2) = L´(2) ; For i = 3 to n { L(i) = max(L(i - 1), L´(i));}

35 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Concept: Suffix Shift Rule Let l´(i) denote the length of the largest suffix of P[i..n] that is also a prefix of P. Let l´(i) = 0 if no such suffix exists. Example: P = asasbsasas ^ l’(1) = 4 ^ l’(2) = 4 ^ l’(3) = 4 ^ l’(4) = 4 ^ l’(5) = 4 ^ l’(6) = 4 ^ l’(7) = 4 ^ l’(8) = 2 ^ l’(9) = 2 ^ l’(10) = 0

36 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Concept: Suffix Shift Rule Thm: l´(i) = largest j <= n – i + 1 s.t. N j (P) = j. Q: How can we compute l´(i) values in linear time? A: This is problem #9 in Chapter 2. This would make an interesting homework problem.

37 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Boyer-Moore Algorithm Preprocessing: Compute L´(i) and l´(i) for each position i in P, Compute R(x), the right-most occurrence of x in P, for each character x in . Search: k = n; While k <= m { i = n; h = k; While i > 0 and P(i) = T(j) { i = i – 1; h = h – 1;} if i = 0 { report occurrence of P in T at position k. k = k + n - l´(2);} else Shift P (increase k) by the max amount indicated by the extended bad character rule and the good suffix rule. }

38 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Boyer-Moore Algorithm Example: P = golgol Preprocessing: Compute L´(i) and l´(i) for each position i in P Notice that first we need N j (P) values in order to compute L´(i) and l´(i) for each position i in P. For i = 1 to n {L´(i) = 0;} For j = 1 to n – 1 { i = n - N j (P) + 1; L´(i) = j; }

39 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Boyer-Moore Algorithm Example: P = golgol Recall that N j (P) is the length of the longest suffix of P[1..j] that is also a suffix of P. N 1 (P) = 0, there is no suffix of P that ends with g N 2 (P) = 0, there is no suffix of P that ends with o N 3 (P) = 3, there is a suffix of P that ends with l N 4 (P) = 0, there is no suffix of P that ends with g N 5 (P) = 0, there is no suffix of P that ends with o N 1 (P) = N 2 (P) = N 4 (P) = N 5 (P) = 0 and N 3 (P) = 3

40 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Boyer-Moore Algorithm Preprocessing: P = golgol, n = 6 N 1 (P) = N 2 (P) = N 4 (P) = N 5 (P) = 0 and N 3 (P) = 3 Compute L´(i) and l´(i) for each position i in P For i = 1 to n {L´(i) = 0;} For j = 1 to n – 1 { i = n - N j (P) + 1; L´(i) = j; } j = 1  i = 7Therefore L´(7) = 1 j = 2  i = 7 Therefore L´(7) = 2 j = 3  i = 4 Therefore L´(4) = 3 j = 4  i = 7 Therefore L´(7) = 4 j = 5  i = 7Therefore L´(7) = 5 L´(1) = L´(2) = L´(3) = L´(5) = 0 and L´(4) = 3

41 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Boyer-Moore Algorithm Preprocessing: P = golgol, n = 6 N 1 (P) = N 2 (P) = N 4 (P) = N 5 (P) = 0 and N 3 (P) = 3 L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3 Compute l´(i) for each position i in P. Recall that l´(i) is the length of the longest suffix of P[i..n] that is also a prefix of P. l´(1) = 6since gol is the longest suffix of P[1..n] that is a prefix of P. l´(2) = 3since gol is the longest suffix of P[2..n] that is a prefix of P. l´(3) = 3since gol is the longest suffix of P[3..n] that is a prefix of P. l´(4) = 3since gol is the longest suffix of P[4..n] that is a prefix of P. l´(5) = 0since there is no suffix of P[5..n] that is a prefix of P. l´(6) = 0since there is no suffix of P[6..n] that is a prefix of P. l´(1) = 6, l´(2) = l´(3) = l´(4) = 3and l´(5) = l´(6) = 0

42 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Boyer-Moore Algorithm Preprocessing: P = golgol, n = 6 N 1 (P) = N 2 (P) = N 4 (P) = N 5 (P) = 0 and N 3 (P) = 3 L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3 l´(1) = 6, l´(2) = l´(3) = l´(4) = 3 and l´(5) = l´(6) = 0 Compute the list R(x), the right-most occurrences of x in P, for each character x in  = {g, o, l} R(g) = R(o) = R(l) =

43 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Boyer-Moore Algorithm Preprocessing: P = golgol, n = 6, T = lolgolgol, m = 9 L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3 l´(1) =6, l´(2) = l´(3) = l´(4) = 3 and l´(5) = l´(6) = 0 R(g) =, R(o) =, R(l) = Search: k = n; While k <= m { i = n; h = k; While i > 0 and P(i) = T(j) { i = i – 1; h = h – 1;} if i = 0 { report occurrence of P in T at position k. k = k + n - l´(2);} else Shift P (increase k) by the max amount indicated by the extended bad character rule and the good suffix rule. }

44 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology ^ i = 6, h = 6 ^ i = 5, h = 5 ^ i = 4, h = 4 ^ i = 3, h = 3 ^ i = 2, h = 2 ^ i = 1, h = 1, P(1) != T(1)  Search lolgolgol golgol Bad Character Rule: there is no occurrence of l, the mismatched character in T, to the left of P(1). This suggests shifting only 1 place Good Suffix Rule: Since L´(2) = 0, l´(2) = 3 therefore shift P by n - l´(2) places, i.e., 6-3=3 places. Thus k = k + 3 = 9 But i = 1! k = 6; While k <= 9 { i = 6; h = k; While i > 0 and P(i) = T(j) { i = i – 1; h = h – 1;} if i = 0 { report occurrence of P in T at position k. k = k + 6 - l´(2);} else Shift P (increase k) by the max amount indicated by the extended bad character rule and the good suffix rule. }

45 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Search lolgolgol golgol ^ i = 6, h = 9 ^ i = 5, h = 8 ^ i = 4, h = 7 ^ i = 3, h = 6 ^ i = 2, h = 5 ^ i = 1, h = 4 ^ i = 0, h = 3 i = 0, report occurrence of P in T at position 4, k = k + 6 - l´(2) = 9 + 6 - 3 = 12 lolgolgol golgol k = 12, we are done! k = 6; While k <= 9 { i = 6; h = k; While i > 0 and P(i) = T(j) { i = i – 1; h = h – 1;} if i = 0 { report occurrence of P in T at position k. k = k + 6 - l´(2);} else Shift P (increase k) by the max amount indicated by the extended bad character rule and the good suffix rule. }

46 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Homework 1: Due 2/4/03 Problems from Chapter 1 pages 12-14 –#2 –#4 –#6 For P = tuttifruttiohrutti, calculate: –R(x) for all x in . Assume that P contains all x. –L(i) for each position i. –L´(i) for each position i. –N j (P) for each position 0 < j < n. –l´(i) for each position i. Additional problem for graduate students: Problem from Chapter 2 page 30 –#9


Download ppt "UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 2: Boyer-Moore Algorithm."

Similar presentations


Ads by Google