  ;  E       

  ;  E            Searching a String with the Boyer-Moore Algorithm Shana Rose Negin December 14, 2000

Boyer-Moore String Search How does it work? Examples Complexity Acknowledgements

How Does it Work? Pattern moves left to right. Comparisons are done right to left. Uses two heuristics: Bad Character Good Suffix Each heuristic is put into play when a mismatch occurs. They give us the maximum number of characters the search pattern can move forward safely and still know that there are no characters that need to be checked.

Pattern Moves Left to Right Text: Several hours later, Cindy Pattern:indy Text: Several hours_later, Cindy Pattern: indy Text: Several hours later, Cindy Pattern: indy Start Middle End

Text: Several hours_later, Cindy Pattern: indy Text: Several hours_later, Cindy Pattern: indy Text: Several hours_later, Cindy Pattern: indy Text: Several hours_later, Cindy Pattern: indy Comparisons Comparisons are done right to left. First Comparison Second Comparison Third Comparison Fourth Comparison

Three Parts to the Bad Character Heuristic 1. When the comparison gives a mismatch, the bad-character heuristic proposes moving the pattern to the right by an amount so that the bad character from the string will match the rightmost occurrence of the bad character in the pattern. 2. If the bad character doesn’t occur in the pattern, then the pattern may be moved completely past the bad character. 3. If the rightmost occurrence of the bad character is to the right of the current bad character position, then this heuristic makes no proposal.

Bad Character Heuristic 1. When the comparison gives a mismatch, the bad-character heuristic proposes moving the pattern to the right by an amount so that the bad character from the string will match the rightmost occurrence of the bad character in the pattern. Text: You’ve got a funny face, man. Pattern: cite Text: You’ve got a funny face,_man. Shift: cite Shifted two characters to match up the c’s.

Bad Character Heuristic 2. If the bad character doesn’t occur in the pattern, then the pattern may be moved completely past the bad character. Text: You’ve got a funny face, man. Pattern: poor Text: You’ve got a funny face, man. Shift: poor Shifted four characters because there was no match.

Bad Character Heuristic 3. If the rightmost occurrence of the bad character is to the right of the current bad character position, then this heuristic makes no proposal. Text: There are no babies here. Pattern: drab Text: There are no babies here. Shift: drab The shift proposed would be negative, so it is ignored.

Good Suffix Heuristic The good-suffix heuristic proposes to move the pattern to the right by the least amount so that a group of characters in the pattern will match with the good suffix found in the text. Text:...I wish I had_an apple instead of... Pattern: banana Text: …..I wish I had an apple instead of... Shift: banana Shift two so that the second occurrence of ‘an’ in ‘banana’ matches the characters ‘an’ in the string.

Im_a_grad._dad_is_glad grad EXAMPLE Text:Pattern: im a grad. dad is glad grad 1 23 6 5 4 10 7 9 8 11 12 Bad-characterGood-SuffixMatch 12 comparisons out of 22 characters.

Where_are_you_moving?_What_are_you_doing? grad EXAMPLE Text:Where are you moving? What are you doing? Pattern: grad Bad-characterGood-SuffixMatch Last ‘grad’ is longer than the remaining string, so it is discarded before it is counted. 10 comparisons out of 41 characters.

Applets http://www.accessone.com/~lorre/pages/bmi.html http://www.i.kyushu-u.ac.jp/~takeda/PM_DEMO/e.html

The Algorithm: Sigma = alphabet in use; T = Search string (text); P = Pattern; N = length[T]; M = length[P]; L = Compute_Last_Occurrence_Function(P, M, Sigma);(for bad-character heuristic) Y = Compute_Good_Suffix_Function(P, M);(for good-suffix heuristic) s = 0; while (s <= n-m) { (j = m); while (j > 0 AND P[j] = T[s+j]) { j--; if (j=0) { print(“Pattern FOUND!!! Location” s); s = s + Y[0]; else s = s+ max(Y[j], j-L[T[s+j]]);

Compute_Last_Occurance_Function(P, M, Sigma) { /* Contained in the array L, there is a field for every letter in the alphabet. When this function is finished computing, the number in L[a] will represent the number of characters from the beginning of the pattern that the rightmost ‘a’ lies; L[b] will contain the distance from the beginning of the pattern for the right most occurrence of ‘b’, and so on. EXAMPLE: pattern: jeff L-> */ for (each character a in sigma) // Initialize all fields to 0 L[a] = 0; for (j = 0; j < m; j++) // For every letter in the pattern, L[P[j]] = j;// record its distance from the start return L;// of the pattern } Compute_Last_Occurrence_Function 1 abcdefghijk 0000420000 /* COMPLEXITY: O(Sigma + M) */ Sigma = alphabet in use; T = Search string (text); P = Pattern; N = length[T]; M = length[P];

Compute_Good_Suffix_Function(P, M) { /* First get the prefix. The fields of Y represent the distance of the suffix from the start of the pattern, using the rightmost character as a reference. Then it searches the pattern to find the next rightmost occurrence of the suffix, and recommends that shift. If there is no other occurrence, it recommends a shift of the length of the pattern */ Pi = Compute_Prefix_Function(P) P’ = Reverse(P) Pi’ = Compute_Prefix_Function(P’) for (i = 0; i < M; i++) Y[i] = M - Pi[M]; for (j = 0; j < M; j++) i = M - Pi’[j]; if (Y[I] > j - Pi’[j] Y[I] = j - Pi’[l] return Y } Compute_Good_Suffix_Function /* COMPLEXITY: O(M) */ Sigma = alphabet in use; T = Search string (text); P = Pattern; N = length[T]; M = length[P];

The Main Loop while (s <= n-m) {// for every shift (j = m);// while (j > 0 AND P[j] = T[s+j]) {// for the length of the pattern j--;// if (j=0) { // if you reach the beginning of the // pattern, print(“Pattern FOUND!!! Location” s);// You found the pattern! s = s + Y[0];// Tell someone and shift else// the length of the pattern s = s+ max(Y[j], j-L[T[s+j]]);// else, choose the greater of the // two heuristic results Sigma = alphabet in use; T = Search string (text); P = Pattern; N = length[T]; M = length[P];

HOWEVER...

IN PRACTICE...

the algorithm takes sub-linear time

Specifically, in the best case, the algorithm’s running time is O(N/M) (length of text over length of pattern)

The complexity is best when the letters in the pattern don’t match the letters in the text very often. Since this is generally the case, the average running time ends up being approximately equivalent to the best case. O(N/M) (length of text over length of pattern)

Conclusion: The Boyer-Moore algorithm is a very good algorithm. Its worst case running time is linear; its best case running time is sub-linear. Most of the time it tends toward the best case rather than the worst case. I recommend the boyer-moore algorithm for searching a string. Shana Negin 252a-as December 14, 2000 Algorithms csc252

Acknowledgements Corman: Chapter 34.5 Cole, Richard: “Tight Bounds on the complexity of the Boyer- Moore string-matching algorithm.” New York University http://www.accessone.com/~lorre/pages/bmi.html http://www.i.kyushu-u.ac.jp/~takeda/PM_DEMO/e.html

Interesting Uses William Hsu, a Computer Science at the Johns Hopkins University, has used the Boyer-Moore algorithm in a virus detection project. http://www.mactech.com/articles/mactech/Vol.08/08.02/VirusDet ection/

One Problem UNICODE has 65,536 characters which makes string searching very time consuming, even using Boyer-Moore. http://www-4.ibm.com/software/developer/library/text- searching.html?dwzone=unicode#Boyer

Similar presentations

Similar presentations

About project

Feedback

Log in

Auth with social network:

Similar presentations

Similar presentations

About project

Feedback