Presentation is loading. Please wait.

Presentation is loading. Please wait.

Boyer-Moore string search algorithm Book by Dan Gusfield: Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore.

Similar presentations


Presentation on theme: "Boyer-Moore string search algorithm Book by Dan Gusfield: Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore."— Presentation transcript:

1 Boyer-Moore string search algorithm Book by Dan Gusfield: Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore (1977) Presented by: Vladimir Zoubritsky

2 Agenda Problem Statement Bad character rule Boyer-Moore-Horspool algorithm Good Suffix Rule Preprocessing Analysis

3 Problem Statement Given pattern P(1..n) and text T(1..m) defined over alphabet Σ, find one or all occurrences of P in T. Boyer-Moore algorithm (1977) provides an efficient solution. The algorithm has a linear running time in worst case and sub-linear time in most practical cases.

4 Right to left matching idea Other known algorithms, e.g. Brute Force, match the pattern from left to right. Algorithm: Align P with index k of T. Start matching from k+n-1, and if all letters match, report occurrence. By itself matching from right to left is similar to Brute Force in the running time. Based on the suffix we can decide to skip over ranges of characters.

5 Algorithm Skeleton 1)Align P with the beginning of T and match from right to left. 2)If whole P was match report occurrence. 3)Otherwise shift P by the maximal amount between the ones given by the bad character shift and the good suffix shift. Conditional correctness: If the two shifts never go beyond an occurrence of P in T, the algorithm will report all occurrences.

6 Bad Character rule Definition For each character x, let R(x) be the position of the right-most occurrence of character x in P. R(x) is defined to be zero if x does not occur in P.

7 Bad character shift Definition: Suppose a particular alignment of P against T, the rightmost n-i characters of P match their counterparts in T, but the character P(i) mismatches with its counterpart, say in position k of T. If the right-most position of the character T(k) in P is j, j < i, then shift so that character j of P is below character k of T, otherwise shift by 1. The shift would be max[1, i-R(T(k))].

8 Bad character shift Simple case: The character aligned with P(n), T(k) does not appear in P: P is shifted by n (to start after k).

9 Bad character shift General case: Shift by i – R(x). Trivial to prove correctness.

10 Boyer-Moore-Horspool algorithm Described by Horspool in 1980. Basic idea: use Boyer Moore algorithm, but only use the bad character shift rule. Worst case running time in degenerate cases may be O(nm). Best case is sub-linear: O(m/n).

11 Boyer-Moore-Horspool worst case A pair of pattern and text could be constructed to have a shift of 1 each time (same as Brute Force).

12 Boyer-Moore-Horspool best case

13 Boyer-Moore-Horspool Time

14 Good Suffix Rule

15 Good suffix rule (cont'd)

16 Correctness of the good-suffix shift

17 Preprocessing of P Originally published preprocessing algorithm was complex and erroneous. An updated version was complex still. We will use a simpler version based on the Z algorithm. We want the preprocessing to compute values for functions L’(i) and l’(i) – defined later.

18 Preprocessing of P (cont'd)

19 Preprocessing of P: calculating L’(i)

20 Preprocessing of P: calculating l’(i)

21 Using the preprocessing results

22 Boyer-Moore Time Using the linear time implementation of the Z algorithm, the preprocessing takes O(n) time and O(n) space. The original Boyer-Moore algorithm had cases when P appears in T which resulted in O(nm) time, before a few simple modifications [Galil 1979]. A tight bound of 3m comparisons was established for Boyer-Moore running time [Cole 1991]. An average case analysis is proposed, but remains difficult to simplify into a simple expression as in BMH [Tsai 2005]. For other, “Boyer-Moore-like” algorithms the following time bounds were established: 14mGalil, 79 2mApostolico et al. 86 3m/2Colussi et al. 90 4m/3Colussi et al. 90

23 Experimental Analysis On average, for sufficiently large alphabets (8 characters) Boyer-Moore-Horspool has fast running time and sub-linear number of character comparisons. On average, and in worst cases Boyer-Moore is faster than “Boyer-Moore-like” algorithms. Data from Michailidis and Margaritis [2001]

24 Questions?


Download ppt "Boyer-Moore string search algorithm Book by Dan Gusfield: Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore."

Similar presentations


Ads by Google