Boyer Moore Algorithm Idan Szpektor. Boyer and Moore.

Boyer Moore Algorithm Idan Szpektor

Boyer and Moore

What It’s About A String Matching Algorithm Preprocess a Pattern P (|P| = n) For a text T (| T| = m), find all of the occurrences of P in T Time complexity: O(n + m), but usually sub- linear

Right to Left (like in Hebrew) Matching the pattern from right to left For a pattern abc: ↓ T: bbacdcbaabcddcdaddaaabcbcb P: abc Worst case is still O(n m)

The Bad Character Rule (BCR) On a mismatch between the pattern and the text, we can shift the pattern by more than one place. Sublinearity! ddbbacdcbaabcddcdaddaaabcbcb acabc ↑

BCR Preprocessing A table, for each position in the pattern and a character, the size of the shift. O(n |Σ|) space. O(1) access time. a b a c b: 1 2 3 4 5 A list of positions for each character. O(n + |Σ|) space. O(n) access time, But in total O(m). 12345 a11333 b2225 c44

BCR - Summary On a mismatch, shift the pattern to the right until the first occurrence of the mismatched char in P. Still O(n m) worst case running time: T: aaaaaaaaaaaaaaaaaaaaaaaaa P: abaaaa

The Good Suffix Rule (GSR) We want to use the knowledge of the matched characters in the pattern’s suffix. If we matched S characters in T, what is (if exists) the smallest shift in P that will align a sub-string of P of the same S characters ?

GSR (Cont…) Example 1 – how much to move: ↓ T: bbacdcbaabcddcdaddaaabcbcb P: cabbabdbab cabbabdbab

GSR (Cont…) Example 2 – what if there is no alignment: ↓ T: bbacdcbaabcbbabdbabcaabcbcb P: bcbbabdbabc bcbbabdbabc

GSR - Detailed We mark the matched sub-string in T with t and the mismatched char with x 1. In case of a mismatch: shift right until the first occurrence of t in P such that the next char y in P holds y≠x 2. Otherwise, shift right to the largest prefix of P that aligns with a suffix of t.

Boyer Moore Algorithm Preprocess(P) k := n while (k ≤ m) do Match P and T from right to left starting at k If a mismatch occurs: shift P right (advance k) by max(good suffix rule, bad char rule). else, print the occurrence and shift P right (advance k) by the good suffix rule.

Algorithm Correctness The bad character rule shift never misses a match The good suffix rule shift never misses a match

Preprocessing the GSR – L(i) L(i) – The biggest index j, such that j < n and prefix P[1..j] contains suffix P[i..n] as a suffix but not suffix P[i-1..n] 1 2 3 4 5 6 7 8 9 10 11 12 13 P: b b a b b a a b b c a b b L: 0 0 0 0 0 0 0 0 0 5 9 0 12

Preprocessing the GSR – l(i) l(i) – The length of the longest suffix of P[i..n] that is also a prefix of P P: b b a b b a a b b c a b b l: 2 2 2 2 2 2 2 2 2 2 2 1

Using L(i) and l(i) in GSR If mismatch occurs at position n, shift P by 1 If a mismatch occurs at position i-1 in P: If L(i) > 0, shift P by n – L(i) else shift P by n – l(i) If P was found, shift P by n – l(2)

Building L(i) and l(i) – the Z For a string s, Z(i) is the length of the longest sub-string of s starting at i that matches a prefix of s. s: b b a c d c b b a a b b c d d Z: 1 0 0 0 0 3 1 0 0 2 1 0 0 0 Naively, we can build Z in O(n^2)

From Z to N N(i) is the longest suffix of P[1..i] that is also a suffix of P. N(i) is Z(i), built over P reversed. s: d d c b b a a b b c d c a b b N: 0 0 0 1 2 0 0 1 3 0 0 0 0 1

Building L(i) in O(n) L(i) – The biggest index j < n, such that prefix P[1..j] contains suffix P[i..n] as a suffix but not suffix P[i-1..n] L(i) – The biggest index j < n such that: N(j) == | P[i..n] | == n – i + 1 for i := 1 to n, L(i) := 0 for j := 1 to n-1 i := n – N(j) + 1 L(i) := j

Building l(i) in O(n) l(i) – The length of the longest suffix of P[i..n] that is also a prefix of P l(i) – The biggest j <= | P[i..n] | == n – i + 1 such that N(j) == j k := 0 for j := 1 to n-1 If(N(j) == j), k := j l(n – j + 1) := k

Building Z in O(n) For calculating Z(i), we want to use the previously calculated Z(1)…Z(i-1) For each I we remember the right most Z(j): j, such that j = k + Z(k), for all k < i

Building Z in O(n) (Cont…) ↑ ↑ ↑ ↑ S i’ j i If i < j + Z(j), s[i … j + Z(j) - 1] appeared previously, starting at i’ = i – j + 1. Z(i’) < Z(j) – (i - j) ?

Building Z in O(n) (Cont…) For Z(2) calculate explicitly j := 2, i := 3 While i <= |s|: if i >= j + Z(j), calculate Z(i) explicitly else Z(i) := Z(i’) If Z(i’) >= Z(j) – (i - j), calculate Z(i) tail explicitly If j + Z(j) < i + Z(i), j := i

Building Z in O(n) - Analysis The algorithm builds Z correctly The algorithm executes in O(n) A new character is matched only once All other operations are in O(1)

Boyer Moore Worst Case Analysis Assume P consists of n copies of a single char and T consists of m copies of the same char: T: aaaaaaaaaaaaaaaaaaaaaaaaa P: aaaaaa Boyer Moore Algorithm runs in Θ(m n) when finding all the matches

The Galil Rule In a specific matching phase, We mark with k the position in T of the right end of P. We mark with s the position of last matched char in this phase. s k k’ T: bbacdcbaabcddcdaddaaabcbcb P: abaab abaab

The Galil Rule (Cont…) All the chars in position s < j ≤ k are known to be matching. The algorithm doesn’t need to check them. An extended Boyer Moore algorithm with the Galil rule runs in O(m + n) worst case (even without the bad-character rule).

Don’t Sleep Yet…

O(n + m) proof - Outline Preprocess in O(n) – already proved 1. Properties of strings 2. Proof of search in O(m) if P is not in T, using only the good suffix rule. 3. Proof of search in O(m) even if P is in T, adding the Galil rule.

Properties of Strings If for two strings δ, γ: δγ = γδ then there is a string β such that δ = β i and γ = β j, i, j > 0 - Proof by induction Definition: A string s is semiperiodic with period β if s consists of a non-empty suffix of β (possibly the entire β) followed by one or more complete copies of β. β β’β’ ββ

Properties of Strings (Cont…) A string is prefix semiperiodic if it contains one or more complete copies of β followed by a non-empty prefix of β. A string is prefix semiperiodic iff it is semiperiodic with the same length period

Lemma 1 Suppose P occurs in T starting at position p and also at position q, q > p. If q – p ≤  n/2  then P is semiperiodic with period α = P[n-(q-p)+1…n] p q α ααα α α α’α’ α’α’

Proof - when P is Not Found in T We have R rounds during the search. After each round the good suffix rule decides on a right shift of s i chars. Σs i ≤ m We shall use Σs i as an upper bound.

Proof (Cont…) For each round we count the matched chars by: f i – the number of chars matched for the first time g i –the number of chars already matched in previous rounds. Σf i = m We want to prove that g i ≤ 3s i (  Σg i ≤ 3m).

Proof (Cont…) Each round don’t find P  it matched a substring t i and one bad char x i in T (x i t i  T) T: bbacdcbaabcbbabdbabcaabcbcb P: bdbabc |t i |+1 ≤ 3s i  g i ≤ 3s i (because g i + f i = |t i |+1) For the rest of the proof we assume that for the specific round i: |t i | + 1 > 3s i

Lemma 2 (|t i | + 1 > 3s i ) In round i we look at the matched suffix of P, marked P *. P * = y i t i, y i ≠ x i. Both P * and t i are semiperiodic with period α of length s i and hence with minimal length period β, α = β k. Proof: by Lemma 1.

Lemma 3 (|t i | + 1 > 3s i ) Suppose P overlapped t i during round i. We shall examine in what ways could P overlap t i in previous rounds. In any round h < i, the right end of P could not have been aligned with the right end of any full copy of β in t i. - proof: Both round h and i fail at char x i two cases of possible shift after round h are invalid

Lemma 4 (|t i | + 1 > 3s i ) In round h < i, P can correctly match at most |β|-1 chars in t i.  By Lemma 3, P is not aligned with a right end of t i in phase h. Thus if it matched |β| chars or more there is a suffix γ of β followed by a prefix δ of β such that δ γ = γ δ. By the string properties there is a substring μ such that β = μ k, k>1. This contradicts the minimal period size property of β.

Lemma 5 (|t i | + 1 > 3s i ) If in round h < i the right end of P is aligned with a char in t i, it can only be aligned with one of the following: One of the left-most |β|-1 chars of t i One of the right-most |β| chars of t i -proof: If not, By Lemma 3,4, max |β|-1 chars are matched and only from the middle of a β copy, while there are at least |β| A shift cannot pass the right end of that β copy

Proof (Cont…) If |t i | + 1 > 3s i then g i ≤ 3s i  Using Lemma 5, in previous rounds we could match only the bad char x i, the last |β|-1 chars in t i or start from the first |β| right chars in t i. In the last case, using Lemma 4, we can only match up to |β|-1 chars in total we could previously match: g i = 1 + |β|-1 + (|β| + |β|-1) ≤ 3|β| ≤ 3s i

Proof - Final Number of matches = ∑(f i + g i ) = ∑f i + ∑g i ≤ m + ∑3s i ≤ m + 3m = 4m

Proof - when P is Found in T Split the rounds to two groups: “match” rounds –an occurrence of P in T was found. “mismatch” rounds –P was not found in T. we have proved O(m) for “mismatch” rounds.

Proof (Cont…) After P was found in T, P will be shifted by a constant length s. (s = n – l(2)). |n| + 1 ≤ 3s  ∑ matches in round i ≤ ∑3s ≤ m For the rest of the proof we assume that: |n| + 1 > 3s

Proof (|n| + 1 > 3s) By Lemma 1, P is semiperiodic with minimal length period β, |β| = s. If round i+1 is also a “match” round then, by the Galil rule, only the new |β| chars are compared. A contiguous series of “match” rounds, i…i+k is called a “run”.

Proof (|n| + 1 > 3s) ∑ The length of a “run”, not including chars that where already matched in previous “runs” ≤ m How many chars in a “run” where already matched in previous “runs”?

Lemma (|n| + 1 > 3s) Suppose k-1 was a “match” round and k is a “mismatch” round that ends the “run”. If k’ > k is the first “match” round then it overlaps at most |β|-1 chars with the previous “run” (ended by round k-1).  The left end of P at round k’ cannot be aligned with the left end of a full copy of |β| at round k-1. As a result, P cannot overlap |β| chars or more with round k-1.

Proof (|n| + 1 > 3s) By the Lemma and because the shift after every “match” round is of |β|, only the first round of a “run” can overlap, and only with the last previous “run”.  ∑ The length of the chars that where already matched in previous “runs” ≤ m

Proof (|n| + 1 > 3s) - Final ∑ The length of a “run” = ∑ The length of a “run”, not including chars that where already matched in previous “runs” + ∑ The length of the chars that where already matched in previous “runs” ≤ m + m

Boyer Moore Algorithm Idan Szpektor. Boyer and Moore.

Similar presentations

Presentation on theme: "Boyer Moore Algorithm Idan Szpektor. Boyer and Moore."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Boyer Moore Algorithm Idan Szpektor. Boyer and Moore.

Similar presentations

Presentation on theme: "Boyer Moore Algorithm Idan Szpektor. Boyer and Moore."— Presentation transcript:

Similar presentations

About project

Feedback