Presentation is loading. Please wait.

Presentation is loading. Please wait.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 2: KMP Algorithm Lecturer:

Similar presentations


Presentation on theme: "UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 2: KMP Algorithm Lecturer:"— Presentation transcript:

1 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 2: KMP Algorithm Lecturer: Dr. Rose Slides by: Dr. Rose January 28, 2003

2 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology KMP Algorithm Preliminaries: –KMP can be easily explained in terms of finite state machines. –KMP has a easily proved linear bound –KMP is usually not the method of choice

3 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology KMP Algorithm Recall that the naïve approach to string matching is  (mn). How can we reduce this complexity? –Avoid redundant comparisons –Use larger shifts Boyer-Moore good suffix rule Boyer-Moore extended bad character rule

4 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology KMP Algorithm KMP finds larger shifts by recognizing patterns in P. –Let sp i (P) denote the length of the longest proper suffix of P[1..i] that matches a prefix of P. –By definition sp 1 = 0 for any string. –Q: Why does this make sense? –A: The proper suffix must be the empty string

5 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology KMP Algorithm Example: P = abcaeabcabd –P[1..2] = ab hence sp 2 = ? –sp 2 = 0 –P[1..3] = abc hence sp 3 = ? –sp 3 = 0 –P[1..4] = abca hence sp 4 = ? –sp 4 = 1 –P[1..5] = abcae hence sp 5 = ? –sp 5 = 0 –P[1..6] = abcaea hence sp 6 = ? –sp 6 = 1

6 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology KMP Algorithm Example Continued –P[1..7] = abcaeab hence sp 7 = ? –sp 7 = 2 –P[1..8] = abcaeabc hence sp 8 = ? –sp 8 = 3 –P[1..9] = abcaeabca hence sp 9 = ? –sp 9 = 4 –P[1..10] = abcaeabcab hence sp 10 = ? –sp 10 = 2 –P[1..11] = abcaeabcabd hence sp 11 = ? –sp 11 = 0

7 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology KMP Algorithm Like the  /  concept for Boyer-Moore, there is an analogous sp i /sp´ i concept. Let sp´ i (P) denote the length of the longest proper suffix of P[1..i] that matches a prefix of P, with the added condition that characters P(i + 1) and P(sp´ i + 1) are unequal. Example: P = abcdabce sp´ 7 = 3 Obviously sp´ i (P) <= sp i (P), since the later is less restrictive.

8 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology KMP Algorithm KMP Shift Rule: 1.Mismatch case: Let position i+1 in P and position k in T be the first mismatch in a left-to-right scan. Shift P to the right, aligning P[1..sp´ i ] with T[k- sp´ i..k-1] 2.Match case: If no mismatch is found, an occurrence of P has been found. Shift P by n – sp´ n spaces to continue searching for other occurrences.

9 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology KMP Algorithm Observations: –The prefix P[1..sp´ i ] of the shifted P is shifted to match the corresponding substring in T. –Subsequent character matching proceeds from position sp´ i + 1 –Unlike Boyer-Moore, the matched substring is not compared again. –The shift rule based on sp´ i guarantees that the exact same mismatch won’t occur at sp´ i + 1 but doesn’t guarantee that P(sp´ i +1) = T(k)

10 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology KMP Algorithm Example: P = abcxabcde –If a mismatch occurs at position 8, P will be shifted 4 positions to the right. –Q: Where did the 4 position shift come from? –A: The number of position is given by i - sp´ i, in this example i = 7, sp´ 7 = 3,  7 – 3 = 4 –Notice that we know the amount of shift without knowing anything about T other than there was a mismatch at position 8..

11 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology KMP Algorithm Example Continued: P = abcxabcde –After the shift, P[1..3] lines up with T[k-4..k-1] –Since it known that P[1..3] must match T[k-4..k-1], no comparison is needed. –The scan continues from P(4) & T(k) Advantages of KMP Shift Rule 1. P is often shifted by more than 1 character, (i - sp´ i ) 2.The left-most sp´ i characters in the shifted P are known to match the corresponding characters in T.

12 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology KMP Algorithm Full Example: T = xyabcxabcxadcdqfeg P = abcxabcde Assume that we have already shifted past the first two positions in T. xyabcxabcxadcdqfeg abcxabcde ^1^1 ^ 2 ^ 3 ^ 4 ^ 5 ^ 6 ^ 7 ^ 8 d!=x, shift 4 places abcxabcde ^ 1 start again from position 4

13 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Preprocessing for KMP Approach: show how to derive sp´ values from Z values. Definition: Position j > 1 maps to i if i = j + Z j (P) – 1 –Recall that Z j (P) denotes the length of the Z-box starting at position j. –This says that j maps to i if i is the right end of a Z-box starting at j.

14 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Preprocessing for KMP Theorem. For any i > 1, sp´ i (P) = Z j = i – j + 1 Where j > 1 is the smallest position that maps to i. If  j then sp´ i (P) = 0 Similarly for sp: For any i > 1, sp i (P) = i – j + 1 Where j, i  j > 1, is the smallest position that maps to i or beyond. If  j then sp i (P) = 0

15 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Preprocessing for KMP Given the theorem from the preceding slide, the sp´ i and sp i values can be computed in linear time using Z i values: For i = 1 to n { sp´ i = 0;} For j = n downto 2 { i = j + Z i (P) – 1; sp´ i = Z i ; } sp n (P) = sp´ n (P); For i = n - 1 downto 2 { sp i (P) = max[sp i+1 (P) - 1, sp´ i (P)];}

16 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Preprocessing for KMP Defn. Failure function F´(i) = sp´ i-1 + 1, 1  i  n + 1, sp´ 0 = 0 (similarly F(i) = sp i-1 + 1, 1  i  n + 1, sp 0 = 0) Idea: –We maintain a pointer i in P and c in T. – After a mismatch at P(i+1) with T(c), shift P to align P(sp´ i + 1) with T(c), i.e., i = sp´ i + 1. –Special case 1: i = 1  set i = F´(1) = 1 & c = c + 1 –Special case 2: we find P in T,  shift n - sp´ n spaces, i.e., i = F´(n + 1) = sp´ n + 1.

17 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Full KMP Algorithm Preprocess P to find F´(k) = sp´ k-1 +1 for k from 1 to n + 1 c = 1; p = 1; While c + (n – p)  m { While P(p) = T( c )and p  n { p = p + 1; c = c + 1;} If (p = n + 1) then report an occurrence of P at position c – n of T. if (p = 1) then c = c + 1; p = F´(p) ;}

18 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Full KMP Algorithm xyabcxabcxabcdefeg abcxabcde ^ 1 a!=x p != n+1 p = 1!  c = 2 p = F’(1) = 1 c = 1; p = 1; While c + (n – p)  m { While P(p) = T( c )and p  n { p = p + 1; c = c + 1;} If (p = n + 1) then report an occurrence of P at position c – n of T. if (p = 1) then c = c + 1; p = F´(p) ; }

19 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Full KMP Algorithm xyabcxabcxabcdefeg abcxabcde ^ 1 a!=y p != n+1 p = 1!  c = 3 p = F’(1) = 1 c = 1; p = 1; While c + (n – p)  m { While P(p) = T( c )and p  n { p = p + 1; c = c + 1;} If (p = n + 1) then report an occurrence of P at position c – n of T. if (p = 1) then c = c + 1; p = F´(p) ; } abcxabcde

20 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Full KMP Algorithm xyabcxabcxabcdefeg p != n+1 p = 8!  don’t change c p = F´(8) = 4 abcxabcde ^1^1 ^ 2 ^ 3 ^ 4 ^ 5 ^ 6 ^ 7 ^ 8 d!=x c = 1; p = 1; While c + (n – p)  m { While P(p) = T( c )and p  n { p = p + 1; c = c + 1;} If (p = n + 1) then report an occurrence of P at position c – n of T. if (p = 1) then c = c + 1; p = F´(p) ; }

21 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology p = 4, c = 10 ^ 4 Full KMP Algorithm xyabcxabcxabcdefeg p = n+1 ! abcxabcde ^ 5 ^ 6 ^ 7 ^ 8 abcxabcde c = 1; p = 1; While c + (n – p)  m { While P(p) = T( c )and p  n { p = p + 1; c = c + 1;} If (p = n + 1) then report an occurrence of P at position c – n of T. if (p = 1) then c = c + 1; p = F´(p) ; } ^ 9

22 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Real-Time KMP Q: What is meant by real-time algorithms? A: Typically these are algorithms that are meant to interact synchronously in the real world. –This implies a known fixed turn-around time for processing a task –Many embedded scheduling systems are examples involving real-time algorithms. –For KMP this means that we require a constant time for processing all strings of length n.

23 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Real-Time KMP Q: Why is KMP not real-time? A: For any mismatched character in T, we may try matching it several times. –Recall that sp´ i only guarantees that P(i + 1) and P(sp´ i + 1) differ –There is NO guarantee that P(i + 1) and T(k) match We need to ensure that a mismatch at T(k) does NOT entail additional matches at T(k). This means that we have to compute sp´ i values with respect to all characters in  since any could appear in T.

24 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Real-Time KMP Define: sp´ (i,x) (P) to be the length of the longest proper suffix of P[1..i] that matches a prefix of P, with the added condition that character P(sp´ i + 1) is x. This is will tell us exactly what shift to use for each possible mismatch. A mismatched character T(k) will never be involved in subsequent comparisons.

25 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Real-Time KMP Q: How do we know that the mismatched character T(k) will never be involved in subsequent comparisons? A: Because the shift will shift P so that either the matching character aligns with T(k) or P will be shifted past T(k). This results in a real-time version of KMP. Let’s consider how we can find the sp´ (i,x) (P) values in linear time.

26 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Real-Time KMP Thm. For P[i + 1]  x, sp´ (i,x) (P) = i - j + 1 –Here j is the smallest position such that j maps to i and P(Z j + 1) = x. –If there is no such j then where sp´ (i,x) (P) = 0 For i = 1 to n { sp´ (i,x) = 0 for every character x;} For j = n downto 2 { i = j + Z i (P) – 1; x = P(Z j + 1); sp´ (i,x) = Z i ; }

27 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Real-Time KMP Notice how this works: –Starting from the right Find i the right end of the Z box associated with j Find x the character immediately following the prefix corresponding to this Z box. Set sp´ (i,x) = Z i, the length of this Z box. For i = 1 to n { sp´ (i,x) = 0 for every character x;} For j = n downto 2 { i = j + Z i (P) – 1; x = P(Z j + 1); sp´ (i,x) = Z i ;}


Download ppt "UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 2: KMP Algorithm Lecturer:"

Similar presentations


Ads by Google