Data Structures and Algorithms (AT70.02) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: CLRS “Intro.

Data Structures and Algorithms (AT70.02) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: CLRS “Intro. To Algorithms” book website (copyright McGraw Hill) adapted and supplemented

CLRS “Intro. To Algorithms” Ch. 32: String Matching

Text is an array T[1..n] of length n of elements from a finite alphabet . Pattern P is an array P[1..m] of length m ≤ n of elements from . Pattern P occurs with shift s in text T if T[s+1..s+m] = P[1..m]. If P occurs with shift s in T, then s is a valid shift. String matching problem : find all valid shifts. Note: Valid shifts must be in the range 0 ≤ s ≤ n-m, so there are n-m+1 possible different values of valid shifts.

Terminology: The set of all finite-length strings using characters from an alphabet  is denoted  *. The length of a string x is denoted |x|. The zero-length empty string is denoted . The concatenation of two strings x and y is denoted xy. A string w is a prefix of a string x, denoted † w  x if x = wy, for some string y   *. A string w is a suffix of a string x, denoted † w  x if x = yw, for some string y   *. Denote the k-character prefix P[1..k] of a string P[1..m] by P k. † Notation different from text.

Time complexity: O( (n – m + 1)m )

Rabin-Karp Strategy Choose a hash function H:  *  integers. Compute H( P[1..m] ). For each successive s, from 0 to n-m: Compute H( T[s+1..s+m] ) Compare H( T[s+1..s+m] ) with H( P[1..m] ). If H( T[s+1..s+m] )  H( P[1..m] ), then T[s+1..s+m]  P[1..m], so that there is no match. If H( T[s+1..s+m] ) = H( P[1..m] ), then there is possibly a match (remember that the hash values of two different elements may collide!). In this case, explicitly check for a match T[s+1..s+m] = P[1..m] by comparing character by character as in the naïve matcher. If H( T[s+1..s+m] ) = H( P[1..m] ) but T[s+1..s+m]  P[1..m], then we are said to have a spurious hit. Design goals for the hash function H: It should be possible to compute H( T[s+2..s+m+1] ) from H( T[s+1..s+m] ) efficiently. I.e., starting from H( T[1..m] ) it should be possible to efficiently calculate the successive hash values H( T[2..m+1] ), H( T[3..m+2] ), …, each with the help of the previous one. It should be possible to efficiently compare H( T[s+1..s+m] ) with H( P[1..m] ).

Example: Suppose  = {0, 1, …, 9}, the set of decimal digits. Define the hash function H:  *  integers by H(w) = decimal number represented by w. E.g., if w = 23903, then H(w) = 23,903; if w = 02858, then H(w) = 2,858 (note that a decimal character string is simply a representation of an integer, but they are not the same thing). Given a pattern P[1..m], the value H( P[1..m] ), call it p, can be computed via Horner’s rule: p = P[m] + 10( P[m-1] + 10( P[m-2] + … + 10( P[2] + 10P[1] ) … )) Let t s denote the value H( T[s+1..s+m] ). Then, t 0 can be computed using Horner’s rule (as p above). Moreover, t s+1 can be computed from t s using the recurrence: t s+1 = 10(t s – 10 m-1 T[s+1]) + T[s+m+1] (32.1) Efficiency: If the constant 10 m-1 is pre-computed, and if 10 m-1, p and t s, for all s, can each be contained in a single computer word, then each execution of the above equation takes a constant number of arithmetic operations and comparing p and t s is a single word comparison operation as well. Therefore, total time:  (n-m+1). Question: Do we have to worry about spurious hits in this example?!

The assumption that 10 m-1, p and t s fit into a single computer word is not always feasible. Instead, computation is done mod a prime number q. The prime q is usually chosen so that 10q fits into a single word, in which case the operations involved in the modified recurrence (32.1) t s+1 = ( 10(t s – 10 m-1 T[s+1]) + T[s+m+1] ) mod q can each be executed as a one single-precision arithmetic operation. However, spurious hits are now an issue and the worst case running time is  ( n-m+1)m ), like the naïve matcher, because every valid shift has to be checked character by character, and there are potentially n- m+1 valid shifts. In practice, though, we expect only a few (possibly constant number) of valid shifts, and only a few spurious hits (which also have to be verified character by character), in which case performance is much better than the worst case.

In general, if the alphabet  is of size d, then it is interpreted as  = {0, 1, …, d-1} and a string in  * interpreted as a radix-d integer. Correspondingly, (32.1) which was modified earlier to t s+1 = ( 10(t s – 10 m-1 T[s+1]) + T[s+m+1] ) mod q is modified now to: t s+1 = ( d(t s – T[s+1]h) + T[s+m+1] ) mod q (32.2) (where h = d m-1 mod q)

Text T =Pattern P = 31415 31415 = 7 mod 13

Horner’s method

Finite Automaton Review A finite automaton M is a 5-tuple (Q, q 0, A, ,  ), where Q is a finite set of states. q 0  Q is the start state. A  Q is a distinguished set of accepting states.  is a finite input alphabet.  is a function from Q   into Q called the transition function of M. If M reads input character a when in state q, then it changes to state  (q, a). If its currents state q is in A, then M is said to accept the string read so far. M induces a function  :  *  Q, called the final-state function, such that  (w) is the state of M after reading the string w.  can be defined recursively as follows:  (  ) = q 0  (wa) =  (  (w), a) Therefore, M accepts w if and only if  (w)  A.

String Matching with Finite Automata Strategy: Design a finite automaton that reads the text string character by character, going to an accept state only if the pattern has just been seen. E.g., if the text is “abbbabcca” and the pattern “ab” we want the automaton to consume a, b, b, b, a, b, c, c, a and entering into an accept state only at the red b’s. Exercise: Design an automaton as required by the strategy for the particular example above.

Implementing the Strategy For a pattern P[1..m] define a function  :  *  {0, 1, …, m}, called the suffix function for P, such that  (x) is the length of the longest prefix of P that is a suffix of x. I.e.,  (x) = max{k: P k  x} Example: If the pattern is P = aba, then  (  ) = 0,  (abab) = 2,  (abbaba) = 3. How about  (ababb) and  (bbba)? For a pattern P[1..m] of length m,  (x) = m if and only if P  x, i.e., if and only if the pattern is at the end of x. This leads to defining the string-matching automaton M corresponding to P[1..m] as follows: State set Q = {0, 1, …, m}. Start state q 0 is 0. The only accepting state is m. The transition function is defined by  (q, a) =  (P q a). Intuition: As M reads the string T =T[1]T[2]…T[n] character by character, it goes to state  (T[1]T[2]…T[i]) after reading T[i] (Why? To be proved!). Therefore, if it is in the accepting state m after reading T[i], then the pattern P has just been seen.

Pattern P = ababaca

Matching time (excluding preprocessing time to compute  ):  (n)

Lemma 32.1 (Overlapping Suffix Lemma) Suppose that x, y and z are strings s.t. x  z and y  z. If |x| ≤ |y|, then x  y. If |x| ≥ |y|, then y  x. If |x| = |y|, then x = y. Proof: …

Lemma 32.2 (Suffix-function Inequality Lemma) For any string x and character a, we have  (xa) ≤  (x) + 1. Proof: …

Lemma 32.3 (Suffix-function Recursion Lemma) For any string x and character a, if q =  (x), then  (xa) =  (P q a). Proof: …

Theorem 32.4: If  is the final-state function of a string-matching automaton for a given pattern P and T[1..n] is an input text for the automaton, then  (T i ) =  (T i ) for i = 0, 1,.., n. Proof: By induction on i. For i = 0, the theorem is trivially true because T 0 = , so that  (T 0 ) = 0 =  (T 0 ). Assume, inductively, that  (T i ) =  (T i ). We shall prove  (T i+1 ) =  (T i+1 ). Let q denote  (T i ) =  (T i ). Suppose T[i+1] = a. Then,  (T i+1 ) =  (T i a) =  (  (T i ), a) =  (q, a) =  (P q a) (by definition of the trans. fn. of this automaton) =  (T i a) (by Suffix Function Recursion Lemma) =  (T i+1 ) Therefore, as claimed earlier, M does go into state  (T i ) after reading T i, so that if it is in the accepting state m after reading T i, then P has just been seen.

If instead of  (x), how about if we tried to define a function τ(x) as the length of the longest suffix P that is a suffix of x. Would this lead to a finite automaton to recognize matches of P? No! Because we are unable to uniquely define a transition function. E.g., consider the pattern P = aab. Now if the last letter we have read is not b, we must be in state 0. Say, next we do read b. Then, which state do we go to, i.e., what is the transition  (0, b)? If the string is aab, we should go to state 3; however, if it is bab, we should go to state 2.

Running time = preprocessing time for the FA-matcher: O(m 3 |  |) (this can be improved to O(m|  |)) // O(m) time // O(|  |) time // O(m) time // test O(m) time

Running time (excluding preprocessing time to compute  ):  (n)

Knuth-Morris-Pratt Algorithm Strategy: Improve on the string-matching automaton by avoiding the time consuming computation of the transition function . Instead, pre-compute in  (m) time an auxiliary function  that contains information about how the pattern P matches against shifts of itself. Precisely, given pattern P[1..m], the prefix function for P is the function  : {1, 2, …, m}  {0, 1, …, m-1} such that  (q) = max{k: k < q and P k  P q } I.e.,  (q) is the length of the longest prefix of P that is a proper suffix of P q.

π(5) = 3 indicates that a shift of +1 to the right cannot be valid; however, +2 is potentially valid.

Facts about the Prefix Function Given a pattern P[1..m], we’ll show that all the prefixes of P that are proper suffixes of a given prefix P q can be listed by iterating the prefix function . Let  * (q) = {  (q),  (2) (q),  (3) (q), …,  (t) (q) = 0}, where  (2) (q) =  (  (q)),  (3) (q) =  (  (  (q))), etc. I.e.,  * (q) is the list of all possible values obtained by repeatedly applying the prefix function  to q. Lemma 32.5 (Prefix-function Iteration Lemma) Let P be a pattern of length m with prefix function . Then, for q = 1, 2, …, m, we have  * (q) = {k : k < q and P k  P q }. Proof: Induction on q † … †Text does a different induction, but induction on q seems simplest.

Lemma 32.6: Let P be a pattern of length m, and let  be the prefix function for P. For q = 1, 2, …, m, if  (q) > 0, then  (q) – 1   * (q – 1). Proof: If  (q) = r > 0, then r < q and P r  P q. Therefore, r – 1 < q – 1 and P r-1  P q-1. By previous lemma,  (q) – 1 = r – 1   * (q – 1). For q = 2, 3,.., m, define the subset E q–1   * (q – 1) by E q–1 = {k   * (q – 1) : P(k+1) = P(q)} I.e., E q–1 consists of those values k < q – 1 for which P k  P q–1 and for which P k+1  P q because P(k+1) = P(q). Equivalently, E q–1 consists of those values k   * (q – 1) such that we can extend P k to P k+1 and get a proper suffix of P q. Corollary 32.7: Let P be a pattern of length m, and let  be the prefix function for P. For q = 2, 3,.., m,  (q) = 0 if E q–1 = . 1 + max{k  E q–1 } if E q–1  . Proof: Straightforward use of above lemma…

Correctness follows from Cor. 32.7. Running time = preprocessing time for KMP-matcher:  (m) (by amortized analysis!)

Correctness follows from the fact that the KMP-M ATCHER simulates the F INITE -A UTOMATON -M ATCHER. Running time:  (n) (again by amortized analysis) equivalent

Problems Ex. 32.1-1 Ex. 32.1-3 Ex. 32.2-1 Ex. 32.2-2 Ex. 32.2-3 Ex. 32.3-1 Ex. 32.3-2 Ex. 32.3-3 Ex. 32.3-4 Ex. 32.4-1 Ex. 32.4-2 Ex. 32.4-4 Prob. 32-1

Data Structures and Algorithms (AT70.02) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: CLRS “Intro.

Similar presentations

Presentation on theme: "Data Structures and Algorithms (AT70.02) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: CLRS “Intro."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Structures and Algorithms (AT70.02) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: CLRS “Intro.

Similar presentations

Presentation on theme: "Data Structures and Algorithms (AT70.02) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: CLRS “Intro."— Presentation transcript:

Similar presentations

About project

Feedback