Data Structures and Algorithms (AT70. 02) Comp. Sc. and Inf. Mgmt

Data Structures and Algorithms (AT70. 02) Comp. Sc. and Inf. Mgmt
Data Structures and Algorithms (AT70.02) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Prof. Sumanta Guha Slide Sources: CLRS “Intro. To Algorithms” book website (copyright McGraw Hill) adapted and supplemented

CLRS “Intro. To Algorithms” Ch. 32: String Matching

Text is an array T[1..n] of length n of elements from a finite alphabet .
Pattern P is an array P[1..m] of length m ≤ n of elements from . Pattern P occurs with shift s in text T if T[s+1..s+m] = P[1..m]. If P occurs with shift s in T, then s is a valid shift. String matching problem : find all valid shifts. Note: Valid shifts must be in the range 0 ≤ s ≤ n-m, so there are n-m+1 possible different values of valid shifts.

†Notation different from text.
Terminology: The set of all finite-length strings using characters from an alphabet  is denoted *. The length of a string x is denoted |x|. The zero-length empty string is denoted . The concatenation of two strings x and y is denoted xy. A string w is a prefix of a string x, denoted † w  x if x = wy, for some string y  *. A string w is a suffix of a string x, denoted † w  x if x = yw, for some string Denote the k-character prefix P[1..k] of a string P[1..m] by Pk. †Notation different from text.

Time complexity: O( (n – m + 1)m )

Rabin-Karp Strategy Choose a hash function H:*  integers.
Compute H( P[1..m] ). For each successive s, from 0 to n-m: Compute H( T[s+1..s+m] ) Compare H( T[s+1..s+m] ) with H( P[1..m] ). If H( T[s+1..s+m] )  H( P[1..m] ), then T[s+1..s+m]  P[1..m], so that there is no match. If H( T[s+1..s+m] ) = H( P[1..m] ), then there is possibly a match (remember that the hash values of two different elements may collide!). In this case, explicitly check for a match T[s+1..s+m] = P[1..m] by comparing character by character as in the naïve matcher. If H( T[s+1..s+m] ) = H( P[1..m] ) but T[s+1..s+m]  P[1..m], then we are said to have a spurious hit. Design goals for the hash function H: It should be possible to compute H( T[s+2..s+m+1] ) from H( T[s+1..s+m] ) efficiently. I.e., starting from H( T[1..m] ) it should be possible to efficiently calculate the successive hash values H( T[2..m+1] ), H( T[3..m+2] ), …, each with the help of the previous one. It should be possible to efficiently compare H( T[s+1..s+m] ) with H( P[1..m] ) .

Example: Suppose  = {0, 1, …, 9}, the set of decimal digits
Example: Suppose  = {0, 1, …, 9}, the set of decimal digits. Define the hash function H:*  integers by H(w) = decimal number represented by w. E.g., if w = 23903, then H(w) = 23,903; if w = 02858, then H(w) = 2,858 (note that a decimal character string is simply a representation of an integer, but they are not the same thing). Given a pattern P[1..m], the value H( P[1..m] ), call it p, can be computed via Horner’s rule: p = P[m] + 10( P[m-1] + 10( P[m-2] + … + 10( P[2] + 10P[1] ) … )) Let ts denote the value H( T[s+1..s+m] ). Then, t0 can be computed using Horner’s rule (as p above). Moreover, ts+1 can be computed from ts using the recurrence: ts+1 = 10(ts – 10m-1T[s+1]) + T[s+m+1] (32.1) Efficiency: If the constant 10m-1 is pre-computed, and if 10m-1, p and ts, for all s, can each be contained in a single computer word, then each execution of the above equation takes a constant number of arithmetic operations and comparing p and ts is a single word comparison operation as well. Therefore, total time: (n-m+1). Question: Do we have to worry about spurious hits in this example?!

The assumption that 10m-1, p and ts fit into a single computer word is not always feasible.
Instead, computation is done mod a prime number q. The prime q is usually chosen so that 10q fits into a single word, in which case the operations involved in the modified recurrence (32.1) ts+1 = ( 10(ts – 10m-1T[s+1]) + T[s+m+1] ) mod q can each be executed as a one single-precision arithmetic operation. However, spurious hits are now an issue and the worst case running time is ( n-m+1)m ), like the naïve matcher, because every valid shift has to be checked character by character, and there are potentially n-m+1 valid shifts. In practice, though, we expect only a few (possibly constant number) of valid shifts, and only a few spurious hits (which also have to be verified character by character), in which case performance is much better than the worst case.

In general, if the alphabet  is of size d, then it is interpreted as
 = {0, 1, …, d-1} and a string in * interpreted as a radix-d integer. Correspondingly, (32.1) which was modified earlier to ts+1 = ( 10(ts – 10m-1T[s+1]) + T[s+m+1] ) mod q is modified now to: ts+1 = ( d(ts – T[s+1]h) + T[s+m+1] ) mod q (32.2) (where h = dm-1 mod q)

Text T = Pattern P = 31415 31415 = 7 mod 13

Horner’s method

Finite Automaton Review
A finite automaton M is a 5-tuple (Q, q0, A, , ), where Q is a finite set of states. q0  Q is the start state. A  Q is a distinguished set of accepting states.  is a finite input alphabet.  is a function from Q   into Q called the transition function of M. If M reads input character a when in state q, then it changes to state (q, a). If its currents state q is in A, then M is said to accept the string read so far. M induces a function : *  Q, called the final-state function, such that (w) is the state of M after reading the string w.  can be defined recursively as follows: () = q0 (wa) = ((w), a) Therefore, M accepts w if and only if (w)  A.

String Matching with Finite Automata
Strategy: Design a finite automaton that reads the text string character by character, going to an accept state only if the pattern P has just been seen. In other word, the automaton accepts strings with suffix P. E.g., if the text is “abbbabcca” and the pattern “ab” we want the automaton to consume a, b, b , b, a, b, c, c, a and entering into an accept state only at the red b’s. Exercise: Design an automaton as required by the strategy for the particular example above.

Implementing the Strategy
For a pattern P[1..m] define a function : *  {0, 1, …, m}, called the suffix function for P, such that (x) is the length of the longest prefix of P that is a suffix of x. I.e., (x) = max{k: Pk  x} Example: If the pattern is P = aba, then () = 0, (abab) = 2, (abbaba) = 3. How about (ababb) and (bbba)? For a pattern P[1..m] of length m, (x) = m if and only if P  x, i.e., if and only if the pattern is at the end of x. This leads to defining the string-matching automaton M corresponding to P[1..m] as follows: State set Q = {0, 1, …, m}. Start state q0 is 0. The only accepting state is m. The transition function is defined by (q, a) = (Pqa). Intuition: As M reads the string T =T[1]T[2]…T[n] character by character, it goes to state (T[1]T[2]…T[i]) after reading T[i] (Why? To be proved!). Therefore, if it is in the accepting state m after reading T[i], then the pattern P has just been seen.

Pattern P = ababaca

Matching time (excluding preprocessing time to compute ): (n)

Lemma 32.1 (Overlapping Suffix Lemma)
Suppose that x, y and z are strings s.t. x  z and y  z. If |x| ≤ |y|, then x  y. If |x| ≥ |y|, then y  x. If |x| = |y|, then x = y. Proof: …

Lemma 32.2 (Suffix-function Inequality Lemma)
For any string x and character a, we have (xa) ≤ (x) + 1. Proof: …

Lemma 32.3 (Suffix-function Recursion Lemma)
For any string x and character a, if q = (x), then (xa) = (Pqa) . Proof: …

Theorem 32.4: If  is the final-state function of a string-matching automaton for a given pattern P and T[1..n] is an input text for the automaton, then (Ti) = (Ti) for i = 0, 1, .., n. Proof: By induction on i. For i = 0, the theorem is trivially true because T0 = , so that (T0) = 0 = (T0). Assume, inductively, that (Ti) = (Ti). We shall prove (Ti+1) = (Ti+1). Let q denote (Ti) = (Ti). Suppose T[i+1] = a. Then, (Ti+1) = (Tia) = ((Ti), a) = (q, a) = (Pqa) (by definition of the trans. fn. of this automaton) = (Tia) (by Suffix Function Recursion Lemma) = (Ti+1) Therefore, as claimed earlier, M does go into state (Ti) after reading Ti, so that if it is in the accepting state m after reading Ti, then P has just been seen.

If instead of (x), how about if we tried to define a function τ(x) as the length of the longest suffix P that is a suffix of x. Would this lead to a finite automaton to recognize matches of P? No! Because we are unable to uniquely define a transition function. E.g., consider the pattern P = aab. Now if the last letter we have read is not b, we must be in state 0. Say, next we do read b. Then, which state do we go to, i.e., what is the transition (0, b)? If the string is aab, we should go to state 3; however, if it is bab, we should go to state 2.

Running time = preprocessing time for the FA-matcher: O(m3||)
// O(m) time // O(||) time // O(m) time // test O(m) time Running time = preprocessing time for the FA-matcher: O(m3||) (this can be improved to O(m||))

Running time (excluding preprocessing time to compute ): (n)

Knuth-Morris-Pratt Algorithm
Strategy: Improve on the string-matching automaton by avoiding the time consuming computation of the transition function . Instead, pre-compute in (m) time an auxiliary function  that contains information about how the pattern P matches against shifts of itself. Precisely, given pattern P[1..m], the prefix function for P is the function : {1, 2, …, m}  {0, 1, …, m-1} such that (q) = max{k: k < q and Pk  Pq} I.e., (q) is the length of the longest prefix of P that is a proper suffix of Pq. Q: What is the relation between the suffix fn. (Pq) and (q)?

π(5) = 3 indicates that a shift of
+1 to the right cannot be valid; however, +2 is potentially valid.

Facts about the Prefix Function
Given a pattern P[1..m], we’ll show that all the prefixes of P that are proper suffixes of a given prefix Pq can be listed by iterating the prefix function . Let *(q) = {(q), (2)(q), (3)(q), …, (t)(q) = 0}, where (2)(q) = ((q)), (3)(q) =  (((q))), etc. I.e., *(q) is the list of all possible values obtained by repeatedly applying the prefix function  to q. Lemma 32.5 (Prefix-function Iteration Lemma) Let P be a pattern of length m with prefix function . Then, for q = 1, 2, …, m, we have *(q) = {k : k < q and Pk  Pq}. Proof: Induction on q†… †Text does a different induction, but induction on q seems simplest.

Lemma 32.6: Let P be a pattern of length m, and let  be the prefix function for P. For q = 1, 2, …, m, if (q) > 0, then (q) – 1  *(q – 1). Proof: If (q) = r > 0, then r < q and Pr  Pq. Therefore, r – 1 < q – 1 and Pr-1  Pq-1. By previous lemma, (q) – 1 = r – 1  *(q – 1). For q = 2, 3, .., m, define the subset Eq–1  *(q – 1) by Eq–1 = {k  *(q – 1) : P(k+1) = P(q)} I.e., Eq–1 consists of those values k < q – 1 for which Pk  Pq–1 and for which Pk+1  Pq because P(k+1) = P(q). Equivalently, Eq–1 consists of those values k  *(q – 1) such that we can extend Pk to Pk+1 and get a proper suffix of Pq. Corollary 32.7: Let P be a pattern of length m, and let  be the prefix function for P. For q = 2, 3, .., m, (q) = 0 if Eq–1 = . 1 + max{k  Eq–1} if Eq–1  . Proof: Straightforward use of above lemma…

Correctness follows from Cor. 32.7.
Running time = preprocessing time for KMP-matcher: (m) (by amortized analysis!)

Correctness follows from the fact that the KMP-MATCHER simulates
equivalent Correctness follows from the fact that the KMP-MATCHER simulates the FINITE-AUTOMATON-MATCHER. Running time: (n) (again by amortized analysis)

Problems Ex. 32.1-1 Ex. 32.1-3 Ex. 32.2-1 Ex. 32.2-2 Ex. 32.2-3

Data Structures and Algorithms (AT70. 02) Comp. Sc. and Inf. Mgmt

Similar presentations

Presentation on theme: "Data Structures and Algorithms (AT70. 02) Comp. Sc. and Inf. Mgmt"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Structures and Algorithms (AT70. 02) Comp. Sc. and Inf. Mgmt

Similar presentations

Presentation on theme: "Data Structures and Algorithms (AT70. 02) Comp. Sc. and Inf. Mgmt"— Presentation transcript:

Similar presentations

About project

Feedback