Data Structures and Algorithms (AT70. 02) Comp. Sc. and Inf. Mgmt

Slides:



Advertisements
Similar presentations
Deterministic Finite Automata (DFA)
Advertisements

Data Structures and Algorithms (AT70.02) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: CLRS “Intro.
String Searching Algorithms Problem Description Given two strings P and T over the same alphabet , determine whether P occurs as a substring in T (or.
Data Structures and Algorithms (AT70.02) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: CLRS “Intro.
Yangjun Chen 1 String Matching String matching problem - prefix - suffix - automata - String-matching automata - prefix function - Knuth-Morris-Pratt algorithm.
Prefix & Suffix Example W = ab is a prefix of X = abefac where Y = efac. Example W = cdaa is a suffix of X = acbecdaa where Y = acbe A string W is a prefix.
UMass Lowell Computer Science Analysis of Algorithms Prof. Karen Daniels Fall, 2006 Wednesday, 12/6/06 String Matching Algorithms Chapter 32.
6-1 String Matching Learning Outcomes Students are able to: Explain naïve, Rabin-Karp, Knuth-Morris- Pratt algorithms Analyse the complexity of these algorithms.
UMass Lowell Computer Science Analysis of Algorithms Prof. Karen Daniels Fall, 2001 Lecture 8 Tuesday, 11/13/01 String Matching Algorithms Chapter.
Pattern Matching II COMP171 Fall Pattern matching 2 A Finite Automaton Approach * A directed graph that allows self-loop. * Each vertex denotes.
String Matching COMP171 Fall String matching 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences of.
Regular Expression (EXTRA)
Pattern Matching COMP171 Spring Pattern Matching / Slide 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences.
Great Theoretical Ideas in Computer Science.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
String Matching Using the Rabin-Karp Algorithm Katey Cruz CSC 252: Algorithms Smith College
Formal Language Finite set of alphabets Σ: e.g., {0, 1}, {a, b, c}, { ‘{‘, ‘}’ } Language L is a subset of strings on Σ, e.g., {00, 110, 01} a finite language,
Induction and recursion
DECIDABILITY OF PRESBURGER ARITHMETIC USING FINITE AUTOMATA Presented by : Shubha Jain Reference : Paper by Alexandre Boudet and Hubert Comon.
KMP String Matching Prepared By: Carlens Faustin.
Advanced Algorithm Design and Analysis (Lecture 3) SW5 fall 2004 Simonas Šaltenis E1-215b
String Matching (Chap. 32) Given a pattern P[1..m] and a text T[1..n], find all occurrences of P in T. Both P and T belong to  *. P occurs with shift.
1 Chapter 1 Introduction to the Theory of Computation.
Great Theoretical Ideas in Computer Science.
MCS 101: Algorithms Instructor Neelima Gupta
Data Structures and Algorithms (AT70.02) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Prof. Sumanta Guha Slide Sources: CLRS “Intro.
Data Structures and Algorithms (AT70.02) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Prof. Sumanta Guha Slide Sources: CLRS “Intro.
Data Structures and Algorithms (AT70. 02) Comp. Sc. and Inf. Mgmt
MCS 101: Algorithms Instructor Neelima Gupta
Copyright © Curt Hill Finite State Automata Again This Time No Output.
String Matching String Matching Problem We introduce a general framework which is suitable to capture an essence of compressed pattern matching according.
1 String Matching Algorithms Topics  Basics of Strings  Brute-force String Matcher  Rabin-Karp String Matching Algorithm  KMP Algorithm.
CS 203: Introduction to Formal Languages and Automata
String Algorithms David Kauchak cs302 Spring 2012.
String-Matching Problem COSC Advanced Algorithm Analysis and Design
Data Structures and Algorithms (AT70.02) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Prof. Sumanta Guha Slide Sources: CLRS “Intro.
CompSci 102 Discrete Math for Computer Science March 13, 2012 Prof. Rodger Slides modified from Rosen.
1/39 COMP170 Tutorial 13: Pattern Matching T: P:.
Rabin & Karp Algorithm. Rabin-Karp – the idea Compare a string's hash values, rather than the strings themselves. For efficiency, the hash value of the.
1 String Matching Algorithms Mohd. Fahim Lecturer Department of Computer Engineering Faculty of Engineering and Technology Jamia Millia Islamia New Delhi,
Data Structures and Algorithms (AT70. 02) Comp. Sc. and Inf. Mgmt
Advanced Algorithms Analysis and Design
The Rabin-Karp Algorithm
Data Structures and Algorithms (AT70. 02) Comp. Sc. and Inf. Mgmt
Advanced Algorithms Analysis and Design
Data Structures and Algorithms (AT70. 02) Comp. Sc. and Inf. Mgmt
String Matching (Chap. 32)
Finite Automata a b A simplest computational model
Data Structures and Algorithms (AT70. 02) Comp. Sc. and Inf. Mgmt
Cardinality of Sets Section 2.5.
Advanced Algorithm Design and Analysis (Lecture 12)
Rabin & Karp Algorithm.
Data Structures and Algorithms (AT70. 02) Comp. Sc. and Inf. Mgmt
Chapter 3 String Matching.
Hierarchy of languages
Tuesday, 12/3/02 String Matching Algorithms Chapter 32
Data Structures and Algorithms (AT70. 02) Comp. Sc. and Inf. Mgmt
Data Structures and Algorithms (AT70. 02) Comp. Sc. and Inf. Mgmt
String-Matching Algorithms (UNIT-5)
Data Structures and Algorithms (AT70. 02) Comp. Sc. and Inf. Mgmt
Introduction to Finite Automata
Finite Automata Reading: Chapter 2.
Data Structures and Algorithms (AT70. 02) Comp. Sc. and Inf. Mgmt
Data Structures and Algorithms (AT70. 02) Comp. Sc. and Inf. Mgmt
Data Structures and Algorithms (AT70. 02) Comp. Sc. and Inf. Mgmt
Instructor: Aaron Roth
Data Structures and Algorithms (AT70. 02) Comp. Sc. and Inf. Mgmt
Chapter 7 Quicksort.
Chapter 1 Introduction to the Theory of Computation
Finite-State Machines with No Output
Presentation transcript:

Data Structures and Algorithms (AT70. 02) Comp. Sc. and Inf. Mgmt Data Structures and Algorithms (AT70.02) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Prof. Sumanta Guha Slide Sources: CLRS “Intro. To Algorithms” book website (copyright McGraw Hill) adapted and supplemented

CLRS “Intro. To Algorithms” Ch. 32: String Matching

Text is an array T[1..n] of length n of elements from a finite alphabet . Pattern P is an array P[1..m] of length m ≤ n of elements from . Pattern P occurs with shift s in text T if T[s+1..s+m] = P[1..m]. If P occurs with shift s in T, then s is a valid shift. String matching problem : find all valid shifts. Note: Valid shifts must be in the range 0 ≤ s ≤ n-m, so there are n-m+1 possible different values of valid shifts.

†Notation different from text. Terminology: The set of all finite-length strings using characters from an alphabet  is denoted *. The length of a string x is denoted |x|. The zero-length empty string is denoted . The concatenation of two strings x and y is denoted xy. A string w is a prefix of a string x, denoted † w  x if x = wy, for some string y  *. A string w is a suffix of a string x, denoted † w  x if x = yw, for some string Denote the k-character prefix P[1..k] of a string P[1..m] by Pk. †Notation different from text.

Time complexity: O( (n – m + 1)m )

Rabin-Karp Strategy Choose a hash function H:*  integers. Compute H( P[1..m] ). For each successive s, from 0 to n-m: Compute H( T[s+1..s+m] ) Compare H( T[s+1..s+m] ) with H( P[1..m] ). If H( T[s+1..s+m] )  H( P[1..m] ), then T[s+1..s+m]  P[1..m], so that there is no match. If H( T[s+1..s+m] ) = H( P[1..m] ), then there is possibly a match (remember that the hash values of two different elements may collide!). In this case, explicitly check for a match T[s+1..s+m] = P[1..m] by comparing character by character as in the naïve matcher. If H( T[s+1..s+m] ) = H( P[1..m] ) but T[s+1..s+m]  P[1..m], then we are said to have a spurious hit. Design goals for the hash function H: It should be possible to compute H( T[s+2..s+m+1] ) from H( T[s+1..s+m] ) efficiently. I.e., starting from H( T[1..m] ) it should be possible to efficiently calculate the successive hash values H( T[2..m+1] ), H( T[3..m+2] ), …, each with the help of the previous one. It should be possible to efficiently compare H( T[s+1..s+m] ) with H( P[1..m] ) .

Example: Suppose  = {0, 1, …, 9}, the set of decimal digits Example: Suppose  = {0, 1, …, 9}, the set of decimal digits. Define the hash function H:*  integers by H(w) = decimal number represented by w. E.g., if w = 23903, then H(w) = 23,903; if w = 02858, then H(w) = 2,858 (note that a decimal character string is simply a representation of an integer, but they are not the same thing). Given a pattern P[1..m], the value H( P[1..m] ), call it p, can be computed via Horner’s rule: p = P[m] + 10( P[m-1] + 10( P[m-2] + … + 10( P[2] + 10P[1] ) … )) Let ts denote the value H( T[s+1..s+m] ). Then, t0 can be computed using Horner’s rule (as p above). Moreover, ts+1 can be computed from ts using the recurrence: ts+1 = 10(ts – 10m-1T[s+1]) + T[s+m+1] (32.1) Efficiency: If the constant 10m-1 is pre-computed, and if 10m-1, p and ts, for all s, can each be contained in a single computer word, then each execution of the above equation takes a constant number of arithmetic operations and comparing p and ts is a single word comparison operation as well. Therefore, total time: (n-m+1). Question: Do we have to worry about spurious hits in this example?!

The assumption that 10m-1, p and ts fit into a single computer word is not always feasible. Instead, computation is done mod a prime number q. The prime q is usually chosen so that 10q fits into a single word, in which case the operations involved in the modified recurrence (32.1) ts+1 = ( 10(ts – 10m-1T[s+1]) + T[s+m+1] ) mod q can each be executed as a one single-precision arithmetic operation. However, spurious hits are now an issue and the worst case running time is ( n-m+1)m ), like the naïve matcher, because every valid shift has to be checked character by character, and there are potentially n-m+1 valid shifts. In practice, though, we expect only a few (possibly constant number) of valid shifts, and only a few spurious hits (which also have to be verified character by character), in which case performance is much better than the worst case.

In general, if the alphabet  is of size d, then it is interpreted as  = {0, 1, …, d-1} and a string in * interpreted as a radix-d integer. Correspondingly, (32.1) which was modified earlier to ts+1 = ( 10(ts – 10m-1T[s+1]) + T[s+m+1] ) mod q is modified now to: ts+1 = ( d(ts – T[s+1]h) + T[s+m+1] ) mod q (32.2) (where h = dm-1 mod q)

Text T = Pattern P = 31415 31415 = 7 mod 13

Horner’s method

Finite Automaton Review A finite automaton M is a 5-tuple (Q, q0, A, , ), where Q is a finite set of states. q0  Q is the start state. A  Q is a distinguished set of accepting states.  is a finite input alphabet.  is a function from Q   into Q called the transition function of M. If M reads input character a when in state q, then it changes to state (q, a). If its currents state q is in A, then M is said to accept the string read so far. M induces a function : *  Q, called the final-state function, such that (w) is the state of M after reading the string w.  can be defined recursively as follows: () = q0 (wa) = ((w), a) Therefore, M accepts w if and only if (w)  A.

String Matching with Finite Automata Strategy: Design a finite automaton that reads the text string character by character, going to an accept state only if the pattern P has just been seen. In other word, the automaton accepts strings with suffix P. E.g., if the text is “abbbabcca” and the pattern “ab” we want the automaton to consume a, b, b , b, a, b, c, c, a and entering into an accept state only at the red b’s. Exercise: Design an automaton as required by the strategy for the particular example above.

Implementing the Strategy For a pattern P[1..m] define a function : *  {0, 1, …, m}, called the suffix function for P, such that (x) is the length of the longest prefix of P that is a suffix of x. I.e., (x) = max{k: Pk  x} Example: If the pattern is P = aba, then () = 0, (abab) = 2, (abbaba) = 3. How about (ababb) and (bbba)? For a pattern P[1..m] of length m, (x) = m if and only if P  x, i.e., if and only if the pattern is at the end of x. This leads to defining the string-matching automaton M corresponding to P[1..m] as follows: State set Q = {0, 1, …, m}. Start state q0 is 0. The only accepting state is m. The transition function is defined by (q, a) = (Pqa). Intuition: As M reads the string T =T[1]T[2]…T[n] character by character, it goes to state (T[1]T[2]…T[i]) after reading T[i] (Why? To be proved!). Therefore, if it is in the accepting state m after reading T[i], then the pattern P has just been seen.

Pattern P = ababaca

Matching time (excluding preprocessing time to compute ): (n)

Lemma 32.1 (Overlapping Suffix Lemma) Suppose that x, y and z are strings s.t. x  z and y  z. If |x| ≤ |y|, then x  y. If |x| ≥ |y|, then y  x. If |x| = |y|, then x = y. Proof: …

Lemma 32.2 (Suffix-function Inequality Lemma) For any string x and character a, we have (xa) ≤ (x) + 1. Proof: …

Lemma 32.3 (Suffix-function Recursion Lemma) For any string x and character a, if q = (x), then (xa) = (Pqa) . Proof: …

Theorem 32.4: If  is the final-state function of a string-matching automaton for a given pattern P and T[1..n] is an input text for the automaton, then (Ti) = (Ti) for i = 0, 1, .., n. Proof: By induction on i. For i = 0, the theorem is trivially true because T0 = , so that (T0) = 0 = (T0). Assume, inductively, that (Ti) = (Ti). We shall prove (Ti+1) = (Ti+1). Let q denote (Ti) = (Ti). Suppose T[i+1] = a. Then, (Ti+1) = (Tia) = ((Ti), a) = (q, a) = (Pqa) (by definition of the trans. fn. of this automaton) = (Tia) (by Suffix Function Recursion Lemma) = (Ti+1) Therefore, as claimed earlier, M does go into state (Ti) after reading Ti, so that if it is in the accepting state m after reading Ti, then P has just been seen.

If instead of (x), how about if we tried to define a function τ(x) as the length of the longest suffix P that is a suffix of x. Would this lead to a finite automaton to recognize matches of P? No! Because we are unable to uniquely define a transition function. E.g., consider the pattern P = aab. Now if the last letter we have read is not b, we must be in state 0. Say, next we do read b. Then, which state do we go to, i.e., what is the transition (0, b)? If the string is aab, we should go to state 3; however, if it is bab, we should go to state 2.

Running time = preprocessing time for the FA-matcher: O(m3||) // O(m) time // O(||) time // O(m) time // test O(m) time Running time = preprocessing time for the FA-matcher: O(m3||) (this can be improved to O(m||))

Running time (excluding preprocessing time to compute ): (n)

Knuth-Morris-Pratt Algorithm Strategy: Improve on the string-matching automaton by avoiding the time consuming computation of the transition function . Instead, pre-compute in (m) time an auxiliary function  that contains information about how the pattern P matches against shifts of itself. Precisely, given pattern P[1..m], the prefix function for P is the function : {1, 2, …, m}  {0, 1, …, m-1} such that (q) = max{k: k < q and Pk  Pq} I.e., (q) is the length of the longest prefix of P that is a proper suffix of Pq. Q: What is the relation between the suffix fn. (Pq) and (q)?

π(5) = 3 indicates that a shift of +1 to the right cannot be valid; however, +2 is potentially valid.

Facts about the Prefix Function Given a pattern P[1..m], we’ll show that all the prefixes of P that are proper suffixes of a given prefix Pq can be listed by iterating the prefix function . Let *(q) = {(q), (2)(q), (3)(q), …, (t)(q) = 0}, where (2)(q) = ((q)), (3)(q) =  (((q))), etc. I.e., *(q) is the list of all possible values obtained by repeatedly applying the prefix function  to q. Lemma 32.5 (Prefix-function Iteration Lemma) Let P be a pattern of length m with prefix function . Then, for q = 1, 2, …, m, we have *(q) = {k : k < q and Pk  Pq}. Proof: Induction on q†… †Text does a different induction, but induction on q seems simplest.

Lemma 32.6: Let P be a pattern of length m, and let  be the prefix function for P. For q = 1, 2, …, m, if (q) > 0, then (q) – 1  *(q – 1). Proof: If (q) = r > 0, then r < q and Pr  Pq. Therefore, r – 1 < q – 1 and Pr-1  Pq-1. By previous lemma, (q) – 1 = r – 1  *(q – 1). For q = 2, 3, .., m, define the subset Eq–1  *(q – 1) by Eq–1 = {k  *(q – 1) : P(k+1) = P(q)} I.e., Eq–1 consists of those values k < q – 1 for which Pk  Pq–1 and for which Pk+1  Pq because P(k+1) = P(q). Equivalently, Eq–1 consists of those values k  *(q – 1) such that we can extend Pk to Pk+1 and get a proper suffix of Pq. Corollary 32.7: Let P be a pattern of length m, and let  be the prefix function for P. For q = 2, 3, .., m, (q) = 0 if Eq–1 = . 1 + max{k  Eq–1} if Eq–1  . Proof: Straightforward use of above lemma…

Correctness follows from Cor. 32.7. Running time = preprocessing time for KMP-matcher: (m) (by amortized analysis!)

Correctness follows from the fact that the KMP-MATCHER simulates equivalent Correctness follows from the fact that the KMP-MATCHER simulates the FINITE-AUTOMATON-MATCHER. Running time: (n) (again by amortized analysis)

Problems Ex. 32.1-1 Ex. 32.1-3 Ex. 32.2-1 Ex. 32.2-2 Ex. 32.2-3