Yangjun Chen 1 String Matching String matching problem - prefix - suffix - automata - String-matching automata - prefix function - Knuth-Morris-Pratt algorithm.

Slides:



Advertisements
Similar presentations
Parametrized Matching Amir, Farach, Muthukrishnan Orgad Keller.
Advertisements

Deterministic Finite Automata (DFA)
YES-NO machines Finite State Automata as language recognizers.
String Matching with Finite Automata by Caroline Moore.
String Searching Algorithms Problem Description Given two strings P and T over the same alphabet , determine whether P occurs as a substring in T (or.
Data Structures and Algorithms (AT70.02) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: CLRS “Intro.
Prefix & Suffix Example W = ab is a prefix of X = abefac where Y = efac. Example W = cdaa is a suffix of X = acbecdaa where Y = acbe A string W is a prefix.
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search, part 1)
Pattern Matching1. 2 Outline and Reading Strings (§9.1.1) Pattern matching algorithms Brute-force algorithm (§9.1.2) Boyer-Moore algorithm (§9.1.3) Knuth-Morris-Pratt.
Finite Automata Great Theoretical Ideas In Computer Science Anupam Gupta Danny Sleator CS Fall 2010 Lecture 20Oct 28, 2010Carnegie Mellon University.
Goodrich, Tamassia String Processing1 Pattern Matching.
UMass Lowell Computer Science Analysis of Algorithms Prof. Karen Daniels Fall, 2006 Wednesday, 12/6/06 String Matching Algorithms Chapter 32.
6-1 String Matching Learning Outcomes Students are able to: Explain naïve, Rabin-Karp, Knuth-Morris- Pratt algorithms Analyse the complexity of these algorithms.
1 String Matching The problem: Input: a text T (very long string) and a pattern P (short string). Output: the index in T where a copy of P begins.
1 CSE 417: Algorithms and Computational Complexity Winter 2001 Lecture 15 Instructor: Paul Beame.
UMass Lowell Computer Science Analysis of Algorithms Prof. Karen Daniels Fall, 2001 Lecture 8 Tuesday, 11/13/01 String Matching Algorithms Chapter.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 2: KMP Algorithm Lecturer:
Knuth-Morris-Pratt Algorithm left to right scan like the naïve algorithm one main improvement –on a mismatch, calculate maximum possible shift to the right.
Pattern Matching II COMP171 Fall Pattern matching 2 A Finite Automaton Approach * A directed graph that allows self-loop. * Each vertex denotes.
1 Efficient String Matching : An Aid to Bibliographic Search Alfred V. Aho and Margaret J. Corasick Bell Laboratories.
Aho-Corasick String Matching An Efficient String Matching.
Lecture 3 Goals: Formal definition of NFA, acceptance of a string by an NFA, computation tree associated with a string. Algorithm to convert an NFA to.
1 String Matching The problem: Input: a text T (very long string) and a pattern P (short string). Output: the index in T where a copy of P begins.
Knuth-Morris-Pratt Algorithm Prepared by: Mayank Agarwal Prepared by: Mayank Agarwal Nitesh Maan Nitesh Maan.
CS5371 Theory of Computation Lecture 4: Automata Theory II (DFA = NFA, Regular Language)
1 Construction of Index: (Page 197) Objective: Given a document, find the number of occurrences of each word in the document. Example: Computer Science.
1 Boyer-Moore Charles Yan Exact Matching Boyer-Moore ( worst-case: linear time, Typical: sublinear time ) Aho-Corasik ( A set of pattern )
Pattern Matching1. 2 Outline Strings Pattern matching algorithms Brute-force algorithm Boyer-Moore algorithm Knuth-Morris-Pratt algorithm.
1 Exact Matching Charles Yan Na ï ve Method Input: P: pattern; T: Text Output: Occurrences of P in T Algorithm Naive Align P with the left end.
Great Theoretical Ideas in Computer Science.
String Matching Input: Strings P (pattern) and T (text); |P| = m, |T| = n. Output: Indices of all occurrences of P in T. ExampleT = discombobulate later.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Oct.
String Matching Chapter 32 Highlights Charles Tappert Seidenberg School of CSIS, Pace University.
KMP String Matching Prepared By: Carlens Faustin.
CSC401 – Analysis of Algorithms Chapter 9 Text Processing
Advanced Algorithm Design and Analysis (Lecture 3) SW5 fall 2004 Simonas Šaltenis E1-215b
1 INFO 2950 Prof. Carla Gomes Module Modeling Computation: Language Recognition Rosen, Chapter 12.4.
  ;  E       
String Matching (Chap. 32) Given a pattern P[1..m] and a text T[1..n], find all occurrences of P in T. Both P and T belong to  *. P occurs with shift.
20/10/2015Applied Algorithmics - week31 String Processing  Typical applications: pattern matching/recognition molecular biology, comparative genomics,
Great Theoretical Ideas in Computer Science.
MCS 101: Algorithms Instructor Neelima Gupta
Decidable Questions About Regular languages 1)Membership problem: “Given a specification of known type and a string w, is w in the language specified?”
Strings and Pattern Matching Algorithms Pattern P[0..m-1] Text T[0..n-1] Brute Force Pattern Matching Algorithm BruteForceMatch(T,P): Input: Strings T.
MCS 101: Algorithms Instructor Neelima Gupta
String Matching String Matching Problem We introduce a general framework which is suitable to capture an essence of compressed pattern matching according.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
1 String Matching Algorithms Topics  Basics of Strings  Brute-force String Matcher  Rabin-Karp String Matching Algorithm  KMP Algorithm.
1 String Processing CHP # 3. 2 Introduction Computer are frequently used for data processing, here we discuss primary application of computer today is.
Chapter 3 Regular Expressions, Nondeterminism, and Kleene’s Theorem Copyright © 2011 The McGraw-Hill Companies, Inc. Permission required for reproduction.
Deterministic Finite Automata COMPSCI 102 Lecture 2.
String Algorithms David Kauchak cs302 Spring 2012.
String-Matching Problem COSC Advanced Algorithm Analysis and Design
Modeling Computation: Finite State Machines without Output
Great Theoretical Ideas In Computer Science John LaffertyCS Fall 2006 Lecture 22 November 9, 2006Carnegie Mellon University b b a b a a a b a b.
Great Theoretical Ideas in Computer Science for Some.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
Algorithms for hard problems Automata and tree automata Juris Viksna, 2015.
Finite Automata Great Theoretical Ideas In Computer Science Victor Adamchik Danny Sleator CS Spring 2010 Lecture 20Mar 30, 2010Carnegie Mellon.
Lecture 11  2004 SDU Lecture7 Pushdown Automaton.
1 String Matching Algorithms Mohd. Fahim Lecturer Department of Computer Engineering Faculty of Engineering and Technology Jamia Millia Islamia New Delhi,
Advanced Algorithms Analysis and Design
Advanced Algorithms Analysis and Design
Chapter 2 FINITE AUTOMATA.
Tuesday, 12/3/02 String Matching Algorithms Chapter 32
Non-Deterministic Finite Automata
Principles of Computing – UFCFA3-30-1
String-Matching Algorithms (UNIT-5)
Non-Deterministic Finite Automata
Data Structures and Algorithms (AT70. 02) Comp. Sc. and Inf. Mgmt
Presentation transcript:

Yangjun Chen 1 String Matching String matching problem - prefix - suffix - automata - String-matching automata - prefix function - Knuth-Morris-Pratt algorithm

Yangjun Chen 2 Chapter 32: String Matching 1.Finding all occurrences of a pattern in a text is a problem that arises frequently in text-editing programs. 2.String-matching problem text: an array T[1.. n] containing n characters drawn from a finite alphabet . (For instance,  = {1, 2} or  = {a, b, …, z} pattern: an array P[1.. m] (m  n)

Yangjun Chen 3 Definition We say that pattern P occurs with shift s in text T (or, equivalently, that pattern P occurs beginning at position s + 1 in text T) if 0  s  n – m and T[s s + m] = P[1.. m] (i.e., if T[s + j] = P[j] for 1  j  m). Valid shift s – if P occurs with shift s in T. Otherwise, s is an invalid shift. text T: pattern P: We will find all the valid shifts. a b c ab aab caba c a b a a s = 3

Yangjun Chen 4 Naïve algorithm Naïve-String-Matcher(T, P) 1. n  length[T] 2. m  length[P] 3. for s  0 to n - m 4.do if T[s s + m] = P[1.. m] 5.then print “Pattern occurs with shift” s Obviously, the time complexity of this algorithm is bounded by O(nm). In the following, we will discuss Knuth-Morris-Pratt algorithm, which needs only O(n + m) time.

Yangjun Chen 5 Knuth-Morris-Pratt algorithm -Finite automata A finite automata M is a 5-tuple (Q, q 0, A, ,  ), where Q-a finite set of states q 0 -the start state A  Q – a distinguished set of accepting states  -a finite input alphabet  -a function from Q   into Q, called the transition function of M. Example: Q = {0, 1}, q 0 = 0, A = {1},  = {a, b}  (0, a) = 1,  (0, b) = 0,  (1, a) = 0,  (1, b) = input a b state 0 1 a a b b 0 1

Yangjun Chen 6 Knuth-Morris-Pratt algorithm -String-matching automata  *-the set of all finite-length strings formed using characters from the alphabet   -zero-length empty string |x| - the length of string x xy-the concatenation of two strings x and y, which has length |x| + |y| and consists of the characters from x followed by the characters from y prefix – a string w is a prefix of a string x, denoted w  x, if x = wy for some y   *. suffix - a string w is a suffix of a string x, denoted w  x, if x = yw for some y   *. Example: ab  abcca. cca  abcca.

Yangjun Chen 7 Knuth-Morris-Pratt algorithm -String-matching automata P k -P[1.. k] (k  m), a prefix of P[1.. m] suffix function  - a mapping from  * to {0, 1, …, m} such that  (x) is the length of the longest prefix of P that is a suffix of x:  (x) = max{k: P k  x}. Note that P 0 =  is a suffix of every string. -Example P = ab We have  (  ) = 0  (ccaca) = 1 P = ab  (ccab) = 2 P = ab

Yangjun Chen 8 Knuth-Morris-Pratt algorithm -String-matching automata For a pattern P[1.. m], its string-matching automaton can be constructed as follows. 1.The state set Q is {0, 1, …, m}. The start state q 0 is state 0, and state m is the only accepting state. 2.The transition function  is defined by the following equation, for any state q and character a:  (q, a) =  (P q a) a PaPa =

Yangjun Chen 9 Knuth-Morris-Pratt algorithm -Example P = ababaca a b a b a c a a a a a b b abcabc input Pababaca State transition function 7

Yangjun Chen 10 Knuth-Morris-Pratt algorithm -String matching by using the finite automaton Finite-Automaton-Matcher(T, , m) 1. n  length[T] 2. q  0 3. for i  1 to n 4.do q   (q, T[i]) 5.if q = m 6.then print “pattern occurs with shift” i – m If the finite automaton is available, the algorithm needs only O(n + m) time.

Yangjun Chen 11 Knuth-Morris-Pratt algorithm -Example P = ababaca, T = abababacaba a b a b a c a a a a a b b step 1: q = 0, T[1] = a. Go into the state q = 1. step 2:q = 1, T[2] = b. Go into the state q = 2. step 3:q = 2, T[3] = a. Go into the state q = 3. step 4:q = 3, T[4] = b. Go into the state q = 4. step 5:q = 4, T[5] = a. Go into the state q = 5. step 6:q = 5, T[6] = b. Go into the state q = 4. step 7:q = 4, T[7] = a. Go into the state q = 5. step 8:q = 5, T[8] = c. Go into the state q = 6. step 9:q = 6, T[9] = a. Go into the state q = 7. 7

Yangjun Chen 12 Knuth-Morris-Pratt algorithm -Dynamic computation of the transition function  We needn’t compute  altogether, but using an auxiliary function , called a prefix function, to calculate  –values “on thefly”. prefix function  - a mapping from {1, …, m} to {0, 1, …, m} such that  (q) = max{k: k < q, P k  P q }.  (x) = max{k: P k  x} comparison with suffix function:

Yangjun Chen 13 Knuth-Morris-Pratt algorithm -Example P = ababababca ababababca i P[i]P[i] [i][i] abab abab abab abab caca abab abab abab abab abab abab abcaabca ababcaababca abababcaabababca ababababcaababababca P8P6P4P2P0P8P6P4P2P0   [8] = 6  [6] = 4  [4] = 2  [2] = 0

Yangjun Chen 14 Knuth-Morris-Pratt algorithm -function  (u) (j) i)  (1) (j) =  (j), and ii)  (u) (j) =  (  (u-1) (j)), for u > 1. That is,  (u) (j) is just  applied u times to j. Example:  (2) (6) =  (  (6)) =  (4) = 2. -How to use  (u) (j)? Suppose that the automaton is in state j, having read T[1.. k], and that T[k+1]  P[j+1]. Then, apply  repeatedly until it find the smallest value of u for which either 1.  (u) (j) = l and T[k+1] = P[j+1], or 2.  (u) (j) = 0 and T[k+1]  P[1]. j k  (j)(j) 2(j)2(j) P:P: T:T:

Yangjun Chen 15 Knuth-Morris-Pratt algorithm -How to use  (u) (j)? 1.  (u) (j) = l and T[k+1] = P[j+1], or 2.  (u) (j) = 0 and T[k+1]  P[1]. That is, the automaton backs up through  (1) (j),  (2) (j), … until either Case 1 or 2 holds for  (u) (j) but not for  (u-1) (j). If Case 1 holds, the automaton enters state l + 1. If Case 2 holds, it enters state 0. In either case, input pointer is advanced to position T[k+2]. In Case 1, P[1.. j] was the longest prefix of P that is the suffix of T[1.. k], then P[1..  (u) (j) +1] is the longest prefix of P that is a suffix of T[1.. k+1]. In Case 2, no prefix of P is a suffix of T[1.. k+1].

Yangjun Chen 16 Knuth-Morris-Pratt algorithm -Algorithm KMP-Matcher(T, P) 1. n  length[T] 2. m  length[P] 3.   Compute-Prefix-Function(P) 4.q  0 5.for i  1 to n 6.do while q > 0 and P[q + 1]  T[i] 7.do q   [q] 8.if P[q + 1] = T[i] 9.then q  q if q = m 11.then print “pattern occurs with shift” i – m 12. q   [q]

Yangjun Chen 17 Knuth-Morris-Pratt algorithm -Algorithm Compute-Prefix-Function(P) 1. m  length[T] 2.  [1]  0 3. k  0 4.for q  2 to m 5.do while k > 0 and P[k + 1]  P[q] 6.do k   [k]/*if k = 0 or P[k + 1] = P[q], 7.if P[k + 1] = P[q]going out of the while-loop.*/ 8.then k  k  [q]  k 10.return 

Yangjun Chen 18 Knuth-Morris-Pratt algorithm -Example P = ababababca, T = ababaababababca Compute prefix function  [1] = 0 k = 0 q = 2, P[k + 1] = P[1] = a, P[q] = P[2] = b, P[k + 1]  P[q]  [q]  k (  [2]  0) q = 3, P[k + 1] = P[1] = a, P[q] = P[3] = a, P[k + 1] = P[q] k  k + 1,  [q]  k (  [3]  1) k = 1 q = 4, P[k + 1] = P[2] = b, P[q] = P[4] = b, P[k + 1] = P[q] k  k + 1,  [q]  k (  [4]  2)

Yangjun Chen 19 Knuth-Morris-Pratt algorithm -Example k = 2 q = 5, P[k + 1] = P[3] = a, P[q] = P[5] = a, P[k + 1] = P[q] k  k + 1,  [q]  k (  [5]  3) k = 3 q = 6, P[k + 1] = P[4] = b, P[q] = P[6] = b, P[k + 1] = P[q] k  k + 1,  [q]  k (  [6]  4) k = 4 q = 7, P[k + 1] = P[5] = a, P[q] = P[7] = a, P[k + 1] = P[q] k  k + 1,  [q]  k (  [7]  5) k = 5 q = 8, P[k + 1] = P[6] = b, P[q] = P[8] = b, P[k + 1] = P[q] k  k + 1,  [q]  k (  [8]  6)

Yangjun Chen 20 Knuth-Morris-Pratt algorithm -Example k = 6 q = 9, P[k + 1] = P[6] = b, P[q] = P[9] = c, P[k + 1]  P[q] k   [k] (k   [6] = 4) P[k + 1] = P[5] = a, P[q] = P[9] = c, P[k + 1]  P[q] k   [k] (k   [4] = 2) P[k + 1] = P[3] = a, P[q] = P[9] = c, P[k + 1]  P[q] k   [k] (k   [2] = 0) k = 0 q = 9, P[k + 1] = P[1] = a, P[q] = P[9] = c, P[k + 1]  P[q]  [q]  k (  [9]  0) q = 10, P[k + 1] = P[1] = a, P[q] = P[10] = a, P[k + 1] = P[q] k  k + 1,  [q]  k (  [10]  1)

Yangjun Chen 21 Knuth-Morris-Pratt algorithm Theorem Algorithm Compute-Prefix-Function(P) computes  in O(|P|) steps. Proof. The cost of the while statement is proportional to the number of times k is decremented by the statement k   [k] following do in line 6. The only way k is increased is by assigning k  k + 1 in line 8. Since k = 0 initially, and line 8 is executed at most (|P|) – 1 times, we conclude that the while statement on lines 5 and 6 cannot be executed more than |P| times. Thus, the total cost of executing lines 5 and 6 is O(|P|). The remainder of the algorithm is clearly O(|P|), and thus the whole algorithm takes O(|P|) time.