String Matching 15-211 Fundamental Data Structures and Algorithms April 22, 2003.

Slides:



Advertisements
Similar presentations
© 2004 Goodrich, Tamassia Pattern Matching1. © 2004 Goodrich, Tamassia Pattern Matching2 Strings A string is a sequence of characters Examples of strings:
Advertisements

Space-for-Time Tradeoffs
TECH Computer Science String Matching  detecting the occurrence of a particular substring (pattern) in another string (text) A straightforward Solution.
CSE Lecture 23 – String Matching Simple (Brute-Force) Approach Knuth-Morris-Pratt Algorithm Boyer-Moore Algorithm.
Comp. Eng. Lab III (Software), Pattern Matching1 Pattern Matching Dr. Andrew Davison WiG Lab (teachers room), CoE ,
1 A simple fast hybrid pattern- matching algorithm Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C.
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search, part 1)
Design and Analysis of Algorithms - Chapter 71 Space-time tradeoffs For many problems some extra space really pays off (extra space in tables - breathing.
Pattern Matching1. 2 Outline and Reading Strings (§9.1.1) Pattern matching algorithms Brute-force algorithm (§9.1.2) Boyer-Moore algorithm (§9.1.3) Knuth-Morris-Pratt.
Goodrich, Tamassia String Processing1 Pattern Matching.
Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 2: KMP Algorithm Lecturer:
Boyer-Moore string search algorithm Book by Dan Gusfield: Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore.
String Matching COMP171 Fall String matching 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences of.
Quick Search Algorithm A very fast substring search algorithm, SUNDAY D.M., Communications of the ACM. 33(8),1990, pp Adviser: R. C. T. Lee Speaker:
Knuth-Morris-Pratt Algorithm Prepared by: Mayank Agarwal Prepared by: Mayank Agarwal Nitesh Maan Nitesh Maan.
Pattern Matching COMP171 Spring Pattern Matching / Slide 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences.
Pattern Matching1. 2 Outline Strings Pattern matching algorithms Brute-force algorithm Boyer-Moore algorithm Knuth-Morris-Pratt algorithm.
String Matching Input: Strings P (pattern) and T (text); |P| = m, |T| = n. Output: Indices of all occurrences of P in T. ExampleT = discombobulate later.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
KMP String Matching Prepared By: Carlens Faustin.
CSC401 – Analysis of Algorithms Chapter 9 Text Processing
Advanced Algorithm Design and Analysis (Lecture 3) SW5 fall 2004 Simonas Šaltenis E1-215b
Chapter 2.8 Search Algorithms. Array Search –An array contains a certain number of records –Each record is identified by a certain key –One searches the.
20/10/2015Applied Algorithmics - week31 String Processing  Typical applications: pattern matching/recognition molecular biology, comparative genomics,
Boyer Moore Algorithm Idan Szpektor. Boyer and Moore.
MCS 101: Algorithms Instructor Neelima Gupta
Exact String Matching Algorithms: A Survey Mehreen Ali, Hina Naz Khan, Shumaila Sayyab, Nadeem Iftikhar Department of Bio-Science Mohammad Ali Jinnah University,
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 1: Exact String Matching.
Strings and Pattern Matching Algorithms Pattern P[0..m-1] Text T[0..n-1] Brute Force Pattern Matching Algorithm BruteForceMatch(T,P): Input: Strings T.
MCS 101: Algorithms Instructor Neelima Gupta
Design and Analysis of Algorithms - Chapter 71 Space-time tradeoffs For many problems some extra space really pays off: b extra space in tables (breathing.
String Searching CSCI 2720 Spring 2007 Eileen Kraemer.
String Matching A straightforward Solution
Information Retrieval CSE 8337 Spring 2005 Simple Text Processing Material for these slides obtained from: Data Mining Introductory and Advanced Topics.
String Matching String Matching Problem We introduce a general framework which is suitable to capture an essence of compressed pattern matching according.
Exact String Matching Algorithms Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU.
CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms.
CSC 212 – Data Structures Lecture 36: Pattern Matching.
Fundamental Data Structures and Algorithms
ICS220 – Data Structures and Algorithms Analysis Lecture 14 Dr. Ken Cosh.
1/39 COMP170 Tutorial 13: Pattern Matching T: P:.
A new matching algorithm based on prime numbers N. D. Atreas and C. Karanikas Department of Informatics Aristotle University of Thessaloniki.
String Searching 2 of 2. String search Simple search –Slide the window by 1 t = t +1; KMP –Slide the window faster t = t + s – M[s] –Never recheck the.
1 String Matching Algorithms Mohd. Fahim Lecturer Department of Computer Engineering Faculty of Engineering and Technology Jamia Millia Islamia New Delhi,
CSG523/ Desain dan Analisis Algoritma
Advanced Algorithms Analysis and Design
COMP261 Lecture 20 String Searching 2 of 2.
13 Text Processing Hongfei Yan June 1, 2016.
CSCE350 Algorithms and Data Structure
Chapter 3 String Matching.
Knuth-Morris-Pratt algorithm
Space-for-time tradeoffs
Tuesday, 12/3/02 String Matching Algorithms Chapter 32
String-Matching Algorithms (UNIT-5)
Chapter 7 Space and Time Tradeoffs
Pattern Matching 12/8/ :21 PM Pattern Matching Pattern Matching
Pattern Matching 1/14/2019 8:30 AM Pattern Matching Pattern Matching.
KMP String Matching Donald Knuth Jim H. Morris Vaughan Pratt 1997.
Space-for-time tradeoffs
Pattern Matching 2/15/2019 6:17 PM Pattern Matching Pattern Matching.
Space-for-time tradeoffs
Knuth-Morris-Pratt Algorithm.
Pattern Matching Pattern Matching 5/1/2019 3:53 PM Spring 2007
Space-for-time tradeoffs
Pattern Matching 4/27/2019 1:16 AM Pattern Matching Pattern Matching
Space-for-time tradeoffs
Sequences 5/17/ :43 AM Pattern Matching.
15-826: Multimedia Databases and Data Mining
Week 14 - Wednesday CS221.
Presentation transcript:

String Matching Fundamental Data Structures and Algorithms April 22, 2003

Announcements Quiz #4 available after class today!  available until Wednesday midnight Homework 6 is out!  Due on Thursday May 1, 11:59pm  Tournament will run on May 7 details to come… Final exam on May 8  8:30am-11:30am, UC McConomy  Review on May 4, details TBA

String Matching

Why String Matching? Finding patterns in documents formed using a large alphabet  Word processing – search/modify/replace  Web searching- search/display Applications in Molecular Biology  biological molecules can often be approximated as sequences of amino acids  Very large volumes of data – doubles every 18 months  Need efficient string matching algorithms Applications in systems and software design Main data form used to exchange information - TEXT  So text pattern matching is very important Big Question  Given a string T of length n and a pattern P of length m (m <= n), how do we find any or all occurrences of pattern P in T?

String Matching Text string T[0..N-1] T = “ abacaabaccabacabaabb ” Pattern string P[0..M-1] P = “ abacab ” Think of a naïve algorithm to find the pattern P in T How much work is needed to determine that? Can we do better? Better String Matching Algorithms  Use finite automata  Use combinatorial properties

String Matching Let T and P be strings build over a finite alphabet  with || =  Text string T[0..N-1] T = “ abacaabaccabacabaabb ” Pattern string P[0..M-1] P = “ abacab ” Where is the first instance of P in T? T[10..15] = P[0..5]

String Matching abacaabaccabacabaabb abacab The brute force algorithm 22+6= 28 comparisons. Brute Force Algorithm requires O(nm) operations

Brute Force, v.1 static int match(char[] T, char[] P){ int n = T.length; int m = P.length; for (int i=0; i<=n-m; i++) { int j = 0; while (j<m && T[i+j]==P[j]) j++; if (j==m) return i; } return -1; }

Brute Force, v.2 (one loop) static int match(char[] T, char[] P){ int n = T.length; int m = P.length; int i = 0; int j = 0; do { if (T[i]==P[j]) { i++; j++; } else { i=i-j+1; j=0; } } while (j<m && i<n); if (j==m) return i-m; else return –1; }

String Matching Text string T[0..N-1] T = “ abacaabaccabacabaabb ” Pattern string P[0..M-1] P = “ abacab ” Where is the first instance of P in T? T[10..15] = P[0..5] In general, how many comparisons T[i] = P[j] ? are needed to do the search? Worst case: O(NM)

A bad case = 65 comparisons are needed How many of them could be avoided?

A bad case = 65 comparisons are needed How many of them could be avoided?

Typical text matching This is a sample sentence - s- - s- - s- - sente 20+5=25 comparisons are needed (The match is near the same point in the target string as the previous example.) In practice, 0j2

String Matching Brute force worst case  O(MN)  Expensive for long patterns in repetitive text How to improve on this? Intuition:  Don’t look at the text more than once.  Remember what is learned from previous matches

Motivation with FSM Consider the alphabet {a,b,c} and the FSM given below What is a language accepted by this FSM? What can we learn from this FSM? 1 Start 234End aab c b/c c a b a

Clever string matching Cook published an abstract result about machine models  Match in O(N+M) vs. O(MN)?! Knuth and Pratt studied it and refined it into a simple algorithm. Morris, annoyed at a design problem in implementing a text editor, discovered the same algorithm.  How to avoid decrementing i ? KMP published together in 1976.

Morris

String Matching Meanwhile … Boyer and Moore discovered another algorithm that is even faster (for some uses) in the average case. Gosper independently discovered the same algorithm. Boyer and Moore published in 1977.

String Matching In 1980, Karp and Rabin discovered a simpler algorithm.  Uses hashing idea: quickly compute hashes for all M-length substrings in T, and compare with the hash for P.

Knuth Morris Pratt

The KMP idea Take advantage of what we already know during the match process. Suppose P = Suppose P[0..5] matches T[10..15] Suppose P[6]  T[16] Suppose we know that P[0]  any of T[11..15] And the next possible match is P[0] ? T[16]

KMP example Match fails: T[i]  P[j]  i = 6  j = 6 Next match attempt  i = 6  j =

Brute Force KMP A worse case example: = 210 comparisons = 42 comparisons

Brute Force KMP 21 comparisons abcdeabcdeabcedfghijkl - bc- - bc- - bcedfg abcdeabcdeabcedfghijkl - bc- - bc- - bcedfg 19 comparisons

Brute Force KMP 21 comparisons abcdeabcdeabcedfghijkl - bc- - bc- - bcedfg abcdeabcdeabcedfghijkl - bc- - bc- - bcedfg 19 comparisons 5 preparation comparisons

KMP – The Big Idea Retain information from prior attempts. Compute in advance how far to jump in P when a match fails.  Suppose the match fails at P[j]  T[i+j].  Then we know P[0.. j-1] = T[i.. i+j-1]. We must next try P[0] ? T[i+1].  But we know T[i+1]=P[1]  There is another way to compare: P[1]?P[0] If so, increment j by 1. No need to look at T.  What if P[1]=P[0] and P[2]=P[1]? Then increment j by 2. Again, no need to look at T. In general, we can determine how far to jump without any knowledge of T!

Implementing KMP Never decrement i, ever.  Comparing T[i] with P[j]. Compute a table f of how far to jump j forward when a match fails.  The next match will compare T[i] with P[f[j-1]] Do this by matching P against itself in all positions.

Building the Table for f P = Find self-overlaps PrefixOverlapjf

What f means Prefix Overlapjf If f is zero, there is no self-match.  This is good news:  Set j=0  Do not change i. The next match is T[i] ? P[0] f non-zero implies there is a self-match.  This is bad news:  E.g., f=2 means P[0..1] = P[j-2..j-1] Hence must start new comparison at j-2, since we know T[i-2..i-1] = P[0..1] In general:  Set j=f[j-1]  Do not change i. The next match is T[i] ? P[f[j-1]]

Favorable conditions P = Find self-overlaps PrefixOverlapjf

Mixed conditions P = Find self-overlaps PrefixOverlapjf

Poor conditions P = Find self-overlaps PrefixOverlapjf

KMP matcher static int match(char[] T, char[] P) { int n = T.length; int m = P.length; int[] f = computeF(P); int i = 0; int j = 0; while(i<n) { if(P[j]==T[i]) { if (j==m-1) return i-m+1; i++; j++; } else if (j>0) j=f[j-1]; else i++; } return -1; } Use f to determine next value for j.

KMP pre-process static int[] computeF(char[] P) { int m = P.length; int[] f = new int[m]; f[0] = 0; int i = 1; int j = 0; while(i<m) { if(P[j]==P[i]) { f[i] = j+1; i++; j++; } else if (j>0) j=f[j-1]; else {f[i] = 0; i++;} } return f; } Use previous values of f

KMP Performance At each iteration, one of three cases:  T[i] = P[j] i increases  T[i] <> P[j] and j>0 i-j increases  T[I] <> P[j] and j=0 i increases and i-j increases Hence, maximum of 2N iterations. Constructing f[] needs 2M iterations. Thus worst case performance is O(N+M).

KMP Summary performs the comparisons from left to right; preprocessing phase in O(m) space and time complexity; searching phase in O(n+m) time complexity (independent from the alphabet size); performs at most 2n-1 information gathered during the scan of the text;

Boyer Moore

Brute Force KMP 21 comparisons abcdeabcdeabcedfghijkl - bc- - bc- - bcedfg abcdeabcdeabcedfghijkl - bc- - bc- - bcedfg 19 comparisons

Brute Force B-M = 21 comparisons abcdeabcdeabcedfghijkl - bc- - bc- - bcedfg abcdeabcdeabcedfghijkl - g f d e c b = 8 comparisons

Boyer Moore Perhaps the most efficient algorithm for general pattern matching Ideas  Scan pattern from right to left (and target from left to right) Allows for bigger jumps on early failures Could use a table similar to KMP. But follow a better idea:  Use information about T as well as P in deciding what to do next.

Brute Force B-M = 23 comparisons This string is textual - t- - textual This string is textual - l a u t x e t = 10 comparisons

Brute Force B-M 25 comparisons This is a sample sentence - This is a sample sentence - foobar 5 comparisons

Boyer Moore Ideas  Scan pattern from right to left (and target from left to right) Allows for bigger jumps on early failures Could use a table similar to KMP. But follow a better idea:  Use information about T as well as P in deciding what to do next. If T[i] does not appear in the pattern, skip forward beyond the end of the pattern.

Boyer Moore matcher static int[] buildLast(char[] P) { int[] last = new int[128]; int m = P.length; for (int i=0; i<128; i++) last[i] = -1; for (int j=0; j<P.length; j++) last[P[j]] = j; return last; } Mismatch char is nowhere in the pattern (default). last says “jump the distance” Mismatch is a pattern char. last says “jump to align pattern with last instance of this char”

Boyer Moore matcher static int match(char[] T, char[] P) { int[] last = buildLast(P); int n = T.length; int m = P.length; int i = m-1; int j = m-1; if (i > n-1) return -1; do { if (P[j]==T[i]) if (j==0) return i; else { i--; j--; } else { i = i + m – Math.min(j, 1 + last[T[i]]); j = m - 1; } } while (i <= n-1); return -1; } Use last to determine next value for i.

KMP B-M 13 comparisons comparison

KMP B-M 16 comparisons This is a string - ring This is a string - g n i r ring 7 comparisons

KMP B-M 16 comparisons This is a string - tring This is a string - g n i r t tring 8 comparisons

Matching Summary

Boyer-Moore Summary performs the comparisons from right to left; preprocessing phase in O(m+ ) time and space complexity; searching phase in O(mn) time complexity; 3n text character comparisons in the worst case when searching for a non periodic pattern; O(n / m) best performance.

Knuth-Morris-Pratt Summary For text, similar performance to brute force  Can be slower, due to precomputation Works well for self-repetitive patterns in self-repetitive text Never decrements i.  Matching an input stream … Intuition: derives from thinking about a Matching FSM.

Karp and Rabin In 1980, Karp and Rabin discovered a simpler algorithm.  Uses hashing ideas: quickly compute hashes for all M-length substrings in T, and compare with the hash for P.  Compute the hashes in a cumulative way, so each T[i] needs to be seen only once.  Average case time is O(M+N).  Worst case is unlikely (all collisions) at O(MN).

Next Go to recitation Wednesday  Discuss more about string matching algorithms  very important! On Thursday, we will discuss Union Find  Many many applications  Read chapter 24 Work on Homework 6

End