String Matching Fundamental Data Structures and Algorithms April 22, 2003
Announcements Quiz #4 available after class today! available until Wednesday midnight Homework 6 is out! Due on Thursday May 1, 11:59pm Tournament will run on May 7 details to come… Final exam on May 8 8:30am-11:30am, UC McConomy Review on May 4, details TBA
String Matching
Why String Matching? Finding patterns in documents formed using a large alphabet Word processing – search/modify/replace Web searching- search/display Applications in Molecular Biology biological molecules can often be approximated as sequences of amino acids Very large volumes of data – doubles every 18 months Need efficient string matching algorithms Applications in systems and software design Main data form used to exchange information - TEXT So text pattern matching is very important Big Question Given a string T of length n and a pattern P of length m (m <= n), how do we find any or all occurrences of pattern P in T?
String Matching Text string T[0..N-1] T = “ abacaabaccabacabaabb ” Pattern string P[0..M-1] P = “ abacab ” Think of a naïve algorithm to find the pattern P in T How much work is needed to determine that? Can we do better? Better String Matching Algorithms Use finite automata Use combinatorial properties
String Matching Let T and P be strings build over a finite alphabet with || = Text string T[0..N-1] T = “ abacaabaccabacabaabb ” Pattern string P[0..M-1] P = “ abacab ” Where is the first instance of P in T? T[10..15] = P[0..5]
String Matching abacaabaccabacabaabb abacab The brute force algorithm 22+6= 28 comparisons. Brute Force Algorithm requires O(nm) operations
Brute Force, v.1 static int match(char[] T, char[] P){ int n = T.length; int m = P.length; for (int i=0; i<=n-m; i++) { int j = 0; while (j<m && T[i+j]==P[j]) j++; if (j==m) return i; } return -1; }
Brute Force, v.2 (one loop) static int match(char[] T, char[] P){ int n = T.length; int m = P.length; int i = 0; int j = 0; do { if (T[i]==P[j]) { i++; j++; } else { i=i-j+1; j=0; } } while (j<m && i<n); if (j==m) return i-m; else return –1; }
String Matching Text string T[0..N-1] T = “ abacaabaccabacabaabb ” Pattern string P[0..M-1] P = “ abacab ” Where is the first instance of P in T? T[10..15] = P[0..5] In general, how many comparisons T[i] = P[j] ? are needed to do the search? Worst case: O(NM)
A bad case = 65 comparisons are needed How many of them could be avoided?
A bad case = 65 comparisons are needed How many of them could be avoided?
Typical text matching This is a sample sentence - s- - s- - s- - sente 20+5=25 comparisons are needed (The match is near the same point in the target string as the previous example.) In practice, 0j2
String Matching Brute force worst case O(MN) Expensive for long patterns in repetitive text How to improve on this? Intuition: Don’t look at the text more than once. Remember what is learned from previous matches
Motivation with FSM Consider the alphabet {a,b,c} and the FSM given below What is a language accepted by this FSM? What can we learn from this FSM? 1 Start 234End aab c b/c c a b a
Clever string matching Cook published an abstract result about machine models Match in O(N+M) vs. O(MN)?! Knuth and Pratt studied it and refined it into a simple algorithm. Morris, annoyed at a design problem in implementing a text editor, discovered the same algorithm. How to avoid decrementing i ? KMP published together in 1976.
Morris
String Matching Meanwhile … Boyer and Moore discovered another algorithm that is even faster (for some uses) in the average case. Gosper independently discovered the same algorithm. Boyer and Moore published in 1977.
String Matching In 1980, Karp and Rabin discovered a simpler algorithm. Uses hashing idea: quickly compute hashes for all M-length substrings in T, and compare with the hash for P.
Knuth Morris Pratt
The KMP idea Take advantage of what we already know during the match process. Suppose P = Suppose P[0..5] matches T[10..15] Suppose P[6] T[16] Suppose we know that P[0] any of T[11..15] And the next possible match is P[0] ? T[16]
KMP example Match fails: T[i] P[j] i = 6 j = 6 Next match attempt i = 6 j =
Brute Force KMP A worse case example: = 210 comparisons = 42 comparisons
Brute Force KMP 21 comparisons abcdeabcdeabcedfghijkl - bc- - bc- - bcedfg abcdeabcdeabcedfghijkl - bc- - bc- - bcedfg 19 comparisons
Brute Force KMP 21 comparisons abcdeabcdeabcedfghijkl - bc- - bc- - bcedfg abcdeabcdeabcedfghijkl - bc- - bc- - bcedfg 19 comparisons 5 preparation comparisons
KMP – The Big Idea Retain information from prior attempts. Compute in advance how far to jump in P when a match fails. Suppose the match fails at P[j] T[i+j]. Then we know P[0.. j-1] = T[i.. i+j-1]. We must next try P[0] ? T[i+1]. But we know T[i+1]=P[1] There is another way to compare: P[1]?P[0] If so, increment j by 1. No need to look at T. What if P[1]=P[0] and P[2]=P[1]? Then increment j by 2. Again, no need to look at T. In general, we can determine how far to jump without any knowledge of T!
Implementing KMP Never decrement i, ever. Comparing T[i] with P[j]. Compute a table f of how far to jump j forward when a match fails. The next match will compare T[i] with P[f[j-1]] Do this by matching P against itself in all positions.
Building the Table for f P = Find self-overlaps PrefixOverlapjf
What f means Prefix Overlapjf If f is zero, there is no self-match. This is good news: Set j=0 Do not change i. The next match is T[i] ? P[0] f non-zero implies there is a self-match. This is bad news: E.g., f=2 means P[0..1] = P[j-2..j-1] Hence must start new comparison at j-2, since we know T[i-2..i-1] = P[0..1] In general: Set j=f[j-1] Do not change i. The next match is T[i] ? P[f[j-1]]
Favorable conditions P = Find self-overlaps PrefixOverlapjf
Mixed conditions P = Find self-overlaps PrefixOverlapjf
Poor conditions P = Find self-overlaps PrefixOverlapjf
KMP matcher static int match(char[] T, char[] P) { int n = T.length; int m = P.length; int[] f = computeF(P); int i = 0; int j = 0; while(i<n) { if(P[j]==T[i]) { if (j==m-1) return i-m+1; i++; j++; } else if (j>0) j=f[j-1]; else i++; } return -1; } Use f to determine next value for j.
KMP pre-process static int[] computeF(char[] P) { int m = P.length; int[] f = new int[m]; f[0] = 0; int i = 1; int j = 0; while(i<m) { if(P[j]==P[i]) { f[i] = j+1; i++; j++; } else if (j>0) j=f[j-1]; else {f[i] = 0; i++;} } return f; } Use previous values of f
KMP Performance At each iteration, one of three cases: T[i] = P[j] i increases T[i] <> P[j] and j>0 i-j increases T[I] <> P[j] and j=0 i increases and i-j increases Hence, maximum of 2N iterations. Constructing f[] needs 2M iterations. Thus worst case performance is O(N+M).
KMP Summary performs the comparisons from left to right; preprocessing phase in O(m) space and time complexity; searching phase in O(n+m) time complexity (independent from the alphabet size); performs at most 2n-1 information gathered during the scan of the text;
Boyer Moore
Brute Force KMP 21 comparisons abcdeabcdeabcedfghijkl - bc- - bc- - bcedfg abcdeabcdeabcedfghijkl - bc- - bc- - bcedfg 19 comparisons
Brute Force B-M = 21 comparisons abcdeabcdeabcedfghijkl - bc- - bc- - bcedfg abcdeabcdeabcedfghijkl - g f d e c b = 8 comparisons
Boyer Moore Perhaps the most efficient algorithm for general pattern matching Ideas Scan pattern from right to left (and target from left to right) Allows for bigger jumps on early failures Could use a table similar to KMP. But follow a better idea: Use information about T as well as P in deciding what to do next.
Brute Force B-M = 23 comparisons This string is textual - t- - textual This string is textual - l a u t x e t = 10 comparisons
Brute Force B-M 25 comparisons This is a sample sentence - This is a sample sentence - foobar 5 comparisons
Boyer Moore Ideas Scan pattern from right to left (and target from left to right) Allows for bigger jumps on early failures Could use a table similar to KMP. But follow a better idea: Use information about T as well as P in deciding what to do next. If T[i] does not appear in the pattern, skip forward beyond the end of the pattern.
Boyer Moore matcher static int[] buildLast(char[] P) { int[] last = new int[128]; int m = P.length; for (int i=0; i<128; i++) last[i] = -1; for (int j=0; j<P.length; j++) last[P[j]] = j; return last; } Mismatch char is nowhere in the pattern (default). last says “jump the distance” Mismatch is a pattern char. last says “jump to align pattern with last instance of this char”
Boyer Moore matcher static int match(char[] T, char[] P) { int[] last = buildLast(P); int n = T.length; int m = P.length; int i = m-1; int j = m-1; if (i > n-1) return -1; do { if (P[j]==T[i]) if (j==0) return i; else { i--; j--; } else { i = i + m – Math.min(j, 1 + last[T[i]]); j = m - 1; } } while (i <= n-1); return -1; } Use last to determine next value for i.
KMP B-M 13 comparisons comparison
KMP B-M 16 comparisons This is a string - ring This is a string - g n i r ring 7 comparisons
KMP B-M 16 comparisons This is a string - tring This is a string - g n i r t tring 8 comparisons
Matching Summary
Boyer-Moore Summary performs the comparisons from right to left; preprocessing phase in O(m+ ) time and space complexity; searching phase in O(mn) time complexity; 3n text character comparisons in the worst case when searching for a non periodic pattern; O(n / m) best performance.
Knuth-Morris-Pratt Summary For text, similar performance to brute force Can be slower, due to precomputation Works well for self-repetitive patterns in self-repetitive text Never decrements i. Matching an input stream … Intuition: derives from thinking about a Matching FSM.
Karp and Rabin In 1980, Karp and Rabin discovered a simpler algorithm. Uses hashing ideas: quickly compute hashes for all M-length substrings in T, and compare with the hash for P. Compute the hashes in a cumulative way, so each T[i] needs to be seen only once. Average case time is O(M+N). Worst case is unlikely (all collisions) at O(MN).
Next Go to recitation Wednesday Discuss more about string matching algorithms very important! On Thursday, we will discuss Union Find Many many applications Read chapter 24 Work on Homework 6
End