String Matching 15-211 Fundamental Data Structures and Algorithms April 22, 2003.

String Matching 15-211 Fundamental Data Structures and Algorithms April 22, 2003

Announcements Quiz #4 available after class today!  available until Wednesday midnight Homework 6 is out!  Due on Thursday May 1, 11:59pm  Tournament will run on May 7 details to come… Final exam on May 8  8:30am-11:30am, UC McConomy  Review on May 4, details TBA

String Matching

Why String Matching? Finding patterns in documents formed using a large alphabet  Word processing – search/modify/replace  Web searching- search/display Applications in Molecular Biology  biological molecules can often be approximated as sequences of amino acids  Very large volumes of data – doubles every 18 months  Need efficient string matching algorithms Applications in systems and software design Main data form used to exchange information - TEXT  So text pattern matching is very important Big Question  Given a string T of length n and a pattern P of length m (m <= n), how do we find any or all occurrences of pattern P in T?

String Matching Text string T[0..N-1] T = “ abacaabaccabacabaabb ” Pattern string P[0..M-1] P = “ abacab ” Think of a naïve algorithm to find the pattern P in T How much work is needed to determine that? Can we do better? Better String Matching Algorithms  Use finite automata  Use combinatorial properties

String Matching Let T and P be strings build over a finite alphabet  with || =  Text string T[0..N-1] T = “ abacaabaccabacabaabb ” Pattern string P[0..M-1] P = “ abacab ” Where is the first instance of P in T? T[10..15] = P[0..5]

String Matching abacaabaccabacabaabb abacab The brute force algorithm 22+6= 28 comparisons. Brute Force Algorithm requires O(nm) operations

Brute Force, v.1 static int match(char[] T, char[] P){ int n = T.length; int m = P.length; for (int i=0; i<=n-m; i++) { int j = 0; while (j<m && T[i+j]==P[j]) j++; if (j==m) return i; } return -1; }

Brute Force, v.2 (one loop) static int match(char[] T, char[] P){ int n = T.length; int m = P.length; int i = 0; int j = 0; do { if (T[i]==P[j]) { i++; j++; } else { i=i-j+1; j=0; } } while (j<m && i<n); if (j==m) return i-m; else return –1; }

String Matching Text string T[0..N-1] T = “ abacaabaccabacabaabb ” Pattern string P[0..M-1] P = “ abacab ” Where is the first instance of P in T? T[10..15] = P[0..5] In general, how many comparisons T[i] = P[j] ? are needed to do the search? Worst case: O(NM)

A bad case 00000000000000001 0000- 00001 60+5 = 65 comparisons are needed How many of them could be avoided?

Typical text matching This is a sample sentence - s- - s- - s- - sente 20+5=25 comparisons are needed (The match is near the same point in the target string as the previous example.) In practice, 0j2

String Matching Brute force worst case  O(MN)  Expensive for long patterns in repetitive text How to improve on this? Intuition:  Don’t look at the text more than once.  Remember what is learned from previous matches

Motivation with FSM Consider the alphabet {a,b,c} and the FSM given below What is a language accepted by this FSM? What can we learn from this FSM? 1 Start 234End aab c b/c c a b a

Clever string matching 1970. Cook published an abstract result about machine models  Match in O(N+M) vs. O(MN)?! Knuth and Pratt studied it and refined it into a simple algorithm. Morris, annoyed at a design problem in implementing a text editor, discovered the same algorithm.  How to avoid decrementing i ? KMP published together in 1976.

Morris

String Matching Meanwhile … Boyer and Moore discovered another algorithm that is even faster (for some uses) in the average case. Gosper independently discovered the same algorithm. Boyer and Moore published in 1977.

String Matching In 1980, Karp and Rabin discovered a simpler algorithm.  Uses hashing idea: quickly compute hashes for all M-length substrings in T, and compare with the hash for P.

Knuth Morris Pratt

The KMP idea Take advantage of what we already know during the match process. Suppose P = 1000000 Suppose P[0..5] matches T[10..15] Suppose P[6]  T[16] Suppose we know that P[0]  any of T[11..15] And the next possible match is P[0] ? T[16]

KMP example Match fails: T[i]  P[j]  i = 6  j = 6 Next match attempt  i = 6  j = 0 10000010000000000 100000- 1000000

Brute Force KMP A worse case example: 196 + 14 = 210 comparisons 0000000000000000000000000001 0000000000000- 00000000000001 0000000000000000000000000001 0000000000000- 0- 01 28+14 = 42 comparisons

Brute Force KMP 21 comparisons abcdeabcdeabcedfghijkl - bc- - bc- - bcedfg abcdeabcdeabcedfghijkl - bc- - bc- - bcedfg 19 comparisons

Brute Force KMP 21 comparisons abcdeabcdeabcedfghijkl - bc- - bc- - bcedfg abcdeabcdeabcedfghijkl - bc- - bc- - bcedfg 19 comparisons 5 preparation comparisons

KMP – The Big Idea Retain information from prior attempts. Compute in advance how far to jump in P when a match fails.  Suppose the match fails at P[j]  T[i+j].  Then we know P[0.. j-1] = T[i.. i+j-1]. We must next try P[0] ? T[i+1].  But we know T[i+1]=P[1]  There is another way to compare: P[1]?P[0] If so, increment j by 1. No need to look at T.  What if P[1]=P[0] and P[2]=P[1]? Then increment j by 2. Again, no need to look at T. In general, we can determine how far to jump without any knowledge of T!

Implementing KMP Never decrement i, ever.  Comparing T[i] with P[j]. Compute a table f of how far to jump j forward when a match fails.  The next match will compare T[i] with P[f[j-1]] Do this by matching P against itself in all positions.

Building the Table for f P = 1010011 Find self-overlaps PrefixOverlapjf 1.10 10.20 101131 10101042 10100.50 101001161 1010011171

What f means Prefix Overlapjf 1.10 10.20 101131 10101042 10100.50 101001161 1010011171 If f is zero, there is no self-match.  This is good news:  Set j=0  Do not change i. The next match is T[i] ? P[0] f non-zero implies there is a self-match.  This is bad news:  E.g., f=2 means P[0..1] = P[j-2..j-1] Hence must start new comparison at j-2, since we know T[i-2..i-1] = P[0..1] In general:  Set j=f[j-1]  Do not change i. The next match is T[i] ? P[f[j-1]]

Favorable conditions P = 1234567 Find self-overlaps PrefixOverlapjf 1.10 12.20 123.30 1234.40 12345.50 123456.60 1234567.70

Mixed conditions P = 1231234 Find self-overlaps PrefixOverlapjf 1.10 12.20 123.30 1231141 123121252 12312312363 1231234.70

Poor conditions P = 1111110 Find self-overlaps PrefixOverlapjf 1.10 11121 1111132 111111143 11111111154 1111111111165 1111110.70

KMP matcher static int match(char[] T, char[] P) { int n = T.length; int m = P.length; int[] f = computeF(P); int i = 0; int j = 0; while(i<n) { if(P[j]==T[i]) { if (j==m-1) return i-m+1; i++; j++; } else if (j>0) j=f[j-1]; else i++; } return -1; } Use f to determine next value for j.

KMP pre-process static int[] computeF(char[] P) { int m = P.length; int[] f = new int[m]; f[0] = 0; int i = 1; int j = 0; while(i<m) { if(P[j]==P[i]) { f[i] = j+1; i++; j++; } else if (j>0) j=f[j-1]; else {f[i] = 0; i++;} } return f; } Use previous values of f

KMP Performance At each iteration, one of three cases:  T[i] = P[j] i increases  T[i] <> P[j] and j>0 i-j increases  T[I] <> P[j] and j=0 i increases and i-j increases Hence, maximum of 2N iterations. Constructing f[] needs 2M iterations. Thus worst case performance is O(N+M).

KMP Summary performs the comparisons from left to right; preprocessing phase in O(m) space and time complexity; searching phase in O(n+m) time complexity (independent from the alphabet size); performs at most 2n-1 information gathered during the scan of the text;

Boyer Moore

Brute Force KMP 21 comparisons abcdeabcdeabcedfghijkl - bc- - bc- - bcedfg abcdeabcdeabcedfghijkl - bc- - bc- - bcedfg 19 comparisons

Brute Force B-M 15 + 6 = 21 comparisons abcdeabcdeabcedfghijkl - bc- - bc- - bcedfg abcdeabcdeabcedfghijkl - g f d e c b 2 + 6 = 8 comparisons

Boyer Moore Perhaps the most efficient algorithm for general pattern matching Ideas  Scan pattern from right to left (and target from left to right) Allows for bigger jumps on early failures Could use a table similar to KMP. But follow a better idea:  Use information about T as well as P in deciding what to do next.

Brute Force B-M 16 + 7 = 23 comparisons This string is textual - t- - textual This string is textual - l a u t x e t 3 + 7 = 10 comparisons

Brute Force B-M 25 comparisons This is a sample sentence - This is a sample sentence - foobar 5 comparisons

Boyer Moore Ideas  Scan pattern from right to left (and target from left to right) Allows for bigger jumps on early failures Could use a table similar to KMP. But follow a better idea:  Use information about T as well as P in deciding what to do next. If T[i] does not appear in the pattern, skip forward beyond the end of the pattern.

Boyer Moore matcher static int[] buildLast(char[] P) { int[] last = new int[128]; int m = P.length; for (int i=0; i<128; i++) last[i] = -1; for (int j=0; j<P.length; j++) last[P[j]] = j; return last; } Mismatch char is nowhere in the pattern (default). last says “jump the distance” Mismatch is a pattern char. last says “jump to align pattern with last instance of this char”

Boyer Moore matcher static int match(char[] T, char[] P) { int[] last = buildLast(P); int n = T.length; int m = P.length; int i = m-1; int j = m-1; if (i > n-1) return -1; do { if (P[j]==T[i]) if (j==0) return i; else { i--; j--; } else { i = i + m – Math.min(j, 1 + last[T[i]]); j = m - 1; } } while (i <= n-1); return -1; } Use last to determine next value for i.

KMP B-M 13 comparisons 1234561234356 - 1234561234356 - 7777777 1 comparison

KMP B-M 16 comparisons This is a string - ring This is a string - g n i r ring 7 comparisons

KMP B-M 16 comparisons This is a string - tring This is a string - g n i r t tring 8 comparisons

Matching Summary

Boyer-Moore Summary performs the comparisons from right to left; preprocessing phase in O(m+ ) time and space complexity; searching phase in O(mn) time complexity; 3n text character comparisons in the worst case when searching for a non periodic pattern; O(n / m) best performance.

Knuth-Morris-Pratt Summary For text, similar performance to brute force  Can be slower, due to precomputation Works well for self-repetitive patterns in self-repetitive text Never decrements i.  Matching an input stream … Intuition: derives from thinking about a Matching FSM.

Karp and Rabin In 1980, Karp and Rabin discovered a simpler algorithm.  Uses hashing ideas: quickly compute hashes for all M-length substrings in T, and compare with the hash for P.  Compute the hashes in a cumulative way, so each T[i] needs to be seen only once.  Average case time is O(M+N).  Worst case is unlikely (all collisions) at O(MN).

Next Go to recitation Wednesday  Discuss more about string matching algorithms  very important! On Thursday, we will discuss Union Find  Many many applications  Read chapter 24 Work on Homework 6

String Matching 15-211 Fundamental Data Structures and Algorithms April 22, 2003.

Similar presentations

Presentation on theme: "String Matching 15-211 Fundamental Data Structures and Algorithms April 22, 2003."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

String Matching 15-211 Fundamental Data Structures and Algorithms April 22, 2003.

Similar presentations

Presentation on theme: "String Matching 15-211 Fundamental Data Structures and Algorithms April 22, 2003."— Presentation transcript:

Similar presentations

About project

Feedback