Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSE 30331 Lecture 23 – String Matching Simple (Brute-Force) Approach Knuth-Morris-Pratt Algorithm Boyer-Moore Algorithm.

Similar presentations


Presentation on theme: "CSE 30331 Lecture 23 – String Matching Simple (Brute-Force) Approach Knuth-Morris-Pratt Algorithm Boyer-Moore Algorithm."— Presentation transcript:

1 CSE 30331 Lecture 23 – String Matching Simple (Brute-Force) Approach Knuth-Morris-Pratt Algorithm Boyer-Moore Algorithm

2 The Problem Find the first occurrence of the pattern P in text T. The number of characters in P is m The number of characters in T is n

3 The Simple Approach For each position j in the text If T[ j.. j+m) matches P[0..m) stop : pattern found at position j Advantage: simple to increment Disadvantage: may require ability to push previously read characters back into input stream Worst Case Efficiency: O(m*n) The pattern is moved forward only one position each time a mismatch is found, no matter how much of the pattern matched prior to the mismatch character

4 Knuth-Morris-Pratt (KMP) Based on FSA for recognizing the pattern P The FSA is represented by a KMP flowchart States are letters in the pattern P Arcs are SUCCESS or FAIL On success ( T[ j ] == P[ k ] ) move forward with match ( j++ & k++ ) On failure ( T[ j ] != P[ k ] ) Move backward in the pattern (or shift the pattern forward over the text) to align the rightmost character P [ fail [ k ] ] with text character T [ j ] preserving the longest matching prefix

5 KMP Fail Links: hubbahubba Example pattern: hubbahubba P: H U B B A H U B B A K: 0 1 2 3 4 5 6 7 8 9 Fail[k] -1 0 0 0 0 0 1 2 3 4 Match to text: hubbahubbletelescope... hubbahubbalast A != Lfail[9]= 4 hubbahubbafirst A != Lfail[4]= 0 hubbahubba H != Lfail[0]= -1 hubbahubba hubbahubbletelescope... ^

6 KNP – Building Fail Links Pattern: ABABDD If P [ k ] != T [ j ] then K new = fail [ k ] is the position of the pattern character with the longest prefix matching the text T prior to the mismatch character T [ j ] Finding fail[k]: Go to P [ k-1 ] & find its fail [ k-1 ] (prefix that matches up to T[ k-2 ] ) If P [ fail[k-1] ] matches P[k-1], then fail [ k ] becomes P[ fail[k-1] ] + 1 Else follow next fail arrow fail [ fail [ k-1 ] ] and repeat Read char ABABDD* 012345

7 KNP – Building Fail Links void kmpSetup(char P[], int m, int fail[]) { int k, s; fail[0] = -1; // ch != P[0], read another ch for (k=1; k= 0) // if not back to start { if (P[s] == P[k-1]) // duplicate char found break; // so, stop following links s = fail[s]; // follow next fail link } fail[k] = s + 1; // }

8 KNP – Building Fail Links Pattern: A B A B D D Fail: -1 0 void kmpSetup(char P[], int m, int fail[]) { int k, s; fail[0] = -1; // ch != P[0], read another ch for (k=1; k= 0) { // skip loop if (P[s] == P[k-1]) // break; // s = fail[s]; // } fail[k] = s + 1; // set fail[1] = -1 + 0 = 0 } Read char 012345 ABABDD*

9 KNP – Building Fail Links Pattern: A B A B D D Fail: -1 0 0 void kmpSetup(char P[], int m, int fail[]) { int k, s; fail[0] = -1; // ch != P[0], read another ch for (k=1; k= 0) { // loop once if (P[s] == P[k-1]) // P[0]:’A’ != p[1]:’B’ break; // s = fail[s]; // so s is fail[0]:-1 } fail[k] = s + 1; // fail[2] = -1+1 = 0 } Read char 012345 ABABDD*

10 KNP – Building Fail Links Pattern: A B A B D D Fail: -1 0 0 1 void kmpSetup(char P[], int m, int fail[]) { int k, s; fail[0] = -1; // ch != P[0], read another ch for (k=1; k= 0) { // loop once if (P[s] == P[k-1]) // P[0]:‘A’ == P[2]:‘A’ break; // so, break s = fail[s]; // } fail[k] = s + 1; // fail[3] = 0+1 = 1 } Read char 012345 ABABDD*

11 KNP – Building Fail Links Pattern: A B A B D D Fail: -1 0 0 1 2 void kmpSetup(char P[], int m, int fail[]) { int k, s; fail[0] = -1; // ch != P[0], read another ch for (k=1; k= 0) { // loop once if (P[s] == P[k-1]) // P[1]:‘B’ == P[3]:‘B’ break; // so, break s = fail[s]; // } fail[k] = s + 1; // fail[4] = 1+1 = 2 } Read char 012345 ABABDD*

12 KNP – Building Fail Links Pattern: A B A B D D Fail: -1 0 0 1 2 0 void kmpSetup(char P[], int m, int fail[]) { int k, s; fail[0] = -1; // ch != P[0], read another ch for (k=1; k= 0) { // loop twice if (P[s] == P[k-1]) // P[2]:‘A’ != P[4]:‘D’, P[0]:‘A’ != P[4]:‘D’ break; // s = fail[s]; // s = fail[2]:0, s = fail[0]:-1 } fail[k] = s + 1; // fail[5] = -1+1 = 0 } Read char 012345 ABABDD*

13 KMP Fail Links: on mismatch, new k = fail[k] Example pattern: ABABDD fail: -1 0 0 1 2 0 ABABDD.ABABDDA != X sofail[0] = -1 X?????X?????Skip X & k=0 ABABDD.ABABDDB != X sofail[1] = 0 AX????AX??????k=0 (shifts pattern 1) ABABDD..ABABDD2nd A != X sofail[2] = 0 ABX???ABX???k=0 (shifts pattern 2) ABABDD..ABABDD2nd B != X sofail[3] = 1 ABAX??ABAX????k=1 (shifts pattern 2)

14 KMP Fail Links: on mismatch, new k = fail[k] Example pattern: ABABDD fail: -1 0 0 1 2 0 ABABDD..ABABDDD != X sofail[4] = 2 ABABX?ABABX?k=2 (shifts pattern 2) ABABDD.....ABABDD2nd D != X sofail[5] = 0 ABABDXABABDXk=0 (shifts pattern 5)

15 KNP Scan Algorithm int kmpScan (char P[], char T[], int m, int fail[]) { int match = -1; // position of match in text int j = 0, k = 0; while (! atEndOfText(T,j)) { // there is more text if (k == m) { match = j - m; // matched entire pattern, so stop break; } if (k == -1) { // nothing in pattern matched last text char, so j++; // get next text character k = 0; // start pattern over } else if (T[j] == P[k]) { j++; k++; // move forward one character in pattern and text } else { k = fail[k]; // follow fail link to best restart in pattern } return match; }

16 KNP - Efficiency Building Fail Links – O(m) Scanning text – O(n) Overall – O(m+n) = O(n)

17 Boyer-Moore (BM) Heuristic # 1 Match pattern Right-to-Left Create a charJump[ch] array with entry for each character in the alphabet (ASCII code) If T[ j ] != P[ k ] then If T[ j ] appears in P[0..k) then  the rightmost occurrence is aligned with T[ j ] Else  the pattern P is aligned beginning at T[ j+1 ] J new = charJump[ T[ j ] ] matching resumes with T[ j new ] and P[m-1] This skips multiple text characters WITHOUT ever examining them

18 Boyer Moore Algorithm Heuristic # 2 MatchJump[k] = slide[k] + m – k Slide[k] is amount of slide to align substrings M-k is length of suffix (substring) being realigned Similar to KMP fail links, but calculated right to left If a suffix has matched in P & T and that same substring appears elsewhere in P, then upon a mismatch the pattern P is “slid” to align the rightmost such matching substring with the suffix in T Matching resumes at the new end of the pattern determined by matchJump [ k ]

19 BM - Example Pattern: BATSANDCATS BATSANDCATS  first Pattern alignment BATSANDCATScharJump[T[j]] aligns N’s BATSANDCATS matchJump[k] aligns ATS’s TWOOLDGNATSCANBELIKEBATSANDCATS  The Text New j (where matching resumes) is at end of pattern P, but which (S =?= A) or (S =?= I) Use MAX(charJump(T[j]),matchJump[k])

20 Computing individual charJumps // find cJ[ch] for each character ch in pattern P void computeJumps (char P[], int m, int alpha, int charJump[]) { // assume jump distance is entire pattern length for all // characters that do not match a pattern letter. for (int ch=0; ch { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/14/4274061/slides/slide_20.jpg", "name": "Computing individual charJumps // find cJ[ch] for each character ch in pattern P void computeJumps (char P[], int m, int alpha, int charJump[]) { // assume jump distance is entire pattern length for all // characters that do not match a pattern letter.", "description": "for (int ch=0; ch

21 Computing substring matchJumps void computeMatchJumps (char P[], int m, int matchJump[]) { int k, s, low, shift, *sufx = new int[m+1]; // note: sufx[0] tells what suffix matches a prefix of P for (k=0;k=0; k--) // k indexes sufx array, k-1 indexes P and matchJump { s = sufx[k+1]; while (s <= m) { if (P[k] == P[s-1]) // P indices 0..m-1, sufx indices 0,1..m break; if (s-(k+1) < matchJump[s-1]) // Mismatch between P[k] and P[s-1] matchJump[s-1] = s-(k+1); s = sufx[s]; } sufx[k] = s - 1; }

22 Computing substring matchJumps // if no suffix match at k+1, compute slide based on prefix that // matches suffix. Prefix length = (m - shift). low = 1; shift = sufx[0]; while (shift <= m) { for (k=low; k<=shift; k++) { if (shift < matchJump[k-1]) matchJump[k-1] = shift; } low = shift + 1; shift = sufx[shift]; } // Add number of matched characters to slide amount for (k=0; k { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/14/4274061/slides/slide_22.jpg", "name": "Computing substring matchJumps // if no suffix match at k+1, compute slide based on prefix that // matches suffix.", "description": "Prefix length = (m - shift). low = 1; shift = sufx[0]; while (shift <= m) { for (k=low; k<=shift; k++) { if (shift < matchJump[k-1]) matchJump[k-1] = shift; } low = shift + 1; shift = sufx[shift]; } // Add number of matched characters to slide amount for (k=0; k

23 BM Scan Algorithm int boyerMooreScan (char P[], char T[], int m, int charJump[], int matchJump[]) { int match = -1, j = m-1, k = m-1; while (! endOfText(T,j)){ if (k < 0) { match = j + 1; break; // entire pattern matches, so stop } if (T[j] == P[k]) { j--; k--; // continue match right-to-left } else { jump = matchJump[k]; if (charJump[(int)t[j]] > matchJump[k]) jump = charJump[(int)t[i]]; j += jump; // jump forward & restart matching at right k = m-1; } return match; }

24 BM - Example Pattern: WOWWOW mJump: 876731cJump: ‘W’=0, ‘O’=1, others=6 WOWTHISISWOWXOWWOWWOW  the TEXT (21 chars) 1 1111111111121# of comparisons (15) WOWWOW W != I, cJ[I]=6, mJ[5]=1 WOWWOWW != S, cJ[S]=6, mJ[2]=6 WOWWOWW != X, cJ[X]=6, mJ[3]=7 WOWWOWW != O, cJ[O]=1, mJ[5]=1 WOWWOWmatch Note: cJump[‘W’]=0 means simply that if the TEXT character is ‘W’ the pattern realignment placing the rightmost pattern ‘W’ over the text ‘W’ is achieved by not moving the pattern Note: the algorithm will NOT work using only cJump

25 BM Algorithm Efficiency Building charJump[ ] – O(  ) Building matchJump[ ] – O(m) Scanning text – O(n) In practice, only every 3 or 4 characters are examined in text so BM is quite fast Overall – O(n)

26 String Matching Program Program to demonstrate all three approaches to string matching demos\strScan.cpp


Download ppt "CSE 30331 Lecture 23 – String Matching Simple (Brute-Force) Approach Knuth-Morris-Pratt Algorithm Boyer-Moore Algorithm."

Similar presentations


Ads by Google