**Algorithm : Design & Analysis [19]**

String Matching Algorithm : Design & Analysis [19]

**In the last class… Optimal Binary Search Tree**

Separating Sequence of Word Dynamic Programming Algorithms

**String Matching Simple String Matching KMP Flowchart Construction**

Jump at Fail KMP Scan

**String Matching: Problem Description**

Search the text T, a string of characters of length n For the pattern P, a string of characters of length m (usually, m<<n) The result If T contains P as a substring, returning the index starting the substring in T Otherwise: fail

**Straightforward Solution**

p1 … pk-1 pk … pm P : Next comparison … ? t1 … ti … ti+k-2 ti+k … ti+m-1 … tn T : Matched window Expanding to right First matched character Note: If it fails to match pk to ti+k-1, then backtracking occurs, a cycle of new matching of characters starts from ti+1.In the worst case, nearly n backtracking occurs and there are nearly m-1 comparisons in one cycle, so (mn)

**Disadvantages of Backtracking**

More comparisons are needed Up to m-1 most recently matched characters have to be readily available for re-examination. (Considering those text which are too long to be loaded in entirety)

**An Intuitive Finite Automaton for Matching a Given Pattern**

Why no backtracking? Memorize the prefix. Alphabet={A,B,C} B B,C A A A B C 1 2 3 4 * B,C A start node C stop node matched! Automaton for pattern “AABC” Advantage: each character in the text is checked only once Difficulty: Construction of the automaton – too many edges(for a large alphabet) to defined and stored

**The Knuth-Morris-Pratt Flowchart**

Success Failure 2 Get next text char. A B A B C B * 1 3 4 5 6 An example: T=“ACABAABABA”, P=“ABABCB” KMP cell number Text being scanned A C C C A B A A A A B A B A A - Success or Failure s f f C s s s f f s s s s f s F get next char.

**Matched Frame P: ABABABCB T: ... ABABAB x …**

Moving for 4 chars may result in error. to be compared next matched frame P: ABABABCB T: ABABABABCB … If x is not C P: ABAB ABCB T: ABABAB x … The matched frame move to right for 2 chars, which is equal to moving the pointers backward.

**Sliding the Matched Frame**

When dismatching occurs: p …… pk-1 pk …… …… t1 …… ti …… tj-1 tj …… Matched frame Dismatching Matched frame slides, with its breadth changed as well: p1 …… pr-1 pr …… p1 …… pk-r+1 …… pk-1 t1 …… ti …… pj-r+1 …… tj-1 tj …… As large as possible. New matched frame Next comparison

Which means: When fail at node k, next comparison is pk vs. pr Fail Links Out of each node of KMP flowchart is a fail link, leading to node r, where r is the largest non-negative interger satisfying r<k and p1,…,pr-1 matches pk-r+1,…,pk-1. (stored in fail[k]) Note: r is independent of T. r pointer for T forward P P pointer for P backward k-r k

**Computing the Fail Links**

To be compared Thinking recursively, let fail[k-1]=s: p …… ps-1 ps ps+1 …… …… p1 …… pk-r …… pk-2 pk-1 pk …… pm Matched To be compared and thinking recursively Case 2: pspk-1 p1… pfail[s]-1 pfail[s] p1 …… ps ps ps+1 …… p1 … pk-r+1 …… pk pk-1 pk …… pm Case 1 ps=pk-1 fail[k]=s+1

**Recursion on Node fail[s]**

Thinking recursively, at the beginning, s=fail[k-1]: Case 2: pspk-1 p1… pfail[s]-1 pfail[s] p1 …… ps ps ps+1 …… p1 … pk-r+1 …… pk pk-1 pk …… pm ps is replaced by pfail[s], that is, new value assumed for s Then, proceeding on new s, that is: If case 1 applys (ps=pk-1): fail[k]=s+1, or If case 2 applys (pspk-1): another new s

**Computing Fail Links: an Example**

Constructing the KMP flowchart for P = “ABABABCB” Assuming that fail[1] to fail[6] has been computed Get next text char. A B A B A B C B * 1 2 3 4 5 6 7 8 9 fail[7]: ∵fail[6]=4, and p6=p4, ∴fail[7]=fail[6]+1=5 (case 1) fail[8]: fail[7]=5, but p7p5, so, let s=fail[5]=3, but p7p3, keeping back, let s=fail[3]=1. Still p7p1. Further, let s=fail[1]=0, so, fail[8]=0+1=1.(case 2)

**Constructing KMP Flowchart**

Input: P, a string of characters; m, the length of P Output: fail, the array of failure links, filled void kmpSetup (char [] P, int m, int [] fail) int k, s; fail[1]=0; for (k=2; km; k++) s=fail[k-1]; while (s1) if (ps= = pk-1) break; s=fail[s]; fail[k]=s+1; For loop executes m-1 times, and while loop executes at most m times since fail[s] is always less than s. So, the complexity is roughly O(m2)

**Number of Character Comparisons**

Success comparison: at most once for a specified k, totaling at most m-1 2m-3 fail[1]=0; for (k=2; km; k++) s=fail[k-1]; while (s1) if (ps= = pk-1) break; s=fail[s]; fail[k]=s+1; Unsuccess comparison: Always followed by decreasing of s. Since: s is initialed as 0, s increases by one each time s is never negative So, the counting of decreasing can not be larger than that of increasing These 2 lines combine to increase s by 1, done m-2 times

**KMP Scan: the Algorithm**

Input: P and T, the pattern and text; m, the length of P; fail: the array of failure links for P. Output: index in T where a copy of P begins, or -1 if no match int kmpScan(char[ ] P, char[ ] T, int m, int[ ] fail) int match, j,k; //j indexes T, and k indexes P match=-1; j=1; k=1; while (endText(T,j)=false) if (k>m) match=j-m; break; if (k= =0) j++; k=1; else if ( tj= =pk) j++; k++; //one character matched else k=fail[k]; //following the failure link return match Each time a new cycle begins, p1,…pk-1 matched Executed at most 2n times, why?

**Skipping Characters in String Matching**

must must must must must must must must must must must must If you wish to understand others you must … Checking the characters in P, in reverse order The copy of the P begins at t38. Matching is achieved in 18 comparisons

**Distance of Jumping Forward**

With the knowledge of P, the distance of jumping forward for the pointer of T is determined by the character itself, independent of the location in T. p1 … A … A … pm =pk p1 … A … A … ps … pm t …… tj=A …… …… tn new j Rightmost ‘A’ current j charJump[‘A’] = m-k

**Computing the Jump: Algorithm**

Input: Pattern string P; m, the length of P; alphabet size alpha=|| Output: Array charJump, indexed 0,…, alpha-1, storing the jumping offsets for each char in alphabet. void computeJumps(char[ ] P, int m, int alpha, int[ ] charJump char ch; int k; for (ch=0; ch<alpha; ch++) charJump[ch]=m; //For all char no in P, jump by m for (k=1; km; k++) charJump[pk]=m-k; (||+m) The increasing order of k ensure that for duplicating symbols in P, the jump is computed according to the rightmost

**Partially Matched Substring**

matched suffix P: b a t s a n d c a t s T: …… d a t s …… New j Move only 1 char Current j charJump[‘d’]=4 Remember the matched suffix, we can get a better jump P: b a t s a n d c a t s T: …… d a t s …… New j Move 7 chars

**Forward to Match the Suffix**

p1 …… pk pk+1 …… pm Matched suffix Dismatch …… t …… tj tj+1 …… …… tn Substring same as the matched suffix occurs in P p1 …… pr pr+1 …… pr+m-k …… pm slide[k] p …… pk pk+1 …… pm t …… tj tj+1 …… …… tn matchJump[k] New j Old j

**Partial Match for the Suffix**

p1 …… pk pk+1 …… pm Matched suffix Dismatch …… t …… tj tj+1 …… …… tn No entire substring same as the matched suffix occurs in P p1 …… pq …… pm May be empty slide[k] p1 …… pk pk+1 …… pm t …… tj tj+1 …… …… tn matchJump[k] New j Old j

**matchjump[k]=slide[k]+m-k**

matchjump and slide p1 …… pr pr+1 …… pr+m-k …… pm p …… pk pk+1 …… pm t …… tj tj+1 …… …… tn Old j New j slide[k] matchJump[k] slide[k]: the distance P slides forward after dismatch at pk, with m-k chars matched to the right matchjump[k]: the distance j, the pointer of P, jumps, that is: matchjump[k]=slide[k]+m-k Let r(r<k) be the largest index, such that pr+1 starts a largest substring matching the matched suffix of P, and prpk, then slide[k]=k-r If the r not found, the longest prefix of P, of length q, matching the matched suffix of P will be lined up. Then slide[k]=m-q.

**Computing matchJump: Example**

P = “ w o w w o w ” Direction of computing w o w w o w w o w w o w matchJump[6]=1 Slide[6]=1 (m-k)=0 t …… tj …… pk Matched is empty w o w w o w w o w w o w matchJump[5]=3 pk Slide[5]=5-3=2 (m-k)=1 t1 …… tj w …… Matched is 1

**Computing matchJump: Example**

P = “ w o w w o w ” Direction of computing w o w w o w w o w w o w matchJump[4]=7 Not lined up =pk No found, but a prefix of length 1, so, Slide[4] = m-1=5 t1 …… tj o w …… Matched is 2 w o w w o w w o w w o w matchJump[3]=6 pk Slide[3]=3-0=3 (m-k)=3 t1 …… tj w o w …… Matched is 3

**Computing matchJump: Example**

P = “ w o w w o w ” Direction of computing w o w w o w w o w w o w matchJump[2]=7 No found, but a prefix of length 3, so, Slide[2] = m-3=3 t1 …… tj w w o w …… Matched is 4 w o w w o w w o w w o w matchJump[1]=8 No found, but a prefix of length 3, so, Slide[1] = m-3=3 t1 …… tj o w w o w …… Matched is 5

**The Boyer-Moore Algorithm**

Void computeMatchjumps(char[] P, int m, int[] matchjump) int k, r, s, low, shift; int sufx=new int[m+1] for (k=1; km; k++) matchjump[k]=m+1; sufx[m]=m+1; for (k=m-1; k0; k--) s=sufix[k+1] while (sm) if (pk+1==ps) break; matchjump[s] = min (matchjump[s], s-(k+1)); s = sufx[k]; sufx[k]=s-1; Sufx[k]=x means a substring starting from pk+1 matches suffix starting from px+1 Computing slide[k] // computing prefix length is necessary; // change slide value to matchjump by addition;

Home Assignment pp.508- 11.4 11.8 11.9 11.13 11.18

