# Algorithm : Design & Analysis [19]

## Presentation on theme: "Algorithm : Design & Analysis [19]"— Presentation transcript:

Algorithm : Design & Analysis [19]
String Matching Algorithm : Design & Analysis [19]

In the last class… Optimal Binary Search Tree
Separating Sequence of Word Dynamic Programming Algorithms

String Matching Simple String Matching KMP Flowchart Construction
Jump at Fail KMP Scan

String Matching: Problem Description
Search the text T, a string of characters of length n For the pattern P, a string of characters of length m (usually, m<<n) The result If T contains P as a substring, returning the index starting the substring in T Otherwise: fail

Straightforward Solution
p1 … pk-1 pk … pm P : Next comparison ? t1 … ti … ti+k-2 ti+k … ti+m-1 … tn T : Matched window Expanding to right First matched character Note: If it fails to match pk to ti+k-1, then backtracking occurs, a cycle of new matching of characters starts from ti+1.In the worst case, nearly n backtracking occurs and there are nearly m-1 comparisons in one cycle, so (mn)

More comparisons are needed Up to m-1 most recently matched characters have to be readily available for re-examination. (Considering those text which are too long to be loaded in entirety)

An Intuitive Finite Automaton for Matching a Given Pattern
Why no backtracking? Memorize the prefix. Alphabet={A,B,C} B B,C A A A B C 1 2 3 4 * B,C A start node C stop node matched! Automaton for pattern “AABC” Advantage: each character in the text is checked only once Difficulty: Construction of the automaton – too many edges(for a large alphabet) to defined and stored

The Knuth-Morris-Pratt Flowchart
Success Failure 2 Get next text char. A B A B C B * 1 3 4 5 6 An example: T=“ACABAABABA”, P=“ABABCB” KMP cell number Text being scanned A C C C A B A A A A B A B A A - Success or Failure s f f C s s s f f s s s s f s F get next char.

Matched Frame P: ABABABCB T: ... ABABAB x …
Moving for 4 chars may result in error. to be compared next matched frame P: ABABABCB T: ABABABABCB … If x is not C P: ABAB ABCB T: ABABAB x … The matched frame move to right for 2 chars, which is equal to moving the pointers backward.

Sliding the Matched Frame
When dismatching occurs: p …… pk-1 pk …… …… t1 …… ti …… tj-1 tj …… Matched frame Dismatching Matched frame slides, with its breadth changed as well: p1 …… pr-1 pr …… p1 …… pk-r+1 …… pk-1 t1 …… ti …… pj-r+1 …… tj-1 tj …… As large as possible. New matched frame Next comparison

Which means: When fail at node k, next comparison is pk vs. pr Fail Links Out of each node of KMP flowchart is a fail link, leading to node r, where r is the largest non-negative interger satisfying r<k and p1,…,pr-1 matches pk-r+1,…,pk-1. (stored in fail[k]) Note: r is independent of T. r pointer for T forward P P pointer for P backward k-r k

To be compared Thinking recursively, let fail[k-1]=s: p …… ps-1 ps ps+1 …… …… p1 …… pk-r …… pk-2 pk-1 pk …… pm Matched To be compared and thinking recursively Case 2: pspk-1 p1… pfail[s]-1 pfail[s] p1 …… ps ps ps+1 …… p1 … pk-r+1 …… pk pk-1 pk …… pm Case 1 ps=pk-1 fail[k]=s+1

Recursion on Node fail[s]
Thinking recursively, at the beginning, s=fail[k-1]: Case 2: pspk-1 p1… pfail[s]-1 pfail[s] p1 …… ps ps ps+1 …… p1 … pk-r+1 …… pk pk-1 pk …… pm ps is replaced by pfail[s], that is, new value assumed for s Then, proceeding on new s, that is: If case 1 applys (ps=pk-1): fail[k]=s+1, or If case 2 applys (pspk-1): another new s

Constructing the KMP flowchart for P = “ABABABCB” Assuming that fail[1] to fail[6] has been computed Get next text char. A B A B A B C B * 1 2 3 4 5 6 7 8 9 fail[7]: ∵fail[6]=4, and p6=p4, ∴fail[7]=fail[6]+1=5 (case 1) fail[8]: fail[7]=5, but p7p5, so, let s=fail[5]=3, but p7p3, keeping back, let s=fail[3]=1. Still p7p1. Further, let s=fail[1]=0, so, fail[8]=0+1=1.(case 2)

Constructing KMP Flowchart
Input: P, a string of characters; m, the length of P Output: fail, the array of failure links, filled void kmpSetup (char [] P, int m, int [] fail) int k, s; fail[1]=0; for (k=2; km; k++) s=fail[k-1]; while (s1) if (ps= = pk-1) break; s=fail[s]; fail[k]=s+1; For loop executes m-1 times, and while loop executes at most m times since fail[s] is always less than s. So, the complexity is roughly O(m2)

Number of Character Comparisons
Success comparison: at most once for a specified k, totaling at most m-1 2m-3 fail[1]=0; for (k=2; km; k++) s=fail[k-1]; while (s1) if (ps= = pk-1) break; s=fail[s]; fail[k]=s+1; Unsuccess comparison: Always followed by decreasing of s. Since: s is initialed as 0, s increases by one each time s is never negative So, the counting of decreasing can not be larger than that of increasing These 2 lines combine to increase s by 1, done m-2 times

KMP Scan: the Algorithm
Input: P and T, the pattern and text; m, the length of P; fail: the array of failure links for P. Output: index in T where a copy of P begins, or -1 if no match int kmpScan(char[ ] P, char[ ] T, int m, int[ ] fail) int match, j,k; //j indexes T, and k indexes P match=-1; j=1; k=1; while (endText(T,j)=false) if (k>m) match=j-m; break; if (k= =0) j++; k=1; else if ( tj= =pk) j++; k++; //one character matched else k=fail[k]; //following the failure link return match Each time a new cycle begins, p1,…pk-1 matched Executed at most 2n times, why?

Skipping Characters in String Matching
must must must must must must must must must must must must If you wish to understand others you must … Checking the characters in P, in reverse order The copy of the P begins at t38. Matching is achieved in 18 comparisons

Distance of Jumping Forward
With the knowledge of P, the distance of jumping forward for the pointer of T is determined by the character itself, independent of the location in T. p1 … A … A … pm =pk p1 … A … A … ps … pm t …… tj=A …… …… tn new j Rightmost ‘A’ current j charJump[‘A’] = m-k

Computing the Jump: Algorithm
Input: Pattern string P; m, the length of P; alphabet size alpha=|| Output: Array charJump, indexed 0,…, alpha-1, storing the jumping offsets for each char in alphabet. void computeJumps(char[ ] P, int m, int alpha, int[ ] charJump char ch; int k; for (ch=0; ch<alpha; ch++) charJump[ch]=m; //For all char no in P, jump by m for (k=1; km; k++) charJump[pk]=m-k; (||+m) The increasing order of k ensure that for duplicating symbols in P, the jump is computed according to the rightmost

Partially Matched Substring
matched suffix P: b a t s a n d c a t s T: …… d a t s …… New j Move only 1 char Current j charJump[‘d’]=4 Remember the matched suffix, we can get a better jump P: b a t s a n d c a t s T: …… d a t s …… New j Move 7 chars

Forward to Match the Suffix
p1 …… pk pk+1 …… pm Matched suffix Dismatch …… t …… tj tj+1 …… …… tn Substring same as the matched suffix occurs in P p1 …… pr pr+1 …… pr+m-k …… pm slide[k] p …… pk pk+1 …… pm t …… tj tj+1 …… …… tn matchJump[k] New j Old j

Partial Match for the Suffix
p1 …… pk pk+1 …… pm Matched suffix Dismatch …… t …… tj tj+1 …… …… tn No entire substring same as the matched suffix occurs in P p1 …… pq …… pm May be empty slide[k] p1 …… pk pk+1 …… pm t …… tj tj+1 …… …… tn matchJump[k] New j Old j

matchjump[k]=slide[k]+m-k
matchjump and slide p1 …… pr pr+1 …… pr+m-k …… pm p …… pk pk+1 …… pm t …… tj tj+1 …… …… tn Old j New j slide[k] matchJump[k] slide[k]: the distance P slides forward after dismatch at pk, with m-k chars matched to the right matchjump[k]: the distance j, the pointer of P, jumps, that is: matchjump[k]=slide[k]+m-k Let r(r<k) be the largest index, such that pr+1 starts a largest substring matching the matched suffix of P, and prpk, then slide[k]=k-r If the r not found, the longest prefix of P, of length q, matching the matched suffix of P will be lined up. Then slide[k]=m-q.

Computing matchJump: Example
P = “ w o w w o w ” Direction of computing w o w w o w w o w w o w matchJump[6]=1 Slide[6]=1 (m-k)=0 t …… tj …… pk Matched is empty w o w w o w w o w w o w matchJump[5]=3 pk Slide[5]=5-3=2 (m-k)=1 t1 …… tj w …… Matched is 1

Computing matchJump: Example
P = “ w o w w o w ” Direction of computing w o w w o w w o w w o w matchJump[4]=7 Not lined up =pk No found, but a prefix of length 1, so, Slide[4] = m-1=5 t1 …… tj o w …… Matched is 2 w o w w o w w o w w o w matchJump[3]=6 pk Slide[3]=3-0=3 (m-k)=3 t1 …… tj w o w …… Matched is 3

Computing matchJump: Example
P = “ w o w w o w ” Direction of computing w o w w o w w o w w o w matchJump[2]=7 No found, but a prefix of length 3, so, Slide[2] = m-3=3 t1 …… tj w w o w …… Matched is 4 w o w w o w w o w w o w matchJump[1]=8 No found, but a prefix of length 3, so, Slide[1] = m-3=3 t1 …… tj o w w o w …… Matched is 5

The Boyer-Moore Algorithm
Void computeMatchjumps(char[] P, int m, int[] matchjump) int k, r, s, low, shift; int sufx=new int[m+1] for (k=1; km; k++) matchjump[k]=m+1; sufx[m]=m+1; for (k=m-1; k0; k--) s=sufix[k+1] while (sm) if (pk+1==ps) break; matchjump[s] = min (matchjump[s], s-(k+1)); s = sufx[k]; sufx[k]=s-1; Sufx[k]=x means a substring starting from pk+1 matches suffix starting from px+1 Computing slide[k] // computing prefix length is necessary; // change slide value to matchjump by addition;

Home Assignment pp.508- 11.4 11.8 11.9 11.13 11.18