Presentation is loading. Please wait.

Presentation is loading. Please wait.

String Matching Algorithm : Design & Analysis [19]

Similar presentations


Presentation on theme: "String Matching Algorithm : Design & Analysis [19]"— Presentation transcript:

1 String Matching Algorithm : Design & Analysis [19]

2 In the last class… Optimal Binary Search Tree Separating Sequence of Word Dynamic Programming Algorithms

3 String Matching Simple String Matching KMP Flowchart Construction Jump at Fail KMP Scan

4 String Matching: Problem Description Search the text T, a string of characters of length n For the pattern P, a string of characters of length m (usually, m<

5 Straightforward Solution t 1 … t i … t i+k-2 t i+k-1 … t i+m-1 … t n p 1 … p k-1 p k … p m T : … P : ? First matched character Matched window Expanding to right Next comparison Note: If it fails to match p k to t i+k-1, then backtracking occurs, a cycle of new matching of characters starts from t i+1.In the worst case, nearly n backtracking occurs and there are nearly m-1 comparisons in one cycle, so  (mn)

6 Disadvantages of Backtracking More comparisons are needed Up to m-1 most recently matched characters have to be readily available for re-examination. (Considering those text which are too long to be loaded in entirety)

7 An Intuitive Finite Automaton for Matching a Given Pattern 1234 * start node stop node matched! B,CB,C C B,CB,C A B A AABC Automaton for pattern “AABC” Alphabet={A,B,C} Advantage: each character in the text is checked only once Difficulty: Construction of the automaton – too many edges(for a large alphabet) to defined and stored Advantage: each character in the text is checked only once Difficulty: Construction of the automaton – too many edges(for a large alphabet) to defined and stored Why no backtracking? Memorize the prefix. Why no backtracking? Memorize the prefix.

8 The Knuth-Morris-Pratt Flowchart Get next text char. AABBBC * An example: T=“ACABAABABA”, P=“ABABCB” KMP cell number Text being scanned A C C C A B A A A A B A B A A - Success or Failures f f C s s s f f s s s s f s F get next char. Success Failure

9 P: ABABABCB T:... ABABAB x … Matched Frame matched frame to be compared next If x is not C P: ABAB ABCB T:... ABABAB x … The matched frame move to right for 2 chars, which is equal to moving the pointers backward. P: ABABABCB T:... ABABABABCB … Moving for 4 chars may result in error.

10 Matched frame slides, with its breadth changed as well: p 1 …… p r-1 p r …… p 1 …… p k-r+1 …… p k-1 t 1 …… t i …… p j-r+1 …… t j-1 t j …… Sliding the Matched Frame When dismatching occurs: p 1 …… p k-1 p k …… …… t 1 …… t i …… t j-1 t j …… Matched frame Dismatching New matched frameNext comparison As large as possible.

11 Fail Links Out of each node of KMP flowchart is a fail link, leading to node r, where r is the largest non-negative interger satisfying r

12 Computing the Fail Links Thinking recursively, let fail[k-1]=s: p 1 …… p s-1 p s p s+1 …… …… p 1 …… p k-r+1 …… p k-2 p k-1 p k …… p m To be compared Matched Case 1 p s =p k-1 fail[k]=s+1 Case 2: p s  p k-1 p 1 … p fail[s]-1 p fail[s] p 1 …… p s-1 p s p s+1 …… p 1 … p k-r+1 …… p k-2 p k-1 p k …… p m To be compared and thinking recursively

13 Recursion on Node fail[s] Thinking recursively, at the beginning, s=fail[k-1]: Case 2: p s  p k-1 p 1 … p fail[s]-1 p fail[s] p 1 …… p s-1 p s p s+1 …… p 1 … p k-r+1 …… p k-2 p k-1 p k …… p m p s is replaced by p fail[s], that is, new value assumed for s Then, proceeding on new s, that is: If case 1 applys (p s =p k-1 ): fail[k]=s+1, or If case 2 applys (p s  p k-1 ): another new s

14 Computing Fail Links: an Example Constructing the KMP flowchart for P = “ABABABCB” Assuming that fail[1] to fail[6] has been computed Get next text char. AABBABCB * fail[7]: ∵ fail[6]=4, and p 6 =p 4, ∴ fail[7]=fail[6]+1=5 (case 1) fail[8]: fail[7]=5, but p 7  p 5, so, let s=fail[5]=3, but p 7  p 3, keeping back, let s=fail[3]=1. Still p 7  p 1. Further, let s=fail[1]=0, so, fail[8]=0+1=1.(case 2)

15 Constructing KMP Flowchart Input: P, a string of characters; m, the length of P Output: fail, the array of failure links, filled void kmpSetup (char [] P, int m, int [] fail) int k, s; fail[1]=0; for (k=2; k  m; k++) s=fail[k-1]; while (s  1) if (p s = = p k-1 ) break; s=fail[s]; fail[k]=s+1; For loop executes m-1 times, and while loop executes at most m times since fail[s] is always less than s. So, the complexity is roughly O(m 2 ) For loop executes m-1 times, and while loop executes at most m times since fail[s] is always less than s. So, the complexity is roughly O(m 2 )

16 Number of Character Comparisons Success comparison: at most once for a specified k, totaling at most m-1 Success comparison: at most once for a specified k, totaling at most m-1 Unsuccess comparison: Always followed by decreasing of s. Since: s is initialed as 0, s increases by one each time s is never negative So, the counting of decreasing can not be larger than that of increasing Unsuccess comparison: Always followed by decreasing of s. Since: s is initialed as 0, s increases by one each time s is never negative So, the counting of decreasing can not be larger than that of increasing fail[1]=0; for (k=2; k  m; k++) s=fail[k-1]; while (s  1) if (p s = = p k-1 ) break; s=fail[s]; fail[k]=s+1; These 2 lines combine to increase s by 1, done m-2 times  2m-3

17 Input: P and T, the pattern and text; m, the length of P; fail: the array of failure links for P. Output: index in T where a copy of P begins, or -1 if no match int kmpScan(char[ ] P, char[ ] T, int m, int[ ] fail) int match, j,k; //j indexes T, and k indexes P match=-1; j=1; k=1; while (endText(T,j)=false) if (k>m) match=j-m; break; if (k= =0) j++; k=1; else if ( t j = =p k ) j++; k++; //one character matched else k=fail[k]; //following the failure link return match KMP Scan: the Algorithm Each time a new cycle begins, p 1,…p k-1 matched Executed at most 2n times, why?

18 Skipping Characters in String Matching If you wish to understand others you must … must Checking the characters in P, in reverse order must The copy of the P begins at t 38. Matching is achieved in 18 comparisons

19 Distance of Jumping Forward With the knowledge of P, the distance of jumping forward for the pointer of T is determined by the character itself, independent of the location in T. p 1 … A … A … p m p 1 … A … A … p s … p m t 1 …… t j =A …… …… t n  current j new j Rightmost ‘A’ charJump[‘A’] = m-k =pk=pk

20 Computing the Jump: Algorithm Input: Pattern string P; m, the length of P; alphabet size alpha=|  | Output: Array charJump, indexed 0,…, alpha-1, storing the jumping offsets for each char in alphabet. Input: Pattern string P; m, the length of P; alphabet size alpha=|  | Output: Array charJump, indexed 0,…, alpha-1, storing the jumping offsets for each char in alphabet. void computeJumps(char[ ] P, int m, int alpha, int[ ] charJump char ch; int k; for (ch=0; ch

21 Partially Matched Substring P: b a t s a n d c a t s T: …… d a t s …… matched suffix Current j charJump[‘d’]=4 New j Move only 1 char Remember the matched suffix, we can get a better jump P: b a t s a n d c a t s T: …… d a t s …… New j Move 7 chars

22 Forward to Match the Suffix p 1 …… p k p k+1 …… p m t 1 …… t j t j+1 …… …… t n  …… Matched suffix Dismatch Substring same as the matched suffix occurs in P p 1 …… p r p r+1 …… p r+m-k …… p m p 1 …… p k p k+1 …… p m t 1 …… t j t j+1 …… …… t n Old j New j slide[k] matchJump[k]

23 Partial Match for the Suffix p 1 …… p k p k+1 …… p m t 1 …… t j t j+1 …… …… t n  …… Matched suffix Dismatch No entire substring same as the matched suffix occurs in P p 1 …… p q …… p m p 1 …… p k p k+1 …… p m t 1 …… t j t j+1 …… …… t n Old j New j slide[k] matchJump[k] May be empty

24 matchjump and slide p 1 …… p r p r+1 …… p r+m-k …… p m p 1 …… p k p k+1 …… p m t 1 …… t j t j+1 …… …… t n Old j New j slide[k] matchJump[k] slide[k]: the distance P slides forward after dismatch at p k, with m-k chars matched to the right matchjump[k]: the distance j, the pointer of P, jumps, that is: matchjump[k]=slide[k]+m-k Let r(r

25 Computing matchJump: Example P = “ w o w w o w ” matchJump[6]=1 Direction of computing w o w t 1 …… t j ……  Matched is empty w o w matchJump[5]=3 w o w t 1 …… t j w …… Matched is 1 w o w  Slide[6]=1 (m-k)=0 pkpk pkpk Slide[5]=5-3=2 (m-k)=1

26 Computing matchJump: Example P = “ w o w w o w ” matchJump[4]=7 Direction of computing w o w t 1 …… t j o w ……  Matched is 2 w o w matchJump[3]=6 w o w t 1 …… t j w o w …… Matched is 3 w o w Not lined up  =pk=pk No found, but a prefix of length 1, so, Slide[4] = m-1=5 pkpk Slide[3]=3-0=3 (m-k)=3

27 Computing matchJump: Example P = “ w o w w o w ” matchJump[2]=7 Direction of computing w o w t 1 …… t j w w o w ……  Matched is 4 w o w matchJump[1]=8 w o w t 1 …… t j o w w o w …… Matched is 5 w o w  No found, but a prefix of length 3, so, Slide[2] = m-3=3 No found, but a prefix of length 3, so, Slide[1] = m-3=3

28 The Boyer-Moore Algorithm Void computeMatchjumps(char[] P, int m, int[] matchjump) int k, r, s, low, shift; int sufx=new int[m+1] for (k=1; k  m; k++) matchjump[k]=m+1; sufx[m]=m+1; for (k=m-1; k  0; k--) s=sufix[k+1] while (s  m) if (p k+1 ==p s ) break; matchjump[s] = min (matchjump[s], s-(k+1)); s = sufx[k]; sufx[k]=s-1; Sufx[k]=x means a substring starting from p k+1 matches suffix starting from p x+1 Computing slide[k] // computing prefix length is necessary; // change slide value to matchjump by addition;

29 Home Assignment pp


Download ppt "String Matching Algorithm : Design & Analysis [19]"

Similar presentations


Ads by Google