Download presentation

Presentation is loading. Please wait.

1
**Algorithm : Design & Analysis [19]**

String Matching Algorithm : Design & Analysis [19]

2
**In the last class… Optimal Binary Search Tree**

Separating Sequence of Word Dynamic Programming Algorithms

3
**String Matching Simple String Matching KMP Flowchart Construction**

Jump at Fail KMP Scan

4
**String Matching: Problem Description**

Search the text T, a string of characters of length n For the pattern P, a string of characters of length m (usually, m<<n) The result If T contains P as a substring, returning the index starting the substring in T Otherwise: fail

5
**Straightforward Solution**

p1 … pk-1 pk … pm P : Next comparison … ? t1 … ti … ti+k-2 ti+k … ti+m-1 … tn T : Matched window Expanding to right First matched character Note: If it fails to match pk to ti+k-1, then backtracking occurs, a cycle of new matching of characters starts from ti+1.In the worst case, nearly n backtracking occurs and there are nearly m-1 comparisons in one cycle, so (mn)

6
**Disadvantages of Backtracking**

More comparisons are needed Up to m-1 most recently matched characters have to be readily available for re-examination. (Considering those text which are too long to be loaded in entirety)

7
**An Intuitive Finite Automaton for Matching a Given Pattern**

Why no backtracking? Memorize the prefix. Alphabet={A,B,C} B B,C A A A B C 1 2 3 4 * B,C A start node C stop node matched! Automaton for pattern “AABC” Advantage: each character in the text is checked only once Difficulty: Construction of the automaton – too many edges(for a large alphabet) to defined and stored

8
**The Knuth-Morris-Pratt Flowchart**

Success Failure 2 Get next text char. A B A B C B * 1 3 4 5 6 An example: T=“ACABAABABA”, P=“ABABCB” KMP cell number Text being scanned A C C C A B A A A A B A B A A - Success or Failure s f f C s s s f f s s s s f s F get next char.

9
**Matched Frame P: ABABABCB T: ... ABABAB x …**

Moving for 4 chars may result in error. to be compared next matched frame P: ABABABCB T: ABABABABCB … If x is not C P: ABAB ABCB T: ABABAB x … The matched frame move to right for 2 chars, which is equal to moving the pointers backward.

10
**Sliding the Matched Frame**

When dismatching occurs: p …… pk-1 pk …… …… t1 …… ti …… tj-1 tj …… Matched frame Dismatching Matched frame slides, with its breadth changed as well: p1 …… pr-1 pr …… p1 …… pk-r+1 …… pk-1 t1 …… ti …… pj-r+1 …… tj-1 tj …… As large as possible. New matched frame Next comparison

11
Which means: When fail at node k, next comparison is pk vs. pr Fail Links Out of each node of KMP flowchart is a fail link, leading to node r, where r is the largest non-negative interger satisfying r<k and p1,…,pr-1 matches pk-r+1,…,pk-1. (stored in fail[k]) Note: r is independent of T. r pointer for T forward P P pointer for P backward k-r k

12
**Computing the Fail Links**

To be compared Thinking recursively, let fail[k-1]=s: p …… ps-1 ps ps+1 …… …… p1 …… pk-r …… pk-2 pk-1 pk …… pm Matched To be compared and thinking recursively Case 2: pspk-1 p1… pfail[s]-1 pfail[s] p1 …… ps ps ps+1 …… p1 … pk-r+1 …… pk pk-1 pk …… pm Case 1 ps=pk-1 fail[k]=s+1

13
**Recursion on Node fail[s]**

Thinking recursively, at the beginning, s=fail[k-1]: Case 2: pspk-1 p1… pfail[s]-1 pfail[s] p1 …… ps ps ps+1 …… p1 … pk-r+1 …… pk pk-1 pk …… pm ps is replaced by pfail[s], that is, new value assumed for s Then, proceeding on new s, that is: If case 1 applys (ps=pk-1): fail[k]=s+1, or If case 2 applys (pspk-1): another new s

14
**Computing Fail Links: an Example**

Constructing the KMP flowchart for P = “ABABABCB” Assuming that fail[1] to fail[6] has been computed Get next text char. A B A B A B C B * 1 2 3 4 5 6 7 8 9 fail[7]: ∵fail[6]=4, and p6=p4, ∴fail[7]=fail[6]+1=5 (case 1) fail[8]: fail[7]=5, but p7p5, so, let s=fail[5]=3, but p7p3, keeping back, let s=fail[3]=1. Still p7p1. Further, let s=fail[1]=0, so, fail[8]=0+1=1.(case 2)

15
**Constructing KMP Flowchart**

Input: P, a string of characters; m, the length of P Output: fail, the array of failure links, filled void kmpSetup (char [] P, int m, int [] fail) int k, s; fail[1]=0; for (k=2; km; k++) s=fail[k-1]; while (s1) if (ps= = pk-1) break; s=fail[s]; fail[k]=s+1; For loop executes m-1 times, and while loop executes at most m times since fail[s] is always less than s. So, the complexity is roughly O(m2)

16
**Number of Character Comparisons**

Success comparison: at most once for a specified k, totaling at most m-1 2m-3 fail[1]=0; for (k=2; km; k++) s=fail[k-1]; while (s1) if (ps= = pk-1) break; s=fail[s]; fail[k]=s+1; Unsuccess comparison: Always followed by decreasing of s. Since: s is initialed as 0, s increases by one each time s is never negative So, the counting of decreasing can not be larger than that of increasing These 2 lines combine to increase s by 1, done m-2 times

17
**KMP Scan: the Algorithm**

Input: P and T, the pattern and text; m, the length of P; fail: the array of failure links for P. Output: index in T where a copy of P begins, or -1 if no match int kmpScan(char[ ] P, char[ ] T, int m, int[ ] fail) int match, j,k; //j indexes T, and k indexes P match=-1; j=1; k=1; while (endText(T,j)=false) if (k>m) match=j-m; break; if (k= =0) j++; k=1; else if ( tj= =pk) j++; k++; //one character matched else k=fail[k]; //following the failure link return match Each time a new cycle begins, p1,…pk-1 matched Executed at most 2n times, why?

18
**Skipping Characters in String Matching**

must must must must must must must must must must must must If you wish to understand others you must … Checking the characters in P, in reverse order The copy of the P begins at t38. Matching is achieved in 18 comparisons

19
**Distance of Jumping Forward**

With the knowledge of P, the distance of jumping forward for the pointer of T is determined by the character itself, independent of the location in T. p1 … A … A … pm =pk p1 … A … A … ps … pm t …… tj=A …… …… tn new j Rightmost ‘A’ current j charJump[‘A’] = m-k

20
**Computing the Jump: Algorithm**

Input: Pattern string P; m, the length of P; alphabet size alpha=|| Output: Array charJump, indexed 0,…, alpha-1, storing the jumping offsets for each char in alphabet. void computeJumps(char[ ] P, int m, int alpha, int[ ] charJump char ch; int k; for (ch=0; ch<alpha; ch++) charJump[ch]=m; //For all char no in P, jump by m for (k=1; km; k++) charJump[pk]=m-k; (||+m) The increasing order of k ensure that for duplicating symbols in P, the jump is computed according to the rightmost

21
**Partially Matched Substring**

matched suffix P: b a t s a n d c a t s T: …… d a t s …… New j Move only 1 char Current j charJump[‘d’]=4 Remember the matched suffix, we can get a better jump P: b a t s a n d c a t s T: …… d a t s …… New j Move 7 chars

22
**Forward to Match the Suffix**

p1 …… pk pk+1 …… pm Matched suffix Dismatch …… t …… tj tj+1 …… …… tn Substring same as the matched suffix occurs in P p1 …… pr pr+1 …… pr+m-k …… pm slide[k] p …… pk pk+1 …… pm t …… tj tj+1 …… …… tn matchJump[k] New j Old j

23
**Partial Match for the Suffix**

p1 …… pk pk+1 …… pm Matched suffix Dismatch …… t …… tj tj+1 …… …… tn No entire substring same as the matched suffix occurs in P p1 …… pq …… pm May be empty slide[k] p1 …… pk pk+1 …… pm t …… tj tj+1 …… …… tn matchJump[k] New j Old j

24
**matchjump[k]=slide[k]+m-k**

matchjump and slide p1 …… pr pr+1 …… pr+m-k …… pm p …… pk pk+1 …… pm t …… tj tj+1 …… …… tn Old j New j slide[k] matchJump[k] slide[k]: the distance P slides forward after dismatch at pk, with m-k chars matched to the right matchjump[k]: the distance j, the pointer of P, jumps, that is: matchjump[k]=slide[k]+m-k Let r(r<k) be the largest index, such that pr+1 starts a largest substring matching the matched suffix of P, and prpk, then slide[k]=k-r If the r not found, the longest prefix of P, of length q, matching the matched suffix of P will be lined up. Then slide[k]=m-q.

25
**Computing matchJump: Example**

P = “ w o w w o w ” Direction of computing w o w w o w w o w w o w matchJump[6]=1 Slide[6]=1 (m-k)=0 t …… tj …… pk Matched is empty w o w w o w w o w w o w matchJump[5]=3 pk Slide[5]=5-3=2 (m-k)=1 t1 …… tj w …… Matched is 1

26
**Computing matchJump: Example**

P = “ w o w w o w ” Direction of computing w o w w o w w o w w o w matchJump[4]=7 Not lined up =pk No found, but a prefix of length 1, so, Slide[4] = m-1=5 t1 …… tj o w …… Matched is 2 w o w w o w w o w w o w matchJump[3]=6 pk Slide[3]=3-0=3 (m-k)=3 t1 …… tj w o w …… Matched is 3

27
**Computing matchJump: Example**

P = “ w o w w o w ” Direction of computing w o w w o w w o w w o w matchJump[2]=7 No found, but a prefix of length 3, so, Slide[2] = m-3=3 t1 …… tj w w o w …… Matched is 4 w o w w o w w o w w o w matchJump[1]=8 No found, but a prefix of length 3, so, Slide[1] = m-3=3 t1 …… tj o w w o w …… Matched is 5

28
**The Boyer-Moore Algorithm**

Void computeMatchjumps(char[] P, int m, int[] matchjump) int k, r, s, low, shift; int sufx=new int[m+1] for (k=1; km; k++) matchjump[k]=m+1; sufx[m]=m+1; for (k=m-1; k0; k--) s=sufix[k+1] while (sm) if (pk+1==ps) break; matchjump[s] = min (matchjump[s], s-(k+1)); s = sufx[k]; sufx[k]=s-1; Sufx[k]=x means a substring starting from pk+1 matches suffix starting from px+1 Computing slide[k] // computing prefix length is necessary; // change slide value to matchjump by addition;

29
Home Assignment pp.508- 11.4 11.8 11.9 11.13 11.18

Similar presentations

OK

CSC401 – Analysis of Algorithms Chapter 9 Text Processing

CSC401 – Analysis of Algorithms Chapter 9 Text Processing

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on save mother earth in french Best ppt on motivation Convert pdf ppt to ppt online viewer Download ppt on cybercrime and security Ppt on lost city of atlantis mystery Ppt on clinical data management Ppt on introduction to object-oriented programming c++ Well made play ppt on dvd Ppt on online examination in php Ppt on heritage tourism in india