Presentation is loading. Please wait.

Presentation is loading. Please wait.

Reverse Colussi algorithm

Similar presentations


Presentation on theme: "Reverse Colussi algorithm"— Presentation transcript:

1 Reverse Colussi algorithm
Fastest pattern matching in strings, Colussi, L. Journal of Algorithms, Vol. 16 , No. 2, 1994, pp Advisor: Prof. R. C. T. Lee Speaker: Y. K. Shie

2 The Reverse Colussi Algorithm is an algorithm which solves the string matching problem and it is in the spirit of the original Colussi Algorithm..

3 The Main Points of the Reverse Colussi Algorithm
1. It changes the bad character rule from matching one character to matching a pair of characters. Reverse Colussi algorithm divides the position into special position and non-special position. Special position allow smaller number of jump. The Reverse Colussi Algorithm processes the special position first.

4 Note that the Colussi Algorithm does not consider all of the positions where the prefix function assumes value -1. That this can be done can be seen by the following fact: The position where prefix function assumes -1 allows the largest number of steps to shift. Thus the Colussi Algorithm examines all positions which allow smaller number of steps of shift which is a safe action.

5 We shall make this clear later.
In this Reverse Colussi Algorithm, we define some points which are special and some points which are not special. Special points allow smaller number of steps to shift than non-special points. Thus, in the Reverse Colussi Algorithm, we examine the special positions first. We shall make this clear later.

6 Ti is the ith character in T (1≦i≦n)
Ti is the ith character in T (1≦i≦n). Pj are the jth character in P (1≦j≦m). The bad character rule is like the Rule 2-1, Character Matching Rule.

7 Rule 2-1: Character Matching Rule(A Special Version of Rule 2)
For any character x in T, find the nearest x in P which is to the left of x in T.

8 Implication of Rule 2-1 Case 1. If there is an x in P to the left of T, move P so that the two x’s match.

9 Case 2: If no such an x exists in P, consider the partial window defined by x in T and the string to the left of it.

10 rcBc table   Consider the following case where the last character X of the window of T does not match with the last character of P.

11 rcBc table   Suppose we successfully find an X in P as shown below:

12 rcBc table   Then we can move P as shown as below:

13 rcBc table Suppose the last character Y of the window
of T does not match with the last character of P as shown below:

14 rcBc table Then we try to find a pair of X and Y in P
such that after we move P, these X and Y in P match with the X and Y in T.

15   Thus, the Reverse Colussi Algorithm uses a very special version of Rule 2: a pair of characters.

16 How do we find this pair of characters in P?
We use the rcBc Table.

17 rcBc table Y is the last character of the windows of T.
s is the length which we shift in last step. k is an integer. case 1: If we can find Pm-k-1=Y and Pm-k-s-1=Pm-s-1, we fill the minimal k into rcBc[Y, s]. case 2: If we can find Pm-k-1=Y and k>m-s-1, case 3: Otherwise, we fill the m into rcBc[Y, s].

18 XY = AA does not exist in P. rcBc[Y, 1] = 8
X = A XY = AA does not exist in P. rcBc[Y, 1] = 8 Length of Previous Present matched Shifts (s) character of T (Y) 1 2 3 4 5 6 7 8 A C G T

19 Looking for exists. rcBc[Y, 2] = 5 Y = A ex: s = 2: X = G G A G C A 1
Length of Previous Present matched Shifts (s) character of T (Y) 1 2 3 4 5 6 7 8 A C G T

20 Looking for qualifies. rcBc[Y, 3] = 5 Y = A ex: s = 3: X = A A G C A 1
Length of Previous Present matched Shifts (s) character of T (Y) 1 2 3 4 5 6 7 8 A C G T

21 ex: 1 2 3 4 5 6 7 8 A C G T Length of Previous
Present matched Shifts (s) character of T (Y) 1 2 3 4 5 6 7 8 A C G T

22 rcGs table We build the rcGs table which corresponds to the good suffix rules of Boyer-Moore algorithm. The good suffix rules are like the Rule 1, The Suffix to Prefix Rule, and Rule 2, The Substring Matching Rule.

23 Rule 1: The Suffix to Prefix Rule
For a window to have any chance to match a pattern, in some way, there must be a suffix of the window which is equal to a prefix of the pattern. T P

24 Rule 2: The Substring Matching Rule
For any substring u in T, find a nearest u in P which is to the left of it. If such an u in P exists, move P such then the two u’s match; otherwise, we may define a new partial window.

25 A repeating suffix of a string S is a suffix which appears somewhere else in S.
For instance, ABA is a repeating suffix of CABAGTABA. BA is also a suffix repeating suffix.

26 Let x be the character to the left of a repeating suffix
Let x be the character to the left of a repeating suffix. A repeating suffix u of S is a maximal repeating suffix if xu does not appear elsewhere in S. For instance, in CABAGTABA , ABA is a maximal repeating suffix because TABA does not appear any where in S while BA is not because ABA appears somewhere else in S.

27 G ( corresponding substring : G ) AG ( corresponding substring : CAG )
Given a pattern P, denote all positions to the left of maximal repeating suffixes of P as special positions. The Reverse Colussi Algorithm consider these special positions first. In this case, we can see that the following suffixes are all maximal suffixes: G ( corresponding substring : G ) AG ( corresponding substring : CAG ) AGAG ( correspondingsubstring : CAGAG) G C A

28 For The special positions are G C A 1 2 3 4 5 6 7 G C A

29 For each maximal suffix u, let the last position of corresponding substring be located at p. Then, if a mismatching occur at the special positions with u, we may move P m-p-1 steps, where m is length of P (Rule 2). p = 5 1 2 3 4 5 6 7 m = 8 G C A u special position substring associates with u

30 So we can move 8 - 5 - 1 = 2 as below:
T: T G 1 2 3 4 5 6 7 P: G C A G C A The number of steps moved for each special position is stored in a table, called hmin.

31 For a special position i = 3, we record
special positions 1 2 3 4 5 6 7 Pi G C A hmin 3 For a special position i = 3, we record its length of move 2 (8-5-1) on hmin[2]=3.

32 For a special position i = 5, we record
special positions 1 2 3 4 5 6 7 Pi G C A hmin 3 5 For a special position i = 5, we record its length of move 4 (8-3-1) on hmin[4]=5.

33 For a special position i = 6, we record
special positions 1 2 3 4 5 6 7 Pi G C A hmin 3 5 6 For a special position i = 6, we record its length of move 7 (8-0-1) on hmin[7]=6.

34 Note that for special positions, Rule 2 (substring matching rule) can be used.
For non-special positions, Rule 1 (suffix to prefix rule) can be used.

35 The basic idea of the Reverse Colussi
Algorithm is as follows: We consider special positions first and non-special positions next. We use Rule 2 (substring matching rule) when we consider special positions. 3. We use Rule 1 (suffix to prefix rule) when we consider non-special positions.

36 After we compare special positions, we must compare the remainder positions, called non-special positions. We compare those non-special positions form left to right. The number of steps moved for each non-special position is stored in a table, called rmin. The value of rmin can be found by Rule 1 (the suffix to prefix rule).

37 If a suffix S which exists at the right side of a non-special position i is equal to a prefix, rmin(i)=m-|S|. (|S| is the length of S.) If no such S exists, rmin(i)=m.

38 ex1: G C A A suffix S is equal to a prefix which is at right side of
1 2 3 4 5 6 7 G C A A suffix S is equal to a prefix which is at right side of some non-special positions, so the values of rmin of these non-special positions are m-|S| ( 8-1 ). G C A rmin 7 S

39 ex2: G A T A suffix S is equal to a prefix which is at right side of
1 2 3 4 5 6 7 8 9 10 G A T A suffix S is equal to a prefix which is at right side of some non-special positions, so the values of rmin of these non-special positions are m-|S| ( 11-5 ). special positions G A T rmin 6 S

40 ex2: G A T We find a shorter suffix at right side of some non-special
1 2 3 4 5 6 7 8 9 10 G A T We find a shorter suffix at right side of some non-special position which is equal to a prefix, so the values of rmin of these non-special positions are m-|S| ( 11-3 ). special positions G A T rmin 6 8 S

41 ex2: G A T And we find a shorter suffix at right side of some
1 2 3 4 5 6 7 8 9 10 G A T And we find a shorter suffix at right side of some non-special position which is equal to a prefix, so the values of rmin of these non-special positions are m-|S| ( 11-1 ). special positions G A T rmin 6 8 10 S

42 No suffix is equal to any prefix, so the
ex3: 1 2 3 4 5 6 7 8 9 10 11 C G A T No suffix is equal to any prefix, so the values of all non-special positions in rmin are m. C G A T rmin 12

43 rcGs table After we bulid those tables, we can use those tables to build the rcGs table. ex : GCAGAGAG i 1 2 3 4 5 6 7 8 Pi G C A hmin[ i ] rmin[ i ] rcGs[ i ]

44 rcGs table First, we fill the index of special positions that hmin is nonempty into rcGs table. i 1 2 3 4 5 6 7 8 Pi G C A hmin[ i ] rmin[ i ] rcGs[ i ]

45 rcGs table Second, we fill the rmin value that rmin is nonempty into rcGs table. i 1 2 3 4 5 6 7 8 Pi G C A hmin[ i ] rmin[ i ] rcGs[ i ]

46 rcGs table If P exact match with T, we can move P by Rule 1. Therefore, we fill rcGs[8]=m-|S| (8-1). i 1 2 3 4 5 6 7 8 Pi G C A hmin[ i ] rmin[ i ] rcGs[ i ]

47 ex: T= P= s = m = 8 rcBc 1 2 3 4 5 6 7 8 A C G T i 1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 rcGs[ i ]

48 Shift by 1 (rcBc[A][s], s = 8), and change s = 1
ex: Shift by 1 (rcBc[A][s], s = 8), and change s = 1 rcBc 1 2 3 4 5 6 7 8 A C G T i 1 2 3 4 5 6 7 8 rcGs[ i ]

49 Shift by 2 (rcGs[1]), and change s = 2
ex: Shift by 2 (rcGs[1]), and change s = 2 rcBc 1 2 3 4 5 6 7 8 A C G T i 1 2 3 4 5 6 7 8 rcGs[ i ]

50 Shift by 2 (rcGs[1]), and change s = 2
ex: Shift by 2 (rcGs[1]), and change s = 2 rcBc 1 2 3 4 5 6 7 8 A C G T i 1 2 3 4 5 6 7 8 rcGs[ i ]

51 Shift by 7 (rcGs[8]), and change s = 7
ex: Shift by 7 (rcGs[8]), and change s = 7 rcBc 1 2 3 4 5 6 7 8 A C G T i 1 2 3 4 5 6 7 8 rcGs[ i ]

52 Shift by 2 (rcGs[1]), and change s = 2
ex: Shift by 2 (rcGs[1]), and change s = 2 rcBc 1 2 3 4 5 6 7 8 A C G T i 1 2 3 4 5 6 7 8 rcGs[ i ]

53 Shift by 5 (rcBc[A][s], s = 2), and change s = 5
ex: Shift by 5 (rcBc[A][s], s = 2), and change s = 5 rcBc 1 2 3 4 5 6 7 8 A C G T i 1 2 3 4 5 6 7 8 rcGs[ i ]

54 Time complexity preprocessing phase in O(m2) time complexity and O(mσ) space complexity. searching phase in O(n) time complexity. 2n text character comparisons in the worst case.

55 Reference [BV2005] Mutable strings in Java: design, implementation and lightweight text-search algorithms, Boldi, P. and Vigna, S., Science of Computer Programming, Vol.54, No.1, 2005, pp.3-23 [HWC2000] Research on a faster algorithm for pattern matching, Han, K., Wang, Y. and Chen, G., Proceedings of the fifth international workshop on on Information retrieval with Asian languages, 2000, pp [L96] Chinese string searching using the KMP algorithm, Luk, R.W.P., Proceedings of the 16th conference on Computational linguistics, 1996

56 Thank you~


Download ppt "Reverse Colussi algorithm"

Similar presentations


Ads by Google