Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advisor: Prof. R. C. T. Lee Speaker: Y. L. Chen

Similar presentations


Presentation on theme: "Advisor: Prof. R. C. T. Lee Speaker: Y. L. Chen"— Presentation transcript:

1 Advisor: Prof. R. C. T. Lee Speaker: Y. L. Chen
Colussi algorithm Correctness and Efficiency of Pattern Matching Algorithms Information and Computation, Vol, 95, 1991, pp Colussi, L. Advisor: Prof. R. C. T. Lee Speaker: Y. L. Chen

2 The main principle of Colussi Algorithm
We point out that there are positions where large number of jumps are allowed. We first process the positions where only small number of jumps are allow. It is obviously safe to do so. Beside, we may look into the future this way.

3 The Colussi Algorithm is a modification of the KMP Algorithm
The Colussi Algorithm is a modification of the KMP Algorithm. In the KMP Algorithm, we always construct the KMP function. For instance, for the case of ATCATCATCA, the KMP function is as follows:

4 Condition for KMP[i] = -1
Condition A: p0 = pi Condition B: p0, j is a suffix of p0, i-1 Condition C: pj+1 = pi KMP[i] = -1 :

5 There is no suffix of p(0, 3) which is equal to a prefix of p0, 3 .
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 P b c b a b c b a e b c b a b c b a -1 -1 1 -1 There is no suffix of p(0, 3) which is equal to a prefix of p0, 3 . p0 = p4. A KMP[4] = -1 because it satisfies the condition

6 KMP[15] = -1 because it satisfies the condition .
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 P b c b a b c b a e b c b a b c b a -1 -1 1 -1 -1 1 4 -1 -1 1 -1 -1 There are two suffixes of p0, 14 which are equal to a prefix of p0, 14 : p0, 1 = p13, 14 and p0, 5 = p9, 14 For p0, 5, we have p6 = p15; For p0, 1, we have p2 = p ( ) p0 = p4. ( A ) KMP[15] = -1 because it satisfies the condition .

7 First, construct the preprocess tables. It contains Kmp、Kmin、Rmin
and Shift functions. Second, the set of pattern positions is divided into two disjoint subsets. Then each attempt consists in two phases: In the first phase the comparisons are performed from left to right with text characters aligned with pattern position for which the value of the kmp function is strictly greater than -1. These positions are called noholes; If all noholes exactly match we will go to second phase. If a mismatch happens in the first phase we would move by shift functions. The second phase consists in comparing the remaining positions (called holes) from right to left. If a mismatch happens in the second phase we would move by shift functions.

8 Consider any location i, where Kmp[i] = -1
Consider any location i, where Kmp[i] = -1. If a mismatch occurs at this point, the KMP Algorithm shifts i–(-1) = i steps. If , j must be larger than -1. The number of steps moved is i – j < i + 1.

9 If we ignore the location i then Kmp[i] = -1, it is safe because we will move smaller number of steps.

10 Ex: The pattern is “ATCATATCA”.
The Colussi algorithm uses three other preprocessing functions: namely Kmin, Rmin and Shift.. Let us first recall the Kmp function as follows. Ex: The pattern is “ATCATATCA”.

11 The Kmin function Kmin[i] = i –Kmp[i]. ( The number of jumps in KMP Algorithm.) If i is a nohole we would set Kmin[i]. 1 – (0) = 1

12 The Kmin function Kmin[i] = i –Kmp[i]. ( The number of jumps in KMP Algorithm.) If i is a nohole we would set Kmin[i].

13 Definition of Period An integer k is a period of a pattern p if for any i, 0 <= i < m - k, pi = pi + k. In other words, pk, i-1 = p0, i–k-1. According to the above definition, given a pattern p, there are many periods. For instance, for the case of ATCATCATCA, there are three periods, namely 5, 8, and 9. For instance, we can verify that pi+5 = pi for i = 0 to 8. Note that the length of a pattern is trivially a period of it.

14 The Rmin function If i is a hole, Rmin[i] is the smallest period of p greater than i. (The number of jumps for holes under the conduction then we have already matched all characters after i.) Rmin implies that we can look into the future in Colussi Algorithm. We set Rmin[0] = 5. period = 5,8 and 9.

15 The Rmin function If i is a hole, Rmin[i] is the smallest period of p greater than i. We set Rmin[3] = 5. period = 5,8 and 9

16 The Rmin function If i is a hole, Rmin[i] is the smallest period of p greater than i. We set Rmin[8] = 9. period = 5,8 and 9

17 The shift function If (Kmp[i] = -1) shift[i] = Rmin[i] ; else shift[i] = Kmin[i] ;

18 The shift function Then, we can set shift[1] = 5 Kmp[1] = -1 so shift[1] = Rmin[1]

19 The shift function

20 We give two kinds of examples where Kmp[i] = -1 to explain
Rmin[i]. The condition is satisifed in this case. Prefix Suffix i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 P b c b a b c b a e b c b a b c b a -1 -1 1 -1 Already matched If mismatch occurs at p4, we jump 4 steps for the MP algorithm, we jump 5 steps for the KMP algorithm, and we jump 9 steps for the Colussi algorithm because Rmin[i] = 9. But we must understand that for Colussi Algorithm, all points after p4 have already been matched. Then we can look into the future.

21 We give two kinds of examples which are Kmp[i] = -1 to explain
Rmin[i]. The condition is satisfied in this case. i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 P b c b a b c b a e b c b a b c b a -1 -1 1 -1 -1 1 4 -1 -1 1 -1 -1 If mismatch occurs at p15, we jump 15 steps for the MP algorithm, we jump 16 steps for the KMP algorithm, and we jump 17 steps for the Colussi algorithm because Rmin[i] = 17.

22 The Colussi Algorithm uses the Rmin function. Actually, it
is using the suffix to prefix rule Implicitly. We shall explain this point in the following slides.

23 Note that the Rmin is used when all of the locations
where have been processed and have been found matched. For a location where we know that we may jump steps. But, for Colussi algorithm, we use Rmin and Rmin is always larger than Why?

24 Note that Rmin[i] is defined as the smallest period of p which is larger than i.
Case 1: Rmin is lager than the length of p. In this case, we know that no suffix of p is equal to a prefix. Case 2: Rmin is smaller than the length of p. In this case, there is a suffix of p which is equal to a prefix.

25 Furthermore, Rmin is used when we scan from right
to left. That is, all locations after location i have already been matched. Therefore, we may use the suffix to prefix rule now.

26 Rule 1: The Suffix to Prefix Rule
For a window to have any chance to match a pattern, in some way, there must be a suffix of the window which is equal to a prefix of the pattern. T P

27 The Implication of Rule 1:
Find the longest suffix U of the window which is equal to some prefix of P. Skip the pattern as follows:

28 T = GCATCGACAGACTATACAGTACG P = GACGGATCA
Example T = GCATCGACAGACTATACAGTACG P = GACGGATCA ∵The longest suffix of the window which is equal to a prefix of P is “GAC” = p1, 3 , slide the window by 6. P = GACGGATCA

29 Let us consider the following example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 P b c b a b c b a e b c b a b c b a -1 -1 1 -1 Note that Rmin is only used when we scan from right to left and a mismatch occurs at a location i where Kmp[i] = -1. In the above example, let us consider i = 4. The smallest period of the pattern larger than i =4 is 9. Therefore min(4) = 9. This means that we may jump 9 steps as shown in the next slide.

30 T b c b a x c b a e b c b a b c b a P b c b a b c b a e b c b a b c b a P b c b a b c b a e b c b a b c b a From the definition of period, we know that p0, 7 = p9, 16. Since we scan form the right to left, we know that T9,16 = p9,16 = p0,7. Therefore, we may move p0 to p9.

31 If it happens to mismatch in the first phase, we can base on the shift[i]
to move. If all noholes exactly match we can run the second phase. Example First attempt: Text: ATATCCTATCATATCA Pattern:ATCATATCA match

32 Example First attempt: Text: ATATCCTATCATATCA Pattern:ATCATATCA mismatch

33 Example First attempt: Text: ATATCCTATCATATCA Pattern: ATCATATCA Shift[2] = 2

34 Example Second attempt: Text: ATATCCTATCATATCA Pattern: ATCATATCA match

35 Example Second attempt: Text: ATATCCTATCATATCA Pattern: ATCATATCA match

36 Example Second attempt: Text: ATATCCTATCATATCA Pattern: ATCATATCA match

37 Example Second attempt: Text: ATATCCTATCATATCA Pattern: ATCATATCA match

38 Example Second attempt: Text: ATATCCTATCATATCA Pattern: ATCATATCA match

39 Example Second attempt: Text: ATATCCTATCATATCA Pattern: ATCATATCA match

40 Example Second attempt: Text: ATATCCTATCATATCA Pattern: ATCATATCA match

41 If it happens to mismatch in the second phase, we can base on the
shif[i] to move. If all holes exactly match we can move the shift[0] values. Example Second attempt: Text: ATATCCTATCATATCA Pattern: ATCATATCA match

42 Example Second attempt: Text: ATATCCTATCATATCA Pattern: ATCATATCA mismatch

43 Shift[3] = 5, Prefix of the pattern ATCA
Example Second attempt: Text: ATATCCTATCATATCA Pattern: ATCATATCA Shift[3] = 5, Prefix of the pattern ATCA

44 Example Third attempt: Text: ATATCCTATCATATCA Pattern: ATCATATCA match

45 Example Third attempt: Text: ATATCCTATCATATCA Pattern: ATCATATCA match

46 Example Third attempt: Text: ATATCCTATCATATCA Pattern: ATCATATCA match

47 Example Third attempt: Text: ATATCCTATCATATCA Pattern: ATCATATCA match

48 Example Third attempt: Text: ATATCCTATCATATCA Pattern: ATCATATCA match

49 Shift[0] = 5, Prefix of the pattern ATCA
Example Third attempt: Text: ATATCCTATCATATCA Pattern: ATCATATCA Shift[0] = 5, Prefix of the pattern ATCA

50 Why does the Colussi Algorithm ignore the locations where kmp(i)=-1?
Note that when kmp(i)=-1 and we scan from left to right, we jump i+1 steps. In all other cases, kmp(i)=j, j>-1 and we jump i-j<i+1steps. This means that it is safe to ignore the locations where kmp(i)=-1.

51 Colussi Algorithm Time complexity
The preprocessing phase can be done in O( m ) space and time. The searching phase can then be done in O( n ) time complexity and furthermore at most n text character comparisons are performed during the searching phase.

52 References [B92] Efficient String Algorithmics, BRESLAUER, D., Ph. D. Thesis, Report CU , Computer Science Department, Columbia University, New York, NY, 1992. [C91]Correctness and efficiency of the pattern matching algorithms, COLUSSI L., Information and Computation 95(2): , 1991, pp [CGG90]On the exact complexity of string matching, COLUSSI, L., GALIL, Z., GIANCARLO, R., in Proceedings of the 31st IEEE Annual Symposium on Foundations of Computer Science, 1990 , pp [GG92] On the exact complexity of string matching: upper bounds, SIAM Journal on Computing, GALIL, Z., GIANCARLO, R , Vol.21, No.3, 1992 , pp


Download ppt "Advisor: Prof. R. C. T. Lee Speaker: Y. L. Chen"

Similar presentations


Ads by Google