Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String.

Similar presentations


Presentation on theme: "1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String."— Presentation transcript:

1 1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String Matching Algorithms Based upon the Uniqueness Property, The 24th Workshop on Combinatorial Mathematics and Computation Theory, pp.385-392.

2 2 String matching problem –Given a text string T of length n and a pattern string P of length m. –Find all occurrences of P in T.

3 3 Rule 1: The Suffix to Prefix Rule Suppose we have longest suffix u of a window which is also a prefix of P, we can move P in such a way that the prefix u of P matches with the suffix u of the window.

4 4 The Uniqueness Property of a String For any substring V of P, if V occurs in P only once, V is a unique substring. When V matches with some substring of T, we can move P such a way that the prefix of P matches with the suffix of V.

5 5 012345678910111213141516171819 Tacgccgcgcccgcgctcaaa Pcatagtagcct 012345678910 Example P = c a t a g t a g c c t Suppose we use the substring “cc” as the unique substring. Pcatagtagcct 012345678910

6 6 Algorithm 1- The Longest Prefix with Unique Suffix Matching Algorithm We further modified the uniqueness by noting that the substring does not have to be unique in the entire pattern P. In fact, a substring which is unique in a prefix of P suffices. Therefore, we only have to find the longest prefix which contains a unique suffix in P.

7 7 Example P = CACTAGCCACTCTC The substring TC occurs twice in P, but it is unique in the prefix CACTAGCCACTC. Move P 11 steps.

8 8 Example P = CACTAGCCACTCTC The substring G is also unique in the prefix CACTAG. Move P 6 steps.

9 9 In the above example, using the unique substring TC, we could move P 11 steps if TC matches with TC in T; using the unique substring G, we could move P 6 steps if G matches with G in T. P = CACTAGCCACTCTC Is the unique substring TC better than the unique substring G?

10 10 We should notice that if the unique substring appears in T many times, our algorithm would be efficient. In general, the probability of TC in P matching with TC in T exactly is 1/16 (Suppose the size of alphabet is 4), and the probability of G in P matching with G in T exactly is 1/4. Thus, the size of the unique substring is also important.

11 11 If the substring TC in P exactly matches with TC in T once and moves P by 11 steps, the substring G in P may match G in T four times and moves P by 6 steps for each time. So, we expect that the substring G would be better than the substring TC in general. P = CACTAGCCACTCTC

12 12 We now define a ratio to determine which substring is better. Let Σ be the alphabet. The larger σ is, the better efficiency can be achieved in the searching phase.

13 13 Preprocessing Phase P = CAGACGACCCCAACAGC Σ = {A, C, G, T}, |Σ| = 4. Find the longest prefix with an unique suffix which size is one. 012345678910111213141516171819 TACGCCGCGCCCGCGCTCAAA… PCAGACGACCCCAACAGC 012345678910111213141516 PCAGACGACCCCAACAGC 012345678910111213141516

14 14 We have found the unique substring with size 1, and we could use it to move P 3 steps. Next, we try to find an unique substring with size 2 such that we could use this substring to move P more than 3*4 steps. Thus, we only consider the substrings of p 12 p 13 …p 16. Preprocessing Phase 012345678910111213141516171819 TACGCCGCGCCCGCGCGCAAA… PCAGACGACCCCAACAGC 012345678910111213141516 PCAGACG 012345

15 15 Searching Phase T…CGCCGCGCCCGCGCGCAAA… PCAGACGACCCCAACAGC 012345678910111213141516 PCAGACGACCCCAACAGC 012345678910111213141516 PCAGACGACCCCAACAGC 012345678910111213141516 Move 1 step. If the unique substring mismatches, move P one step.

16 16 Searching Phase T…CGCCGCGCCCGCGCGCAAA… PCAGACGACCCCAACAGC 012345678910111213141516 PCAGACGACCCCAACAGC 012345678910111213141516 PCAGACGACCCCAACAGC 012345678910111213141516 Move 16 steps. If the unique substring GC matches with GC in T, move P 16 steps.

17 17 As we discuss above, the size of the unique substring is important. In the following, we will introduce another algorithm which uses an unique substring with size one.

18 18 Algorithm 2- Longest Substring with Unique Character Matching Algorithm In the window, let x be any character. In order to have any meaningful matching of P with T, we must find the same x in P located in the left side of x in T.

19 19 In preprocessing phase, we try to find the longest substring p’ in P such that x in p’ occurs only once. That is, and p j occurs in p’ only once.

20 20 If the unique character x matches with x in T, we can move P |p’| steps.

21 21 Example In this example, we would find the longest substring p 4 p 5 …p 10 with a unique character p 10. If the character p 10 matches with T, we can move P 7 steps. PCACTAGCCACTCTC 012345678910111213

22 22 Searching Phase T…CGCCTCGCTCGCGTGCTAA… Move 1 step. If p 10 mismatches, move P one step. PCACTAGCCACTCTC 012345678910111213 PCACTAGCCACTCTC 012345678910111213 PCACTAGCCACTCTC 012345678910111213

23 23 Searching Phase T…CGCCTCGCTCGCGTGCTAA… Move 7 steps. If p 10 matches with T, move P 7 steps. PCACTAGCCACTCTC 012345678910111213 PCACTAGCCACTCTC 012345678910111213 PCACTAGCCACTCTC 012345678910111213

24 24 Algorithm 3- The Unique Pairwise Substring Algorithm The substring p i p i+1 …p j-1 p j is called an unique pairwise substring if it satisfies the condition that p i p i+1 …p j-1 p j occurs in the prefix p 1 p 2 …p j-1 p j of P exactly once, and no p k p k+1 …p k+j-i exists in p 1 p 2 …p j- 1 such that p k = p i and p k+j-i = p j.

25 25 PCACTCAGCCACTCGC 01234567891011121314 Example The substring TCG is an unique pairwise substring because no p k p k+1 p k+2 exists in p 1 p 2 …p 12 such that p k = p 11 = T and p k+2 = p 13 = G. PCACTCAGCCACTCGC 01234567891011121314 The substring CAC is not an unique pairwise substring because there exists a substring p 2 p 3 p 4 in p 1 p 2 …p 9 such that p 2 = p 8 = C and p 4 = p 10 = C.

26 26 Suppose p i p i+1 …p j-1 p j is an unique pairwise substring. If p i and p j match with T, we have two cases to move P. Case 1: such that p j = p k, where 0 ≦ k ≦ j-i-1. We can move P j-k steps.

27 27 Case 2: p j ≠ p k, where 0 ≦ k ≦ j-i-1. We can move P j+1 steps.

28 28 PCACTCAGCCACTCGC 01234567891011121314 Example If we choose p 11 p 12 p 13 as the unique pairwise substring, we can move P 14 steps when p 11 and p 13 match with T. T…CGCCTCGCTCGTGGGCTAA… PCACTCAGCCACTCGC 01234567891011121314 PCACTCAGCCACTCGC 01234567891011121314 PCACTCAGCCACTCGC 01234567891011121314

29 29 There would be many unique pairwise substrings in the pattern. We will select the one which is located at rightest in the pattern. PCACTCAGCGACTCGC 01234567891011121314 Example The substrings p 5 p 6, p 7 p 8 p 9 and p 11 p 12 p 13 are all unique pairwise substrings. We would select p 11 p 12 p 13 because it will have the largest move.

30 30 PCACTCAGCCACTCGC 01234567891011121314 T…CGCCTCGCTCGTGGGCTAA… Example PCACTCAGCCACTCGC 01234567891011121314 If p 11 or p 13 mismatch, move P one step. PCACTCAGCCACTCGC 01234567891011121314

31 31 PCACTCAGCCACTCGC 01234567891011121314 T…CGCCTCGCTCGTGGGCTAA… Example PCACTCAGCCACTCGC 01234567891011121314 If p 11 and p 13 match with T, move P 14 steps. PCACTCAGCCACTCGC 01234567891011121314

32 32 References [1]Apostolico, A., Giancarlo, R., 1986, The Boyer-Moore-Galil string searching strategies revisited, SIAM Journal on Computing 15(1):98-105. [2]Apostolico, A., Crochemore, M., 1991, Optimal canonization of all substrings of a string, Information and Computation 95(1):76-95. [3]Boyer, R.S., Moore, J.S., 1977, A fast string searching algorithm. Communications of the ACM. 20:762-772. [4]Colussi, L., 1991, Correctness and efficiency of the pattern matching algorithms, Information and Computation 95(2):225-251. [5]Crochemore, M., Czumaj, A., Gasieniec, L., Jarominek, S., Lecroq, T., Plandowski, W., Rytter, W., 1992, Deux méthodes pour accélérer l'algorithme de Boyer-Moore, in Théorie des Automates et Applications, Actes des 2e Journées Franco-Belges, D. Krob ed., Rouen, France, 1991, pp 45-63, PUR 176, Rouen, France. [6]Colussi, L., 1994, Fastest pattern matching in strings, Journal of Algorithms. 16(2):163-189. [7]Charras, C., Lecroq, T., Pehoushek, J.D., 1998, A very fast string matching algorithm for small alphabets and long patterns, in Proceedings of the 9th Annual Symposium on Combinatorial Pattern Matching, M. Farach-Colton ed., Piscataway, New Jersey, Lecture Notes in Computer Science 1448, pp 55-64, Springer-Verlag, Berlin.

33 33 [8]Galil, Z., Seiferas, J., 1983, Time-space optimal string matching, Journal of Computer and System Science 26(3):280-294. [9]Galil, Z., Giancarlo, R., 1992, On the exact complexity of string matching: upper bounds, SIAM Journal on Computing, 21(3):407-437. [10]Horspool, R.N., 1980, Practical fast searching in strings, Software - Practice & Experience, 10(6):501-506. [11]Knuth, D.E., Morris (Jr), J.H., Pratt, V.R., 1977, Fast pattern matching in strings, SIAM Journal on Computing 6(1):323-350. [12]Lecroq, T., 1992, A variation on the Boyer-Moore algorithm, Theoretical Computer Science 92(1):119-144. [13]Morris (Jr), J.H., Pratt, V.R., 1970, A linear pattern-matching algorithm, Technical Report 40, University of California, Berkeley. [14]Sunday, D.M., 1990, A very fast substring search algorithm, Communications of the ACM. 33(8):132-142. [15]Simon, I., 1993, String matching algorithms and automata, in in Proceedings of 1st American Workshop on String Processing, R.A. Baeza- Yates and N. Ziviani ed., pp 151-157, Universidade Federal de Minas Gerais, Brazil.

34 34 Thanks for your attention.


Download ppt "1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String."

Similar presentations


Ads by Google