# 1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol.

## Presentation on theme: "1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol."— Presentation transcript:

1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol. 352, 2006, pp. 240-249 Advisor: Prof. R. C. T. Lee Speaker: C. W. Lu

2 Let x and y be two strings. Edit distance d(x, y) is the minimum number of character insertions, deletions, and replacements to covert string x to y. k-difference string matching problem: –Given a text T with length n, a pattern P with length m, and an error bound k. –Find all position i of T such that there exists an suffix S of T(1, i), d(S, P) k.

3 The approach of this paper is as the follows: Given a pattern P and an error bound k, we generate all possible Ps which contain ( k) errors deduced from P. Then we conduct an exact match of all such Ps against T.

4 Example: T=abbaaa, P=aba and k=1. From P and k, we generate the following Ps: ba, aaba, baba, bba, aa, abba, aaa, ab, abaa, abb, aba.

5 Then we conduct an exact matching of all Ps against T. Any success indicates that there is a substring S in T such that d(S,T) k. How can we generate all Ps which we want? We use the following observation.

6 T P S2S2 Let S be a substring of T, and S= S 1 S 2. P = P 1 P 2. If d(S 1, P 1 ) k, and Dist(S 2, P 2 ) = 0, d(S, P) k. S1S1 S P1P1 P2P2

7 Example: T ACACAAAAACACC 12345678910111213 AGABCA P 123456 k = 2 Consider the substring S = T(6, 11) = AAAACA, Let S 1 = T(6, 9) = AAAA, and S 2 = T(10, 11) = CA. Dist(S 1, P 1 ) = 2 k, and Dist(S 2, P 2 ) = 0. We have Dist(S, P) = 2 k. S1S1 P1P1 S2S2 P2P2

8 Example: T ACACAAAAACACC 12345678910111213 AGABCA P 123456 k = 2 Consider the substring S = T(8, 11) = AACA, Let S 1 = T(8, 9) = AA, and S 2 = T(10, 11) = CA. Dist(S 1, P 1 ) = 2 k, and Dist(S 2, P 2 ) = 0. We have Dist(S, P) = 2 k. S1S1 P1P1 S2S2 P2P2

9 Based upon the above observation, we can generate all edited pattern Ps by editing the prefix and keeping the suffix untouched, in some manner. Consider P=aba, k=1.

10 P=aba, k=1. P = aba ba (Deletion) k = 1 i = 1 aaba (Insertion) k = 1 baba (Insertion) k = 1 bba (Substution) k = 1 aba k = 0 i = 2 aa (Deletion) k = 1 aaba (Insertion) k = 1 abba (Insertion) k = 1 aaa (Substution) k = 1 aba k = 0 ab (Deletion) k = 1 abaa (Insertion) k = 1 abba (Insertion) k = 1 abb (Substution) k = 1 aba k = 0 i = 3 i = 4 abaa (Insertion) k = 1 abab (Insertion) k = 1

11 P=aba, k=2. P = aba ba (Deletion) k = 1 i = 1 aaba (Insertion) k = 1 baba (Insertion) k = 1 bba (Substution) k = 1 aba k = 0 i = 2 aa (Deletion) k = 1 aaba (Insertion) k = 1 abba (Insertion) k = 1 aaa (Substution) k = 1 aba k = 0 ab (Deletion) k = 1 abaa (Insertion) k = 1 abba (Insertion) k = 1 abb (Substution) k = 1 aba k = 0 i = 3 i = 4 abaa (Insertion) k = 1 abab (Insertion) k = 1

12 P=aba, k=2. ba (k = 1) a (Deletion) k = 2 i = 2 aba (Insertion) k = 2 bba (Insertion) k = 2 aa (Substution) k = 2 ba k = 1 i = 3 b (Deletion) k = 2 baa (Insertion) k = 2 bba (Insertion) k = 2 bb (Substution) k = 2 ba k = 1 i = 4 baa (Insertion) k = 2 bab (Insertion) k = 2

13 For i=1 to m+1 P L P R P k=Dist(P L, P L ) k. Dist(P R, P R ) = 0 i P L P R P i PLPL PRPR P Deletion, k++ A P L P R P C P … Replacement, k++ A P L P R P C P … Insertion, k++ P L P R P No operation. i Terminate if k > k.

14 Our problem now becomes the following: Given a pattern P, we produce a modified pattern P. Our job is to determine whether P exactly matches some substring of T or not. For example, Suppose P=aba. We have ba as one of the modified patterns. So, we like to find out whether ba matches exactly with a substring in T.

15 This exact matching can be found by using the suffix array and the inverse suffix array.

16 Suffix Array Let, where t 0, t 1, …t n-1 an alphabet A and t n =\$ is a special symbol that is not in A and smaller than any symbol in A. The jth suffix of T is defined as T(j, n) = t j …t n and is denoted by T j. The suffix array SA[0..n] of T is an array of integers j that represent suffix T j and the integers are sorted in lexicographic order of corresponding suffixes.

17 Example: T GACAGTTCG\$ 0123456789 Suffixes of T: {GACAGTTCG\$, ACAGTTCG\$, CAGTTCG\$, AGTTCG\$, GTTCG\$, TTCG\$, TCG\$, CG\$, G\$, \$} Lexicographic order: \$, ACAGTTCG\$, AGTTCG\$, CAGTTCG\$, CG\$, G\$, GACAGTTCG\$, GTTCG\$, TCG\$, TTCG\$. = T 9, T 1, T 3, T 2, T 7, T 8, T 0, T 4, T 6, T 5 SA[i] 9132780465 0123456789 i

18 Inverse Suffix Array The inverse suffix array of T is denoted as SA -1 [i]. SA -1 [i] equals the number of suffix which are lexicographically smaller then T i.

19 Example: T GACAGTTCG\$ 0123456789 Lexicographic order: \$(T 9 ) ACAGTTCG\$(T 1 ) AGTTCG\$(T 3 ) CAGTTCG\$(T 2 ) CG\$(T 7 ) G\$(T 8 ) GACAGTTCG\$(T 0 ) GTTCG\$(T 4 ) TCG\$(T 6 ) TTCG\$.(T 5 ) SA[i] 9 1 3 2 7 8 0 4 6 5 0 1 2 3 4 5 6 7 8 9 i SA -1 [i] 6 1 3 2 7 9 8 4 5 0 SA -1 [SA[x] ] = x. SA -1 [0]=6 because there are 6 suffixes smaller than T 0 = GACAGTTCG.

20 The size of SA and SA -1 are O(nlogn) bits. Both data structures can be constructed in linear time[13, 15, 17].

21 In this paper, an interval [st..ed] is called the range of the suffix array of T corresponding to a string P if [st..ed] is the largest interval such that P is a prefix of every suffix T j for j = SA[st], SA[st+1], …, SA[ed]. We write [st..ed ] = range(T, P).

22 Example: T GACAGTTCG\$ 0123456789 Lexicographic order: \$(T 9 ) ACAGTTCG\$(T 1 ) AGTTCG\$(T 3 ) CAGTTCG\$(T 2 ) CG\$(T 7 ) G\$(T 8 ) GACAGTTCG\$(T 0 ) GTTCG\$(T 4 ) TCG\$(T 6 ) TTCG\$.(T 5 ) SA[i] 9 1 3 2 7 8 0 4 6 5 0 1 2 3 4 5 6 7 8 9 i P = G. G is a prefix of T 8, T 0 and T 4. T 8 = T SA[5] T 0 = T SA[6] T 4 = T SA[7] st=5, ed=7, range(T, P) = [5..7].

23 Lemma 1 (Gusfild [12]) Given a text T together with its suffix array, assume [st..ed] = range(T, P). Then, for any character c, the interval[st..ed] = range(T, Pc) can be computed in O(logn) time.

24 Lemma 2 Given the interval [st 1..ed 1 ] = range(T, P 1 ) and the interval [st 2..ed 2 ] = range(T, P 2 ), we can find the interval [st..ed] = range(T, P 1 P 2 ) in O(logn) time using the suffix array and the inverse suffix array of T.

25 Let [st 1..ed 1 ] = range(T, P 1 ), [st 2..ed 2 ] = range(T, P 2 ), [st..ed] = range(T, P 1 P 2 ). [st..ed] is a subinterval of [st 1..ed 1 ].

26 Example: T GACAGTTCG\$ 0123456789 Lexicographic order: \$(T 9 ) ACAGTTCG\$(T 1 ) AGTTCG\$(T 3 ) CAGTTCG\$(T 2 ) CG\$(T 7 ) G\$(T 8 ) GACAGTTCG\$(T 0 ) GTTCG\$(T 4 ) TCG\$(T 6 ) TTCG\$.(T 5 ) SA[i] 9 1 3 2 7 8 0 4 6 5 0 1 2 3 4 5 6 7 8 9 i P 1 = G. P 2 = A. range(T, P 1 ) = [5..7]. range(T, P 1 P 2 ) must be within [5..7]. How can we find the exact interval with [5..7]?

27 By the definition of suffix array, the lexicographic order of are increasing. The lexicographic order of are also increasing.

28 Lexicographic order: \$(T 9 ) ACAGTTCG\$(T 1 ) AGTTCG\$(T 3 ) CAGTTCG\$(T 2 ) CG\$(T 7 ) G\$(T 8 ) GACAGTTCG\$(T 0 ) GTTCG\$(T 4 ) TCG\$(T 6 ) TTCG\$.(T 5 ) T 2 = CAGTTCG\$ T 2+1 = T 3 = AGTTCG\$ T 2+1 is obtained by deleting the prefix with length 1 from T 2. In general, T i+1 can be obtained by deleting the prefix with length 1 from T i.

29 Example: T GACAGTTCG\$ 0123456789 Lexicographic order: \$(T 9 ) ACAGTTCG\$(T 1 ) AGTTCG\$(T 3 ) CAGTTCG\$(T 2 ) CG\$(T 7 ) G\$(T 8 ) GACAGTTCG\$(T 0 ) GTTCG\$(T 4 ) TCG\$(T 6 ) TTCG\$.(T 5 ) SA[i] 9 1 3 2 7 8 0 4 6 5 0 1 2 3 4 5 6 7 8 9 i P 1 = G. P 2 = A. range(T, P 1 ) = [5..7]. T 8 < T 0 < T 4 T 8+1, T 0+1, T 4+1 T 9 < T 1 < T 5

30 The lexicographic order of are also increasing. Thus To find st and ed, we find the smallest st such that and the largest ed such that

31 Example: T GACAGATCG\$ 0123456789 Lexicographic order: \$(T 9 ) ACAGTTCG\$(T 1 ) AGTTCG\$(T 3 ) ATCG\$.(T 5 ) CAGTTCG\$(T 2 ) CG\$(T 7 ) G\$(T 8 ) GACAGTTCG\$(T 0 ) GATCG\$(T 4 ) TCG\$(T 6 ) SA[i] 9 1 3 5 2 7 8 0 4 6 0 1 2 3 4 5 6 7 8 9 i P 1 = G. P 2 = A. range(T, P 1 ) = [6..8]. 6 st, ed 8 SA -1 [i] 7 1 4 2 8 3 9 5 6 0 range(T, P 2 ) = [1..3]. range(T, P 1 P 2 ) = [st..ed]. st = 7 and ed = 8.

32 To find the interval of the first character of P: We construct an array C such that for any c in A, C[c] stores the total number of occurrences of all c in T, where c c. range(T, p 1 ) = [C[c 2 ]+1 … C[c]] where c 2 is a character immediately before c in A.

33 Example: T GACAGTTCG\$ 0123456789 Lexicographic order: \$(T 9 ) ACAGTTCG\$(T 1 ) AGTTCG\$(T 3 ) CAGTTCG\$(T 2 ) CG\$(T 7 ) G\$(T 8 ) GACAGTTCG\$(T 0 ) GTTCG\$(T 4 ) TCG\$(T 6 ) TTCG\$.(T 5 ) SA[i] 9 1 3 2 7 8 0 4 6 5 0 1 2 3 4 5 6 7 8 9 i P = GACAGCA C[A] = 2 C[C] = 4 C[G] = 7 C[T] = 9 range(T, p 1 ) = [C[C]+1…C[G] ] = [5…7].

34 Lemma 3 Given the suffix array and the inverse suffix array of T, assume [st..ed] = range(T, P). For any character c, assume we have in advance the array C, we can find the interval [st..ed] = range(T, cP) in O(logn) time.

35 I Construct Fst [1..m+1] and Fed [1..m+1] such that [Fst [i]..Fed [i]]= range(T,P[i..m]). II Call kapproximate([0..n], 1, 0, ε, ε). kapproximate([s..e], i, k, P L, Υ ) begin 1. Given [Fst [i]..Fed [i]] = range(T, P[i..m]) and [s..e] = range(T, P L ), by Lemma 2 find [st..ed] = range(T, P LP[i..m]). 2. Report occurrences of P = P LP[i..m] in [st..ed] if the interval exists. 3. If (k = k) return. 4. For j :=i to m+1 (a) (when j m, deletion at j) Call kapproximate([s..e], j+1, k+1, P L, dΥ). (b) (when j m, replacement at j ) for each c in A i. Given [s..e] = range(T, P L ), by Lemma 1 find [s..e] = range(T, P Lc). ii. Call kapproximate([s..e], j+1, k+1, P Lc, rΥ). (c) (insertion at j) for each c in A i. Given [s..e] = range(T, P L ), by Lemma 1 find [s..e] = range(T, P Lc). ii. Call kapproximate([s..e], j, k+1, P Lc, iΥ). (d) (when j m) Given [s..e] = range(T, P L ), by Lemma 1 find [s..e] = range(T, P LP[j]). s := s; e := e; P L := P LP[j]; Υ := uΥ; end

36 After an O(n) time preprocessing the text T into an O(nlogn)-bit data structure, the algorithm solves the k-difference problem in O(|A| k m k logn + outputtime) time.

37 References [1] A. Amir, D. Keselman, G.M. Landau, M. Lewenstein, N. Lewenstein, M. Rodeh, Indexing and dictionary matching with one error, in: Proc. Sixth WADS, Lecture Notes in Computer Science, vol. 1663, Springer, Berlin, 1999, pp. 181– 192. [2] A. Amir, M. Lewenstein, Ely. Porat, Faster algorithms for string matching with k mismatches, in: Proc. 11th Ann. ACM-SIAM Symp. on Discrete Algorithms, 2000, pp. 794–803. [3] R.A. Baeza-Yates, G. Navarro, A faster algorithm for approximate string matching, in: Proc. Seventh Ann. Symp. on Combinatorial Pattern Matching (CPM96), pp. 1–23. [4] R.A. Baeza-Yates, G. Navarro, A practical index for text retrieval allowing errors, in: CLEI, vol. 1, November 1997, pp. 273–282. [5] R. Boyer, S. Moore, A fast string matching algorithm, CACM 20 (1977) 762–772. [6] A.L. Buchsbaum, M.T. Goodrich, J. Westbrook, Range searching over tree cross products. in: ESA 2000, pp. 120–131. [7] A. Cobbs, Fast approximate matching using suffix trees. in: Proc. Sixth Ann. Symp. on Combinatorial Pattern Matching (CPM95), Lecture Notes in Computer Science, vol. 807, Springer, Berlin, 1995, pp. 41–54. [8] R. Cole, L.A. Gottlieb, M. Lewenstein, Dictionary matching and indexing with errors and dont cares, in: Proc. 36th Ann. ACM Symp. on Theory of Computing, 2004, pp. 91–100. [9] P. Ferragina, G. Manzini, Opportunistic data structures with applications, in: Proc. 41st IEEE Symp. on Foundations of Computer Science (FOCS00), 2000, pp. 390–398.

38 [10] G. Gonnet, A tutorial introduction to computational biochemistry using Darwin, Technical Report, Informatik E.T.H., Zurich, Switzerland, 1992. [11] R. Grossi, J.S. Vitter, Compressed suffix arrays and suffix trees with applications to text indexing and string matching, in: Proc. 32nd ACM Symp. on Theory of Computing, 2000, pp. 397–406. [12] D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press, Cambridge, 1997. [13] W.K. Hon, K. Sadakane,W.K. Sung. Breaking a time-and-space barrier in constructing full-text indices, in: Proc. IEEE Symp. on Foundations of Computer Science, 2003. [14] P. Jokinen, E. Ukkonen, Two algorithms for approximate string matching in static texts. in: Proc. MFCS91, Lecture Notes in Computer Science, vol. 520, Springer, Berlin, 1991, pp. 240–248. [15] D.K. Kim, J.S. Sim, H. Park, K. Park, Linear-time construction of suffix arrays, in: CPM 2003, pp. 186–199. [16] D.E. Knuth, J. Morris, V. Pratt, Fast pattern matching in strings, SIAM J. Comput. 6 (1977) 323–350. [17] P. Ko, S. Aluru, Space efficient linear time construction of suffix arrays. in: CPM 2003, pp. 200–210. [18] G.M. Landau, U. Vishkin, Fast parallel and serial approximate string matching, J. Algorithms 10 (1989) 157–169. [19] U. Manber, G. Myers, Suffix arrays: a new method for on-line string searches, SIAM J. Comput. 22 (5) (1993) 935–948.

39 [20] E.M. MCreight, A space economical suffix tree construction algorithm, J. ACM 23 (2) (1976) 262–272. [21] G. Navarro, A guided tour to approximate string matching, ACM Comput. Surveys 33 (1) (2001) 31–88. [22] G. Navarro, R.A. Baeza-Yates, A new indexing method for approximate string matching, in: Proc. 10th Ann. Symp. on Combinatorial Pattern Matching (CPM99), pp. 163–185. [23] G. Navarro, R.A. Baeza-Yates, A hybrid indexing method for approximate string matching, J. Discrete Algorithms 1 (1) (2000) 205–239 18. [24] G. Navarro, R. Baeza-Yates, E. Sutinen, J. Tarhio, Indexing methods for approximate string matching, IEEE Data Eng. Bull. 24 (4) (2001) 19–27. [25] G. Navarro, E. Sutinen, J. Tanninen, J. Tarhio, Indexing text with approximate q-grams, in: Proc. 11th Ann. Symp. on Combinatorial Pattern Matching, Lecture Notes in Computer Science, vol. 1848, Springer, Berlin, 2000. [26] K. Sadakane, T. Shibuya, Indexing huge genome sequences for solving various problems, Genome Informatics 12 (2001) 175–183. [27] F. Shi, Fast approximate string matching with q-blocks sequences, in: Proc. Third South American Workshop on String Processing (WSP96), Carleton University Press, 1996. [28] E. Sutinen, J. Tarhio, Filtration with q-samples in approximate string matching. in: Proc. Seventh Ann. Symp. on Combinatorial Pattern Matching (CPM96), pp. 50–63. [29] E. Ukkonen, Approximate matching over suffix trees, in: Proc. Combinatorial Pattern Matching 1993, vol. 4, Springer, Berlin, June 1993, pp. 228–242. [30] R.A. Wagner, M.J. Fischer, The string-to-string correction problem, J. ACM 21 (1974) 168–173.

40 Thank you!

Download ppt "1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol."

Similar presentations