1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol.

1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol. 352, 2006, pp. 240-249 Advisor: Prof. R. C. T. Lee Speaker: C. W. Lu

2 Let x and y be two strings. Edit distance d(x, y) is the minimum number of character insertions, deletions, and replacements to covert string x to y. k-difference string matching problem: –Given a text T with length n, a pattern P with length m, and an error bound k. –Find all position i of T such that there exists an suffix S of T(1, i), d(S, P) k.

3 The approach of this paper is as the follows: Given a pattern P and an error bound k, we generate all possible Ps which contain ( k) errors deduced from P. Then we conduct an exact match of all such Ps against T.

4 Example: T=abbaaa, P=aba and k=1. From P and k, we generate the following Ps: ba, aaba, baba, bba, aa, abba, aaa, ab, abaa, abb, aba.

5 Then we conduct an exact matching of all Ps against T. Any success indicates that there is a substring S in T such that d(S,T) k. How can we generate all Ps which we want? We use the following observation.

6 T P S2S2 Let S be a substring of T, and S= S 1 S 2. P = P 1 P 2. If d(S 1, P 1 ) k, and Dist(S 2, P 2 ) = 0, d(S, P) k. S1S1 S P1P1 P2P2

7 Example: T ACACAAAAACACC 12345678910111213 AGABCA P 123456 k = 2 Consider the substring S = T(6, 11) = AAAACA, Let S 1 = T(6, 9) = AAAA, and S 2 = T(10, 11) = CA. Dist(S 1, P 1 ) = 2 k, and Dist(S 2, P 2 ) = 0. We have Dist(S, P) = 2 k. S1S1 P1P1 S2S2 P2P2

8 Example: T ACACAAAAACACC 12345678910111213 AGABCA P 123456 k = 2 Consider the substring S = T(8, 11) = AACA, Let S 1 = T(8, 9) = AA, and S 2 = T(10, 11) = CA. Dist(S 1, P 1 ) = 2 k, and Dist(S 2, P 2 ) = 0. We have Dist(S, P) = 2 k. S1S1 P1P1 S2S2 P2P2

9 Based upon the above observation, we can generate all edited pattern Ps by editing the prefix and keeping the suffix untouched, in some manner. Consider P=aba, k=1.

10 P=aba, k=1. P = aba ba (Deletion) k = 1 i = 1 aaba (Insertion) k = 1 baba (Insertion) k = 1 bba (Substution) k = 1 aba k = 0 i = 2 aa (Deletion) k = 1 aaba (Insertion) k = 1 abba (Insertion) k = 1 aaa (Substution) k = 1 aba k = 0 ab (Deletion) k = 1 abaa (Insertion) k = 1 abba (Insertion) k = 1 abb (Substution) k = 1 aba k = 0 i = 3 i = 4 abaa (Insertion) k = 1 abab (Insertion) k = 1

11 P=aba, k=2. P = aba ba (Deletion) k = 1 i = 1 aaba (Insertion) k = 1 baba (Insertion) k = 1 bba (Substution) k = 1 aba k = 0 i = 2 aa (Deletion) k = 1 aaba (Insertion) k = 1 abba (Insertion) k = 1 aaa (Substution) k = 1 aba k = 0 ab (Deletion) k = 1 abaa (Insertion) k = 1 abba (Insertion) k = 1 abb (Substution) k = 1 aba k = 0 i = 3 i = 4 abaa (Insertion) k = 1 abab (Insertion) k = 1

12 P=aba, k=2. ba (k = 1) a (Deletion) k = 2 i = 2 aba (Insertion) k = 2 bba (Insertion) k = 2 aa (Substution) k = 2 ba k = 1 i = 3 b (Deletion) k = 2 baa (Insertion) k = 2 bba (Insertion) k = 2 bb (Substution) k = 2 ba k = 1 i = 4 baa (Insertion) k = 2 bab (Insertion) k = 2

13 For i=1 to m+1 P L P R P k=Dist(P L, P L ) k. Dist(P R, P R ) = 0 i P L P R P i PLPL PRPR P Deletion, k++ A P L P R P C P … Replacement, k++ A P L P R P C P … Insertion, k++ P L P R P No operation. i Terminate if k > k.

14 Our problem now becomes the following: Given a pattern P, we produce a modified pattern P. Our job is to determine whether P exactly matches some substring of T or not. For example, Suppose P=aba. We have ba as one of the modified patterns. So, we like to find out whether ba matches exactly with a substring in T.

15 This exact matching can be found by using the suffix array and the inverse suffix array.

16 Suffix Array Let, where t 0, t 1, …t n-1 an alphabet A and t n =$ is a special symbol that is not in A and smaller than any symbol in A. The jth suffix of T is defined as T(j, n) = t j …t n and is denoted by T j. The suffix array SA[0..n] of T is an array of integers j that represent suffix T j and the integers are sorted in lexicographic order of corresponding suffixes.

17 Example: T GACAGTTCG$ 0123456789 Suffixes of T: {GACAGTTCG$, ACAGTTCG$, CAGTTCG$, AGTTCG$, GTTCG$, TTCG$, TCG$, CG$, G$, $} Lexicographic order: $, ACAGTTCG$, AGTTCG$, CAGTTCG$, CG$, G$, GACAGTTCG$, GTTCG$, TCG$, TTCG$. = T 9, T 1, T 3, T 2, T 7, T 8, T 0, T 4, T 6, T 5 SA[i] 9132780465 0123456789 i

18 Inverse Suffix Array The inverse suffix array of T is denoted as SA -1 [i]. SA -1 [i] equals the number of suffix which are lexicographically smaller then T i.

19 Example: T GACAGTTCG$ 0123456789 Lexicographic order: $(T 9 ) ACAGTTCG$(T 1 ) AGTTCG$(T 3 ) CAGTTCG$(T 2 ) CG$(T 7 ) G$(T 8 ) GACAGTTCG$(T 0 ) GTTCG$(T 4 ) TCG$(T 6 ) TTCG$.(T 5 ) SA[i] 9 1 3 2 7 8 0 4 6 5 0 1 2 3 4 5 6 7 8 9 i SA -1 [i] 6 1 3 2 7 9 8 4 5 0 SA -1 [SA[x] ] = x. SA -1 [0]=6 because there are 6 suffixes smaller than T 0 = GACAGTTCG.

20 The size of SA and SA -1 are O(nlogn) bits. Both data structures can be constructed in linear time[13, 15, 17].

21 In this paper, an interval [st..ed] is called the range of the suffix array of T corresponding to a string P if [st..ed] is the largest interval such that P is a prefix of every suffix T j for j = SA[st], SA[st+1], …, SA[ed]. We write [st..ed ] = range(T, P).

22 Example: T GACAGTTCG$ 0123456789 Lexicographic order: $(T 9 ) ACAGTTCG$(T 1 ) AGTTCG$(T 3 ) CAGTTCG$(T 2 ) CG$(T 7 ) G$(T 8 ) GACAGTTCG$(T 0 ) GTTCG$(T 4 ) TCG$(T 6 ) TTCG$.(T 5 ) SA[i] 9 1 3 2 7 8 0 4 6 5 0 1 2 3 4 5 6 7 8 9 i P = G. G is a prefix of T 8, T 0 and T 4. T 8 = T SA[5] T 0 = T SA[6] T 4 = T SA[7] st=5, ed=7, range(T, P) = [5..7].

23 Lemma 1 (Gusfild [12]) Given a text T together with its suffix array, assume [st..ed] = range(T, P). Then, for any character c, the interval[st..ed] = range(T, Pc) can be computed in O(logn) time.

24 Lemma 2 Given the interval [st 1..ed 1 ] = range(T, P 1 ) and the interval [st 2..ed 2 ] = range(T, P 2 ), we can find the interval [st..ed] = range(T, P 1 P 2 ) in O(logn) time using the suffix array and the inverse suffix array of T.

25 Let [st 1..ed 1 ] = range(T, P 1 ), [st 2..ed 2 ] = range(T, P 2 ), [st..ed] = range(T, P 1 P 2 ). [st..ed] is a subinterval of [st 1..ed 1 ].

26 Example: T GACAGTTCG$ 0123456789 Lexicographic order: $(T 9 ) ACAGTTCG$(T 1 ) AGTTCG$(T 3 ) CAGTTCG$(T 2 ) CG$(T 7 ) G$(T 8 ) GACAGTTCG$(T 0 ) GTTCG$(T 4 ) TCG$(T 6 ) TTCG$.(T 5 ) SA[i] 9 1 3 2 7 8 0 4 6 5 0 1 2 3 4 5 6 7 8 9 i P 1 = G. P 2 = A. range(T, P 1 ) = [5..7]. range(T, P 1 P 2 ) must be within [5..7]. How can we find the exact interval with [5..7]?

27 By the definition of suffix array, the lexicographic order of are increasing. The lexicographic order of are also increasing.

28 Lexicographic order: $(T 9 ) ACAGTTCG$(T 1 ) AGTTCG$(T 3 ) CAGTTCG$(T 2 ) CG$(T 7 ) G$(T 8 ) GACAGTTCG$(T 0 ) GTTCG$(T 4 ) TCG$(T 6 ) TTCG$.(T 5 ) T 2 = CAGTTCG$ T 2+1 = T 3 = AGTTCG$ T 2+1 is obtained by deleting the prefix with length 1 from T 2. In general, T i+1 can be obtained by deleting the prefix with length 1 from T i.

29 Example: T GACAGTTCG$ 0123456789 Lexicographic order: $(T 9 ) ACAGTTCG$(T 1 ) AGTTCG$(T 3 ) CAGTTCG$(T 2 ) CG$(T 7 ) G$(T 8 ) GACAGTTCG$(T 0 ) GTTCG$(T 4 ) TCG$(T 6 ) TTCG$.(T 5 ) SA[i] 9 1 3 2 7 8 0 4 6 5 0 1 2 3 4 5 6 7 8 9 i P 1 = G. P 2 = A. range(T, P 1 ) = [5..7]. T 8 < T 0 < T 4 T 8+1, T 0+1, T 4+1 T 9 < T 1 < T 5

30 The lexicographic order of are also increasing. Thus To find st and ed, we find the smallest st such that and the largest ed such that

31 Example: T GACAGATCG$ 0123456789 Lexicographic order: $(T 9 ) ACAGTTCG$(T 1 ) AGTTCG$(T 3 ) ATCG$.(T 5 ) CAGTTCG$(T 2 ) CG$(T 7 ) G$(T 8 ) GACAGTTCG$(T 0 ) GATCG$(T 4 ) TCG$(T 6 ) SA[i] 9 1 3 5 2 7 8 0 4 6 0 1 2 3 4 5 6 7 8 9 i P 1 = G. P 2 = A. range(T, P 1 ) = [6..8]. 6 st, ed 8 SA -1 [i] 7 1 4 2 8 3 9 5 6 0 range(T, P 2 ) = [1..3]. range(T, P 1 P 2 ) = [st..ed]. st = 7 and ed = 8.

32 To find the interval of the first character of P: We construct an array C such that for any c in A, C[c] stores the total number of occurrences of all c in T, where c c. range(T, p 1 ) = [C[c 2 ]+1 … C[c]] where c 2 is a character immediately before c in A.

33 Example: T GACAGTTCG$ 0123456789 Lexicographic order: $(T 9 ) ACAGTTCG$(T 1 ) AGTTCG$(T 3 ) CAGTTCG$(T 2 ) CG$(T 7 ) G$(T 8 ) GACAGTTCG$(T 0 ) GTTCG$(T 4 ) TCG$(T 6 ) TTCG$.(T 5 ) SA[i] 9 1 3 2 7 8 0 4 6 5 0 1 2 3 4 5 6 7 8 9 i P = GACAGCA C[A] = 2 C[C] = 4 C[G] = 7 C[T] = 9 range(T, p 1 ) = [C[C]+1…C[G] ] = [5…7].

34 Lemma 3 Given the suffix array and the inverse suffix array of T, assume [st..ed] = range(T, P). For any character c, assume we have in advance the array C, we can find the interval [st..ed] = range(T, cP) in O(logn) time.

35 I Construct Fst [1..m+1] and Fed [1..m+1] such that [Fst [i]..Fed [i]]= range(T,P[i..m]). II Call kapproximate([0..n], 1, 0, ε, ε). kapproximate([s..e], i, k, P L, Υ ) begin 1. Given [Fst [i]..Fed [i]] = range(T, P[i..m]) and [s..e] = range(T, P L ), by Lemma 2 find [st..ed] = range(T, P LP[i..m]). 2. Report occurrences of P = P LP[i..m] in [st..ed] if the interval exists. 3. If (k = k) return. 4. For j :=i to m+1 (a) (when j m, deletion at j) Call kapproximate([s..e], j+1, k+1, P L, dΥ). (b) (when j m, replacement at j ) for each c in A i. Given [s..e] = range(T, P L ), by Lemma 1 find [s..e] = range(T, P Lc). ii. Call kapproximate([s..e], j+1, k+1, P Lc, rΥ). (c) (insertion at j) for each c in A i. Given [s..e] = range(T, P L ), by Lemma 1 find [s..e] = range(T, P Lc). ii. Call kapproximate([s..e], j, k+1, P Lc, iΥ). (d) (when j m) Given [s..e] = range(T, P L ), by Lemma 1 find [s..e] = range(T, P LP[j]). s := s; e := e; P L := P LP[j]; Υ := uΥ; end

36 After an O(n) time preprocessing the text T into an O(nlogn)-bit data structure, the algorithm solves the k-difference problem in O(|A| k m k logn + outputtime) time.

37 References [1] A. Amir, D. Keselman, G.M. Landau, M. Lewenstein, N. Lewenstein, M. Rodeh, Indexing and dictionary matching with one error, in: Proc. Sixth WADS, Lecture Notes in Computer Science, vol. 1663, Springer, Berlin, 1999, pp. 181– 192. [2] A. Amir, M. Lewenstein, Ely. Porat, Faster algorithms for string matching with k mismatches, in: Proc. 11th Ann. ACM-SIAM Symp. on Discrete Algorithms, 2000, pp. 794–803. [3] R.A. Baeza-Yates, G. Navarro, A faster algorithm for approximate string matching, in: Proc. Seventh Ann. Symp. on Combinatorial Pattern Matching (CPM96), pp. 1–23. [4] R.A. Baeza-Yates, G. Navarro, A practical index for text retrieval allowing errors, in: CLEI, vol. 1, November 1997, pp. 273–282. [5] R. Boyer, S. Moore, A fast string matching algorithm, CACM 20 (1977) 762–772. [6] A.L. Buchsbaum, M.T. Goodrich, J. Westbrook, Range searching over tree cross products. in: ESA 2000, pp. 120–131. [7] A. Cobbs, Fast approximate matching using suffix trees. in: Proc. Sixth Ann. Symp. on Combinatorial Pattern Matching (CPM95), Lecture Notes in Computer Science, vol. 807, Springer, Berlin, 1995, pp. 41–54. [8] R. Cole, L.A. Gottlieb, M. Lewenstein, Dictionary matching and indexing with errors and dont cares, in: Proc. 36th Ann. ACM Symp. on Theory of Computing, 2004, pp. 91–100. [9] P. Ferragina, G. Manzini, Opportunistic data structures with applications, in: Proc. 41st IEEE Symp. on Foundations of Computer Science (FOCS00), 2000, pp. 390–398.

38 [10] G. Gonnet, A tutorial introduction to computational biochemistry using Darwin, Technical Report, Informatik E.T.H., Zurich, Switzerland, 1992. [11] R. Grossi, J.S. Vitter, Compressed suffix arrays and suffix trees with applications to text indexing and string matching, in: Proc. 32nd ACM Symp. on Theory of Computing, 2000, pp. 397–406. [12] D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press, Cambridge, 1997. [13] W.K. Hon, K. Sadakane,W.K. Sung. Breaking a time-and-space barrier in constructing full-text indices, in: Proc. IEEE Symp. on Foundations of Computer Science, 2003. [14] P. Jokinen, E. Ukkonen, Two algorithms for approximate string matching in static texts. in: Proc. MFCS91, Lecture Notes in Computer Science, vol. 520, Springer, Berlin, 1991, pp. 240–248. [15] D.K. Kim, J.S. Sim, H. Park, K. Park, Linear-time construction of suffix arrays, in: CPM 2003, pp. 186–199. [16] D.E. Knuth, J. Morris, V. Pratt, Fast pattern matching in strings, SIAM J. Comput. 6 (1977) 323–350. [17] P. Ko, S. Aluru, Space efficient linear time construction of suffix arrays. in: CPM 2003, pp. 200–210. [18] G.M. Landau, U. Vishkin, Fast parallel and serial approximate string matching, J. Algorithms 10 (1989) 157–169. [19] U. Manber, G. Myers, Suffix arrays: a new method for on-line string searches, SIAM J. Comput. 22 (5) (1993) 935–948.

39 [20] E.M. MCreight, A space economical suffix tree construction algorithm, J. ACM 23 (2) (1976) 262–272. [21] G. Navarro, A guided tour to approximate string matching, ACM Comput. Surveys 33 (1) (2001) 31–88. [22] G. Navarro, R.A. Baeza-Yates, A new indexing method for approximate string matching, in: Proc. 10th Ann. Symp. on Combinatorial Pattern Matching (CPM99), pp. 163–185. [23] G. Navarro, R.A. Baeza-Yates, A hybrid indexing method for approximate string matching, J. Discrete Algorithms 1 (1) (2000) 205–239 18. [24] G. Navarro, R. Baeza-Yates, E. Sutinen, J. Tarhio, Indexing methods for approximate string matching, IEEE Data Eng. Bull. 24 (4) (2001) 19–27. [25] G. Navarro, E. Sutinen, J. Tanninen, J. Tarhio, Indexing text with approximate q-grams, in: Proc. 11th Ann. Symp. on Combinatorial Pattern Matching, Lecture Notes in Computer Science, vol. 1848, Springer, Berlin, 2000. [26] K. Sadakane, T. Shibuya, Indexing huge genome sequences for solving various problems, Genome Informatics 12 (2001) 175–183. [27] F. Shi, Fast approximate string matching with q-blocks sequences, in: Proc. Third South American Workshop on String Processing (WSP96), Carleton University Press, 1996. [28] E. Sutinen, J. Tarhio, Filtration with q-samples in approximate string matching. in: Proc. Seventh Ann. Symp. on Combinatorial Pattern Matching (CPM96), pp. 50–63. [29] E. Ukkonen, Approximate matching over suffix trees, in: Proc. Combinatorial Pattern Matching 1993, vol. 4, Springer, Berlin, June 1993, pp. 228–242. [30] R.A. Wagner, M.J. Fischer, The string-to-string correction problem, J. ACM 21 (1974) 168–173.

40 Thank you!

1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol.

Similar presentations

Presentation on theme: "1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol.

Similar presentations

Presentation on theme: "1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol."— Presentation transcript:

Similar presentations

About project

Feedback