1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol.

Slides:



Advertisements
Similar presentations
Advisor: Prof. R. C. T. Lee Speaker: L. C. Chen
Advertisements

AP STUDY SESSION 2.
1
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Author: Julia Richards and R. Scott Hawley
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
Objectives: Generate and describe sequences. Vocabulary:
1 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, G. Navarro and R. Baeza-Yates Advisor: Prof. R. C. T.
1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen.
1 Fast text searching: allowing errors Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp Advisor: Prof. R. C. T. Lee Reporter:
1 Fast Parallel and Serial Approximate String Matching Journal of Algorithms, Vol.10 (1989), pp G. Landau and U. Vishkin Advisor: Prof. R. C.
Speaker: C. C. Lin Adviser: R. C. T. Lee
1 Rules for Approximate String Matching R.C.T. Lee.
On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.
1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,
1 Faster algorithms for string matching with k mismatches Adviser : R. C. T. Lee Speaker: C. C. Yen Journal of Algorithms, Volume 50, Issue 2, February.
1 Text Indexing and Dictionary Matching with One Error Amir, A., KeseIman, D., Landau, G., M. and etc, Journal of Algorithm, 37, 2000, pp Adviser:
1 Approximate string matching using factor automata J. Holub and B. Melichar Theoretical Computer Science vol.249 p Speaker: L. C. Chen Advisor:
UNITED NATIONS Shipment Details Report – January 2006.
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt BlendsDigraphsShort.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Rhesy S.ppt proRheo GmbH
CSCI 3130: Formal Languages and Automata Theory Tutorial 5
REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.
1 OFDM Synchronization Speaker:. Wireless Access Tech. Lab. CCU Wireless Access Tech. Lab. 2 Outline OFDM System Description Synchronization What is Synchronization?
Break Time Remaining 10:00.
Discrete Math Recurrence Relations 1.
Turing Machines.
Tuned Boyer Moore Algorithm
PP Test Review Sections 6-1 to 6-6
EU Market Situation for Eggs and Poultry Management Committee 21 June 2012.
Bellwork Do the following problem on a ½ sheet of paper and turn in.
Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.
Name Convolutional codes Tomashevich Victor. Name- 2 - Introduction Convolutional codes map information to code bits sequentially by convolving a sequence.
1 public class Newton { public static double sqrt(double c) { double epsilon = 1E-15; if (c < 0) return Double.NaN; double t = c; while (Math.abs(t - c/t)
How to convert a left linear grammar to a right linear grammar
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
CONTROL VISION Set-up. Step 1 Step 2 Step 3 Step 5 Step 4.
© 2012 National Heart Foundation of Australia. Slide 2.
Adding Up In Chunks.
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.
Artificial Intelligence
Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M
Subtraction: Adding UP
Analyzing Genes and Genomes
1 Let’s Recapitulate. 2 Regular Languages DFAs NFAs Regular Expressions Regular Grammars.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Essential Cell Biology
Intracellular Compartments and Transport
PSSA Preparation.
Essential Cell Biology
Immunobiology: The Immune System in Health & Disease Sixth Edition
Physics for Scientists & Engineers, 3rd Edition
Energy Generation in Mitochondria and Chlorplasts
Select a time to count down from the clock above
Murach’s OS/390 and z/OS JCLChapter 16, Slide 1 © 2002, Mike Murach & Associates, Inc.
Distributed Computing 9. Sorting - a lower bound on bit complexity Shmuel Zaks ©
1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.
The Pumping Lemma for CFL’s
1 How to Perform a Good Presentation of a Paper and How to Read Difficult Papers 李家同 暨南大學.
Presentation transcript:

1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol. 352, 2006, pp Advisor: Prof. R. C. T. Lee Speaker: C. W. Lu

2 Let x and y be two strings. Edit distance d(x, y) is the minimum number of character insertions, deletions, and replacements to covert string x to y. k-difference string matching problem: –Given a text T with length n, a pattern P with length m, and an error bound k. –Find all position i of T such that there exists an suffix S of T(1, i), d(S, P) k.

3 The approach of this paper is as the follows: Given a pattern P and an error bound k, we generate all possible Ps which contain ( k) errors deduced from P. Then we conduct an exact match of all such Ps against T.

4 Example: T=abbaaa, P=aba and k=1. From P and k, we generate the following Ps: ba, aaba, baba, bba, aa, abba, aaa, ab, abaa, abb, aba.

5 Then we conduct an exact matching of all Ps against T. Any success indicates that there is a substring S in T such that d(S,T) k. How can we generate all Ps which we want? We use the following observation.

6 T P S2S2 Let S be a substring of T, and S= S 1 S 2. P = P 1 P 2. If d(S 1, P 1 ) k, and Dist(S 2, P 2 ) = 0, d(S, P) k. S1S1 S P1P1 P2P2

7 Example: T ACACAAAAACACC AGABCA P k = 2 Consider the substring S = T(6, 11) = AAAACA, Let S 1 = T(6, 9) = AAAA, and S 2 = T(10, 11) = CA. Dist(S 1, P 1 ) = 2 k, and Dist(S 2, P 2 ) = 0. We have Dist(S, P) = 2 k. S1S1 P1P1 S2S2 P2P2

8 Example: T ACACAAAAACACC AGABCA P k = 2 Consider the substring S = T(8, 11) = AACA, Let S 1 = T(8, 9) = AA, and S 2 = T(10, 11) = CA. Dist(S 1, P 1 ) = 2 k, and Dist(S 2, P 2 ) = 0. We have Dist(S, P) = 2 k. S1S1 P1P1 S2S2 P2P2

9 Based upon the above observation, we can generate all edited pattern Ps by editing the prefix and keeping the suffix untouched, in some manner. Consider P=aba, k=1.

10 P=aba, k=1. P = aba ba (Deletion) k = 1 i = 1 aaba (Insertion) k = 1 baba (Insertion) k = 1 bba (Substution) k = 1 aba k = 0 i = 2 aa (Deletion) k = 1 aaba (Insertion) k = 1 abba (Insertion) k = 1 aaa (Substution) k = 1 aba k = 0 ab (Deletion) k = 1 abaa (Insertion) k = 1 abba (Insertion) k = 1 abb (Substution) k = 1 aba k = 0 i = 3 i = 4 abaa (Insertion) k = 1 abab (Insertion) k = 1

11 P=aba, k=2. P = aba ba (Deletion) k = 1 i = 1 aaba (Insertion) k = 1 baba (Insertion) k = 1 bba (Substution) k = 1 aba k = 0 i = 2 aa (Deletion) k = 1 aaba (Insertion) k = 1 abba (Insertion) k = 1 aaa (Substution) k = 1 aba k = 0 ab (Deletion) k = 1 abaa (Insertion) k = 1 abba (Insertion) k = 1 abb (Substution) k = 1 aba k = 0 i = 3 i = 4 abaa (Insertion) k = 1 abab (Insertion) k = 1

12 P=aba, k=2. ba (k = 1) a (Deletion) k = 2 i = 2 aba (Insertion) k = 2 bba (Insertion) k = 2 aa (Substution) k = 2 ba k = 1 i = 3 b (Deletion) k = 2 baa (Insertion) k = 2 bba (Insertion) k = 2 bb (Substution) k = 2 ba k = 1 i = 4 baa (Insertion) k = 2 bab (Insertion) k = 2

13 For i=1 to m+1 P L P R P k=Dist(P L, P L ) k. Dist(P R, P R ) = 0 i P L P R P i PLPL PRPR P Deletion, k++ A P L P R P C P … Replacement, k++ A P L P R P C P … Insertion, k++ P L P R P No operation. i Terminate if k > k.

14 Our problem now becomes the following: Given a pattern P, we produce a modified pattern P. Our job is to determine whether P exactly matches some substring of T or not. For example, Suppose P=aba. We have ba as one of the modified patterns. So, we like to find out whether ba matches exactly with a substring in T.

15 This exact matching can be found by using the suffix array and the inverse suffix array.

16 Suffix Array Let, where t 0, t 1, …t n-1 an alphabet A and t n =$ is a special symbol that is not in A and smaller than any symbol in A. The jth suffix of T is defined as T(j, n) = t j …t n and is denoted by T j. The suffix array SA[0..n] of T is an array of integers j that represent suffix T j and the integers are sorted in lexicographic order of corresponding suffixes.

17 Example: T GACAGTTCG$ Suffixes of T: {GACAGTTCG$, ACAGTTCG$, CAGTTCG$, AGTTCG$, GTTCG$, TTCG$, TCG$, CG$, G$, $} Lexicographic order: $, ACAGTTCG$, AGTTCG$, CAGTTCG$, CG$, G$, GACAGTTCG$, GTTCG$, TCG$, TTCG$. = T 9, T 1, T 3, T 2, T 7, T 8, T 0, T 4, T 6, T 5 SA[i] i

18 Inverse Suffix Array The inverse suffix array of T is denoted as SA -1 [i]. SA -1 [i] equals the number of suffix which are lexicographically smaller then T i.

19 Example: T GACAGTTCG$ Lexicographic order: $(T 9 ) ACAGTTCG$(T 1 ) AGTTCG$(T 3 ) CAGTTCG$(T 2 ) CG$(T 7 ) G$(T 8 ) GACAGTTCG$(T 0 ) GTTCG$(T 4 ) TCG$(T 6 ) TTCG$.(T 5 ) SA[i] i SA -1 [i] SA -1 [SA[x] ] = x. SA -1 [0]=6 because there are 6 suffixes smaller than T 0 = GACAGTTCG.

20 The size of SA and SA -1 are O(nlogn) bits. Both data structures can be constructed in linear time[13, 15, 17].

21 In this paper, an interval [st..ed] is called the range of the suffix array of T corresponding to a string P if [st..ed] is the largest interval such that P is a prefix of every suffix T j for j = SA[st], SA[st+1], …, SA[ed]. We write [st..ed ] = range(T, P).

22 Example: T GACAGTTCG$ Lexicographic order: $(T 9 ) ACAGTTCG$(T 1 ) AGTTCG$(T 3 ) CAGTTCG$(T 2 ) CG$(T 7 ) G$(T 8 ) GACAGTTCG$(T 0 ) GTTCG$(T 4 ) TCG$(T 6 ) TTCG$.(T 5 ) SA[i] i P = G. G is a prefix of T 8, T 0 and T 4. T 8 = T SA[5] T 0 = T SA[6] T 4 = T SA[7] st=5, ed=7, range(T, P) = [5..7].

23 Lemma 1 (Gusfild [12]) Given a text T together with its suffix array, assume [st..ed] = range(T, P). Then, for any character c, the interval[st..ed] = range(T, Pc) can be computed in O(logn) time.

24 Lemma 2 Given the interval [st 1..ed 1 ] = range(T, P 1 ) and the interval [st 2..ed 2 ] = range(T, P 2 ), we can find the interval [st..ed] = range(T, P 1 P 2 ) in O(logn) time using the suffix array and the inverse suffix array of T.

25 Let [st 1..ed 1 ] = range(T, P 1 ), [st 2..ed 2 ] = range(T, P 2 ), [st..ed] = range(T, P 1 P 2 ). [st..ed] is a subinterval of [st 1..ed 1 ].

26 Example: T GACAGTTCG$ Lexicographic order: $(T 9 ) ACAGTTCG$(T 1 ) AGTTCG$(T 3 ) CAGTTCG$(T 2 ) CG$(T 7 ) G$(T 8 ) GACAGTTCG$(T 0 ) GTTCG$(T 4 ) TCG$(T 6 ) TTCG$.(T 5 ) SA[i] i P 1 = G. P 2 = A. range(T, P 1 ) = [5..7]. range(T, P 1 P 2 ) must be within [5..7]. How can we find the exact interval with [5..7]?

27 By the definition of suffix array, the lexicographic order of are increasing. The lexicographic order of are also increasing.

28 Lexicographic order: $(T 9 ) ACAGTTCG$(T 1 ) AGTTCG$(T 3 ) CAGTTCG$(T 2 ) CG$(T 7 ) G$(T 8 ) GACAGTTCG$(T 0 ) GTTCG$(T 4 ) TCG$(T 6 ) TTCG$.(T 5 ) T 2 = CAGTTCG$ T 2+1 = T 3 = AGTTCG$ T 2+1 is obtained by deleting the prefix with length 1 from T 2. In general, T i+1 can be obtained by deleting the prefix with length 1 from T i.

29 Example: T GACAGTTCG$ Lexicographic order: $(T 9 ) ACAGTTCG$(T 1 ) AGTTCG$(T 3 ) CAGTTCG$(T 2 ) CG$(T 7 ) G$(T 8 ) GACAGTTCG$(T 0 ) GTTCG$(T 4 ) TCG$(T 6 ) TTCG$.(T 5 ) SA[i] i P 1 = G. P 2 = A. range(T, P 1 ) = [5..7]. T 8 < T 0 < T 4 T 8+1, T 0+1, T 4+1 T 9 < T 1 < T 5

30 The lexicographic order of are also increasing. Thus To find st and ed, we find the smallest st such that and the largest ed such that

31 Example: T GACAGATCG$ Lexicographic order: $(T 9 ) ACAGTTCG$(T 1 ) AGTTCG$(T 3 ) ATCG$.(T 5 ) CAGTTCG$(T 2 ) CG$(T 7 ) G$(T 8 ) GACAGTTCG$(T 0 ) GATCG$(T 4 ) TCG$(T 6 ) SA[i] i P 1 = G. P 2 = A. range(T, P 1 ) = [6..8]. 6 st, ed 8 SA -1 [i] range(T, P 2 ) = [1..3]. range(T, P 1 P 2 ) = [st..ed]. st = 7 and ed = 8.

32 To find the interval of the first character of P: We construct an array C such that for any c in A, C[c] stores the total number of occurrences of all c in T, where c c. range(T, p 1 ) = [C[c 2 ]+1 … C[c]] where c 2 is a character immediately before c in A.

33 Example: T GACAGTTCG$ Lexicographic order: $(T 9 ) ACAGTTCG$(T 1 ) AGTTCG$(T 3 ) CAGTTCG$(T 2 ) CG$(T 7 ) G$(T 8 ) GACAGTTCG$(T 0 ) GTTCG$(T 4 ) TCG$(T 6 ) TTCG$.(T 5 ) SA[i] i P = GACAGCA C[A] = 2 C[C] = 4 C[G] = 7 C[T] = 9 range(T, p 1 ) = [C[C]+1…C[G] ] = [5…7].

34 Lemma 3 Given the suffix array and the inverse suffix array of T, assume [st..ed] = range(T, P). For any character c, assume we have in advance the array C, we can find the interval [st..ed] = range(T, cP) in O(logn) time.

35 I Construct Fst [1..m+1] and Fed [1..m+1] such that [Fst [i]..Fed [i]]= range(T,P[i..m]). II Call kapproximate([0..n], 1, 0, ε, ε). kapproximate([s..e], i, k, P L, Υ ) begin 1. Given [Fst [i]..Fed [i]] = range(T, P[i..m]) and [s..e] = range(T, P L ), by Lemma 2 find [st..ed] = range(T, P LP[i..m]). 2. Report occurrences of P = P LP[i..m] in [st..ed] if the interval exists. 3. If (k = k) return. 4. For j :=i to m+1 (a) (when j m, deletion at j) Call kapproximate([s..e], j+1, k+1, P L, dΥ). (b) (when j m, replacement at j ) for each c in A i. Given [s..e] = range(T, P L ), by Lemma 1 find [s..e] = range(T, P Lc). ii. Call kapproximate([s..e], j+1, k+1, P Lc, rΥ). (c) (insertion at j) for each c in A i. Given [s..e] = range(T, P L ), by Lemma 1 find [s..e] = range(T, P Lc). ii. Call kapproximate([s..e], j, k+1, P Lc, iΥ). (d) (when j m) Given [s..e] = range(T, P L ), by Lemma 1 find [s..e] = range(T, P LP[j]). s := s; e := e; P L := P LP[j]; Υ := uΥ; end

36 After an O(n) time preprocessing the text T into an O(nlogn)-bit data structure, the algorithm solves the k-difference problem in O(|A| k m k logn + outputtime) time.

37 References [1] A. Amir, D. Keselman, G.M. Landau, M. Lewenstein, N. Lewenstein, M. Rodeh, Indexing and dictionary matching with one error, in: Proc. Sixth WADS, Lecture Notes in Computer Science, vol. 1663, Springer, Berlin, 1999, pp. 181– 192. [2] A. Amir, M. Lewenstein, Ely. Porat, Faster algorithms for string matching with k mismatches, in: Proc. 11th Ann. ACM-SIAM Symp. on Discrete Algorithms, 2000, pp. 794–803. [3] R.A. Baeza-Yates, G. Navarro, A faster algorithm for approximate string matching, in: Proc. Seventh Ann. Symp. on Combinatorial Pattern Matching (CPM96), pp. 1–23. [4] R.A. Baeza-Yates, G. Navarro, A practical index for text retrieval allowing errors, in: CLEI, vol. 1, November 1997, pp. 273–282. [5] R. Boyer, S. Moore, A fast string matching algorithm, CACM 20 (1977) 762–772. [6] A.L. Buchsbaum, M.T. Goodrich, J. Westbrook, Range searching over tree cross products. in: ESA 2000, pp. 120–131. [7] A. Cobbs, Fast approximate matching using suffix trees. in: Proc. Sixth Ann. Symp. on Combinatorial Pattern Matching (CPM95), Lecture Notes in Computer Science, vol. 807, Springer, Berlin, 1995, pp. 41–54. [8] R. Cole, L.A. Gottlieb, M. Lewenstein, Dictionary matching and indexing with errors and dont cares, in: Proc. 36th Ann. ACM Symp. on Theory of Computing, 2004, pp. 91–100. [9] P. Ferragina, G. Manzini, Opportunistic data structures with applications, in: Proc. 41st IEEE Symp. on Foundations of Computer Science (FOCS00), 2000, pp. 390–398.

38 [10] G. Gonnet, A tutorial introduction to computational biochemistry using Darwin, Technical Report, Informatik E.T.H., Zurich, Switzerland, [11] R. Grossi, J.S. Vitter, Compressed suffix arrays and suffix trees with applications to text indexing and string matching, in: Proc. 32nd ACM Symp. on Theory of Computing, 2000, pp. 397–406. [12] D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press, Cambridge, [13] W.K. Hon, K. Sadakane,W.K. Sung. Breaking a time-and-space barrier in constructing full-text indices, in: Proc. IEEE Symp. on Foundations of Computer Science, [14] P. Jokinen, E. Ukkonen, Two algorithms for approximate string matching in static texts. in: Proc. MFCS91, Lecture Notes in Computer Science, vol. 520, Springer, Berlin, 1991, pp. 240–248. [15] D.K. Kim, J.S. Sim, H. Park, K. Park, Linear-time construction of suffix arrays, in: CPM 2003, pp. 186–199. [16] D.E. Knuth, J. Morris, V. Pratt, Fast pattern matching in strings, SIAM J. Comput. 6 (1977) 323–350. [17] P. Ko, S. Aluru, Space efficient linear time construction of suffix arrays. in: CPM 2003, pp. 200–210. [18] G.M. Landau, U. Vishkin, Fast parallel and serial approximate string matching, J. Algorithms 10 (1989) 157–169. [19] U. Manber, G. Myers, Suffix arrays: a new method for on-line string searches, SIAM J. Comput. 22 (5) (1993) 935–948.

39 [20] E.M. MCreight, A space economical suffix tree construction algorithm, J. ACM 23 (2) (1976) 262–272. [21] G. Navarro, A guided tour to approximate string matching, ACM Comput. Surveys 33 (1) (2001) 31–88. [22] G. Navarro, R.A. Baeza-Yates, A new indexing method for approximate string matching, in: Proc. 10th Ann. Symp. on Combinatorial Pattern Matching (CPM99), pp. 163–185. [23] G. Navarro, R.A. Baeza-Yates, A hybrid indexing method for approximate string matching, J. Discrete Algorithms 1 (1) (2000) 205– [24] G. Navarro, R. Baeza-Yates, E. Sutinen, J. Tarhio, Indexing methods for approximate string matching, IEEE Data Eng. Bull. 24 (4) (2001) 19–27. [25] G. Navarro, E. Sutinen, J. Tanninen, J. Tarhio, Indexing text with approximate q-grams, in: Proc. 11th Ann. Symp. on Combinatorial Pattern Matching, Lecture Notes in Computer Science, vol. 1848, Springer, Berlin, [26] K. Sadakane, T. Shibuya, Indexing huge genome sequences for solving various problems, Genome Informatics 12 (2001) 175–183. [27] F. Shi, Fast approximate string matching with q-blocks sequences, in: Proc. Third South American Workshop on String Processing (WSP96), Carleton University Press, [28] E. Sutinen, J. Tarhio, Filtration with q-samples in approximate string matching. in: Proc. Seventh Ann. Symp. on Combinatorial Pattern Matching (CPM96), pp. 50–63. [29] E. Ukkonen, Approximate matching over suffix trees, in: Proc. Combinatorial Pattern Matching 1993, vol. 4, Springer, Berlin, June 1993, pp. 228–242. [30] R.A. Wagner, M.J. Fischer, The string-to-string correction problem, J. ACM 21 (1974) 168–173.

40 Thank you!