Presentation on theme: "1 Average-Optimal Multiple Approximate String Matching Kimmo Fredriksson, Gonzalo Navarro ACM Journal of Experimental Algorithmics, Vol 9, Article No."— Presentation transcript:
1 Average-Optimal Multiple Approximate String Matching Kimmo Fredriksson, Gonzalo Navarro ACM Journal of Experimental Algorithmics, Vol 9, Article No. 1.4,2004, Pages 1-47 Professor R.C.T Lee Speaker K.W.Liu
2 The Problem The approximate string matching problem: Given text T[1...n] and pattern P[1...m] over some finite alphabet of size σ, find the approximate occurrences of P from T, allowing at most k differences ( insertion, deletion, substitution).
3 For a window of size m-k, if there exists a substring s 1 in this window such that its edit distance with every substring of P is greater than k, we move P Our algorithm scans from the right as shown below: Fig. 3 S1S1 T:T: P:P: m - k
4 For a window of size m-k, if there exists a suffix S 1 such that its edit distance with every substring of P is greater than k, we move P to S 1 Our algorithm scans from the right as shown below: Fig. 3 S1S1 T:T: P:P: m - k
5 But, how do we know that ED(S 1,S 2 ) > k? We use a very useful lemma.
6 Lemma Consider string Q and P. Let Q be divided into q 1,q 2,…,q n as shown below: qnqn …q2q2 q1q1 For each q i, let p i be the substring in P such that ED(q i,p i ) is the smallest, among all substrings in P.
7 Proof: Divide P into n pieces as shown below qnqn …q2q2 q1q1 p'np'n …p'2p'2 p'1p'1 Q P
8 To determine whether ED(S 1,S 2 ) > k, we may Use the lemma. We divide the window into small pieces: t 1, t 2, …,t a. For each t i, we find the substring p i in P where ED(p i,t i ) is the smallest. T:T: P:P: Window W Fig. 7 …t2t2 t1t1 p1p1 p2p2
10 In general, to find such a p i, we may use Dynamic programming [Sellers 1980]. But, we may use a special kind of small pieces. It is customary to call a small piece with size L a L-gram. Let us use the 2-gram.
11 Note that for two substrings P and Q which are of length 2, the edit distance between them is equal to the Hamming distance between them. Thus, we may use 2-grams in our algorithm.
12 Our algorithm Make a table D to store the smallest edit distance between each possible 2-gram from finite alphabet set and all substrings of the pattern P. The above is done in the preprocessing stage.
13 Example T = ctagggaataatttacaatt P = ttaatatat k = 1 ctagggaataatttacaatt m-k Smallest edit distance between aa and all substrings of P = 0 Smallest edit distance between gg and all substrings of P = 2 > k
14 Example T = ctagggaataatttacaatt P = ttaatatat k = 1 ctagggaataatttacaatt m-k Smallest edit distance between tt and all substrings of P = 0 Smallest edit distance between aa and all substrings of P = 0 Smallest edit distance between at and all substrings of P = 0 Smallest edit distance between ga and all substrings of P = 1 == k m+2k ctagggaataatttacaatt m-k i i+1
15 ctagggaataatttacaatt m-k m+k i+1 Example T = ctagggaataatttacaatt P = ttaatatat k = 1 To find the edit distance between gaataattta and P.
16 Example T = ctagggaataatttacaatt P = ttaatatat k = 1 ctagggaataatttacaatt m-k ctagggaataatttacaatt
17 In the preprocessing We make a D table to record the smallest edit distance between each possible l-gram from alphabet set whose length is l and all substrings of P.
18 D table : example ( step by step ) aaacagatcacccgctgagcgggttatctgtt DpDp For example P = aacaccgaa For P = a a c a c c g a a a For P = a a c a c c g a a a vs aa
19 For P = a a c a c c g a a a vs ac aaacagatcacccgctgagcgggttatctgtt DpDp
20 For P = a a c a c c g a a a vs ag For P = a a c a c c g a a a with at For P = a a c a c c g a a a with caFor P = a a c a c c g a a a with cc aaacagatcacccgctgagcgggttatctgtt DpDp
21 Time complexity The average complexity of the algorithm is for
22 The end
23 [BYN2000]New models and algorithms for multidimensional approximate pattern matching. BAEZA-YATES, R. AND NAVARRO, G Journal of Discrete Algorithms 1, 1, 21–49. Special issue on Matching Patterns. [BYN2002]New and faster filters for multiple approximate string matching. BAEZA- YATES, R. AND NAVARRO, G Random Structures and Algorithms 20, 23–49. [BYR99]Modern Information Retrieval. BAEZA-YATES, R. AND RIBEIRO-NETO, B Addison-Wesley, Reading, MA. [BYN99]Faster approximate string matching. BAEZA-YATES, R. A. AND NAVARRO, G Algorithmica 23, 2, 127–158. [CL94]Sublinear approximate string matching and biological applications. CHANG, W. AND LAWLER, E Algorithmica 12, 4/5, 327–344. [CM94]Approximate string matching and local similarity. CHANG, W. AND MARR, T In Proceedings of 5th Combinatorial Pattern Matching (CPM94). LNCS, vol Springer-Verlag, Berlin, 259–273. [CCGJLPR94]Speeding up two string matching algorithms. CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK, S., LECROQ, T., PLANDOWSKI, W., AND RYTTER, W Algorithmica 12, 4/5, 247–267.
24 [CR94]Text Algorithms. CROCHEMORE, M. AND RYTTER, W Oxford University Press, Oxford, UK. [DM79]Automatic Speech and Speaker Recognition. DIXON, R. AND MARTIN, T., Eds IEEE Press. [EL90]A review of segmentation and contextual analysis techniques for text recognition. ELLIMAN, D. AND LANCASTER, I Pattern Recogn. 23, 3/4, 337–346. [F2003]Row-wise tiling for the Myers bit-parallel approximate string matching algorithm. FREDRIKSSON, K In Proceedings of 10th Symposium on String Processing and Information Retrieval (SPIRE03). LNCS, vol Springer-Verlag, Berlin, 66–79. [FN2003]Average-optimal multiple approximate string matching. FREDRIKSSON, K. AND NAVARRO, G In Proceedings of 14th Combinatorial Pattern Matching (CPM03). LNCS, vol –128. [FN2004]Improved single and multiple approximate string matching. FREDRIKSSON, K. AND NAVARRO, G In Proceedings of 15th Combinatorial Pattern Matching (CPM04). LNCS, vol Springer-Verlag, Berlin, 457–471.
25 [GL89]Simple and efficient string matching with k mismatches. GROSSI, R. AND LUCCIO, F Information Processing Letters 33, 3, 113–120. HORSPOOL, R Practical fast searching in strings. Software Practice and Experience 10, 501–506. [HFN2004]Increased bit-parallelism for approximate string matching. HYYR¨O, H., FREDRIKSSON, K., AND NAVARRO, G In Proceedings of 3rd Workshop on Efficient and Experimental Algorithms (WEA04). LNCS, vol Springer- Verlag, Berlin, 285–298. [HN2002]Faster bit-parallel approximate string matching. HYYR¨O, H. AND NAVARRO, G In Proceedings of 13th Combinatorial Pattern Matching (CPM02). LNCS, vol Springer-Verlag, Berlin, 203–224. Extended version to appear in Algorithmica. [JTU96]A comparison of approximate string matching algorithms. JOKINEN, P., TARHIO, J., AND UKKONEN, E Software Practice and Experience 26, 12, 1439– [K92]Techniques for automatically correcting words in text. KUKICH, K ACM Computing Surveys 24, 4, 377–439. [KS94]A pattern-matching model for intrusion detection. KUMAR, S. AND SPAFFORD, E In Proceedings of National Computer Security Conference. 11–21. [LT94]On the searchability of electronic ink. LOPRESTI, D. AND TOMKINS, A In Proceedings of 4 th International Workshop on Frontiers in Handwriting Recognition. 156–165.
26 [MM96] Approximate multiple string search. MUTH, R. AND MANBER, U In Proceedings of 7th Combinatorial Pattern Matching (CPM96). LNCS, vol Springer-Verlag, Berlin, 75–86. [M99] A fast bit-vector algorithm for approximate string matching based on dynamic programming. MYERS, E.W J. ACM 46, 3, 395–415. [N2001]A guided tour to approximate string matching. NAVARRO, G ACM Computing Surveys 33, 1, 31–88. [NB99]Very fast and simple approximate string matching. NAVARRO, G. AND BAEZA- YATES, R Inf. Process. Lett. 72, 65–70. [NB2001]Improving an algorithm for approximate pattern matching. NAVARRO, G. AND BAEZA-YATES, R Algorithmica 30, 4, 473–502. [NF2004]Average complexity of exact and approximate multiple string matching. NAVARRO, G. AND FREDRIKSSON, K Theor. Comput. Sci. 321, 2-3, 283–290. [NR2000]Fast and flexible string matching by combining bitparallelism and suffix automata. NAVARRO, G. AND RAFFINOT, M ACM J. Exp. Algorithmics 5, 4. [NR2002]Flexible Pattern Matching in StringsPractical on-line Search Algorithms for Texts and Biological Sequences. NAVARRO, G. AND RAFFINOT, M Cambridge University Press, Cambridge, UK. [NSTT2000]Indexing text with approximate q-grams. NAVARRO, G., SUTINEN, E., TANNINEN, J., AND TARHIO, J In Proceedings of 11th Combinatorial Pattern Matching (CPM00). LNCS, vol Springer-Verlag, Berlin, 350–363.
27 [PS80]Decision trees and random access machines. PAUL, W. AND SIMON, J In Proceedings of International Symposium on Logic and Algorithmic (Zurich). 331–340. [SK83]Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. SANKOFF, D. AND KRUSKAL, J., Eds Addison-Wesley, Reading, MA. [S80]The theory and computation of evolutionary distances: Pattern recognition. SELLERS, P J. Algorithms 1, 359–373. [ST96]Filtration with q-samples in approximate string matching. SUTINEN, E. AND TARHIO, J In Proceedings of 7th Combinatorial Pattern Matching. LNCS, vol Springer-Verlag, Berlin, 50–63. [TU93]Approximate Boyer–Moore string matching. TARHIO, J. AND UKKONEN, E SIAM J. Comput. 22, 2, 243–260. [U85]Finding approximate patterns in strings. UKKONEN, E J. Algorithms 6, 132– 137. [W95]Introduction to Computational Biology. WATERMAN, M Chapman and Hall, London. [Y79]The complexity of pattern matching for a random string. YAO, A. C SIAM J. Comput. 8, 3, 368–387.