Download presentation

Presentation is loading. Please wait.

Published byAndrew Shelton Modified over 4 years ago

1
1 Average-Optimal Multiple Approximate String Matching Kimmo Fredriksson, Gonzalo Navarro ACM Journal of Experimental Algorithmics, Vol 9, Article No. 1.4,2004, Pages 1-47 Professor R.C.T Lee Speaker K.W.Liu

2
2 The Problem The approximate string matching problem: Given text T[1...n] and pattern P[1...m] over some finite alphabet of size σ, find the approximate occurrences of P from T, allowing at most k differences ( insertion, deletion, substitution).

3
3 For a window of size m-k, if there exists a substring s 1 in this window such that its edit distance with every substring of P is greater than k, we move P Our algorithm scans from the right as shown below: Fig. 3 S1S1 T:T: P:P: m - k

4
4 For a window of size m-k, if there exists a suffix S 1 such that its edit distance with every substring of P is greater than k, we move P to S 1 Our algorithm scans from the right as shown below: Fig. 3 S1S1 T:T: P:P: m - k

5
5 But, how do we know that ED(S 1,S 2 ) > k? We use a very useful lemma.

6
6 Lemma Consider string Q and P. Let Q be divided into q 1,q 2,…,q n as shown below: qnqn …q2q2 q1q1 For each q i, let p i be the substring in P such that ED(q i,p i ) is the smallest, among all substrings in P.

7
7 Proof: Divide P into n pieces as shown below qnqn …q2q2 q1q1 p'np'n …p'2p'2 p'1p'1 Q P

8
8 To determine whether ED(S 1,S 2 ) > k, we may Use the lemma. We divide the window into small pieces: t 1, t 2, …,t a. For each t i, we find the substring p i in P where ED(p i,t i ) is the smallest. T:T: P:P: Window W Fig. 7 …t2t2 t1t1 p1p1 p2p2

9
9

10
10 In general, to find such a p i, we may use Dynamic programming [Sellers 1980]. But, we may use a special kind of small pieces. It is customary to call a small piece with size L a L-gram. Let us use the 2-gram.

11
11 Note that for two substrings P and Q which are of length 2, the edit distance between them is equal to the Hamming distance between them. Thus, we may use 2-grams in our algorithm.

12
12 Our algorithm Make a table D to store the smallest edit distance between each possible 2-gram from finite alphabet set and all substrings of the pattern P. The above is done in the preprocessing stage.

13
13 Example T = ctagggaataatttacaatt P = ttaatatat k = 1 ctagggaataatttacaatt m-k Smallest edit distance between aa and all substrings of P = 0 Smallest edit distance between gg and all substrings of P = 2 > k

14
14 Example T = ctagggaataatttacaatt P = ttaatatat k = 1 ctagggaataatttacaatt m-k Smallest edit distance between tt and all substrings of P = 0 Smallest edit distance between aa and all substrings of P = 0 Smallest edit distance between at and all substrings of P = 0 Smallest edit distance between ga and all substrings of P = 1 == k m+2k ctagggaataatttacaatt m-k i i+1

15
15 ctagggaataatttacaatt m-k m+k i+1 Example T = ctagggaataatttacaatt P = ttaatatat k = 1 To find the edit distance between gaataattta and P.

16
16 Example T = ctagggaataatttacaatt P = ttaatatat k = 1 ctagggaataatttacaatt m-k ctagggaataatttacaatt

17
17 In the preprocessing We make a D table to record the smallest edit distance between each possible l-gram from alphabet set whose length is l and all substrings of P.

18
18 D table : example ( step by step ) aaacagatcacccgctgagcgggttatctgtt DpDp 2222222222222222 For example P = aacaccgaa For P = a a c a c c g a a a For P = a a c a c c g a a a vs aa

19
19 For P = a a c a c c g a a a vs ac aaacagatcacccgctgagcgggttatctgtt DpDp 0022222222222222

20
20 For P = a a c a c c g a a a vs ag For P = a a c a c c g a a a with at For P = a a c a c c g a a a with caFor P = a a c a c c g a a a with cc aaacagatcacccgctgagcgggttatctgtt DpDp 0011002222222222

21
21 Time complexity The average complexity of the algorithm is for

22
22 The end

23
23 [BYN2000]New models and algorithms for multidimensional approximate pattern matching. BAEZA-YATES, R. AND NAVARRO, G. 2000. Journal of Discrete Algorithms 1, 1, 21–49. Special issue on Matching Patterns. [BYN2002]New and faster filters for multiple approximate string matching. BAEZA- YATES, R. AND NAVARRO, G. 2002. Random Structures and Algorithms 20, 23–49. [BYR99]Modern Information Retrieval. BAEZA-YATES, R. AND RIBEIRO-NETO, B. 1999. Addison-Wesley, Reading, MA. [BYN99]Faster approximate string matching. BAEZA-YATES, R. A. AND NAVARRO, G. 1999. Algorithmica 23, 2, 127–158. [CL94]Sublinear approximate string matching and biological applications. CHANG, W. AND LAWLER, E. 1994. Algorithmica 12, 4/5, 327–344. [CM94]Approximate string matching and local similarity. CHANG, W. AND MARR, T. 1994. In Proceedings of 5th Combinatorial Pattern Matching (CPM94). LNCS, vol. 807. Springer-Verlag, Berlin, 259–273. [CCGJLPR94]Speeding up two string matching algorithms. CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK, S., LECROQ, T., PLANDOWSKI, W., AND RYTTER, W. 1994. Algorithmica 12, 4/5, 247–267.

24
24 [CR94]Text Algorithms. CROCHEMORE, M. AND RYTTER, W. 1994. Oxford University Press, Oxford, UK. [DM79]Automatic Speech and Speaker Recognition. DIXON, R. AND MARTIN, T., Eds. 1979. IEEE Press. [EL90]A review of segmentation and contextual analysis techniques for text recognition. ELLIMAN, D. AND LANCASTER, I. 1990. Pattern Recogn. 23, 3/4, 337–346. [F2003]Row-wise tiling for the Myers bit-parallel approximate string matching algorithm. FREDRIKSSON, K. 2003. In Proceedings of 10th Symposium on String Processing and Information Retrieval (SPIRE03). LNCS, vol. 2857. Springer-Verlag, Berlin, 66–79. [FN2003]Average-optimal multiple approximate string matching. FREDRIKSSON, K. AND NAVARRO, G. 2003. In Proceedings of 14th Combinatorial Pattern Matching (CPM03). LNCS, vol. 2676. 109–128. [FN2004]Improved single and multiple approximate string matching. FREDRIKSSON, K. AND NAVARRO, G. 2004. In Proceedings of 15th Combinatorial Pattern Matching (CPM04). LNCS, vol. 3109. Springer-Verlag, Berlin, 457–471.

25
25 [GL89]Simple and efficient string matching with k mismatches. GROSSI, R. AND LUCCIO, F. 1989. Information Processing Letters 33, 3, 113–120. HORSPOOL, R. 1980. Practical fast searching in strings. Software Practice and Experience 10, 501–506. [HFN2004]Increased bit-parallelism for approximate string matching. HYYR¨O, H., FREDRIKSSON, K., AND NAVARRO, G. 2004. In Proceedings of 3rd Workshop on Efficient and Experimental Algorithms (WEA04). LNCS, vol. 3059. Springer- Verlag, Berlin, 285–298. [HN2002]Faster bit-parallel approximate string matching. HYYR¨O, H. AND NAVARRO, G. 2002. In Proceedings of 13th Combinatorial Pattern Matching (CPM02). LNCS, vol. 2373. Springer-Verlag, Berlin, 203–224. Extended version to appear in Algorithmica. [JTU96]A comparison of approximate string matching algorithms. JOKINEN, P., TARHIO, J., AND UKKONEN, E. 1996. Software Practice and Experience 26, 12, 1439– 1458. [K92]Techniques for automatically correcting words in text. KUKICH, K. 1992. ACM Computing Surveys 24, 4, 377–439. [KS94]A pattern-matching model for intrusion detection. KUMAR, S. AND SPAFFORD, E. 1994. In Proceedings of National Computer Security Conference. 11–21. [LT94]On the searchability of electronic ink. LOPRESTI, D. AND TOMKINS, A. 1994. In Proceedings of 4 th International Workshop on Frontiers in Handwriting Recognition. 156–165.

26
26 [MM96] Approximate multiple string search. MUTH, R. AND MANBER, U. 1996. In Proceedings of 7th Combinatorial Pattern Matching (CPM96). LNCS, vol. 1075. Springer-Verlag, Berlin, 75–86. [M99] A fast bit-vector algorithm for approximate string matching based on dynamic programming. MYERS, E.W. 1999. J. ACM 46, 3, 395–415. [N2001]A guided tour to approximate string matching. NAVARRO, G. 2001. ACM Computing Surveys 33, 1, 31–88. [NB99]Very fast and simple approximate string matching. NAVARRO, G. AND BAEZA- YATES, R. 1999. Inf. Process. Lett. 72, 65–70. [NB2001]Improving an algorithm for approximate pattern matching. NAVARRO, G. AND BAEZA-YATES, R. 2001. Algorithmica 30, 4, 473–502. [NF2004]Average complexity of exact and approximate multiple string matching. NAVARRO, G. AND FREDRIKSSON, K. 2004. Theor. Comput. Sci. 321, 2-3, 283–290. [NR2000]Fast and flexible string matching by combining bitparallelism and suffix automata. NAVARRO, G. AND RAFFINOT, M. 2000. ACM J. Exp. Algorithmics 5, 4. [NR2002]Flexible Pattern Matching in StringsPractical on-line Search Algorithms for Texts and Biological Sequences. NAVARRO, G. AND RAFFINOT, M. 2002. Cambridge University Press, Cambridge, UK. [NSTT2000]Indexing text with approximate q-grams. NAVARRO, G., SUTINEN, E., TANNINEN, J., AND TARHIO, J. 2000. In Proceedings of 11th Combinatorial Pattern Matching (CPM00). LNCS, vol. 1848. Springer-Verlag, Berlin, 350–363.

27
27 [PS80]Decision trees and random access machines. PAUL, W. AND SIMON, J. 1980. In Proceedings of International Symposium on Logic and Algorithmic (Zurich). 331–340. [SK83]Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. SANKOFF, D. AND KRUSKAL, J., Eds. 1983. Addison-Wesley, Reading, MA. [S80]The theory and computation of evolutionary distances: Pattern recognition. SELLERS, P. 1980. J. Algorithms 1, 359–373. [ST96]Filtration with q-samples in approximate string matching. SUTINEN, E. AND TARHIO, J. 1996. In Proceedings of 7th Combinatorial Pattern Matching. LNCS, vol. 1075. Springer-Verlag, Berlin, 50–63. [TU93]Approximate Boyer–Moore string matching. TARHIO, J. AND UKKONEN, E. 1993. SIAM J. Comput. 22, 2, 243–260. [U85]Finding approximate patterns in strings. UKKONEN, E. 1985. J. Algorithms 6, 132– 137. [W95]Introduction to Computational Biology. WATERMAN, M. 1995. Chapman and Hall, London. [Y79]The complexity of pattern matching for a random string. YAO, A. C. 1979. SIAM J. Comput. 8, 3, 368–387.

Similar presentations

Presentation is loading. Please wait....

OK

1 Rules for Approximate String Matching R.C.T. Lee.

1 Rules for Approximate String Matching R.C.T. Lee.

© 2018 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on new company act 2013 Ppt on principles of object-oriented programming in java Ppt on synthesis and degradation of purines and pyrimidines rings Ppt on pierre de fermat Ppt on power electronics application Ppt on total parenteral nutrition contraindications Ppt on electron spin resonance spectroscopy ppt Ppt on teamviewer 8 Ppt on state of indian economy Ppt on non biodegradable waste management