Presentation on theme: "1 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, 1999. G. Navarro and R. Baeza-Yates Advisor: Prof. R. C. T."— Presentation transcript:
1 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, 1999. G. Navarro and R. Baeza-Yates Advisor: Prof. R. C. T. Lee Speaker: H. M. Chen
2 Our approximate string matching problem is defined as follows: Given a pattern string P of length m and a text string T of length n, and a maximal number k of errors allowed, find all text positions that match P with up to edit distance equal to k.
3 This paper is based upon the following lemma presented by the same authors in [A Hybird indexing method for approximate string matching]. Lemma: Let T and P be two strings. Let P be divided into j pieces p 1, p 2, …, p j. If ed(T,P) k, then there exists at least one p i and a substring S in T such that ed(S,p i ).
4 If we let j=k+1, then. In this case, if ed(T,P) k, then at least one p i occurs in T exactly. If, in a certain window, we find an exact matching of a p i inside the window, we use the dynamic programming approach to determine whether there exists an approximate matching of P allowing k errors in this window.
5 If, in a window, we cannot find any exact matching of p i inside the window, we ignore the window. That is, we do not have to check whether there is an approximate matching inside the window.
6 Question: How large can the window be? Answer: The largest window size which is allowed to produce an approximate matching with edit distance smaller than or equal to k is m+2k where m is the length of the pattern. This can be explained in the following slide.
7 Consider the following case. Suppose P exactly matches a substring S in T. We may extend S k characters to the right and k characters to the left. This forms a window of size m+2k. Any substring obtained by extending S to the right and to the left is an approximate matching with P with edit distance less than or equal to k. T P m S kk
8 Let us consider the case where we limit the error to be less than k. Then we split the pattern P into k+1 pieces. Since each piece is rather small, there is a high probability that it appears exactly in T. Thus, when the pieces are small, us in this case, we cannot eliminate many substrings.
9 Our think is as follows: After determining the occurrences of exact matching of small pieces, we start to determine the occurrences of larger piece of P in T. AAABBBCCCDDD AAABBB CCCDDD BBBCCCDDDAAA k = 3
10 bc table The only thing we want to do is to construct a table of each piece of P as follow. Let x be a character in the alphabet. We record the position of the last x, if it exists in piece of P, we record the position of x from the right end. If x does not exist in piece of P, we record it as m+1.
11 Suppose we have P = ATCCTC with k = 2. We divide P into three pieces : p 1 = AT, p 2 = CC and p 3 = TC. To search for exact matching, we actually perform an exhaustive search. Let us assume that we search for AT. Note that there are three cases: Case 1 : X = A. We move AT 2 steps. Case 2 : X = T. We move AT 1 steps. Case 3 : XA and XT, we move AT 3 steps. T X AT
12 Let us assume that we search for CC. Case 1 : X = C. We move CC 1 step. Case 2 : X C. We move CC 3 steps. T X CC
13 Let us assume that we search for TC. Case 1 : X = T. We move TC 2 steps. Case 2 : X = C. We move TC 1 step. Case 3 : XT and XC, we move TC 3 steps. T X TC
14 Based upon three above discussions, we choose the minimum values of each character and have the following shift table: p 1 = AT p 2 = CCp 3 = TC AT* 213 TC* 213 C* 13 ATC* 2113 shift table bc table
15 T = TCCAAGTTATAGCTC p 1 = AT, p 2 = CC, p 3 = TC First step: We open a window with length 2 to compare with AT, CC and TC. We found that it has a exact matching with p 3. Then shift the window according to shift table value of next position. Second step: We found CC has a exact matching with p 2. Then we shift the window 2 positions. Third step: We cannot find AA among p 1, p 2 and p 3. Then shift the window 3 positions and continue to compare. ATC* 2113 shift table TCCAAGTTATAGCTC TCCAAGTTATAGCTC TCCAAGTTATAGCTC
16 T = TCCAAGTTATAGCTC P = ATCCTC Using this shift table, we may have the following. We will find AT occurring at 9 in T, CC occurring at 2 in T and TC occurring at 1 and 14 in T. Table d contains all text positions of Ps pieces. AT 9 CC 2 TC 1,14 Table d ATC* 2113 shift table
17 T = CAABCAAABDAACB P = ABCACABCDDCA k = 3 ABCACABCDDCA ABCACABCDDCA ABCACABCD DCA
18 T = CAABCAAABDAACB P = ABCACABCDDCA k = 3 Table d ABC 3 ACA NULL BCD NULL DCA NULL shift table ABCD* 12114
19 T = CAABCAAABDAACB P = ABCACABCDDCA 1. Found ABC in T. Search for ABCACA in with k=1. Now the length of m is six. So the window length is eight. found! CAABCAAABDAACB
20 T = CAABCAAABDAACB P = ABCACABCDDCA 2. Search for ABCACABCDDCA with k=3 in But we cant find ABCACABCDDCA in T with k=3. Stop comparing. CAABCAAABDAACB
21 Time complexity search cost in O(kn/m) = O(αn)time complexity. Error level α= k / m.
22 References  R. Baeza-Yates, G. Gonnet, A new approach to text searching, Comm. ACM 35 (10) (1992) 74–82.  R. Baeza-Yates, G. Navarro, Faster approximate string matching, Algorithmica 23 (2) (1999) 127–158. Preliminary version in: Proc. CPM96.  R. Baeza-Yates, C. Perleberg, Fast and practical approximate pattern matching, Inform. Process. Lett. 59 (1996) 21–27.  G. Myers, A fast bit-vector algorithm for approximate pattern matching based on dynamic programming, in: Proc. CPM98, Lecture Notes in Computer Sci., Vol. 1448, Springer, Berlin, 1998, pp. 1–13.  G. Navarro, Approximate text searching, Ph.D. Thesis, Department of Computer Science, University of Chile, December 1998. Tech. Report TR/DCC-98-14.  G. Navarro, A guided tour to approximate string matching, Technical Report TR/DCC-99-5, Department of Computer Science, University of Chile, 1999. Submitted. ftp:// ftp.dcc.uchile.cl/pub/users/gnavarro/survasm.ps.gz.
23 References  G. Navarro, R. Baeza-Yates, Improving an algorithm for approximate string matching, 1998, submitted.  G. Navarro, M. Raffinot, A bit-parallel approach to suffix automata: Fast extended string matching, in: Proc. CPM98, Lecture Notes in Computer Sci., Vol. 1448, Springer, Berlin, 1998, pp. 14–33.  P. Sellers, The theory and computation of evolutionary distances: pattern recognition, J. Algorithms 1 (1980) 359–373.  D. Sunday, A very fast substring search algorithm, Comm. ACM 33 (8) (1990) 132–142.  S. Wu, U. Manber, Agrep – a fast approximate pattern-matching tool, in: Proc. of USENIX Technical Conference, 1992, pp. 153–162.  S. Wu, U. Manber, Fast text searching allowing errors, Comm. ACM 35 (10) (1992) 83–91.