Presentation on theme: "1 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, 1999. G. Navarro and R. Baeza-Yates Advisor: Prof. R. C. T."— Presentation transcript:
1 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, G. Navarro and R. Baeza-Yates Advisor: Prof. R. C. T. Lee Speaker: H. M. Chen
2 Our approximate string matching problem is defined as follows: Given a pattern string P of length m and a text string T of length n, and a maximal number k of errors allowed, find all text positions that match P with up to edit distance equal to k.
3 This paper is based upon the following lemma presented by the same authors in [A Hybird indexing method for approximate string matching]. Lemma: Let T and P be two strings. Let P be divided into j pieces p 1, p 2, …, p j. If ed(T,P) k, then there exists at least one p i and a substring S in T such that ed(S,p i ).
4 If we let j=k+1, then. In this case, if ed(T,P) k, then at least one p i occurs in T exactly. If, in a certain window, we find an exact matching of a p i inside the window, we use the dynamic programming approach to determine whether there exists an approximate matching of P allowing k errors in this window.
5 If, in a window, we cannot find any exact matching of p i inside the window, we ignore the window. That is, we do not have to check whether there is an approximate matching inside the window.
6 Question: How large can the window be? Answer: The largest window size which is allowed to produce an approximate matching with edit distance smaller than or equal to k is m+2k where m is the length of the pattern. This can be explained in the following slide.
7 Consider the following case. Suppose P exactly matches a substring S in T. We may extend S k characters to the right and k characters to the left. This forms a window of size m+2k. Any substring obtained by extending S to the right and to the left is an approximate matching with P with edit distance less than or equal to k. T P m S kk
8 Let us consider the case where we limit the error to be less than k. Then we split the pattern P into k+1 pieces. Since each piece is rather small, there is a high probability that it appears exactly in T. Thus, when the pieces are small, us in this case, we cannot eliminate many substrings.
9 Our think is as follows: After determining the occurrences of exact matching of small pieces, we start to determine the occurrences of larger piece of P in T. AAABBBCCCDDD AAABBB CCCDDD BBBCCCDDDAAA k = 3
10 bc table The only thing we want to do is to construct a table of each piece of P as follow. Let x be a character in the alphabet. We record the position of the last x, if it exists in piece of P, we record the position of x from the right end. If x does not exist in piece of P, we record it as m+1.
11 Suppose we have P = ATCCTC with k = 2. We divide P into three pieces : p 1 = AT, p 2 = CC and p 3 = TC. To search for exact matching, we actually perform an exhaustive search. Let us assume that we search for AT. Note that there are three cases: Case 1 : X = A. We move AT 2 steps. Case 2 : X = T. We move AT 1 steps. Case 3 : XA and XT, we move AT 3 steps. T X AT
12 Let us assume that we search for CC. Case 1 : X = C. We move CC 1 step. Case 2 : X C. We move CC 3 steps. T X CC
13 Let us assume that we search for TC. Case 1 : X = T. We move TC 2 steps. Case 2 : X = C. We move TC 1 step. Case 3 : XT and XC, we move TC 3 steps. T X TC
14 Based upon three above discussions, we choose the minimum values of each character and have the following shift table: p 1 = AT p 2 = CCp 3 = TC AT* 213 TC* 213 C* 13 ATC* 2113 shift table bc table
15 T = TCCAAGTTATAGCTC p 1 = AT, p 2 = CC, p 3 = TC First step: We open a window with length 2 to compare with AT, CC and TC. We found that it has a exact matching with p 3. Then shift the window according to shift table value of next position. Second step: We found CC has a exact matching with p 2. Then we shift the window 2 positions. Third step: We cannot find AA among p 1, p 2 and p 3. Then shift the window 3 positions and continue to compare. ATC* 2113 shift table TCCAAGTTATAGCTC TCCAAGTTATAGCTC TCCAAGTTATAGCTC
16 T = TCCAAGTTATAGCTC P = ATCCTC Using this shift table, we may have the following. We will find AT occurring at 9 in T, CC occurring at 2 in T and TC occurring at 1 and 14 in T. Table d contains all text positions of Ps pieces. AT 9 CC 2 TC 1,14 Table d ATC* 2113 shift table
17 T = CAABCAAABDAACB P = ABCACABCDDCA k = 3 ABCACABCDDCA ABCACABCDDCA ABCACABCD DCA
18 T = CAABCAAABDAACB P = ABCACABCDDCA k = 3 Table d ABC 3 ACA NULL BCD NULL DCA NULL shift table ABCD* 12114
19 T = CAABCAAABDAACB P = ABCACABCDDCA 1. Found ABC in T. Search for ABCACA in with k=1. Now the length of m is six. So the window length is eight. found! CAABCAAABDAACB
20 T = CAABCAAABDAACB P = ABCACABCDDCA 2. Search for ABCACABCDDCA with k=3 in But we cant find ABCACABCDDCA in T with k=3. Stop comparing. CAABCAAABDAACB
21 Time complexity search cost in O(kn/m) = O(αn)time complexity. Error level α= k / m.
22 References  R. Baeza-Yates, G. Gonnet, A new approach to text searching, Comm. ACM 35 (10) (1992) 74–82.  R. Baeza-Yates, G. Navarro, Faster approximate string matching, Algorithmica 23 (2) (1999) 127–158. Preliminary version in: Proc. CPM96.  R. Baeza-Yates, C. Perleberg, Fast and practical approximate pattern matching, Inform. Process. Lett. 59 (1996) 21–27.  G. Myers, A fast bit-vector algorithm for approximate pattern matching based on dynamic programming, in: Proc. CPM98, Lecture Notes in Computer Sci., Vol. 1448, Springer, Berlin, 1998, pp. 1–13.  G. Navarro, Approximate text searching, Ph.D. Thesis, Department of Computer Science, University of Chile, December Tech. Report TR/DCC  G. Navarro, A guided tour to approximate string matching, Technical Report TR/DCC-99-5, Department of Computer Science, University of Chile, Submitted. ftp:// ftp.dcc.uchile.cl/pub/users/gnavarro/survasm.ps.gz.
23 References  G. Navarro, R. Baeza-Yates, Improving an algorithm for approximate string matching, 1998, submitted.  G. Navarro, M. Raffinot, A bit-parallel approach to suffix automata: Fast extended string matching, in: Proc. CPM98, Lecture Notes in Computer Sci., Vol. 1448, Springer, Berlin, 1998, pp. 14–33.  P. Sellers, The theory and computation of evolutionary distances: pattern recognition, J. Algorithms 1 (1980) 359–373.  D. Sunday, A very fast substring search algorithm, Comm. ACM 33 (8) (1990) 132–142.  S. Wu, U. Manber, Agrep – a fast approximate pattern-matching tool, in: Proc. of USENIX Technical Conference, 1992, pp. 153–162.  S. Wu, U. Manber, Fast text searching allowing errors, Comm. ACM 35 (10) (1992) 83–91.