# 1 A Hybrid Indexing Method for Approximate String Matching Journal of Discrete Algorithms, No. 1, Vol. 1, 2000, pp. 205-239, Gonzalo Navarro and Ricardo.

## Presentation on theme: "1 A Hybrid Indexing Method for Approximate String Matching Journal of Discrete Algorithms, No. 1, Vol. 1, 2000, pp. 205-239, Gonzalo Navarro and Ricardo."— Presentation transcript:

1 A Hybrid Indexing Method for Approximate String Matching Journal of Discrete Algorithms, No. 1, Vol. 1, 2000, pp. 205-239, Gonzalo Navarro and Ricardo Baeza-Yates Advisor: Prof. R. C. T. Lee Speaker: Y. K. Shieh

2 The approximate string matching problem is: Given a text T of length n, a pattern P of length m (n > m), and a threshold k to the number of "errors" in the matches, find all occurrences of a pattern in a text with k errors.

3 This paper uses an exhaustive searching mechanism. We open a window T in T with size m+k (Rule 2) and try to determine whether we are sure that every prefix T of this window T has ed(T,P) > k. If the answer is yes, we ignore this window; otherwise, we use dynamic programming to examine whether any prefix T of the window T has ed(T,P) k.

4 We use dynamic programming to compute the edit distance between two strings. A matrix C 0…|m|,0…|n| is filled, where C j,i represents the minimum number of operations need to match T 1…i to P 1…j. This is computed as follows C j,0 and C 0,i represent the edit distance between a string of length j or i and the empty string.

5 example: T = surgery P = survey k = 2 surgery 01234567 s10123456 u21012345 r32101234 v43211234 e54322123 y65433222 There are only three prefixes of T, namely surge, surger and surgery, whose edit distances with P=survey are smaller than or equal to k=2.

6 Let us now see how we can be sure that for a window T with size m+k, for every prefix T of T, ed(T,P) > k. We present Lemma 1 of this paper as follows.

7 Lemma 1 Let T in T and P be two strings such that ed(T, P) k. Let P = P 1 x 1 P 2 x 2 … x j-1 P j, for strings P i and x i and for any j 1. Then, at least one string P i appears in T with at most errors. Thus, we always divide the pattern into j pieces. We shall point out how to divide later.

8 To be more precise, we may say that if ed(T,P) k, there exists a P i in P and a T in T such that ed(P i,T).

9 Lemma 1 tells us that if for all P i in P and every substring b in T, ed(P i,b) >, then ed(P,T) > k. Suppose that there is a window T with size m+k and for all P i in P and for every substring b in T, ed(P i,b) >. Then, we can be sure that for every prefix Tof T, for all P i in P and every substring b in T, ed(P i,b) >. b T T T PiPi P

10 Let us define the following condition. Condition A: For all P i in P and every substring b in T, ed(P i, b) > Thus, if Condition A is satisfied, then for every prefix T of T, ed(T,P)>k. In such a case, we ignore T and shift P one step to the right.

11 Question, how can we be sure that the above condition is satisfied. The approach: For each P i, we generate all possible modified strings P i whose distances with P i are smaller than or equal to k. After generating all possible modified, we may use the suffix tree of T to find all occurrences of, for all i, in T with error less than.

12 We still have the following questions: Question 1. How to divide P into j pieces? Question 2. How to generate all modified P i s? Question 3. How to find the occurrences of P i s in T with edit distance less than or equal to.

13 Question 1: How to divide P into j pieces? It can be proved that an optimal method is to partition P into j pieces with, where σ is the alphabet size. We can get j pieces of P, and the size of every piece is around log σ n.

14 Question 2. How to generate all modified P i s? The generation of all modified strings whose distances with P can be done trivially. One method can be found in [HHLS2006] which was reported by C. W. Lu. Another method can be found in [HM2007] reported By L. C. Chen. In this paper, the authors used the second method mentioned in [HM2007].

15 We can use non-deterministic finite automatons (NFA). A NFA is a five-tuple M=(Q, Σ, δ, q 0, F), where Q is a finite set of states, Σ is a finite input alphabet, δ is a mapping from Q×(Σ {ε}) into the set of subsets of Q, q 0 Q is an initial state, and F Q is a set of final states.

16 P = abac, k = 2. The finite automaton M accepts L k (P). L k (P)={aa, ab, ac, ba, bc, aaa, aab, aac, aba, abb, abc, acc, baa, bab, bac, bbc, bcc, aaaa, aaab, aaac, aaba, aabc, aaca, aacb, aacc, abaa, abab, abac, abba, abbb, abbc, abca, abcb, abcc, baac, babc, bbac, bbbc, bcac}.

17 P = abac, k = 2. The finite automaton M accepts L k (P). L k (P)={aa, ab, ac, ba, bc, aaa, aab, aac, aba, abb, abc, acc, baa, bab, bac, bbc, bcc, aaaa, aaab, aaac, aaba, aabc, aaca, aacb, aacc, abaa, abab, abac, abba, abbb, abbc, abca, abcb, abcc, baac, babc, bbac, bbbc, bcac}. Recognize aa

18 Full example: T = GACACAGACCAAAGCAGn = 17 P = CAAGm = 4 k = 1

19 P = CAAG j = (m + k) / log σ n = (4 + 1) / log 3 17 = 1.9388 Therefore, we partition P into two pieces. P 1 = CA P 2 = AG According to Lemma 1, at least one piece appears in substrings of T with at most = 0 error. This means that we want to find exact matching of P 1 and P 2.

20 NFA with k = 1 of P 1 = CA: NFA with k = 1 of P 2 = AG:

21 T = GACACGGACCAAAGCAG We construct the suffix tree of T. A C G C G CAAAGCAG\$CAAAGCAG\$ GGACCAAAGCAG\$GGACCAAAGCAG\$ ACGGACCAAAGCAG\$ACGGACCAAAGCAG\$ A AAGCAG\$AAGCAG\$ CGGACCAAAGCAG\$CGGACCAAAGCAG\$ G\$G\$ CAAAGCAG\$CAAAGCAG\$ GGACCAAAGCAG\$GGACCAAAGCAG\$ ACAC ACGGACCAAAGCAG\$ACGGACCAAAGCAG\$ CAAAGCAG\$CAAAGCAG\$ CAG\$CAG\$ GACCAAAGCAG\$GACCAAAGCAG\$ A GCAG\$GCAG\$ AGCAG\$AGCAG\$ \$ CAG\$CAG\$ 11 12 13 14 15 16 \$ 17 10 9 8 7 6 5 4 3 2 1

22 We only need to consider the tree level from root to = 3. A C G C G C G A A A C G CACA G ACAC CACA GAGA A G A \$ C 11 12 13 14 15 16 \$ 17 109 8 6 5 4 3 2 1,7 T = GACACGGACCAAAGCAG

23 A C G C G C G A A A C G CACA G ACAC CACA GAGA A G A \$ C 11 12 13 14 15 16 \$ 17 109 8 6 5 4 3 2 1,7 NFA of P 1 : NFA of P 2 T = GACACGGACCAAAGCAG k = 1

24 A C G C G C G A A A C G CACA G ACAC CACA GAGA A G A \$ C 11 12 13 14 15 16 \$ 17 109 8 6 5 4 3 2 1,7 (not exact match) T = GACACGGACCAAAGCAG k = 1

25 A C G C G C G A A A C G CACA G ACAC CACA GAGA A G A \$ C 11 12 13 14 15 16 \$ 17 109 8 6 5 4 3 2 1,7 Out of active states. (not exact match) T = GACACGGACCAAAGCAG k = 1

26 A C G C G C G A A A C G CACA G ACAC CACA GAGA A G A \$ C 11 12 13 14 15 16 \$ 17 109 8 6 5 4 3 2 1,7 (exact match) Out of active states. We record positions 13 and 16 where AG occurs. T = GACACGGACCAAAGCAG 13 16 k = 1

27 A C G C G C G A A A C G CACA G ACAC CACA GAGA A G A \$ C 11 12 13 14 15 16 \$ 17 109 8 6 5 4 3 2 1,7 T = GACACGGACCAAAGCAG k = 1

28 A C G C G C G A A A C G CACA G ACAC CACA GAGA A G A \$ C 11 12 13 14 15 16 \$ 17 109 8 6 5 4 3 2 1,7 Out of active states. (exact match) We record positions 3, 10 and 15 where CA occurs. T = GACACGGACCAAAGCAG k = 1

29 A C G C G C G A A A C G CACA G ACAC CACA GAGA A G A \$ C 11 12 13 14 15 16 \$ 17 109 8 6 5 4 3 2 1,7 Out of active states. (not exact match) T = GACACGGACCAAAGCAG k = 1

30 A C G C G C G A A A C G CACA G ACAC CACA GAGA A G A \$ C 11 12 13 14 15 16 \$ 17 109 8 6 5 4 3 2 1,7 (not exact match) T = GACACGGACCAAAGCAG k = 1

31 A C G C G C G A A A C G CACA G ACAC CACA GAGA A G A \$ C 11 12 13 14 15 16 \$ 17 109 8 6 5 4 3 2 1,7 T = GACACGGACCAAAGCAG k = 1

32 A C G C G C G A A A C G CACA G ACAC CACA GAGA A G A \$ C 11 12 13 14 15 16 \$ 17 109 8 6 5 4 3 2 1,7 Out of active states. T = GACACGGACCAAAGCAG k = 1

33 A C G C G C G A A A C G CACA G ACAC CACA GAGA A G A \$ C 11 12 13 14 15 16 \$ 17 109 8 6 5 4 3 2 1,7 Out of active states. (not exact match) T = GACACGGACCAAAGCAG k = 1

34 A C G C G C G A A A C G CACA G ACAC CACA GAGA A G A \$ C 11 12 13 14 15 16 \$ 17 109 8 6 5 4 3 2 1,7 Out of active states. T = GACACGGACCAAAGCAG k = 1

35 A C G C G C G A A A C G CACA G ACAC CACA GAGA A G A \$ C 11 12 13 14 15 16 \$ 17 109 8 6 5 4 3 2 1,7 Out of active states. (not exact match) T = GACACGGACCAAAGCAG k = 1

36 After we find all probable positions in T, we verify every substring of those positions. The probable positions of T are: 3, 10, 13, 15, 16 We use the dynamic program to verify whether any approximate string matching occurs between T and P at the above locations.

37 The probable positions of T are 3, 10, 13, 15, 16 m+k GACAC 012345 C112234 A221223 A332223 G433333 k = 1 No approximate matching with k=1 found.

38 m+k ACACG 012345 C111234 A212123 A322223 G433332 The probable positions of T are: 3, 10, 13, 15, 16 k = 1 No approximate matching with k=1 found.

39 m+k CACGG 012345 C101234 A210123 A321123 G432212 The probable positions of T are: 3, 10, 13, 15, 16 CACG is found. k = 1

40 m+k The probable positions of T are: 3, 10, 13, 15, 16 This window does not include any probable position. Therefore we can ignore this window.

41 m+k The probable positions of T are: 3, 10, 13, 15, 16 The window does not include any probable position. Therefore we can shift the window directly.

42 m+k GGACC 012345 C112334 A222234 A333234 G433334 The probable positions of T are: 3, 10, 13, 15, 16 k = 1 No approximate matching with k=1 found.

43 m+k GACCA 012345 C112234 A221233 A332233 G433334 The probable positions of T are: 3, 10, 13, 15, 16 k = 1 No approximate matching with k=1 found.

44 m+k ACCAA 012345 C111234 A212223 A322322 G433433 The probable positions of T are: 3, 10, 13, 15, 16 k = 1 No approximate matching with k=1 found.

45 m+k CCAAA 012345 C101234 A211123 A322112 G433222 The probable positions of T are: 3, 10, 13, 15, 16 k = 1 No approximate matching with k=1 found.

46 m+k CAAAG 012345 C101234 A210123 A321012 G432111 The probable positions of T are: 3, 10, 13, 15, 16 CAA, CAAA and CAAAG are found. k = 1

47 m+k AAAGC 012345 C112344 A221234 A322123 G433212 The probable positions of T are: 3, 10, 13, 15, 16 k = 1 AAAG is found.

48 m+k AAGCA 012345 C112334 A211233 A321233 G432123 The probable positions of T are: 3, 10, 13, 15, 16 k = 1 AAG is found.

49 m+k AGCAG 012345 C112234 A212323 A322433 G432543 The probable positions of T are: 3, 10, 13, 15, 16 k = 1 No approximate matching with k=1 found.

50 m GCAG 01234 C11123 A22212 A33322 G43332 The probable positions of T are: 3, 10, 13, 15, 16 k = 1 No approximate matching with k=1 found.

51 m-k CAG 0123 C1012 A2101 A3211 G4321 The probable positions of T are: 3, 10, 13, 15, 16 k = 1 CAG is found.

52 Time complexity The preprocessing time complexity of constructing automatons and a suffix tree of T is O(|N|*|m|) and O(n) respectively, |N| is the number of states in a NFA and |m| is the length of m. The search time obtained using the partitioning scheme is O(n λ logn), where λ < 1 when error tolerated α < 1-e/, where e = 2.718….

53 references [AG85]Combinatorial Algorithms on Words. A. Apostolico and Z. Galil. Springer-Verlag, New York, 1985. [ANZ97]Large text searching allowing errors. M. Ara´ujo, G. Navarro, and N. Ziviani. In Proc. 4th South American Workshop on String Processing (WSP97), pages 2–20. Carleton University Press, 1997. [B92]Text retrieval: Theory and practice. R. Baeza-Yates. In 12th IFIPWorld Computer Congress, volume I, pages 465–476. Elsevier Science, September 1992. [B96]A unified view of string matching algorithms. R. Baeza-Yates. In SOFSEM96: Theory and Practice of Informatics, LNCS 1175, pages 1–15, 1996. Invited paper. [BG96]Fast text searching for regular expressions or automaton searching on a trie. R. Baeza- Yates and G. Gonnet. Journal of the ACM, 43, 1996. [BG99]A fast algorithm on average for all-against-all sequence matching. R. Baeza-Yates and G. Gonnet.In Proc. 6th Symposium on String Processing and Information Retrieval (SPIRE99). IEEE CS Press, 1999. Previous version unpublished, Dept. of Computer Science, Univ. of Chile, 1990. [BN99]Faster approximate string matching. R. Baeza-Yates and G. Navarro. Algorithmica, 23(2):127–158, 1999. Preliminary version in Proc. CPM96, LNCS 1075. [BN2000]Block-addressing indices for approximate text retrieval. R. Baeza-Yates and G. Navarro. Journal of the American Society for Information Science (JASIS), 51(1):69–82, January 2000. [BBHECS85]The smallest automaton recognizing the subwords of a text. A. Blumer, J. Blumer, D. Haussler, A. Ehrenfeucht, M. Chen, and J. Seiferas. Theoretical Computer Science, 40:31–55, 1985.

54 [CM94]Approximate string matching and local similarity. W. Chang and T. Marr. In Proc. 5th Annual Symposium on Combinatorial Pattern Matching (CPM94), LNCS 807, pages 259–273, 1994. [C95]Fast approximate matching using suffix trees. A. Cobbs. In Proc. 6th Annual Symposium on Combinatorial Pattern Matching (CPM95), LNCS 937, pages 41–54, 1995. [C86]Transducers and repetitions. M. Crochemore. Theoretical Computer Science, 45:63–86, 1986. [FFM98]Overcoming the memory bottleneck in suffix tree construction. M. Farach, P. Ferragina, and S. Muthukrishnan. In Proc. 9th Symposium on Discrete Algorithms (SODA98), pages 174–183, 1998. [GKS99]Efficient implementation of lazy suffix trees. R. Giegerich, S. Kurtz, and J. Stoye. In Proc. 3rdWorkshop on Algorithm Engineering (WAE99), LNCS 1668, pages 30–42, 1999. [G92]A tutorial introduction to Computational Biochemistry using Darwin. G. Gonnet. Technical report, Informatik E.T.H., Zuerich, Switzerland, 1992. [GBS92]Information Retrieval: Data Structures and Algorithms, chapter 3: New indices for text: Pat trees and Pat arrays. Gonnet, R. Baeza-Yates, and T. Snider. Pages 66–82. Prentice-Hall, 1992. [H95]Overview of the Third Text REtrieval Conference. D. Harman. In Proc. Third Text REtrieval Conference (TREC-3), pages 1–19, 1995. NIST Special Publication 500-207. [HS94]N. Holsti and E. Sutinen. Approximate string matching using q-gram places. In Proc. 7th Finnish Symposium on Computer Science, pages 23–32. University of Joensuu, 1994. [IT99]An efficient method for in memory construction of suffix arrays. H. Itoh and H. Tanaka. In Proc. 6 th Symposium on String Processing and Information Retrieval (SPIRE99), pages 81–87. IEEE CS Press, 1999. [JU91]Two algorithms for approximate string matching in static texts. P. Jokinen and E. Ukkonen. In Proc. 2nd Annual Symposium on Mathematical Foundations of Computer Science (MFCS91), volume 16, pages 240–248, 1991.

55 [K73]The Art of Computer Programming, volume 3: Sorting and Searching. D. Knuth. Addison-Wesley, 1973. [MM93]Suffix arrays: a new method for on-line string searches. U. Manber and E. Myers. SIAM Journal on Computing, pages 935–948, 1993. [Mw94]GLIMPSE: A tool to search through entire file systems. U. Manber and S. Wu. In Proc. USENIX Technical Conference, pages 23–32, Winter 1994. [M94]A sublinear algorithm for approximate keyword searching. E. Myers. Algorithmica, 12(4/5):345–374, Oct/Nov 1994. [N98]Approximate Text Searching. G. Navarro. PhD thesis, Dept. of Computer Science, Univ. of Chile, December 1998. Technical Report TR/DCC-98-14. ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/thesis98.ps.gz. [N99]A guided tour to approximate string matching. G. Navarro. Technical Report TR/DCC- 99-5, Dept. of Computer Science, Univ. of Chile, 1999. To appear in ACM Computing Surveys. ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/survasm.ps.gz. [NB98]Improving an algorithm for approximate pattern matching. G. Navarro and R. Baeza- Yates. Technical Report TR/DCC-98-5, Dept. of Computer Science, Univ. of Chile, 1998. Submitted. [NB98]A practical q-gram index for text retrieval allowing errors. G. Navarro and R. Baeza- Yates. CLEI Electronic Journal, 1(2), 1998. http://www.clei.cl. [NB99]A new indexing method for approximate string matching. G. Navarro and R. Baeza- Yates. In Proc. 10th Annual Symposium on Combinatorial Pattern Matching (CPM99), LNCS 1645, pages 163–186, 1999. [NB99]Very fast and simple approximate string matching. G. Navarro and R. Baeza-Yates. Information Processing Letters, 72:65–70, 1999. [NSTT2000]Indexing text with approximate q-grams. G. Navarro, E. Sutinen, J. Tanninen, and J. Tarhio. In Proc. 11th Annual Symposium on Combinatorial Pattern Matching (CPM2000), Montreal, Canada, 2000.

56 [S98]A fast algorithm for making suffix arrays and for the Burrows-Wheeler transformation. K. Sadakane. In Proc. Data Compression Conference (DCC98), pages 129–138, 1998. [S80]The theory and computation of evolutionary distances: pattern recognition. P. Sellers. Journal of Algorithms, 1:359–373, 1980. [S96]Fast approximate string matching with q-blocks sequences. F. Shi. In Proc. 3rd South American Workshop on String Processing (WSP96), pages 257–271. Carleton University Press, 1996. [ST95]On using q-gram locations in approximate string matching. E. Sutinen and J. Tarhio. In Proc. ESA95, LNCS 979, pages 327–340, 1995. [ST96]Tarhio. Filtration with q-samples in approximate string matching. E. Sutinen and J. In Proc. 7 th Annual Symposium on Combinatorial Pattern Matching (CPM96), LNCS 1075, pages 50–61, 1996. [U96]Approximate string matching over suffix trees. E. Ukkonen. In Proc. 4th Annual Symposium on Combinatorial Pattern Matching (CPM93), pages 228–242, 1993. [U95]Constructing suffix trees on-line in linear time. E. Ukkonen. Algorithmica, 14(3):249– 260, Sep 1995. [U85]Finding approximate patterns in strings. Esko Ukkonen. Journal of Algorithms, 6:132– 137, 1985. [WM92]Fast text searching allowing errors. S.Wu and U. Manber. Comm. of the ACM, 35(10):83–91, October 1992.

57 Thank you

Similar presentations