# 1 Text Indexing and Dictionary Matching with One Error Amir, A., KeseIman, D., Landau, G., M. and etc, Journal of Algorithm, 37, 2000, pp. 309-325 Adviser:

## Presentation on theme: "1 Text Indexing and Dictionary Matching with One Error Amir, A., KeseIman, D., Landau, G., M. and etc, Journal of Algorithm, 37, 2000, pp. 309-325 Adviser:"— Presentation transcript:

1 Text Indexing and Dictionary Matching with One Error Amir, A., KeseIman, D., Landau, G., M. and etc, Journal of Algorithm, 37, 2000, pp. 309-325 Adviser: R. C. T. Lee Speaker: C. W. Cheng

2 Problem Definition The Indexing Problem –Input A Text T of length n over alphabet Σ, a pattern P of length m over alphabet Σ and an integer k. –Output All occurrences of P in T with at most k mismatches.

3 Main idea In this algorithm, we construct suffix tree and prefix tree with text T. We set an integer j, j=1,2…m. Then we find the prefix P 1,j-1 in prefix tree and the suffix P j+1,m in suffix tree. If both of them exist, an approximation string matching with one error occurs.

4 Processing 1.Construct a suffix tree S T of the text string T and suffix tree S T R of the string T R is the reversed text T R = t n … t 1.

5 Ex T=AGCAGAT T R =TAGACGA

6 Ex T=AGCAGAT T R =TAGACGA

7 Processing 2. For each of the suffix trees, link all leaves of the suffix tree in a left-to-right order.

8 Ex T=AGCAGAT T R =TAGACGA

9 Processing 3. For each of the suffix trees, set pointers from each tree node v to its left most leaf v l and rightmost leave v r in the linked list.

10 Ex T=AGCAGAT T R =TAGACGA

11 Processing 4. Designate each leaf in S T by the starting location of its suffix. Designate each leaf in S T R by n – i + 3, where i is the starting position of the leafs suffix in T R.

12 Ex T=AGCAGAT T R =TAGACGA

13 Query Processing For j = 1, …., m do –1. Find node v, the location of P j+1 … P m in S T, if such a node exists. –2. Find node w, the location of P j-1.. P 1 in S T R, if such a node exist. –3. If v and w exist, the values of leaves under v and w are V[v l ….v r ] and W[w l …w r ], to find the intersections I of V[v l ….v r ] and W[w l …w r ]. If the intersections exist, the approximate string matching occurs on T i-3 …T i-3+m, for all i I.

14 Example Ex T=actgacctcagctta P=ctga k=1

15 Ex T=actgacctcagctta T R =attcgactccagtca P=ctaa Suffix Tree of T

16 Ex T=actgacctcagctta T R =attcgactccagtca P=ctaa Suffix Tree of T R

17 Ex T=actgacctcagctta T R =attcgactccagtca P=ctaa Suffix Tree of T j=1 v=P j+1 …P m =taa w=P j-1 …P 1 =ε V[v l ….v r ]={ε}

18 Suffix Tree of T R Ex T=actgacctcagctta T R =attcgactccagtca P=ctaa j=1 v=P j+1 …P m =taa w=P j-1 …P 1 =ε V[v l ….v r ]={ε} W[v l ….v r ]={3,12,…,14} I={ε}

19 Suffix Tree of T Ex T=actgacctcagctta T R =attcgactccagtca P=ctaa j=2 v=P j+1 …P m =aa w=P j-1 …P 1 =c V[v l ….v r ]={ε}

20 Suffix Tree of T R Ex T=actgacctcagctta T R =attcgactccagtca P=ctaa j=2 v=P j+1 …P m =aa w=P j-1 …P 1 =c V[v l ….v r ]={ε} W[v l ….v r ]={4,8, 9,14,11} I={ε}

21 Suffix Tree of T Ex T=actgacctcagctta T R =attcgactccagtca P=ctaa j=3 v=P j+1 …P m =a w=P j-1 …P 1 =tc V[v l ….v r ]={15,5, 1,10}

22 Suffix Tree of T R Ex T=actgacctcagctta T R =attcgactccagtca P=ctaa j=3 v=P j+1 …P m =a w=P j-1 …P 1 =tc V[v l ….v r ]={15,5, 1,10} W[v l ….v r ]={5,10,15} I={15,5,10}

23 When j=3, the intersection of V[15,5,1,10] and W[5,10,15] is I={5,10,15}. Therefore approximate string matching occurs on T i-j …T i-j+m, for all i I. T 2 …T 6 T 7 …T 11 T 12 …T 15 T=actgacctcagctta P=ctaa

24 Suffix Tree of T Ex T=actgacctcagctta T R =attcgactccagtca P=ctaa j=4 v=P j+1 …P m =ε w=P j-1 …P 1 =atc V[v l ….v r ]={15,5, …,13}

25 Suffix Tree of T R Ex T=actgacctcagctta T R =attcgactccagtca P=ctaa j=3 v=P j+1 …P m =ε w=P j-1 …P 1 =atc V[v l ….v r ]={15,5, …,13} W[v l ….v r ]={ε} I={ε}

26 Range Query Problem In step 3, given nodes v and w, we want to find the leaves that appear both in interval [v l … v r ] and in the interval [w l … w r ], where the four end points of the two intervals are defined in step P.3 of the preprocessing. Thus, we are seeking a solution to the range query problem.

27 Problem Definition of Range Query Input Let V=[v 1,v 2 … v n ] and W=[w 1,w 2 … w n ] be two permutation arrays, where n is the number of elements. Four constants i,j,k and l, where both i+k < n and j+l < n. Output Find the intersection of elements of V[i … i+k] and W[j … j+l].

28 Example V=[8,5,1,4,3,7,6,2] W=[3,6,4,7,2,1,5,8] i=3,k=4 j=2,l=5 Output: the intersection of V[v 3,v 4,v 5,v 6 ] and W[w 2,w 3,w 4,w 5,w 6 ]

29 Preprocessing V= W= 1 2 3 4 5 6 7 8 2 3 4 5 6 7 8 85143762 36472158 1 2 3 4 5 6 7 8

30 Preprocessing V= W= 1 2 3 4 5 6 7 8 2 3 4 5 6 7 8 85143762 36472158 1 2 3 4 5 6 7 8 8 5 1 4 3 6 7 2

31 Preprocessing V= W= 1 2 3 4 5 6 7 8 2 3 4 5 6 7 8 85143762 36472158 1 2 3 4 5 6 7 8 8 5 1 4 3 7 6 2 The intersection of V[v3,v4,v5,v6] and W[w2,w3,w4,w5,w6] is {1,4,7}.

32 Time Complexity of Range Query Problem By using Overmars algorithm, the range query problem can be solved with preprocessing time and, where k is the number of points in the range. [O88] Overmars, M. H., Efficient data structures for range searching on a grid, J. Algorithms 9, 1988,pp. 254-275.

33 Time Complexity For the indexing problem, the preprocessing time is and the query can be implemented in, where tocc is the number of occurrences of the pattern in the text with one error.

34 The Dictionary Matching Problem

35 Problem Definition The Dictionary Matching Problem –Input 1. A dictionary P = {p 1,…., p s }, where p i, i = 1,…., s, are patterns over alphabet Σ, and is the sum of the lengths of all the dictionary patterns. 2. A Text T of length n over alphabet Σ. 3. An integer k. –Output All occurrences of any dictionary patterns in T with at most k mismatches.

36 Main idea In this algorithm, we construct suffix tree and prefix tree with D which is concatenation of all patterns in dictionary. We set an integer j, j=1,2…n. Then we find the prefix T 1,j-1 in prefix tree and the suffix T j+1,m in suffix tree. If both of them exist, an approximation string matching with one error occurs.

37 Processing 1. Construct a suffix tree S D of string D and suffix tree S D R of the string D R, where D is the concatenation of all dictionary patterns, with a separator at the end of each pattern, and where D R is the reversal of string D.

38 Example P={tca,gctga,gca} D=TCA\$GCTGA\$GCA\$ D R =ACG\$AGTCG\$ACT\$

39 Example P={tca,gctga,gca} D=TCA\$GCTGA\$GCA\$ D R =ACG\$AGTCG\$ACT\$ Suffix Tree of D (S D )

40 Example P={tca,gctga,gca} D=TCA\$GCTGA\$GCA\$ D R =ACG\$AGTCG\$ACT\$ Suffix Tree of D R (S D R )

41 Processing 2. Modify suffix tree S D, and S D R respectively, as follows. For each separator which is treefirst but not edgefirst, i.e., it appears on an edge (u,v) labeled σ\$σ, where σε, break (u,v) into (u,w) and (w,v). Label (u,v) with σ and (w,v) with \$σ.

42 Example P={tca,gctga,gca} D=TCA\$GCTGA\$GCA\$ D R =ACG\$AGTCG\$ACT\$ Suffix Tree of D (S D )

43 Example P={tca,gctga,gca} D=TCA\$GCTGA\$GCA\$ D R =ACG\$AGTCG\$ACT\$ Suffix Tree of D R (S D R )

44 Preprocessing 3. Scan suffix tree S D, respectively S D R, and modify as follows. For each vertex v consider the associated string L(v), i.e., the string from the root to v. Label v with all the locations of the pattern suffixes, resp. prefixes, that are equal to L(v). To implement this note that all the relevant suffixes share a prefix of L(v)\$. So, go to edge (v,w) with label beginning with \$, assuming such exists, and scan the subtree rooted at w to find all relevant suffixes.

45 Example P={tca,gctga,gca} D=TCA\$GCTGA\$GCA\$ D R =ACG\$AGTCG\$ACT\$ Suffix Tree of D (S D )

46 Example P={tca,gctga,gca} D=TCA\$GCTGA\$GCA\$ D R =ACG\$AGTCG\$ACT\$ Suffix Tree of D R (S D R )

47 Query Processing For j = 1,…., n do –1. Find node v, the location of the longest prefix of t j+1 … t n in S D. –2. Find node w, the location of the longest prefix of t j-1 … t 1 in S D R. –3. Find intersection of markings of nodes on the path from the root to v in S D and on the path from the root to w in S D R.

48 Example T=acagccga D={tca,gctga,gca} K=1

49 Suffix Tree of D (S D ) Example P={tca,gctga,gca} D=tca\$gctga\$gca\$ D R =acg\$agtcg\$act\$ T=acagccga

50 Suffix Tree of D R (S D R ) Example P={tca,gctga,gca} D=tca\$gctga\$gca\$ D R =acg\$agtcg\$act\$ T=acagccga

51 Suffix Tree of D (S D ) Example P={tca,gctga,gca} D=tca\$gctga\$gca\$ D R =acg\$agtcg\$act\$ T=acagccga j=1 v=T j+1 …T m =cagccga w=T j-1 …T 1 =ε V={10,2}

52 Suffix Tree of D R (S D R ) Example P={tca,gctga,gca} D=tca\$gctga\$gca\$ D R =acg\$agtcg\$act\$ T=acagccga j=1 v=T j+1 …T m =cagccga w=T j-1 …T 1 =ε V={10,2} W={13,5,…, 8} I={10,2}

53 When j=1, the intersection of V[10,2] and W[13,5,…,8] is I={10,2}. Therefore approximate string matching occurs on T with P…P. T=acagccga P={tca,gctga,gca}

54 Suffix Tree of D (S D ) Example P={tca,gctga,gca} D=tca\$gctga\$gca\$ D R =acg\$agtcg\$act\$ T=acagccga j=2 v=T j+1 …T m =agccga w=T j-1 …T 1 =a V={11,8,3}

55 Suffix Tree of D R (S D R ) Example P={tca,gctga,gca} D=tca\$gctga\$gca\$ D R =acg\$agtcg\$act\$ T=acagccga j=2 v=T j+1 …T m =agccga w=T j-1 …T 1 =a V={11,8,3} W={ε} I={ε}

56 Suffix Tree of D (S D ) Example P={tca,gctga,gca} D=tca\$gctga\$gca\$ D R =acg\$agtcg\$act\$ T=acagccga j=3 v=T j+1 …T m =gccga w=T j-1 …T 1 =ca V={ε}

57 Suffix Tree of D R (S D R ) Example P={tca,gctga,gca} D=tca\$gctga\$gca\$ D R =acg\$agtcg\$act\$ T=acagccga j=3 v=T j+1 …T m =gccga w=T j-1 …T 1 =ca V={ε} W={ε} I={ε}

58 Suffix Tree of D (S D ) Example P={tca,gctga,gca} D=tca\$gctga\$gca\$ D R =acg\$agtcg\$act\$ T=acagccga j=4 v=T j+1 …T m =ccga w=T j-1 …T 1 =aca V={ε}

59 Suffix Tree of D R (S D R ) Example P={tca,gctga,gca} D=tca\$gctga\$gca\$ D R =acg\$agtcg\$act\$ T=acagccga j=4 v=T j+1 …T m =ccga w=T j-1 …T 1 =aca V={ε} W={ε} I={ε}

60 Suffix Tree of D (S D ) Example P={tca,gctga,gca} D=tca\$gctga\$gca\$ D R =acg\$agtcg\$act\$ T=acagccga j=5 v=T j+1 …T m =cga w=T j-1 …T 1 =gaca V={ ε }

61 Suffix Tree of D R (S D R ) Example P={tca,gctga,gca} D=tca\$gctga\$gca\$ D R =acg\$agtcg\$act\$ T=acagccga j=5 v=T j+1 …T m =cga w=T j-1 …T 1 =gaca V={ε} W={6,11} I={ε}

62 Suffix Tree of D (S D ) Example P={tca,gctga,gca} D=tca\$gctga\$gca\$ D R =acg\$agtcg\$act\$ T=acagccga j=6 v=T j+1 …T m =ga w=T j-1 …T 1 =cgaca V={7}

63 Suffix Tree of D R (S D R ) Example P={tca,gctga,gca} D=tca\$gctga\$gca\$ D R =acg\$agtcg\$act\$ T=acagccga j=6 v=T j+1 …T m =ga w=T j-1 …T 1 =cgaca V={7} W={7,12} I={7}

64 When j=6, the intersection of V[7] and W[7,12] is I={7}. Therefore approximate string matching occurs on T with P. T=acagccga P={tca,gctga,gca}

65 Suffix Tree of D (S D ) Example P={tca,gctga,gca} D=tca\$gctga\$gca\$ D R =acg\$agtcg\$act\$ T=acagccga j=7 v=T j+1 …T m =ε w=T j-1 …T 1 =ccgaca V={11,8,…,6}

66 Suffix Tree of D R (S D R ) Example P={tca,gctga,gca} D=tca\$gctga\$gca\$ D R =acg\$agtcg\$act\$ T=acagccga j=1 v=T j+1 …T m =ε w=T j-1 …T 1 =ccgaca V={11,8,…,6} W={ε} I={ε}

67 Time Complexity For the indexing problem, the preprocessing time is and the query can be implemented in, where tocc is the number of occurrences of the dictionary in the text with one error.

68 References [AC75] Efficient string matching, A. V. Aho and M. J. Corasick, Comm. Assoc. Comput. Mach. 18, No. 6, 1975, pp.333-340. [AF91] Adaptive dictionary matching, A. Amir and M. Farach, Proc. 32nd IEEE FOCS, 1991, pp.760- 766. [AFI95] A. Amir, M. Farach, R. M. Idury and etc, Improved dynamic dictionary matching, Inform. And Comput. 119, 1995, pp.258-282. [AKL99] A. Amir, D. Keselman, G. M. Landau and etc, Indexing and dictionary matching with one error, Proc. 1999 Workshop on Algorithms and Data Structures (WADS), 1999, pp.181-192. [AG97] Pattern Matching Algorithms, Oxford University Press, 1997. [BM77] A fast string searching algorithm, Comm. Assoc. Comput. Mach. 20, 1977, pp.762-772. [BG96] G. S. Brodal and L. Gasieniec, Approximate Dictionary queries, Proc. 7th Annual Symposium on Combinatorial Pattern Matching (CPM96), 1996, pp.65-74. [O88] M. H. Overmars, Efficient data structures for range searching on a grid, J. Algorithms 9, 1988,pp. 254-275.