Download presentation

Presentation is loading. Please wait.

Published byLandon Vaughn Modified over 2 years ago

1
1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43, 1986, pp Speaker: C. C. Lin Adviser: R. C. T. Lee

2
2 Problem definition: Input: A text T with length n, a pattern P with length m and a mismatching threshold k. Output: All sub-strings of T with length m matching P with k maximal number of mismatches. T = A G C T G C D C A C G I A B P = A G C C If k = 2 k:k: P = A G C C

3
3 The concept of the Kangaroo method can be explained as the following figure. Assume that it is known before hand there t 1 t 2 …t a =p 1 p 2 …p a and t a+1 is not equal to p a+1. Thus we do not have to examine t 1 t 2 …t a+1 with p 1 p 2 …p a+1 and jump directly to match the suffixes beginning from t a+2 and p a+2. Text: t 1 t 2 … t a t a+1 t a+2 t a+3 …t k ………… Pattern: p 1 p 2 …p a p a+1 p a+2 p a+3...p k ………… mismatch

4
4 T = ABCCABDADBDETADBAADFDAAEERDXTDADCT… P = ETBDBCCDFDC Kangaroo method will process as follows. start k=0

5
5 T = ABCCABDADBDETADBAADFDAAEERDXTDADCT… P = ETBDBCCDFDC Kangaroo method will process as follows. k=1

6
6 T = ABCCABDADBDETADBAADFDAAEERDXTDADCT… P = ETBDBCCDFDC Kangaroo method will process as follows. k=2

7
7 T = ABCCABDADBDETADBAADFDAAEERDXTDADCT… P = ETBDBCCDFDC Kangaroo method will process as follows. k=3

8
8 T = ABCCABDADBDETADBAADFDAAEERDXTDADCT… P = ETBDBCCDFDC Kangaroo method will process as follows. k=4

9
9 We continue the above process. Whenever we come to the situation that it is known a substring of T exactly matching with a substring of P, we skip this substring. This process is stopped when k+1 mismatches have been found. Input: T=ABAABBCCDD, P=ACDCB and k=2. T=ABAABCCDD P=ACDCB k=3, we stop and discard ABAAB, then we start to compare BAADB and ACDCB.

10
10 Before we introduce the Kangaroo algorithm, we shall first introduce the suffix tree and the lowest common ancestor of two nodes. The properties of suffix tree and the lowest common ancestor of two nodes will be used in Kangaroo algorithm.

11
11 S = ABCDEADDBE Suffix tree of a string with length n can be constructed in O(n). Weiner, 1973 McCreight, 1976 Ukkonen, 1995

12
12 The lowest common ancestor of two leaf nodes can be found in O(1) by O(n) preprocessing in constructing time. Harel and Tarjan, 1984

13
13 The Kangaroo method constructs a suffix tree for text T and pattern P. Let the leaf node corresponding to the substring starting from the location be denoted as X. Let the leaf corresponding to the pattern be denoted as Y. The Kangaroo Method finds the lowest common ancestor of X and Y to verify a text location with k mismatches in O(k). Let us consider the next page to figure out the Kangaroo method.

14
14 ANBECF$ ANCEC$ Two suffix strings: ANBECF$ ANCEC$ ANBECF$ ANCEC$ Then we can know that they have the same prefix AN and a mismatch B and C. We now have to find whether there is any mismatches between ECF and EC. ANBECF$ANCEC$ mismatches=1

15
15 We get remaining suffix strings: ECF$ EC$ Then we can know that they have the same prefix EC and because we touch $, we finish the verification. ECF$ EC$ ECF$EC$ mismatches=1 Thus we could know that the mismatches between ANBECF and ANCEC is 1.

16
16 We will not have to compare all characters by using the finding of the lowest common ancestor of two strings of text and pattern in the suffix tree. This is useful if there are many equivalent characters between the text and the pattern because we will not have to compare those equivalent characters. Finding the lowest common ancestor between two suffixes is to find the next mismatch between two strings.

17
17 Input: T=ABCCBDCDBC, P=ABCD and k=2 The suffix tree of T and P is:

18
18 The lowest common ancestor of ABCD and ABCCBDCDBC. T=ABCCBDCDBC P=ABCD k=1, return ABCC.

19
19 The lowest common ancestor of ABCD and BCCBDCDBC. T=ABCCBDCDBC P=ABCD k=1.

20
20 The lowest common ancestor of BCD and CCBDCDBC. T=ABCCBDCDBC P=ABCD k=2.

21
21 The lowest common ancestor of CD and CBDCDBC. T=ABCCBDCDBC P=ABCD k=3, discard BCCB.

22
22 The lowest common ancestor of ABCD and CCBDCDBC. T=ABCCBDCDBC P=ABCD k=1.

23
23 The lowest common ancestor of BCD and CBDCDBC. T=ABCCBDCDBC P=ABCD k=2.

24
24 The lowest common ancestor of CD and BDCDBC. T=ABCCBDCDBC P=ABCD k=3, discard CCBD.

25
25 The lowest common ancestor of ABCD and CBDCDBC. T=ABCCBDCDBC P=ABCD k=1.

26
26 The lowest common ancestor of BCD and BDCDBC. T=ABCCBDCDBC P=ABCD k=2.

27
27 The lowest common ancestor of D and CDBC. T=ABCCBDCDBC P=ABCD k=3, discard CBDC.

28
28 The lowest common ancestor of ABCD and BDCDBC. T=ABCCBDCDBC P=ABCD k=1.

29
29 The lowest common ancestor of BCD and DCDBC. T=ABCCBDCDBC P=ABCD k=2.

30
30 The lowest common ancestor of CD and CDBC. T=ABCCBDCDBC P=ABCD k=2, return BDCD.

31
31 The lowest common ancestor of ABCD and DCDBC. T=ABCCBDCDBC P=ABCD k=1.

32
32 The lowest common ancestor of BCD and CDBC. T=ABCCBDCDBC P=ABCD k=2.

33
33 The lowest common ancestor of CD and DBC. T=ABCCBDCDBC P=ABCD k=3, discard DCDB.

34
34 The lowest common ancestor of ABCD and CDBC. T=ABCCBDCDBC P=ABCD k=1.

35
35 The lowest common ancestor of BCD and DBC. T=ABCCBDCDBC P=ABCD k=2.

36
36 The lowest common ancestor of CD and BC. T=ABCCBDCDBC P=ABCD k=3, discard CDBC.

37
37 Input: T=ABCCBDCDBC, P=ABCD and k=2. Output: ABCC and BDCD.

38
38 In order to use Kangaroo method, we construct a suffix tree for the text T with the length n and the pattern p with the length m in O(n+m). By using Kangaroo method, we take O(1) time to find one mismatch. We stop when there are more than k mismatches. Therefore, we take O(k) time to find at most k mismatches.

39
39 Thus, the time complexity of finding out all locations of text T with k maximal mismatches with the pattern P is O(nk).

40
40 References For Construction of Suffix trees: [M76] McCreight, E.M., A Space-Economical Suffix Tree Construction Algorithm, J. ACM 23 (1976): [U95] Ukkonen, E., On-line Construction of Suffix Trees, Algorithmica 41 (1995): For Finding Lowest Common Ancestor: [HT84] Harel, D. and Tarjan, R.E., Fast Algorithms for Finding Nearest Common Ancestor, SIAM Journal on Computing 13 (1984):

41
41 References For String Matching with k Mismatches: [LV86] Landau, G.M., and Vishkin, U., Efficient string with k mismatches, Theoret. Comput Sci 43 (1986):

42
42 Thank you

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google