Download presentation

Presentation is loading. Please wait.

Published byRyan Ortiz Modified over 3 years ago

1
1 Finding approximate palindromes in strings Pattern Recognition, vol.35, pp , 2002 Alexandre H. L Porto and Valmir C. Barbosa Advisor: Prof. R. C. T. Lee Speaker: L. C. Chen

2
2 Definition S: a string of n characters. S[i]: the ith character in S. S[i..j]: the substring of S whose first and last characters are S[i] and S[j]. S R : the reverse of S. S: abcab S R :bacba

3
3 Definition A even(odd) palindrome is a string which is of the form of S R S(S R aS). Thus abaccaba is a palindrome because abac is the reverse of caba. S[c]: the center of palindrome S[i…j] in S, where cbaccaba S S[2…7]=baccab is an even palindrome and S[c]=4

4
4 Edit distance In edit distance, there are three types of differences between two strings X and Y: Insertion: a symbol of Y is missing in X at a corresponding position. Substitution: symbols at corresponding positions are distinct. Deletion: a symbol of X is missing in Y at a corresponding position. X : A T Y : A G T X : A C C Y : T C C X: G C A Y: G A

5
5 denotes the edit distance between two strings A and B as the minimum number of substitutions, insertions and deletions of characters in B to transform to A. A=abcab-a B=cb–abbc Insertion:1, Substitution:2 and Deletion:1.

6
6 Approximate palindromes An approximate palindrome with error up to k : a string of the form of S R S(S R aS) such that ED(S,S R ) k. An approximate palindrome is maximal if no other approximate palindrome for the same c and k exists having strictly greater size or the same size but strictly fewer errors.

7
7 To simplify our discussion, we only discuss even approximate palindromes here. S: aabaabcd and k= aabaabcd S At c=3, abaa and aabaa are even approximate palindromes, and aabaa is a maximal approximate palindrome. Delete b Substitute b with a

8
8 Problem Given a string T of size n, we want to find all maximal approximate palindromes in T with up to k errors. For each c, we find the largest i and j in T[c+1…n] and T R [1…c] respectively such that ED(T[c+1…i] ), T R [1…j]) k.

9
9 Let S 2 =T R [1…c] and S 1 =T[c+1…n], where 1 c n. In the dynamic programming approach, we construct a matrix D n+1,m+1 when D i,j is the minimum edit distance between S 1 [1,i] and S 2 [1,j], where the length of S 1 and S 2 are n and m respectively.

10
10 T: dbcaabac, and k=2. At c=3, S 2 =T R [1…3] =cbd and S 1 =T[4…7]=aabac. i j aabac c b d We can find that the maximal approximate palindrome is bcaab. : substitution or a matching : deletion : insertion

11
11 How can we compute the table faster? In this paper, the method in [LV89]( L.Y. Huang) was used.

12
12 We shall heavily use the concept of diagonal. Diagonal d is defined as all of the D i,j s where d = i – j. The diagonal property: D i,j -D i-1,j-1 =0 or 1. It means that on the diagonal, the values are monotonically increasing. [U85] Diagonal 2 Diagonal c 211b 3210 cba i j12j12

13
13 Consider diagonal d=0. Let us find the largest j, if it exists, such that (i,j) is on Diagonal d (i - j = d) and D i,j = 0. Let us now label all of these locations. 4c 3t 2t 01g atctggg i j1234j1234 Diagonal 0 S 1 =gggtcta S 2 =gttc

14
14 Having found the above locations (i, j) where D i,j = 0, we can further find the largest j, if it exists, such that (i, j) is on Diagonal d and D i,j = 1. To do this, we use the following observation: Each element in Diagonal d can only influence elements in Diagonals d-1, d and d+1.

15
15 Let us consider any (i, j) location on Diagonal d. –D i,j can only be influenced as shown below: Thus, we conclude that we only need to consider Diagonals d-1, d and d+1 for each D i,j. D i-1, j-1 D i, j-1 D i-1, j D i, j d d+1 d-1 delete insert substitution

16
16 Observe the following two strings: If i and j are the largest i and j such that ED(T 1 [1…i],T 2 [1…j]) = k and T 1 [i+1] T 2 [j+1], then ED(A 1 +x, A 2 +y) = k+1. T1T1 T2T2 1 j 1 i

17
17 Consider T 1 =abcd and T 2 =cdde. ED(T 1 [1…i],T 2 [1…j])=2. The largest such i and j are 2 and 3 respectively, and T 1 [i+1] T 2 [j+1]. Thus the ED(ab+c,cbd+e)=2+1=3. T1T1 abc T2T2 cbd d e 1 j 1 i

18
18 Based upon the above discussion, on a diagonal d, we can find the largest i and j such that D i,j =e. How can we find the largest row containing the value smaller or equal to k ? We need to let L d,e denote the largest row j such that D i,j is on the Diagonal d (i- j = d) and D i,j =e k.

19
19 Let L d,e denote the largest row j such that D i,j is on the Diagonal d (i- j = d) and D i,j =e k. Based upon this definition, e is the edit distance between S 1 [1…i] and S 2 [1…j] such that i and j are the such largest ones, and S 2 [ j+1] S 1 [i+1]. At d =0. L 0,0 = 1, L 0,1 =2, L 1,2 =3 and L 1,3 =4. gggtcta g t t c i j1234j1234 S 1 =gggtcta S 2 =gttc d=0

20
20 How can we compute the L d,e s value? We define row d,e = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)]. (substitution) (insertion) (deletion) L d,e = row d,e +t, where t= the length of the longest common prefix of S 1 [d+row d,e +1…n] and S 2 [row d,e +1…m]. If t=0, it means that S 1 [d+row d,e +1]S 2 [row d,e +1].

21
21 Consider D 3,2. L 1,1 =1. The largest j on d=1 for D i,j =1 is j=1. In this case, d=1, e=2. L d,e-1 =L 1,1 =1, L d-1,e- 1 =L 0,1 =2 and L d+1,e-1 =L 2,1 =0. Thus row d,e =row 1,2 =max(L 1,1 +1,L 0,1,L 2,1 +1)=max(1+1,2,0+ 1)=max(2,2,1)=2. gggtcta g t t c i j1234j1234 d=0 d=1 d=2

22
22 How to compute L -1,1 ? row -1,1 = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)] = max[(L -1,0 +1),(L -2,0 ),(L 0,0 +1)] = max[0+1, 0, 1+1]= max[1, 0, 2] = 2 Since S 1 [d+row d,e +1]= S 1 [-1+1+2]=g S 2 [row d,e +1]=S 2 [2+1]=t, L -1,1 = row -1,1 +0 = 2. d = -1 i j1234j1234 4c 3t 12t 01g atctggg S 1 =gggtcta S 2 =gttc e =1, d = -1

23
23 How to compute L 1,2 ? row 1,2 = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)] = max[(L 1,1 +1),(L 0,1 ),(L 2,1 +1)] = max[1+1, 2, 0+1]= max[2, 2, 1] = 2. Since the length of the longest common prefix of S 1 [d+row 1,2 +1…n]=S 1 [4…7]=tcta and S 2 [row 1,2 +1…m]= S 2 [3…4]=tc is 2, L 1,2 = row 1,2 +2 =4. d = 1 i j1234j c 22223t 2112t 101g atctggg S 1 =gggtcta S 2 =gttc

24
24 L d,e =row d,e +t, where t= the length of the longest common prefix of S 1 [d+row d,e +1…n] and S 2 [row d,e +1…m]. How can we compute t ? In this paper, LCA (lowest common ancestor ) is used.

25
25 Consider two substrings T 1 and T 2 as shown below: T1T1 A1A1 S1S1 T2T2 A2A2 S2S2 If ED(A 1, A 2 ) =k and S 1 =S 2, then ED(A 1 +S 1, A 2 +S 2 ) =k. x y

26
26 When we find the ED(A 1, A 2 ) =k, we want to determine whether the longest common prefix S of B 1 and B 2 exists. This paper will use LCA (lowest common ancestor) to find S. S1S1 S2S2 B1B1 B2B2

27
27 To find such S, if it exists, we may concatenate S 1 and S 2 to a new string. Obviously, suffixes S 1 and S 2 have a common prefix S. S1S1 S2S2 S 2 S 1

28
28 Let us concatenate S 1 and S 2 to be a new string as follows: Consider D 3,2, the substring after ggg is tctagttc=S 1. The substring after gt is tc=S 2. Note that S 2 and S 1 have a common prefix with length 2. Thus we have that D 3,2 =D 4,3 =D 5,4 =2. S 1 =gggtcta S 2 =gttc gggtcta g t t c i j1234j1234 d = 1

29
29 S 1 =gggtcta S 2 =gttc Let us concatenate S 1 and S 2 to be a new string as follows: gggtctagttaa. And then we construct the suffix tree of it. The substring after ggg is tctagttc=S 1. The substring after gt is tc=S 2. Note that S 2 and S 1 have a common ancestor tc of length 2.

30
30 Algorithm Initialization for all d, 1 d k+1, d e, L d,e =-1. for all d, -(k+1) d -1,L d,|d|-1 = -1, L d,|d|-2 =|d|-2. for all e, -1 e k, L n+1,e = -1 Find L 0,0 = the length of longest common prefix of S 1 and S 2 For e = 1 to k do For d = -e to e do row d,e = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)] row d,e = min(row d,e,m) while row d,e < m and row d,e +d

31
31 gggtcta g1 t2 t3 c4 Example: T = cttggggtcta and k=2. At c=4, T[1…4]=cttg, S 2 =T R [1..4]=gttc and S 1 =T[5…11]=gggtcta. i j1234j1234 S2S2 S1S1

32
32 At d = 0, find the largest j such that S 2 [1…j] is equal to S 1 [1..i], then we set the value of L 0,0 = j. S 2 [1] = S 1 [1], L 0,0 =1 i c 3t 2t 01g atctggg j1234j1234 d=0 S2S2 S1S1

33
33 row -1,1 = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)] = max[0,0,2]=2. the length of longest common prefix of ggtctagttc and tc is 0. L -1,1 = 2 d = -1 i j1234j1234 4c 3t 12t 01g atctggg e =1, d = -1 S2S2 S1S1

34
34 The length of LCA of ggtctagttc and tc is 0.

35
35 row 0,1 = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)] = max[2,0,1]=2. the length of common prefix of gtctagttc and tc is 0. L 0,1 = 2 d = 0 i j1234j1234 4c 3t 112t 01g atctggg e =1, d = 0 S2S2 S1S1

36
36 The length of LCA of gtctagttc and tc is 0.

37
37 row 1,1 = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)] =1. the length of common prefix of gtctagttc and ttc is 0. L 1,1 = 1 d = 1 i j1234j1234 4c 3t 112t 101g atctggg e =1, d = 1 S2S2 S1S1

38
38 The length of LCA of gtctagttc and ttc is 0.

39
39 row 1,2 = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)] =2 d = 1 i e =2, d = 1 j1234j1234 4c 2223t 2112t 101g atctggg S2S2 S1S1

40
40 We find that the longest common prefix of tc and tctagttc is tc. d = 1 i j1234j c 22223t 2112t 101g atctggg S 1 S 2 e =2, d = 1 L 1,2 = row+2=2+2=4

41
41 The length of LCA of tctagttc and ttc is 2.

42
42 row 2,2 = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)] =1 We find that the lenghth of common prefix of ttc and tctagttc is 1. d = 2 i S 1 S 2 e =2, d = 2 j1234j c 22223t 22112t 2101g atctggg L 2,2 = row 2,2 +1=1+1=2 S1S1 S2S2

43
43 The length of LCA of ttc and tctagttc is 1.

44
44 T = cttggggtcta and k=2. At c=4, T[1…4]=cttg, T R [1..4]=gttc and T R [5…11]=gggtcta. cttggggtc is the maximal approximate palindromes. i j1234j c 2223t 2112t 2101g atctggg 2 2 S2S2 S1S1 S 1 =gggtcta S 2 =gttc

45
45 References [U85] Finding approximate patterns in strings, Ukkonen, E., Journal of algorithms, Vol. 6, 1985, pp [LV89] Fast parallel and serial approximate string matching, G. Landau and U. Vishkin, Journal of algorithms, Vol. 10, 1989, pp

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google