Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Finding approximate palindromes in strings Pattern Recognition, vol.35, pp. 2581-2591, 2002 Alexandre H. L Porto and Valmir C. Barbosa Advisor: Prof.

Similar presentations


Presentation on theme: "1 Finding approximate palindromes in strings Pattern Recognition, vol.35, pp. 2581-2591, 2002 Alexandre H. L Porto and Valmir C. Barbosa Advisor: Prof."— Presentation transcript:

1 1 Finding approximate palindromes in strings Pattern Recognition, vol.35, pp , 2002 Alexandre H. L Porto and Valmir C. Barbosa Advisor: Prof. R. C. T. Lee Speaker: L. C. Chen

2 2 Definition S: a string of n characters. S[i]: the ith character in S. S[i..j]: the substring of S whose first and last characters are S[i] and S[j]. S R : the reverse of S. S: abcab S R :bacba

3 3 Definition A even(odd) palindrome is a string which is of the form of S R S(S R aS). Thus abaccaba is a palindrome because abac is the reverse of caba. S[c]: the center of palindrome S[i…j] in S, where cbaccaba S S[2…7]=baccab is an even palindrome and S[c]=4

4 4 Edit distance In edit distance, there are three types of differences between two strings X and Y: Insertion: a symbol of Y is missing in X at a corresponding position. Substitution: symbols at corresponding positions are distinct. Deletion: a symbol of X is missing in Y at a corresponding position. X : A T Y : A G T X : A C C Y : T C C X: G C A Y: G A

5 5 denotes the edit distance between two strings A and B as the minimum number of substitutions, insertions and deletions of characters in B to transform to A. A=abcab-a B=cb–abbc Insertion:1, Substitution:2 and Deletion:1.

6 6 Approximate palindromes An approximate palindrome with error up to k : a string of the form of S R S(S R aS) such that ED(S,S R ) k. An approximate palindrome is maximal if no other approximate palindrome for the same c and k exists having strictly greater size or the same size but strictly fewer errors.

7 7 To simplify our discussion, we only discuss even approximate palindromes here. S: aabaabcd and k= aabaabcd S At c=3, abaa and aabaa are even approximate palindromes, and aabaa is a maximal approximate palindrome. Delete b Substitute b with a

8 8 Problem Given a string T of size n, we want to find all maximal approximate palindromes in T with up to k errors. For each c, we find the largest i and j in T[c+1…n] and T R [1…c] respectively such that ED(T[c+1…i] ), T R [1…j]) k.

9 9 Let S 2 =T R [1…c] and S 1 =T[c+1…n], where 1 c n. In the dynamic programming approach, we construct a matrix D n+1,m+1 when D i,j is the minimum edit distance between S 1 [1,i] and S 2 [1,j], where the length of S 1 and S 2 are n and m respectively.

10 10 T: dbcaabac, and k=2. At c=3, S 2 =T R [1…3] =cbd and S 1 =T[4…7]=aabac. i j aabac c b d We can find that the maximal approximate palindrome is bcaab. : substitution or a matching : deletion : insertion

11 11 How can we compute the table faster? In this paper, the method in [LV89]( L.Y. Huang) was used.

12 12 We shall heavily use the concept of diagonal. Diagonal d is defined as all of the D i,j s where d = i – j. The diagonal property: D i,j -D i-1,j-1 =0 or 1. It means that on the diagonal, the values are monotonically increasing. [U85] Diagonal 2 Diagonal c 211b 3210 cba i j12j12

13 13 Consider diagonal d=0. Let us find the largest j, if it exists, such that (i,j) is on Diagonal d (i - j = d) and D i,j = 0. Let us now label all of these locations. 4c 3t 2t 01g atctggg i j1234j1234 Diagonal 0 S 1 =gggtcta S 2 =gttc

14 14 Having found the above locations (i, j) where D i,j = 0, we can further find the largest j, if it exists, such that (i, j) is on Diagonal d and D i,j = 1. To do this, we use the following observation: Each element in Diagonal d can only influence elements in Diagonals d-1, d and d+1.

15 15 Let us consider any (i, j) location on Diagonal d. –D i,j can only be influenced as shown below: Thus, we conclude that we only need to consider Diagonals d-1, d and d+1 for each D i,j. D i-1, j-1 D i, j-1 D i-1, j D i, j d d+1 d-1 delete insert substitution

16 16 Observe the following two strings: If i and j are the largest i and j such that ED(T 1 [1…i],T 2 [1…j]) = k and T 1 [i+1] T 2 [j+1], then ED(A 1 +x, A 2 +y) = k+1. T1T1 T2T2 1 j 1 i

17 17 Consider T 1 =abcd and T 2 =cdde. ED(T 1 [1…i],T 2 [1…j])=2. The largest such i and j are 2 and 3 respectively, and T 1 [i+1] T 2 [j+1]. Thus the ED(ab+c,cbd+e)=2+1=3. T1T1 abc T2T2 cbd d e 1 j 1 i

18 18 Based upon the above discussion, on a diagonal d, we can find the largest i and j such that D i,j =e. How can we find the largest row containing the value smaller or equal to k ? We need to let L d,e denote the largest row j such that D i,j is on the Diagonal d (i- j = d) and D i,j =e k.

19 19 Let L d,e denote the largest row j such that D i,j is on the Diagonal d (i- j = d) and D i,j =e k. Based upon this definition, e is the edit distance between S 1 [1…i] and S 2 [1…j] such that i and j are the such largest ones, and S 2 [ j+1] S 1 [i+1]. At d =0. L 0,0 = 1, L 0,1 =2, L 1,2 =3 and L 1,3 =4. gggtcta g t t c i j1234j1234 S 1 =gggtcta S 2 =gttc d=0

20 20 How can we compute the L d,e s value? We define row d,e = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)]. (substitution) (insertion) (deletion) L d,e = row d,e +t, where t= the length of the longest common prefix of S 1 [d+row d,e +1…n] and S 2 [row d,e +1…m]. If t=0, it means that S 1 [d+row d,e +1]S 2 [row d,e +1].

21 21 Consider D 3,2. L 1,1 =1. The largest j on d=1 for D i,j =1 is j=1. In this case, d=1, e=2. L d,e-1 =L 1,1 =1, L d-1,e- 1 =L 0,1 =2 and L d+1,e-1 =L 2,1 =0. Thus row d,e =row 1,2 =max(L 1,1 +1,L 0,1,L 2,1 +1)=max(1+1,2,0+ 1)=max(2,2,1)=2. gggtcta g t t c i j1234j1234 d=0 d=1 d=2

22 22 How to compute L -1,1 ? row -1,1 = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)] = max[(L -1,0 +1),(L -2,0 ),(L 0,0 +1)] = max[0+1, 0, 1+1]= max[1, 0, 2] = 2 Since S 1 [d+row d,e +1]= S 1 [-1+1+2]=g S 2 [row d,e +1]=S 2 [2+1]=t, L -1,1 = row -1,1 +0 = 2. d = -1 i j1234j1234 4c 3t 12t 01g atctggg S 1 =gggtcta S 2 =gttc e =1, d = -1

23 23 How to compute L 1,2 ? row 1,2 = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)] = max[(L 1,1 +1),(L 0,1 ),(L 2,1 +1)] = max[1+1, 2, 0+1]= max[2, 2, 1] = 2. Since the length of the longest common prefix of S 1 [d+row 1,2 +1…n]=S 1 [4…7]=tcta and S 2 [row 1,2 +1…m]= S 2 [3…4]=tc is 2, L 1,2 = row 1,2 +2 =4. d = 1 i j1234j c 22223t 2112t 101g atctggg S 1 =gggtcta S 2 =gttc

24 24 L d,e =row d,e +t, where t= the length of the longest common prefix of S 1 [d+row d,e +1…n] and S 2 [row d,e +1…m]. How can we compute t ? In this paper, LCA (lowest common ancestor ) is used.

25 25 Consider two substrings T 1 and T 2 as shown below: T1T1 A1A1 S1S1 T2T2 A2A2 S2S2 If ED(A 1, A 2 ) =k and S 1 =S 2, then ED(A 1 +S 1, A 2 +S 2 ) =k. x y

26 26 When we find the ED(A 1, A 2 ) =k, we want to determine whether the longest common prefix S of B 1 and B 2 exists. This paper will use LCA (lowest common ancestor) to find S. S1S1 S2S2 B1B1 B2B2

27 27 To find such S, if it exists, we may concatenate S 1 and S 2 to a new string. Obviously, suffixes S 1 and S 2 have a common prefix S. S1S1 S2S2 S 2 S 1

28 28 Let us concatenate S 1 and S 2 to be a new string as follows: Consider D 3,2, the substring after ggg is tctagttc=S 1. The substring after gt is tc=S 2. Note that S 2 and S 1 have a common prefix with length 2. Thus we have that D 3,2 =D 4,3 =D 5,4 =2. S 1 =gggtcta S 2 =gttc gggtcta g t t c i j1234j1234 d = 1

29 29 S 1 =gggtcta S 2 =gttc Let us concatenate S 1 and S 2 to be a new string as follows: gggtctagttaa. And then we construct the suffix tree of it. The substring after ggg is tctagttc=S 1. The substring after gt is tc=S 2. Note that S 2 and S 1 have a common ancestor tc of length 2.

30 30 Algorithm Initialization for all d, 1 d k+1, d e, L d,e =-1. for all d, -(k+1) d -1,L d,|d|-1 = -1, L d,|d|-2 =|d|-2. for all e, -1 e k, L n+1,e = -1 Find L 0,0 = the length of longest common prefix of S 1 and S 2 For e = 1 to k do For d = -e to e do row d,e = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)] row d,e = min(row d,e,m) while row d,e < m and row d,e +d

31 31 gggtcta g1 t2 t3 c4 Example: T = cttggggtcta and k=2. At c=4, T[1…4]=cttg, S 2 =T R [1..4]=gttc and S 1 =T[5…11]=gggtcta. i j1234j1234 S2S2 S1S1

32 32 At d = 0, find the largest j such that S 2 [1…j] is equal to S 1 [1..i], then we set the value of L 0,0 = j. S 2 [1] = S 1 [1], L 0,0 =1 i c 3t 2t 01g atctggg j1234j1234 d=0 S2S2 S1S1

33 33 row -1,1 = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)] = max[0,0,2]=2. the length of longest common prefix of ggtctagttc and tc is 0. L -1,1 = 2 d = -1 i j1234j1234 4c 3t 12t 01g atctggg e =1, d = -1 S2S2 S1S1

34 34 The length of LCA of ggtctagttc and tc is 0.

35 35 row 0,1 = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)] = max[2,0,1]=2. the length of common prefix of gtctagttc and tc is 0. L 0,1 = 2 d = 0 i j1234j1234 4c 3t 112t 01g atctggg e =1, d = 0 S2S2 S1S1

36 36 The length of LCA of gtctagttc and tc is 0.

37 37 row 1,1 = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)] =1. the length of common prefix of gtctagttc and ttc is 0. L 1,1 = 1 d = 1 i j1234j1234 4c 3t 112t 101g atctggg e =1, d = 1 S2S2 S1S1

38 38 The length of LCA of gtctagttc and ttc is 0.

39 39 row 1,2 = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)] =2 d = 1 i e =2, d = 1 j1234j1234 4c 2223t 2112t 101g atctggg S2S2 S1S1

40 40 We find that the longest common prefix of tc and tctagttc is tc. d = 1 i j1234j c 22223t 2112t 101g atctggg S 1 S 2 e =2, d = 1 L 1,2 = row+2=2+2=4

41 41 The length of LCA of tctagttc and ttc is 2.

42 42 row 2,2 = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)] =1 We find that the lenghth of common prefix of ttc and tctagttc is 1. d = 2 i S 1 S 2 e =2, d = 2 j1234j c 22223t 22112t 2101g atctggg L 2,2 = row 2,2 +1=1+1=2 S1S1 S2S2

43 43 The length of LCA of ttc and tctagttc is 1.

44 44 T = cttggggtcta and k=2. At c=4, T[1…4]=cttg, T R [1..4]=gttc and T R [5…11]=gggtcta. cttggggtc is the maximal approximate palindromes. i j1234j c 2223t 2112t 2101g atctggg 2 2 S2S2 S1S1 S 1 =gggtcta S 2 =gttc

45 45 References [U85] Finding approximate patterns in strings, Ukkonen, E., Journal of algorithms, Vol. 6, 1985, pp [LV89] Fast parallel and serial approximate string matching, G. Landau and U. Vishkin, Journal of algorithms, Vol. 10, 1989, pp


Download ppt "1 Finding approximate palindromes in strings Pattern Recognition, vol.35, pp. 2581-2591, 2002 Alexandre H. L Porto and Valmir C. Barbosa Advisor: Prof."

Similar presentations


Ads by Google