Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advisor: Prof. R. C. T. Lee Speaker: L. C. Chen

Similar presentations


Presentation on theme: "Advisor: Prof. R. C. T. Lee Speaker: L. C. Chen"— Presentation transcript:

1 Advisor: Prof. R. C. T. Lee Speaker: L. C. Chen
Finding approximate palindromes in strings Pattern Recognition, vol.35, pp , Alexandre H. L Porto and Valmir C. Barbosa Advisor: Prof. R. C. T. Lee Speaker: L. C. Chen

2 Definition S: a string of n characters. S[i]: the ith character in S.
S[i..j]: the substring of S whose first and last characters are S[i] and S[j]. SR: the reverse of S. S: abcab SR:bacba

3 Definition A even(odd) palindrome is a string which is of the form of SRS(SRaS). Thus abaccaba is a palindrome because abac is the reverse of caba. S[c]: the center of palindrome S[i…j] in S, where . 1 2 3 4 5 6 7 8 c b a S S[2…7]=baccab is an even palindrome and S[c]=4

4 Edit distance X : A - T Y : A G T X : A C C Y : T C C X: G C A
In edit distance, there are three types of differences between two strings X and Y: Insertion: a symbol of Y is missing in X at a corresponding position. Substitution: symbols at corresponding positions are distinct. Deletion: a symbol of X is missing in Y at a X : A - T Y : A G T X : A C C Y : T C C X: G C A Y: G - A

5 denotes the edit distance between two strings A and B as the minimum number of substitutions, insertions and deletions of characters in B to transform to A. A=abcab-a B=cb–abbc Insertion:1, Substitution:2 and Deletion:1.

6 Approximate palindromes
An approximate palindrome with error up to k : a string of the form of SRS(SRaS) such that ED(S,SR) ≦k. An approximate palindrome is maximal if no other approximate palindrome for the same c and k exists having strictly greater size or the same size but strictly fewer errors.

7 abaa and aabaa are even approximate palindromes,
To simplify our discussion, we only discuss even approximate palindromes here. S: aabaabcd and k=1. 1 2 3 4 5 6 7 8 a b c d S At c=3, abaa and aabaa are even approximate palindromes, Substitute b with a Delete b and aabaa is a maximal approximate palindrome.

8 Problem Given a string T of size n, we want to find all maximal approximate palindromes in T with up to k errors. For each c, we find the largest i’ and j’ in T[c+1…n] and TR[1…c] respectively such that ED(T[c+1…i’] ), TR[1…j’]) ≦k.

9 Let S2=TR[1…c] and S1=T[c+1…n], where 1≦c≦n.
In the dynamic programming approach, we construct a matrix Dn’+1,m’+1 when Di,j is the minimum edit distance between S1[1,i] and S2[1,j], where the length of S1 and S2 are n’ and m’ respectively.

10 S2=TR [1…3] =cbd and S1=T[4…7]=aabac. i j a b c 1 2 3 4 5
T: dbcaabac, and k=2. At c=3, S2=TR [1…3] =cbd and S1=T[4…7]=aabac. i j a b c 1 2 3 4 5 d ↖: substitution or a matching ↑: deletion ←: insertion We can find that the maximal approximate palindrome is bcaab.

11 How can we compute the table faster?
In this paper, the method in [LV89]( L.Y. Huang) was used.

12 We shall heavily use the concept of diagonal.
Diagonal d is defined as all of the Di,j’s where d = i – j. The diagonal property: Di,j-Di-1,j-1=0 or 1. It means that on the diagonal, the values are monotonically increasing. [U85] 1 2 c b 3 a i j Diagonal 2 Diagonal 0

13 Let us now label all of these locations.
Consider diagonal d=0. Let us find the largest j, if it exists, such that (i,j) is on Diagonal d (i - j = d) and Di,j = 0. Let us now label all of these locations. S1=gggtcta S2=gttc 4 c 3 t 2 1 g 7 6 5 a i j Diagonal 0

14 Having found the above locations (i, j) where Di,j = 0, we can further find the largest j, if it exists, such that (i, j) is on Diagonal d and Di,j = 1. To do this, we use the following observation: Each element in Diagonal d can only influence elements in Diagonals d-1, d and d+1.

15 Let us consider any (i, j) location on Diagonal d.
Di,j can only be influenced as shown below: Di-1, j-1 Di, j-1 substitution delete d+1 Di-1, j Di, j insert d d-1 Thus, we conclude that we only need to consider Diagonals d-1, d and d+1 for each Di,j.

16 Observe the following two strings:
j If i and j are the largest i and j such that ED(T1[1…i],T2[1…j]) = k and T1[i+1]≠ T2[j+1], then ED(A1+x, A2+y) = k+1.

17 T1 ab c d i T2 cbd e j Consider T1=abcd and T2=cdde. ED(T1[1…i],T2[1…j])=2. The largest such i and j are 2 and 3 respectively, and T1[i+1]≠ T2[j+1]. Thus the ED(ab+c,cbd+e)=2+1=3.

18 Based upon the above discussion, on a diagonal d, we can find the largest i and j such that Di,j =e.
How can we find the largest row containing the value smaller or equal to k ? We need to let Ld,e denote the largest row j such that Di,j is on the Diagonal d (i- j = d) and Di,j =e≦k.

19 Let Ld,e denote the largest row j such that Di,j is on the Diagonal d (i- j = d) and Di,j =e≦k.
Based upon this definition, e is the edit distance between S1[1…i] and S2[1…j] such that i and j are the such largest ones, and S2[ j+1] ≠S1[i+1]. S1=gggtcta S2=gttc i g t c a 1 2 3 4 5 6 7 j 1 2 3 4 d=0 At d =0. L0,0 = 1, L0,1=2, L1,2 =3 and L1,3 =4.

20 How can we compute the Ld,e’s value?
We define rowd,e = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)]. (substitution) (insertion) (deletion) Ld,e= rowd,e+t, where t= the length of the longest common prefix of S1[d+rowd,e+1…n’] and S2[rowd,e+1…m’]. If t=0, it means that S1[d+rowd,e+1] ≠S2[rowd,e+1].

21 Consider D3,2. L1,1=1. The largest j on d=1 for Di,j=1 is j=1
Consider D3,2. L1,1=1. The largest j on d=1 for Di,j=1 is j=1. In this case, d=1, e=2. Ld,e-1=L1,1=1, Ld-1,e-1=L0,1=2 and Ld+1,e-1=L2,1=0. Thus rowd,e=row1,2=max(L1,1+1,L0,1,L2,1+1)=max(1+1,2,0+1)=max(2,2,1)=2. i g t c a 1 2 3 4 5 6 7 j 1 2 3 4 d=0 d=1 d=2

22 e =1, d = -1 S1=gggtcta S2=gttc How to compute L-1,1?
i j 1 2 3 4 c t g 7 6 5 a How to compute L-1,1? row-1,1 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] = max[(L-1,0+1),(L-2,0),(L0,0+1)] = max[0+1, 0, 1+1]= max[1, 0, 2] = 2 Since S1[d+rowd,e+1]= S1[-1+1+2]=g ≠S2[rowd,e+1]=S2[2+1]=t, L-1,1 = row-1,1+0 = 2.

23 S1=gggtcta S2=gttc How to compute L1,2?
i g g g t c t a 1 2 3 4 5 6 7 j 1 2 3 4 g 1 1 t 2 1 1 2 t 3 2 2 2 2 c 4 2 d = 1 How to compute L1,2? row1,2 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] = max[(L1,1+1),(L0,1),(L2,1+1)] = max[1+1, 2, 0+1]= max[2, 2, 1] = 2. Since the length of the longest common prefix of S1[d+row1,2+1…n’]=S1[4…7]=tcta and S2[row1,2+1…m’]= S2[3…4]=tc is 2, L1,2 = row1,2+2 =4.

24 Ld,e=rowd,e+t, where t= the length of the longest common prefix of S1[d+rowd,e+1…n’] and S2[rowd,e+1…m’]. How can we compute t ? In this paper, LCA (lowest common ancestor ) is used.

25 Consider two substrings T1 and T2 as shown below:
x T2 A2 S2 y If ED(A1, A2) =k and S1=S2, then ED(A1+S1, A2+S2) =k.

26 This paper will use LCA (lowest common ancestor) to find S.
When we find the ED(A1, A2) =k, we want to determine whether the longest common prefix S of B1 and B2 exists. B1 S1 S2 B2 This paper will use LCA (lowest common ancestor) to find S.

27 Obviously, suffixes S1’ and S2’ have a common prefix S.
To find such S, if it exists, we may concatenate S1 and S2 to a new string. S2’ S1’ Obviously, suffixes S1’ and S2’ have a common prefix S.

28 Let us concatenate S1 and S2 to be a new string as follows:
Consider D3,2, the substring after ggg is tctagttc=S1’. The substring after gt is tc=S2’. Note that S2’ and S1’ have a common prefix with length 2. Thus we have that D3,2=D4,3=D5,4=2. S1=gggtcta S2=gttc i g t c a 1 2 3 4 5 6 7 j 1 2 3 4 d = 1

29 S1=gggtcta S2=gttc Let us concatenate S1 and S2 to be a new string as follows: gggtctagttaa. And then we construct the suffix tree of it. The substring after ggg is tctagttc=S1’. The substring after gt is tc=S2’. Note that S2’ and S1’ have a common ancestor tc of length 2.

30 Algorithm Initialization for all d, 1≦d ≦k+1, d>e, Ld,e=-1 .
for all d, -(k+1) ≦d ≦-1,Ld,|d|-1= -1, Ld,|d|-2 =|d|-2 . for all e, -1≦e≦k, Ln’+1,e = -1 Find L0,0= the length of longest common prefix of S1 and S2 For e = 1 to k do For d = -e to e do rowd,e = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] rowd,e = min(rowd,e,m’) while rowd,e < m’ and row d,e+d <n’ do find t= the length of longest common prefix of S1[d+rowd,e+1…n’] and S2[rowd,e+1…m’]; rowd,e = rowd,e + t; Ld,e = rowd,e.

31 At c=4, T[1…4]=cttg, S2=TR[1..4]=gttc and S1=T[5…11]=gggtcta.
Example: T = cttggggtcta and k=2. At c=4, T[1…4]=cttg, S2=TR[1..4]=gttc and S1=T[5…11]=gggtcta. S1 i S2 g t c a 1 2 3 4 5 6 7 j 1 2 3 4

32 At d = 0, find the largest j such that S2[1…j] is equal to S1[1
At d = 0, find the largest j such that S2[1…j] is equal to S1[1..i], then we set the value of L0,0 = j. S1 i S2 4 c 3 t 2 1 g 7 6 5 a j d=0 S2[1] = S1[1], L0,0 =1

33 e =1, d = -1 S1 S2 row-1,1 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)]
i j 1 2 3 4 c t g 7 6 5 a S2 row-1,1 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] = max[0,0,2]=2. the length of longest common prefix of ggtctagttc and tc is 0. L-1,1 = 2

34 The length of LCA of ggtctagttc and tc is 0.

35 e =1, d = 0 S1 i j 1 2 3 4 c t g 7 6 5 a S2 d = 0 row0,1 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] = max[2,0,1]=2. the length of common prefix of gtctagttc and tc is 0. L0,1 = 2

36 The length of LCA of gtctagttc and tc is 0.

37 the length of common prefix of gtctagttc and ttc is 0. L1,1 = 1
e =1, d = 1 S1 i S2 g g g t c t a 1 2 3 4 5 6 7 j 1 2 3 4 g 1 1 t 2 1 1 t 3 c 4 d = 1 row1,1= max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] =1. the length of common prefix of gtctagttc and ttc is 0. L1,1 = 1

38 The length of LCA of gtctagttc and ttc is 0.

39 row1,2 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] =2
e =2, d = 1 S1 i j 1 2 3 4 c t g 7 6 5 a S2 d = 1 row1,2 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] =2

40 We find that the longest common prefix of tc and tctagttc is tc.
e =2, d = 1 i g g g t c t a 1 2 3 4 5 6 7 j 1 2 3 4 g 1 1 t 2 1 1 2 t 3 2 2 2 2 c 4 2 d = 1 We find that the longest common prefix of tc and tctagttc is tc. S2’ S1’ L1,2 = row+2=2+2=4

41 The length of LCA of tctagttc and ttc is 2.

42 row2,2 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] =1
e =2, d = 2 S1 i S2 j 1 2 3 4 c t g 7 6 5 a d = 2 row2,2 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] =1 We find that the lenghth of common prefix of ttc and tctagttc is 1. S2’ S1’ L2,2 = row2,2+1=1+1=2

43 The length of LCA of ttc and tctagttc is 1.

44 S1=gggtcta S2=gttc S1 S2 T = cttggggtcta and k=2.
i j 1 2 3 4 c t g 7 6 5 a S2 T = cttggggtcta and k=2. At c=4, T[1…4]=cttg, TR[1..4]=gttc and TR[5…11]=gggtcta. cttggggtc is the maximal approximate palindromes.

45 References [U85] Finding approximate patterns in strings, Ukkonen, E., Journal of algorithms, Vol. 6, 1985, pp [LV89] Fast parallel and serial approximate string matching, G. Landau and U. Vishkin, Journal of algorithms, Vol. 10, 1989, pp


Download ppt "Advisor: Prof. R. C. T. Lee Speaker: L. C. Chen"

Similar presentations


Ads by Google