Download presentation

Presentation is loading. Please wait.

1
**Advisor: Prof. R. C. T. Lee Speaker: L. C. Chen**

Finding approximate palindromes in strings Pattern Recognition, vol.35, pp , Alexandre H. L Porto and Valmir C. Barbosa Advisor: Prof. R. C. T. Lee Speaker: L. C. Chen

2
**Definition S: a string of n characters. S[i]: the ith character in S.**

S[i..j]: the substring of S whose first and last characters are S[i] and S[j]. SR: the reverse of S. S: abcab SR:bacba

3
Definition A even(odd) palindrome is a string which is of the form of SRS(SRaS). Thus abaccaba is a palindrome because abac is the reverse of caba. S[c]: the center of palindrome S[i…j] in S, where . 1 2 3 4 5 6 7 8 c b a S S[2…7]=baccab is an even palindrome and S[c]=4

4
**Edit distance X : A － T Y : A G T X : A C C Y : T C C X: G C A**

In edit distance, there are three types of differences between two strings X and Y: Insertion: a symbol of Y is missing in X at a corresponding position. Substitution: symbols at corresponding positions are distinct. Deletion: a symbol of X is missing in Y at a X : A － T Y : A G T X : A C C Y : T C C X: G C A Y: G － A

5
denotes the edit distance between two strings A and B as the minimum number of substitutions, insertions and deletions of characters in B to transform to A. A=abcab-a B=cb–abbc Insertion:1, Substitution:2 and Deletion:1.

6
**Approximate palindromes**

An approximate palindrome with error up to k : a string of the form of SRS(SRaS) such that ED(S,SR) ≦k. An approximate palindrome is maximal if no other approximate palindrome for the same c and k exists having strictly greater size or the same size but strictly fewer errors.

7
**abaa and aabaa are even approximate palindromes,**

To simplify our discussion, we only discuss even approximate palindromes here. S: aabaabcd and k=1. 1 2 3 4 5 6 7 8 a b c d S At c=3, abaa and aabaa are even approximate palindromes, Substitute b with a Delete b and aabaa is a maximal approximate palindrome.

8
Problem Given a string T of size n, we want to find all maximal approximate palindromes in T with up to k errors. For each c, we find the largest i’ and j’ in T[c+1…n] and TR[1…c] respectively such that ED(T[c+1…i’] ), TR[1…j’]) ≦k.

9
**Let S2=TR[1…c] and S1=T[c+1…n], where 1≦c≦n.**

In the dynamic programming approach, we construct a matrix Dn’+1,m’+1 when Di,j is the minimum edit distance between S1[1,i] and S2[1,j], where the length of S1 and S2 are n’ and m’ respectively.

10
**S2=TR [1…3] =cbd and S1=T[4…7]=aabac. i j a b c 1 2 3 4 5**

T: dbcaabac, and k=2. At c=3, S2=TR [1…3] =cbd and S1=T[4…7]=aabac. i j a b c 1 2 3 4 5 d ↖: substitution or a matching ↑: deletion ←: insertion We can find that the maximal approximate palindrome is bcaab.

11
**How can we compute the table faster?**

In this paper, the method in [LV89]( L.Y. Huang) was used.

12
**We shall heavily use the concept of diagonal. **

Diagonal d is defined as all of the Di,j’s where d = i – j. The diagonal property: Di,j-Di-1,j-1=0 or 1. It means that on the diagonal, the values are monotonically increasing. [U85] 1 2 c b 3 a i j Diagonal 2 Diagonal 0

13
**Let us now label all of these locations.**

Consider diagonal d=0. Let us find the largest j, if it exists, such that (i,j) is on Diagonal d (i - j = d) and Di,j = 0. Let us now label all of these locations. S1=gggtcta S2=gttc 4 c 3 t 2 1 g 7 6 5 a i j Diagonal 0

14
Having found the above locations (i, j) where Di,j = 0, we can further find the largest j, if it exists, such that (i, j) is on Diagonal d and Di,j = 1. To do this, we use the following observation: Each element in Diagonal d can only influence elements in Diagonals d-1, d and d+1.

15
**Let us consider any (i, j) location on Diagonal d. **

Di,j can only be influenced as shown below: Di-1, j-1 Di, j-1 substitution delete d+1 Di-1, j Di, j insert d d-1 Thus, we conclude that we only need to consider Diagonals d-1, d and d+1 for each Di,j.

16
**Observe the following two strings:**

j If i and j are the largest i and j such that ED(T1[1…i],T2[1…j]) = k and T1[i+1]≠ T2[j+1], then ED(A1+x, A2+y) = k+1.

17
T1 ab c d i T2 cbd e j Consider T1=abcd and T2=cdde. ED(T1[1…i],T2[1…j])=2. The largest such i and j are 2 and 3 respectively, and T1[i+1]≠ T2[j+1]. Thus the ED(ab+c,cbd+e)=2+1=3.

18
**Based upon the above discussion, on a diagonal d, we can find the largest i and j such that Di,j =e.**

How can we find the largest row containing the value smaller or equal to k ? We need to let Ld,e denote the largest row j such that Di,j is on the Diagonal d (i- j = d) and Di,j =e≦k.

19
**Let Ld,e denote the largest row j such that Di,j is on the Diagonal d (i- j = d) and Di,j =e≦k.**

Based upon this definition, e is the edit distance between S1[1…i] and S2[1…j] such that i and j are the such largest ones, and S2[ j+1] ≠S1[i+1]. S1=gggtcta S2=gttc i g t c a 1 2 3 4 5 6 7 j 1 2 3 4 d=0 At d =0. L0,0 = 1, L0,1=2, L1,2 =3 and L1,3 =4.

20
**How can we compute the Ld,e’s value?**

We define rowd,e = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)]. (substitution) (insertion) (deletion) Ld,e= rowd,e+t, where t= the length of the longest common prefix of S1[d+rowd,e+1…n’] and S2[rowd,e+1…m’]. If t=0, it means that S1[d+rowd,e+1] ≠S2[rowd,e+1].

21
**Consider D3,2. L1,1=1. The largest j on d=1 for Di,j=1 is j=1**

Consider D3,2. L1,1=1. The largest j on d=1 for Di,j=1 is j=1. In this case, d=1, e=2. Ld,e-1=L1,1=1, Ld-1,e-1=L0,1=2 and Ld+1,e-1=L2,1=0. Thus rowd,e=row1,2=max(L1,1+1,L0,1,L2,1+1)=max(1+1,2,0+1)=max(2,2,1)=2. i g t c a 1 2 3 4 5 6 7 j 1 2 3 4 d=0 d=1 d=2

22
**e =1, d = -1 S1=gggtcta S2=gttc How to compute L-1,1?**

i j 1 2 3 4 c t g 7 6 5 a How to compute L-1,1? row-1,1 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] = max[(L-1,0+1),(L-2,0),(L0,0+1)] = max[0+1, 0, 1+1]= max[1, 0, 2] = 2 Since S1[d+rowd,e+1]= S1[-1+1+2]=g ≠S2[rowd,e+1]=S2[2+1]=t, L-1,1 = row-1,1+0 = 2.

23
**S1=gggtcta S2=gttc How to compute L1,2?**

i g g g t c t a 1 2 3 4 5 6 7 j 1 2 3 4 g 1 1 t 2 1 1 2 t 3 2 2 2 2 c 4 2 d = 1 How to compute L1,2? row1,2 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] = max[(L1,1+1),(L0,1),(L2,1+1)] = max[1+1, 2, 0+1]= max[2, 2, 1] = 2. Since the length of the longest common prefix of S1[d+row1,2+1…n’]=S1[4…7]=tcta and S2[row1,2+1…m’]= S2[3…4]=tc is 2, L1,2 = row1,2+2 =4.

24
Ld,e=rowd,e+t, where t= the length of the longest common prefix of S1[d+rowd,e+1…n’] and S2[rowd,e+1…m’]. How can we compute t ? In this paper, LCA (lowest common ancestor ) is used.

25
**Consider two substrings T1 and T2 as shown below:**

x T2 A2 S2 y If ED(A1, A2) =k and S1=S2, then ED(A1+S1, A2+S2) =k.

26
**This paper will use LCA (lowest common ancestor) to find S.**

When we find the ED(A1, A2) =k, we want to determine whether the longest common prefix S of B1 and B2 exists. B1 S1 S2 B2 This paper will use LCA (lowest common ancestor) to find S.

27
**Obviously, suffixes S1’ and S2’ have a common prefix S.**

To find such S, if it exists, we may concatenate S1 and S2 to a new string. S2’ S1’ Obviously, suffixes S1’ and S2’ have a common prefix S.

28
**Let us concatenate S1 and S2 to be a new string as follows:**

Consider D3,2, the substring after ggg is tctagttc=S1’. The substring after gt is tc=S2’. Note that S2’ and S1’ have a common prefix with length 2. Thus we have that D3,2=D4,3=D5,4=2. S1=gggtcta S2=gttc i g t c a 1 2 3 4 5 6 7 j 1 2 3 4 d = 1

29
S1=gggtcta S2=gttc Let us concatenate S1 and S2 to be a new string as follows: gggtctagttaa. And then we construct the suffix tree of it. The substring after ggg is tctagttc=S1’. The substring after gt is tc=S2’. Note that S2’ and S1’ have a common ancestor tc of length 2.

30
**Algorithm Initialization for all d, 1≦d ≦k+1, d＞e, Ld,e=-1 .**

for all d, -(k+1) ≦d ≦-1,Ld,|d|-1= -1, Ld,|d|-2 =|d|-2 . for all e, -1≦e≦k, Ln’+1,e = -1 Find L0,0= the length of longest common prefix of S1 and S2 For e = 1 to k do For d = -e to e do rowd,e = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] rowd,e = min(rowd,e,m’) while rowd,e < m’ and row d,e+d <n’ do find t= the length of longest common prefix of S1[d+rowd,e+1…n’] and S2[rowd,e+1…m’]; rowd,e = rowd,e + t; Ld,e = rowd,e.

31
**At c=4, T[1…4]=cttg, S2=TR[1..4]=gttc and S1=T[5…11]=gggtcta.**

Example: T = cttggggtcta and k=2. At c=4, T[1…4]=cttg, S2=TR[1..4]=gttc and S1=T[5…11]=gggtcta. S1 i S2 g t c a 1 2 3 4 5 6 7 j 1 2 3 4

32
**At d = 0, find the largest j such that S2[1…j] is equal to S1[1**

At d = 0, find the largest j such that S2[1…j] is equal to S1[1..i], then we set the value of L0,0 = j. S1 i S2 4 c 3 t 2 1 g 7 6 5 a j d=0 S2[1] = S1[1], L0,0 =1

33
**e =1, d = -1 S1 S2 row-1,1 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)]**

i j 1 2 3 4 c t g 7 6 5 a S2 row-1,1 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] = max[0,0,2]=2. the length of longest common prefix of ggtctagttc and tc is 0. L-1,1 = 2

34
**The length of LCA of ggtctagttc and tc is 0.**

35
e =1, d = 0 S1 i j 1 2 3 4 c t g 7 6 5 a S2 d = 0 row0,1 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] = max[2,0,1]=2. the length of common prefix of gtctagttc and tc is 0. L0,1 = 2

36
**The length of LCA of gtctagttc and tc is 0.**

37
**the length of common prefix of gtctagttc and ttc is 0. L1,1 = 1**

e =1, d = 1 S1 i S2 g g g t c t a 1 2 3 4 5 6 7 j 1 2 3 4 g 1 1 t 2 1 1 t 3 c 4 d = 1 row1,1= max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] =1. the length of common prefix of gtctagttc and ttc is 0. L1,1 = 1

38
**The length of LCA of gtctagttc and ttc is 0.**

39
**row1,2 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] =2**

e =2, d = 1 S1 i j 1 2 3 4 c t g 7 6 5 a S2 d = 1 row1,2 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] =2

40
**We find that the longest common prefix of tc and tctagttc is tc.**

e =2, d = 1 i g g g t c t a 1 2 3 4 5 6 7 j 1 2 3 4 g 1 1 t 2 1 1 2 t 3 2 2 2 2 c 4 2 d = 1 We find that the longest common prefix of tc and tctagttc is tc. S2’ S1’ L1,2 = row+2=2+2=4

41
**The length of LCA of tctagttc and ttc is 2.**

42
**row2,2 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] =1 **

e =2, d = 2 S1 i S2 j 1 2 3 4 c t g 7 6 5 a d = 2 row2,2 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] =1 We find that the lenghth of common prefix of ttc and tctagttc is 1. S2’ S1’ L2,2 = row2,2+1=1+1=2

43
**The length of LCA of ttc and tctagttc is 1.**

44
**S1=gggtcta S2=gttc S1 S2 T = cttggggtcta and k=2.**

i j 1 2 3 4 c t g 7 6 5 a S2 T = cttggggtcta and k=2. At c=4, T[1…4]=cttg, TR[1..4]=gttc and TR[5…11]=gggtcta. cttggggtc is the maximal approximate palindromes.

45
References [U85] Finding approximate patterns in strings, Ukkonen, E., Journal of algorithms, Vol. 6, 1985, pp [LV89] Fast parallel and serial approximate string matching, G. Landau and U. Vishkin, Journal of algorithms, Vol. 10, 1989, pp

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google