Advisor: Prof. R. C. T. Lee Speaker: L. C. Chen

Slides:



Advertisements
Similar presentations
Finding The Unknown Number In A Number Sentence! NCSCOS 3 rd grade 5.04 By: Stephanie Irizarry Click arrow to go to next question.
Advertisements

Advanced Piloting Cruise Plot.
Feichter_DPG-SYKL03_Bild-01. Feichter_DPG-SYKL03_Bild-02.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Chapter 1 The Study of Body Function Image PowerPoint
Author: Julia Richards and R. Scott Hawley
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
1 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, G. Navarro and R. Baeza-Yates Advisor: Prof. R. C. T.
1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen.
1 Fast text searching: allowing errors Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp Advisor: Prof. R. C. T. Lee Reporter:
1 Fast Parallel and Serial Approximate String Matching Journal of Algorithms, Vol.10 (1989), pp G. Landau and U. Vishkin Advisor: Prof. R. C.
Speaker: C. C. Lin Adviser: R. C. T. Lee
On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.
1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,
1 Faster algorithms for string matching with k mismatches Adviser : R. C. T. Lee Speaker: C. C. Yen Journal of Algorithms, Volume 50, Issue 2, February.
1 Approximate string matching using factor automata J. Holub and B. Melichar Theoretical Computer Science vol.249 p Speaker: L. C. Chen Advisor:
and 6.855J Cycle Canceling Algorithm. 2 A minimum cost flow problem , $4 20, $1 20, $2 25, $2 25, $5 20, $6 30, $
and 6.855J Spanning Tree Algorithms. 2 The Greedy Algorithm in Action
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Addition Facts
Year 6 mental test 5 second questions
Year 6 mental test 10 second questions
Solve Multi-step Equations
REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.
Copyright © Cengage Learning. All rights reserved.
Randomized Algorithms Randomized Algorithms CS648 1.
Data Structures Using C++
ABC Technology Project
LIAL HORNSBY SCHNEIDER
Outline Minimum Spanning Tree Maximal Flow Algorithm LP formulation 1.
1 Undirected Breadth First Search F A BCG DE H 2 F A BCG DE H Queue: A get Undiscovered Fringe Finished Active 0 distance from A visit(A)
Green Eggs and Ham.
VOORBLAD.
15. Oktober Oktober Oktober 2012.
Name Convolutional codes Tomashevich Victor. Name- 2 - Introduction Convolutional codes map information to code bits sequentially by convolving a sequence.
1 Breadth First Search s s Undiscovered Discovered Finished Queue: s Top of queue 2 1 Shortest path from s.
How to convert a left linear grammar to a right linear grammar
Copyright © 2013, 2009, 2006 Pearson Education, Inc.
BIOLOGY AUGUST 2013 OPENING ASSIGNMENTS. AUGUST 7, 2013  Question goes here!
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
© 2012 National Heart Foundation of Australia. Slide 2.
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
Chapter 5 Test Review Sections 5-1 through 5-4.
Addition 1’s to 20.
25 seconds left…...
Copyright © Cengage Learning. All rights reserved.
Januar MDMDFSSMDMDFSSS
Week 1.
Analyzing Genes and Genomes
We will resume in: 25 Minutes.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Essential Cell Biology
Intracellular Compartments and Transport
PSSA Preparation.
Immunobiology: The Immune System in Health & Disease Sixth Edition
Essential Cell Biology
Immunobiology: The Immune System in Health & Disease Sixth Edition
Energy Generation in Mitochondria and Chlorplasts
By Rasmussen College. 1. What majors or programs do you offer? 2. What is the average length of your programs? 3. What percentage of your students graduate?
1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.
Presentation transcript:

Advisor: Prof. R. C. T. Lee Speaker: L. C. Chen Finding approximate palindromes in strings Pattern Recognition, vol.35, pp. 2581-2591, 2002 Alexandre H. L Porto and Valmir C. Barbosa Advisor: Prof. R. C. T. Lee Speaker: L. C. Chen

Definition S: a string of n characters. S[i]: the ith character in S. S[i..j]: the substring of S whose first and last characters are S[i] and S[j]. SR: the reverse of S. S: abcab SR:bacba

Definition A even(odd) palindrome is a string which is of the form of SRS(SRaS). Thus abaccaba is a palindrome because abac is the reverse of caba. S[c]: the center of palindrome S[i…j] in S, where . 1 2 3 4 5 6 7 8 c b a S S[2…7]=baccab is an even palindrome and S[c]=4

Edit distance X : A - T Y : A G T X : A C C Y : T C C X: G C A In edit distance, there are three types of differences between two strings X and Y: Insertion: a symbol of Y is missing in X at a corresponding position. Substitution: symbols at corresponding positions are distinct. Deletion: a symbol of X is missing in Y at a X : A - T Y : A G T X : A C C Y : T C C X: G C A Y: G - A

denotes the edit distance between two strings A and B as the minimum number of substitutions, insertions and deletions of characters in B to transform to A. A=abcab-a B=cb–abbc Insertion:1, Substitution:2 and Deletion:1.

Approximate palindromes An approximate palindrome with error up to k : a string of the form of SRS(SRaS) such that ED(S,SR) ≦k. An approximate palindrome is maximal if no other approximate palindrome for the same c and k exists having strictly greater size or the same size but strictly fewer errors.

abaa and aabaa are even approximate palindromes, To simplify our discussion, we only discuss even approximate palindromes here. S: aabaabcd and k=1. 1 2 3 4 5 6 7 8 a b c d S At c=3, abaa and aabaa are even approximate palindromes, Substitute b with a Delete b and aabaa is a maximal approximate palindrome.

Problem Given a string T of size n, we want to find all maximal approximate palindromes in T with up to k errors. For each c, we find the largest i’ and j’ in T[c+1…n] and TR[1…c] respectively such that ED(T[c+1…i’] ), TR[1…j’]) ≦k.

Let S2=TR[1…c] and S1=T[c+1…n], where 1≦c≦n. In the dynamic programming approach, we construct a matrix Dn’+1,m’+1 when Di,j is the minimum edit distance between S1[1,i] and S2[1,j], where the length of S1 and S2 are n’ and m’ respectively.

S2=TR [1…3] =cbd and S1=T[4…7]=aabac. i j a b c 1 2 3 4 5 T: dbcaabac, and k=2. At c=3, S2=TR [1…3] =cbd and S1=T[4…7]=aabac. i j a b c 1 2 3 4 5 d ↖: substitution or a matching ↑: deletion ←: insertion We can find that the maximal approximate palindrome is bcaab.

How can we compute the table faster? In this paper, the method in [LV89]( L.Y. Huang) was used.

We shall heavily use the concept of diagonal. Diagonal d is defined as all of the Di,j’s where d = i – j. The diagonal property: Di,j-Di-1,j-1=0 or 1. It means that on the diagonal, the values are monotonically increasing. [U85] 1 2 c b 3 a i 1 2 3 j Diagonal 2 Diagonal 0

Let us now label all of these locations. Consider diagonal d=0. Let us find the largest j, if it exists, such that (i,j) is on Diagonal d (i - j = d) and Di,j = 0. Let us now label all of these locations. S1=gggtcta S2=gttc 4 c 3 t 2 1 g 7 6 5 a i 1 2 3 4 5 6 7 j Diagonal 0

Having found the above locations (i, j) where Di,j = 0, we can further find the largest j, if it exists, such that (i, j) is on Diagonal d and Di,j = 1. To do this, we use the following observation: Each element in Diagonal d can only influence elements in Diagonals d-1, d and d+1.

Let us consider any (i, j) location on Diagonal d. Di,j can only be influenced as shown below: Di-1, j-1 Di, j-1 substitution delete d+1 Di-1, j Di, j insert d d-1 Thus, we conclude that we only need to consider Diagonals d-1, d and d+1 for each Di,j.

Observe the following two strings: 1 j If i and j are the largest i and j such that ED(T1[1…i],T2[1…j]) = k and T1[i+1]≠ T2[j+1], then ED(A1+x, A2+y) = k+1.

T1 ab c d 1 i T2 cbd e 1 j Consider T1=abcd and T2=cdde. ED(T1[1…i],T2[1…j])=2. The largest such i and j are 2 and 3 respectively, and T1[i+1]≠ T2[j+1]. Thus the ED(ab+c,cbd+e)=2+1=3.

Based upon the above discussion, on a diagonal d, we can find the largest i and j such that Di,j =e. How can we find the largest row containing the value smaller or equal to k ? We need to let Ld,e denote the largest row j such that Di,j is on the Diagonal d (i- j = d) and Di,j =e≦k.

Let Ld,e denote the largest row j such that Di,j is on the Diagonal d (i- j = d) and Di,j =e≦k. Based upon this definition, e is the edit distance between S1[1…i] and S2[1…j] such that i and j are the such largest ones, and S2[ j+1] ≠S1[i+1]. S1=gggtcta S2=gttc i 1 2 3 4 5 6 7 g t c a 1 2 3 4 5 6 7 j 1 2 3 4 d=0 At d =0. L0,0 = 1, L0,1=2, L1,2 =3 and L1,3 =4.

How can we compute the Ld,e’s value? We define rowd,e = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)]. (substitution) (insertion) (deletion) Ld,e= rowd,e+t, where t= the length of the longest common prefix of S1[d+rowd,e+1…n’] and S2[rowd,e+1…m’]. If t=0, it means that S1[d+rowd,e+1] ≠S2[rowd,e+1].

Consider D3,2. L1,1=1. The largest j on d=1 for Di,j=1 is j=1 Consider D3,2. L1,1=1. The largest j on d=1 for Di,j=1 is j=1. In this case, d=1, e=2. Ld,e-1=L1,1=1, Ld-1,e-1=L0,1=2 and Ld+1,e-1=L2,1=0. Thus rowd,e=row1,2=max(L1,1+1,L0,1,L2,1+1)=max(1+1,2,0+1)=max(2,2,1)=2. i 1 2 3 4 5 6 7 g t c a 1 2 3 4 5 6 7 j 1 2 3 4 d=0 d=1 d=2

e =1, d = -1 S1=gggtcta S2=gttc How to compute L-1,1? i 1 2 3 4 5 6 7 j 1 2 3 4 c t g 7 6 5 a How to compute L-1,1? row-1,1 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] = max[(L-1,0+1),(L-2,0),(L0,0+1)] = max[0+1, 0, 1+1]= max[1, 0, 2] = 2 Since S1[d+rowd,e+1]= S1[-1+1+2]=g ≠S2[rowd,e+1]=S2[2+1]=t, L-1,1 = row-1,1+0 = 2.

S1=gggtcta S2=gttc How to compute L1,2? i 1 2 3 4 5 6 7 g g g t c t a 1 2 3 4 5 6 7 j 1 2 3 4 g 1 1 t 2 1 1 2 t 3 2 2 2 2 c 4 2 d = 1 How to compute L1,2? row1,2 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] = max[(L1,1+1),(L0,1),(L2,1+1)] = max[1+1, 2, 0+1]= max[2, 2, 1] = 2. Since the length of the longest common prefix of S1[d+row1,2+1…n’]=S1[4…7]=tcta and S2[row1,2+1…m’]= S2[3…4]=tc is 2, L1,2 = row1,2+2 =4.

Ld,e=rowd,e+t, where t= the length of the longest common prefix of S1[d+rowd,e+1…n’] and S2[rowd,e+1…m’]. How can we compute t ? In this paper, LCA (lowest common ancestor ) is used.

Consider two substrings T1 and T2 as shown below: x T2 A2 S2 y If ED(A1, A2) =k and S1=S2, then ED(A1+S1, A2+S2) =k.

This paper will use LCA (lowest common ancestor) to find S. When we find the ED(A1, A2) =k, we want to determine whether the longest common prefix S of B1 and B2 exists. B1 S1 S2 B2 This paper will use LCA (lowest common ancestor) to find S.

Obviously, suffixes S1’ and S2’ have a common prefix S. To find such S, if it exists, we may concatenate S1 and S2 to a new string. S2’ S1’ Obviously, suffixes S1’ and S2’ have a common prefix S.

Let us concatenate S1 and S2 to be a new string as follows: Consider D3,2, the substring after ggg is tctagttc=S1’. The substring after gt is tc=S2’. Note that S2’ and S1’ have a common prefix with length 2. Thus we have that D3,2=D4,3=D5,4=2. S1=gggtcta S2=gttc i 1 2 3 4 5 6 7 g t c a 1 2 3 4 5 6 7 j 1 2 3 4 d = 1

S1=gggtcta S2=gttc Let us concatenate S1 and S2 to be a new string as follows: gggtctagttaa. And then we construct the suffix tree of it. The substring after ggg is tctagttc=S1’. The substring after gt is tc=S2’. Note that S2’ and S1’ have a common ancestor tc of length 2.

Algorithm Initialization for all d, 1≦d ≦k+1, d>e, Ld,e=-1 . for all d, -(k+1) ≦d ≦-1,Ld,|d|-1= -1, Ld,|d|-2 =|d|-2 . for all e, -1≦e≦k, Ln’+1,e = -1 Find L0,0= the length of longest common prefix of S1 and S2 For e = 1 to k do For d = -e to e do rowd,e = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] rowd,e = min(rowd,e,m’) while rowd,e < m’ and row d,e+d <n’ do find t= the length of longest common prefix of S1[d+rowd,e+1…n’] and S2[rowd,e+1…m’]; rowd,e = rowd,e + t; Ld,e = rowd,e.

At c=4, T[1…4]=cttg, S2=TR[1..4]=gttc and S1=T[5…11]=gggtcta. Example: T = cttggggtcta and k=2. At c=4, T[1…4]=cttg, S2=TR[1..4]=gttc and S1=T[5…11]=gggtcta. S1 i 1 2 3 4 5 6 7 S2 g t c a 1 2 3 4 5 6 7 j 1 2 3 4

At d = 0, find the largest j such that S2[1…j] is equal to S1[1 At d = 0, find the largest j such that S2[1…j] is equal to S1[1..i], then we set the value of L0,0 = j. S1 i 1 2 3 4 5 6 7 S2 4 c 3 t 2 1 g 7 6 5 a j d=0 S2[1] = S1[1], L0,0 =1

e =1, d = -1 S1 S2 row-1,1 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] i 1 2 3 4 5 6 7 j 1 2 3 4 c t g 7 6 5 a S2 row-1,1 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] = max[0,0,2]=2. the length of longest common prefix of ggtctagttc and tc is 0. L-1,1 = 2

The length of LCA of ggtctagttc and tc is 0.

e =1, d = 0 S1 i 1 2 3 4 5 6 7 j 1 2 3 4 c t g 7 6 5 a S2 d = 0 row0,1 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] = max[2,0,1]=2. the length of common prefix of gtctagttc and tc is 0. L0,1 = 2

The length of LCA of gtctagttc and tc is 0.

the length of common prefix of gtctagttc and ttc is 0. L1,1 = 1 e =1, d = 1 S1 i 1 2 3 4 5 6 7 S2 g g g t c t a 1 2 3 4 5 6 7 j 1 2 3 4 g 1 1 t 2 1 1 t 3 c 4 d = 1 row1,1= max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] =1. the length of common prefix of gtctagttc and ttc is 0. L1,1 = 1

The length of LCA of gtctagttc and ttc is 0.

row1,2 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] =2 e =2, d = 1 S1 i 1 2 3 4 5 6 7 j 1 2 3 4 c t g 7 6 5 a S2 d = 1 row1,2 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] =2

We find that the longest common prefix of tc and tctagttc is tc. e =2, d = 1 i 1 2 3 4 5 6 7 g g g t c t a 1 2 3 4 5 6 7 j 1 2 3 4 g 1 1 t 2 1 1 2 t 3 2 2 2 2 c 4 2 d = 1 We find that the longest common prefix of tc and tctagttc is tc. S2’ S1’ L1,2 = row+2=2+2=4

The length of LCA of tctagttc and ttc is 2.

row2,2 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] =1 e =2, d = 2 S1 i 1 2 3 4 5 6 7 S2 j 1 2 3 4 c t g 7 6 5 a d = 2 row2,2 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] =1 We find that the lenghth of common prefix of ttc and tctagttc is 1. S2’ S1’ L2,2 = row2,2+1=1+1=2

The length of LCA of ttc and tctagttc is 1.

S1=gggtcta S2=gttc S1 S2 T = cttggggtcta and k=2. i 1 2 3 4 5 6 7 j 1 2 3 4 c t g 7 6 5 a S2 T = cttggggtcta and k=2. At c=4, T[1…4]=cttg, TR[1..4]=gttc and TR[5…11]=gggtcta. cttggggtc is the maximal approximate palindromes.

References [U85] Finding approximate patterns in strings, Ukkonen, E., Journal of algorithms, Vol. 6, 1985, pp.132-137. [LV89] Fast parallel and serial approximate string matching, G. Landau and U. Vishkin, Journal of algorithms, Vol. 10, 1989, pp.157-169.