# School of CSE, Georgia Tech

## Presentation on theme: "School of CSE, Georgia Tech"— Presentation transcript:

School of CSE, Georgia Tech
Analysis of Real World NP-Complete Graph Problem: DCJ Median Algorithm to Find Ancestor of Genome of Three Zhaoming Yin School of CSE, Georgia Tech

Foundamentals Sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. ...

Foundamentals In 1980s Jeffrey Palmer studied evolution of plant organelles by comparing mitochondrial genomes of the cabbage and turnip, 99% similarity between genes, These surprisingly identical gene sequences differed in gene order, This study helped pave the way to analyzing genome rearrangements in molecular evolution. Inversion: 1 2 –6 – Transposition: Inverted Transposition:

Foundamentals Maximal Parsimony Phylogeny is to optimize each ancestral node of an unrooted phylogeny in terms of its three or more immediate neighbours, modern or ancestral, and to iterate across the tree until convergence of the objective function (to a local optimum) at all nodes.

Break Point Graph 1 2 -1 2 1 2 3 4 5 6 1 -5 -2 3 -6 -4 0/+1 1/-1 2/+2
3/-2 1/-1 0/+1 2/+2 3/-2 11/-6 0/+1 1/-1 2/+2 3/-2 4/+3 5/-3 6/+4 7/-4 8/+5 9/-5 10/+6

MBG/0-Matching -6 -3 -2 +5 +3 +1 -4 +4 +2 -1 +6 -5

Subgraph/Decomposer 1 2 3 4 5 6 1 -5 -2 3 -6 -4 1 3 5 -4 6 -2
-3 -2 +5 +3 +1 -4 +4 +2 -1 Subgraph +6 -5 H-crossing

Adequate Subgraph Definition: In an MBG for a set of genomes G, a connected subgraph H of size m is an adequate subgraph if cmax(H) ≥ 1/2mNG; it is strongly adequate if cmax(H) >1/2mNG. (m is the size of node in the subgraph, NG is the size of genome, which is 3 for the median of three problem). Property: A Adequate Subgraph is simple, if it does not contain another adequate subgraph. Lemma: A Adequate Subgraph is a decomposer.

Algorithm: AS1() for each v do if v[0]=v[1] or v[0]=v[2] or v[1]=v[2]
major set for each v do if v[0]=v[1] or v[0]=v[2] or v[1]=v[2] these two points are AS; the edge conncecting them is major set; endif endfor

Algorithm: AS2() c c c1 c2 c2 c1 (1) (2) c2 c1 c c1 c c2 (1) (2)
for each color c do for each v do if v[c1][c]=v[c][c2](1) or v[c2][c]=v[c][c1] (2) or v[c2][c1]=v[c][c2] (3) or v[c1][c2]=v[c][c1] v,v[c],v[c1],v[c2] are AS; (1), major set is (v,v[c1]) and (v[c],v[c2]) or (2), major set is (v,v[c2]) and (v[c],v[c1]) or (3), major set is (v,v[c]) and (v[c1],v[c][c2]) or (4), major set is (v,v[c]) and (v[c2],v[c][c1]) endif endfor

Algorithm: AS2() c c2 c1 c1 c2 c2 c1 c c (1) (2) for each color c do
for each v do if v[c1][c]=v[c][c1] and (v[c1]=v[c][c2] || v[c1]=v[c][c2) (1) or v[c1][c2]=v[c][c1] and (v[c1]=v[c][c2] || v[c1]=v[c][c2) (2) v,v[c],v[c1],v[c2] are AS; (1), major set is (v,v[c1]) and (v[c],v[c2]) or (2), major set is (v,v[c2]) and (v[c],v[c1]) endif endfor

Algorithm: AS2() for each color c do for each v do
if v[c1][c]=v[c][c1] and (v[c2][c]=v[c][c2] and v[c1]!=v[c][c2] and v[c2] !v[c][c1] v,v[c],v[c1],v[c2] are AS; (1), major set is (v,v[c1]) and (v[c],v[c2]) endif endfor c2 c c1

Algorithm: AS2() In this case, there are two major sets
for each color c do for each v do if v[c1][c]=v[c][c1] and type three is not find v,v[c],v[c1],v[c2] are AS; (1), major set is (v,v[c1]) and (v[c],v[c2]) and (v,v[c]) and (v[c1],v[c][c1]) endif endfor In this case, there are two major sets c c1

Algorithm: AS4()--type 5-3-5
c2 c1 po1 core po2 c0 po11 po22 po0

Algorithm: AS4()

Algorithm: AS4()

Algorithm: AS4()

Algorithm: Shrink() 11 5 3 8 4 7 6 2 1 10 9

Algorithm: Shrink() 11 5 2 5 3 4 6 1 7 6

Algorithm: Shrink() 11 5 2 5 3 4 6 1 7 6

Branch and Bound Algorithm

Branch and Bound Algorithm(1)
If there is no brach that has the current upper bound, decrease it. No element in the memory, load others from disk.

Branch and Bound Algorithm(2)
Get a intermediate sub- graph, and check if it could be trimed, or it is the final solution. If too much elems in the memory store them in the disk.

Upperbound and Lowerbound-Upperbound
DCJ distance between genomes obey triangular inequality. So: Given Three genomes G1 G2 G3, the median genome will have the distance between them: Because the distance is defined by: therefore, the upperbound for circle number is:

Upperbound and Lowerbound-Upperbound
DCJ distance between genomes obey triangular inequality. So: Given Three genomes G1 G2 G3, the median genome will have the distance between them: Because the distance is defined by: therefore, the upperbound for circle number is:

Best First Search Because best first search can ensure that the searching space is minimal. However, it needs much space to store the foot print. Which makes the branch and bound algorithm an I/O bound algorithm. 1 2 3 4 k k+1 k+1 5 6 7 7 3 1 8 9 10 9 5 2 10 6 4 8

Reference [1] Andrew Wei Xu and David Sankoff, Decompositions of multiple breakpoint graphs and rapid exact solutions to the median problem., K.A. Crandal l and J. Lagergren (Eds.): Proceedings of the Workshop on Algorithms in Bioinformatics, WABI 2008, Lecture Notes in Bioinformatics 5251,Springer. [2] Yancopoulos, S., Attie, O., Friedberg, R.: E?cient sorting of genomic permutations by translocation, inversion and block interchange. Bioinform. 21, 3340ĺC3346 (2005) [3] Andrew Wei Xu, A Fast and Exact Algorithm for the Median of three Problem: a Graph Decomposition Approach., Journal of computational biology, 2009, 16(10), 1-13.