Presentation on theme: "DCJUC: A Maximum Parsimony Simulator for Constructing Phylogenetic Tree of Genomes with Unequal Contents Zhaoming Yin Bader-Polo Joint Group Meeting, Nov."— Presentation transcript:
DCJUC: A Maximum Parsimony Simulator for Constructing Phylogenetic Tree of Genomes with Unequal Contents Zhaoming Yin Bader-Polo Joint Group Meeting, Nov 11, 2013
Contribution Research Aspect -A framework to solve the maximum parsimonious tree with the input of unequal genome contents. -Proved Adequate subgraph theory is applicable in unequal contents data which reduces search space. -provide a benchmark for the HPC community. Engineering Aspect -Implement software with many state of the art features such as supertree method, GAS initialization method, spectral partition etc. -The software can produce a tree with not only topologies, but also type/number of different evolution events (visualization!).
Why Phylogenetic Tree Problem is Hard? For N genomes, there are (N-3)!! number of possible tree topologies. For each topology, we need to compute at least one different median, the possible median order are (g-2)!!. g is the number of genes. To validate each possible median, if the gene content has duplications, it’s NP hard. So the complexity type of computing the MP tree with uneuqal contents genomes is: NP hard over NP hard over NP hard!
Phylogenetic Tree This picture presents the phylogeny of the “12 Drosophila.” From
Maximum Parsimony Concept Of all possible topologies, the maximum parsimonious tree is the one that has the minimum total tree length
Genome Rearrangement In 1980s Jeffrey Palmer studied evolution of plant organelles by comparing mitochondrial genomes of the cabbage and turnip, 99% similarity between genes, These surprisingly identical gene sequences differed in gene order, This study helped pave the way to analyzing genome rearrangements in molecular evolution – 6 – – Inversion: Transposition: Inverted Transposition:
Step 2-2: How to Compute Median (LK) …………………. stop
Step 2-2: How to Evaluate Median 1 med 1, 2, 3, 3, 4, 6, 5 1, 2, 3, 4, 3, 6, 5 1, 2, 3, 4, 6, 3, 5 1, 2, 5, 4, 6, 3, 3 Dis(m,1)+Dis(m,2)+Dis(m,3) 2 3
Step 2-2: How to Evaluate Median 1, 2, 3, 3, 4, 6, 5 1, 2, 3, 4, 3, 5 Find a mapping first (NP hard) dis=1 1, 2, 3, 3, 4, 6, 5 -2, -1, 3, 3, 4, 5 Complete the loss (polynomial) dis =2 1, 2, 3, 4, 6, 5 -2, -1, 3, 4, 6, 5 Compute DCJ (polynomial) dis =3 1, 2, 3, 4, 6, 5
Step 3: Merge Disks Decomposition of The disks Construct a tree for each disk Merge the tree using A specific consensus method: Strict, majority etc… Disambiguation
Step 4: Initialization X 12 c b e d Init by insertion Which is local Init by prospection Which is global.
Step5: Iterative Refinement a b
Review Step 1: Spectral partition Step 2: Subtree construction Step 3: Supertree merge Step 4: Initialization of complete tree using General Adequate Subgraph (GAS) method. Step 5: Iterative Refinement until the complete tree converged.
Result—Simulated Data seed #Theta+ #gamma+ #phi operations We know the total number of evolution event in the model tree We grow our own tree
Result--Accuracy %of duplication 0.1 % of loss 0.1 Theta is % of inversion There are 8 species 2*8-3 =13edges. So the average accuracy is ~90%
Result – Real Data SCRaMbLE Matrix We can represent a SCRaMbLEd strain by its vector. The sign gives the orientation. The color encodes the position in the synthetic chromosome.
Result – Real Data #inversion:#insertion/deletion:#duplication
Why Many-core BnB? So many distributed memory MIP BnB frameworks (PICO, PEBBL, ALPS, COIN-OR). Load balance of distributed BnB is highly relied on Ramp up, run time load balancing is not efficient. But nowadays Peta-flops machines are mostly hybrid systems(distributed + many-core (or accelerators)).