Presentation on theme: "DCJUC: A Maximum Parsimony Simulator for Constructing Phylogenetic Tree of Genomes with Unequal Contents Zhaoming Yin Bader-Polo Joint Group Meeting, Nov."— Presentation transcript:
DCJUC: A Maximum Parsimony Simulator for Constructing Phylogenetic Tree of Genomes with Unequal Contents Zhaoming Yin Bader-Polo Joint Group Meeting, Nov 11, 2013
Contribution Research Aspect -A framework to solve the maximum parsimonious tree with the input of unequal genome contents. -Proved Adequate subgraph theory is applicable in unequal contents data which reduces search space. -provide a benchmark for the HPC community. Engineering Aspect -Implement software with many state of the art features such as supertree method, GAS initialization method, spectral partition etc. -The software can produce a tree with not only topologies, but also type/number of different evolution events (visualization!).
Why Phylogenetic Tree Problem is Hard? For N genomes, there are (N-3)!! number of possible tree topologies. For each topology, we need to compute at least one different median, the possible median order are (g-2)!!. g is the number of genes. To validate each possible median, if the gene content has duplications, it’s NP hard. So the complexity type of computing the MP tree with uneuqal contents genomes is: NP hard over NP hard over NP hard!
Phylogenetic Tree This picture presents the phylogeny of the “12 Drosophila.” From http://insects.eugenes.org/species
Maximum Parsimony Concept 5 1 2 3 4 1 32 4 656 5 1423 6 Of all possible topologies, the maximum parsimonious tree is the one that has the minimum total tree length
Review Step 1: Spectral partition Step 2: Subtree construction Step 3: Supertree merge Step 4: Initialization of complete tree using General Adequate Subgraph (GAS) method. Step 5: Iterative Refinement until the complete tree converged.
Result—Simulated Data seed #Theta+ #gamma+ #phi operations We know the total number of evolution event in the model tree We grow our own tree
Result--Accuracy %of duplication 0.1 % of loss 0.1 Theta is % of inversion There are 8 species 2*8-3 =13edges. So the average accuracy is ~90%
Result – Real Data SCRaMbLE Matrix We can represent a SCRaMbLEd strain by its vector. The sign gives the orientation. The color encodes the position in the synthetic chromosome.
Result – Real Data #inversion:#insertion/deletion:#duplication
Why Many-core BnB? So many distributed memory MIP BnB frameworks (PICO, PEBBL, ALPS, COIN-OR). Load balance of distributed BnB is highly relied on Ramp up, run time load balancing is not efficient. But nowadays Peta-flops machines are mostly hybrid systems(distributed + many-core (or accelerators)).