Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallel & Distributed Systems and Algorithms for Inference of Large Phylogenetic Trees with Maximum Likelihood Alexandros Stamatakis LRR TU München Contact:

Similar presentations


Presentation on theme: "Parallel & Distributed Systems and Algorithms for Inference of Large Phylogenetic Trees with Maximum Likelihood Alexandros Stamatakis LRR TU München Contact:"— Presentation transcript:

1 Parallel & Distributed Systems and Algorithms for Inference of Large Phylogenetic Trees with Maximum Likelihood Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum.edu

2 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 2Outline Motivation Introduction to phylogenetic tree inference Statistical inference methods Maximum Likelihood & associated problems Solutions: – 2 simple heuristics – parallel & distributed implementation Results Conclusion Availability & Future Work

3 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 3 Motivation: Towards a „Tree of Life“ 30.000 organisms available, current trees <= 1000 Where we are:

4 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 4 Motivation: Towards a „Tree of Life“ 30.000 organisms available, current trees <= 1000 Where we want to get:

5 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 5 Phylogenetic Tree Inference Input: „good“ multiple alignment of a distinguished, highly conserved part of DNA sequences Output: unrooted binary tree with the sequences at its leaves (all nodes: either degree 1 or 3) Various methods for phylogenetic tree inference Differ in computational complexity and quality of trees Most accurate methods: Maximum Likelihood Method (ML) and Bayesian Phylogenetic Inference: + most sound and flexible methods + other methods not suited for large/complex trees -- most computationally intensive methods

6 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 6 ML and Bayesian methods T.Williams et al (March 2003) comparative analysis with simulated data shows: MrBayes is best program Guidon et al (May 2003) PHYML very fast & accurate ML program for real & simulated data: faster than MrBayes ML (PHYML, RAxML2): + Significantly faster than MrBayes + Reference/starting trees for bayesian methods -- Less powerful statistical model Bayesian Inference (MrBayes): + Powerful statistical model -- MCMC convergence problem Memory requirements for 1000/10000-taxon alignment: – RAxML: 200MB/750MB – PHYML: 900MB/8.8GB – MrBayes: 1150MB/unknown

7 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 7 MCMC Convergence Problem

8 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 8 What does ML compute? Maximum Likelihood calculates: 1.Topologies 2.Branch lengths v[i] 3.Likelihood of the tree Goal: Find tree topology wich maximizes likelihood Problem I: Number of possible topologies is exponential in n Problem II: Computation of likelihood value + branch length optimization is expensive Solution: Algorithmic Optimizations (previous work) + New heuristics + HPC S1 S2 S3 S4 S5 v1 v2 v3 v4 v5 v6 v7

9 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 9 New Heuristics for RAxML Two common methods to build a tree:  Progressive addition of organisms e.g. stepwise addition algorithm  Use a (random, simple) starting tree containing all organisms and optimize likelihood by application of topological changes RAxML (Randomized Axelerated Maximum Likelihood) computes parsimony starting tree with dnapars -> fast and relatively „good“ initial likelihood dnapars uses stepwise addition -> randomized sequence input order to obtain distinct starting trees Optimize starting tree by application of rearrangements Accelerate rearrangements by two simple ideas

10 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 10 Subtree Rearrangements ST5 ST2 ST6 ST4 ST3 ST1

11 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 11 Subtree Rearrangements ST5 ST2 ST6 ST4 ST3 ST1

12 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 12 Subtree Rearrangements ST5 ST2 ST6 ST4 ST3 ST1 +1

13 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 13 Subtree Rearrangements ST5 ST2 ST6 ST4 ST3 ST1 +1

14 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 14 Subtree Rearrangements ST5 ST2 ST6 ST4 ST3 ST1 +1

15 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 15 Subtree Rearrangements ST5 ST2 ST6 ST4 ST3 ST1 +1

16 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 16 Subtree Rearrangements ST5 ST2 ST6 ST4 ST3 ST1 +2

17 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 17 Subtree Rearrangements ST5 ST2 ST6 ST4 ST3 ST1 +2

18 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 18 Subtree Rearrangements ST5 ST2 ST6 ST4 ST3 ST1 Optimize all branches

19 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 19 Subtree Rearrangements ST5 ST2 ST6 ST4 ST3 ST1 Need to optimize all branches ?

20 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 20 Idea 1: Local Optimization of Branch Length ST5 ST2 ST6 ST4 ST3 ST1

21 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 21 Idea 1: Local Optimization of Branch Length ST5 ST2 ST6 ST4 ST3 ST1

22 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 22 Why is Idea 1 useful? Local optimization of branch lengths: – Update less likelihood vectors -> significantly faster – Allows higher rearrangement settings -> better trees Likelihood depends strongly on topology Fast exploration of large number of topologies Straight-forward parallelization Store best 20 trees from each rearrangement step Branch length optimization of best 20 trees only Experimental results justify this mechanism

23 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 23 Idea 2:Subsequent Application of Topological Changes ST5 ST2 ST6 ST4 ST3 ST1

24 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 24 Idea 2:Subsequent Application of Topological Changes ST5 ST2 ST6 ST4 ST3 ST1 ST3 ST5 ST2 ST6 ST4 ST1

25 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 25 Idea 2:Subsequent Application of Topological Changes ST5 ST2 ST6 ST4 ST3 ST1 ST5 ST2 ST6 ST4 ST1 ST5 ST2 ST6 ST4 ST1 ST3

26 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 26 Idea 2:Subsequent Application of Topological Changes ST5 ST2 ST6 ST4 ST3 ST1 ST5 ST2 ST6 ST4 ST1 ST5 ST2 ST6 ST4 ST1 ST5 ST2 ST6 ST4 ST1 ST3

27 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 27 Why is Idea 2 useful? During inital 5-10 rearrengement steps many improved topologies are encountered Acceleration of likelihood improvment in initial optimization phase Enables fast optimization of random starting trees

28 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 28 Remainder of this Talk Motivation Introduction to phylogenetic tree inference Statistical inference methods Maximum Likelihood & associated problems Solutions: – 2 simple heuristics – parallel & distributed implementation Results Conclusion Availability & Future Work

29 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 29 Basic Parallel & Distributed Algorithm Basic idea: Distribute work by subtrees instead of topologies (e.g. parallel fastDNAml) Simple Master-Worker architecture Subsequent application of topological changes introduces non-determinism ST5 ST2 ST6 ST4 ST3 ST1

30 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 30 Basic Parallel & Distributed Algorithm Basic idea: Distribute work by subtrees instead of topologies (e.g. parallel fastDNAml) Simple Master-Worker architecture Subsequent application of topological changes introduces non-determinism ST5 MPI_Send(ST3_ID, tree) ST6 ST4 ST3 ST1 ST2

31 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 31 Basic Parallel & Distributed Algorithm Basic idea: Distribute work by subtrees instead of topologies (e.g. parallel fastDNAml) Simple Master-Worker architecture Subsequent application of topological changes introduces non-determinism ST5 MPI_Send(ST3_ID, tree) ST6 ST4 ST3 ST1 ST2 MPI_Send(ST2_ID, tree)

32 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 32 Differences between Parallel & Distributed Algorithm Parallel: best tree list of max(20, #workers) maintained and merged at the master Parallel: Master distributes max(20, #workers) as toplogy-strings to workers for branch length optimization Distributed: Each worker maintains local best list of 20 trees Distributed: Worker performs fast branch length optimizations locally on all 20 trees -> returns only best topology to the master

33 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 33 Sequential Results 50 distinct simulated 100-taxon alignments - Measured average execution times & topological distance (RF- rate) from „true“ tree - PHYML: 35.21 seconds, RF-rate: 0.0796 - MrBayes: 945.32 seconds, RF-rate: 0.0741 - RAxML: 29.27 seconds, RF-rate: 0.0818 9 distinct real alignments containing 101-1000 taxa - Measured execution times & final likelihood values - RAxML yields best-known likelihood for all data sets - RAxML faster than PHYML & MrBayes

34 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 34 Sequential Results: Real Data dataPHYMLsecsMrBayessecsRAxMLsecsR > PHY secs PAXMLhrs 101_SC-74097.6153-77191.540527-73919.361731-73975.947 150_SC-44298.1158-52028.449427-44142.639033-44146.9164 150_ARB-77219.7313-77196.729383-77189.717867-77189.8300 200_ARB-104826.5477-104856.4156419-104742.627299-104743.3775 250_ARB-131560.3787-133238.3158418-131468.01067249-131469.01947 500_ARB-253354.2 2235-263217.8366496-252499.426124493-252588.17372 1000_ARB-402215.016594-459392.4509148-400925.3507291893-402282.19898 218_RDPII-157923.1403-158911.6138453-157526.06774244n/a 500_ZILLA-22186.82400-22259.096557-21033.92991667n/a

35 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 35 Sequential Results: Real Data dataPHYMLsecsMrBayessecsRAxMLsecsR > PHY secs PAXMLhrs 101_SC-74097.6153-77191.540527-73919.361731-73975.947 150_SC-44298.1158-52028.449427-44142.639033-44146.9164 150_ARB-77219.7313-77196.729383-77189.717867-77189.8300 200_ARB-104826.5477-104856.4156419-104742.627299-104743.3775 250_ARB-131560.3787-133238.3158418-131468.01067249-131469.01947 500_ARB-253354.2 2235-263217.8366496-252499.426124493-252588.17372 1000_ARB-402215.016594-459392.4509148-400925.3507291893-402282.19898 218_RDPII-157923.1403-158911.6138453-157526.06774244n/a 500_ZILLA-22186.82400-22259.096557-21033.92991667n/a

36 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 36 Sequential Results: Real Data

37 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 37 Sequential Results: Real Data

38 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 38 Sequential Results: Real Data

39 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 39 Parallel Results: Speedup 1000_ARB

40 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 40 Distributed Results: First Tests Platforms: – Infiniband-Cluster: 10 Intel Xeon 2.4 GHz – Sunhalle: 50 Sun-Workstations for CS students Alignments: – 1000_ARB – 2025_ARB – Larger trees to come.......... Results: – Program executed correctly & terminated – RAxML@home yielded best-known tree for 2025_ARB

41 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 41 Biological Results: 1st ML 10.000-taxon tree Calculated 5 parsimony starting trees + 3-4 initial rearrangement steps sequentially on Xeon 2.4GHz Further rearrangements of those 5 trees in parallel on 32 or 64 Xeon 2.66GHz at RRZE Accumulated CPU hours/tree ~ 3200hours Best ln likelihood: -949539 worst: -950026 Problems: – Quality assessment? bootstrap not feasible – Consense crashes for > 5 trees – MrBayes/PHYML crash on 32-bit/4GB – MrBayes crashed on Itanium – Visualization?

42 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 42

43 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 43Conclusion RAxML not able to handle protein data RAxML not able to perform model parameter optimization BUT: – RAxML easy to parallelize/distribute – Accurate & fast for large trees – Significantly lower memory requirements than MrBayes/PHYML Conclusion: Imlement model parameter optimization & protein data in RAxML

44 ICS/IMBB Iraklion Alexandros Stamatakis: Phylogenetic Inference with RAxML2 Slide: 44 Availability & Future Work Further development & distribution of RAxML@home Big production runs with RAxML@homeRAxML@home Survey: ML supertrees vs. integral trees Alignment split-up methods for ML supertrees RAxML implementation on GPUs RAxML2 download, benchmark, code: wwwbode.in.tum.de/~stamatak RAxML@home development: www.sourceforge.com/projects/axml


Download ppt "Parallel & Distributed Systems and Algorithms for Inference of Large Phylogenetic Trees with Maximum Likelihood Alexandros Stamatakis LRR TU München Contact:"

Similar presentations


Ads by Google