CIS786, Lecture 5 Usman Roshan.

http://creativecommons.org/licenses/by-sa/2.0/

CIS786, Lecture 5 Usman Roshan

Previously DCM decompositions in detail DCM1 improved significantly over NJ DCM2 did not always improve over TNT (for solving MP) New DCM3 improved over DCM2 but not better than TNT

Previously

Disk Covering Methods (DCMs) DCMs are divide-and-conquer booster methods. They divide the dataset into small subproblems, compute subtrees using a given base method, merge the subtrees, and refine the supertree. DCMs to date –DCM1: for improving statistical performance of distance-based methods. –DCM2: for improving heuristic search for MP and ML –DCM3: latest, fastest, and best (in accuracy and optimality) DCM

DCM2 technique for speeding up MP searches 1. Decompose sequences into overlapping subproblems 2. Compute subtrees using a base method 3. Merge subtrees using the Strict Consensus Merge (SCM) 4. Refine to make the tree binary

DCM1(NJ)

Computing tree for one threshold

Error as a function of evolutionary rate NJDCM1-NJ+MP

I. Comparison of DCMs (4583 sequences) Base method is the TNT-ratchet. DCM2 takes almost 10 hours to produce a tree and is too slow to run on larger datasets.

DCM2 decomposition on 500 rbcL genes (Zilla dataset) DCM2 decomposition Blue: separator Red: subset 1 Pink: subset 2 Vizualization produced by graphviz program---draws graph according to specified distances. Nodes: species in the dataset Distances: p-distances (hamming) between the DNAs 1.Separator is very large 2.Subsets are very large 3.Scattered subsets

DCM3 decomposition - example

Approx centroid-edge DCM3 decomposition – example 1.Locate the centroid edge e (O(n) time) 2.Set the closest leaves around e to be the separator (O(n) time) 3.Remaining leaves in subtrees around e form the subsets (unioned with the separator)

DCM2 decomposition on 500 rbcL genes (Zilla dataset) DCM2 decomposition Blue: separator Red: subset 1 Pink: subset 2 Vizualization produced by graphviz program---draws graph according to specified distances. Nodes: species in the dataset Distances: p-distances (hamming) between the DNAs 1.Separator is very large 2.Subsets are very large 3.Scattered subsets

DCM3 decomposition on 500 rbcL genes (Zilla dataset) DCM3 decomposition Blue: separator (and subset) Red: subset 2 Pink: subset 3 Yellow: subset 4 Vizualization produced by graphviz program---draws graph according to specified distances. Nodes: species in the dataset Distances: p-distances (hamming) between the DNAs 1.Separator is small 2.Subsets are small 3.Compact subsets

Dataset: 4583 actinobacteria ssu rRNA from RDP. Base method is the TNT-ratchet. DCM2 takes almost 10 hours to produce a tree and is too slow to run on larger datasets. DCM3 followed by TNT-ratchet doesn’t improve over TNT Recursive-DCM3 followed by TNT-ratchet doesn’t improve over TNT Comparison of DCMs 0.00 0.05 0.10 0.15 0.20 0.25 0.30 04812162024 Hours Average MP score above optimal, shown as a percentage of the optimal TNTDCM2DCM3Rec-DCM3

Local optima is a problem Phylogenetic trees Cost Global optimum Local optimum

Local optima is a problem Average MP score above optimal, shown as a percentage of the optimal Hours

Iterated local search: escape local optima by perturbation Local optimum Output of perturbation Perturbation Local search

Iterated local search: Recursive-Iterative-DCM3 Local optimum Output of Recursive-DCM3 Local search

Rec-I-DCM3(TNT-ratchet) improves upon unboosted TNT-ratchet 0.00 0.05 0.10 0.15 0.20 0.25 0.30 04812162024 Hours Average MP score above optimal, shown as a percentage of the optimal TNTDCM2DCM3Rec-DCM3Rec-I-DCM3 Comparison of DCMs for solving MP

I. Comparison of DCMs (13,921 sequences) Base method is the TNT-ratchet.

I. Comparison of DCMs (13,921 sequences) Base method is the TNT-ratchet. Note the improvement in DCMs as we move from the default to recursion to iteration to recursion+iteration.

Improving upon TNT But what happens after 24 hours? We studied boosting upon TNT-ratchet. Other TNT heuristics are actually better and improving upon them may not be possible. Can we improve upon the default TNT search?

Improving upon TNT

2000 Eukaryotes rRNA

6722 3-domain+2-org rRNA

13921 Proteobacteria rRNA

How to run Rec-I-DCM3 then? Unanswered question: what about better TNT heuristics? Can Rec-I-DCM3 improve upon them? Rec-I-DCM3 improves upon default TNT but we don’t know what happens for better TNT heuristics. Therefore, for a large-scale analysis figure out best settings of the software (e.g. TNT or PAUP*) on the dataset and then use it in conjunction with Rec-I-DCM3 with various subset sizes

Maximum likelihood

Four problems –Given tree, edge lengths, and ancestral states find likelihood of tree: polynomial time –Given tree and edge lengths find likelihood of tree: polynomial time dynamic programming

Second case Ron Shamir’s lectures

Second case Ron Shamir’s lectures Exponential time summation!

Second case Ron Shamir’s lectures Exponential time summation! Can be solved in polytime using dynamic programmming ---similar to computing MP scores

Second case-DP

Complexity?

Second case-DP Complexity? For each node and each site we do k^2 work, so total is mnk^2

Maximum likelihood Four problems –Given data, tree, edge lengths, and ancestral states find likelihood of tree: polynomial time –Given data, tree and edge lengths find likelihood of tree: polynomial time dynamic programming –Given data and tree, find likelihood: unknown complexity

Third case 1.Assign arbitrary values to all edge lengths except one t_rv 2.Now optimize function of one parameter using EM or Newton Raphson 3.Repeat for other edges 4.Stop when improvement in likelihood is less than delta

Maximum likelihood Four problems –Given data, tree, edge lengths, and ancestral states find likelihood of tree: polynomial time –Given data, tree and edge lengths find likelihood of tree: polynomial time dynamic programming –Given data and tree, find likelihood: unknown complexity –Given data find tree with best likelihood: unknown complexity

ML is a very hard problem Number of potential trees grows exponentially # Taxa# Trees 515 102,027,025 157,905,853,580,625 502.84 * 10 76 This is  the number of atoms in the universe 10^80

Local search Greedy-ML tree followed by local search using TBR moves Software packages: PAUP*, PHYLIP, PhyML, RAxML We now look at RAxML in detail Major RAxML innovations –Good starting trees –Subtree rearrangements –Lazy rescoring

Subtree Rearrangements ST5 ST2 ST6 ST4 ST3 ST1

Subtree Rearrangements ST5 ST2 ST6 ST4 ST3 ST1 +1

Sequential RAxML Compute randomized parsimony starting tree with dnapars from PHYLIP Every run starts from distinct point in search space!

Sequential RAxML Compute randomized parsimony starting tree with dnapars from PHYLIP Apply exhaustive subtree rearrangements RAxML performs fast lazy rearrangements

Sequential RAxML Compute randomized parsimony starting tree with dnapars from PHYLIP Apply exhaustive subtree rearrangements Iterate while tree improves

Subtree Rearrangements ST5 ST2 ST6 ST4 ST3 ST1

Subtree Rearrangements ST5 ST2 ST6 ST4 ST3 ST1 Optimize all branches

Subtree Rearrangements ST5 ST2 ST6 ST4 ST3 ST1 Need to optimize all branches ?

Idea 1: Lazy Subtree Rearrangements ST5 ST2 ST6 ST4 ST3 ST1

Why is Idea 1 useful? Lazy subtree rearrangements: – Update less likelihood vectors  significantly faster – Allows for higher rearrangement settings  better trees Likelihood depends strongly on topology Fast exploration of large number of topologies Fast pre-scoring of topologies Mostly straight-forward parallelization Store best 20 trees from each rearrangement step Branch length optimization of best 20 trees only Experimental results justify this mechanism

Idea 2:Subsequent Application of Changes ST5 ST2 ST6 ST4 ST3 ST1 ST5 ST2 ST6 ST4 ST1 ST5 ST2 ST6 ST4 ST1 ST5 ST2 ST6 ST4 ST1 ST3

Why is Idea 2 useful? During initial 5-10 rearrangement steps many improved topologies are encountered Acceleration of likelihood improvement during initial optimization phase Fast optimization of random starting trees  Subsequent application of changes is hard to parallelize

RAxML comparison to other programs

Parallel & Distributed RAxML –Design goals -minimize communication overhead -attain good speedup –Master-Worker architecture –2 computational phases I.Computation of # workers parsimony trees II.Rearrangement of subtrees at each worker  Program is non-deterministic  every run yields distinct result, even with fixed starting tree

Parallel RAxML: Phase I Distribute alignment file & compute parsimony trees Master Process

Parallel RAxML: Phase I Receive parsimony trees & select best as starting tree Master Process

Parallel RAxML: Phase II Distribute currently best tree Master Process

Parallel RAxML: Phase II Workers issue work requests Master Process

Parallel RAxML: Phase II Distribute subtree IDs Master Process

Parallel RAxML: Phase II Distribute subtree IDs Master Process Only one integer must be sent to each node!

Parallel RAxML: Phase II ST1ST2 ST4ST3

Parallel RAxML: Phase II Receive result trees and continue with best tree Master Process

ML trees on large alignments Significant progress over last 3-4 years Many programs can infer large trees of ≥ 500 organisms now Technical aspects becoming increasingly important & limiting factor ProgramLargest TreeLimitation parallel GAML3000 organismsMemory parallel IQPNNI1500 organismsMemory PHYML2500 organismsMemory MrBayes1000 organismsMemory parallel RAxML10000 organismsAvailable Resources (60 CPUs) Rec-I-DCM3(RAxML)7769 organismsAvailable Resources (16 CPUs) DPRml417 organismsMemory TREE-PUZZLE257 organismsData Structures

Improving RAxML RAxML: fastest heuristic for constructing highly optimal maximum likelihood phylogenies on large datasets But can it be further improved? We compare Rec-I-DCM3(RAxML) against RAxML on real and simulated data

Real data study Dataset: 20 real datasets ranging from 101 to 8780 sequences (DNA and rRNA) Methods studied: –Recursive-Iterative-DCM3 (Rec-I-DCM3) 1/2: below 2K 1/4: between 2K and 6K 1/8: above 6K Base and global methods: default and fast RAxML –RAxML: HKY model, tr/tv ratio estimated 10 runs on datasets with at most 2K taxa, 5 runs on more than 2K taxa: –RAxML: run till completion –Rec-I-DCM3: ran for the same amount of time as RAxML

Small datasets 250 ARB RNA 500 rbcL DNA (Zilla dataset)

Medium datasets 1000 ARB RNA 2025 ARB RNA

Large datasets 6722 RNA (Gutell) 8780 ARB RNA

Comparison across all datasets Dataset sizeImprovement as % Steps improvement Max pAvg p 101 (SC)-0.004%-2.70.450.25 150 (SC)0.007%3.20.430.18 150 (ARB)0%0.30.540.36 193 (Vinh)0.06%38.60.780.64 200 (ARB)-0.006%-6.50.540.35 218 (RDP)0.014%210.420.26 250 (ARB)0.014%190.550.34 439 (PG)0%0.10.650.27

Comparison across all datasets Dataset sizeImprovement as % Steps improvement Max pAvg p 476 (PG)-0.004%-40.890.18 500 (rbcL)0.011%110.180.09 567 (PG)0.006%13.90.330.17 854 (PG)0.03%420.320.14 921 (KJ)0.06%109.60.390.15 1000 (ARB) 0.031%1230.550.35 1663 (ARB) -0.004%-11.70.480.2

Comparison across all datasets Dataset sizeImprovement as % Steps improvement Max pAvg p 2025 (ARB)-0.002%-60.560.36 2415 Bininda- Emonds 0.004%230.480.2 6673 (RG)1.251%687710.29 7769 (RG)2.338%1329010.33 8780 (ARB)0.03%2700.550.23

Summary Out of 20 datasets Rec-I-DCM3(RAxML) finds better trees on 15 On datasets below 500 taxa Rec-I- DCM3(RAxML) wins in 6 out of 9 On dataset above and including 500 taxa Rec-I-DCM3(RAxML) wins in 9 out of 11 But what about accuracy?

Simulation study Model trees: Beta-model with beta=-1 (software provided by Li-San Wang at UPenn) Model of sequence evolution: General Time Reversible (GTR) with gamma distributed site rates and invariant sites---all parameters determined by NJ tree on rbcL500 (Zilla) dataset using PAUP*; seqgen used for evolving sequences Simulation parameters: –Model trees: 5 model phylogenies of 1000, 2000, and 4000 taxa –Evolutionary rates: Branch lengths of each tree were scaled to yield low and moderate evolutionary rates (0.01 and 0.02). –Sequence length: 1000 Methods: RAxML under GTR+CAT model till completion and one iteration of Rec-I-DCM3(RAxML--- GTR+CAT)

1000 taxa -914142.7 -1030180.6 -914164.8 -1030237.6 Low rates Moderate rates

2000 taxa -2069504.6 -1824145.5 -1824123.2 -2069493.9 Low rates Moderate rates

4000 taxa -4138244 -3665526.5 -3665486.2 -4137922 Low rates Moderate rates

Summary Rec-I-DCM3(RAxML) finds more accurate trees much faster than RAxML ML scores are also improved Improvement pronounced on large and divergent datasets

Parallel Rec-I-DCM3 Local optimum Output of DCM3 Recursive- DCM3 Local search (1)Solve subproblems in parallel (2)Merge subtrees in the proper subtree order Use parallel RAxML developed by Du and Stamatakis

Real data study Dataset: 6 real datasets ranging from 500 to 7769 sequences (DNA and rRNA) Max subset sizes: 100 for dataset 1, 125 for dataset 2, 500 for datasets 3-6 Methodology: –One iteration of Rec-I-DCM3 –P-Rec-I-DCM3 for the same amount of time as Rec-I-DCM3 –3 runs of each method on each dataset. –Same starting tree for each method

P-Rec-I-DCM3 vs Rec-I-DCM3 DatasetParallel LHSequential LH Improvement in steps Improvement (as a %) 500 rbcL (Zilla)-99945-99967220.022% 2560 rbcL (Kallersjo) -354944-3550881440.041% 4114 16s Actinobacteria (RDP) -383108-3835244160.11% 6281 ssu rRNA Eukaryotes (ERNA) -1270379-12707854060.032% 6458 16s Firmicutes Bacteria (RDP) -900875-90207712020.13% 7769 rRNA 3- dom+2org (Gutell) -540334-5410196850.13%

Speedup values ProcessorsBaseGlobalOverall Dataset 1 4 8 16 4 4.7 4.85 2.4 2.8 2.78 2.6 3.6 3.5 Dataset 2 4 8 16 3 5.3 7 2.68 3.2 4.2 2.7 3.45 4.6 Dataset 3 4 8 16 1.95 5.5 6.7 2.6 5 5.7 2.2 5.3 6.2

Speedup values ProcessorsBaseGlobalOverall Dataset 4 4 8 16 2.9 4.2 8.3 2.3 4.9 5.3 2.6 4.6 6.3 Dataset 5 4 8 16 2.3 4.8 7.6 2.7 4.4 5.1 2.5 4.7 5.8 Dataset 6 34 8 16 3.2 4.8 5.4 1.95 2.5 2.8 2.2 3 3.3

Parallel performance limits Performance appears sub-optimal because of significant load imbalance caused by different subproblem sizes Optimal speedup=(total subproblem time)/(minimum time) Dataset 3 –19 subproblems of which 3 require at least 5K seconds (max is 5569 seconds) –Optimal speedup: 37353/5569=6.71 Dataset 6 –43 subproblems of which longest takes 12164 seconds –Optimal speedup: 63620/12164=5.23 Dataset 3 4 8 16 1.9 5.5 6.7 2.6 5 5.7 2.2 5.3 6.2 Dataset 6 4 8 16 3.2 4.8 5.4 1.95 2.5 2.8 2.2 3 3.3 ProcessorsBaseGlobalOverall

Conclusions RAxML is fastest and most accurate ML heuristic to date---yet improvement is possible with Rec-I-DCM3 boosting Rec-I-DCM3 and P-Rec-I-DCM3 could be used for tree of life reconstructions (fast and accurate) Viewed as iterated local search: divide-and- conquer works well for escaping local optima (can this be used for other combinatorial optimization problems?)

CIS786, Lecture 5 Usman Roshan.

Similar presentations

Presentation on theme: "CIS786, Lecture 5 Usman Roshan."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CIS786, Lecture 5 Usman Roshan.

Similar presentations

Presentation on theme: "CIS786, Lecture 5 Usman Roshan."— Presentation transcript:

Similar presentations

About project

Feedback