Fast Algorithms for Minimum Evolution Richard Desper, NCBI Olivier Gascuel, LIRMM.

Slides:



Advertisements
Similar presentations
An introduction to maximum parsimony and compatibility
Advertisements

Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
An Introduction to Phylogenetic Methods
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Lecture 13 CS5661 Phylogenetics Motivation Concepts Algorithms.
BALANCED MINIMUM EVOLUTION. DISTANCE BASED PHYLOGENETIC RECONSTRUCTION 1. Compute distance matrix D. 2. Find binary tree using just D. Balanced Minimum.
Parsimony based phylogenetic trees Sushmita Roy BMI/CS 576 Sep 30 th, 2014.
. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
The Saitou&Nei Neighbor Joining Algorithm ©Shlomo Moran & Ilan Gronau.
Molecular Evolution Revised 29/12/06
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Modified by Benny Chor, from slides by Shlomo Moran and Ydo Wexler (IIT)
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
Bioinformatics Algorithms and Data Structures
Building phylogenetic trees Jurgen Mourik & Richard Vogelaars Utrecht University.
In addition to maximum parsimony (MP) and likelihood methods, pairwise distance methods form the third large group of methods to infer evolutionary trees.
The Tree of Life From Ernst Haeckel, 1891.
Linear Least Squares and its applications in distance matrix methods Presented by Shai Berkovich June, 2007 Seminar in Phylogeny, CS Based on the.
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Phylogeny Tree Reconstruction
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &
Building Phylogenies Parsimony 2.
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Phylogenetic trees Sushmita Roy BMI/CS 576
Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Phylogenetic Analysis. 2 Introduction Intension –Using powerful algorithms to reconstruct the evolutionary history of all know organisms. Phylogenetic.
Network Aware Resource Allocation in Distributed Clouds.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
PHYLOGENETIC TREES Dwyane George February 24,
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
BINF6201/8201 Molecular phylogenetic methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
O PTIMALITY OF THE N EIGHBOR J OINING A LGORITHM AND F ACES OF THE B ALANCED M INIMUM E VOLUTION P OLYTOPE David Haws Joint work with Ruriko Yoshida and.
CSC401: Analysis of Algorithms CSC401 – Analysis of Algorithms Chapter Dynamic Programming Objectives: Present the Dynamic Programming paradigm.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Bayes estimators for phylogenetic reconstruction Ruriko Yoshida.
Calculating branch lengths from distances. ABC A B C----- a b c.
Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
Evolutionary tree reconstruction
Bayes estimators for phylogenetic reconstruction Ruriko Yoshida.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Fabio Pardi PhD student in Goldman Group European Bioinformatics Institute and University of Cambridge, UK Joint work with: Barbara Holland, Mike Hendy,
Optimality of the Neighbor Joining Algorithm and Faces of the Balanced Minimum Evolution Polytope Ruriko Yoshida.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
Phylogeny Ch. 7 & 8.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
Probabilistic Approaches to Phylogenies BMI/CS 576 Sushmita Roy Oct 2 nd, 2014.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
CSCE555 Bioinformatics Lecture 13 Phylogenetics II Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Distance-based phylogeny estimation
Phylogenetic Trees - Parsimony Tutorial #12
Distance based phylogenetics
Inferring a phylogeny is an estimation procedure.
Multiple Alignment and Phylogenetic Trees
The Tree of Life From Ernst Haeckel, 1891.
CS 581 Tandy Warnow.
Lecture 7 – Algorithmic Approaches
Phylogeny.
Incorporating uncertainty in distance-matrix phylogenetics
Presentation transcript:

Fast Algorithms for Minimum Evolution Richard Desper, NCBI Olivier Gascuel, LIRMM

Overview I. Statement of phylogeny reconstruction problem and various approaches to solving it. II. Tree length formula as a function of average distances. III. Greedy algorithms for tree building and tree swapping. IV. Simulation results. V. A few extras regarding consistency and branch lengths.

Phylogeny Reconstruction  General problem: reconstruct the evolutionary history for a set L of extant species.  Input: multiple sequence alignment for L or matrix of estimates of pairwise evolutionary distances.  Output: weighted phylogeny representing history of L and common ancestors.

Methods  Likelihood methods : model-based likelihood maximization.  Parsimony methods : minimize total number of mutations in tree.  Distance methods: fit tree structure to inferred evolutionary distances. Leading methods include Felsenstein-Fitch-Margoliash weighted least-squares and Neighbor-Joining and its variants.

Felsenstein-Fitch-Margoliash Least-squares Method  FITCH searches the space of topologies by iteratively adding leaves and by tree swapping.  Edge weights and topology are chosen to minimize the sum of squares ( D is the input metric, D T is the induced tree metric): If s ij = 1 for all i and j, this is called the ordinary least-squares method.

Minimum Evolution  Developed by Rzhetsky and Nei (1992) as a modification of the OLS method  For each topology T, Define function l assigning OLS lengths to edges of T Define size of tree Choose T minimizing l ( T )

Recursive Definition of D A|B If A = {a}, B = {b}, D A|B = D ab, All average distances for all pairs of non-intersecting subtrees of a given topology can be calculated in O(n 2 ) time. For

External OLS Edge Length Function If e is the edge connecting the leaf i to the subtrees A and B, e B A i

Internal OLS Edge Length Function The length of the edge e is (Vach, 1988) e AC D B where

Tree length formula Lemma : with T as to the right, let denote the root of subtree X, and the edge to X for Then, C e A D B

Tree Length Formula With T as in prior slide, Using lemma and branch length formula for l(e),

General approach  To search the space of topologies, we’ll keep in memory two data structures: Sizes of each subtree of given topology Matrix of average distances  X|Y for X,Y disjoint subtrees in given topology  As we move from one topology to another, we’ll update the matrix, but only as much as needed, in an efficient manner.

Tree Swapping by NNI C e A D B NNI swapping is a basic step in topology building and searching A D C B e

Tree Length Formula With T as in prior slide, Using lemma and branch length formula for l(e),

Tree Length after NNI Given T  T ’ the tree swap in prior slide, l the edge length function: where l and l ’ are constants depending on the topologies. (1)

OLS: FASTNNI 1. Pre-compute average distances between non- intersecting sub-trees. ( O ( n 2 ) computations) 2. Loop over all internal edges, select the best swap using Equation (1). ( O ( n )) 3. If no swap improves length of the tree, stop and return the tree, else perform the best swap and update the matrix of average distances and repeat Step 2. ( O ( n ) per swap; there is only one new split.) Thus, if we require p swaps, the total complexity of FASTNNI is O(n 2 + pn).

Balanced Minimum Evolution  Gascuel (2000) observed that the OLS/ME method was weaker than NJ in approximating the correct topology.  Pauplin (2000) to simplify tree length computation proposed to use a “balanced” version of Minimum Evolution, weighting each sub-tree equally when calculating averages: if A and B are sub-trees of T, with

BNNI 1. Calculate balanced averages of all pairs of sub-trees. ( O( n 2 ) ) 2. Calculate improvement for each swap using (2) 3. If no tree swap improves length of the tree, stop and return tree, else update matrix of average distances and repeat Step 2. ( O ( n diam ( T )) per swap) The average complexity, when performing p swaps, is O( n 2 + pn diam(T)).

Updating Subtree Averages e C DB y x X A T Y Q: How many recalculations? If we perform the B-C tree swap, then we must recalculate Typical values for diam( T ): Yule-Harding distribution: Uniform distribution: (Hint: you can count (x,y) pairs). A: O( n diam( T))

Building trees from scratch We have NNI algorithms for OLS and balanced branch lengths. But what if we have no initial topology for NNIs?

OLS: Greedy Minimum Evolution 1. Start with three-taxon tree T 3 2. For k=4 to n, a)Calculate D k|A for each subtree A in T k-1 b)Express cost of inserting k along edge e as f(e). (Use Equation (3) on the next slide.) c)Choose e minimizing f. Insert k along e to form T k. d)Update matrix of average distances between every pair of 2-distant subtrees. GME runs in O(n 2 ) running time

Greedy Minimum Evolution C AB k T C AB k T’T’ Then We use a variant of Equation (1), where D = {k}. Let L = l(T).

Balanced Minimum Evolution Same as GME,except: 2. (modifications) a) Calculate balanced average distances instead of ordinary average distances b) Use l = ½ to find weights for insertion points d) Must keep average distances for all pairs of sub- trees. BME runs in O(n 2 diam(T)) running time.

Simulations  Created 24- and 96-taxon trees, 2000 per each size, Yule-Harding process (  molecular clock).  Edge lengths multiplied by (1.0 + m X), where X is exponentially distributed.  Generated trees with three rates of evolution  SeqGen used to generate sequences for each tree and rate (12,000 in all)  DNADIST used to calculate distance matrices

Results: topological distances BNNI improved all input trees

Results: topological distances This improvement is large with fast rates and high numbers of taxa

Results: topological distances NNI trees are close to the best possible for BME

Results: topological distances The quality of the NNI tree is (mostly) independent of starting point

Results: topological distances FASTNNI trees comparable to NJ as n grows to 96

Computational Times 24 Taxa96 Taxa1000 Taxa4000 Taxa GME + BNNI :02.1 HGT/FP :33.1 NJ/BIONJ :55.9 WEIGHBOR FITCH Computations done on Sun Enterprise E4500/E5500 running Solaris 8 on Mhz processors with 7 Gb memory. in (MM:SS)

Average number of NNIs We see that the average number of NNIs is considerably lower than the number of taxa. 24 Taxa96 Taxa1000 Taxa4000 Taxa GME + FASTNNI GME + BNNI BME + BNNI

BME = WLS Why does the balanced approach work so well?  Pauplin’s formula for the length of a tree is  BME is a weighted least squares approach with Where p T (i,j) is the length of the (i,j) path in T. Distantly related taxa see their importance decrease exponentially.

Bonus features  BME is a consistent method. As observed distances converge to true distances, the true topology becomes the minimum evolution tree.  The BNNI tree has no negative branch lengths. A negative value to the branch length function implies a NNI leading to a smaller tree.

Consistency of Balanced ME  Theorem : Suppose S is a weighted tree, and T is a tree topology incompatible with S. Let T be the tree of topology T with weights determined by the balanced scheme. Then l(T) > l(S).  Lemma: it suffices to prove the case when S is a split metric.

Balanced ME consistency  Basic idea: let l be the tree length function on the space of topologies. We find a sequence of topologies, T=T 0, T 1,... T k =S such that Each T i+1 can be reached from T i via one of two simple topological transformations l(T i ) > l(T i+1 ) for all i. Proof structure modeled after OLS/ME proof (Rzhetsky and Nei, 1993).

Type I transformation B C C AA DD B Color the leaves black or white according to the split metric S. A Type I transformation uses a NNI to form a larger monochromatic cluster This transformation reduces the size of the tree under l

Type II transformation A1A1 B1B1 A1A1 B1B1 B2B2 B2B2 A2A2 A2A2 C C A Type II transformation uses two NNIs to form two monochromatic subtrees This transformation also reduces the value of the size of the tree under l

Positive Branch Lengths after BNNI Recall that the length of an edge is described by C e A D B We do not perform the switch because i.e. Similarly, Thus

Conclusions  BME + BNNI runs in O((n 2 + pn) diam(T)), outputs trees comparable to (better than) FITCH, Weighbor, BioNJ, or NJ.  FastME is faster than NJ or its variants.  BNNI consistently improved output trees in all settings, even when WLS/Fitch trees were input.  BNNI outputs tree without negative branch lengths.  FASTME software available at or