Clustering methods Tree building methods for distance-based trees

Slides:



Advertisements
Similar presentations
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
An Introduction to Phylogenetic Methods
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Lecture 13 CS5661 Phylogenetics Motivation Concepts Algorithms.
BALANCED MINIMUM EVOLUTION. DISTANCE BASED PHYLOGENETIC RECONSTRUCTION 1. Compute distance matrix D. 2. Find binary tree using just D. Balanced Minimum.
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Phylogenetic trees Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Chapter 2.
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
Fitch-Margoliash (FM) Algorithm
From Ernst Haeckel, 1891 The Tree of Life.  Classical approach considers morphological features  number of legs, lengths of legs, etc.  Modern approach.
UPGMA Algorithm.  Main idea: Group the taxa into clusters and repeatedly merge the closest two clusters until one cluster remains  Algorithm  Add a.
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Modified by Benny Chor, from slides by Shlomo Moran and Ydo Wexler (IIT)
Building phylogenetic trees Jurgen Mourik & Richard Vogelaars Utrecht University.
In addition to maximum parsimony (MP) and likelihood methods, pairwise distance methods form the third large group of methods to infer evolutionary trees.
The Tree of Life From Ernst Haeckel, 1891.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures CLUSTAL W Algorithm Lecturer:
CISC667, F05, Lec15, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (II) Distance-based methods.
Phylogenetic Trees Tutorial 6. Measuring distance Bottom-up algorithm (Neighbor Joining) –Distance based algorithm –Relative distance based Phylogenetic.
Protein Sequence Classification Using Neighbor-Joining Method
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Lecture 24 Inferring molecular phylogeny Distance methods
Phylogenetic Trees Tutorial 6. Measuring distance Bottom-up algorithm (Neighbor Joining) –Distance based algorithm –Relative distance based Phylogenetic.
Chapter 5 The Evolution Trees.
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
Phylogenetic trees Tutorial 6. Distance based methods UPGMA Neighbor Joining Tools Mega phylogeny.fr DrewTree Phylogenetic Trees.
Phylogenetic trees Sushmita Roy BMI/CS 576
Phylogenetic Analysis. 2 Introduction Intension –Using powerful algorithms to reconstruct the evolutionary history of all know organisms. Phylogenetic.
Terminology of phylogenetic trees
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.
1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.
1 Chapter 7 Building Phylogenetic Trees. 2 Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances –UPGMA method.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
BINF6201/8201 Molecular phylogenetic methods
Phylogenetics and Coalescence Lab 9 October 24, 2012.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
OUTLINE Phylogeny UPGMA Neighbor Joining Method Phylogeny Understanding life through time, over long periods of past time, the connections between all.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Introduction to Phylogenetics
Calculating branch lengths from distances. ABC A B C----- a b c.
Phylogenetic Analysis Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics Figures from Higgs & Attwood.
Why do trees?. Phylogeny 101 OTUsoperational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal,
Phylogeny Ch. 7 & 8.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
 Tue Introduction to models (Jarno)  Thu Distance-based methods (Jarno)  Fri ML analyses (Jarno)  Mon Assessing hypotheses.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
Distance-based phylogeny estimation
Phylogenetic Analysis
Distance based phylogenetics
dij(T) - the length of a path between leaves i and j
Inferring a phylogeny is an estimation procedure.
Phylogenetic Inference
Multiple Alignment and Phylogenetic Trees
The Tree of Life From Ernst Haeckel, 1891.
Inferring phylogenetic trees: Distance and maximum likelihood methods
Phylogenetic Trees.
Bill Bruno Brian Foley Thomas Leitner Theoretical Biology & Biophysics
#30 - Phylogenetics Distance-Based Methods
Lecture 7 – Algorithmic Approaches
Phylogeny.
Incorporating uncertainty in distance-matrix phylogenetics
Presentation transcript:

Clustering methods Tree building methods for distance-based trees For a set of sequences, all possible pairwise distances are calculated using your method of choice (JC69, K2P, GTR, etc.) Now, how do you build a tree? This provides measures of dissimilarity that can then be clustered according to a several major tree building algorithms U(W)PGMA - Unweighted (Weighted) Pair Group Method with Arithmetic Means Minimum Evolution Neighbor Joining

Clustering methods U(W)PGMA - Unweighted (Weighted) Pair Group Method with Arithmetic Means A single best rooted tree is built using the calculated distances to group pairs of OTUs A molecular clock is assumed, thus terminal nodes are equidistant from the root Summary – Obtain distance matrix Group two most closely related taxa Find mean of distances Group these two as a single new OTU Continue until you run out of sequences

Clustering methods U(W)PGMA Summary – Obtain distance matrix Group two most closely related taxa Find mean of distances Group these two as a single new OTU Continue until you run out of sequences 2 1 3 4 5 5 3 4 1 2

Clustering methods U(W)PGMA Details – Start with your distance matrix Group the taxa with the shortest distance The depth of the divergence between A and B is their distance divided by 2 Recalculate the distances between AB and other taxa   A B C D E 2 4 6 F 8 A B 1

Clustering methods U(W)PGMA Details – Recalculate the distances between AB and other taxa d(AB)C = (dAC + dBC)/2 = 4 d(AB)D = (dAD + dBD)/2 = 6, etc. Repeat with next closest cluster, D & E We could just as easily group C with AB   AB C D E 4 6 F 8 A B 1 D E 2

Clustering methods U(W)PGMA Details – Recalculate the distances between DE and other taxa Join the next closest group AB & C   AB C DE 4 6 F 8 A B 1 C 2 1 D E 2

Clustering methods U(W)PGMA Details – Recalculate the distances between ABC and other taxa How long should the branch joining ABC to DE be? The distance between D (or E) and A, B or C should be a total of 6. Join DE to ABC   ABC DE 6 F 8 A B 1 C 2 1 1 D E 2

Clustering methods U(W)PGMA Details – This leaves only F to add to the tree The total distance between F and any other taxon should be 8   ABCDE F 8 A B 1 C 2 1 1 F 1 4 D E 2

Clustering methods U(W)PGMA Weakness – constant molecular clock assumption Note that 3 and 5 are actually less divergent from one another than 4 and 5 This can’t be depicted in the tree using this method Non-homogeneous rates can introduce serious problems with tree reconstruction when some taxa evolve much faster than others 2 1 3 4 5 5 3 4 1 2

Clustering methods U(W)PGMA is rarely used anymore Other, newer methods avoid the weaknesses – ultrametricity due to the constant molecular clock assumption Newer methods do not require ultrametricity (still require additivity) Require that the distance between any pair of OTUs equal the sum of the lengths of their branch lengths, not the average Allows for variable molecular clock among taxa   A B C D E 5 4 7 10 6 9 F 8 11 1 A 1 4 1 B 2 1 C 3 D 1 2 E 4 F

Clustering methods Minimum Evolution Method The tree that minimizes the lengths of the tree (the sum of the lengths of all branches) is the best tree Reasonably good at finding the best tree Major drawback – Must examine all possible trees, computationally onerous when dealing with large numbers of taxa

Clustering methods Neighbor Joining Method (Saitou and Nei, 1987) A heuristic method for determining the best distance-based tree Heuristic methods explore a subset of all possible trees in the hope that the best tree lies within that subset Heuristic methods often fall into the ‘hill-climbing’ category Start with a tree and alter it in some way If the alteration makes it worse (using some optimality criterion), abandon it and try again. If the alteration makes it better, keep it and continue Heuristic methods include: Stepwise addition Branch swapping More on these later NJ is a star decomposition heuristic

Clustering methods The major drawback to heuristics is that of finding local optima in the tree space

Clustering methods NJ Combines computational speed with unique results (you always get a single best tree) The NJ algorithm Compute the net divergence, r, for every end node rA = 5 + 4 +7 + 6 + 8 = 30 rB = 5 + 7 + 10 + 9 + 11 = 42 rC = 32, rD = 38, rE = 34, rF = 44 Create a ‘rate-corrected distance matrix Mi = dij – (ri + rj)/(N-2) N = number of end nodes MAB = 5 – (30 + 42)/4 = -13 MAC = 4 – (30 + 32)/4 = -11.5   A B C D E 5 4 7 10 6 9 F 8 11   A B C D E -13 -11.5 -10 -10.5 F -11

Clustering methods NJ The NJ algorithm Draw your star phylogeny A F B   A B C D E -13 -11.5 -10 -10.5 F -11 A F B E C D

Clustering methods NJ The NJ algorithm Define a new node, U, that groups minimally diverged taxa Either AB or DE U = AB Determine branch lengths from U to A and B SAU = dAB/2 + (rA – rB)/2(N – 2) SAU = 5/2 + (30 – 32)/2(6 – 2) = 1 Because the distances are additive, SBU = dAB – SAU = 4   A B C D E -13 -11.5 -10 -10.5 F -11 F A E B C D F E C D A B 1 4 U   A B C D E 5 4 7 10 6 9 F 8 11 rA = 30 rB = 42 rC = 32, rD = 38, rE = 34, rF = 44

Clustering methods NJ The NJ algorithm Redefine distances based on the new node dCU = (dAC + dBC – dAB)/2 dCU = (4 + 7 – 5)/2 = 3, etc. Repeat previous steps   A B C D E 5 4 7 10 6 9 F 8 11   U C D E 3 6 7 5 F 8 9 F E C D A B 1 4 U

Clustering methods NJ The NJ algorithm Calculate net divergences for every end node (U now counts as an end node so, N = N - 1) rU = 21, rC = 24, rD = 27, rE = 24, rF = 32 Compute rate-corrected distances Create new node, V, from either UC or DE Calculate branch lengths from node V to C and U SUV = dCU/2 + (rU – rC)/2(N – 2) SUV = 3/2 + (21 – 24)/2(5 – 2) = 1 SCV = 2 Compute new distance matrix from V to all terminal nodes   U C D E -12 -10 -11 F -10.7   U C D E 3 6 7 5 F 8 9 F E C D A B 1 4 U 1 2 B F E D V A 4 U   V D E 5 4 F 6 9 8 C

Clustering methods NJ The NJ algorithm And so on… U F E C D A 1 V W 2 3 B 4 5 U F E C D A 1 V W 2 3 B 4 1 2 B F E D V A 4 U U F E C D A 1 V W 2 3 B 4 U F E C D A 1 V B 4 1 2 C

Clustering methods NJ The NJ algorithm Note that this is an unrooted tree Where do we put the root? Using external information we can determine where the root belongs U F E C D A 1 V W 2 3 B 4 A B 4 1 C 2 D E 3 F 1 1 1 1 2

Clustering methods UPGMA vs NJ

Clustering methods Alternatives to classical NJ Software BIONJ (Gascuel, 1997) Generalized neighbor joining (Pearson et al., 1999) Weighted neighbor joining (Bruno et al., 2000) Multi-neighbor-joining (Silva et al., 2005) Relaxed neighbor joining (Evans et al., 2006) Software PHYLIP MEGA PAUP DAMBE TREECON

Clustering methods Pros Cons Easy to understand Easy to implement Fast Produce a single, best tree Use an explicit substitution model Topology AND branch lengths are calculated (NJ) Cons Reduce most of the data to single value (reduce the amount of information) Different data sequences can yield the same distance matrix