Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.

Slides:



Advertisements
Similar presentations
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
Multiple Sequence Alignment & Phylogenetic Trees.
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Lecture 13 CS5661 Phylogenetics Motivation Concepts Algorithms.
Parsimony based phylogenetic trees Sushmita Roy BMI/CS 576 Sep 30 th, 2014.
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
Problem Set 2 Solutions Tree Reconstruction Algorithms
From Ernst Haeckel, 1891 The Tree of Life.  Classical approach considers morphological features  number of legs, lengths of legs, etc.  Modern approach.
UPGMA Algorithm.  Main idea: Group the taxa into clusters and repeatedly merge the closest two clusters until one cluster remains  Algorithm  Add a.
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Phylogenetic Reconstruction: Distance Matrix Methods Anders Gorm Pedersen Molecular Evolution Group Center for.
. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Modified by Benny Chor, from slides by Shlomo Moran and Ydo Wexler (IIT)
Bioinformatics Algorithms and Data Structures
Building phylogenetic trees Jurgen Mourik & Richard Vogelaars Utrecht University.
Distance methods. UPGMA: similar to hierarchical clustering but not additive Neighbor-joining: more sophisticated and additive What is additivity?
The Tree of Life From Ernst Haeckel, 1891.
CISC667, F05, Lec15, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (II) Distance-based methods.
Phylogenetic Trees Tutorial 6. Measuring distance Bottom-up algorithm (Neighbor Joining) –Distance based algorithm –Relative distance based Phylogenetic.
Multiple sequence alignment
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Phylogeny Tree Reconstruction
Phylogenetic Trees Tutorial 6. Measuring distance Bottom-up algorithm (Neighbor Joining) –Distance based algorithm –Relative distance based Phylogenetic.
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
Phylogenetic trees Tutorial 6. Distance based methods UPGMA Neighbor Joining Tools Mega phylogeny.fr DrewTree Phylogenetic Trees.
Phylogenetic trees Sushmita Roy BMI/CS 576
9/1/ Ultrametric phylogenies By Sivan Yogev Based on Chapter 11 from “Inferring Phylogenies” by J. Felsenstein.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
PHYLOGENETIC TREES Dwyane George February 24,
1 Chapter 7 Building Phylogenetic Trees. 2 Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances –UPGMA method.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
BINF6201/8201 Molecular phylogenetic methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
OUTLINE Phylogeny UPGMA Neighbor Joining Method Phylogeny Understanding life through time, over long periods of past time, the connections between all.
Phylogenetic Trees Tutorial 5. Agenda How to construct a tree using Neighbor Joining algorithm Phylogeny.fr tool Cool story of the day: Horizontal gene.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Introduction to Phylogenetics
Evolutionary tree reconstruction (Chapter 10). Early Evolutionary Studies Anatomical features were the dominant criteria used to derive evolutionary relationships.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
Evolutionary tree reconstruction
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Building Phylogenetic Trees.
Phylogenetic Analysis Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics Figures from Higgs & Attwood.
Comp. Genomics Recitation 8 Phylogeny. Outline Phylogeny: Distance based Probabilistic Parsimony.
Phylogeny Ch. 7 & 8.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Tutorial 5 Phylogenetic Trees.
1 CAP5510 – Bioinformatics Phylogeny Tamer Kahveci CISE Department University of Florida.
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
Part 9 Phylogenetic Trees
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
CSCE555 Bioinformatics Lecture 13 Phylogenetics II Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Distance based phylogenetics
dij(T) - the length of a path between leaves i and j
Clustering methods Tree building methods for distance-based trees
Multiple Alignment and Phylogenetic Trees
Hierarchical clustering approaches for high-throughput data
The Tree of Life From Ernst Haeckel, 1891.
Inferring phylogenetic trees: Distance and maximum likelihood methods
Phylogenetic Trees.
#30 - Phylogenetics Distance-Based Methods
Lecture 7 – Algorithmic Approaches
Phylogeny.
Presentation transcript:

Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS Fall 2015

Basic idea of distance-based methods Suppose we can compute a “distance”, d ij, between each pair of taxa based on some data (e.g., sequences) Can we come up with a tree structure (with lengths assigned to branches) that accurately reflect the pairwise distances? L 26 L 16 L 68 L 58 L 78 L 47 L 37 d 15 ≟ L 16 + L 68 + L 58

Distance-based methods for phylogenetic tree reconstruction Given n × n distance matrix for n units, construct the tree for these n units Algorithms – UPGMA – Neighbor joining Assume additivity and sometimes a “molecular clock” Additivity means we can add up the branch lengths of the tree connecting two nodes and get their distances – In other words, “additivity” of the distances means that there exists some tree that perfectly explains these distances

Defining distance between sequences Fractional alignment mismatch for two sequences i and j – p ij = m ij /L ij Gives an estimate of changes per site – m ij : Number of mismatches between sequences i and j – L ij : Number of aligned positions between sequences i and j – Assumes that changes have happened only once Underestimates the distance between sequences Jukes Cantor distance – Removes assumption above – The simplest evolutionary distance d ij between sequences i and j, where p ij is the fractional mismatch defined above

UPGMA algorithm for phylogenetic tree reconstruction UPGMA: Unweighted pair group method using arithmetic averages Represent all sequences as the leaf nodes of a tree Start with just the leaf nodes At each step, merge two closest nodes to create a new node in the tree – Set new node at height determined by nodes being merged – Recompute distance between new node and all other nodes Intermediate nodes will correspond to a set of sequences We will call sequences associated with an intermediate node i cluster C i Need to compute – Distance between two clusters of sequences – Height

Computing distance between clusters Let i and j be two nodes Let C i be the cluster of sequences for node i Let C j be the cluster of sequences for node j |C j | : Number of sequences in C j Distance between nodes i and j

Computing distance from a new node Let k be a new node to be created from merging i and j Let C i be the cluster of sequences for node i Let C j be the cluster of sequences for node j Distance d kl between nodes k and l, l≠i and l≠j This is equal to

UPGMA algorithm Input – n sequences – Distance matrix for all pairs of n sequences, d ij Output – Tree T Initialization – Assign each sequence i to its own cluster C i – Define one leaf of T for each sequence Iterate until only two clusters remain – Find two nodes C i and C j that have the smallest d ij – Define new cluster C k = C i U C j – Define daughters of k as i and j, place at height d ij /2 – Add k to cluster set. Remove i and j from the set of clusters Terminate – When only two clusters C i and C j remain, place root at d ij /2

UPGMA example ABCDE A08853 B0388 C088 D05 E0 AEDBC AEBCD 0885 B038 C08 D0 AEDBC initial state after one merge Example calculation

UPGMA example (cont.) AEDBC AEBCD AE085 BC08 D0 AEDBC AED08 BC0 AEDBC AEDBC after two merges after three merges final state

UPGMA relies on the molecular clock assumption Sequences diverge at the same rate at all points in the phylogeny Distance from any leaf to root is the same. If this is true the distances are said to have an “ultrametric” property This assumption is rarely true in practice

The molecular clock assumption & ultrametric data Ultrametric data: for any triplet of sequences, i, j, k, the distances are either all equal, or two are equal and the remaining one is smaller ABCDE A08853 B0388 C088 D05 E0 AEDBC

Problem with UPGMA when the molecular clock assumption does not hold Actual tree 2341 Constructed by UPGMA

Neighbor joining The assumption about the ultra-metric property is too strong – Most sequences diverge at different rates A more relaxed requirement is that of additivity – Distance between a pair of species/nodes is equal to the sum of the branch lengths Neighbor Joining (NJ) Uses a similar idea to construct trees as UPGMA – That is, consider pairs of nodes and joins them Produces unrooted trees

How to select nodes for joining? Given all pairwise distances for n sequences d ij denote the distance between node i and j Should we select node pairs with the smallest d ij ? A B C D This will give us an incorrect tree

Selecting nodes to join r i : “Average” distance from all other leaves L : number of leaves Neighbor joining requires us to correct the distance to account for distances from all other nodes. The corrected distance is denoted as D ij

Defining the distance to a new node i j m k d km ? New node Given d ij, d im, d jm, how to calculate distance of existing node m to new node k ?

Updating Distances in Neighbor Joining can calculate the distance from a leaf to its parent node in the same way i j m k New node

Updating Distances in Neighbor Joining we can generalize this so that we take into account the distance to all other leaves where and L is the set of leaves this is more robust if data aren’t strictly additive

Algorithm for NJ Initialization – T be set the of leaf nodes – L = T – Compute r i for all i in L – Compute D ij Iteration – Pick a pair i, j from L such that D ij is smallest – Define new node k – Compute d ik, d jk, add edge between k and i, and between k to j – Add k to L, remove i and j from L – Compute D mn for all nodes m, n in L Terminate – If L has two nodes, add the edge between these two.

Can we check for additivity? Check for additivity: For four leaves, i, j, k, l and the distances d ij, d ik, d il, d jk, d jl, d kl i j k l The three sums of two distances i j k l i j k l i j k l Should be such that two of these are equal, and larger than the third.

Comparing NJ and UPGMA UPGMA – Rooted tree – Assumptions: Molecular clock assumption/ultrametric distance and additivity NJ – Unrooted tree – Assumption: Additivity

Rooting a tree An unrooted tree can be converted to a rooted tree using an outgroup species Outgroup: a species known to be more distantly related to all the species than each of the species themselves Find the branch where the outgroup is selected to be added That gives the root candidate root outgroup