Distance based phylogenetics

Slides:



Advertisements
Similar presentations
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Evolution Revised 29/12/06
. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Modified by Benny Chor, from slides by Shlomo Moran and Ydo Wexler (IIT)
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
BNFO 602 Phylogenetics Usman Roshan.
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Distance Methods. Distance Estimates attempt to estimate the mean number of changes per site since 2 species (sequences) split from each other Simply.
Distance methods. UPGMA: similar to hierarchical clustering but not additive Neighbor-joining: more sophisticated and additive What is additivity?
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
BNFO 602 Phylogenetics Usman Roshan. Summary of last time Models of evolution Distance based tree reconstruction –Neighbor joining –UPGMA.
Distance-Based Phylogenetic Reconstruction Tutorial #8 © Ilan Gronau, edited by Itai Sharon.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Probabilistic methods for phylogenetic trees (Part 2)
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Phylogenetic trees Sushmita Roy BMI/CS 576
Phylogenetic Analysis. 2 Introduction Intension –Using powerful algorithms to reconstruct the evolutionary history of all know organisms. Phylogenetic.
Terminology of phylogenetic trees
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
Tree Inference Methods
1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Computational Biology, Part D Phylogenetic Trees Ramamoorthi Ravi/Robert F. Murphy Copyright  2000, All rights reserved.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
BINF6201/8201 Molecular phylogenetic methods
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
OUTLINE Phylogeny UPGMA Neighbor Joining Method Phylogeny Understanding life through time, over long periods of past time, the connections between all.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Introduction to Phylogenetics
Calculating branch lengths from distances. ABC A B C----- a b c.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Phylogeny Ch. 7 & 8.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
CSCE555 Bioinformatics Lecture 13 Phylogenetics II Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
Hidden Markov Models BMI/CS 576
Molecular phylogenetics continued…
Distance-based phylogeny estimation
Introduction to Bioinformatics Resources for DNA Barcoding
Phylogenetic basis of systematics
Multiple Sequence Alignment Methods
Inferring a phylogeny is an estimation procedure.
Maximum likelihood (ML) method
Multiple Alignment and Phylogenetic Trees
Methods of molecular phylogeny
Inferring phylogenetic trees: Distance and maximum likelihood methods
Phylogenetic Trees.
BNFO 602 Phylogenetics Usman Roshan.
BNFO 602 Phylogenetics – maximum parsimony
CS 581 Tandy Warnow.
#30 - Phylogenetics Distance-Based Methods
Lecture 7 – Algorithmic Approaches
Phylogeny.
CS 394C: Computational Biology Algorithms
Algorithms for Inferring the Tree of Life
Tandy Warnow The University of Texas at Austin
Presentation transcript:

Distance based phylogenetics Usman Roshan

Phylogenetics Study of how species relate to each other “Nothing in biology makes sense, except in the light of evolution”, Theodosius Dobzhansky, Am. Biol. Teacher (1973) Rich in computational problems Fundamental tool in comparative bioinformatics

Why phylogenetics? Study of evolution Origin and migration of humans Origin and spead of disease Many applications in comparative bioinformatics Sequence alignment Motif detection (phylogenetic motifs, evolutionary trace, phylogenetic footprinting) Correlated mutation (useful for structural contact prediction) Protein interaction Gene networks Vaccine devlopment And many more…

Phylogeny Problem U V W X Y X U Y V W AGGGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT X U say hello Y V W

Bipartitions Phylogenies are equivalent to bipartitions

Topological differences

Phylogeny Problem Two main methodologies: Alignment first and phylogeny second Construct alignment using one of the MANY alignment programs in the literature Do manual (eye) adjustments if necessary Apply a phylogeny reconstruction method Fast but biologically not realistic Phylogeny is highly dependent on accuracy of alignment (but so is the alignment on the phylogeny!) Simultaneously alignment and phylogeny reconstruction Output both an alignment and phylogeny Computationally much harder Biologically more realistic as insertions, deletions, and mutations occur during the evolutionary process

First methodology Compute alignment (for now we assume we are given an alignment) Construct a phylogeny (two approaches) Distance-based methods Input: Distance matrix containing pairwise statistical estimation of aligned sequences Output: Phylogenetic tree Fast but less accurate Character-based methods Input: Sequence alignment Accurate but computationally very hard

Distance-based methods

Evolution on a single edge Poisson process Number of changes in a fixed time interval t is independent of changes in any other non-overlapping time interval u Number of changes in time interval t is proportional to the length of the interval No changes in time interval of length 0 Let X be the number of nucleotide changes on a single edge. We assume X is a Poisson process Probability dictates that

Evolution on a single edge We want to compute (the probability of a nucleotide change on edge e) The probability of observing a change is just the sum of probabilities of observing k changes over all possible values of k (excluding even ones because those changes cannot be seen)

Evolution on a single edge Expected number of nucleotide changes on a given edge is given by Key: is additive

Additivity Assume we have a path of k edges and that p1, p2,…, pk are the probabilities of change on each edge of the path Using induction we can show that Multiplicative term is hard to deal with and does not easily decompose into a product or sum of pi’s

Additivity But the expected number of nucleotide changes on the path p is elegant

Evolutionary models Simple 0,1 alphabet evolutionary model i.i.d. model uniformly random root sequence Jukes-Cantor: Uniformly random root sequence

Evolutionary models General Markov Model Uniformly random root sequence i.i.d. model For time reversible models

Variation across sites Standard assumption of how sites can vary is that each site has a multiplicative scaling factor Typically these scaling factors are drawn from a Gamma distribution (or Gamma plus invariant)

Special issues Molecular clock: the expected number of changes for a site is proportional to time No-common-mechanism model: there is a random variable for every combination of edge and site

Evolutionary distance estimation

Estimating evolutionary distances For sequences A and B what is the evolutionary distance under the Jukes-Cantor model? ACCTGTGGGTAACCACCC ACCTGAGGGATAGGTCCG But we don’t know what is

Estimating evolutionary distances Assume nucleotide changes are Bernoulli trials (i.i.d. trials of success or failure) is probability of head in n Bernoulli trials (n is sequence length) Compute a maximum likelihood estimate for ACCTGTGGGTAACCACCC ACCTGAGGGATAGGTCCG 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 1

Estimating evolutionary distance We want to find the value of p that maximizes the probability: Set dP/dp to 0 and solve for p to get

Estimating evolutionary distances = 5/18 Continuing in this manner we estimate for all pairs of sequences in the alignment We now have a distance matrix under a biologically sound evolutionary model ACCTGTGGGTAACCACCC ACCTGAGGGATAGGTCCG 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 1

Distance methods

Distance methods UPGMA: similar to hierarchical clustering but not additive Neighbor-joining: more sophisticated and additive What is additivity?

Additivity

UPGMA UPGMA is not additive but works for ultrametric trees. Takes O(n^3) time A B C D A 6 26 26 10 10 B 26 26 C 6 3 3 3 3 D A B C D

UPGMA Initialize n clusters where each cluster i contains the sequence i Find closest pair of clusters i, j, using distances in matrix D Make them neighbors in the tree by adding new node (ij), and set distance from (ij) to i and j as Dij/2 Update distance matrix D: for all clusters k do the following (ni and nj are size of clusters i and j respectively) Delete columns and rows for i and j in D and add new ones corresponding to cluster (ij) with distances as computed above Goto step 2 until only one cluster is left

UPGMA A B C D A 6 26 26 13 13 B 26 26 C 6 3 3 3 3 D A B C D

UPGMA Doesn’t work (in general) for non-ultrametric trees A B C D 3 3 13 16 26 3 3 B 12 19 10 B C 10 C 13 D A D

UPGMA UPGMA constructs incorrect tree here 7.25 A B C D 7.25 A 13 16 26 7.25 7.25 B 12 19 6 6 C 13 B C A D D

UPGMA Bipartition (BC,AD) is not in true tree 7.25 3 3 3 3 7.25 10 10 6 6 A D B C A D True tree UPGMA tree

Neighbor joining Additive and O(n^3) time Initialization: same as UPGMA For each species compute Select i and j for which is minimum Make them neighbors in the tree by adding new node (ij), and set distance from (ij) to i and j as

Neighbor joining Update distance matrix D: for all clusters k do the following Delete columns and rows for i and j in D and add new ones corresponding to cluster (ij) with distances as computed above Go to 3 until two nodes/clusters are left

NJ NJ constructs the correct tree for additive matrices A B C D 3 3 A 13 16 26 3 3 B 12 19 10 B C 10 C 13 D A D

Simulation studies

Simulation studies The true evolutionary tree is never known in practice. Simulation allows us to study accuracy of methods under biologically realistic scenarios Mathematics behind the phylogenetics is often complex and challenging. Simulation allows us to study algorithms when not possible theoretically and also examine algorithm performance under various conditions such as different evolutionary rates, sequence lengths, or numbers of taxa

Statistical consistency As sequence lengths tend to infinity the distance estimation improves and eventually leads to the true additive matrix If a method like NJ is then applied we get the true tree. In practice, however, we have limited sequence length. Therefore we want to know how much sequence length a method requires to achieve low error

Convergence rates Can be studied experimentally or theoretically Theoretical results offer loose bounds Experiments (under simulation) provide more realistic bounds on sequence lengths

Sequence length requirements

Sequence length requirements

Typical performance study

Sequence lengths for NJ Sequence lengths required to obtain 90% accuracy

Error rate of NJ