Introduction to Phylogeny Cédric Notredame Centro de Regulacio Genomica Adapted from Aiden Budd’s Lecture on Phylogeny.

Slides:



Advertisements
Similar presentations
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Advertisements

Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
An Introduction to Phylogenetic Methods
Wellcome Trust Workshop Working with Pathogen Genomes Module 6 Phylogeny.
Probabilistic Modeling of Molecular Evolution Using Excel, AgentSheets, and R Jeff Krause (Shodor)
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
Lecture 3 Molecular Evolution and Phylogeny. Facts on the molecular basis of life Every life forms is genome based Genomes evolves There are large numbers.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetic reconstruction
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Evolution Revised 29/12/06
© Wiley Publishing All Rights Reserved. Phylogeny.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
Heuristic alignment algorithms and cost matrices
Distance Methods. Distance Estimates attempt to estimate the mean number of changes per site since 2 species (sequences) split from each other Simply.
Bioinformatics and Phylogenetic Analysis
In addition to maximum parsimony (MP) and likelihood methods, pairwise distance methods form the third large group of methods to infer evolutionary trees.
Molecular Evolution with an emphasis on substitution rates Gavin JD Smith State Key Laboratory of Emerging Infectious Diseases & Department of Microbiology.
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
Introduction to Molecular Phylogeny Starting point: a set of homologous, aligned DNA or protein sequences Result of the process: a tree describing evolutionary.
Probabilistic methods for phylogenetic trees (Part 2)
Lecture 13 – Performance of Methods Folks often use the term “reliability” without a very clear definition of what it is. Methods of assessing performance.
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Phylogenetic trees Sushmita Roy BMI/CS 576
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Terminology of phylogenetic trees
BINF6201/8201 Molecular phylogenetic methods
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Molecular phylogenetics
Alexis Dereeper Homology analysis and molecular phylogeny CIBA courses – Brasil 2011.
Christian M Zmasek, PhD 15 June 2010.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Molecular basis of evolution. Goal – to reconstruct the evolutionary history of all organisms in the form of phylogenetic trees. Classical approach: phylogenetic.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Phylogenetic trees School B&I TCD Bioinformatics May 2010.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Molecular phylogenetics 4 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
OUTLINE Phylogeny UPGMA Neighbor Joining Method Phylogeny Understanding life through time, over long periods of past time, the connections between all.
A brief introduction to phylogenetics
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Introduction to Phylogenetics
MOLECULAR PHYLOGENETICS Four main families of molecular phylogenetic methods :  Parsimony  Distance methods  Maximum likelihood methods  Bayesian methods.
Comp. Genomics Recitation 3 The statistics of database searching.
Calculating branch lengths from distances. ABC A B C----- a b c.
Bayes estimators for phylogenetic reconstruction Ruriko Yoshida.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Identifying and Modeling Selection Pressure (a review of three papers) Rose Hoberman BioLM seminar Feb 9, 2004.
Phylogeny and Genome Biology Andrew Jackson Wellcome Trust Sanger Institute Changes: Type program name to start Always Cd to phyml directory before starting.
Phylogenetic Analysis Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics Figures from Higgs & Attwood.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Why do trees?. Phylogeny 101 OTUsoperational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal,
Phylogeny Ch. 7 & 8.
MODELLING EVOLUTION TERESA NEEMAN STATISTICAL CONSULTING UNIT ANU.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Part 9 Phylogenetic Trees
Introduction to Bioinformatics Resources for DNA Barcoding
Inferring a phylogeny is an estimation procedure.
Pipelines for Computational Analysis (Bioinformatics)
Phylogenetic Inference
Methods of molecular phylogeny
Summary and Recommendations
Summary and Recommendations
But what if there is a large amount of homoplasy in the data?
Presentation transcript:

Introduction to Phylogeny Cédric Notredame Centro de Regulacio Genomica Adapted from Aiden Budd’s Lecture on Phylogeny

PRANK T-Coffee MAFFT M-Coffee

Which MSA Method ?

By eye: Using Jalview Using seq_reformat and the T-Coffee CORE Index Using GBLOCKS Using TRimals

Gene tree vs. Species tree The evolutionary history of genes reflects that of species that carry them, except if : horizontal transfer = gene transfer between species (e.g. bacteria, mitochondria) Gene duplication : orthology/ paralogy

Orthology / Paralogy

Reconstruction of species phylogeny: artefacts due to paralogy !! Gene loss can occur during evolution : even with complete genome sequences it may be difficult to detect paralogy !!

Methods for Phylogenetic reconstruction Main families of methods :

Number of possible tree topologies for n taxa

Parsimony (1)

Some properties of Parsimony Several trees can be equally parsimonious (same length, the shortest of all possible lengths). The position of changes on each branch is not uniquely defined => parsimony does not allow to define tree branch lengths in a unique way. The number of trees to evaluate grows extremely fast with the number of compared sequences :  The search for the shortest tree must often be restricted to a fraction of the set of all possible tree shapes (heuristic search) => there is no mathematical certainty of finding the shortest (most parsimonious) tree.

Evolutionary Distances They measure the total number of substitutions that occurred on both lineages since divergence from last common ancestor. Divided by sequence length. Expressed in substitutions / site ancestor sequence 1 sequence 2

Quantification of evolutionary distances (1): The problem of hidden or multiple changes D (true evolutionary distance)  fraction of observed differences (p) D = p + hidden changes Through hypotheses about the nature of the residue substitution process, it becomes possible to estimate D from observed differences between sequences.

Molecular Clock

Quantification of evolutionary distances (1): The problem of hidden or multiple changes

Hypotheses of the model : (a) All sites evolve independently and following the same process. (b) Substitutions occur according to two probabilities : One for transitions, one for transversions. Transitions : G A or C T Transversions : other changes (c) The base substitution process is constant in time. Quantification of evolutionary distance (d) as a function of the fraction of observed differences (p: transitions, q: transversions): Quantification of evolutionary distances (2): Kimura’s two parameter distance (DNA) Kimura (1980) J. Mol. Evol. 16:111

Quantification of evolutionary distances (3): PAM and Kimura’s distances (proteins) Hypotheses of the model (Dayhoff, 1979) : (a) All sites evolve independently and following the same process. (b) Each type of amino acid replacement has a given, empirical probability : Large numbers of highly similar protein sequences have been collected; probabilities of replacement of any a.a. by any other have been tabulated. (c) The amino acid substitution process is constant in time. Quantification of evolutionary distance (d) : the number of replacements most compatible with the observed pattern of amino acid changes and individual replacement probabilities. Kimura’s empirical approximation : d = - ln( 1 - p p 2 ) (Kimura, 1983) where p = fraction of observed differences

Quantification of evolutionary distances (7): Other distance measures Several other, more realistic models of the evolutionary process at the molecular level have been used : Accounting for biased base compositions (Tajima & Nei). Accounting for variation of the evolutionary rate across sequence sites. etc...

Correspondence between trees and distance matrices Any phylogenetic tree induces a matrix of distances between sequence pairs Any phylogenetic tree induces a matrix of distances between sequence pairs “Perfect” distance matrices correspond to a single phylogenetic tree “Perfect” distance matrices correspond to a single phylogenetic tree

Building phylogenetic trees by distance methods General principle : Sequence alignment  (1) Matrix of evolutionary distances between sequence pairs  (2) (unrooted) tree (1) Measuring evolutionary distances. (2) Tree computation from a matrix of distance values.

i j k l i l j l c l k l m Any unrooted tree induces a distance  between sequences :  (i,m) = l i + l c + l r + l m l m l r It is possible to compute the values of branch lengths that create the best match between  and the evolutionary distance d : Distance matrix -> tree (1): minimize It is then possible to compute the total tree length : S = sum of all branch lengths tree topology ==> «best» branch lengths ==> total tree length

UPGMA Very Unreliable MethodVery Unreliable Method Very Simple PrincipleVery Simple Principle

UPGMA Very Unreliable MethodVery Unreliable Method Very Simple PrincipleVery Simple Principle

UPGMA

UPGMA

Distance matrix -> tree (2): The Minimum Evolution Method For all possible topologies : compute its total length, S Keep the tree with smallest S value. Problem: this method is very computation intensive. It is practically not usable with more than ~ 25 sequences. => approximate (heuristic) method is needed. Neighbor-Joining, a heuristic for the minimum evolution principle

Distance matrix -> tree (3): The Neighbor-Joining Method: algorithm Step 1: Use d distances measured between the N sequences Step 2: For all pairs i et j: consider the following bush-like topology, and compute S i,j, the sum of all “best” branch lengths. Step 3: Retain the pair (i,j) with smallest S i,j value. Group i and j in the tree. Step 4: Compute new distances d between N-1 objects: pair (i,j) and the N-2 remaining sequences: d (i,j),k = (d i,k + d j,k ) / 2 Step 5: Return to step 1 as long as N ≥ 4. Saitou & Nei (1987) Mol.Biol.Evol. 4:406

Distance matrix -> tree (5): The Neighbor-Joining Method (NJ): properties NJ is a fast method, even for hundreds of sequences. The NJ tree is an approximation of the minimum evolution tree (that whose total branch length is minimum). In that sense, the NJ method is very similar to parsimony because branch lengths represent substitutions. NJ produces always unrooted trees, that need to be rooted by the outgroup method. NJ always finds the correct tree if distances are tree-like. NJ performs well when substitution rates vary among lineages. Thus NJ should find the correct tree if distances are well estimated.

Maximum likelihood methods (1) (programs fastDNAml, PAUP*, PROML, PROTML) Hypotheses The substitution process follows a probabilistic model whose mathematical expression, but not parameter values, is known a priori. Sites evolve independently from each other. All sites follow the same substitution process (some methods use a discrete gamma distribution of site rates). Substitution probabilities do not change with time on any tree branch. They may vary between branches.

Maximum likelihood methods (2) Probabilistic model of the evolution of homologous sequences l i, branch lengths = expected number of subst. per site along branch , relative rates of base substitutions (e.g., transition/transversion, G+C-bias) Thus, one can compute Proba branch i (x  y) for any bases x & y, any branch i, any set of  values

Maximum likelihood algorithm (1) Step 1: Let us consider a given rooted tree, a given site, and a given set of branch lengths. Let us compute the probability that the observed pattern of nucleotides at that site has evolved along this tree. S1, S2, S3, S4: observed bases at site in seq. 1, 2, 3, 4  : unknown and variable ancestral bases l1, l2, …, l6: given branch lengths P(S1, S2, S3, S4) =        P(  ) P l5 (  ) P l6 (  ) P l1 ( ,S1) P l2 ( ,S2) P l3 ( ,S3) P l4 ( ,S4) where P(S7) is estimated by the average base frequencies in studied sequences.

Maximum likelihood algorithm (2)

Maximum likelihood algorithm (3) Step 2: compute the probability that entire sequences have evolved : P(Sq1, Sq2, Sq3, Sq4) =  all sites P(S1, S2, S3, S4) Step 2: compute branch lengths l1, l2, …, l6 and value of parameter  that give the highest P(Sq1, Sq2, Sq3, Sq4) value. This is the likelihood of the tree. Step 3: compute the likelihood of all possible trees. The tree predicted by the method is that having the highest likelihood.

Maximum likelihood : properties This is the best justified method from a theoretical viewpoint. Sequence simulation experiments have shown that this method works better than all others in most cases. But it is a very computer-intensive method. It is nearly always impossible to evaluate all possible trees because there are too many. A partial exploration of the space of possible trees is done.

Reliability of phylogenetic trees: the bootstrap The phylogenetic information expressed by an unrooted tree resides entirely in its internal branches. The tree shape can be deduced from the list of its internal branches. Testing the reliability of a tree = testing the reliability of each internal branch.

Bootstrap procedure The support of each internal branch is expressed as percent of replicates.

"bootstrapped” tree

Bootstrap procedure : properties Internal branches supported by ≥ 90% of replicates are considered as statistically significant. The bootstrap procedure only detects if sequence length is enough to support a particular node. The bootstrap procedure does not help determining if the tree-building method is good. A wrong tree can have 100 % bootstrap support for all its branches!

Bayesian inference of phylogenetic trees Aim : compute the posterior probability of all tree topologies, given the sequence alignment. likelihood of tree + parameters prior probability of parameter values  : tree topology X: aligned sequences v: set of tree branch lengths  : parameters of substitution model (e.g., transit/transv ratio)

Analytical computation of Pr(  |X) is impossible in general. A computational technique called Metropolis-coupled Markov chain Monte Carlo [ MC 3 ] is used to generate a statistical sample from the posterior distribution of trees. (example: generate a random sample of 10,000 trees) Result: - Retain the tree having highest probability (that found most often among the sample). - Compute the posterior probabilities of all clades of that tree: fraction of sampled trees containing given clade.

Reyes et al. (2004) Mol. Biol. Evol. 21:397–403

Overcredibility of Bayesian estimation of clade support ? Bayesian clade support is much stronger than bootstrap support : Douady et al. (2003) Mol. Biol. Evol. 20:248–254 Bayesian Posterior probability Bootstrapped Posterior probability Boostrap support in ML analysis

So, Bayesian clade support is high Bootstrap clade support is low which one is closer to true support ? Conclusion from simulation experiments : o when sequence evolution fits exactly the probability model used, Bayesian support is correct, bootstrap is pessimistic. o Bayesian inference is sensitive to small model misspecifications and becomes too optimistic.

PHYML : PHYML : a Fast, and Accurate Algorithm to Estimate Large Phylogenies by Maximum Likelihood ML requires to find what quantitative (e.g., branch lengths) and qualitative (tree topology) parameter values correspond to the highest probability for sequences to have evolved. PHYML adjusts topology and branch lengths simultaneously. Because only a few iterations are sufficient to reach an optimum, PHYML is a fast, but accurate, ML algorithm. Guindon & Gascuel (2003) Syst. Biol. 52(5):696–704

Tree and sequence simulation experiment P, PHYML F, fastDNAml L, NJML D, DNAPARS N, NJ 5000 random trees 40 taxa, 500 bases no molecular clock varying tree length K2P,  = 2

distance < parsimony ~ PHYML << Bayesian < classical ML NJ DNAPARS PHYML MrBayes fastDNAml,PAUP NJ DNAPARS PHYML MrBayes fastDNAml,PAUP Comparison of running time for various tree-building algorithms

WWW resources for molecular phylogeny (1) Compilations  A list of sites and resources:  An extensive list of phylogeny programs phylip/software.html Databases of rRNA sequences and associated software  The rRNA WWW Server - Antwerp, Belgium.  The Ribosomal Database Project - Michigan State University

WWW resources for molecular phylogeny (2) Database similarity searches (Blast) : Multiple sequence alignment  ClustalX : multiple sequence alignment with a graphical interface (for all types of computers). and go to ‘software’  Web interface to ClustalW algorithm for proteins: and press “clustal”

WWW resources for molecular phylogeny (3) Sequence alignment editor  SEAVIEW : for windows and unix Programs for molecular phylogeny  PHYLIP : an extensive package of programs for all platforms  PAUP* : a very performing commercial package  PHYLO_WIN : a graphical interface, for unix only  MrBayes : Bayesian phylogenetic analysis  PHYML : fast maximum likelihood tree building  WWW-interface at Institut Pasteur, Paris

WWW resources for molecular phylogeny (4) Tree drawing NJPLOT (for all platforms) Lecture notes of molecular systematics

WWW resources for molecular phylogeny (5) Books  Laboratory techniques Molecular Systematics (2nd edition), Hillis, Moritz & Mable eds.; Sinauer,  Molecular evolution Fundamentals of molecular evolution (2nd edition); Graur & Li; Sinauer,  Evolution in general Evolution (2nd edition); M. Ridley; Blackwell, 1996.

Quantification of evolutionary distances (4): Synonymous and non-synonymous distances (coding DNA): Ka, Ks Hypothesis of previous models : (a) All sites evolve independently and following the same process. Problem: in protein-coding genes, there are two classes of sites with very different evolutionary rates. non-synonymous substitutions (change the a.a.): slow synonymous substitutions (do not change the a.a.): fast Solution: compute two evolutionary distances Ka = non-synonymous distance = nbr. non-synonymous substitutions / nbr. non-synonymous sites Ks = synonymous distance = nbr. synonymous substitutions / nbr. synonymous sites

The genetic code

Quantification of evolutionary distances (6): Calculation of Ka and Ks The details of the method are quite complex. Roughly : Split all sites of the 2 compared genes in 3 categories : I: non degenerate, II: partially degenerate, III: totally degenerate Compute the number of non-synonymous sites = I + 2/3 II Compute the number of synonymous sites = III + 1/3 II Compute the numbers of synonymous and non-synonymous changes Compute, with Kimura’s 2-parameter method, Ka and Ks Frequently, one of these two situations occur : Evolutionarily close sequences : Ks is informative, Ka is not. Evolutionarily distant sequences : Ks is saturated, Ka is informative. Li, Wu & Luo (1985) Mol.Biol.Evol. 2:150

Ka and Ks : example Urotrophin gene of rat (AJ002967) and mouse (Y12229)

Saturation: loss of phylogenetic signal When compared homologous sequences have experienced too many residue substitutions since divergence, it is impossible to determine the phylogenetic tree, whatever the tree-building method used. NB: with distance methods, the saturation phenomenon may express itself through mathematical impossibility to compute d. Example: Jukes-Cantor: p  0.75 => d   NB: often saturation may not be detectable