Tree Inference Methods

Slides:



Advertisements
Similar presentations
Introduction to Molecular Evolution
Advertisements

Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
An Introduction to Phylogenetic Methods
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Probabilistic Modeling of Molecular Evolution Using Excel, AgentSheets, and R Jeff Krause (Shodor)
Phylogenetic Trees Lecture 4
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetic reconstruction
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
IE68 - Biological databases Phylogenetic analysis
Molecular Evolution Revised 29/12/06
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Phylogenetic Reconstruction: Distance Matrix Methods Anders Gorm Pedersen Molecular Evolution Group Center for.
Heuristic alignment algorithms and cost matrices
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Distance Methods. Distance Estimates attempt to estimate the mean number of changes per site since 2 species (sequences) split from each other Simply.
Molecular Evolution with an emphasis on substitution rates Gavin JD Smith State Key Laboratory of Emerging Infectious Diseases & Department of Microbiology.
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
Maximum Parsimony.
Molecular Clocks, Base Substitutions, & Phylogenetic Distances.
Probabilistic methods for phylogenetic trees (Part 2)
. Phylogenetic Trees Lecture 13 This class consists of parts of Prof Joe Felsenstein’s lectures 4 and 5 taken from:
1 Additive Distances Between DNA Sequences MPI, June 2012.
Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Terminology of phylogenetic trees
BINF6201/8201 Molecular phylogenetic methods
Molecular phylogenetics
Why Models of Sequence Evolution Matter Number of differences between each pair of taxa vs. genetic distance between those two taxa. The x-axis is a proxy.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Lecture 3: Markov models of sequence evolution Alexei Drummond.
1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
BINF6201/8201 Molecular phylogenetic methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
A brief introduction to phylogenetics
MOLECULAR PHYLOGENETICS Four main families of molecular phylogenetic methods :  Parsimony  Distance methods  Maximum likelihood methods  Bayesian methods.
1 Evolutionary Change in Nucleotide Sequences Dan Graur.
Calculating branch lengths from distances. ABC A B C----- a b c.
Lecture 10 – Models of DNA Sequence Evolution Correct for multiple substitutions in calculating pairwise genetic distances. Derive transformation probabilities.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Phylogeny and Genome Biology Andrew Jackson Wellcome Trust Sanger Institute Changes: Type program name to start Always Cd to phyml directory before starting.
Rooting Phylogenetic Trees with Non-reversible Substitution Models Von Bing Yap* and Terry Speed § *Statistics and Applied Probability, National University.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Why do trees?. Phylogeny 101 OTUsoperational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal,
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
Phylogeny Ch. 7 & 8.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Measuring genetic change Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Section 5.2.
Molecular Evolution Distance Methods Biol. Luis Delaye Facultad de Ciencias, UNAM.
Bioinf.cs.auckland.ac.nz Juin 2008 Uncorrelated and Autocorrelated relaxed phylogenetics Michaël Defoin-Platel and Alexei Drummond.
Modelling evolution Gil McVean Department of Statistics TC A G.
Maximum Parsimony Phenetic (distance based) methods are fast and often accurate but discard data and are not based on explicit character states at each.
Evolutionary Change in Sequences
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Phylogenetics LLO9 Maximum Likelihood and Its Applications
Lecture 10 – Models of DNA Sequence Evolution
Distance based phylogenetics
Inferring a phylogeny is an estimation procedure.
Linkage and Linkage Disequilibrium
Maximum likelihood (ML) method
Multiple Alignment and Phylogenetic Trees
Models of Sequence Evolution
Goals of Phylogenetic Analysis
The Most General Markov Substitution Model on an Unrooted Tree
Lecture 10 – Models of DNA Sequence Evolution
But what if there is a large amount of homoplasy in the data?
Presentation transcript:

Tree Inference Methods Methods to infer phylogenetic trees – Introduction There is no one correct method Methods are grouped according to two criteria Does it use discrete character states or distance matrices? Does it cluster OTUs in a stepwise manner or evaluate a number of possible trees?

Tree Inference Methods Discrete character state methods Includes sequences, morphological characters, physiological characters, restriction maps, etc. Each character is analyzed separately and independently (usually) Best tree is deduced from a set of possible trees using the character state data Retain information about individual characters throughout the analysis and can be used to reconstruct ancestral states if necessary Extremely computer intensive Beyond certain numbers of taxa, it is impossible to evaluate all possible trees Distance matrix methods Calculate a measure of dissimilarity and abandon any information about the actual character states The distance matrix is then used to build a tree from the ground up Distance matrix represents the genetic or evolutionary distance No need to evaluate multiple trees, computationally simple Information is lost No way to reconstruct ancestral states

Tree Inference Methods Tree evaluation methods With these methods, you have some criterion for selecting a ‘best’ tree based on the data If possible, perform an exhaustive search of all possible trees, evaluate all of them using criterion and choose the best one Not possible for large numbers of OTUs Algorithms allow us to evaluate subsets but we risk never identifying the best tree Many ‘best’ trees are possible (even likely) Clustering methods Construct a tree from nothing using specific algorithms Cluster the two most closely related taxa Then add a third most closely related, and so on…. Fast Produce only one tree

Models of DNA Evolution Clustering Methods: Obtaining Genetic Distances Nucleotide substitution models In order to calculate a genetic distance, we must have some model of DNA evolution on which to “hang our hat” General assumptions of most models (often violated at least slightly) All sites are independent of one another Sites are homogeneous in their rates of change Markovian: Given the present state, future changes are unaffected by past states Temporal homogeneity

Models of DNA Evolution General assumptions of most models (often violated at least slightly) All sites are independent of one another Sites are homogeneous in their rates of change Markovian: Given the present state, future changes are unaffected by past states Temporal homogeneity Compensatory changes

Models of DNA Evolution General assumptions of most models (often violated at least slightly) All sites are independent of one another Sites are homogeneous in their rates of change Markovian: Given the present state, future changes are unaffected by past states Temporal homogeneity

Models of DNA Evolution Clustering Methods: Obtaining Genetic Distances Nucleotide substitution models In order to calculate a genetic distance, we must have some model of DNA evolution on which to “hang our hat” General assumptions of most models (often violated at least slightly) All sites are independent of one another Sites are homogeneous in their rates of change Markovian: Given the present state, future changes are unaffected by past states Temporal homogeneity Strictly speaking, these assumptions apply only to regions undergoing little or no selection Our task is to determine a mathematical method to model the (presumed) stochastic processes that introduced the observed differences among sequences

Models of DNA Evolution A model should: Provide a consistent measure of dissimilarity among sequences Provide linearly proportional distances to the time since divergence (if a molecular clock is assumed) Provide distances representing the branch lengths on an evolutionary tree The basic model is just counting the number of differences - p-distance (p = #differences/site) Intuitively simple but probably accurate only for very few cases because of homoplasy Homoplasy - a character state shared by a set of sequences but not present in the common ancestor; a misleading phylogenetic signal Most commonly, homoplasy is introduced because of multiple and back substitutions P-distances almost invariably underestimate the actual number of changes

Models of DNA Evolution P-distances invariably underestimate the actual number of changes

Models of DNA Evolution P-distances invariably underestimate the actual number of changes Saturation – the point at which any phylogenetic signal is lost; so many changes have occurred, the sequences are essentially random with respect to one another

Models of DNA Evolution Substitutions as homogeneous Markov processes Markov processes are specified in Q matrices A 4x4 matrix in which each position gives the instantaneous rate of change from one base to another. μ = mutation rate a = rate at which A-C change occurs relative to other possible changes

Models of DNA Evolution Most Q matrices represent time homogeneous, time continuous, stationary Markov process Assumptions At any given site in a sequence, the rate of change from base i to base j is independent of the base that occupied the site prior to i. Time homogeneous/continuous – substitution rates do not change over time Stationary – the relative frequencies of the bases (πA,πC,πG,πT) are at equilibrium Many models are also time-reversible – the rate of change from i to j is always the same as from j to i. These assumptions don’t make much sense biologically but are necessary if substitutions are to be modeled as stochastic processes

Models of DNA Evolution Jukes Cantor (JC69) – the simplest model Assumptions: Equilibrium frequencies for the four nucleotides are 25% each (πA=πC=πG=πT=1/4) Equal probabilities exist for any substitution (a=b=c=d=e=f=1) Once the Q matrix is stated, calculating the probability of change from one base to another over evolutionary time, P(t) is accomplished by calculating the matrix exponential Matrix algebra is involved. I took it back in 1991. Forgive me The resulting correction becomes d=-¾ln(1-(4/3)p) p = the observed distance (p-distance)

Models of DNA Evolution Using JC69 Note the parallel substitution at position 9 The actual distance is higher than the observed distance 6 changes actually occurred

Models of DNA Evolution Using JC69 p = 4/10 = 0.4 d (JC69) = -3/4 ln [1-4/3 (0.4)] = 0.5716 A more reasonable estimate of the number of actual changes that occurred What assumptions of JC69 are violated?

Models of DNA Evolution Kimura 2-parameter (K2P) Generally, transitions occur at higher rates than transversions This violates the rate assumptions of JC69

Models of DNA Evolution Kimura 2-parameter A different rate must be considered for transitions (α) and transversions (β), changing the Q matrix to: π remains ¼ for all bases d = ½ ln[1/1-2P-Q] + [1/4 ln[1/(1-2Q]] P and Q are the proportional differences between sequences due to transitions and transversions, respectively Note if, α=β …

Models of DNA Evolution Felsenstein (1981) - F81 In most taxa, A+T ≠ C+G If there are only a few G’s, the rate of substitution from G to A will be low compared to other substitutions Violates the rate assumptions of JC69

Models of DNA Evolution Felsenstein (1981) - F81 Different frequencies must be considered for all bases, substitution rates are the same for all, changing the Q matrix to: π is unique for all bases (πA ≠ πC ≠ πG ≠ πT) Note that this model assumes similar base composition for all sequences under consideration Note, if πA = πC = πG = πT …

Models of DNA Evolution Hasegawa, Kishino and Yano (HKY85) Combines F81 and K2P General Time Reversible (GTR) Allows all six pairs of substitutions to have distinct rates Allows unequal base frequencies

Models of DNA Evolution

Models of DNA Evolution A variety of other models exist: Tajima-Nei (1984) – refines JC69 for more accurate rates of nucleotide substitution Tamura 3 parameter (1982) – corrects for multiple hits Tamura-Nei (1993) – corrects for multiple hits, considers purine and pyrimidine transitions differently

Models of DNA Evolution Varying substitution rates among sites in sequences (rate heterogeneity) can be compensated for Most times, a gamma, Γ, distribution is used An α value to determine the shape of the distribution can be estimated from the data and incorporated into calculations

Models of DNA Evolution Small values of α = L-shaped Γ-distribution and extreme rate variation among sites, most sites invariable but a few sites have very high substitution rates Large values (>1) of α = bell-shaped Γ-distribution and minimal rate variation among sites

Models of DNA Evolution Choosing the wrong model may give the wrong tree Wrong model  incorrect branch lengths, Ti/Tr ratios, divergences rate estimations, mutation rates, divergence dates What model to choose and how to choose it? Generally, more complex models fit the data better Thus, it may seem best to use the most complex model by default However, More parameters must be estimated, making computation more difficult (longer) and increasing the possibility of error in estimation Find a medium between complexity and practicality

Models of DNA Evolution Choosing a model The fit of a model to the data is proportional to: The probability of the data (D), given a model of evolution (M), a vector of model parameters (θ), a tree topology (τ) and a vector of branch lengths (ν) L = P(D | M, θ, τ, ν) Often use the log likelihood to ease computation l = lnP(D | M, θ, τ, ν) Likelihood ratio test (LRT) LRT statistic  LTR = 2 (l1 – l0) l1 = the maximum log likelihood under the more complex model (alternative hypothesis) l0 = the maximum log likelihood under the less complex model (null hypothesis) Always =>0 Large value = the more complex model is better

Models of DNA Evolution Choosing a model Hierarchical likelihood ratio test (hLRT) Most of the models described above are nested, or hierarchical i.e. JC is a special case of F81 where the base frequencies are equal ModelTest will perform all possible comparisons and evaluate them using a Χ2 test

Models of DNA Evolution Choosing a model Information criteria The likelihood of each model is penalized by a function of the number of free parameters (K) in the model; more parameters = higher penalty Akaiki Information Criterion (AIC) AIC = -2l + 2K AIC = the amount of information lost when we use a particular model Small values are better ModelTest, ProtTest

Models of DNA Evolution Choosing a model Bayesian methods Bayes factors are similar to LTR Posterior probabilities can be calculated Most commonly Bayesian Information Criterion (BIC) is calculated BIC = -2l + 2K log n Smaller = better ModelTest & ProtTest