Tree Inference Methods

Name: Tree Inference Methods
Uploaded: 2017-07-15T09:04:22+00:00
Duration: PTM15S6
Channel: Alyson McBride
Description: Tree Inference Methods

Tree Inference Methods
Methods to infer phylogenetic trees – Introduction There is no one correct method Methods are grouped according to two criteria Does it use discrete character states or distance matrices? Does it cluster OTUs in a stepwise manner or evaluate a number of possible trees?

Discrete character state methods Includes sequences, morphological characters, physiological characters, restriction maps, etc. Each character is analyzed separately and independently (usually) Best tree is deduced from a set of possible trees using the character state data Retain information about individual characters throughout the analysis and can be used to reconstruct ancestral states if necessary Extremely computer intensive Beyond certain numbers of taxa, it is impossible to evaluate all possible trees Distance matrix methods Calculate a measure of dissimilarity and abandon any information about the actual character states The distance matrix is then used to build a tree from the ground up Distance matrix represents the genetic or evolutionary distance No need to evaluate multiple trees, computationally simple Information is lost No way to reconstruct ancestral states

Tree evaluation methods With these methods, you have some criterion for selecting a ‘best’ tree based on the data If possible, perform an exhaustive search of all possible trees, evaluate all of them using criterion and choose the best one Not possible for large numbers of OTUs Algorithms allow us to evaluate subsets but we risk never identifying the best tree Many ‘best’ trees are possible (even likely) Clustering methods Construct a tree from nothing using specific algorithms Cluster the two most closely related taxa Then add a third most closely related, and so on…. Fast Produce only one tree

Models of DNA Evolution
Clustering Methods: Obtaining Genetic Distances Nucleotide substitution models In order to calculate a genetic distance, we must have some model of DNA evolution on which to “hang our hat” General assumptions of most models (often violated at least slightly) All sites are independent of one another Sites are homogeneous in their rates of change Markovian: Given the present state, future changes are unaffected by past states Temporal homogeneity

General assumptions of most models (often violated at least slightly) All sites are independent of one another Sites are homogeneous in their rates of change Markovian: Given the present state, future changes are unaffected by past states Temporal homogeneity Compensatory changes

General assumptions of most models (often violated at least slightly) All sites are independent of one another Sites are homogeneous in their rates of change Markovian: Given the present state, future changes are unaffected by past states Temporal homogeneity

Clustering Methods: Obtaining Genetic Distances Nucleotide substitution models In order to calculate a genetic distance, we must have some model of DNA evolution on which to “hang our hat” General assumptions of most models (often violated at least slightly) All sites are independent of one another Sites are homogeneous in their rates of change Markovian: Given the present state, future changes are unaffected by past states Temporal homogeneity Strictly speaking, these assumptions apply only to regions undergoing little or no selection Our task is to determine a mathematical method to model the (presumed) stochastic processes that introduced the observed differences among sequences

A model should: Provide a consistent measure of dissimilarity among sequences Provide linearly proportional distances to the time since divergence (if a molecular clock is assumed) Provide distances representing the branch lengths on an evolutionary tree The basic model is just counting the number of differences - p-distance (p = #differences/site) Intuitively simple but probably accurate only for very few cases because of homoplasy Homoplasy - a character state shared by a set of sequences but not present in the common ancestor; a misleading phylogenetic signal Most commonly, homoplasy is introduced because of multiple and back substitutions P-distances almost invariably underestimate the actual number of changes

P-distances invariably underestimate the actual number of changes

P-distances invariably underestimate the actual number of changes Saturation – the point at which any phylogenetic signal is lost; so many changes have occurred, the sequences are essentially random with respect to one another

Substitutions as homogeneous Markov processes Markov processes are specified in Q matrices A 4x4 matrix in which each position gives the instantaneous rate of change from one base to another. μ = mutation rate a = rate at which A-C change occurs relative to other possible changes

Most Q matrices represent time homogeneous, time continuous, stationary Markov process Assumptions At any given site in a sequence, the rate of change from base i to base j is independent of the base that occupied the site prior to i. Time homogeneous/continuous – substitution rates do not change over time Stationary – the relative frequencies of the bases (πA,πC,πG,πT) are at equilibrium Many models are also time-reversible – the rate of change from i to j is always the same as from j to i. These assumptions don’t make much sense biologically but are necessary if substitutions are to be modeled as stochastic processes

Jukes Cantor (JC69) – the simplest model Assumptions: Equilibrium frequencies for the four nucleotides are 25% each (πA=πC=πG=πT=1/4) Equal probabilities exist for any substitution (a=b=c=d=e=f=1) Once the Q matrix is stated, calculating the probability of change from one base to another over evolutionary time, P(t) is accomplished by calculating the matrix exponential Matrix algebra is involved. I took it back in Forgive me The resulting correction becomes d=-¾ln(1-(4/3)p) p = the observed distance (p-distance)

Using JC69 Note the parallel substitution at position 9 The actual distance is higher than the observed distance 6 changes actually occurred

Using JC69 p = 4/10 = 0.4 d (JC69) = -3/4 ln [1-4/3 (0.4)] = A more reasonable estimate of the number of actual changes that occurred What assumptions of JC69 are violated?

Kimura 2-parameter (K2P) Generally, transitions occur at higher rates than transversions This violates the rate assumptions of JC69

Kimura 2-parameter A different rate must be considered for transitions (α) and transversions (β), changing the Q matrix to: π remains ¼ for all bases d = ½ ln[1/1-2P-Q] + [1/4 ln[1/(1-2Q]] P and Q are the proportional differences between sequences due to transitions and transversions, respectively Note if, α=β …

Felsenstein (1981) - F81 In most taxa, A+T ≠ C+G If there are only a few G’s, the rate of substitution from G to A will be low compared to other substitutions Violates the rate assumptions of JC69

Felsenstein (1981) - F81 Different frequencies must be considered for all bases, substitution rates are the same for all, changing the Q matrix to: π is unique for all bases (πA ≠ πC ≠ πG ≠ πT) Note that this model assumes similar base composition for all sequences under consideration Note, if πA = πC = πG = πT …

Hasegawa, Kishino and Yano (HKY85) Combines F81 and K2P General Time Reversible (GTR) Allows all six pairs of substitutions to have distinct rates Allows unequal base frequencies

A variety of other models exist: Tajima-Nei (1984) – refines JC69 for more accurate rates of nucleotide substitution Tamura 3 parameter (1982) – corrects for multiple hits Tamura-Nei (1993) – corrects for multiple hits, considers purine and pyrimidine transitions differently

Varying substitution rates among sites in sequences (rate heterogeneity) can be compensated for Most times, a gamma, Γ, distribution is used An α value to determine the shape of the distribution can be estimated from the data and incorporated into calculations

Small values of α = L-shaped Γ-distribution and extreme rate variation among sites, most sites invariable but a few sites have very high substitution rates Large values (>1) of α = bell-shaped Γ-distribution and minimal rate variation among sites

Choosing the wrong model may give the wrong tree Wrong model  incorrect branch lengths, Ti/Tr ratios, divergences rate estimations, mutation rates, divergence dates What model to choose and how to choose it? Generally, more complex models fit the data better Thus, it may seem best to use the most complex model by default However, More parameters must be estimated, making computation more difficult (longer) and increasing the possibility of error in estimation Find a medium between complexity and practicality

Choosing a model The fit of a model to the data is proportional to: The probability of the data (D), given a model of evolution (M), a vector of model parameters (θ), a tree topology (τ) and a vector of branch lengths (ν) L = P(D | M, θ, τ, ν) Often use the log likelihood to ease computation l = lnP(D | M, θ, τ, ν) Likelihood ratio test (LRT) LRT statistic  LTR = 2 (l1 – l0) l1 = the maximum log likelihood under the more complex model (alternative hypothesis) l0 = the maximum log likelihood under the less complex model (null hypothesis) Always =>0 Large value = the more complex model is better

Choosing a model Hierarchical likelihood ratio test (hLRT) Most of the models described above are nested, or hierarchical i.e. JC is a special case of F81 where the base frequencies are equal ModelTest will perform all possible comparisons and evaluate them using a Χ2 test

Choosing a model Information criteria The likelihood of each model is penalized by a function of the number of free parameters (K) in the model; more parameters = higher penalty Akaiki Information Criterion (AIC) AIC = -2l + 2K AIC = the amount of information lost when we use a particular model Small values are better ModelTest, ProtTest

Choosing a model Bayesian methods Bayes factors are similar to LTR Posterior probabilities can be calculated Most commonly Bayesian Information Criterion (BIC) is calculated BIC = -2l + 2K log n Smaller = better ModelTest & ProtTest

Tree Inference Methods

Similar presentations

Presentation on theme: "Tree Inference Methods"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tree Inference Methods

Similar presentations

Presentation on theme: "Tree Inference Methods"— Presentation transcript:

Similar presentations

About project

Feedback