Presentation is loading. Please wait.

Presentation is loading. Please wait.

Estimating Evolutionary Distances from DNA Sequences Lecture 14 ©Shlomo Moran, parts based on Ilan Gronau.

Similar presentations


Presentation on theme: "Estimating Evolutionary Distances from DNA Sequences Lecture 14 ©Shlomo Moran, parts based on Ilan Gronau."— Presentation transcript:

1 Estimating Evolutionary Distances from DNA Sequences Lecture 14 ©Shlomo Moran, parts based on Ilan Gronau

2 1. - Compute distances between all taxon-pairs 2.- Find a tree (edge-weighted) best-describing the distances Distance Based Methods for Reconstructing Phylogenies 4 5 7 2 1 2 10 6 1 The distances are implied by the assumed “model tree”

3 3 AATCCTG ATAGCTG AATGGGC GAACGTA AAACCGA ACGGTCA ACGGATA ACGGGTA ACCCGTG ACCGTTG TCTGGTA TCTGGGA TCCGGAAAGCCGTG GGGGATT AAAGTCA AAAGGCG AAACACA AAAGCTG Model Tree: A Probabilistic Model of Evolution stochastic transition matrices at the edges DNA distribution at the root mutations along the edges occur with probabilities defined by the transition matrices

4 4 assign edge-lengths Additive distance matrix Need to assign lengths d(e) to the edges of the tree, s.t for all u,v, d(u,v) = ∑{d(e): edge e is on the path connecting u and v }. From Model Tree to Additive Distances We do this for a simple evolutionary model – the CFN model

5 5 Transitions Transversions Transitions CFN: ignore transitions, count only transversions α α β Purines Pyrimidines The CFN 2-states model distinguish between two types of DNA bases: Purines {A,G} and pyrimidines {C,T} : The CFN (Cavendar Farris Neyman) 2-States Model

6 6 The CFN 2-States Model Purines are marked by 0 and pyrimidines by 1. Uniform distribution on the root: prob(s(r)=0)=prob(s(r)=1)=0.5 On each (directed) edge e=(u  v), 0< p(s(v)=0|s(u)=1) = p(s(v)=1|s(u)=0)=p e <0.5. 01 0 1-p e pepe 1pepe This implies a uniform distribution at each vertex

7 Mutation probabilities of edges are undirected The mutation - state changes - probabilities of each (directed) edge in a CFN model tree are the same in both directions: For each edge e=(u,v) and b  {0,1}: p(s(v)=b|s(u)=1-b) = p(s(u)=1-b|s(v)=b)=p e. 7 uv 01 01-p e pepe 1pepe 01 0 pepe 1pepe p uv p vu =

8 State-Change Probabilities along Paths Hence, we can ignore the direction of the edges when computing the state-change probabilities for any pair of vertices (u,v). This state-change probability is symmetric p uv = p(s(v)=0|s(u)=1) = p(s(v)=1|s(u)=0) Our goal is to convert the values of p uv to additive distances d(u,v). First we express p uv. for any pair of vertices u,v, as a function of the mutation probabilities along the path connecting u and v. vu 8

9 9 Consider a path of two edges: Since state-changes probabilities are equal in both directions, the directions of the edges can be ignored. Probability of State-Changes along a Path in the CFN 2-States Model vu p1p1 p2p2 Direct generalization of this formula to longer paths is tedious. However there is a simple formula for the probability of a state change along an arbitrary long path from u to v:

10 10 State-Changes Probabilities along a Path Each edge e has probability 0<p e <0.5 to change states. Let e 1,…,e l be the path of l edges from u to v. Claim: The probability is given by:

11 11 State-Changes Probabilities along a Path (cont) Define the following imaginary stochastic “bond-opening process” for changing a state along an edge e. Initially all edges are “bonded”. For each edge e: 1.With probability 2p e open the bond on e. 2.If the bond was opened, set the state to 0 or 1 with equal probability (0.5). This process implies that at each edge e i the state is changed with probability p e i. With this process we have: Proof of the formula

12 12 Bond Probabilities  Additive distances Thus, d(u,v) = –logθ u,v is an additive metric on the tree.

13 13 A Physical Interpretation of d(u,v)=-logθ u,v Common physical models of evolution view mutations as “Poisson processes”. For the CFN model, this means that a mutation on edge e is a random event that occurs at some frequency λ e (i.e., λ e is the average number of mutations per site on e). With this interpretation, it can be shown that Thus is the expected number of mutations that occur on the tree-path between u and v.

14 14 B : AATCCTG C : ATAGCTG A : AATGGGC D : GAACGTA E : AAACCGA J : ACCGTTG G : TCTGGGA H : TCCGGAA I : AGCCGTG F : GGGGATT We saw that the values {d(u,v )=log θ uv : u,v are leaves of T} form an additive metric on T’s leaves. Hence, if we had these distances, we could reconstruct T in O(n 2 ) time (eg by DLCA). Estimated Distance matrix The distances d(u,v)=-logθ uv are estimated by using the fact that θ uv = 1-2p uv, and p uv is naturally approximated by the Hamming distance between the sequences at u and v, as we show next. estimate {d(u,v)=-log θ uv } from the sequences

15 15 Estimating the Additive Distances Definition: H(u,v), the Hamming distance between (the sequence at) u and (the sequence at) v, is the number of sites in which u and v have different states. H(u,v) can be used to estimate d(u,v) by the following steps:

16 Consistency of Distance Based Algorithms in the CFN Model A tree reconstruction algorithm is said to be “consistent” for a probabilistic model of evolution, if the following holds for any phylogenetic tree which fits the model: When the sequences length goes to , the reconstructed tree is w.h.p. the true tree. Thus, the previous slide shows that distance based methods are consistent for the CFN model. 16

17 17 Reconstructing Trees Generated by the CFN Model The longer are the sequences, the more accurate is our estimation of d(u,v)=-log θ uv. Thus the accuracy of the estimation is a function of the sequences lengths, k. A practical questions: How long should the sequences be in order to guarantee an accurate reconstruction? Much research on this question was done in the last decade. The bottom line is that estimations of long distances are very noisy: the sequence length needed to accurately estimate a distance d grows exponentially with d (recall that d is proportional to the expected number of mutations between vertices). Hence reconstruction should attempt to use only small distances.

18 18 More Involved Mutations Models More involved models allow different types of mutations to have different probabilities. The bad news are that the same exponential lower bound on the length of the sequences needed to estimate the distances holds for these models. The good news are that when the model allows several types of mutations, there are many different distance functions which can be used, so it is possible to select for each model tree a distance function which is best for this tree.


Download ppt "Estimating Evolutionary Distances from DNA Sequences Lecture 14 ©Shlomo Moran, parts based on Ilan Gronau."

Similar presentations


Ads by Google