Presentation is loading. Please wait.

Presentation is loading. Please wait.

Molecular Evolution and Phylogenetic Tree Reconstruction

Similar presentations


Presentation on theme: "Molecular Evolution and Phylogenetic Tree Reconstruction"— Presentation transcript:

1 Molecular Evolution and Phylogenetic Tree Reconstruction
1 4 3 2 5 Molecular Evolution and Phylogenetic Tree Reconstruction

2 Phylogenetic Trees Nodes: species Edges: time of independent evolution
Edge length represents evolution time AKA genetic distance Not necessarily chronological time

3 Inferring Phylogenetic Trees
Trees can be inferred by several criteria: Morphology of the organisms Can lead to mistakes! Sequence comparison Example: Mouse: ACAGTGACGCCCCAAACGT Rat: ACAGTGACGCTACAAACGT Baboon: CCTGTGACGTAACAAACGA Chimp: CCTGTGACGTAGCAAACGA Human: CCTGTGACGTAGCAAACGA

4 Inferring Phylogenetic Trees
Sequence-based methods Deterministic (Parsimony) Probabilistic (SEMPHY) Distance-based methods UPGMA Neighbor-Joining Can compute distances from sequences 4

5 Distance Between Two Sequences
Basic principles: Degree of sequence difference is proportional to length of independent sequence evolution Only use positions where alignment is certain – avoid areas with (too many) gaps 5

6 Distance Between Two Sequences
Given sequences xi, xj, Define dij = distance between the two sequences One possible definition: dij = fraction f of sites u where xi[u]  xj[u] Better scores are derived by modeling evolution as a continuous change process

7 Outline Molecular Evolution Distance Methods Sequence Methods
UPGMA / Average Linkage Neighbor-Joining Sequence Methods Deterministic (Parsimony) Probabilistic (SEMPHY)

8 Molecular Evolution Q: How can we model evolution on nucleotide level? (ignore gaps, focus on substitutions) A: Consider what happens at a specific position for small time interval Δt P(t) = vector of probabilities of {A,C,G,T} at time t μAC = rate of transition from A to C per unit time μA = μAC + μAG + μAT rate of transition out of A pA(t+Δt) = pA(t) – pA(t) μA Δt + pC(t) μCA Δt + … 8

9 P(t+Δt) = P(t) + Q P(t) Δt
Molecular Evolution In matrix/vector notation, we get P(t+Δt) = P(t) + Q P(t) Δt where Q is the substitution rate matrix 9

10 Molecular Evolution This is a differential equation: P’(t) = Q P(t)
A substitution rate matrix Q implies a probability distribution over {A,C,G,T} at each position, including stationary (equilibrium) frequencies πA, πC, πG, πT Each Q is an evolutionary model (some work better than others) 10

11 Evolutionary Models Jukes-Cantor Kimura Felsenstein HKY 11

12 Estimating Distances Solve the differential equation and compute expected evolutionary time given sequences Jukes-Cantor Kimura 12

13 Outline Molecular Evolution Distance Methods Sequence Methods
UPGMA / Average Linkage Neighbor-Joining Sequence Methods Deterministic (Parsimony) Probabilistic (SEMPHY) 13

14 A simple clustering method for building tree
UPGMA (unweighted pair group method using arithmetic averages) Or the Average Linkage Method Given two disjoint clusters Ci, Cj of sequences, 1 dij = ––––––––– {p Ci, q Cj}dpq |Ci|  |Cj| Claim that if Ck = Ci  Cj, then distance to another cluster Cl is: dil |Ci| + djl |Cj| dkl = –––––––––––––– |Ci| + |Cj|

15 Algorithm: Average Linkage
Initialization: Assign each xi into its own cluster Ci Define one leaf per sequence, height 0 Iteration: Find two clusters Ci, Cj s.t. dij is min Let Ck = Ci  Cj Define node connecting Ci, Cj, and place it at height dij/2 Delete Ci, Cj Termination: When two clusters i, j remain, place root at height dij/2 1 4 3 2 5 1 4 2 3 5

16 Average Linkage Example
w x y z 6 8 4 2 v w xyz 6 8 vw xyz 8 4 v w x yz 6 8 4 3 2 1 v w x y z

17 Ultrametric Distances and Molecular Clock
Definition: A distance function d(.,.) is ultrametric if for any three distances dij  dik  dij, it is true that dij  dik = dij The Molecular Clock: The evolutionary distance between species x and y is 2 the Earth time to reach the nearest common ancestor That is, the molecular clock has constant rate in all species The molecular clock results in ultrametric distances years 1 4 2 3 5

18 Ultrametric Distances & Average Linkage
1 4 2 3 5 Average Linkage is guaranteed to reconstruct correctly a binary tree with ultrametric distances Proof: Exercise

19 Weakness of Average Linkage
Molecular clock: all species evolve at the same rate (Earth time) However, certain species (e.g., mouse, rat) evolve much faster Example where UPGMA messes up: Correct tree AL tree 3 2 4 1 4 2 3 1

20 Additive Distances 1 d1,4 12 4 8 3 13 7 9 5 11 10 6 2 Given a tree, a distance measure is additive if the distance between any pair of leaves is the sum of lengths of edges connecting them Given a tree T & additive distances dij, can uniquely reconstruct edge lengths: Find two neighboring leaves i, j, with common parent k Place parent node k at distance dkm = ½ (dim + djm – dij) from any node m  i, j

21 d(x, y) + d(z, w) < d(x, z) + d(y, w) = d(x, w) + d(y, z)
Additive Distances z x w y For any four leaves x, y, z, w, consider the three sums d(x, y) + d(z, w) d(x, z) + d(y, w) d(x, w) + d(y, z) One of them is smaller than the other two, which are equal d(x, y) + d(z, w) < d(x, z) + d(y, w) = d(x, w) + d(y, z)

22 Reconstructing Additive Distances Given T
x T D y 5 4 v w x y z 10 17 16 15 14 9 3 z 3 4 7 w 6 v If we know T and D, but do not know the length of each leaf, we can reconstruct those lengths

23 Reconstructing Additive Distances Given T
x T D y v w x y z 10 17 16 15 14 9 z w v

24 Reconstructing Additive Distances Given T
x v w x y z 10 17 16 15 14 9 T y z a w D1 v dax = ½ (dvx + dwx – dvw) a x y z 11 10 9 15 14 day = ½ (dvy + dwy – dvw) daz = ½ (dvz + dwz – dvw)

25 Reconstructing Additive Distances Given T
x a x y z 11 10 9 15 14 T y 5 4 b 3 z 3 a c 4 7 w D2 6 d(a, c) = 3 d(b, c) = d(a, b) – d(a, c) = 3 d(c, z) = d(a, z) – d(a, c) = 7 d(b, x) = d(a, x) – d(a, b) = 5 d(b, y) = d(a, y) – d(a, b) = 4 d(a, w) = d(z, w) – d(a, z) = 4 d(a, v) = d(z, v) – d(a, z) = 6 Correct!!! v a b z 6 10 D3 a c 3

26 Neighbor-Joining Guaranteed to produce the correct tree if distance is additive May produce a good tree even when distance is not additive Step 1: Finding neighboring leaves Define Dij = (N – 2) dij – ki dik – kj djk Claim: The above “magic trick” ensures that Dij is minimal iff i, j are neighbors 1 3 0.1 0.1 0.1 0.4 0.4 2 4

27 Algorithm: Neighbor-Joining
Initialization: Define T to be the set of leaf nodes, one per sequence Let L = T Iteration: Pick i, j s.t. Dij is minimal Define a new node k, and set dkm = ½ (dim + djm – dij) for all m  L Add k to T, with edges of lengths dik = ½ (dij + ri – rj), djk = dij – dik where ri = (N – 2)-1 ki dik Remove i, j from L; Add k to L Termination: When L consists of two nodes, i, j, and the edge between them of length dij

28 Outline Molecular Evolution Distance Methods Sequence Methods
UPGMA / Average Linkage Neighbor-Joining Sequence Methods Deterministic (Parsimony) Probabilistic (SEMPHY) 28

29 Parsimony One of the most popular methods: Idea:
GIVEN multiple alignment FIND tree & history of substitutions explaining alignment Idea: Find the tree that explains the observed sequences with a minimal number of substitutions Two computational subproblems: Find the parsimony cost of a given tree (easy) Search through all tree topologies (hard)

30 Example: Parsimony Cost of One Column
{A, B} C++ A B A B A A {A} {B} {A} {A}

31 Parsimony Scoring Given a tree, and an alignment column u
Label internal nodes to minimize the number of required substitutions Initialization: Set cost C = 0; node k = 2N – 1 (last leaf) Iteration: If k is a leaf, set Rk = { xk[u] } // Rk is simply the character of kth species If k is not a leaf, Let i, j be the daughter nodes; Set Rk = Ri  Rj if intersection is nonempty Set Rk = Ri  Rj, and increment C if intersection is empty Termination: Minimal cost of tree for column u, = C

32 Example {B} {A,B} {A} {B} {A} {A,B} {A} A A A A B B A B {A} {A} {A}

33 Parsimony Traceback Traceback:
Choose an arbitrary nucleotide from R2N – 1 for the root Having chosen nucleotide r for parent k, If r  Ri choose r for daughter i Else, choose arbitrary nucleotide from Ri Easy to see that this traceback produces some assignment of cost C

34 inadmissible with Traceback
Example Admissible with Traceback x B Still optimal, but inadmissible with Traceback A {A, B} A B x {A} B {A, B} A B A B x B x A B A B A {A} {B} {A} {B} A B A B A x A x A B A B

35 Another Parsimony Algorithm
Let C(v) be cost for subtree rooted at node v Let C(v,x) be cost for subtree rooted at v if we force v to have value x Initialization: For each leaf v C(v) = 0 C(v,x) = 0 if x is input character that labels v; C(v,x) = ∞ otherwise Iteration: Let u, w be children of v C(v,x) = min(C(u) + 1, C(u,x)) + min(C(v) + 1, C(v,x)) C(v) = min C(v,x) Termination: Minimal cost is C(root) 35

36 Probabilistic Methods
xroot t1 t2 x1 x2 A more refined measure of evolution along a tree than parsimony P(x1, x2, xroot | t1, t2) = P(xroot) P(x1 | t1, xroot) P(x2 | t2, xroot) If we use Jukes-Cantor, for example, and x1 = xroot = A, x2 = C, t1 = t2 = 1, = pA¼(1 + 3e-4α) ¼(1 – e-4α) = (¼)3(1 + 3e-4α)(1 – e-4α)

37 Probabilistic Methods
xroot xu x2 xN x1 If we know all internal labels xu, P(x1, x2, …, xN, xN+1, …, x2N-1 | T, t) = P(xroot)jrootP(xj | xparent(j), tj, parent(j)) Usually we don’t know the internal labels, therefore P(x1, x2, …, xN | T, t) = xN+1 xN+2 … x2N-1 P(x1, x2, …, x2N-1 | T, t)

38 Felsenstein’s Likelihood Algorithm
To calculate P(x1, x2, …, xN | T, t) Initialization: Set k = 2N – 1 Iteration: Compute P(Lk | a) for all a   If k is a leaf node: Set P(Lk | a) = 1(a = xk) If k is not a leaf node: 1. Compute P(Li | b), P(Lj | b) for all b, for daughter nodes i, j 2. Set P(Lk | a) = b,c P(b | a, ti) P(Li | b) P(c | a, tj) P(Lj | c) Termination: Likelihood at this column = P(x1, x2, …, xN | T, t) = aP(L2N-1 | a)P(a) Let P(Lk | a) denote the prob. of all the leaves below node k, given that the residue at k is a

39 Felsenstein’s Likelihood Algorithm
Define: and recursively compute: 39

40 Felsenstein’s Likelihood Algorithm
Now using u and U we can compute: and 40

41 Probabilistic Methods
Given M (ungapped) alignment columns of N sequences, Define likelihood of a tree: L(T, t) = P(Data | T, t) = m=1…M P(x1m, …, xnm | T, t) Maximum Likelihood Reconstruction: Given data X = (xij), find a topology T and length vector t that maximize likelihood L(T, t)

42 Current popular methods
HUNDREDS of programs available! Some recommended programs: Discrete—Parsimony-based Rec-1-DCM3 Tandy Warnow and colleagues Probabilistic SEMPHY Nir Friedman and colleagues


Download ppt "Molecular Evolution and Phylogenetic Tree Reconstruction"

Similar presentations


Ads by Google