Phylogeny Tree Reconstruction 1 4 3 2 5 1 4 2 3 5.

Phylogeny Tree Reconstruction 1 4 3 2 5 1 4 2 3 5.

1 Phylogeny Tree Reconstruction 1 4 3 2 5 1 4 2 3 5

2 Inferring Phylogenies Trees can be inferred by several criteria:  Morphology of the organisms  Sequence comparison Example: Orc: ACAGTGACGCCCCAAACGT Elf: ACAGTGACGCTACAAACGT Dwarf: CCTGTGACGTAACAAACGA Hobbit: CCTGTGACGTAGCAAACGA Human: CCTGTGACGTAGCAAACGA

3 Modeling Evolution During infinitesimal time  t, there is not enough time for two substitutions to happen on the same nucleotide So we can estimate P(x | y,  t), for x, y  {A, C, G, T} Then let P(A|A,  t) …… P(A|T,  t) S(  t) = ……… P(T|A,  t) ……P(T|T,  t) xx y tt

4 Modeling Evolution Reasonable assumption: multiplicative (implying a stationary Markov process) S(t+t’) = S(t)S(t’) That is, P(x | y, t+t’) =  z P(x | z, t) P(z | y, t’) Jukes-Cantor: constant rate of evolution 1 - 3     For short time , S(  ) = I+R  =  1 - 3      1 - 3      1 - 3  AC GT

5 Modeling Evolution Jukes-Cantor: For longer times, r(t)s(t) s(t) s(t) S(t) = s(t)r(t) s(t) s(t) s(t)s(t) r(t) s(t) s(t)s(t) s(t) r(t) Where we can derive: r(t) = ¼ (1 + 3 e -4  t ) s(t) = ¼ (1 – e -4  t ) S(t+  ) = S(t)S(  ) = S(t)(I + R  ) Therefore, (S(t+  ) – S(t))/  = S(t) R At the limit of   0, S’(t) = S(t) R Equivalently, r’ = -3  r + 3  s s’ = -  s +  r Those diff. equations lead to: r(t) = ¼ (1 + 3 e -4  t ) s(t) = ¼ (1 – e -4  t )

6 Modeling Evolution Kimura: Transitions: A/G, C/T Transversions: A/T, A/C, G/T, C/G Transitions (rate  ) are much more likely than transversions (rate  ) r(t)s(t) u(t) s(t) S(t) = s(t)r(t) s(t) u(t) u(t)s(t) r(t) s(t) s(t)u(t) s(t) r(t) Wheres(t) = ¼ (1 – e -4  t ) u(t) = ¼ (1 + e -4  t – e -2(  +  )t ) r(t) = 1 – 2s(t) – u(t)

7 Parsimony One of the most popular methods Idea: Find the tree that explains the observed sequences with a minimal number of substitutions Two computational subproblems: 1.Find the parsimony cost of a given tree (easy) 2.Search through all tree topologies (hard)

8 Parsimony Scoring Given a tree, and an alignment column u Label internal nodes to minimize the number of required substitutions Initialization: Set cost C = 0; k = 2N – 1 Iteration: If k is a leaf, set R k = { x k [u] } If k is not a leaf, Let i, j be the daughter nodes; Set R k = R i  R j if intersection is nonempty Set R k = R i  R j, and C += 1, if intersection is empty Termination: Minimal cost of tree for column u, = C

9 Example A B A B {A, B} C+=1 {A, B} C+=1 {A} {B} {A} {B}

10 Example AAAB {A} {B} BABA {A}{B}{A}{B} {A} {A,B} {B}

11 Number of labeled unrooted tree topologies How many possibilities are there for leaf 4? 1 2 3 4 4 4

12 Number of labeled unrooted tree topologies How many possibilities are there for leaf 4? For the 4 th leaf, there are 3 possibilities 1 2 3 4

13 Number of labeled unrooted tree topologies How many possibilities are there for leaf 5? For the 5 th leaf, there are 5 possibilities 1 2 3 4 5

14 Number of labeled unrooted tree topologies How many possibilities are there for leaf 6? For the 6 th leaf, there are 7 possibilities 1 2 3 4 5

15 Number of labeled unrooted tree topologies How many possibilities are there for leaf n? For the n th leaf, there are 2n – 5 possibilities 1 2 3 4 5

16 Number of labeled unrooted tree topologies #unrooted trees for n taxa: (2n-5)*(2n-7)*...*3*1 = (2n-5)! / [2n-3*(n-3)!] #rooted trees for n taxa: (2n-3)*(2n-5)*(2n-7)*...*3 = (2n-3)! / [2n-2*(n-2)!] 1 2 3 4 5 N = 10 #unrooted: 2,027,025 #rooted: 34,459,425 N = 30 #unrooted: 8.7x10 36 #rooted: 4.95x10 38

17 Probabilistic Methods A more refined measure of evolution along a tree than parsimony P(x 1, x 2, x root | t 1, t 2 ) = P(x root ) P(x 1 | t 1, x root ) P(x 2 | t 2, x root ) If we use Jukes-Cantor, for example, and x 1 = x root = A, x 2 = C, t 1 = t 2 = 1, = p A  ¼(1 + 3e -4α )  ¼(1 – e -4α ) = (¼) 3 (1 + 3e -4α )(1 – e -4α ) x1x1 t2t2 x root t1t1 x2x2

18 Computing the Likelihood of a Tree Define P(L k | a): probability of subtree rooted at x k, given that x k = a Then, P(L k | a) = (  b P(L i | b) P(b | a, t ki ) )(  c P(L j | c) P(c | a, t ki ) ) xkxk xixi xjxj t ki t kj

19 Felsenstein’s Likelihood Algorithm To calculate P(x 1, x 2, …, x N | T, t) Initialization: Set k = 2N – 1 Recursion: Compute P(L k | a) for all a   If k is a leaf node: Set P(L k | a) = 1(a = x k ) If k is not a leaf node: 1. Compute P(L i | b), P(L j | b) for all b, for daughter nodes i, j 2. Set P(L k | a) =  b, c P(b | a, t i )P(L i | b) P(c | a, t j ) P(L j | c) Termination: Likelihood at this column = P(x 1, x 2, …, x N | T, t) =  a P(L 2N-1 | a)P(a)

20 Probabilistic Methods Given M (ungapped) alignment columns of N sequences, Define likelihood of a tree: L(T, t) = P(Data | T, t) =  m=1…M P(x 1m, …, x nm, T, t) Maximum Likelihood Reconstruction: Given data X = (x ij ), find a topology T and length vector t that maximize likelihood L(T, t)

21 ML Reconstruction Task 1  Edge length estimation Given a topology, find the edge lengths that maximize the likelihood Accomplished by iterative methods such as Expectation Maximization (EM) or using Newton’s Raphson optimization Each iteration requires computations that take on the order of the number of taxa times the number of sequence positions Guaranteed to find local maxima but, in practice, usually find the global maximum

22 ML Reconstruction Task 2  Find a tree topology that maximizes the likelihood  More challenging Naïve, exhaustive search of tree space is infeasible Effectiveness of heuristic paradigms, like simulated annealing, genetic algorithms or other search methods is hampered by re- estimating edge lengths afresh for different trees  Problem is tackled by iterative procedures that greedily construct the desired tree

23 Structural EM Approach Based on development of Structural EM method in learning Bayesian networks Similar to standard EM procedure, with exception that we optimize not only edge lengths but also the topology during each EM iteration Overview to Algorithm  Choose a tree (T 1, t 1 ) using, say, Neighbor-Joining  Improve the tree in successive iterations  In the l’th iteration, we start with the bifurcating tree (T l, t l ) and construct a new bifurcating tree (T l+1, t l+1 ). Basically we use (T l, t l ) to define a measure Q(T,t:T l, t l ) of expected log likelihood of trees and then to find a bifurcating tree that maximizes this expected log likelihood

24 Definition of Terms Σ – fixed, finite alphabet of characters e.g. Σ = {A,C,G,T} for DNA sequences T – bifurcating tree, in which each internal node is adjacent to exactly three other nodes N – number of sequences/nodes in the tree X i – random variable representing the node i in the tree T, where i = 1,2,..,N (i,j) Є T – nodes i and j in T have an edge between them t – vector comprising of a non-negative duration or edge length t i,j for each edge (i,j) Є T The pair (T,t) constitutes a phylogenetic tree

25 Definition of Terms D – complete data set p a->b (t)  Probability of a character to tranform from ‘a’ into ‘b’ in duration t  Equal to Σ c p a->c (t) p c->b (t’) where c Є Σ p a - prior distribution of character ‘a’ S i (a)  frequency count of occurrences of letters  It is the number of times we see the character ‘a’ at node i in D  S i (a) = Σ m 1{X i [m] = a} S i,j (a,b)  frequency count of co-occurrences of letters  It is the number of times we see the character ‘a’ at node i and the character ‘b’ at node j in D  S i,j (a,b) = Σ m 1{X i [m] = a, X j [m] = b}

26 Structural EM Iterations Three steps  E-Step: Compute expected frequency counts for all links (i,j) and for all character states a,b Є Σ  M-Step: Optimize link lengths by computing for each link (i,j), its best length  M-Step II: Construct a topology T * l+1 that maximizes W l+1 (T), by finding maximizing spanning tree Construct a bifurcating topology T l+1 such that L(T * l+1, t l+1 ) = L(T l+1,t l+1 )

27 E-Step For each iteration calculate expected frequency count for all links (i,j) using the formula, E[S i,j (a,b) |D, T l, t l ] = Σ m P(X i [m] = a, X j [m] = b | x [1…N] [m], T l,t l ) Here x [1…N] [m]= the set of characters observed at all nodes at position ‘m’ in the data (T l,t l ) is the current topology Computation of these expected counts takes O(M.N 2 |Σ| 3 ) and storing them requires O(N 2 |Σ| 2 ) space We can improve the speed by O(|Σ|) by calculating approximate counts instead S* i,j (a,b) =Σ m P(X i [m] = a| x [1..N] [m], T l,t l ) P(X j [m] = b | x [1..N] [m], T l,t l )

28 M-Steps M-Step I  After each E-Step, we calculate each link (i,j)’s best length using, t l+1 i,j = argmax t L local (E[S i,j (a, b) | D, T l, t l ], t), where L local (E[S i,j (a, b) | D, T l, t l ], t) = Σ a,b E[S i,j (a,b)] [log p a->b (t) – log p b ] M-Step II  Construct a topology T * l+1 that maximizes W l+1 (T), by finding maximizing spanning tree  Construct a bifurcating topology T l+1 such that L(T * l+1, t l+1 ) = L(T l+1,t l+1 )

29 Empirical Evaluation The Algorithm has been written in C++ and been implemented in a program called SEMPHY For protein sequences  Compared results to MOLPHY, which is a leading ML application for phylogeny reconstruction based on protein sequences  Used the basic Structural EM algorithm, using a Neighbor-joining tree as a starting point  Evaluated performance on synthetic data sets, generated by constructing an original phylogeny and then sampling from its distributions  The synthetic data sets were broken into a test set and training set  Ran both programs on the data

30 Empirical Evaluation For protein sequences:  Results of basic Structural EM Quality of both methods deteriorated when number of taxa increases and when the number of training position decreases. In this case, both methods did poorly SEMPHY outperforms MOLPHY in terms of quality of solutions MOLPHY’s running time grows much faster than SEMPHY’S running time, which is only quadratic in number of taxa Both MOLPHY’s and SEMPHY’s running times grows linearly in number of training positions

