Download presentation

Presentation is loading. Please wait.

1
. Phylogeny II : Parsimony, ML, SEMPHY

2
Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node

3
Character Based Methods u We start with a multiple alignment u Assumptions: l All sequences are homologous l Each position in alignment is homologous l Positions evolve independently l No gaps u We seek to explain the evolution of each position in the alignment

4
Parsimony u Character-based method u A way to score trees (but not to build trees!) Assumptions: u Independence of characters (no interactions) u Best tree is one where minimal changes take place

5
A Simple Example u What is the parsimony score of AardvarkBisonChimpDogElephant A : CAGGTA B : CAGACA C : CGGGTA D : TGCACT E : TGCGTA

6
A Simple Example u Each column is scored separately. u Let’s look at the first column: u Minimal tree has one evolutionary change: C C C C C T T T T C A : CAGGTA B : CAGACA C : CGGGTA D : TGCACT E : TGCGTA

7
Evaluating Parsimony Scores u How do we compute the Parsimony score for a given tree? u Traditional Parsimony l Each base change has a cost of 1 u Weighted Parsimony Each change is weighted by the score c(a,b)

8
Traditional Parsimony aga {a,g} {a} Solved independently for each position Linear time solution a a

9
Evaluating Weighted Parsimony Dynamic programming on the tree S(i,a) = cost of tree rooted at i if i is labeled by a Initialization: For each leaf i set S(i,a) = 0 if i is labeled by a, otherwise S(i,a) = Iteration: if k is a node with children i and j, then S(k,a) = min b (S(i,b)+c(a,b)) + min b (S(j,b)+c(a,b)) Termination: cost of tree is min a S(r,a) where r is the root

10
Cost of Evaluating Parsimony u Score is evaluated on each position independetly. Scores are then summed over all positions. If there are n nodes, m characters, and k possible values for each character, then complexity is O(nmk) u By keeping traceback information, we can reconstruct most parsimonious values at each ancestor node

11
Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 Species 1 - A G G G T A A C T G Species 2 - A C G A T T A T T A Species 3 - A T A A T T G T C T Species 4 - A A T G T T G T C G How many possible unrooted trees?

12
Maximum Parsimony How many possible unrooted trees? 1 2 3 4 5 6 7 8 9 10 Species 1 - A G G G T A A C T G Species 2 - A C G A T T A T T A Species 3 - A T A A T T G T C T Species 4 - A A T G T T G T C G

13
Maximum Parsimony How many substitutions? MP

14
Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G 0 0 0

15
Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G 0 3

16
Maximum Parsimony 4 1 - G 2 - C 3 - T 4 - A A G C T C A G T C C C G A T C 3 3 3

17
Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G 0 3 2

18
Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G 0 3 2 2 0 3 2 1 0 3 2 2

19
Maximum Parsimony 4 1 - G 2 - A 3 - A 4 - G G G A A A G G A A A G A A G A 2 2 1

20
Maximum Parsimony 0 3 2 2 0 1 1 1 1 3 14 0 3 2 1 0 1 2 1 2 3 15 0 3 2 2 0 1 2 1 2 3 16

21
Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G 0 3 2 2 0 1 1 1 1 3 14

22
Searching for Trees

23
Searching for the Optimal Tree u Exhaustive Search l Very intensive u Branch and Bound l A compromise u Heuristic l Fast l Usually starts with NJ

24
Phylogenetic Tree Assumptions u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 Lengths t = {t i } for each branch u Phylogenetic tree = (Topology, Lengths) = (T,t) leaf branch internal node

25
Probabilistic Methods u The phylogenetic tree represents a generative probabilistic model (like HMMs) for the observed sequences. u Background probabilities: q(a) u Mutation probabilities: P(a|b, t) u Models for evolutionary mutations l Jukes Cantor l Kimura 2-parameter model u Such models are used to derive the probabilities

26
Jukes Cantor model u A model for mutation rates Mutation occurs at a constant rate Each nucleotide is equally likely to mutate into any other nucleotide with rate .

27
Kimura 2-parameter model u Allows a different rate for transitions and transversions.

28
Mutation Probabilities u The rate matrix R is used to derive the mutation probability matrix S: u S is obtained by integration. For Jukes Cantor: u q can be obtained by setting t to infinity

29
Mutation Probabilities Both models satisfy the following properties: u Lack of memory: l u Reversibility: Exist stationary probabilities { P a } s.t. A GT C

30
Probabilistic Approach u Given P,q, the tree topology and branch lengths, we can compute: x1x1 x2x2 x3x3 x4x4 x5x5 t1t1 t2t2 t3t3 t4t4

31
Computing the Tree Likelihood u We are interested in the probability of observed data given tree and branch “lengths”: u Computed by summing over internal nodes u This can be done efficiently using a tree upward traversal pass.

32
Tree Likelihood Computation u Define P(L k |a)= prob. of leaves below node k given that x k =a u Init: for leaves: P(L k |a)=1 if x k =a ; 0 otherwise Iteration: if k is node with children i and j, then u Termination:Likelihood is

33
Maximum Likelihood (ML) u Score each tree by l Assumption of independent positions u Branch lengths t can be optimized l Gradient ascent l EM u We look for the highest scoring tree l Exhaustive search l Sampling methods (Metropolis)

34
Optimal Tree Search u Perform search over possible topologies T1T1 T3T3 T4T4 T2T2 TnTn Parametric optimization (EM) Parameter space Local Maxima

35
Computational Problem u Such procedures are computationally expensive! u Computation of optimal parameters, per candidate, requires non-trivial optimization step. u Spend non-negligible computation on a candidate, even if it is a low scoring one. u In practice, such learning procedures can only consider small sets of candidate structures

36
Structural EM Idea: Use parameters found for current topology to help evaluate new topologies. Outline: Perform search in (T, t) space. u Use EM-like iterations: l E-step: use current solution to compute expected sufficient statistics for all topologies l M-step: select new topology based on these expected sufficient statistics

37
The Complete-Data Scenario Suppose we observe H, the ancestral sequences. Define: Find: topology T that maximizes S i,j is a matrix of # of co-occurrences for each pair (a,b) in the taxa i,j F is a linear function of S i,j

38
Expected Likelihood Start with a tree (T 0,t 0 ) u Compute Formal justification: u Define: Theorem: Consequence: improvement in expected score improvement in likelihood

39
Proof Theorem: u Simple application of Jensen’s inequality

40
Algorithm Outline Original Tree (T 0,t 0 ) Unlike standard EM for trees, we compute all possible pairwise statistics Time: O(N 2 M) Compute: Weights:

41
Pairwise weights This stage also computes the branch length for each pair (i,j) Algorithm Outline Compute: Weights: Find:

42
Max. Spanning Tree Fast greedy procedure to find tree By construction: Q(T’,t’) Q(T 0,t 0 ) Thus, l(T’,t’) l(T 0,t 0 ) Algorithm Outline Compute: Weights: Find: Construct bifurcation T 1

43
Fix Tree Remove redundant nodes Add nodes to break large degree This operation preserves likelihood l(T 1,t’) =l(T’,t’) l(T 0,t 0 ) Algorithm Outline Compute: Find: Weights: Construct bifurcation T 1

44
Assessing trees: the Bootstrap u Often we don’t trust the tree found as the “correct” one. u Bootstrapping: l Sample (with replacement) n positions from the alignment l Learn the best tree for each sample l Look for tree features which are frequent in all trees. u For some models this procedure approximates the tree posterior P(T| X 1,…, X n )

45
New Tree Thm: l(T 1,t 1 ) l(T 0,t 0 ) Algorithm Outline Compute: Construct bifurcation T 1 Find: Weights: These steps are then repeated until convergence

Similar presentations

© 2023 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google