Presentation is loading. Please wait.

Presentation is loading. Please wait.

“Inferring Phylogenies” Joseph Felsenstein Excellent reference

Similar presentations


Presentation on theme: "“Inferring Phylogenies” Joseph Felsenstein Excellent reference"— Presentation transcript:

1 “Inferring Phylogenies” Joseph Felsenstein Excellent reference
Phylogenetics “Inferring Phylogenies” Joseph Felsenstein Excellent reference

2 What is a phylogeny?

3 Different Representations
Cladogram - branching pattern only Phylogram - branch lengths are estimated and drawn proportional to the amount of change along the branch Rooted - implies directionality of change Unrooted - does not How do you root a tree?

4 What is a phylogeny used for?

5 Estimate a Phylogeny Sp1 ACCGTCTTGTTA Sp2 AGCGTCATCAAA
Sp4 ACCGTCTTGATA Sp5 AGCCTCTTCATA

6 Estimate a Phylogeny Sp1 ACCGTCTTGTTA Sp2 AGCGTCATCAAA
Sp4 ACCGTCTTGATA Sp5 AGCCTCTTCATA

7 Working Tree sp2 sp1 c2 sp3 sp5 sp4

8 Estimate a Phylogeny Sp1 ACCGTCTTGTTA Sp2 AGCGTCATCAAA
Sp4 ACCGTCTTGATA Sp5 AGCCTCTTCATA

9 Working Tree sp2 sp1 c2 sp3 c4 sp5 sp4

10 Estimate a Phylogeny Sp1 ACCGTCTTGTTA Sp2 AGCGTCATCAAA
Sp4 ACCGTCTTGATA Sp5 AGCCTCTTCATA

11 Working Tree sp2 sp1 c7 c2 sp3 c4 sp5 sp4

12 Estimate a Phylogeny Sp1 ACCGTCTTGTTA Sp2 AGCGTCATCAAA
Sp4 ACCGTCTTGATA Sp5 AGCCTCTTCATA

13 Working Tree sp2 sp1 c7 c2 sp3 c4 c9 sp5 sp4

14 Estimate a Phylogeny Sp1 ACCGTCTTGTTA Sp2 AGCGTCATCAAA
Sp4 ACCGTCTTGATA Sp5 AGCCTCTTCATA

15 Working Tree sp2 sp1 c10 c7 c2 sp3 c4 c9 sp5 sp4

16 Estimate a Phylogeny Sp1 ACCGTCTTGTTA Sp2 AGCGTCATCAAA
Sp4 ACCGTCTTGATA Sp5 AGCCTCTTCATA

17 Final Tree sp2 sp1 c10 c11 c2 c7 sp3 c4 c9 sp5 sp4

18 What optimality criteria do we use then?
Parsimony Likelihood Bayesian Distance methods?

19 Parsimony Why should we choose a specific grouping?
Maximum parsimony: we should accept the hypothesis that explain the data most simply and efficiently “Parsimony is simply the most robust criterion for choosing between competing scientific hypotheses. It is not a statement about how evolution may or may not have taken place”1 1 Kitching, I. J.; Forey, P. L.; Humphries, J. & Williams, D. M Cladistics: the theory and practice of parsimony analysis. The systematics Association Publication. No. 11.

20 Parsimony Optimality criteria that chooses the topology with the less number of transformations of character states Optimizing one component: tree topology (pattern based) Most parsimonious tree: the one (or multiple) with the minimum number of evolutionary changes (smaller size/tree length)

21 Reconstructing trees via sequence data
1 2 3 4 5 6 O T G A B C - D A O D C B 6. T=>G 5. A=> GAP 4. A=>G 4. A=>C 2. G=>A 3. T=>C 1. T=>A Tree length = 8

22 Neighbor-joining Method

23 NJ distance matrices

24 NJ distance matrices

25 NJ distance matrices

26 NJ distance matrices

27 Finished NJ tree

28 Models of Evolution T C Pyrimidines A G Purines Transversions
Transitions

29 Maximum Likelihood Base frequencies: fA + fG + fC + fT = 1
Base exchange: fs + fv = 1 R-matrix:  +  +  +  +  +  = 1 Gamma shape parameter Number of discrete gamma-distribution categories Pinvar: fvar + finv = 1 Likelihood: L =  li where i is each character state

30 Maximum Likelihood L=Pr(D|H) C G G t4 t5 A G y t1 t2 t3 t6 x z t7 t8 w
The likelihood is not the probability that the tree is the true tree, rather it is the probability that the tree has given rise to the data we collected. Likelihood requires three elements (what are they? We've talked about two, the data and the tree (hypothesis) the third is the model of evolution). w

31 ML cont. the probability that the nucleotide at time t is i is given by the probability that the nucleotide at time t is j, j i, is given by

32 The conditional probability of H given D: posterior probability
Bayes Theorem Prior probability or Marginal probability of H The conditional probability of H given D: posterior probability Likelihood function Prob (H │D) = Prob (H) Prob (D│H) Prob (D) H=Hypothesis D=Data Prior probability or Marginal probability of D ∑HP(H) P(D|H) Normalizing Constant: ensures ∑ P (H │D) = 1

33 Take Home Message Likelihood: represents the P of the data given the hypothesis => difficult to interpret Bayes approach: estimates the P of the hypothesis given the data => estimates P for the hypothesis of interest

34 Bayesian Inference of Phylogeny
f(i |X) = f(i) f(X|i) ∑j=1 f(i) f(X|i) B(s) Calculating pP of a tree involves a summation over all possible trees and, for each tree, integration over all combinations of bl and substitution-model parameter values f(i,i,|X) = f(i,i,) f(X|i,i,) ∑j=1 ∫ , f(i,i,) f(X| i,i,)dd B(s) Inferences of any single parameter are based on the marginal distribution of the parameter f(i|X) = ∫ , f(i,i,) f(X|i,i,) dd ∑j=1 ∫ , f(i,i,) f(X| i,i,)dd B(s) This marginal P distribution of the topology, for example, integrates out all the other parameters Advantage: the power of the analysis is focused on the parameter of interest (i.e., the topology of the tree)

35 Estimating phylogenies
Exhaustive Searches Branch and bound methods Rise in computational time versus rise in solution space

36 How many topologies are there?
When we add species to a tree, the number of ways in which we can do that are equal to the number of branches, including the branch at the botom of the tree. There are 3 such branches in a two species tree. Every time that we add a new species, it adds a new interior node, plus two new branches. Thus after choosing one of the 3 possible places to add the third species, the fourth can be added in any of 5 places, the fifth in any of 7, and so on.

37 The Phylogenetic Problem

38 HIV-1 Whole Genomes 1993 - 15 HIV-1 Whole Genomes 2003 (JAN) - 397
The two trees represent complete HIV-1 genomes (limited to those with over 7000bp sequenced) from the Los Alamos National Labs HIV database ( ). The sparse tree represents those genomes sequenced 1993 or earlier (determined by the sequence submission date to Genbank, not the publication date since several seemed to be cobbled together from multiple sources). There were 15 genomes by 1993, mostly from subtype B and a few subtype D, with a final alignment of 8097 characters. The dense tree represents 397 complete HIV-1 genomes, the current complement of genomes available. The search on the Los Alamos database came up with 416 genomes, but a few were deleted during the alignment process due to stretches of questionable sequence. The final alignment length was 8583 characters. Both trees are color coded by subtype and major groups of recombinants. Both trees were constructed using Neighbor Joining, with the results of modeltest providing the model of evolution for tree construction.

39 Tree Space - the final frontier

40 Heuristic Searches Nearest-neighbor interchanges (NNI) - swap two adjacent branches on the tree Subtree pruning and regrafting (SPR) - removing a branch from the tree (either an interior or an exterior branch) with a subtree attached to it. The subtree is then reinserted into the remaining tree in all possible places Tree bisection and reconnection (TBR) - An interior branch is broken, and the two resulting fragments o the tree ar considered as separate trees. All possible connections are made between a branch of one and a branch of the other.

41 Other approaches Tree-fusing - find two near optimal trees and exchange subgroups between the two trees Genetic Algorithms - a simulation of evolution with a genotype that describes the tree and a fitness function that reflects the optimality of the tree Disc Covering - upcoming paper

42 Phylogenetic Accuracy?
Consistency - A phylogenetic method is consistent for a given evolutionary model if the method converges on the correct tree as the data available to the method become infinite. Efficiency - Statistical efficiency is a measure of how quickly a method converges on the correct solution as more data are applied to the problem. Robustness - Robustness refers to the degree to which violations of assumptions will affect performance of phylogenetic methods All methods are consistent when their assumptions (explicit and implicit) are met, and all methods are inconsistent when these assumptions are violated sufficiently. In the case of phylogenetic methods, efficiency may be measured in terms of the number of characters required to find the correct solution at a given frequency or in terms of the frequency of correct solutions at a given sample size. All methods are based on explicit and/or implicit assumptions about the evolutionary process, and yet we know these assumptions are violated to one degree or another in real data.

43

44 How reliable is MY phylogeny?
Bootstrap Analysis Jackknife Analysis Posterior Probabilities (Bayesian Approaches) Decay Indices

45 Bootstrap

46 Pseudoreplicates


Download ppt "“Inferring Phylogenies” Joseph Felsenstein Excellent reference"

Similar presentations


Ads by Google