Phylogenetic inference or “How to recognize a tree from quite a long way away” Mikael Thollesson Evolutionary Biology Centre, Uppsala University Slides.

Phylogenetic inference or “How to recognize a tree from quite a long way away” Mikael Thollesson Evolutionary Biology Centre, Uppsala University Slides are available on the course’s web page

“Bioinformation in the cell” DNA RNA mRNA protein polypeptide enzyme transcription splicing translation protein folding coenzym activation

“Extended bioinformation” Original sense strand Original anti-sense strand Original sense strand Original anti-sense strand New sense strand New anti-sense strand

Phylogeny from a Bioinformatic viewpoint n A phylogeny is the (event) history more or less exclusively shared by some kind of biological replicators n These replicators can in practice be for example – Species, population, strains – Genomes, genes – Populations n Phylogenies can usually be modelled as trees; phylogeny and phylogenetic tree has thus become more or less synonymous, even though it is not n The objective for phylogenetic analysis is to infer these history and events, usually resulting in a phylogenetic hypothesis in the form of a tree (together with cosmology the only science dealing with particular histories) 000 010 111 110

GCCACTTTCGCGATCA GCgACTTTCGCGATtA GCgACTagCGCGATCA GCCACaTTCcCGATCA GCCACaTTCcCGAgCA GCgACTTTCGCGATta GCgACTTTCcCGATtA GCgACTTTCGCGATCA GCCACaTTCcCGATCA Time ? GCgACTTTCGCGATta GCgACTTTCcCGATtA GCgACTagCGCGATCA GCCACaTTCcCGATCA GCCACaTTCcCGAgCA GCgACTTTCGC--Tta n Ordering the sequences hierarchically after shared evolutionary novelties, synapomorphies, produce a phylogenetic hypothesis (tree) n We can not distinguish between novelties and ancestral state, just see the difference n Parallel substitutions and multiple substitution at the same site creates ambiguities about the hierarchy n We must make some a priori assumption of homology – for sequences, this is the same as doing a multiple alignment

limbs amnion Lionyes yes Bald eagleyes yes Bullfrogyes no Codno no White sharkno no Lion Bald eagle Bullfrog Cod White shark Characters Character states Taxa, Terminal units

Lineus geniculatus TGGGCTGGGATGAAGGGAAGTATCGTGGGCCCGG Micrura akkeshiensis GGGGCTAGAATGAATGGGA-TAACGAGCCCCCGA Myoisophagus versicolor GGGGCTAGAATGAAAGAAA-GTTTGAGACCTCAT Parvicirrus dubius GGGACTGGAATGAAAGAAA-TTTTGAGGCCTTAA 1. Gather data from the entities whose phylogeny we are interested in 2. Select a criterion to evaluate how well each possble tree fits the observed data PdMvMaLgPdMaMvLgPdLgMaMv PdMaLgMvMaPdLgMvLgMaPd MaMvPdLgMaPdMvLgMaLgPdMv LgMvMaPd LgMaMvPd LgPdMaMv PdLgMaMv PdMaLgMv PdMvMaLg 2622 27 26 Mi. akkeshiensis Li. geniculatus Pa. dubius My. versicolor 3. Find the tree that best fit the data and choose it to be the preferred hypothesis 22 4. Evaluate the sampling variation in the data to see if you have enough support for your conclusion 95%

Why do phylogenetics? – Prediction n Prospective biomedical compounds from sponges (Porifera) n Treatment of microsporidia n Gauging biodiversity for conservation “Taxa are not related because of similarity, but similar due to relatedness”

Why? –Sequence of evolutionary events n Why the oaks retain their leaves in contrast to other deciduous trees Evergreens n Evolution of metabolic pathways n Tracing infection histories for virus

Why? – (Ab)use of comparative method n Correlation between ability to fly and being black and white Species, populations, or genes (i.e., entities corresponding to replicators) are not independent samples/observations since they have a more or less inclusively shared history

Trees and terminology A B C D branch or edge node or vertex n Terminal nodes (external vertices) represent taxa or genes on which we have observations n Internal vertices represent inferred splitting events (may be interpreted as ancestral species or gene copies) n Unrooted vs. rooted trees DCAB clade Rooting is normally done using a designated outgroup D D C B A e1e1 e2e2 e6e6 e3e3 e4e4 e5e5

A B C D X is defined to be more closely related to Y than to Z if and only if X shares a (more recent) history with Y that it does not share with Z DCABBACD Relatedness

“The standard recipe” for phylogenetic inference n Collect your data n Select an optimality criterion (“Which tree is better”?) n Optional: do data transformations (“corrections”) n Select a search strategy and find the best hypothesis (according to selected criterion) using this search method n Assess the variation in your data in some way n There are really only two big theoretical problems in phylogenetic inference… – The criterion and calculating the score – Finding the best tree

Step 1 – Data collection n Any observation of inherited traits is in principle useful n Primary homology assessment - from traits to characters and character states; for sequence data this corresponds to alignment n Pair-wise differences (e.g., DNA-DNA hybridization, histocompatibility) can also be used, although with a limited set of criteria n Include one or several outgroups for rooting

Step 2 – Optimality criteria, some selected

Assumptions in shared by (almost) all optimality criteria/methods n Characters are independent (and thus the order in the data matrix does not matter) – Special models for e.g., rRNA and codons n The substitution process is homogenous over time/in the entire tree (overall rate can vary) – Special models do not make this assumption n Substitution rates are the same for all characters – Can be accommodated easily in most methods

Parsimony optimality criterion n Given two trees, the one requiring the lowest number of character changes necessary to explain the observed character distribution is the better – Parsimony score for a tree is the minimum number of required changes – This score is frequently referred to as number of steps or tree length – The method can be modified using non-uniform weights n Character weights (positional weights) n Character state weights (transformational weights)

Parsimony – an example  acgtatgga  acgggtgca  aacggtgga  aactgtgca  : c  : c  : a  : a  : c  : c  : a  : a  : c  : a  : a  : c Total tree length: 7Total tree length: 8

Using substitution models – Why? Observed differences Actual changes Example: Jukes-Cantor model, if i=j, if i≠j n Jukes-Cantor is the simplest model in a class of models called time-reversible (GTR) models for DNA n GTR (most complex symmetric model) has six different rates (one for each pair of bases) and different base frequencies A G TC P (t) =e Qt

Pair-wise distances – an example  acgtatggac  acgggtgcac  aacggtggac  aactgtgcac   00.30.4  0.30  0.40.300.2  0.40.30.20 p  = p  =2/10=0.2 (p distance) – Jukes-Cantor distance

Minimum evolution optimality criterion n Starts by calculating pair-wise distances between all terminal taxa/sequences – These calculations can incorporate explicit substitution models, e.g., Jukes-Cantor n Given two trees, the one having the lowest sum of branch lengths when fitted to the data, is the better n One way to fit the data is using the constraints below, or using least squares approximation – No branch can have negative length, e ij ≥0 – The path between two terminals along the tree is at least as long as the pair-wise distance, e ij ≥d ij n The score is commonly referred to as tree length (as for parsimony)

Maximum likelihood optimality criterion n Given two trees, the one with the higher likelihood, i.e. the one with the higher conditional probability of observing the data, is the better – Site likelihood is the conditional probability of the data at one site (one character) given the assumed model of evolution and parameters of the model – Data set likelihood is the product of the site likelihoods (character independence) n Likelihood values under different models are comparable, thus giving us a way to test the adequacy of the model n The model consists of – A substitution model, e.g. Jukes-Cantor – A tree with branch lengths

For Jukes-Cantor! Likelihood of a one-branch tree Taxon1AC Taxon2CC L tot =L 1 ·L 2, or log L tot = logL 1 +logL 2 Taxon1 AC Taxon2 CC tt

30 nucleotides from  -globin genes of two primates on a one-edge tree * * Gorilla GAAGTCCTTGAGAAATAAACTGCACACTGG Orangutan GGACTCCTTGAGAAATAAACTGCACACTGG There are two differences and 28 similarities tt lnL  t= 0.02327 lnL= -51.133956 Another one-branch tree

Likelihoods of a more interesting tree… n Bases at internal nodes are unknown A A C T e1e1 e3e3 e4e4 e2e2 e5e5 u v

Step 3 – Finding the best tree n Number of (rooted) trees for n terminals is (2n-3)·(2n-5)·(2n-7)…3·1 – 3 taxa -> 3 trees – 4 taxa -> 15 trees – 10 taxa -> 34 459 425 trees – 25 taxa -> 1,19·10 30 trees – 52 taxa -> 2,75·10 80 trees n Finding the optimal tree is an NP-complete or NP-hard problem Search strategies – Exact Will find the best (according to selected criterion) tree n Exhaustive – Up to ca 10 taxa n Branch and bound – Up to ca 15 taxa – Heuristic Limits the search to a “reasonable” set of trees. May not find the optimal tree

Heursitic tree searches usually n start with hill climbing (greedy algorithms) to obtain a starting tree – Star decomposition – Stepwise addition n and proceed with some flavour of branch swapping to improve on the starting tree and find better trees

Heursitic tree search – Star decomposition B A CD E B A C D E B E C D A A B C D E C A E BD C A D BE … E A B DC C B D AE

Heursitic tree search – Stepwise addition C A BA B C D 921 A B C D E 914 A D C E B 915 A C D E B 905 B A C D E 916 B E C D A 831 A B C D 783 A B C D 837 A C B D

Heursitic tree search – Branch swapping F A B G E D I C H G I H F A B E DC A G I H D C F E B SPR A I G H D C F E B F B A E DC G I H C A G H I D F E B A G I H D C F E B TBR

Step 2+3 – A dirty shortcut to get a tree… n Instead of evaluating each tree, some methods build a tree using a specific algorithm, usually from pair-wise distances n Neighbor-joining is such a methods that is widely used – NJ can roughly be viewed as a star decomposition minimizing the sum of branch lengths (evolutionary change)

What is a “good” method? n Efficiency n Power n Consistency n Robustness n Falsifiability – Time to find a/the solution – Rate of convergence/how much data are needed – Convergence to “correct” solution as data are added – Performance when assumptions are violated – Rejection of the model when it is inadequate

Frequency of correct inference Sequence length All 0.50 0.30 and 0.05 respectively Performance on simulated data

Some pros and cons of selected methods n Pair-wise, algorithmic approach (eg. Neighbor-joining) + Fast + Models can be used when transforming to distances - Information is lost when transforming to pair-wise distances - One will get a tree, but no measure of goodness to compare with other hypotheses (when using algorithmic methods like NJ) n Parsimony + Philosophically appealing – Occam’s razor (no unnecessary assumptions) + Can be applied to most kinds of data without prior knowledge - Can be inconsistent - Can be computationally slow n Maximum likelihood + Model based; enables statistical tests and handles problems with multiple substitutions - Model based; models can be inadequate and give misleading results - Computationally veeeeery slooooowww

Step 4 – Assessing the variation in the data n Variation can not be assessed by repeated sampling from the statistical population – we have a unique sample n We have to rely on resampling from the data already at hand – Jack-knife – resampling without replacement – Bootstrap – resampling with replacement

Original data set with n characters. Draw n characters randomly with replacement. Repeat m times. m pseudo-replicates, each with n characters. Aus Beus Ceus Deus Original analysis, e.g. MP, ML, NJ. Aus Beus Ceus Deus 75% Evaluate the results from the m analyses. Aus Beus Ceus Deus Aus Beus Ceus Deus Aus Beus Ceus Deus Aus Beus Ceus Deus Aus Beus Ceus Deus Aus Beus Ceus Deus Repeat original analysis on each of the pseudo- replicate data sets. Bootstrap n Bootstrap proportions between 0.5 and 1 can be interpreted as a measure of confidence or support n Valules below 0.5 are non-sense

What can go wrong? n Sampling error (i.e., due to finite data) – Assessed by - for example - the bootstrap n Systematic error (inconsistent method) – Tests of the adequacy of models used – Using different methods with different properties and compare the results n Inadequate tree search (heuristics) n Reality – A tree may be a poor model of the real history – Information has been lost by subsequent evolutionary changes n “Species” vs. “gene” trees

CanisMusGadus What is wrong with this tree? n Negligible (within sequence) sampling error – high bootstrap values 100 n Tree estimated by a consistent method

Gene duplication “Species” tree “Gene” trees The expected tree…

CanisMusGadus MusCanis Two copies (paralogs) present in the genomes Paralogous Orthologous

CanisGadusMus What we have actually studied… n To detect a paralogy problem, several different genes can be used to infer the “species” phylogeny

To conclude– n Phylogenetic inference deals with historical events and information transfer – the evolutionary history n Results from phylogenetic analyses are hypotheses for further testing; the true history will remain unknown n Inference is mathematically intricate and computationally heavy, and as a result methods for phylogenetic inference are legio. A good place to start looking for software is http://evolution.genetics.washington.edu/phylip/software.html http://evolution.genetics.washington.edu/phylip/software.html n There are several pitfalls to avoid when doing the analyses and when interpreting them – and most of the problems are data dependent… n But… Phylogenies have great explanatory power (the only we have to predict properties of organisms), and ignoring the shared histories can sometimes give completely bogus results in comparative studies

Phylogenetic inference or “How to recognize a tree from quite a long way away” Mikael Thollesson Evolutionary Biology Centre, Uppsala University Slides.

Similar presentations

Presentation on theme: "Phylogenetic inference or “How to recognize a tree from quite a long way away” Mikael Thollesson Evolutionary Biology Centre, Uppsala University Slides."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Phylogenetic inference or “How to recognize a tree from quite a long way away” Mikael Thollesson Evolutionary Biology Centre, Uppsala University Slides.

Similar presentations

Presentation on theme: "Phylogenetic inference or “How to recognize a tree from quite a long way away” Mikael Thollesson Evolutionary Biology Centre, Uppsala University Slides."— Presentation transcript:

Similar presentations

About project

Feedback