Presentation on theme: "Phylogenetic reconstruction. What is phylogenetic analysis and why should we perform it? Phylogenetic analysis has two major components: 1.Phylogeny inference."— Presentation transcript:
What is phylogenetic analysis and why should we perform it? Phylogenetic analysis has two major components: 1.Phylogeny inference or “tree building” 2.Character and rate analysis
A B C D E TAXA (genes, populations, species, etc.) used to infer the phylogeny Common Phylogenetic Tree Terminology Vertex or Nodes Edges
Ancestral Node or ROOT of the Tree Internal Nodes or Divergence Points (represent hypothetical ancestors of the taxa) Branches, Lineages or Clades Terminal Nodes (Leaves) A B C D E Represent the TAXA (genes, populations, species, etc.) used to infer the phylogeny Common Phylogenetic Tree Terminology
Phylogenetic trees diagram the evolutionary relationships between the taxa ((A,(B,C)),(D,E)) = The above phylogeny as nested parentheses Taxon A Taxon B Taxon C Taxon E Taxon D No meaning to the spacing or to order - no scale (for ‘cladograms’), - proportional to genetic distance or amount of change (for ‘phylograms’ or ‘additive trees’), - proportional to time (for ‘ultrametric trees’ or true evolutionary trees).
A few examples of what can be inferred from phylogenetic trees built from DNA or protein sequence data: Which species are the closest living relatives of modern humans? Did the infamous Florida Dentist infect his patients with HIV? What were the origins of specific transposable elements? Plus countless others…..
Using Phylogeny to Understand Gene Duplication and Loss A.A gene tree. B.The gene tree superimposed on a species tree, allowing identification of the duplication and loss events.
Which species are the closest living relatives of modern humans? Mitochondrial DNA and most nuclear DNA-encoded genes, The pre-molecular view MYA Chimpanzees Orangutans Humans Bonobos Gorillas Humans Bonobos GorillasOrangutans Chimpanzees MYA 0 15-30 0 14
Did the Florida Dentist infect his patients with HIV? DENTIST Patient D Patient F Patient C Patient A Patient G Patient B Patient E Patient A Local control 2 Local control 3 Local control 9 Local control 35 Local control 3 Yes: The HIV sequences from these patients fall within the clade of HIV sequences found in the dentist. No From Ou et al. (1992) and Page & Holmes (1998) Phylogenetic tree of HIV sequences from the DENTIST, his Patients, & Local HIV-infected People:
A few examples of what can be learned from character analysis using phylogenies as analytical frameworks: When did specific episodes of positive Darwinian selection occur during evolutionary history? What was the most likely geographical location of the common ancestor of the African apes and humans? Plus countless others…..
Completely unresolved or "star" phylogeny Partially resolved phylogeny Fully resolved, bifurcating phylogeny AAA B BB C C C E E E D DD Polytomy or multifurcationA bifurcation The goal of phylogeny inference is to resolve the branching orders of lineages in evolutionary trees:
There are three possible unrooted trees for four taxa (A, B, C, D) AC B D Tree 1 AB C D Tree 2 AB D C Tree 3 Which one is correct?
The number of unrooted trees increases in a greater than exponential manner with number of taxa (2N - 5)!! = # unrooted trees for N taxa C A B D A B C A D B E C A D B E C F
Inferring evolutionary relationships between the taxa requires rooting the tree: A B C Root D A B C D Note that in this rooted tree, taxon A is no more closely related to taxon B than it is to C or D. Rooted tree Unrooted tree B
Now, try it again with the root at another position: A B C Root D Unrooted tree Note that in this rooted tree, taxon A is most closely related to taxon B, and together they are equally distantly related to taxa C and D. C D Root Rooted tree A B B
An unrooted, four-taxon tree theoretically can be rooted in five different places to produce five different rooted trees The unrooted tree 1: AC B D Rooted tree 1d C D A B 4 Rooted tree 1c A B C D 3 Rooted tree 1e D C A B 5 Rooted tree 1b A B C D 2 Rooted tree 1a B A C D 1 These trees show five different evolutionary relationships among the taxa!
All of these rearrangements show the same evolutionary relationships between the taxa B A C D A B D C B C A D B D A C B A C D Rooted tree 1a B A C D A B C D
Think for yourself How many unrooted trees are there with 4 taxa? With 5 taxa? How many rooted trees are there with 4 taxa? With 5 taxa?
x = C A B D AD B E C A D B E C F (2N - 3)!! = # unrooted trees for N taxa Each unrooted tree theoretically can be rooted anywhere along any of its branches
Finding the best tree Number of (rooted) trees 3 taxa -> 3 trees 4 taxa -> 15 trees 10 taxa -> 34 459 425 trees 25 taxa -> 1,19·10 30 trees 52 taxa -> 2,75·10 80 trees Finding the optimal tree is an NP-complete problem Search strategies Exact Exhaustive Branch and bound Algorithmic Greedy algorithms (including Neighbor- joining) Heuristic Systematic; branch- swapping (NNI, SPR, TBR) Stochastic Markov Chain Monte Carlo (MCMC) Genetic algorithms
Rooting Trees Molecular Clock Root=midpoint, longest span Extrinsic Evidence (Outgroup) select fungus as root for plants A B C D 10 2 3 5 2 d (A,D) = 10 + 3 + 5 = 18 Midpoint = 18 / 2 = 9 outgroup
Phylogenetic Models All sequences are homologous Each position in alignment homologous Positions evolve independently
Steps in Analysis Data Model (Alignment) DNA base substitution model Build Trees Algorithm based vs Criterion based Distance based vs Character-based
Practicalities Quality of input data critical Examine data from all possible angles distance, parsimony, likelihood Outgroup taxon critical problem if outgroup shares a selective property with a subset of ingroup Order of input can be problematic Try different orders! Assess the variation in your data in some way
Types of data used in phylogenetic inference: Character-based methods: Use the aligned characters, such as DNA or protein sequences, directly during tree inference. Taxa Characters Species AATGGCTATTCTTATAGTACG Species BATCGCTAGTCTTATATTACA Species CTTCACTAGACCTGTGGTCCA Species DTTGACCAGACCTGTGGTCCG Species ETTGACCAGTTCTCTAGTTCG Distance-based methods: Transform the sequence data into pairwise distances, and use the matrix during tree building. A B C D E Species A---- 0.20 0.50 0.45 0.40 Species B0.23 ---- 0.40 0.55 0.50 Species C0.87 0.59 ---- 0.15 0.40 Species D0.73 1.12 0.17 ---- 0.25 Species E0.59 0.89 0.61 0.31 ---- Example 1: Uncorrected “p” distance (=observed percent sequence difference) Example 2: Kimura 2-parameter distance (estimate of the true number of substitutions between taxa)
Types of computational methods: Clustering algorithms: Use pairwise distances. Are purely algorithmic methods. Optimality approaches: Use either character or distance data. - minimum branch lengths, - fewest number of events, - highest likelihood Warning: Finding an optimal tree is not necessarily the same as finding the "true” tree.
Molecular phylogenetic tree building methods: COMPUTATIONAL METHOD Clustering algorithmOptimality criterion DATA TYPE Characters Distances PARSIMONY MAXIMUM LIKELIHOOD UPGMA NEIGHBOR-JOINING MINIMUM EVOLUTION LEAST SQUARES
Tree-Building Methods Distance UPGMA, NJ, FM, ME Character Maximum Parsimony Maximum Likelihood
Distance Methods Measure distance (dissimilarity) Methods UPGMA (Unweighted pair group method with Arithmetic Mean) NJ (Neighbor joining) FM (Fitch-Margoliash) ME (Minimal Evolution)
Inferring Trees and Ancestors CCCAGG CCCAAG-> CCCAAG CCCAAA-> CCCAAA CCCAAA-> CCCAAC
Different Criteria 1CCCAGG 2CCCAAG 3CCCAAA 4CCCAAC 1-21 1-32 1-42 2-31 2-41 3-41 1,2 can be sister taxa AND 3,4 can be sister taxa Infer ancestor of 1,2 and 3,4
UPGMA: Distance measure where |C i | and |C j | denote the number of sequences in cluster i and j, respectively. Clustering: All leaves are assigned to a cluster, which then are iteratively merged according to their distance. The distance between two clusters i and j is defined as:
(1) (2) UPGMA: Replacing Node k replaces nodes i and j with their union: The new distances between the new node k and all other clusters l are computed according to:
UPGMA: Algorithm Initialization: Assign each sequence i to its own cluster C i. Define one leaf of T for each sequence, and place at height zero. Iteration Determine the two clusters i, j for which d i,j is minimal. Define a new cluster k by C k = C i U C j, and define d k l for all l by (2). Define a node k with daughter nodes i and j, and place it at height d i,j /2. Add k to the current clusters and remove i. Termination When only two clusters i, j remain, place the root at height d i,j /2.
UPGMA example: Step 1 Alignment -> distance ABCDEFG A- B63- C9479- D1119647- E672083100- F23588910662- G10792431696102- Example 1: Uncorrected “p” distance (=observed percent sequence difference) Example 2: Kimura 2- parameter distance (estimate of the true number of substitutions between taxa) Distance: Distance matrix:
Root AFBECDG AFBE- CDG94- AF UPGMA: distance -> phylogeny AF
COMPUTATIONAL METHOD Clustering algorithmOptimality criterion DATA TYPE Characters Distances PARSIMONY MAXIMUM LIKELIHOOD UPGMA NEIGHBOR-JOINING MINIMUM EVOLUTION LEAST SQUARES Classification of phylogenetic inference methods
Minimum evolution (ME) methods Optimality criterion: The tree(s) with the shortest sum of the branch lengths (or overall tree length) is chosen as the best tree.
Character Methods Maximum Parsimony minimal changes to produce data Maximum Likelihood
Parsimony methods: Optimality criterion: The ‘most-parsimonious’ tree is the one that requires the fewest number of evolutionary events (e.g., nucleotide substitutions, amino acid replacements) to explain the sequences.
Parsimony – an example acgtatgga acgggtgca aacggtgga aactgtgca : c : c : a : a : c : c : a : a : c : a : a : c Total tree length: 7Total tree length: 8
Maximum likelihood (ML) methods Optimality criterion: ML methods evaluate phylogenetic hypotheses in terms of the probability that a proposed model of the evolutionary process and the proposed unrooted tree would give rise to the observed data. The tree found to have the highest ML value is considered to be the preferred tree.
Using models Observed differences Actual changes AG CT Example: Jukes-Cantor, if i=j, if i≠j p t : proportion of different nucleotides
Maximum likelihood Given two trees, the one with the higher likelihood, i.e. the one with the higher conditional probability of observing the data, is the better Site likelihood is the conditional probability of the data at one site given the assumed model of evolution and parameters of the model Data set likelihood is the product of the site likelihoods Likelihood values under different models are comparable, thus giving us a way to test the adequacy of the model
30 nucleotides from -globin genes of two primates on a one-edge tree * * Gorilla GAAGTCCTTGAGAAATAAACTGCACACTGG Orangutan GGACTCCTTGAGAAATAAACTGCACACTGG There are two differences and 28 similarities tt lnL t= 0.02327 lnL= -51.133956
Likelihoods of a more interesting tree Data for one site is shown on the tree Edge lengths are defined as =3( t) i Computational root is chosen arbitrarily (homogenous models) at an internal node (arrow) u is the state at the root node, v at the other internal node A A C T d1d1 d3d3 d4d4 d2d2 d5d5
Assessing the variation Jack-knife – resampling without replacement Bootstrap – resampling with replacement From the characters (sites) draw randomly as many times as there are number of characters Analyze this re-sampled data set in the same way as the study Repeat this 100+ times and summarize, for example, as a majority rule consensus tree Bootstrap proportions between 0.5 and 1 can be interpreted as a measure of confidence or support
Original data set with n characters. Draw n characters randomly with re- placement. Repeat m times. m pseudo-replicates, each with n characters. Aus Beus Ceus Deus Original analysis, e.g. MP, ML, NJ. Aus Beus Ceus Deus 75% Evaluate the results from the m analyses. Aus Beus Ceus Deus Aus Beus Ceus Deus Aus Beus Ceus Deus Aus Beus Ceus Deus Aus Beus Ceus Deus Aus Beus Ceus Deus Repeat original analysis on each of the pseudo- replicate data sets. Bootstrap
Resampling statistics in phylogenetics ”…provides us with a confidence interval…[of] the phylogeny that would be estimated on repeated sampling of many characters from the underlying pool of characters” (Felsenstein 1985) True? We don’t know. The exact statistical interpretation remains unclear.
Pros and cons of some methods Pair-wise, algorithmic approach + Fast + Models can be used when transforming to distances - Information is lost when transforming to pair-wise distances - One will get a tree, but no measure of goodness to compare with other hypotheses Parsimony + Philosophically appealing – Occam’s razor - Can be inconsistent - Can be computationally slow Maximum likelihood + Model based - Model based - Computationally veeeeery slow
Computation For large data sets (many taxa) exact solutions for any method employing an optimality criterion (parsimony, likelihood, minimum evolution) are not possible
What can go wrong? Sampling error Assessed by - for example - the bootstrap Systematic error (inconsistent method) Tests of the adequacy of models used Reality A tree may be a poor model of the real history Information has been lost by subsequent evolutionary changes “Species” vs. “gene” trees
CanisMusGadus What is wrong with this tree? Negligible (within sequence) sampling error Tree estimated by a consistent method 100
Gene duplication “Species” tree “Gene” trees The expected tree…
CanisMusGadus MusCanis Two copies (paralogs) present in the genomes Paralogous Orthologous