2 What is phylogenetic analysis and why should we perform it? Phylogenetic analysis has two major components:1. Phylogeny inference or “tree building”2. Character and rate analysis1. Phylogeny inference or “tree building” —the inference of the branching orders, and ultimately the evolutionary relationships, between “taxa” (entities such as genes, populations, species, etc.)2. Character and rate analysis —using phylogenies as analytical frameworksfor rigorous understanding of the evolution ofvarious traits or conditions of interest.
3 Common Phylogenetic Tree Terminology EdgesATAXA (genes,populations,species, etc.)used to inferthe phylogenyBCDEVertex or Nodes
4 Common Phylogenetic Tree Terminology Terminal Nodes (Leaves)Branches, Lineagesor CladesARepresent theTAXA (genes,populations,species, etc.)used to inferthe phylogenyBCDAncestral Nodeor ROOT ofthe TreeEInternal Nodes orDivergence Points (represent hypothetical ancestors of the taxa)
5 Phylogenetic trees diagram the evolutionary relationships between the taxaTaxon ATaxon BTaxon CTaxon ETaxon DNo meaning to thespacing or to order- no scale (for ‘cladograms’),- proportional to genetic distance or amount of change (for ‘phylograms’ or ‘additive trees’),- proportional to time (for ‘ultrametric trees’ or true evolutionary trees).These say that B and C are more closely related to each other than either is to A,and that A, B, and C form a clade that is a sister group to the clade composed ofD and E. If the tree has a time scale, then D and E are the most closely related.((A,(B,C)),(D,E)) = The above phylogeny as nested parentheses
6 A few examples of what can be inferred from phylogenetic trees built from DNA or protein sequence data:Which species are the closest living relatives of modern humans?Did the infamous Florida Dentist infect his patients with HIV?What were the origins of specific transposable elements?Plus countless others…..
7 Using Phylogeny to Understand Gene Duplication and Loss A gene tree.The gene tree superimposed on a species tree, allowing identification of the duplication and loss events.
8 Which species are the closest living relatives of modern humans? GorillasChimpanzeesChimpanzeesBonobosBonobosGorillasOrangutansOrangutansHumansMitochondrial DNA and most nuclear DNA-encoded genes, show that bonobos and chimpanzees are related more closely to humans than either are to gorillas.The pre-molecular view was that the great apes (chimpanzees, gorillas and orangutans) formed a clade separate from humans, and that humans diverged from the apes at least MYA.1415-30MYAMYAMitochondrial DNA and most nuclear DNA-encoded genes,The pre-molecular view
9 Did the Florida Dentist infect his patients with HIV? Phylogenetic treeof HIV sequencesfrom the DENTIST,his Patients, & LocalHIV-infected People:Patient CPatient APatient GYes:The HIV sequences fromthese patients fall withinthe clade of HIV sequences found in the dentist.Patient BPatient EPatient ADENTISTLocal control 2Local control 3NoPatient FLocal control 9Local control 35Local control 3NoPatient DFrom Ou et al. (1992) and Page & Holmes (1998)
10 A few examples of what can be learned from character analysis using phylogenies as analytical frameworks:When did specific episodes of positive Darwinian selection occur during evolutionary history?What was the most likely geographical location of the common ancestor of the African apes and humans?Plus countless others…..
11 Phylogenetic Resources NCBI Taxonomy Browser“Tree of Life”
12 Completely unresolved bifurcating phylogeny The goal of phylogeny inference is to resolve thebranching orders of lineages in evolutionary trees:Completely unresolvedor "star" phylogenyPartially resolvedphylogenyFully resolved,bifurcating phylogenyABCEDPolytomy or multifurcationA bifurcation
13 There are three possible unrooted trees for four taxa (A, B, C, D) Phylogenetic tree building (or inference) methods are aimed at discovering which of the possible unrooted trees is "correct".We would like this to be the “true” biological tree — that is, one that accurately represents the evolutionary history of the taxa.However, we must settle for discovering the computationally correct or optimal tree for the phylogenetic method of choiceWhich one is correct?
14 The number of unrooted trees increases in a greater than exponential manner with number of taxa (2N - 5)!! = # unrooted trees for N taxa
15 Inferring evolutionary relationships between the taxa requires rooting the tree: CRootDUnrooted treeABCDRootNote that in this rooted tree, taxon A is no more closely related to taxon B than it is to C or D.Rooted treeBTo root a tree mentally, imagine that the tree is made of string. Grab the string at the root and tug on it until the ends of the string (the taxa) fall opposite the root:
16 Now, try it again with the root at another position: BCRootUnrooted treeDAABBCDRooted treeNote that in this rooted tree, taxon A is most closely related to taxon B, and together they are equally distantly related to taxa C and D.Root
17 An unrooted, four-taxon tree theoretically can be rooted in five different places to produce five different rooted treesRooted tree 1bABCD2ARooted tree 1dCDAB4CRooted tree 1aBACD1The unrooted tree 1:Rooted tree 1eDCAB5Rooted tree 1cABCD3BDThese trees show five different evolutionary relationships among the taxa!
18 All of these rearrangements show the same evolutionary relationships between the taxa CDBACDRooted tree 1aBACDBCADDCABABDCABCD
19 Think for yourself How many unrooted trees are there with 4 taxa? How many rooted trees are there with 4 taxa?
20 Each unrooted tree theoretically can be rooted anywhere along any of its branches x=CABDEF(2N - 3)!! = # unrooted trees for N taxa
21 Finding the best tree Search strategies Number of (rooted) trees ExactExhaustiveBranch and boundAlgorithmicGreedy algorithms (including Neighbor-joining)HeuristicSystematic; branch-swapping (NNI, SPR, TBR)StochasticMarkov Chain Monte Carlo (MCMC)Genetic algorithmsNumber of (rooted) trees3 taxa -> 3 trees4 taxa -> 15 trees10 taxa -> trees25 taxa -> 1,19·1030 trees52 taxa -> 2,75·1080 treesFinding the optimal tree is an NP-complete problem
22 Rooting Trees Molecular Clock Extrinsic Evidence (Outgroup) BCD10235d (A,D) = = 18Midpoint = 18 / 2 = 9Molecular ClockRoot=midpoint, longest spanExtrinsic Evidence (Outgroup)select fungus as root for plantsoutgroup
23 Phylogenetic Models All sequences are homologous Each position in alignment homologousPositions evolve independently
24 Steps in Analysis Data Model (Alignment) DNA base substitution model Build TreesAlgorithm based vs Criterion basedDistance based vs Character-based
25 Practicalities Quality of input data critical Examine data from all possible anglesdistance, parsimony, likelihoodOutgroup taxon criticalproblem if outgroup shares a selective property with a subset of ingroupOrder of input can be problematicTry different orders!Assess the variation in your data in some way
26 Types of data used in phylogenetic inference: Character-based methods: Use the aligned characters, such as DNA or protein sequences, directly during tree inference.Taxa CharactersSpecies A ATGGCTATTCTTATAGTACGSpecies B ATCGCTAGTCTTATATTACASpecies C TTCACTAGACCTGTGGTCCASpecies D TTGACCAGACCTGTGGTCCGSpecies E TTGACCAGTTCTCTAGTTCGDistance-based methods: Transform the sequence data into pairwise distances, and use the matrix during tree building.A B C D ESpecies ASpecies BSpecies CSpecies DSpecies EExample 1:Uncorrected“p” distance(=observed percentsequence difference)Example 2: Kimura 2-parameter distance(estimate of the true number of substitutions between taxa)
27 Types of computational methods: Clustering algorithms: Use pairwise distances. Are purely algorithmic methods.Optimality approaches: Use either character or distance data.minimum branch lengths,fewest number of events,highest likelihoodWarning: Finding an optimal tree is not necessarily the same as finding the "true” tree.Clustering algorithms: Use pairwise distances. Are purely algorithmic methods, in which the algorithm itself defines the the tree selection criterion. Tend to be very fast programs that produce singular trees rooted by distance. No objective function to compare to other trees, even if numerous other trees could explain the data equally well. Warning: Finding a singular tree is not necessarily the same as finding the "true” evolutionary tree.Optimality approaches: Use either character or distance data. First define an optimality criterion (minimum branch lengths, fewest number of events, highest likelihood), and then use a specific algorithm for finding trees with the best value for the objective function. Can identify many equally optimal trees, if such exist.Warning: Finding an optimal tree is not necessarily the same as finding the "true” tree.
28 Molecular phylogenetic tree building methods: COMPUTATIONAL METHODClustering algorithmOptimality criterionDATA TYPECharactersDistancesPARSIMONYMAXIMUM LIKELIHOODUPGMANEIGHBOR-JOININGMINIMUM EVOLUTIONLEAST SQUARESAre mathematical and/or statistical methods for inferring the divergence order of taxa, as well as the lengths of the branches that connect them. There are many phylogenetic methods available today, each having strengths and weaknesses. Most can be classified as follows:
34 UPGMA: Distance measure Clustering: All leaves are assigned to a cluster, which then are iteratively merged according to their distance.The distance between two clusters i and j is defined as:where |Ci| and |Cj| denote the number of sequences in cluster i and j, respectively.
35 UPGMA: Replacing Node k replaces nodes i and j with their union: (1)The new distances between the new node k and all other clusters l are computed according to:(2)
36 UPGMA: Algorithm Initialization: Iteration Termination Assign each sequence i to its own cluster Ci .Define one leaf of T for each sequence, and place at height zero.IterationDetermine the two clusters i, j for which di,j is minimal.Define a new cluster k by Ck = Ci U Cj, and define dkl for all l by (2).Define a node k with daughter nodes i and j, and place it at height di,j/2.Add k to the current clusters and remove i.TerminationWhen only two clusters i, j remain, place the root at height di,j/2.
37 UPGMA example: Step 1 Alignment -> distance Uncorrected“p” distance(=observed percentsequence difference)Example 2: Kimura 2-parameter distance(estimate of the true number of substitutions between taxa)Distance:ABCDEFG-639479111964767208310023588910662107924316102Distance matrix:DNA/RNA overview
49 Clustering methods (UPGMA & N-J) Optimality criterion: NONE. The algorithm itself builds‘the’ tree.Advantages:Can be used on indirectly-measured distances (immunological, hybridization).Distances can be ‘corrected’ for unseen events.The fastest of the methods available.Can therefore analyze very large datasets quickly (needed for HIV, etc.).Can be used for some types of rate and date analysis.Disadvantages:Similarity and relationship are not necessarily the same thing, so clustering bysimilarity does not necessarily give an evolutionary tree.Cannot be used for character analysis!Have no explicit optimization criteria, so one cannot even know if the programworked properly to find the correct tree for the method.
50 Minimum evolution (ME) methods Optimality criterion: The tree(s) with the shortest sum of thebranch lengths (or overall tree length) is chosen as the best tree.Advantages:Can be used on indirectly-measured distances (immunological, hybridization).Distances can be ‘corrected’ for unseen events.Usually faster than character-based methods.Can be used for some rate analyses.Has an objective function (as compared to clustering methods).Disadvantages:Information lost when characters transformed to distances.Cannot be used for character analysis.Slower than clustering methods.
51 Character Methods Maximum Parsimony Maximum Likelihood minimal changes to produce dataMaximum Likelihood
52 Parsimony methods:Optimality criterion: The ‘most-parsimonious’ tree is the one thatrequires the fewest number of evolutionary events (e.g., nucleotidesubstitutions, amino acid replacements) to explain the sequences.Advantages:Are simple, intuitive, and logical (many possible by ‘pencil-and-paper’).Can be used on molecular and non-molecular (e.g., morphological) data.Can tease apart types of similarity (shared-derived, shared-ancestral, homoplasy)Can be used for character (can infer the exact substitutions) and rate analysis.Can be used to infer the sequences of the extinct (hypothetical) ancestors.Disadvantages:Are simple, intuitive, and logical (derived from “Medieval logic”, not statistics!)Can become positively misleading in the “Felsenstein Zone”:
54 Parsimony – an example a acgtatgga b acgggtgca g aacggtgga d aactgtgca b: cd: ad: ab: cb: cd: aTotal tree length: 7Total tree length: 8Total tree length: 8
55 Maximum likelihood (ML) methods Optimality criterion: ML methods evaluate phylogenetic hypothesesin terms of the probability that a proposed model of the evolutionaryprocess and the proposed unrooted tree would give rise to theobserved data. The tree found to have the highest ML value isconsidered to be the preferred tree.Advantages:Are inherently statistical and evolutionary model-based.Usually the most ‘consistent’ of the methods available.Can be used for character (can infer the exact substitutions) and rate analysis.Can be used to infer the sequences of the extinct (hypothetical) ancestors.Can help account for branch-length effects in unbalanced trees.Can be applied to nucleotide or amino acid sequences, and other types of data.Disadvantages:Are not as simple and intuitive as many other methods.Are computationally very intense (Iimits number of taxa and length of sequence).Like parsimony, can be fooled by high levels of homoplasy.Violations of the assumed model can lead to incorrect trees.
56 Using models Example: Jukes-Cantor , if i≠j , if i=j Observed differencesAGCT, if i=j, if i≠jActual changespt : proportion of different nucleotides
57 Maximum likelihoodGiven two trees, the one with the higher likelihood, i.e. the one with the higher conditional probability of observing the data, is the betterSite likelihood is the conditional probability of the data at one site given the assumed model of evolution and parameters of the modelData set likelihood is the product of the site likelihoodsLikelihood values under different models are comparable, thus giving us a way to test the adequacy of the model
58 30 nucleotides from yh-globin genes of two primates on a one-edge tree * *Gorilla GAAGTCCTTGAGAAATAAACTGCACACTGGOrangutan GGACTCCTTGAGAAATAAACTGCACACTGGThere are two differences and 28 similaritiesatat=lnL=lnL
59 Likelihoods of a more interesting tree Data for one site is shown on the treeEdge lengths are defined as =3(at)iComputational root is chosen arbitrarily (homogenous models) at an internal node (arrow)u is the state at the root node, v at the other internal nodeACd1d3d5d4d2AT
60 Assessing the variation Jack-knife – resampling without replacementBootstrap – resampling with replacementFrom the characters (sites) draw randomly as many times as there are number of charactersAnalyze this re-sampled data set in the same way as the studyRepeat this 100+ times and summarize, for example, as a majority rule consensus treeBootstrap proportions between 0.5 and 1 can be interpreted as a measure of confidence or support
61 Bootstrap Original analysis, e.g. MP, ML, NJ. AusBeusCeusDeusOriginal analysis, e.g. MP, ML, NJ.Original data set with n characters.Draw n characters randomly with re-placement.Repeat mtimes.m pseudo-replicates, each with n characters.AusBeusCeusDeusRepeat original analysis on each of the pseudo-replicate data sets.AusBeusCeusDeus75%Evaluate the results from the m analyses.
62 Resampling statistics in phylogenetics ”…provides us with a confidence interval…[of] the phylogeny that would be estimated on repeated sampling of many characters from the underlying pool of characters” (Felsenstein 1985)True? We don’t know. The exact statistical interpretation remains unclear.
63 Pros and cons of some methods Pair-wise, algorithmic approach+ Fast+ Models can be used when transforming to distances- Information is lost when transforming to pair-wise distances- One will get a tree, but no measure of goodness to compare with other hypothesesParsimony+ Philosophically appealing – Occam’s razor- Can be inconsistent- Can be computationally slowMaximum likelihood+ Model based- Model based- Computationally veeeeery slow
64 ComputationFor large data sets (many taxa) exact solutions for any method employing an optimality criterion (parsimony, likelihood, minimum evolution) are not possible
65 What can go wrong? Sampling error Assessed by - for example - the bootstrapSystematic error (inconsistent method)Tests of the adequacy of models usedRealityA tree may be a poor model of the real historyInformation has been lost by subsequent evolutionary changes“Species” vs. “gene” trees
66 What is wrong with this tree? CanisMusGadusNegligible (within sequence) sampling errorTree estimated by a consistent method100100
67 The expected tree…Gene duplication“Species” tree“Gene” trees
68 Two copies (paralogs) present in the genomes OrthologousOrthologousCanisMusGadusParalogousTwo copies (paralogs) present in the genomes