Presentation is loading. Please wait.

Presentation is loading. Please wait.

Phylogenetic reconstruction

Similar presentations


Presentation on theme: "Phylogenetic reconstruction"— Presentation transcript:

1 Phylogenetic reconstruction

2 What is phylogenetic analysis and why should we perform it?
Phylogenetic analysis has two major components: 1. Phylogeny inference or “tree building” 2. Character and rate analysis 1. Phylogeny inference or “tree building” — the inference of the branching orders, and ultimately the evolutionary relationships, between “taxa” (entities such as genes, populations, species, etc.) 2. Character and rate analysis — using phylogenies as analytical frameworks for rigorous understanding of the evolution of various traits or conditions of interest.

3 Common Phylogenetic Tree Terminology
Edges A TAXA (genes, populations, species, etc.) used to infer the phylogeny B C D E Vertex or Nodes

4 Common Phylogenetic Tree Terminology
Terminal Nodes (Leaves) Branches, Lineages or Clades A Represent the TAXA (genes, populations, species, etc.) used to infer the phylogeny B C D Ancestral Node or ROOT of the Tree E Internal Nodes or Divergence Points (represent hypothetical ancestors of the taxa)

5 Phylogenetic trees diagram the evolutionary
relationships between the taxa Taxon A Taxon B Taxon C Taxon E Taxon D No meaning to the spacing or to order - no scale (for ‘cladograms’), - proportional to genetic distance or amount of change (for ‘phylograms’ or ‘additive trees’), - proportional to time (for ‘ultrametric trees’ or true evolutionary trees). These say that B and C are more closely related to each other than either is to A, and that A, B, and C form a clade that is a sister group to the clade composed of D and E. If the tree has a time scale, then D and E are the most closely related. ((A,(B,C)),(D,E)) = The above phylogeny as nested parentheses

6 A few examples of what can be inferred from phylogenetic trees built from DNA or protein sequence data: Which species are the closest living relatives of modern humans? Did the infamous Florida Dentist infect his patients with HIV? What were the origins of specific transposable elements? Plus countless others…..

7 Using Phylogeny to Understand Gene Duplication and Loss
A gene tree. The gene tree superimposed on a species tree, allowing identification of the duplication and loss events.

8 Speciation Ancestor a Speciation a a Species I Species II Orthologes

9 Gene duplication Ancestor Duplication a b Mutations a b Paraloges

10 Gene duplication & speciation
Ancestor Duplication a b Mutations Paraloges a b Speciation a b a b Species I Orthologes Species II

11 Gene Phylogeny Species I a Orthologes a Species II a Species I b
Paraloges b Species II b

12 Which species are the closest living relatives of modern humans?
Gorillas Chimpanzees Chimpanzees Bonobos Bonobos Gorillas Orangutans Orangutans Humans Mitochondrial DNA and most nuclear DNA-encoded genes, show that bonobos and chimpanzees are related more closely to humans than either are to gorillas. The pre-molecular view was that the great apes (chimpanzees, gorillas and orangutans) formed a clade separate from humans, and that humans diverged from the apes at least MYA. 14 15-30 MYA MYA Mitochondrial DNA and most nuclear DNA-encoded genes, The pre-molecular view

13 Did the Florida Dentist infect his patients with HIV?
Phylogenetic tree of HIV sequences from the DENTIST, his Patients, & Local HIV-infected People: Patient C Patient A Patient G Yes: The HIV sequences from these patients fall within the clade of HIV sequences found in the dentist. Patient B Patient E Patient A DENTIST Local control 2 Local control 3 No Patient F Local control 9 Local control 35 Local control 3 No Patient D From Ou et al. (1992) and Page & Holmes (1998)

14 A few examples of what can be learned from character analysis using phylogenies as analytical frameworks: When did specific episodes of positive Darwinian selection occur during evolutionary history? What was the most likely geographical location of the common ancestor of the African apes and humans? Plus countless others…..

15 Phylogenetic Resources
NCBI Taxonomy Browser “Tree of Life” TreeBase

16 Tree of Life

17 Completely unresolved bifurcating phylogeny
The goal of phylogeny inference is to resolve the branching orders of lineages in evolutionary trees: Completely unresolved or "star" phylogeny Partially resolved phylogeny Fully resolved, bifurcating phylogeny A B C E D Polytomy or multifurcation A bifurcation

18 There are three possible unrooted trees for four taxa (A, B, C, D)
Phylogenetic tree building (or inference) methods are aimed at discovering which of the possible unrooted trees is "correct". We would like this to be the “true” biological tree — that is, one that accurately represents the evolutionary history of the taxa. However, we must settle for discovering the computationally correct or optimal tree for the phylogenetic method of choice Which one is correct?

19 The number of unrooted trees increases in a greater than exponential manner with number of taxa
(2N - 5)!! = # unrooted trees for N taxa

20 Inferring evolutionary relationships between the taxa requires rooting the tree:
C Root D Unrooted tree A B C D Root Note that in this rooted tree, taxon A is no more closely related to taxon B than it is to C or D. Rooted tree B To root a tree mentally, imagine that the tree is made of string. Grab the string at the root and tug on it until the ends of the string (the taxa) fall opposite the root:

21 Now, try it again with the root at another position:
B C Root Unrooted tree D A A B B C D Rooted tree Note that in this rooted tree, taxon A is most closely related to taxon B, and together they are equally distantly related to taxa C and D. Root

22 An unrooted, four-taxon tree theoretically can be rooted in five different places to produce five different rooted trees Rooted tree 1b A B C D 2 A Rooted tree 1d C D A B 4 C Rooted tree 1a B A C D 1 The unrooted tree 1: Rooted tree 1e D C A B 5 Rooted tree 1c A B C D 3 B D These trees show five different evolutionary relationships among the taxa!

23 All of these rearrangements show the same evolutionary relationships between the taxa
C D B A C D Rooted tree 1a B A C D B C A D D C A B A B D C A B C D

24 Think for yourself How many unrooted trees are there with 4 taxa?
How many rooted trees are there with 4 taxa?

25 Each unrooted tree theoretically can be rooted anywhere along any of its branches
x = C A B D E F (2N - 3)!! = # unrooted trees for N taxa

26 Finding the best tree Search strategies Number of (rooted) trees
Exact Exhaustive Branch and bound Algorithmic Greedy algorithms (including Neighbor-joining) Heuristic Systematic; branch-swapping (NNI, SPR, TBR) Stochastic Markov Chain Monte Carlo (MCMC) Genetic algorithms Number of (rooted) trees 3 taxa -> 3 trees 4 taxa -> 15 trees 10 taxa -> trees 25 taxa -> 1,19·1030 trees 52 taxa -> 2,75·1080 trees Finding the optimal tree is an NP-complete problem

27 Rooting Trees Molecular Clock Extrinsic Evidence (Outgroup)
B C D 10 2 3 5 d (A,D) = = 18 Midpoint = 18 / 2 = 9 Molecular Clock Root=midpoint, longest span Extrinsic Evidence (Outgroup) select fungus as root for plants outgroup

28 Steps in Analysis Input: Alignment Choose substitution model
Build Trees Algorithm based vs Criterion based Distance based vs Character-based

29 Practicalities Quality of input data critical
Examine data from all possible angles distance, parsimony, likelihood Assess the variation in your data in some way

30 Types of data used in phylogenetic inference:
Character-based methods: Use the aligned characters, such as DNA or protein sequences, directly during tree inference. Taxa Characters Species A ATGGCTATTCTTATAGTACG Species B ATCGCTAGTCTTATATTACA Species C TTCACTAGACCTGTGGTCCA Species D TTGACCAGACCTGTGGTCCG Species E TTGACCAGTTCTCTAGTTCG Distance-based methods: Transform the sequence data into pairwise distances, and use the matrix during tree building. A B C D E Species A Species B Species C Species D Species E Example 1: Uncorrected “p” distance (=observed percent sequence difference) Example 2: Kimura 2-parameter distance (estimate of the true number of substitutions between taxa)

31 Types of computational methods:
Clustering algorithms: Use pairwise distances. Are purely algorithmic methods. Optimality approaches: Use either character or distance data. minimum branch lengths, fewest number of events, highest likelihood Clustering algorithms: Use pairwise distances. Are purely algorithmic methods, in which the algorithm itself defines the the tree selection criterion. Tend to be very fast programs that produce singular trees rooted by distance. No objective function to compare to other trees, even if numerous other trees could explain the data equally well. Warning: Finding a singular tree is not necessarily the same as finding the "true” evolutionary tree. Optimality approaches: Use either character or distance data. First define an optimality criterion (minimum branch lengths, fewest number of events, highest likelihood), and then use a specific algorithm for finding trees with the best value for the objective function. Can identify many equally optimal trees, if such exist. Warning: Finding an optimal tree is not necessarily the same as finding the "true” tree.

32 Molecular phylogenetic tree building methods:
COMPUTATIONAL METHOD Clustering algorithm Optimality criterion DATA TYPE Characters Distances PARSIMONY MAXIMUM LIKELIHOOD UPGMA NEIGHBOR-JOINING MINIMUM EVOLUTION LEAST SQUARES Are mathematical and/or statistical methods for inferring the divergence order of taxa, as well as the lengths of the branches that connect them. There are many phylogenetic methods available today, each having strengths and weaknesses. Most can be classified as follows:

33 Tree-Building Methods
Distance UPGMA, NJ, FM, ME Character Maximum Parsimony Maximum Likelihood

34 Distance Methods Measure distance (dissimilarity) Methods
UPGMA (Unweighted pair group method with Arithmetic Mean) NJ (Neighbor joining) FM (Fitch-Margoliash) ME (Minimal Evolution)

35 Inferring Trees and Ancestors
CCCAGG CCCAAG-> CCCAAG CCCAAA-> CCCAAA CCCAAC

36 UPGMA: Clustering

37 UPGMA: Distance measure
Clustering: All leaves are assigned to a cluster, which then are iteratively merged according to their distance. The distance between two clusters i and j is defined as: where |Ci| and |Cj| denote the number of sequences in cluster i and j, respectively.

38 UPGMA: Replacing Node k replaces nodes i and j with their union:
(1) The new distances between the new node k and all other clusters l are computed according to: (2)

39 UPGMA: Algorithm Initialization: Iteration Termination
Assign each sequence i to its own cluster Ci . Define one leaf of T for each sequence, and place at height zero. Iteration Determine the two clusters i, j for which di,j is minimal. Define a new cluster k by Ck = Ci U Cj, and define dkl for all l by (2). Define a node k with daughter nodes i and j, and place it at height di,j/2. Add k to the current clusters and remove i. Termination When only two clusters i, j remain, place the root at height di,j/2.

40 UPGMA example: Step 1 Alignment -> distance
Example: observed percent sequence difference A B C D E F G - 63 94 79 111 96 47 67 20 83 100 23 58 89 106 62 107 92 43 16 102 Distance matrix: DNA/RNA overview

41 Step 2: distance -> clade
B C D E F G - 63 94 79 111 96 47 67 23 83 100 20 58 89 106 62 107 92 43 16 102 DNA/RNA overview

42 Step 3: merge D and G A B C E F DG - 63 94 79 67 23 83 20 58 89 62 109
45 98 104 DNA/RNA overview

43 Step 4 A B C E F DG - 63 94 79 67 23 83 20 58 89 62 109 45 98 104 DNA/RNA overview

44 Step 5 AF B C E DG - 61 92 79 65 23 62 107 94 45 98 DNA/RNA overview

45 Step 6 AF B C E DG - 61 92 79 65 23 62 107 94 45 98 DNA/RNA overview

46 Step 7 AF BE C DG - 63 92 71 107 96 45 DNA/RNA overview

47 Step 8 AF BE C DG - 63 92 71 107 96 45 DNA/RNA overview

48 Step 9 AF BE CDG - 63 102 88 DNA/RNA overview

49 Step 10 AF BE CDG - 63 102 88 DNA/RNA overview A F

50 UPGMA: distance -> phylogeny
AFBE CDG - 94 A F A F DNA/RNA overview Root

51 Classification of phylogenetic
inference methods COMPUTATIONAL METHOD Optimality criterion Clustering algorithm PARSIMONY MAXIMUM LIKELIHOOD Characters DATA TYPE MINIMUM EVOLUTION LEAST SQUARES UPGMA NEIGHBOR-JOINING Distances

52 Clustering methods (UPGMA & N-J)
Optimality criterion: NONE. The algorithm itself builds ‘the’ tree. Advantages: Can be used on indirectly-measured distances (immunological, hybridization). Distances can be ‘corrected’ for unseen events. The fastest of the methods available. Can therefore analyze very large datasets quickly (needed for HIV, etc.). Can be used for some types of rate and date analysis. Disadvantages: Similarity and relationship are not necessarily the same thing, so clustering by similarity does not necessarily give an evolutionary tree. Cannot be used for character analysis! Have no explicit optimization criteria, so one cannot even know if the program worked properly to find the correct tree for the method.

53 Minimum evolution (ME) methods
Optimality criterion: The tree(s) with the shortest sum of branch lengths (or overall tree length) is chosen as the best tree. Advantages: Can be used on indirectly-measured distances (immunological, hybridization). Distances can be ‘corrected’ for unseen events. Usually faster than character-based methods. Can be used for some rate analyses. Has an objective function (as compared to clustering methods). Disadvantages: Information lost when characters transformed to distances. Cannot be used for character analysis. Slower than clustering methods.

54 Character Methods Maximum Parsimony Maximum Likelihood
minimal changes to produce data Maximum Likelihood

55 Parsimony methods: Optimality criterion: The ‘most-parsimonious’ tree is the one that requires the fewest number of evolutionary events (e.g., nucleotide substitutions, amino acid replacements) to explain the sequences. Advantages: Are simple, intuitive, and logical (many possible by ‘pencil-and-paper’). Can be used on molecular and non-molecular (e.g., morphological) data. Can tease apart types of similarity (shared-derived, shared-ancestral, homoplasy) Can be used for character (can infer the exact substitutions) and rate analysis. Can be used to infer the sequences of the extinct (hypothetical) ancestors. Disadvantages: Are simple, intuitive, and logical (derived from “Medieval logic”, not statistics!) Can become positively misleading in the “Felsenstein Zone”:

56 Parsimony Comparison # changes 1-2 1 1-3 2 1-4 2-3 2-4 3-4 1 CCCAGG
2 CCCAAG 3 CCCAAA 4 CCCAAC Comparison # changes 1-2 1 1-3 2 1-4 2-3 2-4 3-4 1,2 can be sister taxa AND 3,4 can be sister taxa Infer ancestor of 1,2 and 3,4

57 Parsimony CCCAGG CCCAAG-> CCCAAG CCCAAA-> CCCAAA CCCAAC
3 changes

58 Calculate # changes | tree
a acgtatgga b acgggtgca g aacggtgga d aactgtgca a: c g: a a: c g: a a: c g: a b: c d: a d: a b: c b: c d: a Total tree length: 7 Total tree length: 8 Total tree length: 8

59 Calculate # changes | tree
a acgtatgga b acgggtgca g aacggtgga d aactgtgca a: c g: a a: c g: a a: c g: a b: c d: a d: a b: c b: c d: a Total tree length: 7 Total tree length: ? Total tree length: ?

60 Maximum likelihood (ML) methods
Optimality criterion: ML methods evaluate phylogenetic hypotheses in terms of the probability that a proposed model of the evolutionary process and the proposed unrooted tree would give rise to the observed data. The tree found to have the highest ML value is considered to be the preferred tree. Advantages: Are inherently statistical and evolutionary model-based. Usually the most ‘consistent’ of the methods available. Can be used for character (can infer the exact substitutions) and rate analysis. Can be used to infer the sequences of the extinct (hypothetical) ancestors. Can help account for branch-length effects in unbalanced trees. Can be applied to nucleotide or amino acid sequences, and other types of data. Disadvantages: Are not as simple and intuitive as many other methods. Are computationally very intense (Iimits number of taxa and length of sequence). Like parsimony, can be fooled by high levels of homoplasy. Violations of the assumed model can lead to incorrect trees.

61 Using models Example: Jukes-Cantor , if i≠j , if i=j
Observed differences A G C T , if i=j , if i≠j Actual changes pt : proportion of different nucleotides

62 Maximum likelihood Given two trees, the one with the higher likelihood, i.e. the one with the higher conditional probability of observing the data, is the better

63 30 nucleotides from yh-globin genes of two primates on a one-edge tree
* * Gorilla GAAGTCCTTGAGAAATAAACTGCACACTGG Orangutan GGACTCCTTGAGAAATAAACTGCACACTGG There are two differences and 28 similarities at at= lnL= lnL

64 Likelihoods of a more interesting tree
Data for one site is shown on the tree Edge lengths are defined as =3(at)i Computational root is chosen arbitrarily (homogenous models) at an internal node (arrow) u is the state at the root node, v at the other internal node A C d1 d3 d5 d4 d2 A T

65 Confidence assesment Bootstrap

66 Bootstrap Original analysis, e.g. MP, ML, NJ.
Aus Beus Ceus Deus Original analysis, e.g. MP, ML, NJ. Original data set with n characters. Draw n characters randomly with re-placement. Repeat m times. m pseudo-replicates, each with n characters. Aus Beus Ceus Deus Repeat original analysis on each of the pseudo-replicate data sets. Aus Beus Ceus Deus 75% Evaluate the results from the m analyses.

67 Pros and cons of some methods
Pair-wise, algorithmic approach + Fast + Models can be used when transforming to distances - Information is lost when transforming to pair-wise distances - One will get a tree, but no measure of goodness to compare with other hypotheses Parsimony + Philosophically appealing - Can be computationally slow Maximum likelihood + Model based - Model based - Computationally veeeeery slow

68 Computation For large data sets (many taxa) exact solutions for any method employing an optimality criterion (parsimony, likelihood, minimum evolution) are not possible

69 What can go wrong? Reality
A tree may be a poor model of the real history Information has been lost by subsequent evolutionary changes “Species” vs. “gene” trees

70 What is wrong with this tree?
Canis Mus Gadus 100 100

71 The expected tree… Gene duplication “Species” tree “Gene” trees

72 Two copies (paralogs) present in the genomes
Orthologous Orthologous Canis Mus Gadus Paralogous Two copies (paralogs) present in the genomes

73 What we have studied… Canis Gadus Mus

74 HIV Genome Diversity Error prone (RT) replication
High rate of replication 1010 virions/day In vivo selection pressure

75 HIV tree ENV AIDS 1996, 10:S13 GAG Recombinants!

76 To conclude– Trash in, trash out : Alignment crucial
Try several methods, for consistency Beware of paraloges If recombinations possible: each site its tree.


Download ppt "Phylogenetic reconstruction"

Similar presentations


Ads by Google