Presentation is loading. Please wait.

Presentation is loading. Please wait.

Phylogenetics Alexei Drummond. CS369 20072 Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) 2027025 (B) 34459425.

Similar presentations


Presentation on theme: "Phylogenetics Alexei Drummond. CS369 20072 Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) 2027025 (B) 34459425."— Presentation transcript:

1 Phylogenetics Alexei Drummond

2 CS369 20072 Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) 2027025 (B) 34459425 (C) 8.20  10 21 (D) 3.21  10 70 Bonus question: What about unrooted trees?

3 CS369 20073 Computational Biology Multiple sequence alignment GlobalLocal Evolutionary tree reconstruction Substitution matrices Pairwise sequence alignment (global and local) Database searching BLAST Sequence statistics Adapted from slide by Dannie Durant

4 Molecules as Documents of Evolutionary History Macromolecules contain information about the processes and history that formed them HIV-1 (UK) ATCGGATGCTAAAGCATATGACACAGAGGTACATAATGTTT HIV-1 (USA) ATCAGATGCTAGAGCTTATGATACAGAGGTACA---TGTTT However, this information is often fragmentary, camouflaged or lost completely One of the aims of computational biology is to recover as much of this information as possible and decipher its meaning

5 Phylogenetics Views similarity (homology) as evidence of common ancestry –Homology: similarity that is the result of inheritance from a common ancestor Uses tree diagrams to portray relationships based upon recency of common ancestry Monophyletic groups (clades) - contain species which are more closely related to each other than to any outside of the group Phylogenetics has in recent years become a statistical science based on probabilistic models of evolution.

6 Bacterium 1 Bacterium 3 Bacterium 2 Eukaryote 1 Eukaryote 4 Eukaryote 3 Eukaryote 2 Bacterium 1 Bacterium 3 Bacterium 2 Eukaryote 1 Eukaryote 4 Eukaryote 3 Eukaryote 2 Types of Phylogenies Cladograms show clusters –Branch lengths are meaningless Phylograms show clusters and branch lengths –Branch lengths can represent time or genetic distance –Vertical dimension is meaningless

7 Rooting trees using an outgroup archaea eukaryote bacteria outgroup root of ingroup eukaryote archaea Monophyletic Group (clade) Unrooted tree Rooted by outgroup Monophyletic Group (clade)

8 CS369 20078 Anatomy of a tree Bacterium 1 Bacterium 3 Bacterium 2 Eukaryote 1 Eukaryote 4 Eukaryote 3 Eukaryote 2 External branch or edge Internal branch or edge Internal node External node or tip Taxon Root

9 Problems in Phylogenetics Correctly aligning multiple sequences Choosing an evolutionary model of sequence change –To estimate the genetic distance between sequences Inferring phylogenetic trees Testing evolutionary hypotheses –(we won’t cover this material in 369)

10 4 5 6 7 8 9 10 20 48 136 15 105 945 10395 135135 2027025 34459425 8.20  10 21 3.21  10 70 2.11  10 267 enumerable by hand enumerable by hand on a rainy day enumerable by computer still searchable very quickly on computer a bit more than the number of hairs on your head Greater than the population of Auckland ≈ upper limit for exhaustive searching; about the number of possible combinations of numbers in the UK National Lottery ≈ upper limit for branch-and-bound searching ≈ the number of particles in the universe number of trees to choose from in the “Out of Africa” data (Vigilant et al., 1991) n #trees How many trees are there? For n taxa there are (2n – 3)! = (2n – 3)(2n – 5)...(3)(1) rooted, binary trees:

11 A B C D E 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 0 0 1 1 1 0 0 1 1 1 1 0 1 1234567890 1 Characters Taxa A B C D E Distances Phylogenetic Reconstruction There are essentially two types of data for phylogenetic tree estimation: –Distance data, usually stored in a distance matrix, e.g. DNA×DNA hybridisation data, morphometric differences, immunological data, pairwise genetic distances –Character data, usually stored in a character array; e.g. multiple sequence alignment of DNA sequences, morphological characters.

12 Phylogenetic Reconstruction Given the huge number of possible trees even for small data sets, we have two options: –Build one according to some clustering algorithm –Assign a “goodness of fit” criterion (an objective function) and find the tree(s) which optimise(s) this criterion

13 CS369 200713 Distances Nucleotide Sites Type of Data UPGMA Neighbor-Joining Minimum Evolution Maximum Parsimony Maximum Likelihood Tree Building Method Optimality Criterion Clustering Algorithm Phylogenetic Reconstruction

14 Clustering Algorithms The clustering algorithms are usually very fast, and simple but –there is no explicit optimality criterion, so we have no measure of how good the tree is! we do not get any idea about other potential trees – were there any better trees? Common methods are Neighbour-Joining and UPGMA.

15 A B Node 1 * NJ uses rate-corrected distances Clustering Algorithms The UPGMA and neighbor-joining (NJ) methods are both greedy heuristics which join, at each step, the two closest* sub-trees that are not already joined. They are based on the minimum evolution principle. An important concept in both of these methods is a pair of neighbors, which is defined as two nodes that are connected via a single node:

16 CS369 200716 UPGMA Example ABCD A0 B80 C790 D1214110 A C 3.5

17 CS369 200717 UPGMA Example ACBD 0 B8.50 D11.5140 A C 3.5 B 4.25 0.75

18 CS369 200718 UPGMA Example A C 3.5 B 4.25 0.75 ABCD 0 D12.330 1.92 6.17 D

19 CS369 200719 UPGMA weaknesses ABCD A0 B80 C790 D1214110 A B 3 5 C 3 1 2 6 D There is a (non clock-like) tree that fits the distance matrix exactly!

20 CS369 200720 UPGMA properties UPGMA assumes that the rates of evolution are clock-like. –Assumes the rate of substitution is the same on all branches of the tree Produces a rooted tree

21 CS369 200721 Neighbor-joining Most widely-used distance based method for phylogenetic reconstruction UPGMA illustrated that it is not enough to pick the closest neighbors (at least when there is rate heterogeneity across branches) Idea: take into account averaged distances to other leaves as well Produces an unrooted tree

22 CS369 200722 The basic idea We start by moving every node i closer to all other nodes by this amount: As a result the new (squashed) distances are: We are pushing node i closer to all other nodes by an amount slightly more than the average distance to all other taxa.

23 CS369 200723 The basic idea In effect, the nodes that were far away from everything get pushed towards everything quite a lot. This counteracts the effect of long branches. A B C D 0.3 0.1 UPGMA would incorrectly group A and B, whereas NJ would reconstruct the correct tree in this case.

24 CS369 200724 Neighbor-joining We use an algorithm very similar to UPGMA to connect the two closest nodes, i and j, using these new squashed distances. We join these into a cluster and make a new node k to correspond to their ancestor, and pick distances from i, j and all other nodes to k. The squashed distances are updated at each step. See Durbin book, p171 for details.

25 CS369 200725 Runtime of the algorithm Both of these clustering-based algorithms take O(n 3 ) time once we have the distance matrix. There are n steps and in each step we do: –(1) find the smallest distance –(2) join these two taxa –(3) compute the distance from the new ancestor to all others Step (1) takes O(n 2 ) and the other two steps take O(n)


Download ppt "Phylogenetics Alexei Drummond. CS369 20072 Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) 2027025 (B) 34459425."

Similar presentations


Ads by Google