1 Phylogenetic Tree Reconstruction Modified version of Dr. Chun-Chieh Shih’ Institute of Information Sciences Academia Sinica.

2 2 OUTLINE Tree reconstruction methods Flowchart of phylogenetic analysis Concept 0f evolutionary trees Evaluation of reconstructed trees

3 3 Why RECONSTRUCT phylogenetic trees Understand evolutionary history Map pathogen strain diversity for vaccines Assist in epidemiology of infectious diseases Aid in prediction of function of novel genes Biodiversity studies Understanding microbial ecologies For Example

4 4 Concept 0f evolutionary trees Rooted tree One sequence (root) defined to be common ancestor of all other sequences If molecular clock hypothesis holds, it is possible to predict a root Unrooted tree Indicates evolutionary relationship without revealing location of oldest ancestry

6 6 Unrooted trees Rooted trees # sequences # pairwise distances # trees # branches /tree # trees # branches /tree 331334 4635156 5101571058 615105994510 10452,027,0251734,459,42518 30435 8.69  10 36 57 4.95  10 38 58 N N (N - 1) 2 (2N - 5)! 2 N - 3 (N - 3)! 2N - 3 (2N - 3)! 2 N - 2 (N - 2)! 2N - 2 Taken From: Concept 0f evolutionary trees Number of Trees

7 7 Types of data used in phylogenetic inference Use the aligned characters, such as DNA or protein sequences, directly during tree inference. Character-based methods: Transform the sequence data into pairwise distances, and use the matrix during tree building. Distance-based methods:

8 8 Data set collection Multiple sequence alignment Tree construction Character-basedDistance-based Optimal criteria ParsimonyMaximum likelihood UPGMANJ Fitch-Margoliash KITCHDistance Test reliability of the tree by analytical and/or resampling procedure

9 9 Distance Methods Calculate changes between each pair in a group of sequences (The first step in producing a multiple sequence Alignment) Identify tree that correctly positions neighbors and that also has branch lengths that reproduce the original data as closely as possible Finding closest neighbors among a group of Sequences

10 10 Distance Methods - Example distances between sequences distance table

11 11  FITCH: estimates phylogenetic tree assuming additivity of branch lengths using the Fitch- Margoliash method  KITSH: same as FITCH, but under the assumption of a molecular clock  NEIGHBOR: estimates phylogenies using either:  Neighbor-joining (no molecular clock assumed)  Unweighted Pair Group Method with Arithmetic Mean (UPGMA) (molecular clock assumed) Distance Methods - Example Distance Programs in Phylip

12 12 Distance Methods - UPGMA Construct a distance tree A -GCTTGTCCGTTACGAT B –ACTTGTCTGTTACGAT C –ACTTGTCCGAAACGAT D -ACTTGACCGTTTCCTT E –AGATGACCGTTTCGAT F -ACTACACCCTTATGAGABCDEB2 C44 D666 E6664 F88888 Clustering  All leaves are assigned to a cluster, which then are iteratively merged according to their distance

13 13 Distance Methods - UPGMA The distance between two clusters i and j is defined as: where |C i | and |C j | denote the number of sequences in cluster i and j, respectively. Replacing C k = C i  C j The new distances between the new node k and all other clusters l are computed according to:

14 14 Distance Methods - UPGMA Step I: Initialization. Assign each sequence i to its own cluster C i.. Define one leaf of T for each sequence, and place at height zero. Step II: Iteration. Determine the two clusters i, j for which d i,j is minimal. Define a new cluster k by C k = C i U C j, and define d kl for all l. Define a node k with daughter nodes i and j, and place it at height d i,j /2.. Add k to the current clusters and remove i. Step III: Termination. When only two clusters i, j remain, place the root at height d i,j /2.

15 15 Distance Methods - UPGMA First roundABCDEB2 C44 D666 E6664 F88888 A B 1 1 dist((A,B),C) = (distAC+distBC)/2 =4 dist((A,B),D) = (distAD+distBD)/2 = 6 dist((A,B),E) = (distAE+distBE)/2 = 6 dist((A,B),F) = (distAF+distBF)/2 = 8A,BCDEC4 D66 E664 F8888 Choose the most similar pair, cluster them together and calculate the new distance matrix.

16 16 Distance Methods - UPGMA Second round A B 1 1A,BCDEC4 D66 E664 F8888 D E 2 2 Third round A B 1 1A,BCD,EC4 D,E66 F888 D E 2 2 C 1 2

17 17 Distance Methods - UPGMA Fourth roundAB,CD,ED,E6 F88 A B 1 1 D E 2 2 C 1 2 1 1 Fifth round ABC, DE F8 A B 1 1 D E 2 2 C 1 2 1 1 F 4 1 ROOT

18 18 Distance Methods - UPGMA The UPGMA clustering method is very sensitive to unequal evolutionary rates  Assumes that the evolutionary rate is the same for all branches Clustering works only if the data are ultrametric Ultrametric tree Special kind of additive tree in which the tips of the trees are all equidistant from the root A cladogram with branch lengths, also called phylograms and metric trees 1 111 1 1 2 3 1 3 1 Additive tree 7 2 3 1 3 5 1 4 3 3

19 19 Distance Methods - UPGMA UPGMA fails when rates of evolution are not constant A B 1 4 D E 3 2 C 1 2 1 1 F 4 1 ABCDEB5 C47 D7 101010107 E6965 F811898 Wrong topology A C 1 1 D E 2.5 A C B 2 2 3 1 D E 1.5 0.5 A C B 2 2 3 1 D E 2.5 1.5 0.5 F 4.5 0.5 A C 2 2 3 1

20 20 Distance Methods – Neighbor Joining The Four Point Condition d AC + d BD = d AD + d BC = a + b + c + d + 2x = d AB + d CD + 2x The 4-point condition d AB + d CD < d AC + d BD d AB + d CD < d AD + d BC neighborsnon-neighbors Neighbors are closer than non-neighbors

21 21 Distance Methods – Neighbor Joining Sequences chosen to give best least-squares estimate of branch length Begin with star topology – no neighbors have been joined B A C D E Tree modified by joining pairs of sequences

22 22 Distance Methods – Neighbor Joining Pair is chosen by calculating sum of branch lengths for the corresponding tree If A and B are joined: B A C D E

23 23 Distance Methods – Neighbor Joining Neighbor-Joining approximates the least squares tree, assuming additivity, but without resorting to the assumption of a molecular clock. Idea: join clusters that are not only close to one another, but are also far from the rest. In each iteration: find direct ancestor of two species in the tree  neighboring leaves.

24 24 Distance Methods – Neighbor Joining Example: neighboring leaves i, j with ancestor k. Join i and j  remove them from list of leave nodes  add k to list with distances to other leave(s) m defined as Problem: it is not sufficient to pick simply the two closest leaves

25 25 Distance Methods – Neighbor Joining Solution: For node i, define average distance u i to all other leaves: and correct distances: Minimum-evolution criterion: minimize the sum of all branch lengths. Nodes i and j that are clustered next are those for which D ij is smallest.

26 26 Distance Methods – Neighbor Joining Initialization Iteration: 1. Initialize n clusters with the given species, one species per cluster 2. Set the size of each cluster to 1: n i  1 3. In the output tree T, assign a leaf for each species 1. For each species, compute 2. Choose the i and j for which d ij − u i − u j is smallest. 3. Join clusters i and j to new cluster, with corresponding node k and set Calculate the branch lengths from i and j to the new node as:, 4. Delete clusters i and j from T and add k 5. If more than two nodes remain, go back to 1. Otherwise, --- end

27 27 Maximum Parsimony Predicts evolutionary tree by minimizing number of steps required to generate observed variation For each position, a phylogenetic tree requires smallest number of evolutionary changes to produce observed sequence changes are identified Trees producing smallest number of changes for all sequence positions are identified Time consuming algorithm Only works well if the sequences have a strong sequence similarity

28 28 Maximum Parsimony Step I Step II Input: multiple sequence alignment For each aligned position, identify phylogenetic trees that require the smallest number of evolutionary changes to produce the observed sequence changes Step III Continue analysis for every position in the sequence alignment Step IV Sequence variations at each site in the alignment are placed at the tips of the trees

29 29 Maximum Parsimony - Example Sequences positions Informative sites: must favor one tree over another  site 5 is informative, but sites 1, 6, 8 are not To be informative, a site must also have the same sequence character in at least two genomes  only sites 5, 7, and 9 are informative according to this rule E.g. trees for position 5: Combining sites 5, 7, and 9, the left tree is the best tree for these 4 sequences

30 30 Maximum Parsimony - Example What is the parsimony score of

31 31 Maximum Parsimony - Example Species 1 - A G G G T A A C T G Species 2 - A C G A T T A T T A Species 3 - A T A A T T G T C T Species 4 - A A T G T T G T C G 1 2 3 4 5 6 7 8 9 10 How many possible unrooted trees?

32 32 Maximum Parsimony - Example How many substitutions? tree 1 change5 changes

33 33 Maximum Parsimony - Example Species 1 - A G G G T A A C T G Species 2 - A C G A T T A T T A Species 3 - A T A A T T G T C T Species 4 - A A T G T T G T C G 1 2 3 4 5 6 7 8 9 10 0 0 0

34 34 Maximum Parsimony - Example Species 1 - A G G G T A A C T G Species 2 - A C G A T T A T T A Species 3 - A T A A T T G T C T Species 4 - A A T G T T G T C G 1 2 3 4 5 6 7 8 9 10 0 3

35 35 Maximum Parsimony - Example Species 1 - A G G G T A A C T G Species 2 - A C G A T T A T T A Species 3 - A T A A T T G T C T Species 4 - A A T G T T G T C G 1 2 3 4 5 6 7 8 9 10 0 3 2 2 0 3 2 1

36 36 Maximum Parsimony - Example 1 2 3 4 5 6 7 8 9 10 Species 1 - A G G G T A A C T G Species 2 - A C G A T T A T T A Species 3 - A T A A T T G T C T Species 4 - A A T G T T G T C G 0 3 2 2 0 1 1 1 1 3 0 3 2 2 0 1 2 1 2 3 0 3 2 1 0 1 2 1 2 3 14 16 Minimum substitutions

37 37 Maximum Parsimony – Searching for Trees#Taxa3451050100#Trees1315 2  10 6 2  10 74 2  10 182 Imagine how large of 10 182...

38 38 Maximum Parsimony Parsimony can give misleading information when rates of sequence change vary in the different branches of a tree that are represented by the sequence data Where maximum parsimony fails Real tree: 2 long branches in which G has turned to A independently, possibly with some intermediate steps. In parsimony analysis rates of change along all branches of the tree are assumed equal. Therefore the tree predicted from parsimony will not be correct.

39 39 Standard problem: Maximum Parsimony (Hamming distance Steiner Tree)  Input: Set S of n aligned sequences of length k  Output: A phylogenetic tree T –leaf-labeled by sequences in S –additional sequences of length k labeling the internal nodes of T such that is minimized. Maximum Parsimony - Example

40 40 Maximum parsimony (example)  Input: Four sequences –ACT –ACA –GTT –GTA  Question: which of the three trees has the best MP scores? Maximum Parsimony - Example

41 41 All possible unrooted trees ACT GTTACA GTA ACA ACT GTA GTT ACT ACA GTT GTA Maximum Parsimony - Example

42 42 Possible substitutions ACT GTT GTA ACA GTA 1 2 2 MP score = 5 ACA ACT GTA GTT ACAACT 3 1 3 MP score = 7 ACT ACA GTT GTA ACAGTA 1 2 1 MP score = 4 Optimal MP tree Maximum Parsimony - Example

43 43 Maximum Parsimony: computational complexity ACT ACA GTT GTA ACAGTA 1 2 1 MP score = 4 Finding the optimal MP tree is NP-hard Optimal labeling can be computed in linear time O(nk) Maximum Parsimony - Example

44 44 Maximum likelihood approach Method uses probability calculations to find a tree that best accounts for the variation in a set of sequences Similar to maximum parsimony method in that analysis is performed on each column of a multiple sequence alignment Start with an evolutionary model of sequence change that provides estimates of rates of substitution of one base for another (transitions and transversions).

45 45 Maximum likelihood approach Statistical method - powerful and flexible, also computationally complex Given a particular tree and a model of the evolutionary change, calculate the likelihood of the tree based on data, i.e. the given multiple sequence alignment Likelihood (tree | data) proportional to Probability( data | tree)

46 46 Maximum likelihood approach Tree with branches, v k branch lengths Probability of character change P AC (t) for A  C in time t Don’t know character states inside tree (in the past) so calculate for all possibilities, e.g. A, C, G, T

47 47 Maximum likelihood approach L = p(A) P AA (v1) P AA (v2) P AG (v4) P AA (v5) P AA (v6) P AA (v3) P AA (v7) P AA (v8) L = p(A) P AA (v1) P AG (v2) P GG (v4) P GA (v5) P AA (v6) P AA (v3) P AA (v7) P AA (v8)

48 48 Maximum likelihood approach L = p(s0) P s0s1 (v1) P s1s2 (v2) P s2s4 (v4) P s2s5 (v5) P s1s6 (v6) P s0s3 (v3) P s3s7 (v7) P s3s8 (v8) Maximum likelihood does best in simulation but is also slowest method Variety of new heuristics to find ML tree faster

49 49 Maximum Likelihood (ML)  Given: stochastic model of sequence evolution (e.g. Jukes-Cantor) and a set S of sequences  Objective: Find tree T and probabilities p(e) of substitution on each edge, to maximize the probability of the data. Preferred by some systematists, but even harder than MP in practice. Maximum likelihood approach

50 50 Quality of the tree Phylogenetic trees can vary dramatically with slight changes in data We want to know which branches are reliable, and which branches do not have strong support from the data Bootstrapping is the most common method used A general statistical technique for determining how much error is in a set of results

51 51 Confidence assessment Bootstrapping Original data set with n characters Draw n characters randomly with re-placement. Repeat m times. m pseudo-replicates, each with n characters. Original analysis, e.g. MP, ML, NJ. Repeat original analysis on each of the pseudo-replicate data sets. Evaluate the results from the m analyses.

52 52 Confidence assessment Bootstrap sampling of phylogenies

53 53 Confidence assessment What do the bootstrap values mean? Bootstrap values for phylogenetic trees do not follow proper statistical behavior Bootstrap value 95% actually close to 100% confidence in that branch Bootstrap value 75% often close to 95% confidence Bootstrap value 60% is much lower confidence Less than 50% bootstrap: no confidence in that branch over an alternative

54 54 Computer Software for Phylogenetics Due to the lack of consensus among evolutionary biologists about basic principles for phylogenetic analysis, it is not surprising that there is a wide array of computer software available for this purpose. –PHYLIP is a free package that includes 30 programs that compute various phylogenetic algorithms on different kinds of data. –The GCG package (available at most research institutions) contains a full set of programs for phylogenetic analysis including simple distance-based clustering and the complex cladistic analysis program PAUP (Phylogenetic Analysis Using Parsimony) –CLUSTALX is a multiple alignment program that includes the ability to create tress based on Neighbor Joining. –MacClade is a well designed cladistics program that allows the user to explore possible trees for a data set.

55 55 Phylogenetics on the Web  There are several phylogenetics servers available on the Web –some of these will change or disappear in the near future –these programs can be very slow so keep your sample sets small  The Institut Pasteur, Paris has a PHYLIP server at:  Louxin Zhang at the Natl. University of Singapore has a WebPhylip server:  The Belozersky Institute at Moscow State University has their own "GeneBee" phylogenetics server:  The Phylodendron website is a tree drawing program with a nice user interface and a lot of options, however, the output is limited to gifs at 72 dpi - not publication quality.

56 56 Other Web Resources  Joseph Felsenstein (author of PHYLIP) maintains a comprehensive list of Phylogeny programs at:  Introduction to Phylogenetic Systematics, Peter H. Weston & Michael D. Crisp, Society of Australian Systematic Biologists  University of California, Berkeley Museum of Paleontology (UCMP)

57 57 Software Hazards  There are a variety of programs for Macs and PCs, but you can easily tie up your machine for many hours with even moderately sized data sets (i.e. fifty 300 bp sequences)  Moving sequences into different programs can be a major hassle due to incompatible file formats.  Just because a program can perform a given computation on a set of data does not mean that that is the appropriate algorithm for that type of data.

58 58 Which Method to Choose? Depends upon the sequences that are being compared Strong sequence similarity:  Maximum parsimony Clearly recognizable sequence similarity  Distance methods All others:  Maximum likelihood Best to choose at least two approaches Compare the results – if they are similar, you can have more confidence

59 59 Which Method to Choose?

60 60 Neighbor-joining Maximum parsimony Maximum likelihood Uses only pairwise distances Uses only shared derived characters Uses all data Minimizes distance between nearest neighbors Minimizes total distance Maximizes tree likelihood given specific parameter values Very fast Slow Very slow Easily trapped in local optima Assumptions fail when evolution is rapid Highly dependent on assumed evolution model Good for generating tentative tree, or choosing among multiple trees Best option when tractable (<30 taxa, homoplasy rare) Good for very small data sets and for testing trees built using other methods Tony Weisstein, Comparison of Methods

61 61 More Topics Related to Phylogenetics

62 62 More topics related to Phylogenetics Phylogeny epidemiology Supertree / Tree of life Phylogeography

63 63 Idea of the ‘Tree of Life’ The idea that the evolution of life can be represented as a tree, with leaves corresponding to extant species and nodes to extinct ancestors, came from Charles Darwin The earliest trees formed by Ernst Haeckel and others were based on a general idea of a hierarchy of relationships between species and higher taxa Gradually, quantitative criteria have been developed to measure the degree of morphological difference that was thought to reflect evolutionary distance

64 64 Winds of Change In the early days of molecular phylogenetics, a gene tree was usually equated with the species tree. This view was typified using ribosomal RNA (rRNA) sequences as the principal molecular phylogenetic marker This resulted in the discovery of a previously unrecognized domain of life, the Archaea, and in a tree topology that has been aptly called the ‘standard model’ of evolution This model involves the early descent of the bacterial clade from the last universal common ancestor and a subsequent separation of archaea and eukaryotes. All this was to change once comparative genomics yielded more information and multiple complete genome sequences became available for comparison

65 65 The three domains of Life Identified by phylogenetic analysis of the highly conserved 16S ribosomal RNA

66 66 Three strategies for constructing phylogenies Homologous single-gene data set Sequence concatenation Supertree construction  Rely on many taxa for a single gene  Combine or concatenate multiple sequences for the same set of species  Need for close concordance of species sampling among genes, which is difficult because of the hit-or-miss sampling in the databases.  Less genes and less samples  Large number sequence alignment  Sample multiple genes only for minimally overlapping sets of species  Tree constructed by a set of subtrees

67 67 With current computational tools, phylogenetic analyses for 1,000 species is possible with adequate computer resources It is currently impossible to reach a reasonable solution for 500,000 species, even with months of computation. Tree of Life ( 30,000 species ) Assembling the Tree of Life (ATOL ) What difficulty in computing David Hillis, Science, 2003 PARALLEL ALGORITHMS FOR GENETICS

69 69 Assembling large data matrices by concatenation Advantages  Improve the accuracy of a specific portion of a tree  The addition of species can be useful in cases of so-called ‘long-branch attraction’, in which high substitution rates or long intervals of time can mislead phylogenetic inference methods Two potential problems  Multiple genes can mix phylogenetic signals arising from different evolutionary histories  Some sequences are usually unavailable for some species, ‘missing data’, with possible deleterious effects on accuracy Domination by biological problems

70 70 Reconstruction of trees from large data matrices Two issues in constructing phylogenetic trees  Computation time  Reliability Two time-consuming computational problems  Multiple sequence alignment  Phylogenetic inference Domination by computational problems  Optimal methods ( parsimony and maximum likelihood ) are time-consuming  Even heuristic approach Months of processor time were devoted to a heuristic parsimony analysis of the Chase et al. dataset of ~ 500 sequences, and it never ran to completion ( Sanderson and Driskell, 2004)

71 71 Synthesis of large trees: supertree Tree constructed by a set of trees Advantages  Independent studies can be combined into a single tree  Initial trees can be based on different kinds of data  Initial trees can be obtained by different methodologies  Initial trees often have been selected from competing trees by professional judgment  There are most likely no common data for all species  Methods such as maximum likelihood would not be computationally tractable on such a large dataset +

72 72 Synthesis of large trees: supertree Classification ( Wilkinson et al, 2001, Bininda-Emonds et al, 2002 )  Present  Past Supertree technique past and present ( Bininda-Emonds, 2004 )

73 73 Reconstructing the “Tree” of Life Handling large datasets: millions of species The “Tree of Life” is not really a tree: reticulate evolution

74 74 Phylogenetic Epidemiology

75 75 Infectious diseases are caused by pathogens pathogen: microbe that causes disease microbe: microscopic organism The major classes of disease-causing microbes are viruses, bacteria, and eukaryotes (protists, fungi, and worms) RNA Viruses The RNA viruses are more often associated with epidemic and emerging diseases in humans than DNA viruses. The gene sequences of many RNA viruses change so rapidly that it is possible to watch spatial and temporal patterns unfold on a ‘real time’ scale that is not usually visible in other organisms. Diseases caused by RNA viruses: avian influenza, HIV, dengue...

76 76 The rapidity of RNA virus evolution is caused by a combination of (Holmes, 2004)  Extremely high mutation rates  Short generation times  Immense population sizes. These factors produce rates of nucleotide substitution that are, on average, some six orders of magnitude higher than those in eukaryotes and DNA viruses (Jenkins et al. 2002). The high rates of substitution found in viruses and bacteria allow phylogenies to be reconstructed for sequences that have diverged only recently Molecular phylogenies have come to play an increasingly important role in epidemiological studies of microbial pathogens, as they provide information about the location, timing, and mechanisms by which virulent strains arise.

77 77 Guan et al. (2002) Emergence of multiple genotypes of H5N1 avian influenza viruses in Hong Kong SAR. Proc Natl Acad Sci U S A, 99, 8950-8955.

78 78 Moya, A., Holmes, E.C., and Gonzalez-Candelas, F. (2004) The population genetics and evolutionary epidemiology of RNA viruses. Nat Rev Microbiol, 2, 279-288.

79 79 Maximum likelihood estimate of phylogeny of eight strains of influenza A isolated from humans, swine, and birds based on an analysis of the HA gene. The divergence years prior to 1870, estimated using a partially constrained molecular clock, are shown at the left of the branch. The branch lengths (after 1870) are calibrated in units of years (scale at bottom). Rannala, B. 2002. Molecular phylogenies and virulence evolution. In Adaptive Dynamics of Infectious Diseases: In Pursuit of Virulence Management

80 80 Difficulties With Phylogenetic Analysis Horizontal or lateral transfer of genetic material (for instance through viruses) makes it difficult to determine phylogenetic origin of some evolutionary events Garbage in, garbage out ! Alignment crucial Genes selective pressure can be rapidly evolving, masking earlier changes that had occurred phylogenetically

81 81 Difficulties With Phylogenetic Analysis Two sites within comparative sequences may be evolving at different rates Rearrangements of genetic material can lead to false conclusions duplicated genes can evolve along separate pathways, leading to different functions

82 82 Gene trees vs species trees  Gene duplication can complicate phylogenetic analysis  Paralogues (duplicated genes) do not fit in evolutionary tree Phylogenetics - Issues Choice of target sequence type Use for very long-term evolutionary studies, spanning species boundaries & biological kingdoms  Ribosomal RNA (slowest change / mutation rate) (a) Use for short-term studies of closely-related species  DNA / RNA (fastest change / mutation rate) (b) Contains more evolutionary information than protein (a) Use for wide species comparisons  Protein (medium change / mutation rate) (b) More reliable alignment than DNA

83 83 NO HOMEWORK! Happy?? A problem will be appeared in the Final Exam: Give an example and design a flowchart to show how to construct a tree Your answer should include, at least: (a) Where you find the example? ( Google, books, or papers ) (b) Why you choose this example? ( curiosity, simple, or no reason? ) (c) Where you plan to get the sequences? ( database in the public domain ) (d) What kind of the methods you plan to use to construct your tree? (e) Why you plan not use other methods Just go to Google and find YOUR OWN Answer !

