Presentation is loading. Please wait.

Presentation is loading. Please wait.

Wellcome Trust Workshop Working with Pathogen Genomes Module 6 Phylogeny.

Similar presentations

Presentation on theme: "Wellcome Trust Workshop Working with Pathogen Genomes Module 6 Phylogeny."— Presentation transcript:

1 Wellcome Trust Workshop Working with Pathogen Genomes Module 6 Phylogeny

2 Phylogeny Phylogeny refers to the ancestry of a biological lineage, but is also synonymous with phylogenetic tree Taxonomy began by grouping taxa together based on morphology at various structural levels Phylogeny is tree-like, or dichotomous Phylogeny provides the historical basis to the comparative method

3 Principle of phylogenetics Inferring relationships is about similarity. Homology describes similarity due to common inheritance from an ancestor. Homologous characters are useful similarity. Homoplasy describes similarity due to independent acquisitions of the same or superficially similar character state. Homoplasious characters provide a mis-leading picture of phylogeny. Distance in a phylogenetic tree reflects a decreasing number of shared, homologous characters (assuming that evolution maximises homology).

4 Phylogenetic trees in biology Tool for understanding biological processes Examination of phylogeny to determine distance to characterized molecules draw conclusions regarding biological functions not otherwise apparent multiple alignments vs. pairwise homology Genomes are historical entities their structure and function reflect the past

5 Applications to genome biology Gene family evolution –orthology vs paralogy –gene duplications and losses can be inferred through comparisons of ‘gene’ and ‘species’ trees –the placement of a gene in the ‘wrong’ position within a phylogeny is used to support horizontal gene transfer. Microarray data analysis –Comparative genome hybridization (CGH) distance matrix Phylogenomics –gene order, gene content and concatenated sequences can be used to infer phylogeny Recombination –tests for recombination and gene conversion use phylogenetic profiles to detect breakpoints

6 Building a phylogenetic tree Identify protein, DNA or RNA sequences of interest –Fasta format file of concatenated Multiple sequence alignment –ClustalX Construct phylogeny –PHYML View and edit tree –ATV

7 Overview of ClustalX Procedure alpha-helices Quick pairwise alignment: calculate distance matrix Neighbor-joining tree (guide tree) Progressive alignment following guide tree CLUSTAL W

8 Creating multiple alignments Phylogeny is meaningless unless it is based on a well-done alignment Issues to consider –Alignment parameters Weight matrix parameters Gap penalties –Truncated sequences –Non homologous sequences

9 Multiple alignments: parameters

10 Multiple alignments: Gap penalties High gap penalties Default gap penalties Low penalties

11 Multiple alignments: truncated sequences

12 Multiple alignments: non-homologous sequences

13 Constructing phylogenies Stages in constructing phylogenies: 1.Data scoring; producing genetic distances or character states (‘distance’ or ‘discrete’ data). 2.Tree sorting; processes for searching ‘tree-space’, e.g., hill-climbing or MCMC. 3.Estimation; identifying the most acceptable tree topology and model parameters using a variety of methods (‘clustering’ or ‘optimising’ methods) Phylogenetic methods: Algorithmic Neighbor-joining UPGMA Tree-searching Maximum parsimony Maximum likelihood Bayesian inference No one method is best for all circumstances

14 Neighbor Joining (NJ) ABCDEFGHI A· B0.001· C · D · E · F · G · H · I · Principles: Tree topology and branch lengths are estimated from a genetic distance matrix. Advantages: A single tree is estimated by minimising genetic distance, in a short time and with little computational expenditure. Disadvantages: The method lacks accuracy because there is no attempt to correct for potential bias (homoplasy). The method lacks precision because the outcome is partly contingent on the tree with which the search process begins.

15 Maximum parsimony (MP) Principles: Searches through tree topologies in ‘tree-space’ using a ‘hill-climbing’ algorithm. Scores trees on their ‘length’, i.e., the number of character state changes required to explain the distribution of characters on a given tree topology. Looks for the tree with the minimum number of changes, i.e. the topology with the fewest character changes overall. Advantages: Generally accurate method with few assumptions. Phylogenetic hypotheses can be statistically tested by comparing the lengths of different trees. Tree estimation is relatively fast and undemanding. Disadvantages: There are typically several shortest trees, resulting in a potentially ambiguous consensus topology. There is no explicit model of evolution and so the method is prone to error under certain circumstances, e.g., long-branch attraction (homoplasy).

16 Maximum likelihood (ML) Principles: Looks for the tree that, under a given model of evolution, maximizes the likelihood of the observed data Applies a complex model of DNA or protein sequence evolution that estimates parameters for specific substitutions and other qualities of molecular sequences Locates the most likely tree topology through a hill-climbing algorithm Various models accommodate sources of molecular homoplasy that might result in the wrong tree: ‘Multiple hits’ (substitutional saturation) Rate convergence Rate heterogeneity Base composition bias Codon usage bias Secondary structure Covariance

17 Advantages: Highly accurate because considerable biological realism is introduced through the substitutional model. This allows various forms of homoplasy to be corrected for. Phylogenetic estimation within the likelihood framework provides a robust statistical context in which to evaluate specific hypotheses. A single tree is produced that is generally precise. Disadvantages: The complexity of the estimation process means that it is slow and computationally demanding. The hill-climbing algorithm is susceptible to local optima and so does not guarantee to return the most optimal solution. Maximum Likelihood

18 Bootstrapping a tree Statistical estimate of the reliability of groupings Subsamples of sites in an alignment are used to generate trees Process is iterated multiple times ( times) Agreement among the resulting trees is summarized with a majority-rule consensus tree

19 Bayesian Principles: Based on the notion of posterior probabilities: probabilities that are estimated, based on some model (prior expectations), after learning something about the data. Uses an MCMC process to search through tree-space. Selects the tree-topology with the highest probability, given the data. Advantages: Intuitive Potential for any complex model. Provides both parameter estimates (i.e., tree) and their probabilities in a single analysis. Many different hypotheses can be evaluated in a single analysis. The MCMC algorithm makes integrating over all parameter values fast and accurate; MCMCs are able to break out of local optima.

20 Disadvantages: An evolutionary model must be specified a priori, in form of prior probabilities (‘priors’). Is there sufficient knowledge of these probabilities? The MCMC must be run long enough for variation in the parameter estimates to smooth out or reach ‘convergence’. The time required is never certain. Posterior probabilities describe the absolute probability of particular nodes and branch lengths; these can be overestimated. BI Bayesian

21 Remember All trees are wrong

22 Bacterium 1 Bacterium 3 Bacterium 2 Eukaryote 1 Eukaryote 4 Eukaryote 3 Eukaryote 2 Bacterium 1 Bacterium 3 Bacterium 2 Eukaryote 1 Eukaryote 4 Eukaryote 3 Eukaryote 2 Phylograms show branch order and branch lengths Cladograms and phylograms Cladograms show branching order - branch lengths are meaningless

23 Rooted by outgroup Rooting using an outgroup archaea eukaryote bacteria outgroup root eukaryote Unrooted tree archaea Monophyletic group Monophyletic group The root defines common ancestry

24 Further details Textbooks: Page & Holmes Molecular Evolution: A Phylogenetic Approach. Blackwell Science. Felsenstein Inferring Phylogenies. Sinauer Associates. Hall Phylogenetic trees made easy. Sinauer Associates. Software: Phyml PAUP* (NJ, MP, ML): PHYLIP (NJ, MP, ML): MrBayes (Bayesian): Splitstree (Networks) FindModel (Model Test) Websites: MultiPhyl (ML via ) Felsenstein’s Phylogeny program page (links to available software):

Download ppt "Wellcome Trust Workshop Working with Pathogen Genomes Module 6 Phylogeny."

Similar presentations

Ads by Google