Presentation on theme: "Wellcome Trust Workshop Working with Pathogen Genomes Module 6 Phylogeny."— Presentation transcript:
Wellcome Trust Workshop Working with Pathogen Genomes Module 6 Phylogeny
Phylogeny Phylogeny refers to the ancestry of a biological lineage, but is also synonymous with phylogenetic tree Taxonomy began by grouping taxa together based on morphology at various structural levels Phylogeny is tree-like, or dichotomous Phylogeny provides the historical basis to the comparative method
Principle of phylogenetics Inferring relationships is about similarity. Homology describes similarity due to common inheritance from an ancestor. Homologous characters are useful similarity. Homoplasy describes similarity due to independent acquisitions of the same or superficially similar character state. Homoplasious characters provide a mis-leading picture of phylogeny. Distance in a phylogenetic tree reflects a decreasing number of shared, homologous characters (assuming that evolution maximises homology).
Phylogenetic trees in biology Tool for understanding biological processes Examination of phylogeny to determine distance to characterized molecules draw conclusions regarding biological functions not otherwise apparent multiple alignments vs. pairwise homology Genomes are historical entities their structure and function reflect the past
Applications to genome biology Gene family evolution –orthology vs paralogy –gene duplications and losses can be inferred through comparisons of ‘gene’ and ‘species’ trees –the placement of a gene in the ‘wrong’ position within a phylogeny is used to support horizontal gene transfer. Microarray data analysis –Comparative genome hybridization (CGH) distance matrix Phylogenomics –gene order, gene content and concatenated sequences can be used to infer phylogeny Recombination –tests for recombination and gene conversion use phylogenetic profiles to detect breakpoints
Building a phylogenetic tree Identify protein, DNA or RNA sequences of interest –Fasta format file of concatenated Multiple sequence alignment –ClustalX Construct phylogeny –PHYML View and edit tree –ATV
Overview of ClustalX Procedure alpha-helices Quick pairwise alignment: calculate distance matrix Neighbor-joining tree (guide tree) Progressive alignment following guide tree CLUSTAL W
Creating multiple alignments Phylogeny is meaningless unless it is based on a well-done alignment Issues to consider –Alignment parameters Weight matrix parameters Gap penalties –Truncated sequences –Non homologous sequences
Constructing phylogenies Stages in constructing phylogenies: 1.Data scoring; producing genetic distances or character states (‘distance’ or ‘discrete’ data). 2.Tree sorting; processes for searching ‘tree-space’, e.g., hill-climbing or MCMC. 3.Estimation; identifying the most acceptable tree topology and model parameters using a variety of methods (‘clustering’ or ‘optimising’ methods) Phylogenetic methods: Algorithmic Neighbor-joining UPGMA Tree-searching Maximum parsimony Maximum likelihood Bayesian inference No one method is best for all circumstances
Neighbor Joining (NJ) ABCDEFGHI A· B0.001· C0.0250.024· D0.0030.0020.019· E0.3360.3310.2190.231· F0.0210.0190.0010.0180.233· G0.001 0.0250.0020.2560.023· H0.0560.0440.0050.0420.1320.0510.043· I0.3250.3000.1160.1950.0050.1220.3660.213· Principles: Tree topology and branch lengths are estimated from a genetic distance matrix. Advantages: A single tree is estimated by minimising genetic distance, in a short time and with little computational expenditure. Disadvantages: The method lacks accuracy because there is no attempt to correct for potential bias (homoplasy). The method lacks precision because the outcome is partly contingent on the tree with which the search process begins.
Maximum parsimony (MP) Principles: Searches through tree topologies in ‘tree-space’ using a ‘hill-climbing’ algorithm. Scores trees on their ‘length’, i.e., the number of character state changes required to explain the distribution of characters on a given tree topology. Looks for the tree with the minimum number of changes, i.e. the topology with the fewest character changes overall. Advantages: Generally accurate method with few assumptions. Phylogenetic hypotheses can be statistically tested by comparing the lengths of different trees. Tree estimation is relatively fast and undemanding. Disadvantages: There are typically several shortest trees, resulting in a potentially ambiguous consensus topology. There is no explicit model of evolution and so the method is prone to error under certain circumstances, e.g., long-branch attraction (homoplasy).
Maximum likelihood (ML) Principles: Looks for the tree that, under a given model of evolution, maximizes the likelihood of the observed data Applies a complex model of DNA or protein sequence evolution that estimates parameters for specific substitutions and other qualities of molecular sequences Locates the most likely tree topology through a hill-climbing algorithm Various models accommodate sources of molecular homoplasy that might result in the wrong tree: ‘Multiple hits’ (substitutional saturation) Rate convergence Rate heterogeneity Base composition bias Codon usage bias Secondary structure Covariance
Advantages: Highly accurate because considerable biological realism is introduced through the substitutional model. This allows various forms of homoplasy to be corrected for. Phylogenetic estimation within the likelihood framework provides a robust statistical context in which to evaluate specific hypotheses. A single tree is produced that is generally precise. Disadvantages: The complexity of the estimation process means that it is slow and computationally demanding. The hill-climbing algorithm is susceptible to local optima and so does not guarantee to return the most optimal solution. Maximum Likelihood
Bootstrapping a tree Statistical estimate of the reliability of groupings Subsamples of sites in an alignment are used to generate trees Process is iterated multiple times (100-1000 times) Agreement among the resulting trees is summarized with a majority-rule consensus tree
Bayesian Principles: Based on the notion of posterior probabilities: probabilities that are estimated, based on some model (prior expectations), after learning something about the data. Uses an MCMC process to search through tree-space. Selects the tree-topology with the highest probability, given the data. Advantages: Intuitive Potential for any complex model. Provides both parameter estimates (i.e., tree) and their probabilities in a single analysis. Many different hypotheses can be evaluated in a single analysis. The MCMC algorithm makes integrating over all parameter values fast and accurate; MCMCs are able to break out of local optima.
Disadvantages: An evolutionary model must be specified a priori, in form of prior probabilities (‘priors’). Is there sufficient knowledge of these probabilities? The MCMC must be run long enough for variation in the parameter estimates to smooth out or reach ‘convergence’. The time required is never certain. Posterior probabilities describe the absolute probability of particular nodes and branch lengths; these can be overestimated. BI Bayesian