Presentation on theme: "Introduction to Molecular Evolution"— Presentation transcript:
1Introduction to Molecular Evolution Mike ThomasOctober 3, 2002
2What we can learn from multiple sequence alignments An alignment is a hypothesis about the relatedness of a set of genesThis information can be used to reconstruct the evolutionary history of those genesThe history of the genes can provide us with information about the structure and function, and significance of a gene or family of genesWe can also use the reconstructed history to test hypotheses about evolution itself:Rates of changeThe degree of changeImplications of change, etcWe can then pose and test hypotheses about the evolution of phenomena unrelated to the genesEvolution of flight in insectsEvolution of humansEvolution of disease
3Assumptions made by phylogenetic methods: The sequences are correctThe sequence are homologousEach position is homologousThe sampling of taxa or genes is sufficient to resolve the problem of interestSequence variation is representative of the broader group of interestSequence variation contains sufficient phylogenetic signal (as opposed to noise) to resolve the problem of intereestEach position in the sequence evolved independently
4How do you extract this information from an alignment?
5Answer: a tree Haeckel’s Tree of Life “Higher” organismsHaeckel’s Tree of Life“Lower” organismsA phylogenetic tree is a hierarchical, graphical representation of relationships
6Other Ways to Represent Phylogenies Cladogram showing the phylogenetic relationships between four species.Relationships of the same four species represented as a set of nested parentheses.Evolutionary relationships of the same four species with nine synapomorphies (shared, derived characters) plotted on the branches.
7Using Phylogeny to Understand Gene Duplication and Loss A gene tree.The gene tree superimposed on a species tree, allowing identification of the duplication and loss events.
8Problems with Phylogenetic Inference How do we know what the potential candidate trees are?How do we choose which tree is (most likely) the true tree?
9Number of Possible Trees Number of taxa or genesNumber of possible rooted trees34155105710,395ABCBACCBA
10Recipe for reconstructing a phylogeny Select an optimality criterionSelect a search strategyUse the selected search strategy to generate a series of trees, and apply the selected optimality criterion to each tree, always keeping track of the “best” tree examined thus far
11Search strategy: Which is the right tree? When m is the number of taxa, the number of possible trees is:[(2m-3)!]/[2m-2(m-2)!]For 10 taxa, the number of trees is 34,459,425Many trees can be discarded because they are obviously wrongSometimes, there is a general or even specific grouping that can serve as a start for the tree searchThere are a number of approaches to tree searches that can be used
12Search Strategies Strategy Type Stepwise addition Algorithmic Star decompositionExhaustiveExactBranch & boundBranch swappingHeuristicGenetic algorithmMarkov Chain Monte CarloheuristicBut, we still need to evaluate the trees in order to identify the one most likely to be the true tree
13Choose an optimality criterion to evaluate trees Commonalities can be found, but how can these be used to evaluate a tree?
14General differences between optimality criteria Minimum evolution Maximum ParsimonyMaximum LikelihoodModel based“Model free”Can account for many types of sequence substitutionsAssumes that all substitutions are equalWorks well with strong or weak sequence similarityWorks only when sequence similarity is highComputationally fastComputationally slowWell understood statistical properties (easy to test)Poorly understood statistical properties (hard to test)Can accurately estimate branch lengths (important for molecular clocks)Cannot estimate branch lengths accuratelyCan estimate branch lengths with some degree of accuracy
15Maximum ParsimonyThe parsimony score is the minimum number of required changes, or stepsOnly shared, derived characters are usedThe score for each character (site) is called the character scoreSite lengths added over all sites is the tree lengthThe tree (out of all examined trees) with the lowest tree length is the most parsimonious tree… and most likely to be the true tree
16Example: Maximum Parsimony Tree length: 6 stepsTree length: 12 stepsXHGGHFFX5120 teeth, 5 toes, 10 ribs, round lobes, long legs2420 teeth153 toes, round lobes10 ribs, 5 toes,round lobes, long legs4 toes, short legs, 8 ribs, 16 teeth, oval lobes4 toesround lobes, 20 teeth, 25 verts,10 ribs, 5 toes, long legsoval lobes, 16 teeth, 25 verts,8 ribs, 3 toes, short legs
18Another example with nucleotide data Alignment of four hypothetical DNA sequences.Most parsimonious rooted cladogram for this alignment.Corresponding unrooted cladogram.
19Issues & problems with parsimony Multiple trees may be the most parsimonious (have the same tree length)A consensus tree can be constructed to visualize the congruity & discontinuity between theseBranch lengths (and, therefore, rates of change) cannot be accurately estimatedNo explicit model of change is used, even when one might be well supportedThe most parsimonious tree(s) may not be the true tree
20Minimum Evolution (Distance) All data are used, even though some may not be shared, derived charactersThe branch lengths represent distance between a taxon and an ancestor, given an assumed model of evolutionThe pairwise distances are calculated for each pair of taxa, given an assumed model of evolutionThe tree length is the sum of branch length across a treeThe tree (out of all examined trees) with the lowest tree length is the minimum evolution tree… and most likely to be the true tree
21The tree is different than a parsimony tree Hypothetical evolutionary relationships between three DNA sequences, in which the horizontal branch lengths are proportional to the number of character-state changes along the branches.Topology of the parsimonious cladogram that would be constructed from the sequence similarities produced by such an evolutionary history if multiple substitutions had occurred at several sites.
22Models of evolution: choosing parameters Factors that Affect Phylogenetic InferenceRelative base frequencies (A,G,T,C)Transition/transversion ratioNumber of substitutions per siteNumber of nucleotides (or amino acids) in sequenceDifferent rates in different parts of the moleculeSynonymous/non-synonymous substitution ratioSubstitutions that are uninformative or obfuscatoryParallel substitutionsConvergent substitutionsBack substitutionsCoincidental substitutionsIn general, the more factors that are accounted for by the model (i.e., more parameters), the larger the error of estimation. It is often best to use fewer parameters by choosing the simpler model.
23Some distance models: p-distance p = nd/n, where n is the number of sites (nucleotides or amino acids), and nd is the number of differences between the two sequences examined.Very robust when divergence times are recent and the affect of complicating phenomena is minor
24Some distance models: Jukes-Cantor Used to estimate the number of substitutions per siteThe expected number of substitutions per site is:d = 3αt = -(3/4)ln[1-(4/3)p], where p is the proportion of difference between 2 sequencesVariance can be calculatedNo assumptions are made about nucleotide frequencies, or differential substitution ratesA T C GATCG-α-α
25Some distance models: Kimura two-parameter Used to estimate the number of substitutions per sited = 2rt, where r is the substitution rate (per site, per year) and t is the generation time; r = α + 2β, so:d = 2αt + 4βtAccounts for different transition and transversion ratesNo assumptions are made about nucleotide frequencies, variance is greater than Jukes-CantorCTAGPyrimidinesPurines= transition rate= transversion rateThese are treated the same for long divergence times.
26Other modelsHasegawa, Kishino, Yano (HKY): corrects for unequal nucleotide frequencies and transition/ transversion bias into accountUnrestricted model: allows different rates between all pairs of nucleotidesGeneral Time Reversible model: allows different rates between all pairs of nucleotides and corrects for unequal nucleotide frequenciesMany other models have been invented to correct for specific problemsThe more parameters are introduced, the larger the variance becomes
27Ways to build trees with distance models: ME Minimum Evolution (ME) trees can be found by exhaustive searches or heuristic searches (starting with a reasonable tree or eliminating unlikely possible trees)For each tree examined, the total tree length is calculated as the sum of branch lengths calculated using a given modelME, like Maximum Parsimony, may generate a number of equal-scoring ME trees and may not actually result in the true tree Many other models have been invented to correct for specific problems
28Ways to build trees with distance models: UPGMA UPGMA (unweighted pair-group method using arithmetic averages)Generally accurate for molecular evolution when substitution rates are relatively constant, but this can rarely be assumed to be trueMethod:distances for each pair of taxa are computed using the chosen distance methodThe pair with the smallest value d are combined into a single, composite taxonThe distances from this composite taxon to all other taxa are computedThe next pair with the smallest d is chosen (including consideration of pairings with the composite taxon)
29Ways to build trees with distance models: Neighbor Joining Neighbor Joining (NJ) is a very robust method that is accurate even when substitution rates are not constant, and generally recovers the ME tree (although this is not always the case)Method:We construct a “star” tree and compute the sum of all branches, SO (this will be greater than the sum of all branches for the final tree, SF)We then pick a pair of taxa to be “neighbors”, (say, taxa 1 & 2) and compute the sum of all branches, S1,2All other pairs of taxa are then placed as neighbors and the sum of all branches computedThe neighbors whose pairing results in the greatest reduction in the sum of all branches will be keptThen, another round of neighbor joining is conducted, including using the neighbor pair retained in the first round
30Example: The evolution of flight in stoneflies Reconstruction of the Plectoptera order (stoneflies) from 18S rRNA sequenceKimura 2-parameter distance usedTree rooted with known outgroup speciesNeighbor-Joining tree building method used to construct first tree; tree search was conducted to ensure that the NJ was also the ME treeCharacters related to flight were then mapped onto the treeDefined outgroup taxaScale, in substitutions/site
31Maximum LikelihoodThe site likelihoods represent probability of data for one site given an assumed model of evolutionOverall likelihood is the product of the site likelihoodsTrees are evaluated by comparing log-likelihood scoresLikelihood scores are comparable across models as well as trees, so it provides a way of testing the goodness of fit of a modelThe tree (out of all examined trees) with the lowest tree length is the maximum likelihood tree… and most likely to be the true tree
32All material through next Tuesday (10/8) will be covered by the exam Examples of phylogenetic reconstructionsUses of phylogenetic treesOther research using molecular evolutionNext Thursday: exam 1All material through next Tuesday (10/8) will be covered by the exam