Download presentation
Presentation is loading. Please wait.
Published bySimon Shields Modified over 9 years ago
1
Phylogenetic Analysis
2
General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure of evolutionary relatedness: e.g., morphological features Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure of evolutionary relatedness: e.g., morphological features
3
Phylogenetics on sequence data is an attempt to reconstruct the evolutionary history of those sequences Relationships between individual sequences are not necessarily the same as those between the organisms they are found in Phylogenetics on sequence data is an attempt to reconstruct the evolutionary history of those sequences Relationships between individual sequences are not necessarily the same as those between the organisms they are found in
4
The ultimate goal is to be able to use sequence data from many sequences to give information about phylogenetic history of organisms Phylogenetic relationships usually depicted as trees, with branches representing ancestors of “children”; the bottom of the tree (individual organisms) are leaves. Individual branch points are nodes. The ultimate goal is to be able to use sequence data from many sequences to give information about phylogenetic history of organisms Phylogenetic relationships usually depicted as trees, with branches representing ancestors of “children”; the bottom of the tree (individual organisms) are leaves. Individual branch points are nodes.
5
Phylogenetic trees ABCD time A rooted tree A B C D An unrooted tree time?
6
We will only consider binary trees: edges split only into two branches (daughter edges) rooted trees have an explicit ancestor; the direction of time is explicit in these trees unrooted trees do not have an explicit ancestor; the direction of time is undetermined in such trees We will only consider binary trees: edges split only into two branches (daughter edges) rooted trees have an explicit ancestor; the direction of time is explicit in these trees unrooted trees do not have an explicit ancestor; the direction of time is undetermined in such trees
7
Types of phylogenetic analysis methods Phenetic: trees are constructed based on observed characteristics, not on evolutionary history Cladistic: trees are constructed based on fitting observed characteristics to some model of evolutionary history Phenetic: trees are constructed based on observed characteristics, not on evolutionary history Cladistic: trees are constructed based on fitting observed characteristics to some model of evolutionary history Distance methods Parsimony and Maximum Likelihood methods
8
Distance methods Measuring distance -- just like when we talked about multiple alignment, distance represents all the differences at the various positions; these differences can be treated as equal or weighted according to empirical knowledge of substitution rates
9
Another way to say this is that there are a set of distances d ij between each pair of sequences i,j in the dataset. d ij can be the fraction f of sites u where residues x i and x j differ; or d ij can be such a fraction but weighted in some way (e.g. Jukes-Cantor distance)
10
Clustering algorithms UPGMA -- this is the distance clustering method that is used in pileup to make the guide tree d ij is the average distance between pairs of sequences found in two clusters, C i and C j. Text’s notation: |C i | = number of sequences in C i UPGMA -- this is the distance clustering method that is used in pileup to make the guide tree d ij is the average distance between pairs of sequences found in two clusters, C i and C j. Text’s notation: |C i | = number of sequences in C i
11
The algorithm in the text means just what we said before: find the closest distance between two sequences, cluster those; then find the next closest distance, cluster those; as sequences are added to existing clusters find the average distance between existing clusters Work through the notation! UPGMA assumes a molecular clock mechanism of evolution The algorithm in the text means just what we said before: find the closest distance between two sequences, cluster those; then find the next closest distance, cluster those; as sequences are added to existing clusters find the average distance between existing clusters Work through the notation! UPGMA assumes a molecular clock mechanism of evolution
12
Neighbor-joining: corrects for UPGMA’s assumption of the same rate of evolution for each branch by modifying the distance matrix to reflect different rates of change. The net difference between sequence i and all other sequences is r i = d ik Neighbor-joining: corrects for UPGMA’s assumption of the same rate of evolution for each branch by modifying the distance matrix to reflect different rates of change. The net difference between sequence i and all other sequences is r i = d ik k
13
The rate-corrected distance matrix is then M ij = d ij - (r i + r j )/(n - 2) Join the two sequences whose M ij is minimal; then calculate the distance from this new node to all other sequences using d km = (d im + d jm - d ij )/2 Again correct for rates and join nodes. The rate-corrected distance matrix is then M ij = d ij - (r i + r j )/(n - 2) Join the two sequences whose M ij is minimal; then calculate the distance from this new node to all other sequences using d km = (d im + d jm - d ij )/2 Again correct for rates and join nodes.
14
In-class exercise I Retrieve the file named phylo2 from bioinfI.list in my directory Open it in the editor, select all the sequencs Select Functions Evolution PAUPSearch; in Tree Optimality Criterion choose distance; in Method for Obtaining Best Tree choose heuristic. Leave everything else as default (make sure bootstrap option is not selected) Select Run. Inspect output Retrieve the file named phylo2 from bioinfI.list in my directory Open it in the editor, select all the sequencs Select Functions Evolution PAUPSearch; in Tree Optimality Criterion choose distance; in Method for Obtaining Best Tree choose heuristic. Leave everything else as default (make sure bootstrap option is not selected) Select Run. Inspect output
15
Parsimony methods Parsimony methods are based on the idea that the most probable evolutionary pathway is the one that requires the smallest number of changes from some ancestral state For sequences, this implies treating each position separately and finding the minimal number of substitutions at each position Parsimony methods are based on the idea that the most probable evolutionary pathway is the one that requires the smallest number of changes from some ancestral state For sequences, this implies treating each position separately and finding the minimal number of substitutions at each position
16
Example of parsimonious tree building Tree on left requires only one change, tree on left requires two: left tree is most parsimonious
17
Parsimony methods assign a cost to each tree available to the dataset, then screen trees available to the dataset and select the most parsimonious Screening all the trees available to even a smallish dataset would take too much time; branch and bound method builds trees with increasing numbers of leaves but abandons the topology whenever the current tree has a bigger cost than any complete tree Parsimony methods assign a cost to each tree available to the dataset, then screen trees available to the dataset and select the most parsimonious Screening all the trees available to even a smallish dataset would take too much time; branch and bound method builds trees with increasing numbers of leaves but abandons the topology whenever the current tree has a bigger cost than any complete tree
18
In-class exercise II Use same data set and program as in exercise I, but choose maximum parsimony. Use heuristic for the tree building method. Inspect your tree. Compare it to the distance generated tree. Use same data set and program as in exercise I, but choose maximum parsimony. Use heuristic for the tree building method. Inspect your tree. Compare it to the distance generated tree.
19
Maximum likelihood methods Maximum likelihood reconstructs a tree according to an explicit model of evolution. For the given model, no other method will work as well But, such models must be simple, because the method is computationally intensive Maximum likelihood reconstructs a tree according to an explicit model of evolution. For the given model, no other method will work as well But, such models must be simple, because the method is computationally intensive
20
Actually, all the other methods discussed implicitly use a simple model of evolution similar to the typical model made explicit in maximum likelihood: All sites selectively neutral All mutate independently, forward and reverse rates equal, given by Actually, all the other methods discussed implicitly use a simple model of evolution similar to the typical model made explicit in maximum likelihood: All sites selectively neutral All mutate independently, forward and reverse rates equal, given by
21
Also assume discrete generations and sites change independently Given this model, can calculate probability that a site with initial nucleotide I will change to nucleotide j within time t: P t ij = ij e - t + (1 - e - t )g j, where ij = 1 if i = j and ij = 0 otherwise, and where g j is the equilibrium frequency of nucleotide j Also assume discrete generations and sites change independently Given this model, can calculate probability that a site with initial nucleotide I will change to nucleotide j within time t: P t ij = ij e - t + (1 - e - t )g j, where ij = 1 if i = j and ij = 0 otherwise, and where g j is the equilibrium frequency of nucleotide j
22
The likelihood that some site is in state i at the kth node of a tree is L i (k) The likelihoods for all states for each site for each node are calculated separately; the product of the likelihoods for each site gives the overall likelihood for the observed data Different tree topologies are searched to find the highest overall likelihood The likelihood that some site is in state i at the kth node of a tree is L i (k) The likelihoods for all states for each site for each node are calculated separately; the product of the likelihoods for each site gives the overall likelihood for the observed data Different tree topologies are searched to find the highest overall likelihood
23
Maximum likelihood is maybe the “gold standard” for phylogenetic analysis; but because of its computational intensity it can only be used for select data and only after much initial fine tuning of many parameters of sequence alignments Often used to distinguish between several already generated trees Maximum likelihood is maybe the “gold standard” for phylogenetic analysis; but because of its computational intensity it can only be used for select data and only after much initial fine tuning of many parameters of sequence alignments Often used to distinguish between several already generated trees
24
DO NOT USE MAXIMUM LIKELIHOOD TO BUILD TREES, TODAY OR FOR YOUR PROJECT. PLEASE! DO NOT USE MAXIMUM LIKELIHOOD TO BUILD TREES, TODAY OR FOR YOUR PROJECT. PLEASE!
25
Assessing trees The bootstrap: randomly sample all positions (columns in an alignment) with replacement -- meaning some columns can be repeated -- but conserving the number of positions; build a large dataset of these randomized samples
26
Bootstrap alignment process
27
Then use your method (distance, parsimony, likelihood) to generate another tree Do this a thousand or so times Note that if the assumptions the method is based on hold, you should always get the same tree from the bootstrapped alignments as you did originally The frequency of some feature of your phylogeny in the bootstrapped set gives some measure of the confidence you can have for this feature Then use your method (distance, parsimony, likelihood) to generate another tree Do this a thousand or so times Note that if the assumptions the method is based on hold, you should always get the same tree from the bootstrapped alignments as you did originally The frequency of some feature of your phylogeny in the bootstrapped set gives some measure of the confidence you can have for this feature
28
In-class exercise III Use the same dataset, select distance again. This time, select the bootstrap box. In options, make sure to select the box labelled Save a file containing PAUP screen output. Take defaults for everything else. Run. Inspect your output. In particular, look at the paup.log file and compare it to the paupdisplay.figure file. Use the same dataset, select distance again. This time, select the bootstrap box. In options, make sure to select the box labelled Save a file containing PAUP screen output. Take defaults for everything else. Run. Inspect your output. In particular, look at the paup.log file and compare it to the paupdisplay.figure file.
29
Repeat for the maximum parsimony method. Were the original trees (not bootstrapped) meaningful? Repeat for the maximum parsimony method. Were the original trees (not bootstrapped) meaningful?
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.