Download presentation
Presentation is loading. Please wait.
Published byAugustine Freeman Modified over 9 years ago
1
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Building Phylogenetic Trees
2
Algorithms in Computational Biology22Department of Mathematics & Computer Science Phylogeny All organisms on Earth had a common ancestor Evidence from morphological, biochemical, and gene sequence data Phylogeny This history of organismal lineages as they change through time Phylogenetic tree A tree showing the evolutionary relationships among various biological species All living organisms today, from smallest microbe to the largest plants and animals, are connected by the passage of genes along the branches of the phylogenetic tree
3
Algorithms in Computational Biology33Department of Mathematics & Computer Science Phylogenetic Tree of Life
4
Algorithms in Computational Biology44Department of Mathematics & Computer Science Inferring Phylogenies Traditionally Use morphological characters (both from living and fossilized organisms) 1962 Zuckerkandl & Pauling showed that molecular sequences can be used to infer phylogenies Assumes current sequences descended from some common ancestral gene in a common ancestral species
5
Algorithms in Computational Biology55Department of Mathematics & Computer Science Major Tree Building Algorithms Distance based Parsimony Maximum likelihood
6
Algorithms in Computational Biology66Department of Mathematics & Computer Science Orthologue vs Paralogue Both of them are homologous genes (homologues) Orthologues are a set of genes diverged from a common ancestor through gene speciation Homologous genes from different species Paralogues are a set of genes diverged from a common ancestor through gene duplication Homologous genes from the same species
7
Algorithms in Computational Biology77Department of Mathematics & Computer Science A Tree of Orthologues A tree of orthologues based on a set of alpha hemoglobins
8
Algorithms in Computational Biology88Department of Mathematics & Computer Science A Tree of Paralogues
9
Algorithms in Computational Biology99Department of Mathematics & Computer Science Background on Trees Nodes and Edges Nodes: unobserved ancestor Edge length On average, corresponds to evolutionary time period Variations Different proteins can change at different rates Same sequence evolve much faster in some organism than others Root of a phylogenetic tree Ultimate ancestor of all species Some algorithms provides the location of the root, while other don’t
10
Algorithms in Computational Biology10 Department of Mathematics & Computer Science Counting and Labeling Trees Counting: For a rooted tree with n leaves As we move up the tree, the edges coalesce as each new node is reached In addition to n leaves, there are n-1 nodes (internal nodes plus root node). A total of 2n-1 nodes There will be 2n-2 edges (discounting the edge above the root node) For an unrooted tree with n leaves Total number of nodes = 2n – 2 Total number of edges = 2n – 3 Labeling (for rooted tree) Label the leaves using 1 to n Label the branch nodes using n+1 to 2n-2 Label the root using 2n-1
11
Algorithms in Computational Biology11 Department of Mathematics & Computer Science Rooting an Unrooted Tree 1 2 3 1 2 3 1 2 3 1 2 3 2 1 3 3 1 2
12
Algorithms in Computational Biology12 Department of Mathematics & Computer Science How Many Possible Topologies? # of leavesWays to add n th leaf # of edges in the sub-tree # of un-rooted trees 4353 5573x5 6793x5x7 79113x5x7x9 ………… n2n-52n-33x5x7x9x…x(2n-5) (2n-5)!! # of rooted trees: (2n-3)!!
13
Algorithms in Computational Biology13 Department of Mathematics & Computer Science Making a Tree from Pairwise Distances Distance Measure First find f which is the fraction of differences between two sequences presupposing an alignment of the two sequences Fraction of difference expected by chance (by random substitution) is about 3/4 Jukes-Cantor distance (odds ratio) Clustering methods UPGMA Neighbor-joining
14
Algorithms in Computational Biology14 Department of Mathematics & Computer Science Unweighted Pair Group Method Using Arithmetic Average (UPGMA) [Sokal & Michener, 1958] Overview 1. Cluster the sequences 2. Amalgamate two clusters at each stage, create a new node on a tree 3. Assemble the tree upwards, each node being added above the others 4. The edge length determined by the difference in the heights of the nodes at the top and bottom of an edge
15
Algorithms in Computational Biology15 Department of Mathematics & Computer Science Distance Measure Used in UPGMA Distance b/w two clusters C i and C j is the average distance between pairs of sequences from each other Distance b/w two clusters C k and C l, if C k is the union of two clusters C i and C j
16
Algorithms in Computational Biology16 Department of Mathematics & Computer Science Algorithm UPGAM Initialization Assign each sequence i to its own cluster C i Define one leaf of T for each sequence, and place at height zero Iteration Determine the two clusters i, j for which d ij is minimal (if there are ties, pick one randomly) Define a new cluster k by C k = C i C j, and define d kl for all l using arithmetic average Define a node k with daughter nodes i and j, and place it at height d ij /2. Add k to the current clusters and remove i and j Termination When only two clusters i, j remain, place the root at height d ij /2
17
Algorithms in Computational Biology17 Department of Mathematics & Computer Science An Example
18
Algorithms in Computational Biology18 Department of Mathematics & Computer Science Cont’
19
Algorithms in Computational Biology19 Department of Mathematics & Computer Science Molecular Clock Assumption in UPGMA UPGMA produces a rooted tree Edge lengths in the resulting tree can be viewed as times measured by a molecular clock with a constant rate The sum of times down a path to the leaves from any node is the same, whatever the path The distances d ij are said to be ultrametric, if for any triplet of sequences, x i, x j, x k, the distances d ij, d jk, d ik are either all equal, or two are equal and the remaining one is smaller True for a tree with a molecular clock Implied additivity The edge lengths are said to be additive if the distance b/w any pair of the leaves is the sum of the lengths of the edges on the path connecting them
20
Algorithms in Computational Biology20 Department of Mathematics & Computer Science Molecular Clocks Mutations may build up in any given stretch of DNA at a reliable rate If the rate of mutation of a gene is reliable, this gene can be used as a molecular clock This gene can be a powerful tool for estimating the dates of lineage-splitting events.
21
Algorithms in Computational Biology21 Department of Mathematics & Computer Science Example The entire length of DNA of a genes changes at a rate of approximately one base per 25 million years
22
Algorithms in Computational Biology22 Department of Mathematics & Computer Science What If Molecular Clock Property Fails? 1 2 3 4 1 4 2 3 A tree that is reconstructed incorrectly by UPGMA (right)
23
Algorithms in Computational Biology23 Department of Mathematics & Computer Science Additivity Given a tree, its edge length is additive If the distance between any pair of leaves is the sum of lengths of the edges on the path connecting them Build-in assumption in UPGMA
24
Algorithms in Computational Biology24 Department of Mathematics & Computer Science Test for Additivity For every set of four leaves, 1, 2, 3 and 4, two of the three distances d 12 + d 34, d 13 + d 24 and d 14 + d 23 must be equal and larger than the 3 rd. 1 2 3 4
25
Algorithms in Computational Biology25 Department of Mathematics & Computer Science Joining a Pair of Neighboring Leaves i j k m D im = d ik + d km D jm = d jk + d km D ij = d ik + d jk D km = 0.5(d im + d jm – d ij ) Node k joins leaf nodes i and j
26
Algorithms in Computational Biology26 Department of Mathematics & Computer Science Closest Pairs of Leaves Are not Necessarily Neighboring Leaves 0.1 0.4 1 2 34 1234 1 20.3 30.50.6 4 0.50.9 d Table
27
Algorithms in Computational Biology27 Department of Mathematics & Computer Science Compensation for Long Edges 1234 1 2-1.1 3-1.2-1.1 4 -1.2-1.1 r 1 = 0.7 r 2 = 0.7 r 3 = 1 r 4 = 1 D Table
28
Algorithms in Computational Biology28 Department of Mathematics & Computer Science Algorithm: Neighbor-Joining Initialization: Define T to be the set of leaf nodes, one for each given sequence, and put L = T. Iteration: Pick a pair i, j in L for which D ij is minimal Define a new node k and set d km = 0.5(d im + d jm – d ij ), for all m in L. Add k to T with edges of lengths d ik = 0.5(d ij +r i -r j ), d jk = d ij – d ik, joining k to i and j, respectively. Remove i and j from L and add k. Termination When L consists of two leaves i and j add the remaining edge between i and j, with length d ij Produces an unrooted tree
29
Algorithms in Computational Biology29 Department of Mathematics & Computer Science Rooting Trees Outgroup Species known to be more distantly related to each of the remaining species than they are to each other Find the root by adding an outgroup The point in the tree where the edge to the outgroup joins is expected to be the best root candidate In the absence of a convenient outgroup, methods are quite ad hoc E.g. picking the midpoint of the longest chain of consecutive edges if deviation from a molecular clock were not too great.
30
Algorithms in Computational Biology30 Department of Mathematics & Computer Science Assumptions Used by UPGMA and Neighbor-Join UPGMA (molecular clock with implied additivity) The edge lengths in the resulting tree can be viewed as times measured by a molecular clock with a constant rate The divergence of sequences is assumed to occur at the same constant rate at all points in the tree The distance from an internal node to a leaf node will always be the same no matter what path is taken Neighbor-Join It is possible for the molecular clock property to fail but for additivity to hold Assume additivity only
31
Algorithms in Computational Biology31 Department of Mathematics & Computer Science Parsimony Most widely used tree building algorithm It works by finding the tree which can explain the observed sequences with a minimum # of substitutions Two components to the algorithm 1.The computation of a cost for a given tree T 2.A search through all trees, to find the overall minimum of this cost
32
Algorithms in Computational Biology32 Department of Mathematics & Computer Science Notations Used in Weighted Parsimony S k (a) denotes the minimal cost for the assignment of a to node k S(a, b): cost for each substitution of a by b
33
Algorithms in Computational Biology33 Department of Mathematics & Computer Science Algorithm: Weighted Parsimony Compute the minimum cost at site u [Sankoff & Cedergren 1983] Initialization: Set k = 2n – 1, the number of the root node Recursion: Compute S k (a) for all a as follows: If k is a leaf node: Set S k (a) = 0 for a = x u k, S k (a) = , otherwise If k is not leaf node: Compute S i (b), S j (b) for all b at the daughter nodes i, j and define S k (a) = min b (S i (b) + S(a, b)) + min b (S j (b) + S(a, b)). Termination: Minimal cost of tree = min a S 2n-1 (a) Weighted parsimony reduces to traditional parsimony if S(a, a) = 0 for all a, S(a, b) = 1 for all a b
34
Algorithms in Computational Biology34 Department of Mathematics & Computer Science Algorithm: Traditional Parsimony [Fitch 1971] Initialization Set C = 0 and k = 2n -1 Recursion: to obtain the set R k If k is leaf node: Set R k = x u k If k is not a leaf node: Compute R i, R j for the daughter nodes i, j of k, and set R k = R i R j if this intersection is not empty, or else R k = R i R j and increment C Termination: Minimal cost of the tree = C
35
Algorithms in Computational Biology35 Department of Mathematics & Computer Science Parsimony Example {A, B} A A B A B Minimum cost = 2 Obtained by traditional parsimony A A A A B A B B A A A B A B X X X X
36
Algorithms in Computational Biology36 Department of Mathematics & Computer Science Cont’ B B B A B A B Minimum cost tree: not obtained by traditional parsimony
37
Algorithms in Computational Biology37 Department of Mathematics & Computer Science Enumeration of Unrooted Trees Enumerate all unrooted trees by an array [i 3 ] [i 5 ] [i 7 ] [i 9 ]… [i 2n-5 ] Take the unrooted tree with 3 sequences x1, x2 and x3 and add an edge for x4 on the edge labeled by i 3, since the new edge divides the preexisting edge in two, the total number of edges is now 3 + 2 = 5. The value of i 5 determines which of these x5 is added to. Think of [i 3 ] [i 5 ] [i 7 ] [i 9 ]… [i 2n-5 ] as an odometer …
38
Algorithms in Computational Biology38 Department of Mathematics & Computer Science Counting Trees Cont’ Counting complete trees The rightmost numbers advance till they reach 2n-5 The next-to-rightmost array index clicks forward by 1 when the rightmost array index go back to 1 The second-to-rightmost index clicks forward by 1 when the next-to-rightmost index reaches 2n-7 And so on and so forth … Counting both complete and incomplete trees Add 0 to each array index, meaning that there is no edge of the order specified by the counter
39
Algorithms in Computational Biology39 Department of Mathematics & Computer Science Selecting Labeled Branching Patterns by Branch and Bound Starts from the odometer setting [1][0][0]…[0] Let the smallest cost so far for a complete tree be C Brand and bound Adding more leaves can only increase cost No point branching out if current cost is larger than the minimum cost so far Implementation trick Whenever the cost of our current subtree T is more than C, we know that T is not part of the optimal tree If all the counters to the right of a given non-zero counter are 0, instead of advancing them all to ‘1’ we can click the rightmost non- zero counter one forward
40
Algorithms in Computational Biology40 Department of Mathematics & Computer Science An Example of Branch-and-Bound 70000 3 71111 3 80000 3 Skip 3…70001 to 3…7(2n-11)(2n-9)(2n-7)(2n-5) and go directly to 3…80000 if the cost of 3…70000 is higher the the minimum cost found so far ……
41
Algorithms in Computational Biology41 Department of Mathematics & Computer Science Assessing the Trees: the Bootstrap Bootstrapping (sample with replacement) Given a dataset consisting an alignment of sequences, generates an artificial dataset by picking columns from the alignment at random with replacement Generate large number (order of thousands) of artificial alignment datasets For each artificially generated data set, build a tree Assessing phylogenetic features Find the frequency of each phylogenetic feature that appears in the thousands trees generated above The higher the frequency, the more confident we have with a phylogenetic feature
42
Algorithms in Computational Biology42 Department of Mathematics & Computer Science Describe a New Hampshire Standard Tree Tree file representation of the above rooted tree, starting at the beginning of the file: (B,(A,C,E),D); (B:6.0,(A:5.0,C:3.0,E:4.0):5.0,D:11.0);
43
Algorithms in Computational Biology43 Department of Mathematics & Computer Science Visualize Trees Phylip DrawTree
44
Algorithms in Computational Biology44 Department of Mathematics & Computer Science Visualize Trees Cladogram
45
Algorithms in Computational Biology45 Department of Mathematics & Computer Science Visualize Trees Phenogram
46
Algorithms in Computational Biology46 Department of Mathematics & Computer Science Visualize Trees Curve-O-Gram
47
Algorithms in Computational Biology47 Department of Mathematics & Computer Science Visualize Trees Eurogram
48
Algorithms in Computational Biology48 Department of Mathematics & Computer Science Programs to Build Phylogenetic Trees PAUP Include parsimony, maximum likelihood, and distance methods Phylip Include parsimony, distance matrix, and likelihood methods, including bootstrapping and consensus trees. MrBayes Bayesian estimation of phylogeny Uses a simulation technique called Markov chain Monte Carlo (or MCMC) to approximate the posterior probabilities of trees NoTung Incorporating duplication/loss parsimony into phylogenetic tasks ……
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.