Chap. 7. Building Trees.

Chap. 7. Building Trees

Variation Most organisms can increase exponentially
If all organisms survived and multiplied at the same rate, there will be no change in frequency of the variants, and thus no evolution Limited by food, space, predators, etc. When population size is limited, not all variants survive A possibility of natural selection Also, chance effects exist Equal-sized populations with two variants will not stay the same even with the same degree of fitness Called random drift, the chance effect will take over the whole population This implies that evolution can occur even without natural selection, referred to as neutral evolution

Mutation Any change in a gene sequence that is passed on to offspring
Caused by A damage to DNA moledule (from radiation, etc.) Errors in replication Point mutation – simplest form of mutation and occurs all over DNA sequences Transition – mutation within purine (A,G) or pyrimidine (C,T/U) Transversion – mutation between nt groups Effects depend on where mutations occur Non-coding region – no effect on proteins, and neutral But may have significant effects if occurring in control region Coding region Synonymous substitution when a mutation does not change AA Non-synonymous AA is replaced by another stop codon is introduced

Mutation Indel mutation Gene inversion
Small indels of a single base of a few bases are frequent Caused by slippage during DNA replication Particularly frequent with repeated sequences GCGC…: insertion of extra GC or deletion cause slight slippage CAG repeated region in huntingtin protein can expand, causing Huntington’s disease Indels can cause frame shift, if indels are not multiples of three Gene inversion Whole genes are copied to offspring in reverse direction Translocation Whole genes can be deleted from one genome and inserted into another

Historical background: insulin
Mature insulin consists of an A chain and B chain heterodimer connected by disulphide bridges The signal peptide and C peptide are cleaved, and their sequences display fewer functional constraints.

Note the sequence divergence in the
disulfide loop region of the A chain

0.1 x 10-9 1 x 10-9 0.1 x 10-9 Number of nucleotide substitutions/site/year

By the 1950s, it became clear that amino acid substitutions occur nonrandomly. For example, Sanger and colleagues noted that most amino acid changes in the insulin A chain are restricted to a disulfide loop region. Such differences are called “neutral” changes (Kimura, 1968; Jukes and Cantor, 1969) Subsequent studies at the DNA level showed that rate of nucleotide (and of amino acid) substitution is about six-to ten-fold higher in the C peptide, relative to the A and B chains.

Surprisingly, insulin from the guinea pig (and from the related coypu) evolve seven times faster than insulin from other species. Why? The answer is that guinea pig and coypu insulin do not bind two zinc ions, while insulin molecules from most other species do. There was a relaxation on the structural constraints of these molecules, and so the genes diverged rapidly.

Guinea pig and coypu insulin have undergone an
extremely rapid rate of evolutionary change Arrows indicate positions at which guinea pig insulin (A chain and B chain) differs from both human and mouse

Molecular clock hypothesis
In the 1960s, sequence data were accumulated for small, abundant proteins such as globins, cytochromes c, and fibrinopeptides. Some proteins appeared to evolve slowly, while others evolved rapidly. Linus Pauling, Emanuel Margoliash and others proposed the hypothesis of a molecular clock: For every given protein, the rate of molecular evolution is approximately constant in all evolutionary lineages

corrected amino acid changes
Dickerson (1971) corrected amino acid changes per 100 residues (m) Millions of years since divergence

Molecular clock hypothesis: implications
If protein sequences evolve at constant rates, they can be used to estimate the times that sequences diverged. This is analogous to dating geological specimens by radioactive decay. e.g. from fossil evidence, human and gorilla diverged 11 million years ago (MYA). Human globins have twice as many substitutions than gorilla globins.

K: number of substitutions per site
r: Rate of substitution per year r = K/2T

Molecular phylogeny uses trees to depict evolutionary
relationships among organisms. These trees are based upon DNA and protein sequence data. A B C D E F G H I time 6 2 1 A 2 1 1 B 2 C 2 2 1 D 6 one unit E

Globin phylogeny by Dayhoff (1972)

Globin phylogeny by Dayhoff in evolutionary time (1972)

Tree nomenclature taxon taxon A B C D E F G H I A B C D E time 6 2 1 2
one unit E

Tree nomenclature: clades
Clade ABF (monophyletic group) 2 A F 1 1 G B 2 I H 2 C 1 D 6 E time

Examples of clades Lindblad-Toh et al., Nature 438: 803 (2005), fig. 10

Phylogenetic Methods Family of related sequences evolved from a common ancestor is studied with phylogenetic trees showing the order of evolution Want to have a tree representation showing Divergence among species Evolutionary distance Usually unrooted B C A D B D C B D A C A

Phylogenetic Trees Rooted tree provide direction of evolution and its distance Unrooted tree is less informative Finding a root Use known species relationship If not known, use mid-point method: finding a point on the tree with the mean distance among the tree is identical in either side – assumes the same evolution rate

Tree Construction Multiple sequences are aligned
Use JC or other models to compute pair-wise evolutionary distances From distance matrix, use a clustering method Join the closest two clusters to form a larger one Recompute distances between all clusters Repeat two steps above until all species are connected

Tree-building methods
distance-based and character-based: Distance-based methods involve a distance metric, such as the number of amino acid changes between the sequences, or a distance score. Examples of distance-based algorithms are UPGMA and neighbor-joining. Character-based methods include maximum parsimony and maximum likelihood. Parsimony analysis involves the search for the tree with the fewest amino acid (or nucleotide) changes that account for the observed differences between taxa.

Example with Globin

Distance-based tree Calculate the pairwise alignments; if two sequences are related, put them next to each other on the tree

Character-based tree: identify positions that best describe how characters (amino acids) are derived from common ancestors

Tree from Distance Matrix
Given a weighted tree, with weights on edges representing evolutionary distances Additive distances di,c + dc,j = Di,j Find the nearest leaves – combine to the same parent Not easy to find neighboring leaves

Reconstructing tree Shorten all hanging edges of a tree
Reduce length of every hanging edge by the same small amount δ, then distance matrix is reduced by 2δ Find the leaf with 0 weight and remove the leaf

Additive matrix

Distance-based methods: UPGMA trees
UPGMA is a simple approach for making trees. An UPGMA tree is always rooted. An assumption of the algorithm is that the molecular clock is constant for sequences in the tree. If there are unequal substitution rates, the tree may be wrong. While UPGMA is simple, it is less accurate than the neighbor-joining approach (described next). Page 256

UMPGMA Method Distance between two clusters is defined as the mean of the distances between species in the two clusters Human cluster vs. chimpanzee/pygmy cluster Mean of human-chimpanzee and human-pygmy distances Produces a rooted tree Tree distance between chimpanzee and pygmy is /2 All species end at the right aligned (because the same molecular evolution is assumed in every species) – not used Page 256

UMPGMA Example Pick the smallest distance Recompute distances
C to AB: (60+50)/2 = 55 D to AB: (100+90)/2 = 95 E to AB: (90+80)/2 = 85

UMPGMA Example Pick the smallest distance Recompute distances
AB C D E 55 95 40 85 50 30 Pick the smallest distance Recompute distances AB to DE: (95+85)/2 = 90 AB C DE 55 90 45

Neighbor-Joining (NJ) Method
Additive distance in an unrooted tree Distance between two species is the sum of branch lengths connecting them NJ Method Construct an unrooted tree whose branch lengths are as close to the distance matrix among species Algorithm Join two neighbors, and replace them by a new internal node Keep repeating the step until all species are covered

Making trees using neighbor-joining
The neighbor-joining method of Saitou and Nei (1987) Is especially useful for making a tree having a large number of taxa. Begin by placing all the taxa in a star-like clusters.structure. Page 259

Tree-building methods: Neighbor joining
Next, identify neighbors (e.g. 1 and 2) that are most closely related. Connect these neighbors to other OTUs (operational taxonomic units) via an internal branch, XY. At each successive stage, minimize the sum of the branch lengths.

Tree-building methods: Neighbor joining
Define the distance from X to Y by dXY = 1/2(d1Y + d2Y – d12)

NJ Example A B C D E 5 4 7 6 10 9 F 8 11 C(A) = (5+4+7+6+8)/4 = 7.5
5 4 7 6 10 9 F 8 11 C(A) = ( )/4 = 7.5 C(B) = ( )/4 = 10.5 C(C) = ( )/4 = 8 C(D) = ( )/4 = 9.5 C(E) = ( )/4 = 8.5 C(F) = ( )/4 = 11 D(A,B) – C(A) – C(B) = -13 D(D,E) – C(D) – C(E) = -13 … Pick the smallest (eg. A,B)

NJ Example A B C D E 5 4 7 10 6 9 F 8 11 Join A and B to U1
C(AU1) = D(A,B)/2 + (C(A)-C(B))/2 = 1 C(BU1) = D(A,B)/2 + (C(B)-C(A))/2 = 4

NJ Example U1 C D E 3 6 7 5 F 8 9

Making trees using character-based methods
The main idea of character-based methods is to find the tree with the shortest branch lengths possible. Thus we seek the most parsimonious (“simple”) tree. Identify informative sites. For example, constant characters are not parsimony-informative. Construct trees, counting the number of changes required to create each tree. For about 12 taxa or fewer, evaluate all possible trees exhaustively; for >12 taxa perform a heuristic search. Select the shortest tree (or trees). Page 260

As an example of tree-building using maximum
parsimony, consider these four taxa: AAG AAA GGA AGA How might they have evolved from a common ancestor such as AAA?

Tree-building methods: Maximum parsimony
1 AAA AAA AAA AAA AGA AAA AAA AAA AAA 1 1 1 1 2 1 2 1 AAG AAA GGA AGA AAG AGA AAA GGA AAG GGA AAA AGA Cost = 3 Cost = 4 Cost = 4 In maximum parsimony, choose the tree(s) with the lowest cost (shortest branch lengths).

Parsimony Use simplest possible explanation of the data, or with fewest assumptions Binary states: 0 for ancestral, 1 derived character 0 may be ancestral tetrpod forelimb bone structure 1 may be the bone structure in the bird wing C,D posses a derived character, not possessed by A, B Tree A: the character must have evolved on the + branch Tree b: evolved once (+), and lost (*) Tree c: evolved independently (+) on two branches Parsimony criterion – tree a is the simplest (single state change)

Small parsimony Problem
Find the most parsimonious labeling of the internal vertices input: Tree T with each leaf labeled by an m-character string output: Labeling of internal vertices of T minimizing the parsimony score Characters in the string are independent, and the problem can be solved independently for each character Assume that each leaf is labeled by a single character Solve for a more general problem Length of an edge is defined as the Hamming distance For k-letter alphabet, dH(v,w) = 0 if v=w; =1, otherwise

Weighted Small Parsimony Problem
Find the minimal weighted parsimony score labeling of internal vertices input: Tree T with each leaf labeled by a k-letter alphabet, and kxk scoring matrix output: Labeling of internal vertices of T minimizing the weighted parsimony score David Sankoff Dynamic Programming, 1975 Internal vertex v with offsprings u and w si(v) = min{si(u)+δi,t} + min{si(w)+ δj,t}

Parsimony Example Five species, scored for 6 characters with 0 or 1 state each Calculate how many changes of state are needed in a tree, for example Species 1 2 3 4 5 6 Alpha Beta Gamma Delta Epsilon alpha delta gamma beta epsilon

Reconstruction of Character 1
alpha delta gamma beta epsilon Species 1 Alpha Beta Gamma Delta Epsilon Red: state of 1 Regular: state of 0 alpha delta gamma beta epsilon

alpha delta gamma beta epsilon Species 2 Alpha Beta Gamma 1 Delta Epsilon alpha delta gamma beta epsilon alpha delta gamma beta epsilon

alpha delta gamma beta epsilon Species 3 Alpha Beta 1 Gamma Delta Epsilon alpha delta gamma beta epsilon

Reconstruction of character 4, 5
alpha delta gamma beta epsilon Species 4 5 Alpha 1 Beta Gamma Delta Epsilon alpha delta gamma beta epsilon

Reconstruction of character 6
Species 6 Alpha Beta Gamma Delta 1 Epsilon alpha delta gamma beta epsilon

Reconstruction with All Changes
Total # of changes, taking random choice for more than one trees = = 9 alpha delta gamma beta epsilon 2,6 5 4 2,5 4 1,3

Most Parsimonious Tree
alpha delta gamma beta epsilon alpha delta gamma beta epsilon 2,6 5 4 4,5 2,5 6 4,5 4 2 1,3 1,3 9 8

Most Parsimonious Trees
alpha delta gamma beta epsilon 4,5 6 4,5 Identical, if unrooted 2 alpha 1,3 gamma beta 4,5 4,5 1,3 2 6 delta gamma delta alpha beta epsilon epsilon 6 4,5 4,5 1,3 2

How to determine Branch Lengths
Given an unrooted tree, want to determine the number of changes in each branch Ambiguous as to where the changes are => use the average over all possible reconstructions of each character alpha delta gamma beta epsilon gamma alpha beta 2,6 5 4 1.5 0.5 1 2,5 4 2.5 1 1 1.5 1,3 epsilon delta

Large Parsimony Problem
Find a tree with n leaves having the minimal parsimony score input: An nxm matrix M describing n species, each represented by m-character string output: A tree T with n leaves labeled by n rows of matrix M, and a labeling of the internal vertices with minimal parsimony score over all possible trees and labelings NP-complete Greedy heuristics Start with an arbitrary tree Move from one tree to another if it lowers parsimony score by nearest neighbor interchange

Probabilistic Models Likelihood ratios
Example: predict helices and loops in a protein Known info: helices have a high content of hydrophobic residues ph and pl: frequencies of AA being in the helix or loop Lh and Ll : likelihoods that a sequence of N AAs are in a helix or a loop Lh = ∏N ph , Ll = ∏N pl Rather than likelihoods, their ratios have more info Lh/Ll : is sequence more or less likely to be a helical or loop region S = ln(Lh/Ll) = ∑ N ln(ph/pl): positive for helical region Partition a sequence into N-AA segments (N=300)

Prior and Posterior Probs.
Previous example has two hypotheses (Helix or Loop) The sequence is described by models 0 and 1 Models 0 and 1 are defined by ph and pl Generalize to k hypotheses: Mk models (k=0,1,2,…) Given a test dataset D, what is the prob. that D is described by each of the models ? Known info: prior probs., Pprior(Mk) for each model from other info sources Compute likelihood of D according to each of the models: L(D|Mk) Of interest is not the prob of D arising from Mk but the prob of D being described by Mk Namely, Ppost(Mk| D) ∞ L(D|Mk) Pprior(Mk) : posterior prob. Ppost(Mk| D) = L(D|Mk) Pprior(Mk)/∑iL(D|ii) Pprior(Mi) => Bayesian prob.

Bayesian Prob. Basic principles Special case: two models
We make inference using posterior probs. If a posterior prob. of one model is higher, it can be the best model with confidence Special case: two models Two prior probs.: Pprior0 , Pprior1 Pposti = Li Ppriori/(L0 Pprior0 + L1 Pprior1) Log-odd score: S΄ = ln(L1Pprior1/L0Pprior0) = ln(L1/L0) + ln(Pprior1/Pprior0) = S + ln(Pprior1/Pprior0) Difference between S΄and S is simply the additive constant, and ranking will be identical whether we use S΄or S Warning: if Pprior1 is small, S has to be high to make S΄positive When Pprior0 = Pprior1, S΄= S Ppost1 = 1/(1 + L0 Pprior0 /L1 Pprior1) = 1/(1 + exp(- S΄)) S΄=0 →Ppost1 =1/2; S΄is large and negative → Ppost1 ≈1

Maximum Likelihood (ML) Phylogeny
Given a model of sequence evolution and a proposed tree structure, compute the likelihood that the known sequences would have evolved on that tree ML chooses the tree that maximizes this likelihood Three parameters Tree toplogy Branch lengths Values of the parameters in the rate matrix

What is Likelihood in ML Tree ?
Given a model of sequence evolution at a site Likelihood of ancestor X: L(X) = PXA(t1) PXG(t2) L(Y) = PYG(t4) ∑X L(X) PYX(t3) L(W) = ∑y ∑Z L(Y)PWY(t5)L(Z)PWZ(t6) Total likelihood for the site: L = ∑W P W L(W) P W: equilibrium prob. Is equal to posterior prob. of different clades W t5 Y t6 t3 X Z t4 t1 t2 A G G T T

Computing Posterior Prob.
ML tree maximizes the total likelihood of the data given the tree, i.e., L(data|tree) We want to compute posterior prob: P(tree|data) From Bayes theorem, Pt(tree|data)= L(data|tree)*Pr(tree)/ ∑L(data|tree)*Pr(tree) (summation is over all possible trees) Namely, posterior prob. ∞ L(data|tree)*Pr(tree) Problem is the summation over all possible trees Moreover, what we really want is, given the data, the posterior prob. that a particular clade of interest is present Ppost(clade|data)= ∑P(data|tree) for trees containing the clade = ∑cladeL(data|tree)*Pprior(tree)/ ∑all treesL(data|tree)*Pprior(tree) In practice, Ppsot(clade|data) = # of trees containing clade/total # of trees in the sample

Making trees using maximum likelihood
Maximum likelihood is an alternative to maximum parsimony. It is computationally intensive. A likelihood is calculated for the probability of each residue in an alignment, based upon some model of the substitution process. What are the tree topology and branch lengths that have the greatest likelihood of producing the observed data set? ML is implemented in the TREE-PUZZLE program, as well as PAUP and PHYLIP. Page 262

Bayesian inference of phylogeny with MrBayes
Calculate: Pr [ Tree | Data] = Pr [ Data | Tree] x Pr [ Tree ] Pr [ Data ] Pr [ Tree | Data ] is the posterior probability distribution of trees. Ideally this involves a summation over all possible trees. In practice, Monte Carlo Markov Chains (MCMC) are run to estimate the posterior probability distribution. Notably, Bayesian approaches require you to specify prior assumptions about the model of evolution.

Evaluating trees: bootstrapping
Bootstrapping is a commonly used approach to measuring the robustness of a tree topology. Given a branching order, how consistently does an algorithm find that branching order in a randomly permuted version of the original data set? Page 266

Evaluating trees: bootstrapping
Bootstrapping is a commonly used approach to measuring the robustness of a tree topology. Given a branching order, how consistently does an algorithm find that branching order in a randomly permuted version of the original data set? To bootstrap, make an artificial dataset obtained by randomly sampling columns from your multiple sequence alignment. Make the dataset the same size as the original. Do 100 (to 1,000) bootstrap replicates. Observe the percent of cases in which the assignment of clades in the original tree is supported by the bootstrap replicates. >70% is considered significant. Page 266

In 61% of the bootstrap resamplings, ssrbp and btrbp (pig and cow RBP) formed a distinct clade. In 39% of the cases, another protein joined the clade (e.g. ecrbp), or one of these two sequences joined another clade.

Bootstrapping Bootstrapping
A method of assessing the reliability of trees Numbers in the rooted tree – called bootstrap percentages Distances according to models are not realistic due to chance fluctuations Boostrapping addresses the question on if these fluctuations are influencing the tree configuration Boostrapping deliberately construct sequence data sets that differe by some small random fluctuations from real sequences And check if the same tree topology is obtained Randomized sequences are constructed by sampling columns

Bootstrapping Generate 100 or 1,000 randomized sequences
And compute what percentage of randomized trees contain the same group 77% boostrap value is considered to be reliable e.g. 24% -- doubtful if they form a clade 71% - human/chimpanzee/pygmy chimapnzee Between two high figures Chimpanzee/pygmy always form a clade Gorilla/human/chimpanzee/pygmy always form a clade (Gorilla.(human,chimpanzees)) appear more frequently than (human,(gorilla, chimpanzees)) or (chimpanzees, (gorilla,human)) Thus, can conclude (human, chimpanzees) is more reliable Can construct a consensus tree Frequency of each possible clade is determined Construct a consensus tree by adding clades from more frequent clades

ML vs. Parsimony Parsimony is fast – ML requires each tree topology to be optimized ML is model-based parsimony’s model is equal substitution Parsimony can incorporate models, but not clear what the weights have to be Parsimony tries to minimize the number of substitutions, irrespective of the branch lengths ML allows for changes more likely to happen on longer branches On a long branch, no reason to try to minimize the number of substitutions Parsimony is strong for evaluating trees based on qualitative characters

Phylogeny programs Joseph Felsenstein
Phylogeny.fr

Chap. 7. Building Trees.

Similar presentations

Presentation on theme: "Chap. 7. Building Trees."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chap. 7. Building Trees.

Similar presentations

Presentation on theme: "Chap. 7. Building Trees."— Presentation transcript:

Similar presentations

About project

Feedback