# The distance between sequences, Part I. Foundations M. Elizabeth Corey, Ph.D. UCSC Extension and UC Berkeley Extension.

## Presentation on theme: "The distance between sequences, Part I. Foundations M. Elizabeth Corey, Ph.D. UCSC Extension and UC Berkeley Extension."— Presentation transcript:

The distance between sequences, Part I. Foundations M. Elizabeth Corey, Ph.D. UCSC Extension and UC Berkeley Extension

A simple start Suppose we have two sequences, A and B A = {a 1, a 2, …, a m } and B = {b 1, b 2, …, b n } and we want to know how similar they are. What is the basis for their similarity?

The practical measure What we usually do is obtain an alignment and then score using the sum of the pairwise scores:

The nice metric Wouldn’t it be nice if we could simply say that the distance between sequences was the geometric sum of the distances between loci in the sequences?

1 a 2 b 3c3c 4 n 5 j 6r6r 7 q 8 c 9 l 10 c 11 r 12 p 13 m 1 a1 2 j 1 3 c 1 11 4 j 1 5 n 1 6 r 1 7 c11 8 k 9 c11 10 r11 11 b1 12 p1 Dynamic programming methods Compuational method dating from the 40’s, introduced to biology as “Needleman-Wunsch” in 1969. A numerical value is assigned to every cell in the array giving the similarity/dissimilarity of residues The example shown –match = +1 –mismatch = null (value 0)

abcnjrqclcrpm a1 j 1 c 1 11 j 1 n 1 r 1 4332200 c3343333433100 k3333333332100 c2232222323100 r2111121111200 b1211111111100 p000000000001 0 Dynamic programming methods GOAL: For each cell find the maximum possible score for an alignment ending at that point Searchs subrow and subcolumn, as shown, for highest score Adds this to the score for the current row Proceeds row by row through the array

Maximum bipartite matching Series of solutions, starting with Dijskta, 1950’s Find the set of matches that provide maximum flow. Each match, a i to b j, has a capacity equal to its pairwise score. AB s(a 1, b 1 )

Alignment’s not really the problem Optimal alignment falls into a set of problems with a long history in computer science. The underlying metric for distances between sequences falls in the province of biology.

Beguiled by a matrix (PAM)

PAM PAM starts with closely related sequences from 34 superfamilies, grouped into 71 evolutionary trees. PAM rests on a measure of amino acid “mutability”. PAM attempts to capture a representative slice of evolutionary behavior.

PAM (From Dayhoff, Schwartz and Orcutt) Obtain alignments for homologous proteins Compute scoring matrix elements using: where a ij is substitution frequency, m i is the mutability of i and  is a proportionality constant. Extrapolate to longer evolutionary distances by using {S(   )} n

Limitations of PAM matrices PAM matrices are built from alignments with > 85% identity. The entries in the initial scoring matrix, S(t=1) arise from short time interval substitutions; raising S(1) to a higher power may not capture some interesting substitutions with longer rate constants.

The Gutzwiller temptation An abstract dynamic system (M, ,  t ) –a measurable space, M, composed of the set of all sequences. –a measure  based on transition probabilities  –a group of automorphisms,  t, that map M onto itself, that preserves  and where the variable t runs through the integers.

What’s Bernoulli got to do with it? A scheme with subshift –The measure on M is generated by the sets A i,j,k = {a |a i = j, a i+1 = k} whose measure is given by a matrix of transition probabilities p jk >= 0. –A future event a 1 depends on a 0 ; hence, memory. –Realized in the geodesic flow on a compact closed surface of constant negative curvature.

System behaviors Ergodicity: Transition probabilities are positive recurrent and aperiodic. Mixing: Inheritance and Mendelian exceptions lead to mixing. K-systems: Speciation events rigidly segregate M; other segregations exist.

Our salad days Jukes-Cantor HGY Kimura 2-Parameter PAM BLOSUM

General Stationary Time- reversible Model.p C r CA p G r GA p T r TA p A r AC.p G r GC p T r TC p A r AG p C r CG.p T r TG p A r AT p C r CT p G r GT. R = Time reversibility: p i r ij = p j r ji (Diagonal elements such that rows sum to zero)

General Stationary Time- reversible Model P(t) = e Rt Given rates, one can find transition probabilities, and vice-versa.

Jukes-Cantor -3aaaa a aa aa a aaa R =

Kimura 2-Parameter.bab b.ba ab.b bab. R = a/b = transition/transversion bias A C G T

HKY (Hasegawa, Kishino, Yano). pCpC  p G pTpT pApA. pGpG  p T  p A pCpC. pTpT pApA  p C pGpG. R =  = transversion / transition

The BLOSUMn matrices Start with multiple, ungapped alignments of proteins found using PROTOMAT. Build clusters by placing together sequences with N% identity. Measure the score for each pair defined as: s ij = 2*log 2 (p ij /e ij ) e ij is expected probability of occurrence of the i,j pair p ij is observed probability of the i,j pair.

Limitations Naive approach: measure frequencies of aligned pairs and gaps in randomly selected confirmed alignments to get p ij, use a “random” set of sequences to obtain e ij. Difficulty 1: it is difficult to get a good random sample of sequences or alignments – databases are biased. Difficulty 2: When sequences diverge from a common ancestor recently, p ij is small and s is strongly negative. When sequences diverged long ago, p ij tends to e ij and s approaches zero.

A short compendium of distances and scores Jukes-Cantor distance Kimura distance Dayhoff evolutionary distance BLOSUM scores Profile scores Average scores

References Gu, X. & Li, W, 1996. A general additive distance with time-reversibility and rate variation among nucleotide sites. Proc. Natl. Acad. Sci. USA 93: 4671-4676. Hasegawa, M., Kishino, H., & Yano, T., 1985. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22: 160-174. Sanderson, M. J. & Shaffer, H. B., 2002. Troubleshooting molecular phylogenetic analyses. Annu. Rev. Ecol. Syst. 33: 49-72.

The distance between sequences, Part II. Careful Measures M. Elizabeth Corey, Ph.D. UCSC Extension and UC Berkeley Extension

Exceptions to Mendel’s Laws The theory: a chromosomal basis of inheritance Some so-called exceptions: linkage and recombination gene conversion transposition and mobile genetic elements A plethora of other mutations: point mutations, reversions, deletions, frameshifts, duplications, inversions “Exceptions” do not result in rejection of Mendelian genetics but a better understanding of the mechanisms underlying Mendelian inheritance.

Mutation frequencies (#mutations/generation) Frequency of point mutation: 10 -7 to 10 -8 Reversion of point mutations: ~10 -8. Sometimes called back mutation, sometimes called convergence. Reversion of deletion mutations: undetectably small. “Loss of function” mutations result in grossly lower biological fitness. The rate of extinction due to gross “loss of function” is much great than the rate of reversion, so the line will die long before reversion can occur. In the aggregate, the record will show a pseudo- reversion.

Mutation frequencies Deletions: 10 -6 – dependent on chromosomal region. Caveat: May be underestimated; less detectable because they are often lethal. Frameshifts: 10 -6 – often repaired. Duplications: 10 -3 - E. coli: approximately 0.l% of a culture for a given region of the chromosome. Inversions: hard to detect, not always mutations Gene Conversions: still unknown. Reparative. mutators increase mutation frequencies by ~100, they work on “hot spots”

Protein-based inheritance – Prions Proteins that change their shape in response to fluctuating environmental pressures, and then maintain that shape during mitosis and meiosis, constitute a form of cellular memory. Various structural conformations are propagated outside of the traditional genetic framework.

Hsp90 and Sup35 A buffer for silent polymorphisms: Hsp90 –promotes the folding of signal tranducers –buffers the effects of many silent polymorphisms –may serve as a capacitor of evolutionary change – storing and releasing genetic variation “Epigenetic inheritance”: The Sup35 prion

James Joyce’s List Milk Call mom! Lettuce Plumb the smithy of my soul for the unborn race- consciousness… Rent -------------------------------------- Thriving in fluctuating environments by exploiting pre-existing genetic variations.

References Recent Publications on Conformational Change and Evolution Queitsch, C., Sangster, T.A. and Lindquist, S. 2002. Hsp90 as a capacitor of phenotypic variation. Nature 417: 618-624. Jensen, M.A., True, H.L., Chernoff, Y.O., and Lindquist, S., 2001 Molecular Population Genetics and Evolution of a Prion-like Protein in Saccaromyces cerevisiae. Genetics 159: 527-525. True, H.L., and Lindquist, S.L. 2000. A yeast prion provides an exploratory mechanism for genetic variation and phenotypic diversity. Nature 407: 477-483. Rutherford, S.L. and Lindquist, S. 1998. Hsp90 as a capacitor for morphological evolution. Nature 396: 336-342.

Mutations and time Take a series of sequences and figure out how different they are by counting up their substitutions. A B C 5 substitutions 3 substitutions 6 substitutions

Mutations and time What process takes us from A to B to C? A B C gene conversion frameshift (repairable) 2 point accepted mutations No direct ancestry

Counting mutations Consider a counting process {N(t), t  T} where N(t i ) – N(t j ) is the number of mutations in the time interval (t i,t j ]. A B C N(  AB ) = 1 GC N(  BC ) = 1 FS, 2 PM No direct ancestry but we can still count substitutions: N (  AC ) = 6 PM

Times on the edges of the tree The “interoccurrence” times between mutations,  1 = 0,,  1 = t 2 – t 1, …  i = t i – t i-1, are exponential variables with mean 1/  such that P[  i > h] = e -  h and P[  i <= h] = 1 - e -  h for h>= 0.

Edge times Gene conversions  gc = 1 gc/2,000* years Frame shifts  fs = 1 shift/5,000 years Point mutations  pm = 1 pm/10,000 years *Just an wild guess A B C 1/  gc = 2,000 yrs 2/  pm + 1/  fs = 25,000 yrs 1/  pm = 60,000 yrs

Edge times Population of A = 10 5 Population of B = 10 6 Population of C = I don’t care. A B C 1/N a  gc = 20 * 10 -2 yrs 2/ N b  pm + 1/ N b  fs = 25*10 -3 yrs 1/  pm = 60 * 10 -2 yrs

Calculating divergence times Doolittle, D.F., Fend, D-F, Tsang, S., Cho, G and Little, E. “Determining Divergence Times of the Major Kingdoms of Living Organisms With a Protein Clock.” Science, 271, pp. 470-477, 1996.

Calculating divergence times Task: Build a model for evolutionary time based on pairwise distances, d ij, and the fossil record –Start with the vertebrate fossil record - the biogeochemistry gives reliable times. –Map the fossil-based phylogeny to the sequence based phylogeny and compare edge lengths. –Adjust the sequence-based time model to match the vertebrate fossil record.

Using the fossil record

Readjusting the clock After sampling the vertebrate fossil-record and fitting the sequence data to the fossil- record, they maintain the same clock. Result: Eukaryotes and Prokaryotes diverged about 2.5 billion years ago.

On fitting the fossil record to sequence data Challenges: unequal rates of change in different species due to: –different reproductive cycles in different species –different base population sizes in different species. Obtaining bacterial mutation rates using vertebrate mutation rates when we are looking at the evolution of populations: how viable is it?

Population mutation Suppose an average rate of mutation per site is about 10 -7 (ignoring duplications). Compare lengths of reproductive cycles: –Prokaryotes (blue-green algae and bacteria): 20 minutes to an hour per generation. –Humans: US, average time to first child is 24.8 years. How many times does a bacteria reproduce in the time it takes a human being to reproduce? 24 * 365 * 25 = 219,000 So if we are comparing bacterial mutation rates to human mutation rates and we looking at aggregate populations, we have to adjust by a factor of 10 6.

Population mutation Size of the base population on planet earth: 5 * 10 30 prokaryotes (UG, Bill Whitman) - including about a mole of bacteria 3 * 10 9 humans How many bacteria are there, propagating how fast, in comparison to humans? Worst case ratio? Calculate using base population * rate of generation * number of mutable genes (10 23 * 10 6 *10 3 ) -------------------------- = 10 18 (10 9 * 1*10 4)

One final issue: The Success Question When mutations succeed, they succeed within an ecological niche. So when we ask “When did a species arise?”, it is not enough to ask about the likelihood of a certain kind of mutation, one must also ask: what is the likelihood that that mutation arose in a niche that would support it? So, don’t forget about acceptance rates.

The FOXP2 point mutation Enard et al, “Molecular evolution of FOXP2, a gene involved in speech and language”, Nature, Vol. 418, August 22, 2002

Silent/expressed mutations in FOXP2 Edge labels are: Amino Acid / DNA substitutions OHG HG Human Gorilla Orangutan 0/7 0/2 1/2 2/2

Selective sweeps Measures for determining the existence of a sweep: –Tajima’s D: from Genetics, 1989 (conservative) –Fay and Wu’s H: from “Hitchhiking under positive Darwinian selection”, Genetics, 2000. Also, Griffiths and Tavare estimate selection using linked SNP data

Population mutation rates  ia = 4N a  i - the population mutation rate for site i in species a, where N a is the effective population size of species a and  i is the mutation rate per generation at site i.

Tajima’s D for FOXP2   0.03% S/a n = 0.079% S is the sample size a n is the number of segregating sites.

Discovering different rate constants

Finding the time of appearance of the FOXP2 segregation Sample current human population worldwide. Generate trees with different times for the human sequence data. Measure the likelihood of the different trees.

Multiple rates The automorphism  mapping M onto itself, used to be a simple shift operation. Now, it incorporates several underlying processes, including: –mutation of the bases (mutation rate) –expression of the mutations (expression rate) –stabilization of a conformational phenotype (stabilization rate) –success of the substitution (acceptance rate)

The distance between sequences, Part III. Algorithms for phylogenies M. Elizabeth Corey, Ph.D. UCSC Extension and UC Berkeley Extension

Phylogenies provide measures of similarity and can lay a foundation for scoring alignments. Rate structures provide indicators for motifs. Branch points allow us to identify and classify interesting bases. –If the branch points are in phenotypic trees, the mutating bases can be used as phenotypic identifiers. –If the branch points are in genotypic trees, mutating (nonsilent) bases can be used as genetic identifiers. Motivation

What goes into a phylogeny? Distance measures (UPGMA, NN) Site info (MLE and Parsimony) Substitution scores Equilibrium distributions for MLE Pairwise Alignment Multiple Alignment Phylogenies Transitional probability data

What do we get in return? Guide trees Rates and probabilitiesScoring matrices Pairwise Alignment Multiple Alignment Phylogenies Transitional probability data

Part III: Goals Depict methods for finding guide trees for progressive multiple alignment. Clarify the differences between MLE, Maximum Parsimony and Distance Methods and identify the optimization techniques appropriate for each. Define a new approach for faster identification of near-optimal phylogenies.

Progressive multiple alignment Choose a set of scores for sequence comparison –Alignment scores from Needleman-Wunsch, Smith-Waterman and variants. –Consensus word score from BLAST, PSI-BLAST and others –Substitution (scoring) matrices – PAM, BLOSUM, Jukes-Cantor, etc. Construct a reputable guide tree –Hierachical clustering (UPGMA, Neighbor-Joining, Fitch and Margoliash) –Maximum Parsimony (simple or weighted). –Maximum Likelihood Estimation (MLE) Use the guide tree to produce an alignment

Tree evaluation - Parsimony Given a semi-labeled tree, it is possible to determine the tree’s internal nodes (ancestral sequences) using a parsimony algorithm. Evaluation function: A summation of the scored mutations in the parsimonious tree.

Parsimony - Illustrated ABCADC A(B or D)C node 1, cost is 1 ABE ACC A(B or C) (E or C) node 2, cost is 2 ABC node 3, cost is 3

Example: Simple Parsimony Initialization: Set the cost, C = 0. Set k = 2n-1, where n is the number of sequences. Recursion to compute node, N k : if k is a leaf node, N k = sequence k if k is not a leaf node Compute N i and N j for the daughter nodes of N k. where the intersection of N i and N j is nonempty, otherwise increment the cost by the number of nonmatching residues and set Termination: Minimum cost of tree = C.

Tree evaluation – Distance methods Given a set of alignment scores, but without assuming a tree topology, it is possible to determine a tree and its edge lengths using a distance method. This is sometimes called minimum evolution and includes the hierarchical clustering methods. Evaluation function: The sum of the edge lengths.

Hierarchical Clustering – Illustrated UPGMA 21 3 4 5 21 3 4 5 21 3 4 5 21 3 4 5 12 6 t 1 = t 2 = ½d 12 12 6 45 7 From Durbin et al, 2001 12 6 45 7 3 8 67 8 9 12453 ½d 68

Algorithm: UPGMA Input: N sequences and their relative distances, d ij Initialization: Assign each sequence to its own cluster, C i. Define a leaf of T for each sequence and place at height = 0. Iteration Pick two clusters C i, C j such that d ij is minimal. Define a new cluster k by C k = {C i,C j }. Define a new set of distances {d kl } between C k and all current clusters. Define a node k with daughter nodes i and j, and place it at height h ik = ½d ik. Add k to the set of current clusters and remove i and j. Termination: Rooted: When only two clusters i, j, remain, add the root at height ½d ik.

Tree evaluation - MLE Given a tree topology and sequences preassigned to each leaf, it is possible to determine a tree’s edge lengths using maximum likelihood estimation. Evaluation function: the likelihood of the tree.

Estimating Likelihood Estimate branch lengths by viewing evolution as a random process Requires a probability model of evolution as a function of time. –For DNA one can use Jukes-Cantor model (all nucleotides have same substitution rates), or Kimura model (different rates for transitions and transversions). –For proteins one can use Dayhoff, but in the probability form not the log-odds form.

Estimating Likelihood S1, etc. are the bases or residues observed in the extant and ancestral taxa. v = t where  is the substitution rate and t is absolute time P i,j (v) is the probability that the residue at node s i becomes residue at node s j in time v  0 is the prior probability of the bases or nucleotides at any position The likelihood for this tree is: L =  0 P 0,5 (v 5 ) P 5,1 (v 1 ) P 5,2 (v 2 ) P 0,6 (v 6 ) P 6,3 (v 3 ) P 6,4 (v 4 )

Example: Likelihood For each mutating site in a set of sequences Initialization: Set k = 2n-1, where n is the number of sequences. Recursion: Compute P(L k |a) for each symbol, a, in the alphabet as follows: If k is a leaf node: if x k,u = “a”, then P(L k |a) = 1, else P k (a) = 0. if k is not a leaf node: Compute P(L i |a), P(L j |a) for all a at daughters i,j Set P(L k |a) =  b,c P(b|a,t i ) P(L i |b) P(c|a,t j ) P(L j |c). Termination: Likelihood for site u =  a  a P(L 2n-1 |a) ( a is the equilibrium value of the probability distribution for a.) Concluding step: Combine the likelihoods for each site.

Maximizing Likelihood Estimation over edge times Likehood estimation includes a step for computing the likelihood of some character “a” at node k given the subtree of k. While we know that there is the possibility of substitutions leading to a, these depend on how long a time we have to make those substitutions and we do not know the edge times of the tree. We must explore a series of possible times in order to to maximize the likelihood. A method that maximizes likelihoods over edge times is what is usually referred to as MLE. Standard MLE procedures do not maximize likelihoods over all topologies of the tree.

Comparisons between MLE, Parsimony and Distance Methods AlgorithmRequires semi- labeled tree Requires scored alignments OrderResults – Edge weights Results – Internal tree nodes Resulting Tree Is Ultrametric MLEYesNoLa 2n-1 2an 2 Transitional probabilities subtree probability Yes ParsimonyYesNo2an 2 Mutation countsAncestral sequences No Distance Methods NoYes2n 2 Distance measures – e.g. alignment scores UPGMA: a cluster of sequences UPGMA - no NN - yes.

Exploring different topologies Successive addition and rearrangement –Very common method (see Phylip programs including: PROTPARS, DNAPARS, DNACOMP, DNAML, DNAMLK, RESTML, KITSCH, FITCH, CONTML, MIX and DOLLOP) –Sequences are taken in the order that they appear in the input file and successively added to a tree. MCMC

Successive addition Initialization: –Place the set of sequences into L. – Create a tree,T, with one node – the root. Iteration: for each sequence in L –Remove a sequence from L and add it as a leaf to T. –Apply a process of local rearrangement (in Felsenstein’s package, there are (n-1)(2n-3) arrangements.) –Score each locally arranged tree. –set T to equal the best scoring tree. Termination: Globally rearrange the tree by swapping subtrees, score each globally rearranged tree and accept the tree with the best score.

Markov Chain Monte Carlo A Bayesian method for phylogenetic inference –Moderately new method rooted in molecular dynamics. –Topologies are randomly generated and scored so that a representative set of most likely tree topologies can be identified. Mau, Newton and Larget (1998) apply MCMC to sample trees using Bayes theorem. The following explanation is based on their methodology - the mistakes are mine, the facts and foundations, theirs.

Introduction to the method   is the set of all semi-labeled trees

Introduction to the method Sampling the set of trees Q1Q1 Q2Q2 Q3Q3

Introduction to the method    abcd acbdabcd Q 01 Q 12 Q 23 … A Chain of Accepted Samples

Introduction to the method The partitioned space with representatives {  1 ,  3 }  

MCMC propaganda allow exact inference provided certain convergent criteria are demonstrated. are efficient and can handle many more taxa or sequences. measure uncertainty during tree construction (no bootstrapping needed.)

Summary of the Algorithm 1.Choose a starting tree 2.Perturb the current tree’s topology and branch lengths to find a new tree. 3.Measure the likelihood for the new tree. 4.Compare the new tree to the last tree and decide whether or not to accept it into the chain. 5.If you’ve got a sufficiently long chain, check the characteristics of your sample to see if there is convergence to a set of representative topologies. If so, stop. Otherwise, to to 2.

Subproblems to be discussed 1.How do we represent the tree so it that is easy to operate on? Cophentic matrices. 2.What is our perturbation operator? 3.How do we build our sampling chain? 4.When are we done sampling?

The Cophenetic Matrix Some Notation  – a topology n – a node a(n) – the ancestor of a node L – a leaf node (the leaves are the current record) I – an internal node (the historical record)

Cophenetic Trees Labeled history (t 1, t 2 ) provides an order on coalescent levels. level 0 level 2 level 1 I1I1 L3L3 I2I2 I0I0 L1L1 L2L2 t 1 { t 2 {

Example: A Cophenetic Tree These trees are described in terms of nodes coalescing or merging backwards in time. t1= 0.8 t2=0.3 t3=0.7 t4=0.5 t5=0.9 t6=1.5 total: 4.7

Example: Cophenetic Matrix Leaf5741263 509.4 701.64.66.4 404.66.4 10 203.6 602.2 30 The cophenetic matrix for the previous tree. The tree representation (  a) is {(5,7,4,1,2,6,3), (4.7, 0.8, 2.3, 3.2, 1.8, 1.1)}

The Cophenetic Matrix Theorem: For any weighted binary tree with labeled leaf nodes, the tree topology and branch lengths can be uniquely determined using the within-tree distances between all pairs of leaf nodes. (Lapoint and Legendre, 1992) Note, each permutation of the leaf labels generates a different n x n symmetric matrix of distance distances.

What is the perturbation operator? Q is the proposal function and it has two stages: Q1 randomly selects a new leaf order Q2 perturbs the values of the matrix supradiagonals. The proposal mechanism is symmetrical Q(  n,  n+1 ) = Q(  n+1,  n )

Details on Q1 and Q2 Q1 samples one of the 2 n-1 leaf orderings of the current tree model. Q2 simultaneously and independently modifies the elements of the superdiagonal by creating a uniform distribution (a i  d) where d is constant. By applying both types of perturbations, Q1 and Q2, all the permutations of trees can be reached.

Illustration of Q2

Subproblems to be discussed 1.How do we represent the tree so it that is easy to operate on? Use cophenetic matrices. 2.What is our perturbation operator? Q. 3.How do we build our sampling chain? Apply Metropolis-Hastings 4.When are we done?

Acceptance with Metropolis-Hastings Given a tree , Metropolis-Hastings: 1. Applies Q to build a new tree,  . 2. Always accepts the new tree when it is more likely than the old one and sometimes accepts it when it is less likely than the old one.

Acceptance with Metropolis- Hastings – the algorithm If P(  * ) > P(  ) accept  * into the chain. else accept   into the chain with probability P(   ) / P 

Acceptance with Metropolis-Hastings The final step in evaluating the acceptance test is evaluating P(   ) / P  This is easy: P(  ) is approximated using the LE of 

Size of chain and convergence How many trees do you have to propose before you begin to get a good enough sample? Mau et al 1998 sample over about 2500 trees for Clarkia, a phylogeny with 9 leaves How do you test that you are done? At the end of the run, we say that we have convergence if there is a small set of topologies with high relative frequency in the chain. What’s the result? The topologies with the highest frequencies are the reported reconstructions.

Mixing To obtain a confidence measure, the algorithm must be run more than once: each run generates a chain of accepted trees. When chains “mix” well when they come up with the same representative topologies, starting from different tree topologies. If running a sufficient number of independent chains is computationally prohibitive, Suchard et al, 2002, provide a “poor man's estimate of the uncertainty”.

Example with binary data (from Mau, et al, 1998) 9 species of genus Clarkia (California plants) 120 restriction sites Data translated into a 9 x 120 matrix of zeroes and ones, representing the absence or presence of a restriction site in the genome of each species.

Running the MCMC algorithm Random starting trees Chains of length 250,000 were subsampled at rate of 1/100 = 2500 trees Each run took 20 minutes on a Sparc 10. Convergence was inferred by reproducibility across runs with very different starting trees.

The most common topologies for Clarkia A = 1,2; B = 3,4; C = 5,6; D=8,9

References Smouse and Li (1989) introduced the Bayesian paradigm, but not the notation, to the phylogeny reconstruction problem. Goldman (1993) used non-Bayesian Monte Carlo tests of significance to assess the adequacy of evolutionary models. Griffiths and Tavare (1994) constructed Markov chains to compute likelihoods for ancestral inference. Mau, Newton and Larget (1998) apply MCMC to sample trees using Bayes theorem.

Drill-down: Rates The way I use it, and I admit this is quirky, motif means the genetic profile for a functional structure. Using the following definitions: –Let r G be the rate of mutation for a gene. –Let r E be the rate of expressed mutation for the protein G encodes. –Let r S be the rate of structural mutation for the protein G encodes. –Let r F be the rate of functional mutation for the protein G encodes. r G > r E > r S > r F Note that the rate of neutral mutations is r N = r G – r F. The “true” rate of mutation for a motif is r F, the observed rate of mutation for members of a motif in a genotypic tree is r G. If we want motif branchings, we eliminate all branchings in the phylogeny occuring with rates r N.

Drill-down: Semi-labeled trees Trees with a defined branching pattern and defined leaf labels but WITHOUT edge lengths or internal node labels. In our terms, phylogenies with known branching patterns but without information about ancestors or mutation times. nccbac nacbac ncbbbc nccnaa

Drill-down: Progressive Alignments As you move up the tree, add to sum of characters in growing alignment

Progressive Alignments Sum of characters in growing alignment can be represented in a table of values called afrequency matrix or a profile

Progressive Alignments Alignments are frozen once they are made. Scores are then calculated between aligned positions tabulated in a frequency matrix, using a scoring table S ij = 2 × G:G + 1×A:G AGST A41310 G326 S214 T8

Algorithm: Neighbor-joining Input: N sequences and their relative distances, d ij Initialization: Define a leaf of T for each sequence Iteration Pick two nodes i,j such that d ij – (r i + r j ) is minimal. Define a new set of distances, {d kl } between k and all current nodes. Define a node k with daughter nodes i and j, and place it at edge length e ik = ½(d ij + r i – r j ) and e jk = d ij –d ik. Add k to the set of current nodes and remove i and j. Termination: Unrooted: When only two nodes i, j, remain, add an edge of length d ij/2.

Comparison: Neighbor-joining and UPGMA Minimization: –UPGMA uses d ij –Nearest-neighbor uses d ij – (r i + r j ) where Distance measures: For distances between leaves i and j: d ij is the same in both algorithms. For distances between nodes k and m UPGMA uses d ik = 1/|C i ||C j |  p in Ci, q in Cj d pq Nearest-neighbor uses d km = ½ (d im + d jm – d ij ) where i and j are the daughters of k. Edge lengths: UPGMA set the height of node k to ½ the distance between daughters i,j (½ d ij ). Nearest neighbor sets the edge length between k and daughters j to ½(d ij + r i – r j ), daughter k to d ij – d ik.

Drill-down: MLE P(b|a,t j ) ncbbcbc P(L j |b) = 1 a P(L k |a) = P(c|a,t i ) P(b|a,t j ) site u = 3 simplest case nccbabc P(L i |c) = 1 P(c|a,t i )

Drill-down: Enumerating topologies

Drill-down: Acceptance with Metropolis-Hastings A proposed tree  is accepted with probability: However, by detailed balance you can step forward or backward with equal probability: Q( ,  ) = Q( ,  ) Hence our test becomes

Download ppt "The distance between sequences, Part I. Foundations M. Elizabeth Corey, Ph.D. UCSC Extension and UC Berkeley Extension."

Similar presentations