Lecture 10 – Models of DNA Sequence Evolution Correct for multiple substitutions in calculating pairwise genetic distances. Derive transformation probabilities.

Slides:



Advertisements
Similar presentations
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Markov chains Assume a gene that has three alleles A, B, and C. These can mutate into each other. Transition probabilities Transition matrix Probability.
Many useful applications, especially in queueing systems, inventory management, and reliability analysis. A connection between discrete time Markov chains.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Continuous-Time Markov Chains Nur Aini Masruroh. LOGO Introduction  A continuous-time Markov chain is a stochastic process having the Markovian property.
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Cladogram Building - 1 ß How complex is this problem anyway ? ß NP-complete:  Time needed to find solution in- creases exponentially with size of problem.
Probabilistic Modeling of Molecular Evolution Using Excel, AgentSheets, and R Jeff Krause (Shodor)
Sampling distributions of alleles under models of neutral evolution.
Phylogenetic Trees Lecture 4
Markov Models Charles Yan Markov Chains A Markov process is a stochastic process (random process) in which the probability distribution of the.
Matrices, Digraphs, Markov Chains & Their Use. Introduction to Matrices  A matrix is a rectangular array of numbers  Matrices are used to solve systems.
MAT 4830 Mathematical Modeling 4.4 Matrix Models of Base Substitutions II
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
What is the probability that of 10 newborn babies at least 7 are boys? p(girl) = p(boy) = 0.5 Lecture 10 Important statistical distributions Bernoulli.
. Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 7a Presentation partially taken.
Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
From population genetics to variation among species: Computing the rate of fixations.
Distance Methods. Distance Estimates attempt to estimate the mean number of changes per site since 2 species (sequences) split from each other Simply.
Molecular Evolution with an emphasis on substitution rates Gavin JD Smith State Key Laboratory of Emerging Infectious Diseases & Department of Microbiology.
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Distance Matrix Methods: Models of Evolution Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
CISC667, F05, Lec16, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (III) Probabilistic methods.
Lecture 13 – Performance of Methods Folks often use the term “reliability” without a very clear definition of what it is. Methods of assessing performance.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Statistics in Bioinformatics May 12, 2005 Quiz 3-on May 12 Learning objectives-Understand equally likely outcomes, counting techniques (Example, genetic.
. Phylogenetic Trees Lecture 13 This class consists of parts of Prof Joe Felsenstein’s lectures 4 and 5 taken from:
1 Additive Distances Between DNA Sequences MPI, June 2012.
1 Introduction to Bioinformatics 2 Introduction to Bioinformatics. LECTURE 5: Variation within and between species * Chapter 5: Are Neanderthals among.
Why Models of Sequence Evolution Matter Number of differences between each pair of taxa vs. genetic distance between those two taxa. The x-axis is a proxy.
Lecture 3: Markov models of sequence evolution Alexei Drummond.
Tree Inference Methods
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
1 Evolutionary Change in Nucleotide Sequences Dan Graur.
Comp. Genomics Recitation 3 The statistics of database searching.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
Chapter 61 Continuous Time Markov Chains Birth and Death Processes,Transition Probability Function, Kolmogorov Equations, Limiting Probabilities, Uniformization.
Lecture 15: Linkage Analysis VII
Why do trees?. Phylogeny 101 OTUsoperational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal,
Phylogeny Ch. 7 & 8.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
MODELLING EVOLUTION TERESA NEEMAN STATISTICAL CONSULTING UNIT ANU.
Measuring genetic change Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Section 5.2.
Evolutionary Models CS 498 SS Saurabh Sinha. Models of nucleotide substitution The DNA that we study in bioinformatics is the end(??)-product of evolution.
1 Probability Review E: set of equally likely outcomes A: an event E A Conditional Probability (Probability of A given B) Independent Events: Combined.
Molecular Evolution Distance Methods Biol. Luis Delaye Facultad de Ciencias, UNAM.
Probabilistic methods for phylogenetic tree reconstruction BMI/CS 576 Colin Dewey Fall 2015.
Fault Tree Analysis Part 11 – Markov Model. State Space Method Example: parallel structure of two components Possible System States: 0 (both components.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Modelling evolution Gil McVean Department of Statistics TC A G.
CHAPTER 4 ESTIMATES OF MEAN AND ERRORS. 4.1 METHOD OF LEAST SQUARES I n Chapter 2 we defined the mean  of the parent distribution and noted that the.
Reliability Engineering
Discrete-time Markov chain (DTMC) State space distribution
Lecture 10 – Models of DNA Sequence Evolution
Lecture 6B – Optimality Criteria: ML & ME
Inferring a phylogeny is an estimation procedure.
Maximum likelihood (ML) method
Distances.
Models of Sequence Evolution
Goals of Phylogenetic Analysis
Why Models of Sequence Evolution Matter
Lecture 6B – Optimality Criteria: ML & ME
Lecture 10 – Models of DNA Sequence Evolution
Lecture 11 – Increasing Model Complexity
Discrete-time markov chain (continuation)
Presentation transcript:

Lecture 10 – Models of DNA Sequence Evolution Correct for multiple substitutions in calculating pairwise genetic distances. Derive transformation probabilities for likelihood-based methods. Prob(R r |  ) =   x P m,k (v 3,1 ) x P k,A (v 1,w ) x P k,G (v 1,x ) x P m,l (v 3,2 ) x P l,C (v 2,y ) x P l,C (v 2,z ) It’s the P i,j ’s that we need a substitution model to calculate. The models typically used are Markov processes. Poisson process is a stochastic process that can be used to model events in time. The time between events is exponentially distributed, with rate.

Jukes-Cantor Model The probability of a site remaining constant is: p ii(t) = ¼ + ¾ e -4at The probability of a site changing is : p ij(t) = ¼ - ¼ e -4at  is the rate at which any nucleotide changes to any other per unit time. Given that the state at the site is i at t 0, we start by estimating the probability of state i at that site at t 1. p i(0) = 1 p i(1) = 1-3 

Now, what’s the probability of this site having state i at t 2 There are two ways for the site to have state i at t 2 : 1 – It still hasn’t changed since time t 0. 2 – It has changed to something else and back again. Therefore, p i(2) = (1 – 3  ) p i(1) +  [1 – p i(1) ], where (1 – 3a) p i(1) = probability of no change at the site during time t 2, (1-3  ), times the probability of the site having state i at time t 1, (p i(1) ). and  [1-p i(1) ] = probability of a change to i, (  ), times the probability that the site is not state i at time t 1, (1-p i(1) ) Jukes-Cantor Model

We have a recurrence equation. p i(t+1) = (1 - 3  ) p i(t) +  [1 – p i(t) ] = p i(t) - 3  p i(t) +  –  p i(t) We can calculate the change in p i(t) across time,  t. p i(t+1) – p i(t) = -3  p i(t) +  –  p i(t) so and

Jukes-Cantor Model p i(t) = 1/4 + (p i(0) – 1/4) e -4  t We have a probability that a site has a particular nucleotide after time t, given in terms of its initial state. If i = j, p i(0) = 1. Therefore, p ii(t) = 1/4 + 3/4 e -4  t If i not = j, p i(0) = 0, and p ij(t) = 1/4 - 1/4 e -4  t  is an instantaneous rate, so we’ve modeled branch length (rate x time) explicitly in our expectations.

The JC model makes several assumptions. 1) All substitutions are equally likely; we have a single substitution type. 2) Base frequencies are assumed to be equal; each of the four nucleotides occurs at 25% of sites. 3) Each site has the same probability of experiencing a substitution as any other; we have an equal-rates model. 4) The process is constant through time. 5) Sites are independent of each other. 6) Substitution is a Markov process.   Q =   Q - matrix

Substitution types and base frequencies. -  (a  C + b  G + c  T )  a  C  b  G  c  T  g  A -  (g  A + d  G - e  T )  d  G  e  T Q =  h  A  j  C -  (h  A + j  C + f  T )  f  T  i  A  k  C  l  G -  (i  A + k  C + l  G ) For the general case: where,  = the average instantaneous substitution rate, a, b, c, …, l are relative rate parameters (one of them is set to 1). and  i ’s are the frequencies of the base that is being substituted to. Note that this is not symmetric, and therefore, the full model is non-reversible. a = g, b = h, c = i, d = j, e = k, & f = l.

Substitution types and base frequencies. -  (a  C + b  G + c  T )  a  C  b  G  c  T  a  A -  (a  A + d  G + e  T )  d  G  e  T Q =  b  A  d  C -  (b  A + d  C + f  T )  f  T  c  A  e  C  f  G -  (c  A + e  C + f  G ) General Time-Reversible Model There are six relative transformation rates (one of which is set to 1). There are four base frequencies that must sum to 1. Note that this is not a symmetric matrix, but it can be decomposed into R and .

Substitution types and base frequencies. -  (a+b+c)  a  b  c  a -  (a+d+e)  d  e R =  b  d -  (b+d+f)  f  c  e  f -  (c+e+f)  A  C 00  = 00  G  T Visual GTR

Common Simplifications Transition type substitutions occur at a higher rate than transversion substitutions. K2P Model was the first to address this. So we set b = e =  (for transitions), and a = c = d = f = 1 (for transversions). -(  )(  + 2)/4  /4  /4  /4  /4-(  )(  + 2)/4  /4  /4 for K2P: Q =  /4  /4-(  )(  + 2)/4  /4  /4  /4  /4-(  )(  + 2)/4 All  i = ¼ where  =  /4 and  =  /4. Thus,  =  and

Hasegawa-Kishino-Yano (HKY) Model -  (  G +  Y )  C  G  T  A -  (  T +  R )  G  for HKY: Q =    C -  (  A +  Y )  T  A  C  G -  (  C +  R ) where  =  R =  A +  G, and  Y =  C +  T. There are lots of other models that restrict the Q-matrix.

Some common models There are 203 special cases of the GTR, 406 if we allow for equal base frequencies.

Calculating Transformation Probabilities. So the Q & R matrices we’ve been discussing define the instantaneous rates of substitutions from one nucleotide to another. Convert the rates to probabilities by matrix exponentiation: P(t) = e Qt Jukes-Cantor K2P Again, it’s these P ij that are used in the likelihood function.