I. P(0) = I Q and P(t) What is the probability of going from i (C?) to j (G?) in time t with rate matrix Q? vi. QE=0 E ij =1 (all i,j) vii. PE=E viii.

Slides:



Advertisements
Similar presentations
The genetic code.
Advertisements

1 Number of substitutions between two protein- coding genes Dan Graur.
Quick Lesson on dN/dS Neutral Selection Codon Degeneracy Synonymous vs. Non-synonymous dN/dS ratios Why Selection? The Problem.
Towards realistic codon models: among site variability and dependency of synonymous and nonsynonymous rates Itay Mayrose Adi Doron-Faigenboim Eran Bacharach.
Probabilistic Modeling of Molecular Evolution Using Excel, AgentSheets, and R Jeff Krause (Shodor)
Evolution of protein coding sequences
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Model Selection Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical.
Lecture 6, Thursday April 17, 2003
Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Sequence analysis June 18, 2008 Learning objectives-Understand the concept of sliding window programs. Understand difference between identity, similarity.
Schedule Day 1: Molecular Evolution Introduction Lecture: Models of Sequence Evolution Practical: Phylogenies Chose Project and collect literature Read.
Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
Molecular Evolution with an emphasis on substitution rates Gavin JD Smith State Key Laboratory of Emerging Infectious Diseases & Department of Microbiology.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Molecular Evolution: Plan for week Monday 3.11: Basics of Molecular Evolution Lecture 1: Molecular Basis and Models I (JH) Computer :
Advanced Questions in Sequence Evolution Models Context-dependent models Genome: Dinucleotides..ACGGA.. Di-nucleotide events ACGGAGT ACGTCGT Irreversibility.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
Probabilistic methods for phylogenetic trees (Part 2)
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Model Selection Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical.
Maximum Likelihood Molecular Evolution. Maximum Likelihood The likelihood function is the simultaneous density of the observation, as a function of the.
. Phylogenetic Trees Lecture 13 This class consists of parts of Prof Joe Felsenstein’s lectures 4 and 5 taken from:
Likelihood probability of observing the data given a model with certain parameters Maximum Likelihood Estimation (MLE) –find the parameter combination.
Nature and Action of the Gene
FEATURES OF GENETIC CODE AND NON SENSE CODONS
Interpretation of exponentiation + eigenvalue decomposition The terms in the series expansion of P(t) does not directly have an interpretation. The first,
Molecular basis of evolution. Goal – to reconstruct the evolutionary history of all organisms in the form of phylogenetic trees. Classical approach: phylogenetic.
Lecture 3: Markov models of sequence evolution Alexei Drummond.
Tree Inference Methods
More on translation. How DNA codes proteins The primary structure of each protein (the sequence of amino acids in the polypeptide chains that make up.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Schedule Bioinformatics and Computational Biology: History and Biological Background (JH) The Parsimony criterion GKN Stochastic Models of.
Lecture 10, CS5671 Neural Network Applications Problems Input transformation Network Architectures Assessing Performance.
Lecture 10 – Models of DNA Sequence Evolution Correct for multiple substitutions in calculating pairwise genetic distances. Derive transformation probabilities.
Identifying and Modeling Selection Pressure (a review of three papers) Rose Hoberman BioLM seminar Feb 9, 2004.
Estimating evolutionary parameters for Neisseria meningitidis Based on the Czech MLST dataset.
Rooting Phylogenetic Trees with Non-reversible Substitution Models Von Bing Yap* and Terry Speed § *Statistics and Applied Probability, National University.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity.
Statistical Estimation Vasileios Hatzivassiloglou University of Texas at Dallas.
MODELLING EVOLUTION TERESA NEEMAN STATISTICAL CONSULTING UNIT ANU.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
Measuring genetic change Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Section 5.2.
Schedule Day 1: Molecular Evolution Lecture: Models of Sequence Evolution and Statistical Alignment Practical: Molecular Evolution.
Alignment methods April 17, 2007 Quiz 1—Question on databases Learning objectives- Understand difference between identity, similarity and homology. Understand.
Molecular Evolution Distance Methods Biol. Luis Delaye Facultad de Ciencias, UNAM.
Modelling evolution Gil McVean Department of Statistics TC A G.
Mutations Csaba Bödör, Semmelweis University, 1 st Dept. of Pathology.
ORF Calling.
Molecular mechanism of mutation
Modelling Proteomes.
Maximum likelihood (ML) method
Daily Warm-Up Dec. 17th What is a genetic mutation? Homework:
Distances.
Models of Sequence Evolution
Molecular basis of evolution.
Schedule The Parsimony criterion GKN 13.10
PROTEIN SYNTHESIS RELAY
More on translation.
Combining RNA and Protein selection models
The Most General Markov Substitution Model on an Unrooted Tree
Lecture 11 – Increasing Model Complexity
Presentation transcript:

i. P(0) = I Q and P(t) What is the probability of going from i (C?) to j (G?) in time t with rate matrix Q? vi. QE=0 E ij =1 (all i,j) vii. PE=E viii. If AB=BA, then e A+B =e A e B. ii. P(  ) close to I+  Q for  small iii. P'(0) = Q. iv. lim P(t) has the equilibrium frequencies of the 4 nucleotides in each row v. Waiting time in state j, T j, P(T j > t) = e q jj t Expected number of events at equilibrium

Jukes-Cantor (JC69): Total Symmetry Rate-matrix, R: T O A C G T F A  R C  O G  M T  Stationary Distribution: (1,1,1,1)/4. Transition prob. after time t, a =  *t: P(equal) = ¼(1 + 3e -4*a ) ~ 1 - 3a P(diff.) = ¼(1 - 3e -4*a ) ~ 3a

Principle of Inference: Likelihood Likelihood function L() – the probability of data as function of parameters: L( ,D) If the data is a series of independent experiments L() will become a product of Likelihoods of each experiment, l() will become the sum of LogLikelihoods of each experiment In Likelihood analysis parameter is not viewed as a random variable. LogLikelihood Function – l(): ln(L( ,D)) Likelihood LogLikelihood

From Q to P for Jukes-Cantor

Exponentiation/Powering of Matrices then If where and Finding  : det (Q- I)=0 By eigen values: Numerically: where k ~6-10 JC69: Finding  : (Q-  I)b i =0

Kimura 2-parameter model - K80 TO A C G T F A -  R C  O G  M T   a =  *t b =  *t Q: P(t) start

Unequal base composition: (Felsenstein, 1981 F81) Q i,j = C*π j i unequal j Felsenstein81 & Hasegawa, Kishino & Yano 85 Tv/Tr & compostion bias (Hasegawa, Kishino & Yano, 1985 HKY85) (  )*C*π j i- >j a transition Q i,j = C*π j i- >j a transversion Rates to frequent nucleotides are high - (π =(π A, π C, π G, π T ) Tv/Tr = (π T π C +π A π G ) /[(π T +π C )(π A + π G )] A G T C Tv/Tr = (  ) (π T π C +π A π G ) /[(π T +π C )(π A + π G )]

Measuring Selection ThrSer ACGTCA Certain events have functional consequences and will be selected out. The strength and localization of this selection is of great interest. Pro ThrPro ACGCCA - ArgSer AGGCCG - The selection criteria could in principle be anything, but the selection against amino acid changes is without comparison the most important ThrSer ACGCCG ThrSer ACTCTG AlaSer GCTCTG AlaSer GCACTG

The Genetic Code i. 3 classes of sites: Problems: i. Not all fit into those categories. ii. Change in on site can change the status of another. 4 (3 rd ) (3 rd ) ii. T  A (2 nd )

Possible events if the genetic code remade from Li,1997 Substitutions Number Percent Total in all codons Synonymous Nonsynonymous Missense Nonsense 23 4 Possible number of substitutions: 61 (codons)*3 (positions)*3 (alternative nucleotides).

Kimura’s 2 parameter model & Li’s Model. Selection on the 3 kinds of sites (a,b)  (?,?) (f* ,f*  ) 2-2 ( ,f*  ) 4 ( ,  )      Rates: start Probabilities:

Sites Total Conserved Transitions Transversions (.8978) 12(.0438) 16(.0584) (.6623) 21(.2727) 5(.0649) (.6026) 16(.2051) 15(.1923) alpha-globin from rabbit and mouse. Ser Thr Glu Met Cys Leu Met Gly Gly TCA ACT GAG ATG TGT TTA ATG GGG GGA * * * * * * * ** TCG ACA GGG ATA TAT CTA ATG GGT ATA Ser Thr Gly Ile Tyr Leu Met Gly Ile Z(  t,  t) =.50[1+exp(-2  t) - 2exp(-t(  +  )] transition Y(  t,  t) =.25[1-exp(-2  t )] transversion X(  t,  t) =.25[1+exp(-2  t) + 2exp(-t(  )] identity L(observations,a,b,f)= C(429,274,77,78)* {X(a*f,b*f) 246 *Y(a*f,b*f) 12 *Z(a*f,b*f) 16 }* {X(a,b*f) 51 *Y(a,b*f) 21 *Z(a,b*f) 5 }*{X(a,b) 47 *Y(a,b) 16 *Z(a,b) 15 } where a = at and b = bt. Estimated Parameters: a = b = *b = (a + 2*b) = f = Transitions Transversions a*f = *b*f = a = *b*f = a = *b = Expected number of: replacement substitutions synonymous Replacement sites : (0.3742/0.6744)*77 = Silent sites : = K s =.6644 K a =.1127

Extension to Overlapping Regions Hein & Stoevlbaek, 95 (f 1 f 2 a, f 1 f 2 b) (f 2 a, f 1 f 2 b) (f 2 a, f 2 b) (f 1 a, f 1 f 2 b) (f 2 a, f 1 f 2 b) (a, f 2 b) (f 1 a, f 1 b) (a, f 1 b) (a, b) 1 st 2 nd sites Ziheng Yang has an alternative model to this, were sites are lumped into the same category if they have the same configuration of positions and reading frames. Example: Gag & Pol from HIV gag pol sites Gag Pol MLE: a=.084 b=.024 a+2b=.133 f gag =.403 f pol =.229

Hasegawa, Kisino & Yano Subsitution Model Parameters: a*t β*t  A  C  G  T Selection Factors GAG0.385(s.d ) POL0.220(s.d ) VIF0.407(s.d ) VPR0.494(s.d ) TAT1.229(s.d ) REV0.596(s.d ) VPU0.902(s.d ) ENV0.889(s.d ) NEF0.928(s.d ) Estimated Distance per Site: HIV1 Analysis

Data: 3 sequences of length L ACGTTGCAA... AGCTTTTGA... TCGTTTCGA... Statistical Test of Models (Goldman,1990) A. Likelihood (free multinominal model 63 free parameters) L1 = p AAA #AAA *...p AAC #AAC *...*p TTT #TTT where p N 1 N 2 N 3 = #(N 1 N 2 N 3 )/L L 2 = p AAA (l1',l2',l3') #AAA *...*p TTT (l1',l2',l3') #TTT l2l2 l1l1 l3l3 TCGTTTCGA... ACGTTGCAA... AGCTTTTGA... B. Jukes-Cantor and unknown branch lengths Parametric bootstrap: i. Maximum likelihood to estimate the parameters. ii. Simulate with estimated model. iii. Make simulated distribution of -2 lnQ. iv. Where is real -2 lnQ in this distribution? Test statistics: I.  (expected-observed) 2 /expected or II: -2 lnQ = 2(lnL 1 - lnL 2 ) JC69 Jukes-Cantor: 3 parameters =>  2 60 d.of freedom Problems: i. To few observations pr. pattern. ii. Many competing hypothesis.

Rate variation between sites:iid each site The rate at each position is drawn independently from a distribution, typically a  (or lognormal) distribution. G(a,b) has density x  -1 *e -  x /  ), where  is called scale parameter and  form parameter. Let L(p i, ,t) be the likelihood for observing the i'th pattern, t all time lengths,  the parameters describing the process parameters and f (r i ) the continuous distribution of rate(s). Then