Markov Chain Models BMI/CS 576 Fall 2010.

Slides:



Advertisements
Similar presentations
Markov models and applications
Advertisements

Probabilistic sequence modeling II: Markov chains Haixu Tang School of Informatics.
. Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution.
Hidden Markov Model in Biological Sequence Analysis – Part 2
Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz
Hidden Markov Models Eine Einführung.
Markov Models Charles Yan Markov Chains A Markov process is a stochastic process (random process) in which the probability distribution of the.
MNW2 course Introduction to Bioinformatics
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
1 DNA Analysis Amir Golnabi ENGS 112 Spring 2008.
Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.
Markov Chains Lecture #5
CISC667, F05, Lec18, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation.
Master’s course Bioinformatics Data Analysis and Tools Lecture 12: (Hidden) Markov models Centre for Integrative Bioinformatics.
Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken from (and rapidly mixed) Larry Hunter, Tom Madej, William Stafford Noble,
Markov Models Charles Yan Spring Markov Models.
Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Master’s course Bioinformatics Data Analysis and Tools
More about Markov model.
1 Markov Chains Algorithms in Computational Biology Spring 2006 Slides were edited by Itai Sharon from Dan Geiger and Ydo Wexler.
Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
Hidden Markov models Sushmita Roy BMI/CS 576 Oct 16 th, 2014.
Markov models and applications Sushmita Roy BMI/CS 576 Oct 7 th, 2014.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Lecture 12 Splicing and gene prediction in eukaryotes
. Class 5: Hidden Markov Models. Sequence Models u So far we examined several probabilistic model sequence models u These model, however, assumed that.
Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.
Biological Motivation Gene Finding in Eukaryotic Genomes
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
MNW2 course Introduction to Bioinformatics Lecture 22: Markov models Centre for Integrative Bioinformatics FEW/FALW
Hidden Markov Models for Sequence Analysis 4
BINF6201/8201 Hidden Markov Models for Sequence Analysis
More on translation. How DNA codes proteins The primary structure of each protein (the sequence of amino acids in the polypeptide chains that make up.
Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Hidden Markov Models BMI/CS 776 Mark Craven March 2002.
10/29/20151 Gene Finding Project (Cont.) Charles Yan.
MS Sequence Clustering
From Genomes to Genes Rui Alves.
Interpolated Markov Models for Gene Finding BMI/CS 776 Mark Craven February 2002.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel:
ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Markov Chain Models BMI/CS 576 Colin Dewey Fall 2015.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
(H)MMs in gene prediction and similarity searches.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Protein Structure Prediction: Threading and Rosetta BMI/CS 576 Colin Dewey Fall 2008.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Hidden Markov Models BMI/CS 576
ORF Calling.
bacteria and eukaryotes
Markov Chain Models BMI/CS 776
Learning Sequence Motif Models Using Expectation Maximization (EM)
Interpolated Markov Models for Gene Finding
Recitation 7 2/4/09 PSSMs+Gene finding
More on translation.
Introduction to Bioinformatics II
Hidden Markov Models (HMMs)
Presentation transcript:

Markov Chain Models BMI/CS Fall 2010

Motivation for Markov Models in Computational Biology there are many cases in which we would like to represent the statistical regularities of some class of sequences –genes –various regulatory sites in DNA (e.g., where RNA polymerase and transcription factors bind) –proteins in a given family Markov models are well suited to this type of task

Markov Chain Models a Markov chain model is defined by –a set of states we’ll think of a traversal through the states as generating a sequence each state adds to the sequence (except for begin and end) –a set of transitions with associated probabilities the transitions emanating from a given state define a distribution over the possible next states

transition probabilities A Markov Chain Model for DNA A TC G begin state transition

Markov Chain Models begin end A TC G can also have an end state; allows the model to represent –a distribution over sequences of different lengths –preferences for ending sequences with certain symbols

Markov Chain Models Let X be a sequence of L random variables X 1 … X L representing a biological sequence generated through some process We can ask how probable the sequence is given our model for any probabilistic model of sequences, we can write this probability as (recall the “Chain Rule of Probability”) key property of a (1 st order) Markov chain: the probability of each depends only on the value of

The Probability of a Sequence for a Given Markov Chain Model end A TC G begin

Markov Chain Notation the transition parameters can be denoted by where similarly we can denote the probability of a sequence x as where represents the transition from the begin state This gives a probability distribution over sequences of length L

Estimating the Model Parameters given some data (e.g. a set of sequences from CpG islands), how can we determine the probability parameters of our model? one approach: maximum likelihood estimation –given a set of data D –set the parameters to maximize –i.e. make the data D look as likely as possible under the model

Maximum Likelihood Estimation Let’s use a very simple sequence model –every position is independent of the others –every position generated from the same multinomial distribution we want to estimate the parameters Pr(a), Pr(c), Pr(g), Pr(t) and we’re given the sequences accgcgctta gcttagtgac tagccgttac then the maximum likelihood estimates are the observed frequencies of the bases

Maximum Likelihood Estimation suppose instead we saw the following sequences gccgcgcttg gcttggtggc tggccgttgc then the maximum likelihood estimates are do we really want to set this to 0?

A Bayesian Approach instead of estimating parameters strictly from the data, we could start with some prior belief for each for example, we could use Laplace estimates where represents the number of occurrences of character i using Laplace estimates with the sequences gccgcgcttg gcttggtggc tggccgttgc pseudocount

A Bayesian Approach a more general form: m-estimates with m=8 and uniform priors gccgcgcttg gcttggtggc tggccgttgc number of “virtual” instances prior probability of a

Estimation for 1 st Order Probabilities to estimate a 1 st order parameter, such as Pr(c|g), we count the number of times that c follows the history g in our given sequences using Laplace estimates with the sequences gccgcgcttg gcttggtggc tggccgttgc

Example Application CpG islands –CG dinucleotides are rarer in eukaryotic genomes than expected given the marginal probabilities of C and G –but the regions upstream of genes are richer in CG dinucleotides than elsewhere – CpG islands –useful evidence for finding genes could predict CpG islands with Markov chains –one to represent CpG islands –one to represent the rest of the genome

Markov Chains for Discrimination suppose we want to distinguish CpG islands from other sequence regions given sequences from CpG islands, and sequences from other regions, we can construct –a model to represent CpG islands –a null model to represent the other regions can then score a test sequence by:

Markov Chains for Discrimination +acgt a c g t acgt a c g t parameters estimated for CpG and null models –human sequences containing 48 CpG islands –60,000 nucleotides CpGnull

Markov Chains for Discrimination light bars represent negative sequences dark bars represent positive sequences the actual figure here is not from a CpG island discrimination task, however Figure from A. Krogh, “An Introduction to Hidden Markov Models for Biological Sequences” in Computational Methods in Molecular Biology, Salzberg et al. editors, 1998.

Markov Chains for Discrimination why use Bayes’ rule tells us if we’re not taking into account prior probabilities of two classes ( and ) then we just need to compare and

Higher Order Markov Chains the Markov property specifies that the probability of a state depends only on the probability of the previous state but we can build more “memory” into our states by using a higher order Markov model in an nth order Markov model

Selecting the Order of a Markov Chain Model higher order models remember more “history” additional history can have predictive value example: –predict the next word in this sentence fragment “…finish __” (up, it, first, last, …?) –now predict it given more history “Nice guys finish __”

Selecting the Order of a Markov Chain Model but the number of parameters we need to estimate grows exponentially with the order –for modeling DNA we need parameters for an nth order model the higher the order, the less reliable we can expect our parameter estimates to be –estimating the parameters of a 2 nd order Markov chain from the complete genome of E. Coli, we’d see each (n+1)-mer > 72,000 times on average –estimating the parameters of an 9 th order chain, we’d see each (n+1)-mer ~ 4 times on average

Higher Order Markov Chains an nth order Markov chain over some alphabet is equivalent to a first order Markov chain over the alphabet of n-tuples example: a 2 nd order Markov model for DNA can be treated as a 1 st order Markov model over alphabet AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT caveat: we process a sequence one character at a time A C G G T AC GGCGGT

A Fifth Order Markov Chain GCTAC AAAAA TTTTT CTACG CTACA CTACC CTACT Pr(A | GCTAC) begin Pr(GCTAC) Pr(C | GCTAC)

A Fifth Order Markov Chain GCTAC AAAAA CTACG CTACA CTACC CTACT Pr(A | GCTAC) begin Pr(GCTAC)

Inhomogenous Markov Chains in the Markov chain models we have considered so far, the probabilities do not depend on where we are in a given sequence in an inhomogeneous Markov model, we can have different distributions at different positions in the sequence Equivalent to expanding the state space so that states keep track of which position they correspond to consider modeling codons in protein coding regions

An Inhomogeneous Markov Chain A T C G A T C G A T C G begin pos 1pos 2pos 3

An Application of Markov Chain Models: Finding Genes build Markov models of coding & noncoding regions apply models to ORFs or fixed-sized windows of sequence e.g. GeneMark [Borodovsky et al.] –popular system for identifying genes in bacterial genomes –uses 5 th order inhomogeneous Markov chain models

Open Reading Frames (ORFs) an open reading frame is a sequence that could potentially encode a protein –it begins with a potential start codon –it ends with a potential stop codon –It has no internal in frame stop codons –it satisfies some minimum length requirement ATG CCG AAA TCG CAT GCA TTT GTG … TAG startstop start

Approaches to Finding Genes search by sequence similarity: find genes by looking for matches to sequences that are known to be related to genes search by signal: find genes by identifying the sequence signals involved in gene expression search by content: find genes by statistical properties that distinguish protein-coding DNA from non-coding DNA combined: state-of-the-art systems for gene finding combine these strategies

Gene Finding: Search by Content encoding a protein affects the statistical properties of a DNA sequence –some amino acids are used more frequently than others (Leu more popular than Trp) –different numbers of codons for different amino acids (Leu has 6, Trp has 1) –for a given amino acid, usually one codon is used more frequently than others this is termed codon preference these preferences vary by species

Codon Preference in E. Coli AA codon freq per Gly GGG 1.89 Gly GGA 0.44 Gly GGU Gly GGC Glu GAG Glu GAA Asp GAU Asp GAC 43.26

Base Composition for Various Organisms

Reading Frames a given sequence may encode a protein in any of the six reading frames G C T A C G G A G C T T C G G A G C C G A T G C C T C G A A G C C T C G

Markov Models & Reading Frames consider modeling a given coding sequence for each “word” we evaluate, we’ll want to consider its position with respect to the reading frame we’re assuming G C T A C G G A G C T T C G G A G C G C T A C G reading frame G is in 3 rd codon position C T A C G G G is in 1 st position T A C G G A A is in 2 nd position

A Fifth Order Inhomogenous Markov Chain GCTAC AAAAA TTTTT CTACG CTACA CTACC CTACT start AAAAA TTTTT TACAG TACAA TACAC TACAT GCTAC AAAAA TTTTT CTACG CTACA CTACC CTACT position 1position 2position 3