Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.

Slides:



Advertisements
Similar presentations
Markov models and applications
Advertisements

. Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution.
Hidden Markov Model in Biological Sequence Analysis – Part 2
Marjolijn Elsinga & Elze de Groot1 Markov Chains and Hidden Markov Models Marjolijn Elsinga & Elze de Groot.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Learning HMM parameters
Hidden Markov Model.
Lecture 8: Hidden Markov Models (HMMs) Michael Gutkin Shlomi Haba Prepared by Originally presented at Yaakov Stein’s DSPCSP Seminar, spring 2002 Modified.
Bioinformatics Hidden Markov Models. Markov Random Processes n A random sequence has the Markov property if its distribution is determined solely by its.
Hidden Markov Models Eine Einführung.
Hidden Markov Models.
Markov Models Charles Yan Markov Chains A Markov process is a stochastic process (random process) in which the probability distribution of the.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Hidden Markov Models Modified from:
Hidden Markov Models Theory By Johan Walters (SR 2003)
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
. Hidden Markov Model Lecture #6. 2 Reminder: Finite State Markov Chain An integer time stochastic process, consisting of a domain D of m states {1,…,m}
Markov Chains Lecture #5
HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the.
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Hidden Markov Models. Decoding GIVEN x = x 1 x 2 ……x N We want to find  =  1, ……,  N, such that P[ x,  ] is maximized  * = argmax  P[ x,  ] We.
Markov Models Charles Yan Spring Markov Models.
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Hidden Markov Models Lecture 6, Thursday April 17, 2003.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
. Parameter Estimation For HMM Background Readings: Chapter 3.3 in the book, Biological Sequence Analysis, Durbin et al., 2001.
. Hidden Markov Models Lecture #5 Prepared by Dan Geiger. Background Readings: Chapter 3 in the text book (Durbin et al.).
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
S. Maarschalkerweerd & A. Tjhang1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
CpG islands in DNA sequences
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Hidden Markov Models Usman Roshan BNFO 601. Hidden Markov Models Alphabet of symbols: Set of states that emit symbols from the alphabet: Set of probabilities.
Bioinformatics Hidden Markov Models. Markov Random Processes n A random sequence has the Markov property if its distribution is determined solely by its.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
1 Markov Chains Algorithms in Computational Biology Spring 2006 Slides were edited by Itai Sharon from Dan Geiger and Ydo Wexler.
Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Hidden Markov Models.
Hidden Markov models Sushmita Roy BMI/CS 576 Oct 16 th, 2014.
Markov models and applications Sushmita Roy BMI/CS 576 Oct 7 th, 2014.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
CS262 Lecture 5, Win07, Batzoglou Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
. Class 5: Hidden Markov Models. Sequence Models u So far we examined several probabilistic model sequence models u These model, however, assumed that.
Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.
1 Markov Chains. 2 Hidden Markov Models 3 Review Markov Chain can solve the CpG island finding problem Positive model, negative model Length? Solution:
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al.,  Shlomo.
HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel:
1 DNA Analysis Part II Amir Golnabi ENGS 112 Spring 2008.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models – Concepts 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models BMI/CS 576
Hidden Markov Models - Training
Hidden Markov Models Part 2: Algorithms
CISC 667 Intro to Bioinformatics (Fall 2005) Hidden Markov Models (I)
Hidden Markov Model Lecture #6
CISC 667 Intro to Bioinformatics (Fall 2005) Hidden Markov Models (I)
Presentation transcript:

Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model

Algorithms in Computational Biology22Department of Mathematics & Computer Science Example: CpG Islands Dinucleotide CG (CpG to distinguish it from C-G base pair) C within CG is typically methylated Methyl-C is more likely to mutate to T CpG dinucleotides are rarer in genome than would be expected from the independent probabilities of C and G Methylation process is suppressed in short stretches of the genome More CpG dinucleotides in promoter regions of genes CpG islands are regions with many CpGs Typically a few hundred to a few thousand bases long

Algorithms in Computational Biology33Department of Mathematics & Computer Science Questions about CpG Island? Given a short stretch of genomic sequence, how would we decide if it comes from a CpG island or not? Given a long piece of sequence, how would we find the CpG islands in it if there are any?

Algorithms in Computational Biology44Department of Mathematics & Computer Science Markov Chains A G T C

Algorithms in Computational Biology55Department of Mathematics & Computer Science Key Property of a Markov Chain The probability of each symbol x i depends only on the value of the preceding symbol x i-1

Algorithms in Computational Biology66Department of Mathematics & Computer Science Modeling the Beginning and End of Sequences A G T C BE

Algorithms in Computational Biology77Department of Mathematics & Computer Science Using Markov Chains for Discrimination +ACGT A C G T ACGT A C G T CpG island model Non-CpG island model

Algorithms in Computational Biology88Department of Mathematics & Computer Science Cont’ For discrimination, the log-odds ratio is calculated:  ACGT A C G T

Algorithms in Computational Biology99Department of Mathematics & Computer Science Histogram of Length-Normalized Scores Non-CpG Islands CpG islands

Algorithms in Computational Biology10 Department of Mathematics & Computer Science Locating CpG Islands in a DNA Sequence Input: A long DNA sequence X = {x 1, x 2, …, x L )  * Output: CpG islands along X. Use Markov chain models Calculate log-odds score for a window of length k (e.g., 100) A total of L-k+1 scores will be computed and plotted CpG islands will stand out with positive values

Algorithms in Computational Biology11 Department of Mathematics & Computer Science Problems with Markov Chain Models in Locating CpG Islands CpG islands have sharp boundaries CpG islands have variable lengths These problems can be better addressed by building a single model for the entire sequence that incorporates both Markov chains

Algorithms in Computational Biology12 Department of Mathematics & Computer Science Formal Definition of an HMM A hidden Markov model is a triplet  M = ( , Q,  ), where  is an alphabet of symbols Q is a finite set of states, capable of emitting symbols from the alphabet   is a set of probabilities, comprised of State transition probabilities, denoted by a kl for each k, l  Q Emission probabilities, denoted by e k (b) for each state k  Q and b  

Algorithms in Computational Biology13 Department of Mathematics & Computer Science Cont’  (State sequence or path)  = (  1,  2, …,  L) Follows a simple Markov chain (the probability of a state depends only on the previous state) State transition probability a kl = P{  i = l|  i-1 = k} Emission probability Given a sequence X = (x 1, x 2, … x L ), emission probability e k (b) is defined as: e k (b) = P{x i =b|  i = k} The probability that the sequence X was generated by M given the path  is:

Algorithms in Computational Biology14 Department of Mathematics & Computer Science An HMM for Detecting CpG Islands in a Long DNA Sequence Alphabet:  = {A, C, G, T} States: Q = {A +, C +, G +, T +, A -, C -, G -, T - } Emissions State: A + C + G + T + A - C - G - T - Emitted symbol: A C G T A C G T The emission probability of each state x + and x - is 1 for emitting symbol x and 0 for emitting other symbols (special feature of this HMM)

Algorithms in Computational Biology15 Department of Mathematics & Computer Science Transition Matrix for CpG Island HMM P is the probability of staying in a CpG island, and q is the probability of staying in a non-CpG Island

Algorithms in Computational Biology16 Department of Mathematics & Computer Science Occasionally Dishonest Casino Dealer In casino a dealer uses a fair die most of the time, but occasionally he switch to a loaded die. The loaded die has a probability of 0.5 for a six and probability of 0.1 for the numbers one to five. The dealer switches from a fair to a loaded die with probability of 0.05 before each roll, and that the probability of switching back is 0.1. In each state of the Markov process the outcomes of a roll have different probabilities, thus the process can modeled using a HMM

Algorithms in Computational Biology17 Department of Mathematics & Computer Science HMM for the Occasionally Dishonest Casino Dealer Q = {F, L}  = {1, 2, 3, 4, 5, 6} What is hidden?

Algorithms in Computational Biology18 Department of Mathematics & Computer Science HMMs Generate Sequences Generate a sequence via HMM Choose  1 according to the probabilities a 0i An observation (x 1 ) is emitted according to the probabilities e  1 Choose  2 according to the probabilities a  1i An observation (x 2 ) is emitted according to the probabilities e  2 And so forth …… P(x) is the probability that sequence x was generated by the model The joint probability of an observed sequence x and a state sequence  :

Algorithms in Computational Biology19 Department of Mathematics & Computer Science Most Probable State Path A CpG island example Sequence CGCG can be emitted by: (C +, G +, C +, G + ), (C -, G -, C -, G - ), (C +, G -, C +, G - ) Which state sequence is more likely for the observation? Most probable path is defined as: The probability v k (i) of the most probable path ending in state k with observation i is known for all the states k, then v l (i+1) is defined:

Algorithms in Computational Biology20 Department of Mathematics & Computer Science Finding Most Probable Path Using Viterbi Algorithm Initialization (i = 0): v 0 (0) = 1, v k (0) = 0 for k > 0 Recursion: (i = 1…L): v l (i) = e l (x i )max k (v k (i-1)a kl ) ptr i (l) = argmax k (v k (i-1)a kl ) Termination: P(x,  * ) = max k (v k (L)a k0 )  L * = argmax k (v k (L)a k0 ) Traceback (i=L…1):  i-1 * = ptr i (  i * )

Algorithms in Computational Biology21 Department of Mathematics & Computer Science Viterbi Example VCGCG B10000 A+A C+C G+G T+T A-A C-C G-G T-T Most probable path for sequence CGCG

Algorithms in Computational Biology22 Department of Mathematics & Computer Science Sequence of Die Rolls Predicted by Viterbi Algorithm

Algorithms in Computational Biology23 Department of Mathematics & Computer Science Finding the Probability of a Sequence for an HMM: the Forward Algorithm Definitions: Algorithm: Initialization ( i = 0): Recursion (i = 1…L): Termination:

Algorithms in Computational Biology24 Department of Mathematics & Computer Science Posterior State Probability We want to know the most probable state for an observation x i We need to find out the probability that observation x i came from each state k given the observed sequence

Algorithms in Computational Biology25 Department of Mathematics & Computer Science Finding b k (i) Using Backward Algorithm Initialization (i = L): Recursion (i = L-1, …, 1): Termination:

Algorithms in Computational Biology26 Department of Mathematics & Computer Science Posterior Decoding Approach 1 Approach 2 E.g. Find the posterior probability according to the model that base i is in a CpG island, we can let g(k) = 1 for k  {A +, C +, G +, T + } g(k) = 0 for k  {A -, C -, G -, T - } G(i|k) is precisely the posterior probability

Algorithms in Computational Biology27 Department of Mathematics & Computer Science Use of Posterior Decoding Shaded areas show when the roll was generated by the loaded die

Algorithms in Computational Biology28 Department of Mathematics & Computer Science Parameter Estimation for HMMs Model specification Structure design What states there are and how they are connected Assignment of parameter values Transition probabilities a kl Emission probabilities e k (b) Estimation framework Training sequences x 1, …, x n Work in log space

Algorithms in Computational Biology29 Department of Mathematics & Computer Science Estimation When the State Sequence is Known

Algorithms in Computational Biology30 Department of Mathematics & Computer Science Estimation When Paths Are Unknown Baum (1971) Calculate A kl and E k (b) as the expected times each transition or emission is used given the training sequences Subject to local maxima Depends only the starting values of the parameters The probability that a kl is used at position i in sequence x is:

Algorithms in Computational Biology31 Department of Mathematics & Computer Science Expected Transition and Emission Counts The expected number of times that a kl can be obtained by summing over all positions and over all training sequences The expected number of times that letter b appears in state k

Algorithms in Computational Biology32 Department of Mathematics & Computer Science Baum-Welch Training (EM algorithm) Initialization: Pick arbitrary model parameters Recurrence: Set all the A and E variables to their pseudocount values r (or to zero) For each sequence j = 1 … n Calculate f k (i) for sequence j using forward algorithm Calculate b k (i) for sequence j using backward algorithm Add the contribution of sequence j to A and E Calculate the new model parameters Calculate the new log likelihood of the model Termination: Stop if the change in log likelihood is less than some predefined threshold or the maximum number of iterations is exceeded

Algorithms in Computational Biology33 Department of Mathematics & Computer Science Modeling of Labeled Sequences HMMs can be used to predict the labeling of unannotated sequences Training for HMMs Separately train the model for CpG islands and the model for non-CpG islands Combine them into a larger HMM Tedious especially if there are more two classes involved It will be nice to estimate everything at once Training set includes all classes (e.g., CpG islands and non-CpG islands) Each sequence is labeled with corresponding classes Let y = y 1, …, y L be the labels on the observation x = x 1, …, x L

Algorithms in Computational Biology34 Department of Mathematics & Computer Science Cont’ Model can be estimated with a slight modification of Baum-Welch algorithm Allow only valid paths through the model A valid path is one where the state labels and sequence labels are the same, i.e.,  i has label y i During the forward and backward algorithms this corresponds to setting f l (i) = 0 and b l (i) = 0 for all the states l with a label different from y i

Algorithms in Computational Biology35 Department of Mathematics & Computer Science Discriminative Estimation When modeling labeled sequences, the following likelihood is maximized Obtaining a good prediction of y is our primary interest, it is preferable to maximize the following conditional maximum likelihood Probability calculated by the forward algorithm for the labeled sequences Probability calculated by the forward algorithm disregarding all the labels

Algorithms in Computational Biology36 Department of Mathematics & Computer Science HMM Model Structure Choice of model topology Fully connected model causes local maxima In practice, successful HMMs are constructed by carefully deciding which transitions are allowed in the model based on knowledge about the problem under investigation Duration modeling Probability decays exponentially on lengths (geometric distribution) P(L)=(1-p)p^(L-1) (p: self-transition 1-p: probability of leaving it) Model more complex length distribution Introduce several states with the same distribution over residues and transitions between each other. E.g. Non-negative binomial distribution pppp 1-p

Algorithms in Computational Biology37 Department of Mathematics & Computer Science Numerical Stability of HMM Algorithms Probability gets too low when multiplying many probabilities in the Viterbi, forward and backward algorithms Consequences Underflow error Program would crash Program would keep running and produce arbitrary wrong numbers

Algorithms in Computational Biology38 Department of Mathematics & Computer Science Improving Numerical Stability Log transform Scaling of probabilities For each i define a scaling variable s i