Presentation is loading. Please wait.

Presentation is loading. Please wait.

Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.

Similar presentations


Presentation on theme: "Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model."— Presentation transcript:

1 Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model

2 Algorithms in Computational Biology22Department of Mathematics & Computer Science Example: CpG Islands Dinucleotide CG (CpG to distinguish it from C-G base pair) C within CG is typically methylated Methyl-C is more likely to mutate to T CpG dinucleotides are rarer in genome than would be expected from the independent probabilities of C and G Methylation process is suppressed in short stretches of the genome More CpG dinucleotides in promoter regions of genes CpG islands are regions with many CpGs Typically a few hundred to a few thousand bases long

3 Algorithms in Computational Biology33Department of Mathematics & Computer Science Questions about CpG Island? Given a short stretch of genomic sequence, how would we decide if it comes from a CpG island or not? Given a long piece of sequence, how would we find the CpG islands in it if there are any?

4 Algorithms in Computational Biology44Department of Mathematics & Computer Science Markov Chains A G T C

5 Algorithms in Computational Biology55Department of Mathematics & Computer Science Key Property of a Markov Chain The probability of each symbol x i depends only on the value of the preceding symbol x i-1

6 Algorithms in Computational Biology66Department of Mathematics & Computer Science Modeling the Beginning and End of Sequences A G T C BE

7 Algorithms in Computational Biology77Department of Mathematics & Computer Science Using Markov Chains for Discrimination +ACGT A0.1800.2740.4260.120 C0.1710.3680.2740.188 G0.1610.3390.3750.125 T0.0790.3550.3840.182 -ACGT A0.3000.2050.2850.210 C0.3220.2980.0780.302 G0.2480.2460.2980.208 T0.1770.2390.292 CpG island model Non-CpG island model

8 Algorithms in Computational Biology88Department of Mathematics & Computer Science Cont’ For discrimination, the log-odds ratio is calculated:  ACGT A-0.7400.4190.580-0.803 C-0.9130.3021.812-0.685 G-0.6240.4610.331-0.730 T-1.1690.5730.393-0.679

9 Algorithms in Computational Biology99Department of Mathematics & Computer Science Histogram of Length-Normalized Scores Non-CpG Islands CpG islands

10 Algorithms in Computational Biology10 Department of Mathematics & Computer Science Locating CpG Islands in a DNA Sequence Input: A long DNA sequence X = {x 1, x 2, …, x L )  * Output: CpG islands along X. Use Markov chain models Calculate log-odds score for a window of length k (e.g., 100) A total of L-k+1 scores will be computed and plotted CpG islands will stand out with positive values

11 Algorithms in Computational Biology11 Department of Mathematics & Computer Science Problems with Markov Chain Models in Locating CpG Islands CpG islands have sharp boundaries CpG islands have variable lengths These problems can be better addressed by building a single model for the entire sequence that incorporates both Markov chains

12 Algorithms in Computational Biology12 Department of Mathematics & Computer Science Formal Definition of an HMM A hidden Markov model is a triplet  M = ( , Q,  ), where  is an alphabet of symbols Q is a finite set of states, capable of emitting symbols from the alphabet   is a set of probabilities, comprised of State transition probabilities, denoted by a kl for each k, l  Q Emission probabilities, denoted by e k (b) for each state k  Q and b  

13 Algorithms in Computational Biology13 Department of Mathematics & Computer Science Cont’  (State sequence or path)  = (  1,  2, …,  L) Follows a simple Markov chain (the probability of a state depends only on the previous state) State transition probability a kl = P{  i = l|  i-1 = k} Emission probability Given a sequence X = (x 1, x 2, … x L ), emission probability e k (b) is defined as: e k (b) = P{x i =b|  i = k} The probability that the sequence X was generated by M given the path  is:

14 Algorithms in Computational Biology14 Department of Mathematics & Computer Science An HMM for Detecting CpG Islands in a Long DNA Sequence Alphabet:  = {A, C, G, T} States: Q = {A +, C +, G +, T +, A -, C -, G -, T - } Emissions State: A + C + G + T + A - C - G - T - Emitted symbol: A C G T A C G T The emission probability of each state x + and x - is 1 for emitting symbol x and 0 for emitting other symbols (special feature of this HMM)

15 Algorithms in Computational Biology15 Department of Mathematics & Computer Science Transition Matrix for CpG Island HMM P is the probability of staying in a CpG island, and q is the probability of staying in a non-CpG Island

16 Algorithms in Computational Biology16 Department of Mathematics & Computer Science Occasionally Dishonest Casino Dealer In casino a dealer uses a fair die most of the time, but occasionally he switch to a loaded die. The loaded die has a probability of 0.5 for a six and probability of 0.1 for the numbers one to five. The dealer switches from a fair to a loaded die with probability of 0.05 before each roll, and that the probability of switching back is 0.1. In each state of the Markov process the outcomes of a roll have different probabilities, thus the process can modeled using a HMM

17 Algorithms in Computational Biology17 Department of Mathematics & Computer Science HMM for the Occasionally Dishonest Casino Dealer Q = {F, L}  = {1, 2, 3, 4, 5, 6} What is hidden?

18 Algorithms in Computational Biology18 Department of Mathematics & Computer Science HMMs Generate Sequences Generate a sequence via HMM Choose  1 according to the probabilities a 0i An observation (x 1 ) is emitted according to the probabilities e  1 Choose  2 according to the probabilities a  1i An observation (x 2 ) is emitted according to the probabilities e  2 And so forth …… P(x) is the probability that sequence x was generated by the model The joint probability of an observed sequence x and a state sequence  :

19 Algorithms in Computational Biology19 Department of Mathematics & Computer Science Most Probable State Path A CpG island example Sequence CGCG can be emitted by: (C +, G +, C +, G + ), (C -, G -, C -, G - ), (C +, G -, C +, G - ) Which state sequence is more likely for the observation? Most probable path is defined as: The probability v k (i) of the most probable path ending in state k with observation i is known for all the states k, then v l (i+1) is defined:

20 Algorithms in Computational Biology20 Department of Mathematics & Computer Science Finding Most Probable Path Using Viterbi Algorithm Initialization (i = 0): v 0 (0) = 1, v k (0) = 0 for k > 0 Recursion: (i = 1…L): v l (i) = e l (x i )max k (v k (i-1)a kl ) ptr i (l) = argmax k (v k (i-1)a kl ) Termination: P(x,  * ) = max k (v k (L)a k0 )  L * = argmax k (v k (L)a k0 ) Traceback (i=L…1):  i-1 * = ptr i (  i * )

21 Algorithms in Computational Biology21 Department of Mathematics & Computer Science Viterbi Example VCGCG B10000 A+A+ 00000 C+C+ 00.1300.0110 G+G+ 000.03400.003 T+T+ 00000 A-A- 00000 C-C- 00.1300.0020 G-G- 000.01000.0002 T-T- 00000 Most probable path for sequence CGCG

22 Algorithms in Computational Biology22 Department of Mathematics & Computer Science Sequence of Die Rolls Predicted by Viterbi Algorithm

23 Algorithms in Computational Biology23 Department of Mathematics & Computer Science Finding the Probability of a Sequence for an HMM: the Forward Algorithm Definitions: Algorithm: Initialization ( i = 0): Recursion (i = 1…L): Termination:

24 Algorithms in Computational Biology24 Department of Mathematics & Computer Science Posterior State Probability We want to know the most probable state for an observation x i We need to find out the probability that observation x i came from each state k given the observed sequence

25 Algorithms in Computational Biology25 Department of Mathematics & Computer Science Finding b k (i) Using Backward Algorithm Initialization (i = L): Recursion (i = L-1, …, 1): Termination:

26 Algorithms in Computational Biology26 Department of Mathematics & Computer Science Posterior Decoding Approach 1 Approach 2 E.g. Find the posterior probability according to the model that base i is in a CpG island, we can let g(k) = 1 for k  {A +, C +, G +, T + } g(k) = 0 for k  {A -, C -, G -, T - } G(i|k) is precisely the posterior probability

27 Algorithms in Computational Biology27 Department of Mathematics & Computer Science Use of Posterior Decoding Shaded areas show when the roll was generated by the loaded die

28 Algorithms in Computational Biology28 Department of Mathematics & Computer Science Parameter Estimation for HMMs Model specification Structure design What states there are and how they are connected Assignment of parameter values Transition probabilities a kl Emission probabilities e k (b) Estimation framework Training sequences x 1, …, x n Work in log space

29 Algorithms in Computational Biology29 Department of Mathematics & Computer Science Estimation When the State Sequence is Known

30 Algorithms in Computational Biology30 Department of Mathematics & Computer Science Estimation When Paths Are Unknown Baum (1971) Calculate A kl and E k (b) as the expected times each transition or emission is used given the training sequences Subject to local maxima Depends only the starting values of the parameters The probability that a kl is used at position i in sequence x is:

31 Algorithms in Computational Biology31 Department of Mathematics & Computer Science Expected Transition and Emission Counts The expected number of times that a kl can be obtained by summing over all positions and over all training sequences The expected number of times that letter b appears in state k

32 Algorithms in Computational Biology32 Department of Mathematics & Computer Science Baum-Welch Training (EM algorithm) Initialization: Pick arbitrary model parameters Recurrence: Set all the A and E variables to their pseudocount values r (or to zero) For each sequence j = 1 … n Calculate f k (i) for sequence j using forward algorithm Calculate b k (i) for sequence j using backward algorithm Add the contribution of sequence j to A and E Calculate the new model parameters Calculate the new log likelihood of the model Termination: Stop if the change in log likelihood is less than some predefined threshold or the maximum number of iterations is exceeded

33 Algorithms in Computational Biology33 Department of Mathematics & Computer Science Modeling of Labeled Sequences HMMs can be used to predict the labeling of unannotated sequences Training for HMMs Separately train the model for CpG islands and the model for non-CpG islands Combine them into a larger HMM Tedious especially if there are more two classes involved It will be nice to estimate everything at once Training set includes all classes (e.g., CpG islands and non-CpG islands) Each sequence is labeled with corresponding classes Let y = y 1, …, y L be the labels on the observation x = x 1, …, x L

34 Algorithms in Computational Biology34 Department of Mathematics & Computer Science Cont’ Model can be estimated with a slight modification of Baum-Welch algorithm Allow only valid paths through the model A valid path is one where the state labels and sequence labels are the same, i.e.,  i has label y i During the forward and backward algorithms this corresponds to setting f l (i) = 0 and b l (i) = 0 for all the states l with a label different from y i

35 Algorithms in Computational Biology35 Department of Mathematics & Computer Science Discriminative Estimation When modeling labeled sequences, the following likelihood is maximized Obtaining a good prediction of y is our primary interest, it is preferable to maximize the following conditional maximum likelihood Probability calculated by the forward algorithm for the labeled sequences Probability calculated by the forward algorithm disregarding all the labels

36 Algorithms in Computational Biology36 Department of Mathematics & Computer Science HMM Model Structure Choice of model topology Fully connected model causes local maxima In practice, successful HMMs are constructed by carefully deciding which transitions are allowed in the model based on knowledge about the problem under investigation Duration modeling Probability decays exponentially on lengths (geometric distribution) P(L)=(1-p)p^(L-1) (p: self-transition 1-p: probability of leaving it) Model more complex length distribution Introduce several states with the same distribution over residues and transitions between each other. E.g. Non-negative binomial distribution pppp 1-p

37 Algorithms in Computational Biology37 Department of Mathematics & Computer Science Numerical Stability of HMM Algorithms Probability gets too low when multiplying many probabilities in the Viterbi, forward and backward algorithms Consequences Underflow error Program would crash Program would keep running and produce arbitrary wrong numbers

38 Algorithms in Computational Biology38 Department of Mathematics & Computer Science Improving Numerical Stability Log transform Scaling of probabilities For each i define a scaling variable s i


Download ppt "Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model."

Similar presentations


Ads by Google