Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hidden Markov Models CBB 231 / COMPSCI 261. An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite.

Similar presentations


Presentation on theme: "Hidden Markov Models CBB 231 / COMPSCI 261. An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite."— Presentation transcript:

1 Hidden Markov Models CBB 231 / COMPSCI 261

2 An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite set of states, Q={q 0, q 1,..., q m } a finite alphabet  ={s 0, s 1,..., s n } a transition distribution P t : Q×Q a¡ i.e., P t (q j | q i ) an emission distribution P e : Q×  a¡ i.e., P e (s j | q i ) q 0q 0 100% 80% 15% 30% 70% 5% R =0% Y = 100% q 1 Y =0% R = 100% q 2 What is an HMM? M 1 =({q 0,q 1,q 2 },{ Y, R },P t,P e ) P t ={(q 0,q 1,1), (q 1,q 1,0.8), (q 1,q 2,0.15), (q 1,q 0,0.05), (q 2,q 2,0.7), (q 2,q 1,0.3)} P e ={(q 1, Y,1), (q 1, R,0), (q 2, Y,0), (q 2, R,1) } An Example

3 q 0q 0 100% 80% 15% 30% 70% 5% R =0% Y = 100% q 1 Y =0% R = 100% q 2 P( YRYRY |M 1 ) = a 0→1  b 1,Y  a 1→2  b 2,R  a 2→1  b 1,Y  a 1→2  b 2,R  a 2→1  b 1,Y  a 1→0 =1  1  0.15  1  0.3  1  0.15  1  0.3  1  0.05 =0.00010125 Probability of a Sequence

4 Another Example A=10% T=30% C=40% G=20% A=10% T=30% C=40% G=20% A=11% T=17% C=43% G=29% A=11% T=17% C=43% G=29% A=35% T=25% C=15% G=25% A=35% T=25% C=15% G=25% A=27% T=14% C=22% G=37% A=27% T=14% C=22% G=37% 50% 100% 65% 35% 20% 80% q1q1 q1q1 100% q2q2 q2q2 q3q3 q3q3 q4q4 q4q4 q0q0 q0q0 M 2 = (Q, , P t, P e ) Q = {q 0, q 1, q 2, q 3, q 4 }  ={A, C, G, T}

5 A=10% T=30% C=40% G=20% A=10% T=30% C=40% G=20% A=11% T=17% C=43% G=29% A=11% T=17% C=43% G=29% A=35% T=25% C=15% G=25% A=35% T=25% C=15% G=25% A=27% T=14% C=22% G=37% A=27% T=14% C=22% G=37% 50% 100% 65% 35% 20% 80% q1q1 q1q1 100% q2q2 q2q2 q3q3 q3q3 q4q4 q4q4 q0q0 q0q0 The most probable path is: States: States: 122222224 Sequence: Sequence: CATTAATAG resulting in this parse: States: States: 122222224 Sequence: Sequence: CATTAATAG The most probable path is: States: States: 122222224 Sequence: Sequence: CATTAATAG resulting in this parse: States: States: 122222224 Sequence: Sequence: CATTAATAG Finding the Most Probable Path Example: C A T T A A T A G top: top: 7.0×10 - 7 bottom: bottom: 2.8×10 - 9 feature 1: C feature 2: ATTAATA feature 3: G Finding the Most Probable Path

6 emission prob. transition prob. Decoding with an HMM

7 The Best Partial Parse

8 The Viterbi Algorithm sequence states (i,k)(i,k) k k-1k-1... k-2k-2 k +1...

9 Viterbi: Traceback T( T( T(... T( T(i, L - 1), L - 2)..., 2), 1), 0) = 0

10 Viterbi Algorithm in Pseudocode trans [q i ]={q j | P t (q i |q j )>0} emit [s] = {q i | P e (s|q i )>0} initialization fill out main part of DP matrix choose best state from last column in DP matrix traceback

11 The Forward Algorithm : Probability of a Sequence Fik F(i,k) represents the probability P(S 0..k - 1 | q i ) that the machine emits the subsequence x 0...x k - 1 by any path ending in state q i — i.e., so that symbol x k - 1 is emitted by state q i.

12 The Forward Algorithm : Probability of a Sequence sequence states (i,k)(i,k) k k-1k-1... k-2k-2 k +1... Viterbi:Viterbi: Forward:Forward: the single most probable path sum over all paths i.e.,

13 The Forward Algorithm in Pseudocode fill out the DP matrix sum over the final column to get P(S)

14 CGATATTCGATTCTACGCGCGTATACTAGCTTATCTGATC CGATATTCGATTCTACGCGCGTATACTAGCTTATCTGATC 011111112222222111111222211111112222111110 011111112222222111111222211111112222111110 to state 012 from state 00 (0%)1 (100%)0 (0%) 11 (4%)21 (84%)3 (12%) 20 (0%)3 (20%)12 (80%) symbol ACGT in state 1 6 (24%) 7 (28%) 5 (20%) 7 (28%) 2 3 (20%) 2 (13%) 7 (47%) Training an HMM from Labeled Sequences transitions emissions

15 Recall: Eukaryotic Gene Structure ATG TGA coding segment complete mRNA ATG GT AGGTAG... start codon stop codon donor site acceptor site exonexon intron TGA

16 exon 1 exon 2 exon 3 AGCTAGCAGTATGTCATGGCATGTTCGGAGGTAGTACGTAGAGGTAGCTAGTATAGGTCGATAGTACGCGA Intergenic Start codon Start codon Stop codon Stop codon Exon Donor Acceptor Intron the Markov model: the gene prediction: the input sequence: q0q0 q0q0 the most probable path: Using an HMM for Gene Prediction

17 Higher Order Markovian Eukaryotic Recognizer (HOMER) H3H3 H5H5 H 17 H 27 H 77 H 95

18 nucleotides splice sites start/stop codonsexonsgenes SnSpFSnSpSnSpSnSpFSn# baseline1002844000000000 H3H3 538866000000000 HOMER, version H 3 I=intron state E=exon state N=intergenic state tested on 500 Arabidopsis genes: Intergenic Start codon Stop codon Exon DonorAcceptor Intron q0q0 q0q0 

19 Recall: Sensitivity and Specificity

20 nucleotides splice sites start/stop codonsexonsgenes SnSpFSnSpSnSpSnSpFSn# H3H3 538866000000000 H5H5 659176133300000 HOMER, version H 5 three exon states, for the three codon positions

21 nucleotides splice sites start/stop codonsexonsgenes SnSpFSnSpSnSpSnSpFSn# H5H5 659176133300000 H 17 81938734484337192421735 HOMER version H 17 acceptor site donor site start codon stop codon Intergenic Start codon Start codon Stop codon Stop codon Exon Donor Acceptor Intron q0q0 q0q0

22 GTATGCGATAGTCAAGAGTGATCGCTAGACC 01201201 2012012012 | | | | | | | 0 5 10 15 20 25 30 + phase: sequence: coordinates: Maintaining Phase Across an Intron

23 nucleotidessplicestart/stopexonsgenes SnSpFSnSpSnSpSnSpFSn# H 17 81938734484337192421735 H 27 83938840494136232725838 HOMER version H 27 three separate intron models

24 A T G T G A T A A T A G G T A G (start codons) (donor splice sites) (acceptor splice sites) Recall: Weight Matrices (stop codons) (stop codons)

25 nucleotidessplicestart/stopexonsgenes SnSpFSnSpSnSpSnSpFSn# H 27 83938840494136232725838 H 77 889692666751464746 1365 HOMER version H 77 positional biases near splice sites

26 nucleotidessplicestart/stopexonsgenes SnSpFSnSpSnSpSnSpFSn# H 77 889692666751464746 1365 H 95 929794797657536259601993 HOMER version H 95

27 Summary of HOMER Results

28 Higher-order Markov Models A C G C T A P(G|AC) A C G C T A P(G|C) A C G C T A P(G) 0 th order: 1 st order: 2 nd order:

29 order nucleotides splice sites starts/ stopsexonsgenes SnSpFSnSpSnSpSnSpFSn# H 95 0929794797657536259601993 H 95 19598978781646172687025127 H 95 298 9182656276697227136 H 95 398 9182676376697228140 H 95 49897989081696476687229143 H 95 59897989081666274677027137 Higher-order Markov Models 0 1 2 3 4 5

30 Variable-Order Markov Models

31 Interpolation Results

32 Summary An HMM is a stochastic generative model which emits sequences Parsing with an HMM can be accomplished using a decoding algorithm (such as Viterbi) to find the most probable (MAP) path generating the input sequence Training of unambiguous HMM’s can be accomplished using labeled sequence training Training of ambiguous HMM’s can be accomplished using Viterbi training or the Baum-Welch algorithm (next lesson...) Posterior decoding can be used to estimate the probability that a given symbol or substring was generate by a particular state (next lesson...) An HMM is a stochastic generative model which emits sequences Parsing with an HMM can be accomplished using a decoding algorithm (such as Viterbi) to find the most probable (MAP) path generating the input sequence Training of unambiguous HMM’s can be accomplished using labeled sequence training Training of ambiguous HMM’s can be accomplished using Viterbi training or the Baum-Welch algorithm (next lesson...) Posterior decoding can be used to estimate the probability that a given symbol or substring was generate by a particular state (next lesson...)


Download ppt "Hidden Markov Models CBB 231 / COMPSCI 261. An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite."

Similar presentations


Ads by Google