1 Hidden Markov Models Hsin-Min Wang Institute of Information Science, Academia Sinica References: 1.L. R. Rabiner and B. H. Juang,

Slides:



Advertisements
Similar presentations
Hidden Markov Models (HMM) Rabiner’s Paper
Advertisements

Angelo Dalli Department of Intelligent Computing Systems
1 Hidden Markov Model Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Lecture 8: Hidden Markov Models (HMMs) Michael Gutkin Shlomi Haba Prepared by Originally presented at Yaakov Stein’s DSPCSP Seminar, spring 2002 Modified.
Introduction to Hidden Markov Models
Hidden Markov Models Eine Einführung.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.
Hidden Markov Models Adapted from Dr Catherine Sweeney-Reed’s slides.
Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수
Statistical NLP: Lecture 11
Ch-9: Markov Models Prepared by Qaiser Abbas ( )
Hidden Markov Models Theory By Johan Walters (SR 2003)
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
Hidden Markov Models Fundamentals and applications to bioinformatics.
Hidden Markov Models in NLP
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
INTRODUCTION TO Machine Learning 3rd Edition
Part II. Statistical NLP Advanced Artificial Intelligence (Hidden) Markov Models Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
Chapter 3 (part 3): Maximum-Likelihood and Bayesian Parameter Estimation Hidden Markov Model: Extension of Markov Chains All materials used in this course.
Hidden Markov Models.
Doug Downey, adapted from Bryan Pardo,Northwestern University
Hidden Markov Models David Meir Blei November 1, 1999.
Hidden Markov models Sushmita Roy BMI/CS 576 Oct 16 th, 2014.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Fall 2001 EE669: Natural Language Processing 1 Lecture 9: Hidden Markov Models (HMMs) (Chapter 9 of Manning and Schutze) Dr. Mary P. Harper ECE, Purdue.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Combined Lecture CS621: Artificial Intelligence (lecture 25) CS626/449: Speech-NLP-Web/Topics-in- AI (lecture 26) Pushpak Bhattacharyya Computer Science.
Ch10 HMM Model 10.1 Discrete-Time Markov Process 10.2 Hidden Markov Models 10.3 The three Basic Problems for HMMS and the solutions 10.4 Types of HMMS.
Isolated-Word Speech Recognition Using Hidden Markov Models
Graphical models for part of speech tagging
HMM - Basics.
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
Hidden Markov Models in Keystroke Dynamics Md Liakat Ali, John V. Monaco, and Charles C. Tappert Seidenberg School of CSIS, Pace University, White Plains,
S. Salzberg CMSC 828N 1 Three classic HMM problems 2.Decoding: given a model and an output sequence, what is the most likely state sequence through the.
Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
HMM - Part 2 The EM algorithm Continuous density HMM.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
CS Statistical Machine learning Lecture 24
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Hidden Markov Models (HMMs) –probabilistic models for learning patterns in sequences (e.g. DNA, speech, weather, cards...) (2 nd order model)
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Other Models for Time Series. The Hidden Markov Model (HMM)
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
1 Hidden Markov Model Xiaole Shirley Liu STAT115, STAT215.
Hidden Markov Models Wassnaa AL-mawee Western Michigan University Department of Computer Science CS6800 Adv. Theory of Computation Prof. Elise De Doncker.
Hidden Markov Models HMM Hassanin M. Al-Barhamtoshy
Hidden Markov Models BMI/CS 576
4.0 More about Hidden Markov Models
Presentation transcript:

1 Hidden Markov Models Hsin-Min Wang Institute of Information Science, Academia Sinica References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter 6 2.X. Huang et. al., (2001) Spoken Language Processing, Chapter 8 3.L. R. Rabiner, (1989) “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,” Proceedings of the IEEE, vol. 77, No. 2, February 1989

2 Hidden Markov Model (HMM) History –Published in Baum’s papers in late 1960s and early 1970s –Introduced to speech processing by Baker (CMU) and Jelinek (IBM) in the 1970s –Introduced to computational biology in late1980s Lander and Green (1987) used HMMs in the construction of genetic linkage maps Churchill (1989) employed HMMs to distinguish coding from noncoding regions in DNA

3 Hidden Markov Model (HMM) Assumption –Speech signal (DNA sequence) can be characterized as a parametric random process –Parameters can be estimated in a precise, well-defined manner Three fundamental problems –Evaluation of probability (likelihood) of a sequence of observations given a specific HMM –Determination of a best sequence of model states –Adjustment of model parameters so as to best account for observed signal/sequence

4 Hidden Markov Model (HMM) S2S2 S1S1 S3S3 {X:1/3, Y:1/3, Z:1/3} 1/3 Given an initial model as follows: We can train HMMs for the following two classes using their training data respectively. Training set for class 1: 1. XYYZXYZXXYZ 2. XYZXYZ 3. XYZXXYZ 4. YYXYZXY 5. YZXXYZZXY 6. ZXZZXYZX 7. ZXYZXYZX 8. ZXYZX 9. ZXXYZX Training set for class 2: 1. YYYZZYZ 2. ZZYXYY 3. XXZZYYY 4. YYXYYXZ 5. ZZXXYYXY 6. YYYZZYXX 7. XYYYYYXYX 8. ZZZZZ 9. YYXXX We can then decide which class the following testing sequences belong to. XYZXYZZXY XXYXYZZZZXXY back

5 Brief Review of Probability Theorem Consider the simple scenario of rolling two dice, labeled as die 1 and die 2. Define the following three events: X: Die 1 lands on 3. Y: Die 2 lands on 1. Z: The dice sum to 8. Prior probability: P(X)=P(Y)=1/6, P(Z)=5/36. Joint probability: P(X,Y) (or P(X∩Y)) =1/36, two events X and Y are statistically independent if and only if P(X,Y) = P(X)xP(Y). P(Y,Z)=0, two events Y and Z are mutually exclusive if and only if Y∩Z=Φ, i.e., P(Y∩Z)=0. Conditional probability:, P(Y|X)=P(Y), P(Z|Y)=0 {(2,6), (3,5), (4,4), (5,3), (6,2)} X∩Y ={(3,1)} Y∩Z=ΦY∩Z=Φ

6 General Pattern Recognition Problem Posterior probability Bayes’ rule maximum likelihood criterion maximum a posteriori (MAP) criterion P(λ) is usually assumed uniformly distributed since we don’t know which class is more likely to happen.

7 Markov Chain First-order Markov chain

8 N-gram Language Model P(this is a book) = P(this) × P(is | this) × P(a | this is) × P(book | this is a) P(this) = # “this” / # words P(is | this) = # “this is” / # “this” P(a | this is) = # “this is a” / # “this is” P(book | this is a) = # “this is a book” / # “this is a” Become more difficult to estimate P(this is a book) ≒ P(this) × P(is | this) × P(a | is) × P(book | a) ≒ P(a | is) = # “is a” / # “is” ≒ P(book | a) = # “a book” / # “a” bigram

9 The parameters of a Markov chain, with N states labeled by {1,…,N} and the state at time t in the Markov chain denoted as q t, can be described as a ij =P(q t = j|q t-1 =i) 1≤i,j≤N  i =P(q 1 =i) 1≤i≤N The output of the process is the set of states at each time instant t, where each state corresponds to an observable event X i There is a one-to-one correspondence between the observable sequence and the Markov chain state sequence Observable Markov Model (Rabiner 1989)

10 Markov Chain Model – Ex 1 A 3-state Markov Chain –State 1 generates symbol X only, State 2 generates symbol Y only, State 3 generates symbol Z only –Given a sequence of observed symbols O={ZXYYZXYZ}, the only one corresponding state sequence is Q={S 3 S 1 S 2 S 2 S 3 S 1 S 2 S 3 }, and the corresponding probability is P(O| )=P(ZXYYZXYZ| )=P(Q| )=P(S 3 S 1 S 2 S 2 S 3 S 1 S 2 S 3 | ) =π(S 3 )P(S 1 |S 3 )P(S 2 |S 1 )P(S 2 |S 2 )P(S 3 |S 2 )P(S 1 |S 3 )P(S 2 |S 1 )P(S 3 |S 2 ) =0.1  0.3  0.3  0.7  0.2  0.3  0.3  0.2= S2S2 S3S3 X Y Z S1S1

11 Markov Chain Model – Ex 2 A three-state Markov chain for the Dow Jones Industrial average The probability of 5 consecutive up days (Huang et al., 2001)

12 Extension to Hidden Markov Model HMM: an extended version of Observable Markov Model –The observation is a probabilistic function (discrete or continuous) of a state instead of an one-to-one correspondence of a state –The model is a doubly embedded stochastic process with an underlying stochastic process that is not directly observable (hidden) What is hidden? The State Sequence! According to the observation sequence, we are not sure which state sequence generates it!

13 Hidden Markov Model – Ex 1 A 3-state discrete HMM –Given an observation sequence O={XYZ}, there are 27 possible state sequences, therefore P(O| ) is computed by S2S2 S1S1 S3S3 {X:.3, Y:.2, Z:.5} {X:.7, Y:.1, Z:.2}{X:.3, Y:.6, Z:.1} Initial model

14 Hidden Markov Model – Ex 2 (Huang et al., 2001) Given a three-state Hidden Markov Model for the Dow Jones Industrial average as follows: How to find the probability P(up, up, up, up, up| )? How to find the optimal state sequence of the model which generates the observation sequence “ up, up, up, up, up ”? cf. Markov chain model (3 5 state sequences can generate “up, up, up, up, up”.)

15 Elements of an HMM An HMM is characterized by the following: 1. N, the number of states in the model 2. M, the number of distinct observation symbols per state 3.The state transition probability distribution A={a ij }, where a ij =P[q t+1 =j|q t =i], 1≤i,j≤N 4.The observation symbol probability distribution in state j, B={b j (v k )}, where b j (v k )=P[o t =v k |q t =j], 1≤j≤N, 1≤k≤M 5.The initial state distribution  ={  i }, where  i =P[q 1 =i], 1≤i≤N For convenience, we usually use a compact notation =(A,B,  ) to indicate the complete parameter set of an HMM –Requires specification of two model parameters ( N and M )

16 Two Major Assumptions for HMM First-order Markov assumption First-order Markov assumption –The state transition depends only on the origin and destination –The state transition probability is time invariant Output-independent assumption Output-independent assumption –The observation is dependent on the state that generates it, not dependent on its neighbor observations a ij =P(q t+1 =j|q t =i), 1≤i, j≤N

17 Three Basic Problems for HMMs Given an observation sequence O=(o 1,o 2,…,o T ), and an HMM =(A,B,  ) –Problem 1: How to compute P(O| ) efficiently ?  Evaluation Problem –Problem 2: How to choose an optimal state sequence Q=(q 1,q 2,……, q T ) which best explains the observations?  Decoding Problem –Problem 3: How to adjust the model parameters =(A,B,  ) to maximize P(O| ) ?  Learning/Training Problem P(up, up, up, up, up| )?

18 Solution to Problem 1

19 Solution to Problem 1 - Direct Evaluation Given O and, find P(O| )= Pr{observing O given } Evaluating all possible state sequences of length T that generate the observation sequence O : The probability of the path Q –By the first-order Markov assumption : The joint output probability along the path Q –By the output-independent assumption

20 Solution to Problem 1 - Direct Evaluation (cont’d) S2S2 S3S3 S1S1 o1o1 S2S2 S3S3 S1S1 S2S2 S3S3 S1S1 S2S2 S3S3 S1S1 State o2o2 o3o3 oToT T-1 T Time S2S2 S3S3 S1S1 o T-1 SjSj means b j (o t ) has been computed a ij means a ij has been computed …

21 Solution to Problem 1 - Direct Evaluation (cont’d) –A Huge Computation Requirement: O(N T ) ( N T state sequences) Exponential computational complexity A more efficient algorithm can be used to evaluate –The Forward Procedure/Algorithm

22 Solution to Problem 1 - The Forward Procedure Base on the HMM assumptions, the calculation of and involves only q t-1, q t, and o t, so it is possible to compute the likelihood with recursion on t Forward variable : –The probability of the joint event that o 1,o 2,…,o t are observed and the state at time t is i, given the model λ

23 Solution to Problem 1 - The Forward Procedure (cont’d) Output-independent assumption First-order Markov assumption

24 Solution to Problem 1 - The Forward Procedure (cont’d)  3 (2)=P(o 1,o 2,o 3,q 3 =2| ) =[  2 (1) × a 12 +  2 (2) × a 22 +  2 (3) × a 32 ] × b 2 (o 3 ) S2S2 S3S3 S1S1 o1o1 S2S2 S3S3 S1S1 S3S3 S2S2 S1S1 S2S2 S3S3 S1S1 State o2o2 o3o3 oToT T-1 T Time S2S2 S3S3 S1S1 o T-1 SjSj means b j (o t ) has been computed a ij means a ij has been computed 2(1)2(1) 2(2)2(2) 2(3)2(3) a 12 a 22 a 32 b2(o3)b2(o3) Time index State index

25 Solution to Problem 1 - The Forward Procedure (cont’d) Algorithm –Complexity: O(N 2 T) Based on the lattice (trellis) structure –Computed in a left-to-right time-synchronous manner, where each cell for time t is completely computed before proceeding to time t+1 All state sequences, regardless how long previously, merge to N nodes (states) at each time instance t cf. O(N T ) for direct evaluation

26 Solution to Problem 1 - The Forward Procedure (cont’d) A three-state Hidden Markov Model for the Dow Jones Industrial average b 1 (up)=0.7 b 2 (up)= 0.1 b 3 (up)=0.3 a 11 =0.6 a 21 =0.5 a 31 =0.4 (Huang et al., 2001) b 1 (up)=0.7 b 2 (up)= 0.1 b 3 (up)=0.3 π 1 =0.5 π 2 =0.2 π 3 =0.3 α 1 (1)=0.5*0.7 α 1 (2)= 0.2*0.1 α 1 (3)= 0.3*0.3 α 2 (1)= (0.35* * *0.4)*0.7 α 2 (2)=(0.35* * *0.1)*0.1 α 2 (3)=(0.35* * *0.5)*0.3 P(up, up| ) = α 2 (1)+α 2 (2)+α 2 (3) a 12 =0.2 a 22 =0.3 a 32 =0.1 a 13 =0.2a 23 =0.2 a 33 =0.5 P(up, up| ) ?

27 Solution to Problem 2

28 Solution to Problem 2 - The Viterbi Algorithm The Viterbi algorithm can be regarded as a dynamic programming algorithm applied to the HMM or as a modified forward algorithm –Instead of summing probabilities from different paths coming to the same destination state, the Viterbi algorithm picks and remembers the best path Find a single optimal state sequence Q * –The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

29 Solution to Problem 2 - The Viterbi Algorithm (cont’d) S2S2 S3S3 S1S1 o1o1 S2S2 S3S3 S1S1 S2S2 S3S3 S1S1 S2S2 S1S1 S3S3 State o2o2 o3o3 oToT T-1 T Time S2S2 S3S3 S1S1 o T-1

30 Solution to Problem 2 - The Viterbi Algorithm (cont’d) 1.Initialization 2.Induction 3.Termination 4.Backtracking Complexity: O(N 2 T) is the best state sequence

31 b 1 (up)=0.7 b 2 (up)= 0.1 b 3 (up)=0.3 a 11 =0.6 a 21 =0.5 a 31 =0.4 b 1 (up)=0.7 b 2 (up)= 0.1 b 3 (up)=0.3 π 1 =0.5 π 2 =0.2 π 3 =0.3 Solution to Problem 2 - The Viterbi Algorithm (cont’d) A three-state Hidden Markov Model for the Dow Jones Industrial average (Huang et al., 2001) δ 1 (1)=0.5*0.7 δ 1 (2)= 0.2*0.1 δ 1 (3)= 0.3*0.3 δ 2 (1) =max (0.35*0.6, 0.02*0.5, 0.09*0.4)*0.7 δ 2 (1)= 0.35*0.6*0.7=0.147 Ψ 2 (1)= a 12 =0.2 a 22 =0.3 a 32 =0.1 a 13 =0.2 a 23 =0.2 a 33 =0.5 δ 2 (2) =max (0.35*0.2, 0.02*0.3, 0.09*0.1)*0.1 δ 2 (2)= 0.35*0.2*0.1=0.007 Ψ 2 (2)=1 δ 2 (3) =max (0.35*0.2, 0.02*0.2, 0.09*0.5)*0.3 δ 2 (3)= 0.35*0.2*0.3=0.021 Ψ 2 (3)=1 The most likely state sequence that generates “ up up ”: 1 1

32 Some Examples

33 Isolated Digit Recognition o1o1 o2o2 o3o3 oToT T-1 T Time o T-1 S2S2 S3S3 S1S1 S2S2 S3S3 S1S1 S2S2 S3S3 S1S1 S2S2 S3S3 S1S1 S2S2 S3S3 S1S1 S2S2 S3S3 S1S1 S2S2 S3S3 S1S1 S2S2 S3S3 S1S1 S2S2 S3S3 S1S1 S2S2 S3S3 S1S1 1 0 S2S2 S3S3 S1S1 S2S2 S3S3 S1S1 S2S2 S3S3 S1S1 B E B E

34 Continuous Digit Recognition o1o1 o2o2 o3o3 oToT T-1 T Time o T-1 S2S2 S3S3 S1S1 S2S2 S3S3 S1S1 S2S2 S3S3 S1S1 S2S2 S3S3 S1S1 S2S2 S3S3 S1S1 S5S5 S6S6 S4S4 S5S5 S6S6 S4S4 S5S5 S6S6 S4S4 S5S5 S6S6 S4S4 S5S5 S6S6 S4S4 1 0 S2S2 S3S3 S1S1 S2S2 S3S3 S1S1 S5S5 S6S6 S4S4 S5S5 S6S6 S4S4 B E B E

35 Continuous Digit Recognition (cont’d) Time S2S2 S3S3 S1S1 S2S2 S3S3 S1S1 S2S2 S3S3 S1S1 S2S2 S3S3 S1S1 S5S5 S6S6 S4S4 S5S5 S6S6 S4S4 S5S5 S6S6 S4S4 S5S5 S6S6 S4S4 1 0 S2S2 S3S3 S1S1 S5S5 S6S6 S4S4 S2S2 S3S3 S1S1 S5S5 S6S6 S4S4 S2S2 S3S3 S1S1 S5S5 S6S6 S4S4 S2S2 S3S3 S1S1 S5S5 S6S6 S4S4 S2S2 S3S3 S1S1 S5S5 S6S6 S4S4 S1S1 S1S1 S2S2 S6S6 S3S3 S3S3 S4S4 S5S5 S5S5 Best state sequence

36 CpG Island Recognition Two Questions Q1: Given a short sequence, does it come from a CpG island? Q2: Given a long sequence, how would we find the CpG islands in it?

37 CpG Island Recognition – Q1 Given sequence x, probabilistic model M 1 of CpG islands, and probabilistic model M 2 for non-CpG island regions –Compute p 1 =P(x|M 1 ) and p 2 =P(x|M 2 ) –If p 1 > p 2, then x comes from a CpG island (CpG+) –If p 2 > p 1, then x does not come from a CpG island (CpG-) S 1 :A S 2 :C S 3 :TS 4 :G CpG+ACGT A C G T CpG-ACGT A C G T Large CG transition probability vs. Small CG transition probability

38 CpG Island Recognition – Q2 S1S1 S2S2 A: 0.3 C: 0.2 G: 0.2 T: 0.3 A: 0.2 C: 0.3 G: 0.3 T: 0.2 a 22 = a 11 = a 12 = a 21 = CpG+ CpG- … A C T C G A G T A … S1S1 S1S1 S1S1 S1S1 S2S2 S2S2 S2S2 S2S2 S1S1 Observable Hidden

39 A Toy Example: 5’ Splice Site Recognition 5’ splice site indicates the “switch” from an exon to an intron Assumptions: –Uniform base composition on average in exons (25% each base) –Introns are A/T rich (40% A/T, and 10% C/G) –The 5’SS consensus nucleotide is almost always a G (say, 95% G and 5% A) From “What is a hidden Markov Model?”, by Sean R. Eddy

40 A Toy Example: 5’ Splice Site Recognition

41 Solution to Problem 3

42 Solution to Problem 3 – Maximum Likelihood Estimation of Model Parameters How to adjust (re-estimate) the model parameters =(A,B,  ) to maximize P(O| ) ? –The most difficult one among the three problems, because there is no known analytical method that maximizes the joint probability of the training data in a closed form The data is incomplete because of the hidden state sequence –The problem can be solved by the iterative Baum-Welch algorithm, also known as the forward-backward algorithm The EM (Expectation Maximization) algorithm is perfectly suitable for this problem –Alternatively, it can be solved by the iterative segmental K- means algorithm The model parameters are adjusted to maximize P(O, Q * | ), Q * is the state sequence given by the Viterbi algorithm Provide a good initialization of Baum-Welch training

43 Solution to Problem 3 – The Segmental K-means Algorithm Assume that we have a training set of observations and an initial estimate of model parameters –Step 1 : Segment the training data The set of training observation sequences is segmented into states, based on the current model, by the Viterbi Algorithm –Step 2 : Re-estimate the model parameters –Step 3: Evaluate the model If the difference between the new and current model scores exceeds a threshold, go back to Step 1; otherwise, return Why?

44 Solution to Problem 3 – The Segmental K-means Algorithm (cont’d) 3 states and 2 codewords (observations) π 1 =1, π 2 =π 3 =0 a 11 =3/4, a 12 =1/4, a 13 =0 a 21 =0, a 22 =2/3, a 23 =1/3 a 31 =0, a 32 =0, a 33 =1 b 1 (X)=3/4, b 1 (Y)=1/4 b 2 (X)=1/3, b 2 (Y)=2/3 b 3 (X)=2/3, b 3 (Y)=1/3 X Y O1O1 State O2O2 O3O O4O4 s2s2 s3s3 s1s1 s2s2 s3s3 s1s1 s2s2 s3s3 s1s1 s2s2 s3s3 s1s1 s2s2 s3s3 s1s1 s2s2 s3s3 s1s1 s2s2 s3s3 s1s1 s2s2 s3s3 s1s1 s2s2 s3s3 s1s1 s2s2 s3s3 s1s1 O5O5 O6O6 O9O9 O8O8 O7O7 O 10 Training data: Re-estimated parameters:

45 Solution to Problem 3 – Segmental K-means vs. Baum-Welch

46 Solution to Problem 3 – The Baum-Welch Algorithm Define two new variables:  t (i)= P(q t = i | O, ) –Probability of being in state i at time t, given O and  t ( i, j )=P(q t = i, q t+1 = j | O, ) –Probability of being in state i at time t and state j at time t+1, given O and

47 Solution to Problem 3 – The Baum-Welch Algorithm (cont’d) Re-estimation formulae for , A, and B are

48 Solution to Problem 3 – The Baum-Welch Algorithm (cont’d) We can use the forward-backward algorithm to obtain  t (i)= P(q t = i | O, ) and  t ( i, j )=P(q t = i, q t+1 = j | O, )

49 Solution to Problem 3 – The Forward Procedure Forward variable: –Probability of the joint event that o 1,o 2,…,o t are observed and the state at time t is i, given the model λ –  3 (1)=P(o 1,o 2,o 3,q 3 =1| ) =[  2 (1)×a 11 +  2 (2)×a 21 +  2 (3)×a 31 ] × b 1 (o 3 ) S2S2 S3S3 S1S1 o1o1 S2S2 S3S3 S1S1 S3S3 S2S2 S1S1 S2S2 S3S3 S1S1 State o2o2 o3o3 oToT T-1 T Time S2S2 S3S3 S1S1 o T-1  2 (1)  2 (2)  2 (3) a 11 a 21 a 31 b1(o3)b1(o3)  3 (1)

50 Solution to Problem 3 – The Backward Procedure Backward variable : –Probability of the partial observation sequence o t+1,o t+2,…,o T, given state i at time t and the model –  3 (1)=P(o 4,o 5,…, o T | q 3 =1, ) =a 11 × b 1 (o 4 ) ×  4 (1)+a 12 × b 2 (o 4 ) ×  4 (2)+a 13 × b 3 (o 4 ) ×  4 (3) S2S2 S3S3 S1S1 o1o1 S2S2 S3S3 S1S1 S2S2 S3S3 S1S1 S2S2 S3S3 S1S1 o2o2 o3o3 oToT T-1 T Time S2S2 S3S3 S3S3 o T-1 S2S2 S3S3 S1S1 State  3 (1) b1(o4)b1(o4)  4 (1) a 11 o4o4

51 Solution to Problem 3 – The Backward Procedure (cont’d) Algorithm cf.

52 Solution to Problem 3 – The Forward-Backward Algorithm Relation between the forward and backward variables (Huang et al., 2001)

53 Solution to Problem 3 – The Forward-Backward Algorithm (cont’d)

54 Solution to Problem 3 – The Forward-Backward Algorithm (cont’d)  t (i)= P(q t = i | O, ) –Probability of being in state i at time t, given O and  t ( i, j )=P(q t = i, q t+1 = j | O, ) –Probability of being in state i at time t and state j at time t+1, given O and

55 Solution to Problem 3 – The Forward-Backward Algorithm (cont’d) P(q 3 = 1, O | )=  3 (1)*  3 (1) o1o1 s2s2 s1s1 s3s3 s2s2 s1s1 s3s3 S2S2 S3S3 S1S1 State o2o2 o3o3 oToT T-1 T Time o T-1 S2S2 S3S3 S1S1 S2S2 S3S3 S1S1 S2S2 S3S3 S1S1 S2S2 S3S3 S1S1 S2S2 S3S3 S1S1 3(1)3(1) 3(1)3(1)

56 Solution to Problem 3 – The Forward-Backward Algorithm (cont’d) P(q 3 = 1, q 4 = 3, O | )=  3 (1)*a 13 *b 3 (o 4 )*  4 (3) o1o1 s2s2 s1s1 s3s3 s2s2 s1s1 s3s3 S2S2 S3S3 S1S1 State o2o2 o3o3 oToT T-1 T Time o T-1 S2S2 S3S3 S1S1 S2S2 S3S3 S1S1 S3S3 S2S2 S1S1 S2S2 S3S3 S1S1 S2S2 S3S3 S1S1 3(1)3(1) 4(3)4(3) a 13 b3(o4)b3(o4)

57 Lagrange multiplier back