# Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.

## Presentation on theme: "Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer."— Presentation transcript:

Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer Science and Engineering

Outline Evaluation problem of HMM Decoding problem of HMM Learning problem of HMM –A multi-million dollar algorithm—viterbi algorithm

Evaluation problem. Given the HMM M=(A, B,  ) and the observation sequence O=o 1 o 2... o K, calculate the probability that model M has generated sequence O. Decoding problem. Given the HMM M=(A, B,  ) and the observation sequence O=o 1 o 2... o K, calculate the most likely sequence of hidden states s i that produced this observation sequence O. Learning problem. Given some training observation sequences O=o 1 o 2... o K and general structure of HMM (numbers of hidden and visible states), determine HMM parameters M=(A, B,  ) that best fit training data. O=o 1...o K denotes a sequence of observations o k  { v 1,…,v M }. Main issues using HMMs :

Typed word recognition, assume all characters are separated. Character recognizer outputs probability of the image being particular character, P(image|character). 0.5 0.03 0.005 0.31 z c b a Word recognition example(1). Hidden state Observation

If lexicon is given, we can construct separate HMM models for each lexicon word. Amherstam he r s t Buffalobu ff a l o 0.5 0.03 Here recognition of word image is equivalent to the problem of evaluating few HMM models. This is an application of Evaluation problem. Word recognition example(2). 0.40.6

We can construct a single HMM for all words. Hidden states = all characters in the alphabet. Transition probabilities and initial probabilities are calculated from language model. Observations and observation probabilities are as before. a m he r s t b v f o Here we have to determine the best sequence of hidden states, the one that most likely produced word image. This is an application of Decoding problem. Word recognition example(3).

Evaluation problem. Given the HMM M=(A, B,  ) and the observation sequence O=o 1 o 2... o K, calculate the probability that model M has generated sequence O. Trying to find probability of observations O=o 1 o 2... o K by means of considering all hidden state sequences (as was done in example) is impractical: N K hidden state sequences - exponential complexity. Use Forward-Backward HMM algorithms for efficient calculations. (Dynamic programming!) Define the forward variable  k (i) as the joint probability of the partial observation sequence o 1 o 2... o k and that the hidden state at time k is s i :  k (i)= P(o 1 o 2... o k, q k = s i ) Evaluation Problem.

s1s1 s2s2 sisi sNsN s1s1 s2s2 sisi sNsN s1s1 s2s2 sjsj sNsN s1s1 s2s2 sisi sNsN a 1j a 2j a ij a Nj Time= 1 k k+1 K o 1 o k o k+1 o K = Observations Trellis representation of an HMM

Initialization:  1 (i)= P(o 1, q 1 = s i ) =  i b i (o 1 ), 1<=i<=N. Forward recursion:  k+1 (i)= P(o 1 o 2... o k+1, q k+1 = s j ) =  i P(o 1 o 2... o k+1, q k = s i, q k+1 = s j ) =  i P(o 1 o 2... o k, q k = s i ) a ij b j (o k+1 ) = [  i  k (i) a ij ] b j (o k+1 ), 1<=j<=N, 1<=k<=K-1. Termination: P(o 1 o 2... o K ) =  i P(o 1 o 2... o K, q K = s i ) =  i  K (i) Complexity : N 2 K operations. Forward recursion for HMM

s1s1 s2s2 sisi sNsN s1s1 s2s2 sjsj sNsN s1s1 s2s2 sisi sNsN s1s1 s2s2 sisi sNsN a j1 a j2 a ji a jN Time= 1 k k+1 K o 1 o k o k+1 o K = Observations Trellis representation of an HMM

Define the forward variable  k (i) as the joint probability of the partial observation sequence o k+1 o k+2... o K given that the hidden state at time k is s i :  k (i)= P(o k+1 o k+2... o K |q k = s i ) Initialization:  K (i)= 1, 1<=i<=N. Backward recursion:  k (j)= P(o k+1 o k+2... o K | q k = s j ) =  i P(o k+1 o k+2... o K, q k+1 = s i | q k = s j ) =  i P(o k+2 o k+3... o K | q k+1 = s i ) a ji b i (o k+1 ) =  i  k+1 (i) a ji b i (o k+1 ), 1<=j<=N, 1<=k<=K-1. Termination: P(o 1 o 2... o K ) =  i P(o 1 o 2... o K, q 1 = s i ) =  i P(o 1 o 2... o K |q 1 = s i ) P(q 1 = s i ) =  i  1 (i) b i (o 1 )  i Backward recursion for HMM

Decoding problem. Given the HMM M=(A, B,  ) and the observation sequence O=o 1 o 2... o K, calculate the most likely sequence of hidden states s i that produced this observation sequence. We want to find the state sequence Q = q 1 …q K which maximizes P(Q | o 1 o 2... o K ), or equivalently P(Q, o 1 o 2... o K ). Brute force consideration of all paths takes exponential time. Use efficient Viterbi algorithm instead. Define variable  k (i) as the maximum probability of producing observation sequence o 1 o 2... o k when moving along any hidden state sequence q 1 … q k-1 and getting into q k = s i.  k (i) = max P(q 1 … q k-1, q k = s i, o 1 o 2... o k ) where max is taken over all possible paths q 1 … q k-1. Decoding problem

Andrew Viterbi In 1967 he invented the Viterbi algorithm, which he used for decoding convolutionally encoded data. It is still used widely in cellular phones for error correcting codes, as well as for speech recognition, DNA analysis, and many other applications of Hidden Markov models In September 2008, he was awarded the National Medal of Science for developing "the 'Viterbi algorithm,' and for his contributions to Code Division Multiple Access (CDMA) wireless technology that transformed the theory and practice of digital communications."

General idea: if best path ending in q k = s j goes through q k-1 = s i then it should coincide with best path ending in q k-1 = s i. s1s1 sisi sNsN sjsj a ij a Nj a 1j q k-1 q k  k (i) = max P(q 1 … q k-1, q k = s j, o 1 o 2... o k ) = max i [ a ij b j (o k ) max P(q 1 … q k-1 = s i, o 1 o 2... o k-1 ) ] To backtrack best path keep info that predecessor of s j was s i. Viterbi algorithm (1)

Initialization:  1 (i) = max P(q 1 = s i, o 1 ) =  i b i (o 1 ), 1<=i<=N. Forward recursion:  k (j) = max P(q 1 … q k-1, q k = s j, o 1 o 2... o k ) = max i [ a ij b j (o k ) max P(q 1 … q k-1 = s i, o 1 o 2... o k-1 ) ] = max i [ a ij b j (o k )  k-1 (i) ], 1<=j<=N, 2<=k<=K. Termination: choose best path ending at time K max i [  K (i) ] Backtrack best path. This algorithm is similar to the forward recursion of evaluation problem, with  replaced by max and additional backtracking. Viterbi algorithm (2)

Learning problem. Given some training observation sequences O=o 1 o 2... o K and general structure of HMM (numbers of hidden and visible states), determine HMM parameters M=(A, B,  ) that best fit training data, that is maximizes P(O | M). There is no algorithm producing optimal parameter values. Learning problem (1)

If training data has information about sequence of hidden states (as in word recognition example), then use maximum likelihood estimation of parameters: a ij = P(s i | s j ) = Number of transitions from state s j to state s i Number of transitions out of state s j b i (v m ) = P(v m | s i )= Number of times observation v m occurs in state s i Number of times in state s i Learning problem (2) The state sequence is generated by Viterbi algoritm

CGATATTCGATTCTACGCGCGTATACTAGCTTATCTGATC CGATATTCGATTCTACGCGCGTATACTAGCTTATCTGATC 011111112222222111111222211111112222111110 011111112222222111111222211111112222111110 to state 012 from state 00 (0%)1 (100%)0 (0%) 11 (4%)21 (84%)3 (12%) 20 (0%)3 (20%)12 (80%) symbol ACGT in state 1 6 (24%) 7 (28%) 5 (20%) 7 (28%) 2 3 (20%) 2 (13%) 7 (47%) Example: Training Unambiguous Models transitions emissions

HMM Training Ambiguous Models The Problem: We have training sequences, but not the associated paths (state labels) Two Solutions: 1. Viterbi Training: Start with random HMM parameters. Use Viterbi to find the most probable path for each training sequence, and then label the sequence with that path. Use labeled sequence training on the resulting set of sequences and paths. 2. Baum-Welch Training: sum over all possible paths (rather than the single most probable one) to estimate expected counts A i, j and E i,k ; then use the same formulas as for labeled sequence training on these expected counts:

training features Viterbi Decoder Sequence Labeler Labeled Sequence Trainer paths {  i } initial submodel M new submodel M labeled features final submodel M * repeat n times Viterbi Training (iterate over the training sequences) (find most probable path for this sequence) (label the sequence with that path)

General idea: a ij = P(s i | s j ) = Expected number of transitions from state s j to state s i Expected number of transitions out of state s j b i (v m ) = P(v m | s i )= Expected number of times observation v m occurs in state s i Expected number of times in state s i  i = P(s i ) = Expected frequency in state s i at time k=1. Baum-Welch algorithm

s1s1 s2s2 sisi sNsN s1s1 s2s2 sisi sNsN s1s1 s2s2 sjsj sNsN s1s1 s2s2 sisi sNsN a ij Time= 1 k k+1 K o 1 o k o k+1 o K = Observations Trellis representation of an HMM  k (i)  k+1 (j)

Define variable  k (i,j) as the probability of being in state s i at time k and in state s j at time k+1, given the observation sequence o 1 o 2... o K.  k (i,j)= P(q k = s i, q k+1 = s j | o 1 o 2... o K )  k (i,j)= P(q k = s i, q k+1 = s j, o 1 o 2... o k ) P(o 1 o 2... o k ) = P(q k = s i, o 1 o 2... o k ) a ij b j (o k+1 ) P(o k+2... o K | q k+1 = s j ) P(o 1 o 2... o k ) =  k (i) a ij b j (o k+1 )  k+1 (j)  i  j  k (i) a ij b j (o k+1 )  k+1 (j) Baum-Welch algorithm: expectation step(1)

Define variable  k (i) as the probability of being in state s i at time k, given the observation sequence o 1 o 2... o K.  k (i)= P(q k = s i | o 1 o 2... o K )  k (i)= P(q k = s i, o 1 o 2... o k ) P(o 1 o 2... o k ) =  k (i)  k (i)  i  k (i)  k (i) Baum-Welch algorithm: expectation step(2)

We calculated  k (i,j) = P(q k = s i, q k+1 = s j | o 1 o 2... o K ) and  k (i)= P(q k = s i | o 1 o 2... o K ) Expected number of transitions from state s i to state s j = =  k  k (i,j) Expected number of transitions out of state s i =  k  k (i) Expected number of times observation v m occurs in state s i = =  k  k (i), k is such that o k = v m Expected frequency in state s i at time k=1 :  1 (i). Baum-Welch algorithm: expectation step(3)

a ij = Expected number of transitions from state s j to state s i Expected number of transitions out of state s j b i (v m ) = Expected number of times observation v m occurs in state s i Expected number of times in state s i  i = ( Expected frequency in state s i at time k=1) =  1 (i). =  k  k (i,j)  k  k (i) =  k  k (i,j)  k,o k = v m  k (i) Baum-Welch algorithm: maximization step

Baum-Welch Training compute Fwd & Bkwd DP matrices accumulate expected counts for E & A

Download ppt "Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer."

Similar presentations