 Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.

Presentation on theme: "Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly."— Presentation transcript:

Hidden Markov Models

Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.

Hidden Markov Model A HMM is a quintuple (S, E,  S : {s 1 …s N } are the values for the hidden states E : {e 1 …e T } are the values for the observations  probability distribution of the initial state  transition probability matrix  emission probability matrix X t+1 XtXt X t-1 e t+1 etet e t-1 X1X1 e1e1 XTXT eTeT

Inferences with HMM Filtering: P(x t |e 1:t ) Given an observation sequence, compute the probability of the last state. Decoding: argmax x 1:t P(x 1:t |e 1:t ) Given an observation sequence, compute the most likely hidden state sequence. Learning: argmax  P  (e 1:t ) where  =(  ) are parameters of the HMM Given an observation sequence, find out which transition probability and emission probability table assigns the observations the highest probability. Unsupervised learning

Filtering P(X t+1 |e 1:t+1 ) = P(X t+1 |e 1:t, e t+1 ) =P(e t+1 |X t+1, e 1:t ) P(X t+1 |e 1:t )/P(e t+1 |e 1:t ) =P(e t+1 |X t+1 ) P(X t+1 |e 1:t )/P(e t+1 |e 1:t ) P(X t+1 |e 1:t ) =  x t P(X t+1 |x t, e 1:t ) P(x t |e 1:t ) Same form. Use recursion

Filtering Example

Viterbi Algorithm Compute argmax x 1:t P(x 1:t |e 1:t ) Since P(x 1:t |e 1:t ) = P(x 1:t, e 1:t )/P(e 1:t ), and P(e 1:t ) remains constant when we consider different x 1:t argmax x 1:t P(x 1:t |e 1:t )= argmax x 1:t P(x 1:t, e 1:t ) Since the Markov chain is a Bayes Net, P(x 1:t, e 1:t )=P(x 0 )  i=1,t P(x i |x i-1 ) P(e i |x i ) Minimize – log P(x 1:t, e 1:t ) =–logP(x 0 ) +  i=1,t (–log P(x i |x i-1 ) –log P(e i |x i ))

Viterbi Algorithm Given a HMM (S, E,  and observations o 1:t, construct a graph that consists 1+tN nodes: One initial node N node at time i. The jth node at time i represent X i =s j. The link between the nodes X i-i =s j and X i =s k is associated with the length –log P(X i =s k | X i-1 =s j-1 )P(e i |X i =s k )

The problem of finding argmax x 1:t P(x 1:t |e 1:t ) becomes that of finding the shortest path from x 0 =s 0 to one of the nodes x t =s t.

Example

Baum-Welch Algorithm The previous two kinds of computation needs parameters  =(  ). Where do the probabilities come from? Relative frequency? But the states are not observable! Solution: Baum-Welch Algorithm Unsupervised learning from observations Find argmax  P  (e 1:t )

Baum-Welch Algorithm Start with an initial set of parameters  0 Possibly arbitrary Compute pseudo counts How many times the transition from X i-i =s j to X i =s k occurred? Use the pseudo counts to obtain another (better) set of parameters  1 Iterate until P  1 (e 1:t ) is not bigger than P  (e 1:t ) A special case of EM (Expectation-Maximization)

Pseudo Counts Given the observation sequence e 1:T, the pseudo counts of the link from X t =s i to X t+1 =s j is the probability P(X t =s i,X t+1 =s j |e 1:T ) X t =s i X t+1 =s j

Update HMM Parameters Add P(X t =s i,X t+1 =s j |e 1:T ) to count(i,j) Add P(X t =s i |e 1:T ) to count(i) Add P(X t =s i |e 1:T ) to count(i,e t ) Updated a ij = count(i,j)/count(i); Updated b je t =count(j,e t )/count(j)

P(X t =s i,X t+1 =s j |e 1:T ) =P(X t =s i,X t+1 =s j, e 1:t, e t+1, e t+2:T )/ P(e 1:T ) =P(X t =s i, e 1:t )P(X t+1 =s j |X t =s i )P(e t |X t+1 =s j ) P(e t+2:T |X t+1 =s j )/P(e 1:T ) =P(X t =s i, e 1:t ) a ij b je t P(e t+2:T |X t+1 =s j )/ P(e 1:T ) =  i (t) a ij b je t β j (t+1)/P(e 1:T )

Forward Probability

Backward Probability

X t =s i X t+1 =s j t-1 tt+1t+2  i (t)  j (t+1) a ij b je t

P(X t =s i |e 1:T ) =P(X t =s i, e 1:t, e t+1:T )/P(e 1:T ) =P(e t+1:T | X t =s i, e 1:t )P(X t =s i, e 1:t )/P(e 1:T ) = P(e t+1:T | X t =s i )P(X t =s i |e 1:t )P(e 1:t )/P(e 1:T ) =  i (t) β i (t)/P(e t+1:T |e 1:t )

Speech Recognition

Phones

Speech Signal Waveform Spectrogram

Feature Extraction Frame 1 Frame 2 Feature Vector X 1 Feature Vector X 2

Speech System Architecture x x 1 1 x x T T over w w 1 1... w w k k Language model P P ( ( w w 1 1... w w k k ) ) Phoneme inventory Pronunciation lexicon P P ( ( x x 1 1... x x T T | | w w 1 1 w w k k ) ) Acoustic analysis Global search: Maximize Global search: Maximize Recognized word sequence Recognized word sequence Speech input... P P ( ( x x 1 1 x x T T | | w w 1 1 w w k k ) ) P(x 1 x T | w 1 w k ) ・ P(w 1 w k )

HMM for Speech Recognition start 0 n1n1 d3d3 end 4 iy 2 a 01 a 12 a 23 a 34 a 11 a 22 a 33 a 24 o1o1 o2o2 o3o3 o4o4 o5o5 o6o6 b 1 (o 1 )b 1 (o 2 ) b 1 (o 3 )b 1 (o 5 ) b 1 (o 4 ) b 1 (o 6 ) …… Word Model Observation Sequence

Language Modeling Goal: determine which sequence of words is more likely: I went to a party Eye went two a bar tea Rudolph the Red Nose reigned here. Rudolph the Red knows rain, dear. Rudolph the red nose reindeer.

Summary HMM Filtering Decoding Learning Speech Recognition Feature extraction from signal HMM for speech recognition

Download ppt "Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly."

Similar presentations