Fall 2001 EE669: Natural Language Processing 1 Lecture 9: Hidden Markov Models (HMMs) (Chapter 9 of Manning and Schutze) Dr. Mary P. Harper ECE, Purdue.

Fall 2001 EE669: Natural Language Processing 1 Lecture 9: Hidden Markov Models (HMMs) (Chapter 9 of Manning and Schutze) Dr. Mary P. Harper ECE, Purdue University harper@purdue.edu yara.ecn.purdue.edu/~harper

Fall 2001 EE669: Natural Language Processing 2 Markov Assumptions Often we want to consider a sequence of randm variables that aren’t independent, but rather the value of each variable depends on previous elements in the sequence. Let X = (X 1,.., X T ) be a sequence of random variables taking values in some finite set S = {s 1, …, s N }, the state space. If X possesses the following properties, then X is a Markov Chain. Limited Horizon: P(X t+1 = s k |X 1,.., X t ) = P(X t+1 = s k |X t ) i.e., a word’s state only depends on the previous state. Time Invariant (Stationary): P(X t+1 = s k |X t ) = P(X 2 = s k |X 1 ) i.e., the dependency does not change over time.

Fall 2001 EE669: Natural Language Processing 3 Definition of a Markov Chain A is a stochastic N x N matrix of the probabilities of transitions with: –a ij = Pr{transition from s i to s j } = Pr(X t+1 = s j | X t = s i ) –  ij, a ij  0, and  i:  is a vector of N elements representing the initial state probability distribution with: –  i = Pr{probability that the initial state is s i } = Pr(X 1 = s i ) –  i: –Can avoid this by creating a special start state s 0.

Fall 2001 EE669: Natural Language Processing 4 Markov Chain

Fall 2001 EE669: Natural Language Processing 5 Markov Process & Language Models Chain rule: P(W) = P(w 1,w 2,...,w T ) =  i=1..T p(w i |w 1,w 2,..,w i-n+1,..,w i-1 ) n-gram language models: –Markov process (chain) of the order n-1: P(W) = P(w 1,w 2,...,w T ) =  i=1..T p(w i |w i-n+1,w i-n+2,..,w i-1 ) Using just one distribution (Ex.: trigram model: p(w i |w i-2,w i-1 )): Positions: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Words: My car broke down, and within hours Bob ’s car broke down, too. p(,|broke down) = p(w 5 |w 3,w 4 ) = p(w 14 |w 12,w 13 )

Fall 2001 EE669: Natural Language Processing 6 Example with a Markov Chain 1.4 1.3.4.6 1.4 te h a p i Start

Fall 2001 EE669: Natural Language Processing 7 Probability of a Sequence of States P(t, a, p, p) = 1.0 × 0.3 × 0.4 × 1.0 = 0.12 P(t, a, p, e) = ?

Fall 2001 EE669: Natural Language Processing 8 Hidden Markov Models (HMMs) Sometimes it is not possible to know precisely which states the model passes through; all we can do is observe some phenomena that occurs when in that state with some probability distribution. An HMM is an appropriate model for cases when you don’t know the state sequence that the model passes through, but only some probabilistic function of it. For example: –Word recognition (discrete utterance or within continuous speech) –Phoneme recognition –Part-of-Speech Tagging –Linear Interpolation

Fall 2001 EE669: Natural Language Processing 9 Example HMM Process: –According to , pick a state q i. Select a ball from urn i (state is hidden). –Observe the color (observable). –Replace the ball. –According to A, transition to the next urn from which to pick a ball (state is hidden). N states corresponding to urns: q 1, q 2, …,q N M colors for balls found in urns: z 1, z 2, …, z M A: N x N matrix; transition probability distribution (between urns) B: N x M matrix; for each urn, there is a probability density function for each color. b ij = Pr(COLOR= z j | URN = q i )  : N vector; initial sate probability distribution (where do we start?)

Fall 2001 EE669: Natural Language Processing 10 What is an HMM? Green circles are hidden states Dependent only on the previous state

Fall 2001 EE669: Natural Language Processing 11 What is an HMM? Purple nodes are observations Dependent only on their corresponding hidden state (for a state emit model)

Fall 2001 EE669: Natural Language Processing 12 Elements of an HMM HMM (the most general case): –five-tuple (S, K, , A, B), where: S = {s 1,s 2,...,s N } is the set of states, K = {k 1,k 2,...,k M } is the output alphabet,  = {  i } i  S, initial state probability set A = {a ij } i,j  S, transition probability set B = {b ijk } i,j  S, k  K (arc emission) B = {b ik } i  S, k  K (state emission), emission probability set State Sequence: X = (x 1, x 2,..., x T+1 ), x t : S  {1,2,…, N} Observation (output) Sequence: O = (o 1,..., o T ), o t  K

Fall 2001 EE669: Natural Language Processing 13 HMM Formalism {S, K,  S : {s 1 …s N } are the values for the hidden states K : {k 1 …k M } are the values for the observations SSS KKK S K S K

Fall 2001 EE669: Natural Language Processing 14 HMM Formalism {S, K,   i  are the initial state probabilities A = {a ij } are the state transition probabilities B = {b ik } are the observation probabilities A B AAA BB SSS KKK S K S K ii

Fall 2001 EE669: Natural Language Processing 15 Example of an Arc Emit HMM  3 1 4 2 0.6 1 0.4 1 0.12 enter here p(t)=.5 p(o)=.2 p(e)=.3 p(t)=.8 p(o)=.1 p(e)=.1 p(t)=0 p(o)=0 p(e)=1 p(t)=.1 p(o)=.7 p(e)=.2 p(t)=0 p(o)=.4 p(e)=.6 p(t)=0 p(o)=1 p(e)=0 0.88

Fall 2001 EE669: Natural Language Processing 16 HMMs 1.N states in the model: a state has some measurable, distinctive properties. 2.At clock time t, you make a state transition (to a different state or back to the same state), based on a transition probability distribution that depends on the current state (i.e., the one you're in before making the transition). 3.After each transition, you output an observation output symbol according to a probability density distribution which depends on the current state or the current arc.  Goal: From the observations, determine what model generated the output, e.g., in word recognition with a different HMM for each word, the goal is to choose the one that best fits the input word (based on state transitions and output observations).

Fall 2001 EE669: Natural Language Processing 17 Example HMM N states corresponding to urns: q 1, q 2, …,q N M colors for balls found in urns: z 1, z 2, …, z M A: N x N matrix; transition probability distribution (between urns) B: N x M matrix; for each urn, there is a probability density function for each color. –b ij = Pr(COLOR= z j | URN = q i )  : N vector; initial sate probability distribution (where do we start?)

Fall 2001 EE669: Natural Language Processing 18 Example HMM Process: –According to , pick a state q i. Select a ball from urn i (state is hidden). –Observe the color (observable). –Replace the ball. –According to A, transition to the next urn from which to pick a ball (state is hidden). Design: Choose N and M. Specify a model  = (A, B,  ) from the training data. Adjust the model parameters (A, B,  ) to maximize P(O|  ). Use: Given O and  = (A, B,  ), what is P(O|  )? If we compute P(O|  ) for all models , then we can determine which model most likely generated O.

Fall 2001 EE669: Natural Language Processing 19 Simulating a Markov Process t:= 1; Start in state s i with probability  i (i.e., X 1 = i) Forever do Move from state s i to state s j with probability a ij (i.e., X t+1 = j) Emit observation symbol o t = k with probability b ik (b ijk ) t:= t+1 End

Fall 2001 EE669: Natural Language Processing 20 Why Use Hidden Markov Models? HMMs are useful when one can think of underlying events probabilistically generating surface events. Example: Part-of-Speech-Tagging. HMMs can be efficiently trained using the EM Algorithm. Another example where HMMs are useful is in generating parameters for linear interpolation of n-gram models. Assuming that some set of data was generated by an HMM, and then an HMM is useful for calculating the probabilities for possible underlying state sequences.

Fall 2001 EE669: Natural Language Processing 21 Fundamental Questions for HMMs Given a model  = (A, B,  ), how do we efficiently compute how likely a certain observation is, that is, P(O|  ) Given the observation sequence O and a model , how do we choose a state sequence (X 1, …, X T+1 ) that best explains the observations? Given an observation sequence O, and a space of possible models found by varying the model parameters  = (A, B,  ), how do we find the model that best explains the observed data?

Fall 2001 EE669: Natural Language Processing 22 Probability of an Observation Given the observation sequence O = (o 1, …, o T ) and a model  = (A, B,  ), we wish to know how to efficiently compute P(O|  ). For any state sequence X = (X 1, …, X T+1 ), we find: P(O|  ) =  X P(X |  ) P(O | X,  ) This is simply the probability of an observation sequence given the model. Direct evaluation of this expression is extremely inefficient; however, there are dynamic programming methods that compute it quite efficiently.

Fall 2001 EE669: Natural Language Processing 23 oToT o1o1 otot o t-1 o t+1 Given an observation sequence and a model, compute the probability of the observation sequence Probability of an Observation

Fall 2001 EE669: Natural Language Processing 24 Probability of an Observation oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 b

Fall 2001 EE669: Natural Language Processing 25 Probability of an Observation oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 b a

Fall 2001 EE669: Natural Language Processing 26 Probability of an Observation oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1

Fall 2001 EE669: Natural Language Processing 27 Probability of an Observation oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1

Fall 2001 EE669: Natural Language Processing 28 Probability of an Observation (State Emit) oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1

Fall 2001 EE669: Natural Language Processing 29 Probability of an Observation with an Arc Emit HMM  3 1 4 2 0.6 1 0.4 1 0.12 enter here p(t)=.5 p(o)=.2 p(e)=.3 p(toe) =.6 .88  1  +.4   1   .88  +.4   1   .12  .237 p(t)=.8 p(o)=.1 p(e)=.1 p(t)=0 p(o)=0 p(e)=1 p(t)=.1 p(o)=.7 p(e)=.2 p(t)=0 p(o)=.4 p(e)=.6 p(t)=0 p(o)=1 p(e)=0 0.88

Fall 2001 EE669: Natural Language Processing 30 Making Computation Efficient To avoid this complexity, use dynamic programming or memoization techniques due to trellis structure of the problem. Use an array of states versus time to compute the probability of being at each state at time t+1 in terms of the probabilities for being in each state at t. A trellis can record the probability of all initial subpaths of the HMM that end in a certain state at a certain time. The probability of longer subpaths can then be worked out in terms of the shorter subpaths. A forward variable,  i (t)= P(o 1 o 2 …o t-1, X t =i|  ) is stored at (s i, t) in the trellis and expresses the total probability of ending up in state s i at time t.

Fall 2001 EE669: Natural Language Processing 31 The Trellis Structure

Fall 2001 EE669: Natural Language Processing 32 Forward Procedure oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Special structure gives us an efficient solution using dynamic programming. Intuition: Probability of the first t observations is the same for all possible t+1 length state sequences (so don’t recompute it!) Define:

Fall 2001 EE669: Natural Language Processing 33 oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Forward Procedure Note: We omit |  from formula.

Fall 2001 EE669: Natural Language Processing 34 oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Forward Procedure Note: P(A, B) = P(A| B)*P(B)

Fall 2001 EE669: Natural Language Processing 35 oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Forward Procedure

Fall 2001 EE669: Natural Language Processing 41 The Forward Procedure Forward variables are calculated as follows: –Initialization:  i (1)=  i, 1  i  N –Induction:  j (t+1)=  i=1, N  i (t)a ij b ijo t, 1  t  T, 1  j  N –Total: P(O|  )=  i=1,N  i (T+1) =  i=1,N  i=1, N  i (t)a ij b ijo t This algorithm requires 2N 2 T multiplications (much less than the direct method which takes (2T+1)N T+1

Fall 2001 EE669: Natural Language Processing 42 The Backward Procedure We could also compute these probabilities by working backward through time. The backward procedure computes backward variables which are the total probability of seeing the rest of the observation sequence given that we were in state s i at time t.  i (t) = P(o t …o T | X t = i,  ) is a backward variable. Backward variables are useful for the problem of parameter reestimation.

Fall 2001 EE669: Natural Language Processing 43 oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Backward Procedure

Fall 2001 EE669: Natural Language Processing 44 The Backward Procedure Backward variables can be calculated working backward through the treillis as follows: –Initialization:  i (T+1) = 1, 1  i  N –Induction:  i (t) =  j=1,N a ij b ijo t  j (t+1), 1  t  T, 1  i  N –Total: P(O|  )=  i=1, N  i  i (1) Backward variables can also be combined with forward variables: P(O|  ) =  i=1,N  i (t)  i (t), 1  t  T+1

Fall 2001 EE669: Natural Language Processing 45 The Backward Procedure Backward variables can be calculated working backward through the treillis as follows: –Initialization:  i (T+1) = 1, 1  i  N –Induction:  i (t) =  j=1,N a ij b ijo t  j (t+1), 1  t  T, 1  i  N –Total: P(O|  )=  i=1, N  i  i (1) Backward variables can also be combined with forward variables: P(O|  ) =  i=1,N  i (t)  i (t), 1  t  T+1

Fall 2001 EE669: Natural Language Processing 46 oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Probability of an Observation Forward Procedure Backward Procedure Combination

Fall 2001 EE669: Natural Language Processing 47 Finding the Best State Sequence One method consists of optimizing on the states individually. For each t, 1  t  T+1, we would like to find X t that maximizes P(X t |O,  ). Let  i (t) = P(X t = i |O,  ) = P(X t = i, O|  )/P(O|  ) = (  i (t)  i (t))/  j=1,N  j (t)  j (t) The individually most likely state is X t =argmax 1  i  N  i (t), 1  t  T+1 This quantity maximizes the expected number of states that will be guessed correctly. However, it may yield a quite unlikely state sequence. ^

Fall 2001 EE669: Natural Language Processing 48 oToT o1o1 otot o t-1 o t+1 Best State Sequence Find the state sequence that best explains the observation sequence Viterbi algorithm

Fall 2001 EE669: Natural Language Processing 49 Finding the Best State Sequence: The Viterbi Algorithm The Viterbi algorithm efficiently computes the most likely state sequence. To find the most likely complete path, compute: argmax X P(X|O,  ) To do this, it is sufficient to maximize for a fixed O: argmax X P(X,O|  ) We define  j (t) = max X 1..X t-1 P(X 1 …X t-1, o 1..o t-1, X t =j|  )  j (t) records the node of the incoming arc that led to this most probable path.

Fall 2001 EE669: Natural Language Processing 50 oToT o1o1 otot o t-1 o t+1 Viterbi Algorithm The state sequence which maximizes the probability of seeing the observations to time t-1, landing in state j, and seeing the observation at time t x1x1 x t-1 j

Fall 2001 EE669: Natural Language Processing 51 oToT o1o1 otot o t-1 o t+1 Viterbi Algorithm Recursive Computation x1x1 x t-1 xtxt x t+1

Fall 2001 EE669: Natural Language Processing 52 Finding the Best State Sequence: The Viterbi Algorithm The Viterbi Algorithm works as follows: Initialization:  j (1) =  j, 1  j  N Induction:  j (t+1) = max 1  i  N  i (t)a ij b ijo t, 1  j  N Store backtrace:  j (t+1) = argmax 1  i  N  i (t)a ij b ijo t, 1  j  N Termination and path readout: X T+1 = argmax 1  i  N  i (T+1) X t =  Xt+1 (t+1) P(X) = max 1  i  N  i (T+1) ^ ^ ^ ^

Fall 2001 EE669: Natural Language Processing 53 oToT o1o1 otot o t-1 o t+1 Viterbi Algorithm Compute the most likely state sequence by working backwards x1x1 x t-1 xtxt x t+1 xTxT

Fall 2001 EE669: Natural Language Processing 54 Third Problem: Parameter Estimation Given a certain observation sequence, we want to find the values of the model parameters  =(A, B,  ) which best explain what we observed. Using Maximum Likelihood Estimation, find values to maximize P(O|  ), i.e., argmax  P(O training |  ) There is no known analytic method to choose  to maximize P(O|  ). However, we can locally maximize it by an iterative hill-climbing algorithm known as Baum-Welch or Forward-Backward algorithm. (This is a special case of the EM Algorithm)

Fall 2001 EE669: Natural Language Processing 55 Parameter Estimation: The Forward- Backward Algorithm We don’t know what the model is, but we can work out the probability of the observation sequence using some (perhaps randomly chosen) model. Looking at that calculation, we can see which state transitions and symbol emissions were probably used the most. By increasing the probability of those, we can choose a revised model which gives a higher probability to the observation sequence.

Fall 2001 EE669: Natural Language Processing 56 oToT o1o1 otot o t-1 o t+1 Parameter Estimation Given an observation sequence, find the model that is most likely to produce that sequence. No analytic method Given a model and training sequences, update the model parameters to better fit the observations. A B AAA BBBB

Fall 2001 EE669: Natural Language Processing 57 Definitions The probability of traversing a certain arc at time t given Observation sequence O: Let:

Fall 2001 EE669: Natural Language Processing 58 The Probability of Traversing an Arc sisi sjsj a ij b ijo t … … tt+1t-1t+2  i (t)  j (t+1)

Fall 2001 EE669: Natural Language Processing 59 Definitions The expected number of transitions from state i in O: The expected number of transitions from state i to j in O

Fall 2001 EE669: Natural Language Processing 60 Reestimation Procedure Begin with model  perhaps selected at random. Run O through the current model to estimate the expectations of each model parameter. Change model to maximize the values of the paths that are used a lot while respecting stochastic constraints. Repeat this process (hoping to converge on optimal values for the model parameters  ).

Fall 2001 EE669: Natural Language Processing 61 The Reestimation Formulas The expected frequency in state i at time t = 1:

Fall 2001 EE669: Natural Language Processing 62 Reestimation Formulas

Fall 2001 EE669: Natural Language Processing 63 oToT o1o1 otot o t-1 o t+1 Parameter Estimation A B AAA BBBB Probability of traversing an arc at time t Probability of being in state i

Fall 2001 EE669: Natural Language Processing 64 oToT o1o1 otot o t-1 o t+1 Parameter Estimation A B AAA BBBB Now we can compute the new estimates of the model parameters.

Fall 2001 EE669: Natural Language Processing 65 HMM Applications Parameters for interpolated n-gram models Part of speech tagging Speech recognition

Fall 2001 EE669: Natural Language Processing 1 Lecture 9: Hidden Markov Models (HMMs) (Chapter 9 of Manning and Schutze) Dr. Mary P. Harper ECE, Purdue.

Similar presentations

Presentation on theme: "Fall 2001 EE669: Natural Language Processing 1 Lecture 9: Hidden Markov Models (HMMs) (Chapter 9 of Manning and Schutze) Dr. Mary P. Harper ECE, Purdue."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fall 2001 EE669: Natural Language Processing 1 Lecture 9: Hidden Markov Models (HMMs) (Chapter 9 of Manning and Schutze) Dr. Mary P. Harper ECE, Purdue.

Similar presentations

Presentation on theme: "Fall 2001 EE669: Natural Language Processing 1 Lecture 9: Hidden Markov Models (HMMs) (Chapter 9 of Manning and Schutze) Dr. Mary P. Harper ECE, Purdue."— Presentation transcript:

Similar presentations

About project

Feedback