# 1 Pattern Recognition Chapter 3 Hidden Markov Models (HMMs)

## Presentation on theme: "1 Pattern Recognition Chapter 3 Hidden Markov Models (HMMs)"— Presentation transcript:

1 Pattern Recognition Chapter 3 Hidden Markov Models (HMMs)

2 Sequential patterns:Sequential patterns: –The order of the data points is irrelevant. –No explicit sequencing... Temporal patterns:Temporal patterns: –The result of a time process (e.g., time series). –Can be represented by a number of states. –States at time t are influenced directly by states in previous time steps (i.e., correlated).

3 Hidden Markov Models (HMMs) HMMs are appropriate for problems that have an inherent temporality.HMMs are appropriate for problems that have an inherent temporality. –Speech recognition –Gesture recognition –Human activity recognition

4 First-Order Markov Models They are represented by a graph where every node corresponds to a state ω i.They are represented by a graph where every node corresponds to a state ω i. The graph can be fully-connected with self-loops.The graph can be fully-connected with self-loops. Links between nodes ω i and ω j are associated with a transition probability:Links between nodes ω i and ω j are associated with a transition probability: P( ω(t+1)=ω j /ω(t)=ω i )=α ij P( ω(t+1)=ω j /ω(t)=ω i )=α ij which is the probability of going to state ω j at time t+1 given that the state at time t was ω i (first-order model). which is the probability of going to state ω j at time t+1 given that the state at time t was ω i (first-order model).

5 First-Order Markov Models (cont’d) The following constraints should be satisfied:The following constraints should be satisfied: Markov models are fully described by their transition probabilities α ijMarkov models are fully described by their transition probabilities α ij

6 Example: Weather Prediction Model Assume three weather states:Assume three weather states: –ω 1 : Precipitation (rain, snow, hail, etc.) –ω 2 : Cloudy –ω 3 : Sunny Transition Matrix ω 1 ω 2 ω 3 ω1ω1ω2ω2ω3ω3ω1ω1ω2ω2ω3ω3 ω1ω1ω1ω1 ω2ω2ω2ω2 ω3ω3ω3ω3

7 Computing P(ω T ) of a sequence of states ω T Given a sequence of states ω T =(ω(1), ω(2),..., ω(T)), the probability that the model generated ω T is equal to the product of the corresponding transition probabilities:Given a sequence of states ω T =(ω(1), ω(2),..., ω(T)), the probability that the model generated ω T is equal to the product of the corresponding transition probabilities: where P(ω(1)/ ω(0))=P(ω(1)) is the prior probability of the first state. where P(ω(1)/ ω(0))=P(ω(1)) is the prior probability of the first state.

8 Example: Weather Prediction Model (cont’d) What is the probability that the weather for eight consecutive days is:What is the probability that the weather for eight consecutive days is: “sun-sun-sun-rain-rain-sun-cloudy-sun” ? “sun-sun-sun-rain-rain-sun-cloudy-sun” ? ω 8 =ω 3 ω 3 ω 3 ω 1 ω 3 ω 2 ω 3 ω 8 =ω 3 ω 3 ω 3 ω 1 ω 3 ω 2 ω 3 P(ω 8 )=P(ω 3 )P(ω 3 /ω 3 )P(ω 3 /ω 3 )P(ω 1 /ω 3 )P( ω 3 /ω 1 ) P(ω 8 )=P(ω 3 )P(ω 3 /ω 3 )P(ω 3 /ω 3 )P(ω 1 /ω 3 )P( ω 3 /ω 1 ) P(ω 2 /ω 3 )P(ω 3 /ω 2 )=1.536 x 10 -4 P(ω 2 /ω 3 )P(ω 3 /ω 2 )=1.536 x 10 -4

9 Limitations of Markov models In Markov models, each state is uniquely associated with an observable event.In Markov models, each state is uniquely associated with an observable event. –Once an observation is made, the state of the system is trivially retrieved. Such systems are not of practical use for most practical applications.Such systems are not of practical use for most practical applications.

10 Hidden States and Observations Assume that observations are a probabilistic function of each state.Assume that observations are a probabilistic function of each state. –Each state can produce can generate a number of outputs (i.e., observations) according to a unique probability distribution. –Each observation can potentially be generated at any state. State sequence is not directly observable.State sequence is not directly observable. –Can be approximated by a sequence of observations.

11 First-order HMMs We augment the model such that when it is in state ω(t) it also emits some symbol v(t) (visible states) among a set of possible symbols.We augment the model such that when it is in state ω(t) it also emits some symbol v(t) (visible states) among a set of possible symbols. We have access to the visible states only, while the ω(t) are unobservable.We have access to the visible states only, while the ω(t) are unobservable.

12 Example: Weather Prediction Model (cont’d) v 1 : temperature v 2 : humidity etc. Observations:

13 First-order HMMs For every sequence of -hidden- states, there is an associated sequence of visible states:For every sequence of -hidden- states, there is an associated sequence of visible states: ω T =(ω(1), ω(2),..., ω(T))  V T =(v(1), v(2),..., v(T)) ω T =(ω(1), ω(2),..., ω(T))  V T =(v(1), v(2),..., v(T)) When the model is in state ω j at time t, the probability of emitting a visible state v k at that time is denoted as:When the model is in state ω j at time t, the probability of emitting a visible state v k at that time is denoted as: P(v(t)=v k / ω(t)= ω j )=b jk’ where P(v(t)=v k / ω(t)= ω j )=b jk’ where (observation probabilities) (observation probabilities)

14 Absorbing State Given a state sequence and its corresponding observation sequence:Given a state sequence and its corresponding observation sequence: ω T =(ω(1), ω(2),..., ω(T))  V T =(v(1), v(2),..., v(T)) ω T =(ω(1), ω(2),..., ω(T))  V T =(v(1), v(2),..., v(T)) we assume that ω(T)=ω 0 is some absorbing state, which uniquely emits symbol v(T)=v 0 we assume that ω(T)=ω 0 is some absorbing state, which uniquely emits symbol v(T)=v 0 Once entering the absorbing state, the system can not escape from it.Once entering the absorbing state, the system can not escape from it.

15 HMM Formalism An HMM is defined by {Ω, V, An HMM is defined by {Ω, V,  –Ω : {ω 1 … ω n } are the possible states –V : {v 1 …v m } are the possible observations –    prior state probabilities –A = {a ij } are the state transition probabilities –B = {b ik } are the observation state probabilities

16 Some Terminology Causal: the probabilities depend only upon previous states.Causal: the probabilities depend only upon previous states. Ergodic: Every one of the states has a non- zero probability of occurring given some starting state.Ergodic: Every one of the states has a non- zero probability of occurring given some starting state. “left-right” HMM

17 Coin toss example You are in a room with a barrier (e.g., a curtain) through which you cannot see what is happening.You are in a room with a barrier (e.g., a curtain) through which you cannot see what is happening. On the other side of the barrier is another person who is performing a coin (or multiple coin) toss experiment.On the other side of the barrier is another person who is performing a coin (or multiple coin) toss experiment. The other person will tell you only the result of the experiment, not how he obtained that result!!The other person will tell you only the result of the experiment, not how he obtained that result!! e.g., V T =HHTHTTHH...T=v(1),v(2),..., v(T) e.g., V T =HHTHTTHH...T=v(1),v(2),..., v(T)

18 Coin toss example (cont’d) Problem: derive an HMM model to explain the observed sequence of heads and tails.Problem: derive an HMM model to explain the observed sequence of heads and tails. –The coins represent the states; these are hidden because we do not know which coin was tossed each time. –The outcome of each toss represents an observation. –A “likely” sequence of coins may be inferred from the observations. –As we will see, the state sequence will not be unique in general.

19 Coin toss example: 1-fair coin model There are 2 states, each associated with either heads (state1) or tails (state2).There are 2 states, each associated with either heads (state1) or tails (state2). Observation sequence uniquely defines the states (model is not hidden).Observation sequence uniquely defines the states (model is not hidden). observations

20 Coin toss example: 2-fair coins model There are 2 states but neither state is uniquely associated with either heads or tails (i.e., each state can be associated with a different fair coin).There are 2 states but neither state is uniquely associated with either heads or tails (i.e., each state can be associated with a different fair coin). A third coin is used to decide which of the fair coins to flip.A third coin is used to decide which of the fair coins to flip. observations

21 Coin toss example: 2-biased coins model There are 2 states with each state associated with a biased coin.There are 2 states with each state associated with a biased coin. A third coin is used to decide which of the biased coins to flip.A third coin is used to decide which of the biased coins to flip. observations

22 Coin toss example: 3-biased coins model There are 3 states with each state associated with a biased coin.There are 3 states with each state associated with a biased coin. We decide which coin to flip using some way (e.g., other coins).We decide which coin to flip using some way (e.g., other coins). observations

23 Which model is best? Since the states are not observable, the best we can do is select the model that best explains the data.Since the states are not observable, the best we can do is select the model that best explains the data. Long observation sequences would be best for selecting the best model...Long observation sequences would be best for selecting the best model...

24 Classification Using HMMs Given an observation sequence V T and set of possible models, choose the model with the highest probability.Given an observation sequence V T and set of possible models, choose the model with the highest probability. Bayes formula:

25 Main Problems in HMMs EvaluationEvaluation –Determine the probability P(V T ) that a particular sequence of visible states V T was generated by a given model (based on dynamic programming). DecodingDecoding –Given a sequence of visible states V T, determine the most likely sequence of hidden states ω T that led to those observations (based on dynamic programming). LearningLearning –Given a set of visible observations, determine a ij and b jk (based on EM algorithm).

26 Evaluation (i.e., possible # of state sequences)

27 Evaluation (cont’d) (enumerate all possible transitions to determine how good the model is)

28 Example: Evaluation (enumerate all possible transitions to determine how good the model is)

29 Computational Complexity

30 Recursive computation of P(V T ) (HMM Forward) v(T) v(1)v(t) v(t+1) ω(1)ω(t)ω(t+1)ω(T) ωiωiωiωi ωjωjωjωj...

31 Recursive computation of P(V T ) (HMM Forward) (cont’d) Using maginalization: Using maginalization:

32 Recursive computation of P(V T ) (HMM Forward) (cont’d) ω0ω0ω0ω0 ω0ω0ω0ω0 ω0ω0ω0ω0 ω0ω0ω0ω0 ω0ω0ω0ω0

33 Recursive computation of P(V T ) (HMM Forward) (cont’d) (i.e., corresponds to state ω 0 = ω(T)) for j=0 to c do

34 Example ω 0 ω 1 ω 2 ω 3 ω 0 ω 1 ω 2 ω 3 ω0ω0ω1ω1ω2ω2ω3ω3ω0ω0ω1ω1ω2ω2ω3ω3 ω0ω0ω1ω1ω2ω2ω3ω3ω0ω0ω1ω1ω2ω2ω3ω3

35 Example (cont’d) Similarly for t=2,3,4Similarly for t=2,3,4 Finally:Finally: VT=VT=VT=VT= 0.2 0.2 0.8 (0.00108) initial state

36 The backward algorithm (HMM backward) v(1) ω(1) ω(t)ω(t+1)ω(T) v(t)v(t+1)v(T)... ωiωiωiωi ωjωjωjωj β j (t+1) /ω (t+1)=ω j ) β i (t) i ωiωiωiωi

37 The backward algorithm (HMM backward) (cont’d) =ω j )) or i v(1) ω(1) ω(t)ω(t+1)ω(T) v(t)v(t+1)v(T) ωiωiωiωi ωjωjωjωj

38 The backward algorithm (HMM backward) (cont’d)

39 Decoding We need to use an optimality criterion to solve this problem (i.e., there are several possible ways solving this problem since there are various optimality criteria we could use).We need to use an optimality criterion to solve this problem (i.e., there are several possible ways solving this problem since there are various optimality criteria we could use). Algorithm 1: choose the states ω(t) which are individually most likely (i.e., maximize the expected number of correct individual states).Algorithm 1: choose the states ω(t) which are individually most likely (i.e., maximize the expected number of correct individual states).

40 Decoding – Algorithm 1 (cont’d)

41 Decoding – Algorithm 2 Algorithm 2: at each time step t, find the state that has the highest probability α i (t).Algorithm 2: at each time step t, find the state that has the highest probability α i (t). Uses the forward algorithm with minor changes. Uses the forward algorithm with minor changes.

42 Decoding – Algorithm 2 (cont’d)

43 Decoding – Algorithm 2 (cont’d)

44 Decoding – Algorithm 2 (cont’d) There is no guarantee that the path is a valid one.There is no guarantee that the path is a valid one. The path might imply a transition that is not allowed by the model.The path might imply a transition that is not allowed by the model. not allowed! ω 32 =0 0 1 2 3 4 0 1 2 3 4

45 Decoding – Algorithm 3

46 Decoding – Algorithm 3 (cont’d)

47 Decoding – Algorithm 3 (cont’d)

48 Decoding – Algorithm 3 (cont’d)

49 Learning Use EMUse EM – Update the weights iteratively to better explain the observed training sequences. observed training sequences.

50 Learning (cont’d) Idea Idea

51 Learning (cont’d) Define the probability of transitioning from ω i to ω j at step t given V T :Define the probability of transitioning from ω i to ω j at step t given V T : (expectation step)

52 Learning (cont’d)

53 Learning (cont’d) (maximization step)

54 Learning (cont’d) (maximization step)

55 Difficulties How do we decide on the number of states and the structure of the model? How do we decide on the number of states and the structure of the model? –Use domain knowledge otherwise very hard problem! What about the size of observation sequence ?What about the size of observation sequence ? –Should be sufficiently long to guarantee that all state transitions will appear a sufficient number of times. –A large number of training data is necessary to learn the HMM parameters.