Download presentation

Presentation is loading. Please wait.

1
**Doug Downey, adapted from Bryan Pardo,Northwestern University**

Machine Learning Hidden Markov Models Doug Downey, adapted from Bryan Pardo,Northwestern University

2
**Doug Downey, adapted from Bryan Pardo,Northwestern University**

The Markov Property A stochastic process has the Markov property if the conditional probability of future states of the process, depends only upon the present state. i.e. what I’m likely to do next depends only on where I am now, NOT on how I got here. P(qt | qt-1,…,q1) = P(qt | qt-1) Which processes have the Markov property? K 1 … 2 Doug Downey, adapted from Bryan Pardo,Northwestern University

3
**Markov model for Dow Jones**

Doug Downey, adapted from Bryan Pardo,Northwestern University

4
**Doug Downey, adapted from Bryan Pardo,Northwestern University**

The Dishonest Casino A casino has two dice: Fair die P(1) = P(2) =…= P(5) = P(6) = 1/6 Loaded die P(1) = P(2) =…= P(5) = 1/10; P(6) = ½ I think the casino switches back and forth between fair and loaded die once every 20 turns, on average Doug Downey, adapted from Bryan Pardo,Northwestern University

5
**My dishonest casino model**

This is a hidden Markov model (HMM) 0.05 0.95 0.95 FAIR LOADED P(1|F) = 1/6 P(2|F) = 1/6 P(3|F) = 1/6 P(4|F) = 1/6 P(5|F) = 1/6 P(6|F) = 1/6 P(1|L) = 1/10 P(2|L) = 1/10 P(3|L) = 1/10 P(4|L) = 1/10 P(5|L) = 1/10 P(6|L) = 1/2 0.05 Doug Downey, adapted from Bryan Pardo,Northwestern University

6
**Elements of a Hidden Markov Model**

A finite set of states Q = { q1, ..., qK } A set of transition probabilities between states, A …each aij, in A is the prob. of going from state i to state j The probability of starting in each state P = {p1, …, pK} …each pK in P is the probability of starting in state k A set of emission probabilities, B …where each bi(oj) in B is the probability of observing output oj when in state i Doug Downey, adapted from Bryan Pardo,Northwestern University

7
**My dishonest casino model**

This is a HIDDEN Markov model because the states are not directly observable. If the fair die were red and the unfair die were blue, then the Markov model would NOT be hidden. 0.05 0.95 0.95 FAIR LOADED 0.05 Doug Downey, adapted from Bryan Pardo,Northwestern University

8
**Doug Downey, adapted from Bryan Pardo,Northwestern University**

HMMs are good for… Speech Recognition Gene Sequence Matching Text Processing Part of speech tagging Information extraction Handwriting recognition Doug Downey, adapted from Bryan Pardo,Northwestern University

9
**The Three Basic Problems for HMMs**

Given: observation sequence O=(o1o2…oT), of events from the alphabet , and HMM model = (A,B,)… Problem 1 (Evaluation): What is P(O| ), the probability of the observation sequence, given the model Problem 2 (Decoding): What sequence of states Q=(q1q2…qT) best explains the observations Problem 3 (Learning): How do we adjust the model parameters = (A,B,) to maximize P(O| )? Doug Downey, adapted from Bryan Pardo,Northwestern University

10
**The Evaluation Problem**

Given observation sequence O and HMM , compute P(O| ) Helps us pick which model is the best one FAIR LOADED 0.05 0.95 O = 1,6,6,2,6,3,6,6 FAIR LOADED 0.95 0.05 Doug Downey, adapted from Bryan Pardo,Northwestern University

11
**Doug Downey, adapted from Bryan Pardo,Northwestern University**

Computing P(O|) Naïve: Try every path through the model Sum the probabilities of all possible paths This can be intractable. O(NT) What we do instead: The Forward Algorithm. O(N2T) FAIR LOADED 0.95 0.05 Doug Downey, adapted from Bryan Pardo,Northwestern University

12
**Doug Downey, adapted from Bryan Pardo,Northwestern University**

The Forward Algorithm Doug Downey, adapted from Bryan Pardo,Northwestern University

13
**Doug Downey, adapted from Bryan Pardo,Northwestern University**

The inductive step, Computation of t(j) by summing all previous values t-1(i) for all i A hidden state at time t-1 transition probability t-1(i) t(j) Doug Downey, adapted from Bryan Pardo,Northwestern University

14
**Forward Algorithm Example**

FAIR LOADED 0.95 0.05 Model = P(1|F) = 1/6 P(2|F) = 1/6 P(3|F) = 1/6 P(4|F) = 1/6 P(5|F) = 1/6 P(6|F) = 1/6 P(1|L) = 1/10 P(2|L) = 1/10 P(3|L) = 1/10 P(4|L) = 1/10 P(5|L) = 1/10 P(6|L) = 1/2 Start prob P (fair) = .7 P (loaded) = .3 Observation sequence = 1,6,6,2 1(i) 2(i) 3(i) 4(i) 1(1)*0.05*1/6+ 1(2)*0.05*1/6 2(1)*0.05*1/6+ 2(2)*0.05*1/6 3(1)*0.05*1/6+ 3(2)*0.05*1/6 State 1 (fair) 0.7*1/6 3(1)*0.95*1/10+ 3(2)*0.95*1/10 1(1)*0.95*1/2+ 1(2)*0.95*1/2 2(1)*0.95*1/2+ 2(2)*0.95*1/2 State 2 (loaded) 0.3*1/10 Doug Downey, adapted from Bryan Pardo,Northwestern University

15
**Markov model for Dow Jones**

Doug Downey, adapted from Bryan Pardo,Northwestern University

16
**Forward trellis for Dow Jones**

Doug Downey, adapted from Bryan Pardo,Northwestern University

17
**Doug Downey, adapted from Bryan Pardo,Northwestern University**

The Decoding Problem What sequence of states Q=(q1q2…qT) best explains the observation sequence O=(o1o2…oT)? Helps us find the path through a model. ART N V ADV The dog sat quietly Doug Downey, adapted from Bryan Pardo,Northwestern University

18
**Doug Downey, adapted from Bryan Pardo,Northwestern University**

The Decoding Problem What sequence of states Q=(q1q2…qT) best explains the observation sequence O=(o1o2…oT)? Viterbi Decoding: slight modification of the forward algorithm the major difference is the maximization over previous states Note: Most likely state sequence is not the same as the sequence of most likely states Doug Downey, adapted from Bryan Pardo,Northwestern University

19
**Doug Downey, adapted from Bryan Pardo,Northwestern University**

The Viterbi Algorithm Doug Downey, adapted from Bryan Pardo,Northwestern University

20
**The Forward inductive step**

Computation of at(j) ot-1 ot at-1(j) Doug Downey, adapted from Bryan Pardo,Northwestern University

21
**The Viterbi inductive step**

Computation of vt(j) Keep track of who the predecessor was at each step. ot-1 ot vt-1(i) Doug Downey, adapted from Bryan Pardo,Northwestern University

22
**Doug Downey, adapted from Bryan Pardo,Northwestern University**

Viterbi for Dow Jones Doug Downey, adapted from Bryan Pardo,Northwestern University

23
**Doug Downey, adapted from Bryan Pardo,Northwestern University**

The Learning Problem Given O, how do we adjust the model parameters = (A,B,) to maximize P(O| )? In other words: How do we make a hidden Markov Model that best models the what we observe? Doug Downey, adapted from Bryan Pardo,Northwestern University

24
**Baum-Welch Local Maximization**

1st step: You determine The number of hidden states, N The emission (observation alphabet) 2nd step: randomly assign values to… A - the transition probabilities B - the observation (emission) probabilities - the starting state probabilities 3rd step: Let the machine re-estimate A, B, p Doug Downey, adapted from Bryan Pardo,Northwestern University

25
**Doug Downey, adapted from Bryan Pardo,Northwestern University**

Estimation Formulae Doug Downey, adapted from Bryan Pardo,Northwestern University

26
**Learning transitions…**

Doug Downey, adapted from Bryan Pardo,Northwestern University

27
**Doug Downey, adapted from Bryan Pardo,Northwestern University**

Math… Doug Downey, adapted from Bryan Pardo,Northwestern University

28
**Estimation of starting probs.**

This is number of transitions from i at time t Doug Downey, adapted from Bryan Pardo,Northwestern University

29
**Doug Downey, adapted from Bryan Pardo,Northwestern University**

Estimation Formulae Doug Downey, adapted from Bryan Pardo,Northwestern University

30
**Doug Downey, adapted from Bryan Pardo,Northwestern University**

Estimation Formulae k Doug Downey, adapted from Bryan Pardo,Northwestern University

31
**What are we maximizing again?**

Doug Downey, adapted from Bryan Pardo,Northwestern University

32
**Doug Downey, adapted from Bryan Pardo,Northwestern University**

The game is… EITHER the current model is at a local maximum and… reestimate = current model OR our reestimate will be slightly better and… reestimate != current model SO we feed in the reestimate as the current model, over and over until we can’t improve any more. Doug Downey, adapted from Bryan Pardo,Northwestern University

33
**Caveats This is a kind of hill-climbing technique**

Often has serious problems with local maxima You don’t know when you’re done

34
**So…how else could we do this?**

Standard gradient descent techniques? Hill climb? Beam search? Genetic Algorithm? Doug Downey, adapted from Bryan Pardo,Northwestern University

35
**Back to the fundamental question**

Which processes have the Markov property? What if a hidden state variable is included? (an in an HMM) Doug Downey, adapted from Bryan Pardo,Northwestern University

Similar presentations

© 2020 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google