Download presentation

Presentation is loading. Please wait.

Published byCharles Merrill Modified about 1 year ago

1
Hidden Markov Models Angelo Dalli Department of Intelligent Computing Systems

2
Overview Definition Simple Example 3 Basic Problems Forward Algorithm Viterbi Algorithm Baum - Welch Algorithm Example Application Conclusion

3
Definition Markov Model H = {A, B, N, P}, where…

4
Elements Set of N states {S 1,…,S N } M distinct observation symbols V = {v 1,…,v M } per state – Our “finite grammar” – We assume discrete here, but could be continuous State transition probability matrix A Observation probability distribution B – B = {b j (k) | b j (k) = P(O t = v k | q t = S j ), 1< k

5
Matrix of state transition probabilities Where

6
Markov Chain with 5 states

7
Observable vs. Hidden Observable: output state is completely determined at each instance of time – For example, if output at time t is state itself: 2 state heads/tails coin toss model Hidden: states must be inferred from observations – In other words, observation is probabilistic function of state

8
Simple Example: Urn and Ball N urns sitting in a room Each one has M distinct colored balls Magic genie selects an urn at random, based on some probability distribution Genie selects ball randomly from this urn, tells us the color and puts it back She/he then moves on to next urn based on second prob distribution, and repeats process

9
Obvious Markov Model here: Each urn is a state Genie’s initial selection is based on initial state probability, P Probability of selecting a certain color determined by observation probability matrix, B The likelihood of the “next” urn is determined by the matrix of transition probabilities, A. At end we have observation sequence, for example O = {red, blue, green, red, green, magenta}

10
Where’s genie? If Genie location is known at each time instant t, then model is observed Otherwise, this is a hidden model, and we can only infer state at time t, given our string of observations and known probabilities

11
Three Basic Problems for HMM’s Problem 1: Given observation sequence O = O 1 O 2 …O T, and Markov Model H = {A,B,P}, how do we (efficiently) compute P(O | H) - Given several model choices, can be used to determine most appropriate one

12
Three Basic Problems for HMM’s Problem 2: Given observation sequence O = O 1 O 2 …O T, and Markov Model H = {A,B,P}, find optimal state sequence q = q 1 …q T -Optimality criterion needs to be determined -interest is finding the “correct” state sequence

13
Three Basic Problems for HMM’s Problem 3: Given observation sequence O = O 1 O 2 …O T, estimate parameters for Model H = {A,B,P} that maximize P(O | H) -observation sequence used here to train model, adapting it to best fit observed phenomenon

14
Problem 1 : compute P(O | H) Straighforward (bad) Solution For given state sequence q = {q 1,…,q T } we have The probability of sequence q occurring is P(q | H) = p i a q1q2 …a (qT-1)qT

15
Bad solution continued Joint probability of O and q is product of two: P(O,q | H) = P(O | q,H)P(q | H) Probability of O is P(O,q | H) over set of all possible sequences Q: P(O | H) =

16
No Good Computation for this direct method is O(2TN T ) Not reasonable even for small values of N and T Need to find efficient way

17
Problem 1 : Compute P(O | H) Efficient Solution The forward algorithm

18
The Forward Algorithm Let f t (i) = P(O 1 …O t, q t = S i | H) Initialization: f 1 (i) = p i b i (O 1 ), 1 < i < N Induction:

19
Forward Algorithm Finally: P(O | H) = Requires O(N 2 T) calculations – Much less than direct method

20
Problem 2: Given O, H, find “optimal” q Of course, depends on optimality criterion Several likely candidates: – Maximize number of correct individual states Does not consider transitions -> may lead to illegal sequences – Maximize number of correct duples, triples, etc. – Find single best state sequence i.e. maximize P(q | O,H) This is most common criterion, and it is solved via the Viterbi algorithm

21
Prob 2 solution: Viterbi Algorithm Define: -Highest prob of single path at time t ending in state S i Inductively speaking:

22
Viterbi Algorithm Need to keep track of argument which maximizes our delta function for each timet,state i – We use array r t (i) Now: Initialization: r 1 (i) = 0, 1 < i < N

23
Recursion: r t (i) = At end, we have final probability and the end state:

24
Backtrack to get entire path: t = T-1, T-2,…, 1

25
Problem 3: Given O, estimate parameters for H to maximize P(O|H) No known way to analytically maximize P(O | H), or to solve for optimal parameters Can locally maximize P(O | H) with Baum - Welch Algorithm

26
Solution to 3: Baum - Welch Algorithm Quite lengthy and beyond our time frame Suffice to say, it works Other solutions to 3 used, including EM

27
Ergodic vs. Left-to-Right Ergodic model: Left-to-Right Model:

28
Variations on HMM Null transition – Transition between states that produces no output – For ex: to model alternate word pronunciations Tied Parameters – Set up equivalence relation between parameters – For ex: between observation prob of 2 states which have same B – Reduces size of model, and makes prob 3 easier State duration density – Inherent prob of staying in state S i for d iterations is (a ii ) d-1 (1-a ii ) – May not be appropriate for physical signals, and so an explicit state duration probability density is introduced

29
Issues with HMM implementation Scaling – Product of very small terms -> machine may not be precise enough, so we scale Multiple observation sequences – In left-to-right model, small number of observations available for each state, requiring several sequences for parameter estimation (prob 3) Initial estimate – Normal distributions fine for P,A, but B is sensitive to initial estimate – Again, this is an issue for problem 3

30
Issues with HMM implementation Insufficient training data – For ex: not enough occurrences of different events in a given state – Possible solution: reduce model to subset for which more data exists, and linearly interpolate between model parameters Interp weightings a function of amount of training data – Alternately, could impose some lower bound on individual observation probabilities Model choice – Ergodic vs. LTR (or other), Continuous vs. discrete observation densities, number of states, etc.

31
Markov Processes Used in Composition Xenakis Tenney Hiller Chadabe (performance) Charles Ames – Student of Hiller Many others since

32
Example Application Isolated word recognition (Rabiner) – Each word v modeled as distinct HMM H v – Training set of k occurrences per word O 1,…,O k Each of which is an observation sequence – Need to: estimate parameters for each H v that maximize P(O 1,…,O k | H v ) (i.e. prob 3) Extract features O = (O 1,…,O T ) from unknown word Calculate P(O | H v ) for all v (prob 1), find v which maximizes

33
Make Observation Feature extraction: at each frame, cepstral coefficients and their derivatives are taken Vector Quantization: observed frame is mapped to possible observation (codebook entry) via nearest neighbor – Assuming discrete observation probability – Codebook entries estimated by segmenting training data, and taking centroid of all frame vectors for each segment. A la k-means clustering

34
Choice of Model and Parameters Left-to-Right model more appropriate – Thus we have P(q 1 = S 1 ) = 1 Choice of states - two ideas: – Let state correspond to phoneme – Let state correspond to analysis frame Update model parameters: – Segment training data into states based on current model using Viterbi algorithm (prob 2) – Update A,B probabilities based on observed data Ex: b j (O k ) number of observed vectors nearest to O k in state j divided by total number of observed vectors in state j

35
State Duration Density If phoneme segmentation used, it may be advantageous to determine a state duration density – Variable state length for each phoneme Pyramid of death

36
Conclusion Advantages – Has contributed quite a bit to speech recognition – With algorithms we have described, computation is reasonable – Complex processes can be modeled with low- dimensional data – Works well for time varying classification other examples: gesture recognition, formant tracking Limitations – Assumption that successive observations are independent – First order assumption: probability state at time t only depends on state at time t-1 – Need to be “tailor made” for specific application – Needs lots of training data, in order to see all observations

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google