 # Angelo Dalli Department of Intelligent Computing Systems

## Presentation on theme: "Angelo Dalli Department of Intelligent Computing Systems"— Presentation transcript:

Angelo Dalli Department of Intelligent Computing Systems
Hidden Markov Models Angelo Dalli Department of Intelligent Computing Systems

Overview Definition Simple Example 3 Basic Problems Forward Algorithm
Viterbi Algorithm Baum - Welch Algorithm Example Application Conclusion

Definition Markov Model H = {A, B, N, P} , where…

Elements Set of N states {S1,…,SN}
M distinct observation symbols V = {v1,…,vM} per state Our “finite grammar” We assume discrete here, but could be continuous State transition probability matrix A Observation probability distribution B B = {bj(k) | bj(k) = P(Ot = vk | qt = Sj), 1< k <M, 1< j <N} , where Ot,qt represent observation and state at time t respectively Again, could be continuous pdf modeled by something like Gaussian mixtures Initial state distribution P = {pi | pi = P(q1 = i), 1< i <N}

Matrix of state transition probabilities
Where

Markov Chain with 5 states

Observable vs. Hidden Observable: output state is completely determined at each instance of time For example, if output at time t is state itself: 2 state heads/tails coin toss model Hidden: states must be inferred from observations In other words, observation is probabilistic function of state

Simple Example: Urn and Ball
N urns sitting in a room Each one has M distinct colored balls Magic genie selects an urn at random, based on some probability distribution Genie selects ball randomly from this urn, tells us the color and puts it back She/he then moves on to next urn based on second prob distribution, and repeats process

Obvious Markov Model here:
Each urn is a state Genie’s initial selection is based on initial state probability, P Probability of selecting a certain color determined by observation probability matrix, B The likelihood of the “next” urn is determined by the matrix of transition probabilities, A. At end we have observation sequence, for example O = {red, blue, green, red, green, magenta}

Where’s genie? If Genie location is known at each time instant t, then model is observed Otherwise, this is a hidden model, and we can only infer state at time t, given our string of observations and known probabilities

Three Basic Problems for HMM’s
Given observation sequence O = O1O2…OT , and Markov Model H = {A,B,P} , how do we (efficiently) compute P(O | H) - Given several model choices, can be used to determine most appropriate one

Three Basic Problems for HMM’s
Given observation sequence O = O1O2…OT , and Markov Model H = {A,B,P} , find optimal state sequence q = q1…qT Optimality criterion needs to be determined interest is finding the “correct” state sequence

Three Basic Problems for HMM’s
Given observation sequence O = O1O2…OT , estimate parameters for Model H = {A,B,P} that maximize P(O | H) -observation sequence used here to train model, adapting it to best fit observed phenomenon

Problem 1 : compute P(O | H)
Straighforward (bad) Solution For given state sequence q = {q1,…,qT} we have The probability of sequence q occurring is P(q | H) = piaq1q2…a(qT-1)qT

Joint probability of O and q is product of two: P(O,q | H) = P(O | q,H)P(q | H) Probability of O is P(O,q | H) over set of all possible sequences Q: P(O | H) =

No Good Computation for this direct method is O(2TNT)
Not reasonable even for small values of N and T Need to find efficient way

Problem 1 : Compute P(O | H)
Efficient Solution The forward algorithm

The Forward Algorithm Let ft(i) = P(O1…Ot, qt = Si | H)
Initialization: f1(i) = pibi(O1) , 1 < i < N Induction:

Forward Algorithm Finally: P(O | H) = Requires O(N2T) calculations
Much less than direct method

Problem 2: Given O, H, find “optimal” q
Of course, depends on optimality criterion Several likely candidates: Maximize number of correct individual states Does not consider transitions -> may lead to illegal sequences Maximize number of correct duples, triples, etc. Find single best state sequence i.e. maximize P(q | O,H) This is most common criterion, and it is solved via the Viterbi algorithm

Prob 2 solution: Viterbi Algorithm
Define: -Highest prob of single path at time t ending in state Si Inductively speaking:

Viterbi Algorithm Need to keep track of argument which maximizes our delta function for each timet,state i We use array rt(i) Now: Initialization: r1(i) = 0 , 1 < i < N

Recursion: rt(i) = At end, we have final probability and the end state:

Backtrack to get entire path:
t = T-1, T-2,…, 1

Problem 3: Given O, estimate parameters for H to maximize P(O|H)
No known way to analytically maximize P(O | H), or to solve for optimal parameters Can locally maximize P(O | H) with Baum - Welch Algorithm

Solution to 3: Baum - Welch Algorithm
Quite lengthy and beyond our time frame Suffice to say, it works Other solutions to 3 used, including EM

Ergodic vs. Left-to-Right
Ergodic model: Left-to-Right Model:

Reduces size of model, and makes prob 3 easier
Variations on HMM Null transition Transition between states that produces no output For ex: to model alternate word pronunciations Tied Parameters Set up equivalence relation between parameters For ex: between observation prob of 2 states which have same B Reduces size of model, and makes prob 3 easier State duration density Inherent prob of staying in state Si for d iterations is (aii)d-1(1-aii) May not be appropriate for physical signals, and so an explicit state duration probability density is introduced

Issues with HMM implementation
Scaling Product of very small terms -> machine may not be precise enough, so we scale Multiple observation sequences In left-to-right model, small number of observations available for each state, requiring several sequences for parameter estimation (prob 3) Initial estimate Normal distributions fine for P ,A , but B is sensitive to initial estimate Again, this is an issue for problem 3

Issues with HMM implementation
Insufficient training data For ex: not enough occurrences of different events in a given state Possible solution: reduce model to subset for which more data exists, and linearly interpolate between model parameters Interp weightings a function of amount of training data Alternately, could impose some lower bound on individual observation probabilities Model choice E rgodic vs. LTR (or other), Continuous vs. discrete observation densities, number of states, etc.

Markov Processes Used in Composition
Xenakis Tenney Hiller Chadabe (performance) Charles Ames Student of Hiller Many others since

Example Application Isolated word recognition (Rabiner)
Each word v modeled as distinct HMM Hv Training set of k occurrences per word O1,…,Ok Each of which is an observation sequence Need to: estimate parameters for each Hv that maximize P(O1,…,Ok | Hv) (i.e. prob 3) Extract features O = (O1,…,OT) from unknown word Calculate P(O | Hv) for all v (prob 1), find v which maximizes

Make Observation Feature extraction: at each frame, cepstral coefficients and their derivatives are taken Vector Quantization: observed frame is mapped to possible observation (codebook entry) via nearest neighbor Assuming discrete observation probability Codebook entries estimated by segmenting training data, and taking centroid of all frame vectors for each segment. A la k-means clustering

Choice of Model and Parameters
Left-to-Right model more appropriate Thus we have P(q1 = S1) = 1 Choice of states - two ideas: Let state correspond to phoneme Let state correspond to analysis frame Update model parameters: Segment training data into states based on current model using Viterbi algorithm (prob 2) Update A,B probabilities based on observed data Ex: bj(Ok) number of observed vectors nearest to Ok in state j divided by total number of observed vectors in state j

State Duration Density
If phoneme segmentation used, it may be advantageous to determine a state duration density Variable state length for each phoneme Pyramid of death