 # Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.

## Presentation on theme: "Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept."— Presentation transcript:

Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.

Page 2 Marquette University Overview  Intro: The problem with sequential data  Markov chains  Hidden Markov Models  Key HMM algorithms  Evaluation  Alignment  Training / parameter estimation  Examples / applications

Big Picture View of Statistical Models Basic Gaussian HMMs

Page 4 Marquette University Nonstationary sequential data

Page 5 Marquette University Historical Method: Dynamic Time Warping  DTW is a dynamic path search versus template  Can solve using Dynamic Programming

Page 6 Marquette University Alternative: Sequential modeling Use a Markov Chain (state machine) S1S2S3 Data State Distribution Models State Machine

Page 7 Marquette University Markov Chains (discrete-time & state)  A Markov chain is a discrete-time discrete-state Markov Process. The likelihood of the current RV going to any new state is determined solely by the current state, called a transition probability  Note: since transition probabilities are fixed, there is also a time-invariance assumption. (Also false of course, but useful)

Page 8 Marquette University Graphical representation  Markov chain parameters include  Transition probability values a ij  Initial state probabilities  1  2  3 a 11 a 22 a 33 a 13 a 23 a 12 S1S2S3 a 21 a 31 a 32

Page 9 Marquette University Example: Weather Patterns  Probability of Rain, Clouds, or Sunshine modeled as a Markov chain: A = Note: A matrix of this form (square, row sum=1) is called a stochastic matrix.

Page 10 Marquette University Two-step probabilities If it’s raining today, what’s the probability of it raining two days from now?  Need two-step probabilities. Answer = 0.7*0.7 + 0.2*0.4 + 0.1*0.1 =.58  Can also get these directly from A 2 : A 2 =

Page 11 Marquette University Steady-state  The N-step probabilities can be gotten from A N, so A is sufficient to determine the likelihoods of all possible sequences.  What’s the limiting case? Does it matter if it was raining 1000 days ago? A 1000 =

Page 12 Marquette University Probability of state sequence  The probability of any state sequence is given by:  Training: Learn the transition probabilities by keeping count of the state sequences in the training data.

Page 13 Marquette University Weather classification  Using a Markov chain for classification:  Train one Markov chain model for each class ex: A weather transition matrix for each city; Milwaukee, Phoenix, and Miami  Given a sequence of state observations, identify which is the most likely city by choosing the model that gives the highest overall probability.

Page 14 Marquette University Hidden states & HMMs  What if you can’t directly observe states?  But… there are measures/observations that relate to the probability of different states.  States hidden from view = Hidden Markov Model.

Page 15 Marquette University General Case HMM s i : state i a ij : P(s i  s j ) o t : output at time t b j (o t ) : P (o t | s j ) Initial:  1  2  3 b 1 (o t )b 2 (o t ) b 3 (o t )b 4 (o t )

Page 16 Marquette University Weather HMM  Extend Weather Markov Chain to HMM’s  Can’t see if it’s raining, cloudy, or sunny.  But, we can make some observations:  Humidity H  Temperature T  Pressure P  How do we calculate …  Probability of an observation sequence under a model  How do we learn …  State transition probabilities for unseen states  Observation probabilities in each state

Page 17 Marquette University Observation models  How do we characterize these observations?  Discrete/categorical observations: Learn probability mass function directly.  Continuous observations: Assume a parametric model.  Our Example: Assume a Gaussian distribution  Need to estimate the mean and variance of the humidity, temperature and pressure for each state (9 means and 9 variances, for each city model)

Page 18 Marquette University HMM classification  Using a HMM for classification:  Training: One HMM for each class  Transition matrix plus state means and variances (27 parameters) for each city  Classification: Given a sequence of observations:  Evaluate P(O|model) for each city (Much harder to compute for HMM than for Markov Chain)  Choose the model that gives the highest overall probability.

Page 19 Marquette University Using for Speech Recognition a 22 a 33 a 44 a 24 a 34 a 23 a 12 a 45 a 35 a 13 S1 S2S3S4S5 Start StateEnd State b 2 ()b 3 ()b 4 () States represent beginning, middle, end of a phoneme Gaussian Mixture Model in each state

Page 20 Marquette University Fundamental HMM Computations  Evaluation: Given a model and an observation sequence O = (o 1, o 2, …, o T ), compute P(O | ).  Alignment: Given and O, compute the ‘correct’ state sequence S = (s 1, s 2, …, s T ), such as S = argmax S { P (S |O, ) }.  Training: Given a group of observation sequences, find an estimate of, such as ML = argmax { P (O | ) }.

Page 21 Marquette University Evaluation: Forward/Backward algorithm  Define  i (t) = P(o 1 o 2..o t, s t =i | )  Define  i (t) = P(o t+1 o t+2..o T | s t =i, ) Each of these can be implemented efficiently via dynamic programming recursions starting at t=1 (for  ) and t=T (for  ). By putting the forward & backward together:

Page 22 Marquette University Forward Recursion 1.Initialization 2.Recursion 3.Termination

Page 23 Marquette University Backward recursion 1.Initialization 2.Recursion 3.Termination

Page 24 Marquette University Note: Computation improvement  Direct computation: P(O | ) = the sum of the observation probabilities for all possible state sequences = N T. Time complexity = O(T N T )  F/B algorithm: For each state at each time step do a maximization over all state values from the previous time step: Time Complexity = O(T N 2 )

Page 25 Marquette University From  i (t) and  i (t) : One-State Occupancy probability Two-state Occupancy probability

Page 26 Marquette University Alignment: Viterbi algorithm To find single most likely state sequence S, use Viterbi dynamic programming algorithm: 1.Initialization: 2.Recursion: 3.Termination:

Page 27 Marquette University Training We need to learn the parameters of the model, given the training data. Possibilities include:  Maximum a Priori (MAP)  Maximum Likelihood (ML)  Minimum Error Rate

Page 28 Marquette University Expectation Maximization Expectation Maximization(EM) can be used for ML estimation of parameters in the presence of hidden variables. Basic iterative process: 1.Compute the state sequence likelihoods given current parameters 2.Estimate new parameter values given the state sequence likelihoods.

Page 29 Marquette University EM Training: Baum-Welch for Discrete Observations (e.g. VQ coded) Basic Idea: Using current and F/B equations, compute state occupation probabilities. Then, compute new values:

Page 30 Marquette University  Update equations for Gaussian distributions:  GMMs are similar, but need to incorporate mixture likelihoods as well as state likelihoods

Page 31 Marquette University Toy example: Genie and the urns  There are N urns in a nearby room; each contains many balls of M different colors.  A genie picks out a sequence of balls from the urns and shows you the result. Can you determine the sequence of urns they came from?  Model as HMM: N states, M outputs  probabilities of picking from an urn are state transitions  number of different colored balls in each urn makes up the probability mass function for each state.

Page 32 Marquette University Working out the Genie example  There are three baskets of colored balls  Basket one: 10 blue and 10 red  Basket two: 15 green, 5 blue, and 5 red  Basket three: 10 green and 10 red  The genie chooses from baskets at random  25% chance of picking from basket one or two  50% chance of picking from basket three

Page 33 Marquette University Genie Example Diagram

Page 34 Marquette University Two Questions  Assume that the genie reports a sequence of two balls as {blue, red}.  Answer two questions:  What is the probability that a two ball sequence will be {blue, red}?  What is the most likely sequence of baskets to produce the sequence {blue, red}?

Page 35 Marquette University Probability of {blue, red} for Specific Basket Sequence

Page 36 Marquette University Probability of {blue,red} What is the total probability of {blue,red}?  Sum(matrix values)= 0.074375 What is the most likely sequence of baskets visited?  Argmax(matrix values) = {Basket 1, Basket 3}  Corresponding max likelihood = 0.03125

Page 37 Marquette University Viterbi method Best path ends in state 3, coming previously from state 1.

Page 38 Marquette University a 22 a 33 a 44 a 24 a 34 a 23 a 12 a 45 a 35 a 13 S1 S2S3S4S5 Start StateEnd State a 22 a 33 a 44 a 24 a 34 a 23 a 12 a 45 a 35 a 13 S1 S2S3S4S5 Start StateEnd State Composite Models  Training data is at sentence level, generally not annotated at sub-word (HMM model) level.  Need to be able to form composite models from a sequence of word or phoneme labels.

Page 39 Marquette University Viterbi and Token Passing fd b a c d e c... Recognition Network Best Sentence bccd... Word Graph Viterbi Token Passing

Page 40 Marquette University HMM Notation Discrete HMM Case:

Page 41 Marquette University Continuous HMM Case:

Page 42 Marquette University Multi-mixture, multi-observation case:

Download ppt "Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept."

Similar presentations