1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.

Slides:



Advertisements
Similar presentations
Learning HMM parameters
Advertisements

1 Hidden Markov Model Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Hidden Markov Model 主講人:虞台文 大同大學資工所 智慧型多媒體研究室. Contents Introduction – Markov Chain – Hidden Markov Model (HMM) Formal Definition of HMM & Problems Estimate.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Introduction to Hidden Markov Models
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.
Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수
Statistical NLP: Lecture 11
Ch-9: Markov Models Prepared by Qaiser Abbas ( )
Hidden Markov Models Theory By Johan Walters (SR 2003)
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
Hidden Markov Models Fundamentals and applications to bioinformatics.
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Forward-backward algorithm LING 572 Fei Xia 02/23/06.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
Doug Downey, adapted from Bryan Pardo,Northwestern University
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Fall 2001 EE669: Natural Language Processing 1 Lecture 9: Hidden Markov Models (HMMs) (Chapter 9 of Manning and Schutze) Dr. Mary P. Harper ECE, Purdue.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Combined Lecture CS621: Artificial Intelligence (lecture 25) CS626/449: Speech-NLP-Web/Topics-in- AI (lecture 26) Pushpak Bhattacharyya Computer Science.
Ch10 HMM Model 10.1 Discrete-Time Markov Process 10.2 Hidden Markov Models 10.3 The three Basic Problems for HMMS and the solutions 10.4 Types of HMMS.
Gaussian Mixture Model and the EM algorithm in Speech Recognition
CS344 : Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 21- Forward Probabilities and Robotic Action Sequences.
THE HIDDEN MARKOV MODEL (HMM)
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
7-Speech Recognition Speech Recognition Concepts
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Design and Implementation of Speech Recognition Systems Spring 2014 Class 13: Training with continuous speech 26 Mar
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
S. Salzberg CMSC 828N 1 Three classic HMM problems 2.Decoding: given a model and an output sequence, what is the most likely state sequence through the.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.
HMM - Part 2 The EM algorithm Continuous density HMM.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
CS Statistical Machine learning Lecture 24
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
1 Hidden Markov Model Observation : O1,O2,... States in time : q1, q2,... All states : s1, s2,... Si Sj.
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
1 Hidden Markov Model Observation : O1,O2,... States in time : q1, q2,... All states : s1, s2,..., sN Si Sj.
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida March 31,
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Other Models for Time Series. The Hidden Markov Model (HMM)
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Hidden Markov Models BMI/CS 576
Hidden Markov Models - Training
Hidden Markov Models Part 2: Algorithms
Statistical Models for Automatic Speech Recognition
LECTURE 15: REESTIMATION, EM AND MIXTURES
Introduction to HMM (cont)
Presentation transcript:

1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul Hosom Lecture 11 February 14 Gamma, Xi, and the Forward-Backward Algorithm

2 Review:  and  Define variable  which has meaning of “the probability of observations o 1 through o t and being in state i at time t, given our HMM” Compute  and P(O | ) with the following procedure: Initialization: Induction: Termination:

3 Review:  and  In the same way that we defined , we can define  : Define variable  which has meaning of “the probability of observations o t+1 through o T, given that we’re in state i at time t, and given our HMM” Compute  with the following procedure: Initialization: Induction: Termination:

4 Example: “hi” observed features: o 1 = {0.8} o 2 = {0.8} o 3 = {0.2} hay Forward Procedure: Algorithm Example  1 (h)=0.55  1 (ay)=0.0  2 (h) = [0.55· ·0.0] · 0.55 =  2 (ay) = [0.55· ·0.4] · 0.15 =  3 (h) = [ · ·0.0] · 0.20 =  3 (ay) = [ · ·0.4] · 0.65 =  3 (i) =

5 Backward Procedure: Algorithm Example What are all  values?  3 (h)=1.0  3 (ay)=1.0  2 (h) = [0.3·0.20· ·0.65·1.0] =  2 (ay) = [0.0·0.20· ·0.65·1.0] =  1 (h) = [0.3·0.55· ·0.15·0.260] =  1 (ay) = [0.0·0.55· ·0.15·0.260] =  0 (·) = [1.0·0.55· ·0.15·0.0156] =  0 (·) =   3 (i) = P(O| )

6 Now we can define , the probability of being in state i at time t given an observation sequence and HMM. Probability of Gamma also, so (multiplication rule) Note: We’re writing the denominator this way only to show that it’s equivalent to P(O|λ). We won’t further modify this term, or actually use the denominator in this form in implementation. Note: We do need to compute P(O| ). In the Viterbi search, this is constant and it doesn’t affect the maximization operation. But gamma will be used in cases where we want to compute probability values, not just maxima.

7 Probability of Gamma: Illustration State X State Y State Z a YX bX(o3)bX(o3) bY(o3)bY(o3) bZ(o3)bZ(o3) a YY a YZ a XY a YY a ZY bX(o2)bX(o2) bY(o2)bY(o2) bZ(o2)bZ(o2) bX(o1)bX(o1) bY(o1)bY(o1) bZ(o1)bZ(o1) Illustration: what is probability of being in state Y at time 2?

8 Given this 3-state HMM and set of 4 observations, what is probability of being in state A at time 2? Gamma: Example sil-y+eh O = { } A B C

9 Gamma: Example 1. Compute forward probabilities up to time 2

10 Gamma: Example 2. Compute backward probabilities for times 4, 3, 2

11 Gamma: Example 3. Compute 

12 We can define one more variable:  is the probability of being in state i at time t, and in state j at time t+1, given the observations and HMM Xi We can specify  as follows: Note: We’re writing the denominator this way only to show that it’s equivalent to P(O|λ). We won’t further modify this term, or actually use the denominator in this form in implementation.

13 This diagram illustrates  Xi: Diagram State X State Y State Z a YX bX(o4)bX(o4) bY(o4)bY(o4) bZ(o4)bZ(o4) a YY a YZ a XX a YX a ZX bX(o3)bX(o3) bY(o3)bY(o3) bZ(o3)bZ(o3) bX(o1)bX(o1) bY(o1)bY(o1) bZ(o1)bZ(o1) a AB b B (o 3 ) tt+1t+2t-1 2(X)2(X) 3(Y)3(Y) bX(o2)bX(o2) bY(o2)bY(o2) bZ(o2)bZ(o2)

14 Given the same HMM and observations as in the Example for Gamma, what is  2 (A,B)? Xi: Example #1

15 Given this 3-state HMM and set of 4 observations, what is the expected number of transitions from B to C? Xi: Example #2 sil-y+eh O = { } A B C

16 Xi: Example #2

17 The “expected number of transitions from state i to j for O” does not have to be an integer, even if the actual number of transitions for any single O is always an integer. These expected values have the same meaning as the expected value of a variable x for a function f(x), which is the mean value of x. This mean value does not have to be an integer value, even if x only takes on integer values. From Lecture 3, slide 6: “Expected Values”  expected (mean) value of c.r.v. X with p.d.f. f(x) is:  example 1 (discrete): E(X) = 2·0.05+3·0.10+ … +9·0.05 =

18 We can also specify  in terms of  (up to t=T-1): Xi and finally, using the original definition of  (slide 6): But why do we care??

19 We can improve estimates of HMM parameters using one case of the Expectation-Maximization procedure, known as the Baum-Welch method or forward-backward algorithm. In this algorithm, we use existing estimates of HMM parameters to compute new estimates of HMM parameters. The new parameters are guaranteed to be the same or “better” than old ones. The process of iterative improvement is repeated until the model doesn’t change. We can use the following re-estimation formulae: How Do We Improve Estimates of HMM Parameters? Formula for updating initial state probabilities:

20 How Do We Improve Estimates of HMM Parameters? Formula for observation probabilities, for discrete HMMs: Formula for updating transition probabilities:

21 How Do We Improve Estimates of HMM Parameters? Formula for observation probabilities, for continuous HMMs: j=state, k=mixture component!! = p(being in state j and component k) = p(being in state j) (c is the mixture weight (Lecture 5, slide 29)) prob. of being in component k, given state j and o t prob. of being in state j at time t (slide 6) relative contribution of component k in this GMM for o t is (from multiplication rule) (obs. are indep.)

22 How Do We Improve Estimates of HMM Parameters? For continuous HMMs: = expected value of o t based on existing = expected value of covariance matrix based on existing T=transpose, not end time (total probability of being in state j and component k)

23 After computing new model parameters, we “maximize” by substituting the new parameter values in place of the old parameter values and repeat until the model parameters stabilize. This process is guaranteed to converge monotonically to a maximum-likelihood estimate. The next lecture will try to explain why the process converges to a better estimate with each iteration using these formulae. There may be many local “best” estimates (local maxima in the parameter space); we can’t guarantee that the EM process will reach the globally best result. This is different from Viterbi segmentation because it utilizes probabilities over entire sequence, not just most likely events. How Do We Improve Estimates of HMM Parameters?

24 Forward-Backward Training: Multiple Observation Sequences Usually, training is performed on a large number of separate observation sequences, e.g. multiple examples of the word “yes.” If we denote individual observation sequences with a superscript, where O (i) is the i th observation sequence, then we can consider the set of all K observation sequences used in training: We want to maximize The re-estimation formulas are based on frequencies of events for a single observation sequence O={o 1,o 2,…,o T }, e.g. [1] [2] [3]

25 Forward-Backward Training: Multiple Observation Sequences If we have multiple observation sequences, then we can re-write the re-estimation formulas for specific sequences, e.g. For example, let’s say we have two observation sequences, each of length 3, and furthermore, let’s pretend that the following are reasonable numbers: [4]

26 Forward-Backward Training: Multiple Observation Sequences If we look at the transition probabilities computed separately for each sequence O (1) and O (2), then One way of computing the re-estimation formula for a ij is to set the weight w k to 1.0 for all sequences, and then Another way of re-estimating is to give each individual sequence equal weight by computing the mean, e.g. [5]

27 Forward-Backward Training: Multiple Observation Sequences Rabiner proposes using a weight inversely proportional to the probability of the observation sequence, given the model: This weighting gives greater weight in the re-estimation to those utterances that don’t fit the model well. This is reasonable if one assumes that in training the model and data should always have a good fit. However, we assume that from the (known) words in the training set we can obtain the correct phoneme sequences in the training set. But, this assumption is in many cases not valid. Therefore, it can be safer to use a weight of w k = 1.0. Also, when dealing with very small values of P(O | ), small changes in P(O | ) can yield large changes in the weights. [6]

28 Forward-Backward Training: Multiple Observation Sequences For the third project, you may implement either equations [4] or [5] (above) when dealing with multiple observation sequences (multiple recordings of the same word, in this case). As noted on the next slides, implementation of either solution involves use of “accumulators”… the idea is to add values in the accumulator for each file, and then when all files have been processed, compute the new model parameters. For example, for equation [4], the numerator of the accumulator contains the sum (over each file) of and the denominator contains the sum (over each file) of For equation [5], the accumulator contains the sum of individual values of, and this sum is then divided by the denominator K.

29 Initialize an HMM: set transition probabilities to default values for each file: compute initial state boundaries (e.g. flat start) add information to “accumulator” (sum, sum squared, count) compute mean, variance for each GMM (optional: output initial estimates of model parameters) File 1: File 2: File 3:.pau y eh s.pau Forward-Backward Training: Multiple Observation Sequences

30 Iteratively Improve an HMM: for each iteration: reset accumulators for each file: get alpha and beta based on previous model param. add new estimates for this file to accumulators for a ij and means update estimates of a ij and means for each file: get alpha and beta (again) add new estimates for this file to accumulators for covariance values update estimates of covariances write current model parameters to output NOTE: make sure to update the covariance values using the NEW mean values. And make sure that the covariance values are updated using the mean values over ALL files, not each individual file, since the new means are based on ALL observations. Forward-Backward Training: Multiple Observation Sequences