1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.

Slides:



Advertisements
Similar presentations
Learning HMM parameters
Advertisements

1 Hidden Markov Model Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Introduction to Hidden Markov Models
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.
An Introduction to Hidden Markov Models and Gesture Recognition Troy L. McDaniel Research Assistant Center for Cognitive Ubiquitous Computing Arizona State.
Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수
Statistical NLP: Lecture 11
Hidden Markov Models Theory By Johan Walters (SR 2003)
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
Hidden Markov Models Fundamentals and applications to bioinformatics.
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Sequential Modeling with the Hidden Markov Model Lecture 9 Spoken Language Processing Prof. Andrew Rosenberg.
… Hidden Markov Models Markov assumption: Transition model:
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
Hidden Markov Models. Decoding GIVEN x = x 1 x 2 ……x N We want to find  =  1, ……,  N, such that P[ x,  ] is maximized  * = argmax  P[ x,  ] We.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Forward-backward algorithm LING 572 Fei Xia 02/23/06.
Part 4 c Baum-Welch Algorithm CSE717, SPRING 2008 CUBS, Univ at Buffalo.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
Doug Downey, adapted from Bryan Pardo,Northwestern University
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Fall 2001 EE669: Natural Language Processing 1 Lecture 9: Hidden Markov Models (HMMs) (Chapter 9 of Manning and Schutze) Dr. Mary P. Harper ECE, Purdue.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Combined Lecture CS621: Artificial Intelligence (lecture 25) CS626/449: Speech-NLP-Web/Topics-in- AI (lecture 26) Pushpak Bhattacharyya Computer Science.
Gaussian Mixture Model and the EM algorithm in Speech Recognition
CS344 : Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 21- Forward Probabilities and Robotic Action Sequences.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
7-Speech Recognition Speech Recognition Concepts
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
Cognitive Computer Vision Kingsley Sage and Hilary Buxton Prepared under ECVision Specific Action 8-3
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Hidden Markov Models CBB 231 / COMPSCI 261 part 2.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
S. Salzberg CMSC 828N 1 Three classic HMM problems 2.Decoding: given a model and an output sequence, what is the most likely state sequence through the.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.
HMM - Part 2 The EM algorithm Continuous density HMM.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
CS Statistical Machine learning Lecture 24
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
1 Hidden Markov Model Observation : O1,O2,... States in time : q1, q2,... All states : s1, s2,... Si Sj.
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
1 Hidden Markov Model Observation : O1,O2,... States in time : q1, q2,... All states : s1, s2,..., sN Si Sj.
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition Objectives: Reestimation Equations Continuous Distributions Gaussian Mixture Models EM Derivation of Reestimation Resources:
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Other Models for Time Series. The Hidden Markov Model (HMM)
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Hidden Markov Models BMI/CS 576
Statistical Models for Automatic Speech Recognition
Hidden Markov Models - Training
Hidden Markov Models Part 2: Algorithms
Statistical Models for Automatic Speech Recognition
Introduction to HMM (cont)
Presentation transcript:

1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul Hosom Lecture Notes for May 10 Gamma, Xi, and the Forward-Backward Algorithm

2 Review:  and  Define variable  which has meaning of “the probability of observations o 1 through o t and being in state i at time t, given our HMM” Compute  and P(O | ) with the following procedure: Initialization: Induction: Termination:

3 Review:  and  In the same way that we defined , we can define  : Define variable  which has meaning of “the probability of observations o t+1 through o T, given that we’re in state i at time t, and given our HMM” Compute  with the following procedure: Initialization: Induction: Termination:

4 Example: “hi” observed features: o 1 = {0.8} o 2 = {0.8} o 3 = {0.2} hay Forward Procedure: Algorithm Example  1 (h)=0.55  1 (ay)=0.0  2 (h) = [0.55· ·0.0] · 0.55 =  2 (ay) = [0.55· ·0.4] · 0.15 =  3 (h) = [ · ·0.0] · 0.20 =  3 (ay) = [ · ·0.4] · 0.65 =  3 (i) =

5 Backward Procedure: Algorithm Example What are all  values?  3 (h)=1.0  3 (ay)=1.0  2 (h) = [0.3·0.20· ·0.65·1.0] =  2 (ay) = [0.0·0.20· ·0.65·1.0] =  1 (h) = [0.3·0.55· ·0.15·0.260] =  1 (ay) = [0.0·0.55· ·0.15·0.260] =  0 (·) = [1.0·0.55· ·0.15·0.0156] =  0 (·) =   3 (i) = P(O| )

6 Now we can define , the probability of being in state i at time t given an observation sequence and HMM. Probability of Gamma also, so (multiplication rule)

7 Probability of Gamma: Illustration State X State Y State Z a YX bX(o3)bX(o3) bY(o3)bY(o3) bZ(o3)bZ(o3) a YY a YZ a XY a YY a ZY bX(o2)bX(o2) bY(o2)bY(o2) bZ(o2)bZ(o2) bX(o1)bX(o1) bY(o1)bY(o1) bZ(o1)bZ(o1) Illustration: what is probability of being in state Y at time 2?

8 Given this 3-state HMM and set of 4 observations, what is probability of being in state A at time 2? Gamma: Example sil-y+eh O = { } A B C

9 Gamma: Example 1. Compute forward probabilities up to time 2

10 Gamma: Example 2. Compute backward probabilities for times 4, 3, 2

11 Gamma: Example 3. Compute 

12 We can define one more variable:  is the probability of being in state i at time t, and in state j at time t+1, given the observations and HMM Xi We can specify  as follows:

13 This diagram illustrates  Xi: Diagram State X State Y State Z a YX bX(o4)bX(o4) bY(o4)bY(o4) bZ(o4)bZ(o4) a YY a YZ a XX a YX a ZX bX(o3)bX(o3) bY(o3)bY(o3) bZ(o3)bZ(o3) bX(o1)bX(o1) bY(o1)bY(o1) bZ(o1)bZ(o1) a AB b B (o 3 ) tt+1t+2t-1 2(X)2(X) 3(Y)3(Y) bX(o2)bX(o2) bY(o2)bY(o2) bZ(o2)bZ(o2)

14 Given the same HMM and observations as in the Example for Gamma, what is  2 (A,B)? Xi: Example #1

15 Given this 3-state HMM and set of 4 observations, what is the expected number of transitions from B to C? Xi: Example #2 sil-y+eh O = { } A B C

16 Xi: Example #2

17 The “expected number of transitions from state i to j for O” does not have to be an integer, even if the actual number of transitions for any single O is always an integer. These expected values have the same meaning as the expected value of a variable x for a function f(x), which is the mean value of x. This mean value does not have to be an integer value, even if x only takes on integer values. From Lecture 3, slide 6: “Expected Values”  expected (mean) value of c.r.v. X with p.d.f. f(x) is:  example 1 (discrete): E(X) = 2·0.05+3·0.10+ … +9·0.05 =

18 We can also specify  in terms of  : Xi and finally, But why do we care??

19 We can improve estimates of HMM parameters using one case of the Expectation-Maximization procedure, known as the Baum-Welch method or forward-backward algorithm. In this algorithm, we use existing estimates of HMM parameters to compute new estimates of HMM parameters. The new parameters are guaranteed to be the same or “better” than old ones. The process of iterative improvement is repeated until the model doesn’t change. We can use the following re-estimation formulae: How Do We Improve Estimates of HMM Parameters? Formula for updating initial state probabilities:

20 How Do We Improve Estimates of HMM Parameters? Formula for observation probabilities, for discrete HMMs: Formula for updating transition probabilities:

21 How Do We Improve Estimates of HMM Parameters? Formula for observation probabilities, for continuous HMMs: j=state, k=mixture component!! = p(being in state j from component k) = p(being in state j)

22 How Do We Improve Estimates of HMM Parameters? For continuous HMMs: = expected value of o t based on existing = expected value of diagonal of covariance matrix based on existing

23 After computing new model parameters, we “maximize” by substituting the new parameter values in place of the old parameter values and repeat until the model parameters stabilize. This process is guaranteed to converge monotonically to a maximum-likelihood estimate. The next lecture will try to explain why the process converges to a better estimate with each iteration using these formulae. There may be many local “best” estimates (local maxima in the parameter space); we can’t guarantee that the EM process will reach the globally best result. This is different from Viterbi segmentation because it utilizes probabilities over entire sequence, not just most likely events. How Do We Improve Estimates of HMM Parameters?

24 Forward-Backward Training: Multiple Observation Sequences Usually, training is performed on a large number of separate observation sequences, e.g. multiple examples of the word “yes.” If we denote individual observation sequences with a superscript, where O (i) is the i th observation sequence, then we can consider the set of all K observation sequences used in training: We want to maximize The re-estimation formulas are based on frequencies of events for a single observation sequence O={o 1,o 2,…,o T }, e.g. [1] [2] [3]

25 Forward-Backward Training: Multiple Observation Sequences If we have multiple observation sequences, then we can re-write the re-estimation formulas for specific sequences, e.g. For example, let’s say we have two observation sequences, each of length 3, and furthermore, let’s pretend that the following are reasonable numbers: [4]

26 Forward-Backward Training: Multiple Observation Sequences If we look at the transition probabilities computed separately for each sequence O (1) and O (2), then One way of computing the re-estimation formula for a ij is to set the weight w k to 1.0 and then Another way of re-estimating is to give each individual estimate equal weight by computing the mean, e.g. [5]

27 Forward-Backward Training: Multiple Observation Sequences Rabiner proposes using a weight inversely proportional to the probability of the observation sequence, given the model: This weighting gives greater weight in the re-estimation to those utterances that don’t fit the model well. This is reasonable if one assumes that in training the model and data should always have a good fit. However, we assume that from the (known) words in the training set we can obtain the correct phoneme sequences in the training set. But, this assumption is in many cases not valid. Therefore, it can be safer to use a weight of w k = 1.0. Also, when dealing with very small values of P(O | ), small changes in P(O | ) can yield large changes in the weights. [6]

28 Forward-Backward Training: Multiple Observation Sequences For the third project, you may implement either equations [4] or [5] (above) when dealing with multiple observation sequences (multiple recordings of the same word, in this case). As noted on the next slides, implementation of either solution involves use of “accumulators”… the idea is to add values in the accumulator for each file, and then when all files have been processed, compute the new model parameters. For example, for equation [4], the numerator of the accumulator contains the sum (over each file) of and the denominator contains the sum (over each file) of For equation [5], the accumulator contains the sum of individual values of, and this sum is then divided by K.

29 Initialize an HMM: for each file: compute initial state boundaries (e.g. flat start) add information to “accumulator” (sum, sum squared, count) compute mean, variance for each GMM set initial estimates of state parameters from mean, variance File 1: File 2: File 3:.pau y eh s.pau Forward-Backward Training: Multiple Observation Sequences

30 Iteratively Improve an HMM: for each iteration: reset accumulators for each file: get state parameter info. from previous iteration add new state information to accumulators compute mean, variance for each GMM update estimates of state parameters Forward-Backward Training: Multiple Observation Sequences