1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.

Slides:

Advertisements

Similar presentations

Angelo Dalli Department of Intelligent Computing Systems

Advertisements

Expectation Maximization

Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.

Introduction to Hidden Markov Models

Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.

Statistical NLP: Lecture 11

Hidden Markov Models Theory By Johan Walters (SR 2003)

Statistical NLP: Hidden Markov Models Updated 8/12/2005.

Hidden Markov Models Fundamentals and applications to bioinformatics.

Visual Recognition Tutorial

Speech Recognition Training Continuous Density HMMs Lecture Based on:

HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.

Part 4 b Forward-Backward Algorithm & Viterbi Algorithm CSE717, SPRING 2008 CUBS, Univ at Buffalo.

Lecture 5: Learning models using EM

Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.

Expectation Maximization Algorithm

Maximum Likelihood (ML), Expectation Maximization (EM)

Fast Temporal State-Splitting for HMM Model Selection and Learning Sajid Siddiqi Geoffrey Gordon Andrew Moore.

Expectation-Maximization

Visual Recognition Tutorial

What is it? When would you use it? Why does it work? How do you implement it? Where does it stand in relation to other methods? EM algorithm reading group.

Expectation-Maximization (EM) Chapter 3 (Duda et al.) – Section 3.9

Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.

Fall 2001 EE669: Natural Language Processing 1 Lecture 9: Hidden Markov Models (HMMs) (Chapter 9 of Manning and Schutze) Dr. Mary P. Harper ECE, Purdue.

Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Training HMMs Version 4: February 2005.

EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.

Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.

Combined Lecture CS621: Artificial Intelligence (lecture 25) CS626/449: Speech-NLP-Web/Topics-in- AI (lecture 26) Pushpak Bhattacharyya Computer Science.

Ch10 HMM Model 10.1 Discrete-Time Markov Process 10.2 Hidden Markov Models 10.3 The three Basic Problems for HMMS and the solutions 10.4 Types of HMMS.

Isolated-Word Speech Recognition Using Hidden Markov Models

Gaussian Mixture Model and the EM algorithm in Speech Recognition

Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.

Lecture 19: More EM Machine Learning April 15, 2010.

1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.

1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

1 Hidden Markov Models Hsin-Min Wang Institute of Information Science, Academia Sinica References: 1.L. R. Rabiner and B. H. Juang,

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

HMM - Part 2 The EM algorithm Continuous density HMM.

1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.

CS Statistical Machine learning Lecture 24

Computer Vision Lecture 6. Probabilistic Methods in Segmentation.

1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.

1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.

ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.

1 Hidden Markov Model Observation : O1,O2,... States in time : q1, q2,... All states : s1, s2,... Si Sj.

1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.

1 Hidden Markov Model Observation : O1,O2,... States in time : q1, q2,... All states : s1, s2,..., sN Si Sj.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

ECE 8443 – Pattern Recognition Objectives: Reestimation Equations Continuous Distributions Gaussian Mixture Models EM Derivation of Reestimation Resources:

Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.

Other Models for Time Series. The Hidden Markov Model (HMM)

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.

Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.

Hidden Markov Models HMM Hassanin M. Al-Barhamtoshy

16.0 Some Fundamental Principles – EM Algorithm References: , of Huang, or of Jelinek of Rabiner and Juang 3.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Classification of unlabeled data:

LECTURE 10: EXPECTATION MAXIMIZATION (EM)

Latent Variables, Mixture Models and EM

Expectation-Maximization

Bayesian Models in Machine Learning

4.0 More about Hidden Markov Models

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

LECTURE 15: REESTIMATION, EM AND MIXTURES

EM Algorithm 主講人：虞台文.

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.

Presentation transcript:

1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM

2 Three Basic Problems for HMMs Given an observation sequence O=(o 1,o 2,…,o T ), and an HMM =(A,B,  ) –Problem 1: How to compute P(O| ) efficiently ?  The forward algorithm –Problem 2: How to choose an optimal state sequence Q=(q 1,q 2,……, q T ) which best explains the observations?  The Viterbi algorithm –Problem 3: How to adjust the model parameters =(A,B,  ) to maximize P(O| ) ?  The Baum-Welch (forward-backward) algorithm cf. The segmental K-means algorithm maximizes P(O, Q * | ) P(up, up, up, up, up| )?

3 The Forward Algorithm The forward variable: –Probability of o 1,o 2,…,o t being observed and the state at time t being i, given model λ The forward algorithm

4 The Viterbi Algorithm 1.Initialization 2.Induction 3.Termination 4.Backtracking is the best state sequence

5 The Segmental K-means Algorithm Assume that we have a training set of observations and an initial estimate of model parameters –Step 1 : Segment the training data The set of training observation sequences is segmented into states, based on the current model, by the Viterbi Algorithm –Step 2 : Re-estimate the model parameters –Step 3: Evaluate the model If the difference between the new and current model scores exceeds a threshold, go back to Step 1; otherwise, return

6 Segmental K-means vs. Baum-Welch

7 The Backward Algorithm The backward variable: –Probability of o t+1,o t+2,…,o T being observed, given the state at time t being i and model The backward algorithm cf.

8 otot The Forward-Backward Algorithm Relation between the forward and backward variables (Huang et al., 2001)

9 The Baum-Welch Algorithm (1/3) Define two new variables:  t (i)= P(q t = i | O, ) –Probability of being in state i at time t, given O and  t ( i, j )=P(q t = i, q t+1 = j | O, ) –Probability of being in state i at time t and state j at time t+1, given O and

10 The Baum-Welch Algorithm (2/3)  t (i)= P(q t = i | O, ) –Probability of being in state i at time t, given O and  t ( i, j )=P(q t = i, q t+1 = j | O, ) –Probability of being in state i at time t and state j at time t+1, given O and

11 The Baum-Welch Algorithm (3/3) Re-estimation formulae for , A, and B are How do you know ?

12 Maximum Likelihood Estimation for HMM However, we cannot find the solution directly. An alternative way is to find a sequence: s.t.

13 Jensen’s inequality Q function Solvable and can be proved that If f is a concave function, and X is a r.v., then E[f(X)]≤ f(E[X])

14 The EM Algorithm EM: Expectation Maximization –Why EM? Simple optimization algorithms for likelihood functions rely on the intermediate variables, called latent data For HMM, the state sequence is the latent data Direct access to the data necessary to estimate the parameters is impossible or difficult For HMM, it is almost impossible to estimate ( A, B,  ) without considering the state sequence –Two Major Steps : E step: compute the expectation of the likelihood by including the latent variables as if they were observed M step: compute the maximum likelihood estimates of the parameters by maximizing the expected likelihood found in the E step

15 Three Steps for EM Step 1. Draw a lower bound –Use the Jensen’s inequality Step 2. Find the best lower bound  auxiliary function –Let the lower bound touch the objective function at the current guess Step 3. Maximize the auxiliary function –Obtain the new guess –Go to Step 2 until converge [Minka 1998]

16 objective function current guess Form an Initial Guess of =(A,B,  ) Given the current guess, the goal is to find a new guess such that

17 Step 1. Draw a Lower Bound lower bound function objective function

18 Step 2. Find the Best Lower Bound objective function lower bound function auxiliary function

19 Step 3. Maximize the Auxiliary Function auxiliary function objective function

20 Update the Model objective function

21 Step 2. Find the Best Lower Bound objective function auxiliary function

22 Step 3. Maximize the Auxiliary Function objective function

23 Step 1. Draw a Lower Bound (cont’d) Apply Jensen’s Inequality A lower bound function of If f is a concave function, and X is a r.v., then E[f(X)]≤ f(E[X]) Objective function p(Q): an arbitrary probability distribution

24 Step 2. Find the Best Lower Bound (cont’d) –Find that makes the lower bound function touch the objective function at the current guess

25 Step 2. Find the Best Lower Bound (cont’d) Take the derivative w.r.t and set it to zero

26 Step 2. Find the Best Lower Bound (cont’d) Q function We can check Define

27 EM for HMM Training Basic idea –Assume we have and the probability that each Q occurred in the generation of O i.e., we have in fact observed a complete data pair ( O, Q ) with frequency proportional to the probability P(O,Q| ) –We then find a new that maximizes –It can be guaranteed that EM can discover parameters of model to maximize the log-likelihood of the incomplete data, logP(O| ), by iteratively maximizing the expectation of the log- likelihood of the complete data, logP(O,Q| ) Expectation

28 Solution to Problem 3 - The EM Algorithm The auxiliary function where and can be expressed as

29 Solution to Problem 3 - The EM Algorithm (cont’d) The auxiliary function can be rewritten as wiwi yiyi wjwj yjyj wkwk ykyk example

30 Solution to Problem 3 - The EM Algorithm (cont’d) The auxiliary function is separated into three independent terms, each respectively corresponds to,, and –Maximization procedure on can be done by maximizing the individual terms separately subject to probability constraints –All these terms have the following form

31 Solution to Problem 3 - The EM Algorithm (cont’d) Proof: Apply Lagrange Multiplier Constraint

32 Solution to Problem 3 - The EM Algorithm (cont’d) wiwi yiyi

33 Solution to Problem 3 - The EM Algorithm (cont’d) wjwj yjyj

34 Solution to Problem 3 - The EM Algorithm (cont’d) wkwk ykyk

35 Solution to Problem 3 - The EM Algorithm (cont’d) The new model parameter set can be expressed as:

36 Discrete vs. Continuous Density HMMs Two major types of HMMs according to the observations –Discrete and finite observation: The observations that all distinct states generate are finite in number, i.e., V={v 1, v 2, v 3, ……, v M }, v k  R L In this case, the observation probability distribution in state j, B={b j (k)}, is defined as b j (k)=P(o t =v k |q t =j), 1  k  M, 1  j  N o t : observation at time t, q t : state at time t  b j (k) consists of only M probability values –Continuous and infinite observation: The observations that all distinct states generate are infinite and continuous, i.e., V={v| v  R L } In this case, the observation probability distribution in state j, B={b j (v)}, is defined as b j (v)=f(o t =v|q t =j), 1  j  N o t : observation at time t, q t : state at time t  b j (v) is a continuous probability density function (pdf) and is often a mixture of Multivariate Gaussian (Normal) Distributions

37 Gaussian Distribution A continuous random variable X is said to have a Gaussian distribution with mean μ and variance σ 2 (σ>0) if X has a continuous pdf in the following form:

38 Multivariate Gaussian Distribution If X=(X 1,X 2,X 3,…,X d ) is an d -dimensional random vector with a multivariate Gaussian distribution with mean vector  and covariance matrix , then the pdf can be expressed as If X 1,X 2,X 3,…,X d are independent random variables, the covariance matrix is reduced to diagonal, i.e.,

39 Multivariate Mixture Gaussian Distribution An d -dimensional random vector X=(X 1,X 2,X 3,…,X d ) is with a multivariate mixture Gaussian distribution if In CDHMM, b j (v) is a continuous probability density function (pdf) and is often a mixture of multivariate Gaussian distributions Covariance matrix of the k- th mixture of the j- th state Mean vector of the k -th mixture of the j -th state Observation vector

40 Solution to Problem 3 – The Segmental K-means Algorithm for CDHMM Assume that we have a training set of observations and an initial estimate of model parameters –Step 1 : Segment the training data The set of training observation sequences is segmented into states, based on the current model, by Viterbi Algorithm –Step 2 : Re-estimate the model parameters –Step 3: Evaluate the model If the difference between the new and current model scores exceeds a threshold, go back to Step 1; otherwise, return

41 Solution to Problem 3 – The Segmental K-means Algorithm for CDHMM (cont’d) 3 states and 4 Gaussian mixtures per state O1O1 State O2O2 1 2 t OtOt s2s2 s3s3 s1s1 s2s2 s3s3 s1s1 s2s2 s3s3 s1s1 s2s2 s3s3 s1s1 s2s2 s3s3 s1s1 s2s2 s3s3 s1s1 s2s2 s3s3 s1s1 s2s2 s3s3 s1s1 s2s2 s3s3 s1s1 Global mean Cluster 1 mean Cluster 2mean K-means {  11,  11,c 11 } {  12,  12,c 12 } {  13,  13,c 13 }{  14,  14,c 14 }

42 Solution to Problem 3 – The Baum-Welch Algorithm for CDHMM Define a new variable  t (j,k) –Probability of being in state j at time t with the k -th mixture component accounting for o t, given O and Observation-independent assumption

43 Solution to Problem 3 – The Baum-Welch Algorithm for CDHMM (cont’d) Re-estimation formulae for are

44 A Simple Example o1o1 State o2o2 o3o Time S1S1 S2S2 S1S1 S2S2 S1S1 S2S2 The Forward/Backward Procedure

45 A Simple Example (cont’d) Total 8 paths q: q: 1 1 2

46 A Simple Example (cont’d) back