The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.

Slides:



Advertisements
Similar presentations
Expectation Maximization Dekang Lin Department of Computing Science University of Alberta.
Advertisements

HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Learning HMM parameters
Maximum Likelihood And Expectation Maximization Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Machine Learning with MapReduce. K-Means Clustering 3.
Hidden Markov Model 主講人:虞台文 大同大學資工所 智慧型多媒體研究室. Contents Introduction – Markov Chain – Hidden Markov Model (HMM) Formal Definition of HMM & Problems Estimate.
EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
. Learning – EM in ABO locus Tutorial #08 © Ydo Wexler & Dan Geiger.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Visual Recognition Tutorial
. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Learning Hidden Markov Models Tutorial #7 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
FSA and HMM LING 572 Fei Xia 1/5/06.
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
Part 4 b Forward-Backward Algorithm & Viterbi Algorithm CSE717, SPRING 2008 CUBS, Univ at Buffalo.
. Parameter Estimation For HMM Background Readings: Chapter 3.3 in the book, Biological Sequence Analysis, Durbin et al., 2001.
The EM algorithm (Part 1) LING 572 Fei Xia 02/23/06.
Lecture 5: Learning models using EM
S. Maarschalkerweerd & A. Tjhang1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Forward-backward algorithm LING 572 Fei Xia 02/23/06.
The EM algorithm LING 572 Fei Xia 03/01/07. What is EM? EM stands for “expectation maximization”. A parameter estimation method: it falls into the general.
Expectation Maximization Algorithm
Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Inside-outside algorithm LING 572 Fei Xia 02/28/06.
Maximum Likelihood (ML), Expectation Maximization (EM)
Expectation-Maximization
Visual Recognition Tutorial
What is it? When would you use it? Why does it work? How do you implement it? Where does it stand in relation to other methods? EM algorithm reading group.
Sequence labeling and beam search LING 572 Fei Xia 2/15/07.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Expectation-Maximization (EM) Chapter 3 (Duda et al.) – Section 3.9
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Parameter estimate in IBM Models: Ling 572 Fei Xia Week ??
. Learning – EM in The ABO locus Tutorial #9 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al.,  Shlomo.
EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.
Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.
Machine Learning with EM 闫宏飞 北京大学信息科学技术学院 7/24/2012 This work is licensed under a Creative Commons Attribution-Noncommercial-Share.
Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
HMM - Part 2 The EM algorithm Continuous density HMM.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
CS Statistical Machine learning Lecture 24
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004.
CSE 517 Natural Language Processing Winter 2015
Maximum Likelihood Estimation
1 Hidden Markov Model Observation : O1,O2,... States in time : q1, q2,... All states : s1, s2,... Si Sj.
1 Hidden Markov Model Observation : O1,O2,... States in time : q1, q2,... All states : s1, s2,..., sN Si Sj.
. EM in Hidden Markov Models Tutorial 7 © Ydo Wexler & Dan Geiger, revised by Sivan Yogev.
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Other Models for Time Series. The Hidden Markov Model (HMM)
. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
Comp. Genomics Recitation 6 14/11/06 ML and EM.
Expectation-Maximization
Introduction to EM algorithm
Introduction to HMM (cont)
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Presentation transcript:

The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

What is EM? EM stands for “expectation maximization”. A parameter estimation method: it falls into the general framework of maximum-likelihood estimation (MLE). The general form was given in (Dempster, Laird, and Rubin, 1977), although essence of the algorithm appeared previously in various forms.

Outline MLE EM –Basic concepts –Main ideas of EM Additional slides: –EM for PM models –An example: Forward-backward algorithm

MLE

What is MLE? Given –A sample X={X 1, …, X n } –A vector of parameters θ We define –Likelihood of the data: P(X | θ) –Log-likelihood of the data: L(θ)=log P(X|θ) Given X, find

MLE (cont) Often we assume that X i s are independently identically distributed (i.i.d.) Depending on the form of P(X | θ), solving this optimization problem can be easy or hard.

An easy case Assuming –A coin has a probability p of being heads, 1-p of being tails. –Observation: We toss a coin N times, and the result is a set of Hs and Ts, and there are m Hs. What is the value of p based on MLE, given the observation?

An easy case (cont)** p= m/N

EM: basic concepts

Basic setting in EM X is a set of data points: observed data  is a parameter vector. EM is a method to find θ ML where Calculating P(X | θ) directly is hard. Calculating P(X,Y|θ) is much simpler, where Y is “hidden” data (or “missing” data).

The basic EM strategy Z = (X, Y) –Z: complete data (“augmented data”) –X: observed data (“incomplete” data) –Y: hidden data (“missing” data)

The “missing” data Y Y need not necessarily be missing in the practical sense of the word. It may just be a conceptually convenient technical device to simplify the calculation of P(X | θ). There could be many possible Ys.

Examples of EM HMMPCFGMTCoin toss X (observed) sentences Parallel dataHead-tail sequences Y (hidden)State sequences Parse treesWord alignment Coin id sequences θa ij b ijk P(A  BC)t(f | e) d(a j | j, l, m), … p1, p2, λ AlgorithmForward- backward Inside- outside IBM ModelsN/A

The EM algorithm Consider a set of starting parameters Use these to “estimate” the missing data Use “complete” data to update parameters Repeat until convergence

Highlights General algorithm for missing data problems Requires “specialization” to the problem at hand Examples of EM: –Forward-backward algorithm for HMM –Inside-outside algorithm for PCFG –EM in IBM MT Models

EM: main ideas

Idea #1: find θ that maximizes the likelihood of training data

Idea #2: find the θ t sequence No analytical solution  iterative approach, find s.t.

Idea #3: find θ t+1 that maximizes a tight lower bound of a tight lower bound

Idea #4: find θ t+1 that maximizes the Q function** Lower bound of The Q function

The Q-function** Define the Q-function (a function of θ): –Θ t is the current parameter estimate and is a constant (vector). –Θ is the normal variable (vector) that we wish to adjust. The Q-function is the expected value of the complete data log- likelihood P(X,Y|θ) with respect to Y given X and θ t.

Calculating model expectation in MaxEnt 22

The EM algorithm Start with initial estimate, θ 0 Repeat until convergence –E-step: calculate –M-step: find

Important classes of EM problem Products of multinomial (PM) models Exponential families Gaussian mixture …

Summary EM falls into the general framework of maximum- likelihood estimation (MLE). Calculating P(X | θ) directly is hard.  Introduce Y (“hidden” data), and calculating P(X,Y|θ) instead. Start with initial estimate, repeat E-step and M-step until convergence. Closed-form solutions for some classes of models.

Strengths of EM Numerical stability: in every iteration of the EM algorithm, it increases the likelihood of the observed data. The EM handles parameter constraints gracefully.

Problems with EM Convergence can be very slow on some problems and is intimately related to the amount of missing information. It guarantees to improve the probability of the training corpus, which is different from reducing the errors directly. It cannot guarantee to reach global maxima (it could get struck at the local maxima, saddle points, etc)  The initial estimate is important.

Additional slides

The EM algorithm for PM models

PM models Where is a partition of all the parameters, and for any j

HMM is a PM

PCFG PCFG: each sample point (x,y): –x is a sentence –y is a possible parse tree for that sentence.

PCFG is a PM

Q-function for PM

Maximizing the Q function Maximize Subject to the constraint Use Lagrange multipliers

Optimal solution Normalization factor Expected count

PCFG example Calculate expected counts Update parameters

EM: An example

Forward-backward algorithm HMM is a PM (Product of Multi-nominal) Model Forward-backward algorithm is a special case of the EM algorithm for PM Models. X (observed data): each data point is an O 1T. Y (hidden data): state sequence X 1T. Θ (parameters): a ij, b ijk, π i.

Expected counts

Expected counts (cont)

The inner loop for forward-backward algorithm Given an input sequence and 1.Calculate forward probability: Base case Recursive case: 2.Calculate backward probability: Base case: Recursive case: 3.Calculate expected counts: 4.Update the parameters:

HMM A HMM is a tuple : –A set of states S={s 1, s 2, …, s N }. –A set of output symbols Σ={w 1, …, w M }. –Initial state probabilities –State transition prob: A={a ij }. –Symbol emission prob: B={b ijk } State sequence: X 1 …X T+1 Output sequence: o 1 …o T

Forward probability The probability of producing o i,t-1 while ending up in state s i :

Calculating forward probability Initialization: Induction:

Backward probability The probability of producing the sequence O t,T, given that at time t, we are at state s i.

Calculating backward probability Initialization: Induction:

Calculating the prob of the observation

Estimating parameters The prob of traversing a certain arc at time t given O: (denoted by p t (i, j) in M&S)

The prob of being at state i at time t given O:

Expected counts Sum over the time index: Expected # of transitions from state i to j in O: Expected # of transitions from state i in O:

Update parameters

Final formulae

Summary The EM algorithm –An iterative approach –L(θ) is non-decreasing at each iteration –Optimal solution in M-step exists for many classes of problems. The EM algorithm for PM models –Simpler formulae –Three special cases Inside-outside algorithm Forward-backward algorithm IBM Models for MT

Relations among the algorithms The generalized EM The EM algorithm PM Gaussian Mix Inside-Outside Forward-backward IBM models

The EM algorithm for PM models // for each iteration // for each training example x i // for each possible y // for each parameter