Introduction to EM algorithm

Slides:



Advertisements
Similar presentations
Expectation Maximization Dekang Lin Department of Computing Science University of Alberta.
Advertisements

EMNLP, June 2001Ted Pedersen - EM Panel1 A Gentle Introduction to the EM Algorithm Ted Pedersen Department of Computer Science University of Minnesota.
Mixture Models and the EM Algorithm
An Introduction to the EM Algorithm Naala Brewer and Kehinde Salau.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Maximum Likelihood And Expectation Maximization Lecture Notes for CMPUT 466/551 Nilanjan Ray.
The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Segmentation and Fitting Using Probabilistic Methods
Visual Recognition Tutorial
. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Learning Hidden Markov Models Tutorial #7 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
Data Mining Techniques Outline
Part 4 b Forward-Backward Algorithm & Viterbi Algorithm CSE717, SPRING 2008 CUBS, Univ at Buffalo.
The EM algorithm (Part 1) LING 572 Fei Xia 02/23/06.
Lecture 5: Learning models using EM
Machine Translation (II): Word-based SMT Ling 571 Fei Xia Week 10: 12/1/05-12/6/05.
A gentle introduction to Gaussian distribution. Review Random variable Coin flip experiment X = 0X = 1 X: Random variable.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Forward-backward algorithm LING 572 Fei Xia 02/23/06.
The EM algorithm LING 572 Fei Xia 03/01/07. What is EM? EM stands for “expectation maximization”. A parameter estimation method: it falls into the general.
Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I)
What is it? When would you use it? Why does it work? How do you implement it? Where does it stand in relation to other methods? EM algorithm reading group.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Parameter estimate in IBM Models: Ling 572 Fei Xia Week ??
EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.
Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
Computer Vision Lecture 6. Probabilistic Methods in Segmentation.
CSE 5331/7331 F'07© Prentice Hall1 CSE 5331/7331 Fall 2007 Machine Learning Margaret H. Dunham Department of Computer Science and Engineering Southern.
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
Maximum Likelihood Estimation
Statistical Estimation Vasileios Hatzivassiloglou University of Texas at Dallas.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Information Bottleneck versus Maximum Likelihood Felix Polyakov.
Other Models for Time Series. The Hidden Markov Model (HMM)
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
16.0 Some Fundamental Principles – EM Algorithm References: , of Huang, or of Jelinek of Rabiner and Juang 3.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Comp. Genomics Recitation 6 14/11/06 ML and EM.
Statistical Estimation
Hidden Markov Models.
Classification of unlabeled data:
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
CSC 594 Topics in AI – Natural Language Processing
Hidden Markov Models - Training
Latent Variables, Mixture Models and EM
Bayesian Models in Machine Learning
Expectation Maximization Mixture Models HMMs
Probabilistic Models with Latent Variables
CS498-EA Reasoning in AI Lecture #20
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Word-based SMT Ling 580 Fei Xia Week 1: 1/3/06.
EM for Inference in MV Data
Machine Learning for Signal Processing Expectation Maximization Mixture Models Bhiksha Raj Week Apr /18797.
EM for Inference in MV Data
Introduction to HMM (cont)
Clustering (2) & EM algorithm
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Maximum Likelihood Estimation (MLE)
Presentation transcript:

Introduction to EM algorithm LING 572 Fei Xia 02/23/06

What is EM? EM stands for “expectation maximization”. A parameter estimation method: it falls into the general framework of maximum-likelihood estimation (MLE). The general form was given in (Dempster, Laird, and Rubin, 1977), although essence of the algorithm appeared previously in various forms.

Outline MLE EM: basic concepts

MLE

What is MLE? Given X, find Given A sample X={X1, …, Xn} A vector of parameters θ We define Likelihood of the data: P(X | θ) Log-likelihood of the data: L(θ)=log P(X|θ) Given X, find

MLE (cont) Often we assume that Xis are independently identically distributed (i.i.d.) Depending on the form of p(x|θ), solving optimization problem can be easy or hard.

An easy case Assuming A coin has a probability p of being heads, 1-p of being tails. Observation: We toss a coin N times, and the result is a set of Hs and Ts, and there are m Hs. What is the value of p based on MLE, given the observation?

An easy case (cont) p= m/N

EM: basic concepts

Basic setting in EM X is a set of data points: observed data Θ is a parameter vector. EM is a method to find θML where Calculating P(X | θ) directly is hard. Calculating P(X,Y|θ) is much simpler, where Y is “hidden” data (or “missing” data).

The basic EM strategy Z = (X, Y) Z: complete data (“augmented data”) X: observed data (“incomplete” data) Y: hidden data (“missing” data)

The “missing” data Y Y need not necessarily be missing in the practical sense of the word. It may just be a conceptually convenient technical device to simplify the calculation of P(x |θ). There could be many possible Ys.

Examples of EM HMM PCFG MT Coin toss X (observed) sentences Parallel data Head-tail sequences Y (hidden) State sequences Parse trees Word alignment Coin id sequences θ aij bijk P(ABC) t(f|e) d(aj|j, l, m), … P1, p2, λ Algorithm Forward-backward Inside-outside IBM Models N/A

The EM algorithm Consider a set of starting parameters Use these to “estimate” the missing data Use “complete” data to update parameters Repeat until convergence

General algorithm for missing data problems Requires “specialization” to the problem at hand Examples of EM: Forward-backward algorithm for HMM Inside-outside algorithm for PCFG EM in IBM MT Models

Strengths of EM Numerical stability: in every iteration of the EM algorithm, it increases the likelihood of the observed data. The EM handles parameter constraints gracefully.

Problems with EM Convergence can be very slow on some problems and is intimately related to the amount of missing information. It guarantees to improve the probability of the training corpus, which is different from reducing the errors directly. It cannot guarantee to reach global maximum (it could get struck at the local maxima, saddle points, etc)  The initial values are important.

Additional slides

Setting for the EM algorithm Problem is simpler to solve for complete data Maximum likelihood estimates can be calculated using standard methods. Estimates of mixture parameters could be obtained in straightforward manner if the origin of each observation is known.

EM algorithm for mixtures “Guesstimate” starting parameters E-step: Use Bayes’ theorem to calculate group assignment probabilities M-step: Update parameters using estimated assignments Repeat steps 2 and 3 until likelihood is stable.

Filing in missing data The missing data is the group assignment for each observation Complete data generated by assigning observations to groups Probabilistically We will use “fractional” assignments

Picking starting parameters Mixing proportions Assumed equal Means for each group Pick one observation as the group mean Variances for each group Use overall variance