Expectation Maximization Algorithm

Slides:



Advertisements
Similar presentations
Image Modeling & Segmentation
Advertisements

Mixture Models and the EM Algorithm
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Expectation Maximization
Supervised Learning Recap
The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Segmentation and Fitting Using Probabilistic Methods
DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University.
Visual Recognition Tutorial
Overview Full Bayesian Learning MAP learning
Middle Term Exam 03/04, in class. Project It is a team work No more than 2 people for each team Define a project of your own Otherwise, I will assign.
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Unconstrained Optimization Rong Jin. Recap  Gradient ascent/descent Simple algorithm, only requires the first order derivative Problem: difficulty in.
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
Unsupervised Learning: Clustering Rong Jin Outline  Unsupervised learning  K means for clustering  Expectation Maximization algorithm for clustering.
Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.
Lecture 5: Learning models using EM
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your.
Gaussian Mixture Example: Start After First Iteration.
The EM algorithm LING 572 Fei Xia 03/01/07. What is EM? EM stands for “expectation maximization”. A parameter estimation method: it falls into the general.
Expectation Maximization for GMM Comp344 Tutorial Kai Zhang.
Maximum Likelihood (ML), Expectation Maximization (EM)
Expectation-Maximization
Visual Recognition Tutorial
What is it? When would you use it? Why does it work? How do you implement it? Where does it stand in relation to other methods? EM algorithm reading group.
Expectation-Maximization (EM) Chapter 3 (Duda et al.) – Section 3.9
Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.
Unconstrained Optimization Rong Jin. Logistic Regression The optimization problem is to find weights w and b that maximizes the above log-likelihood How.
Gaussian Mixture Models and Expectation Maximization.
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Biointelligence Laboratory, Seoul National University
Gaussian Mixture Model and the EM algorithm in Speech Recognition
Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.
EM and expected complete log-likelihood Mixture of Experts
Model Inference and Averaging
Segmental Hidden Markov Models with Random Effects for Waveform Modeling Author: Seyoung Kim & Padhraic Smyth Presentor: Lu Ren.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Lecture 19: More EM Machine Learning April 15, 2010.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
HMM - Part 2 The EM algorithm Continuous density HMM.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica ext. 1819
Lecture 2: Statistical learning primer for biologists
CSE 517 Natural Language Processing Winter 2015
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
EM Algorithm 主講人:虞台文 大同大學資工所 智慧型多媒體研究室. Contents Introduction Example  Missing Data Example  Mixed Attributes Example  Mixture Main Body Mixture Model.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Lecture 18 Expectation Maximization
Classification of unlabeled data:
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
Latent Variables, Mixture Models and EM
Expectation-Maximization
دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry
Bayesian Models in Machine Learning
Probabilistic Models with Latent Variables
Gaussian Mixture Models And their training with the EM algorithm
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Presentation transcript:

Expectation Maximization Algorithm Rong Jin

A Mixture Model Problem Apparently, the dataset consists of two modes How can we automatically identify the two modes?

Gaussian Mixture Model (GMM) Assume that the dataset is generated by two mixed Gaussian distributions Gaussian model 1: Gaussian model 2: If we know the memberships for each bin, estimating the two Gaussian models is easy. How to estimate the two Gaussian models without knowing the memberships of bins?

EM Algorithm for GMM Let memberships to be hidden variables EM algorithm for Gaussian mixture model Unknown memberships: Unknown Gaussian models: Learn these two sets of parameters iteratively

Start with A Random Guess Random assign the memberships to each bin

Start with A Random Guess Random assign the memberships to each bin Estimate the means and variance of each Gaussian model

E-step Fixed the two Gaussian models Estimate the posterior for each data point

EM Algorithm for GMM Re-estimate the memberships for each bin

Weighted by posteriors M-Step Fixed the memberships Re-estimate the two model Gaussian Weighted by posteriors Weighted by posteriors

EM Algorithm for GMM Re-estimate the memberships for each bin Re-estimate the models

At the 5-th Iteration Red Gaussian component slowly shifts toward the left end of the x axis

At the10-th Iteration Red Gaussian component still slowly shifts toward the left end of the x axis

At the 20-th Iteration Red Gaussian component make more noticeable shift toward the left end of the x axis

At the 50-th Iteration Red Gaussian component is close to the desirable location

At the 100-th Iteration The results are almost identical to the ones for the 50-th iteration

EM as A Bound Optimization EM algorithm in fact maximizes the log-likelihood function of training data Likelihood for a data point x Log-likelihood of training data

EM as A Bound Optimization EM algorithm in fact maximizes the log-likelihood function of training data Likelihood for a data point x Log-likelihood of training data

EM as A Bound Optimization EM algorithm in fact maximizes the log-likelihood function of training data Likelihood for a data point x Log-likelihood of training data

Logarithm Bound Algorithm Start with initial guess

Logarithm Bound Algorithm Touch Point Start with initial guess Come up with a lower bounded

Logarithm Bound Algorithm Start with initial guess Come up with a lower bounded Search the optimal solution that maximizes

Logarithm Bound Algorithm Start with initial guess Come up with a lower bounded Search the optimal solution that maximizes Repeat the procedure

Logarithm Bound Algorithm Optimal Point Start with initial guess Come up with a lower bounded Search the optimal solution that maximizes Repeat the procedure Converge to the local optimal

EM as A Bound Optimization Parameter for previous iteration: Parameter for current iteration: Compute

Concave property of logarithm function

Definition of posterior

Log-Likelihood of EM Alg. Saddle points

Maximize GMM Model What is the global optimal solution to GMM? Maximizing the objective function of GMM is ill-posed problem

Maximize GMM Model What is the global optimal solution to GMM? Maximizing the objective function of GMM is ill-posed problem

Identify Hidden Variables For certain learning problems, identifying hidden variables is not a easy task Consider a simple translation model For a pair of English and Chinese sentences: A simple translation model is The log-likelihood of training corpus

Identify Hidden Variables Consider a simple case Alignment variable a(i) Rewrite

Identify Hidden Variables Consider a simple case Alignment variable a(i) Rewrite

Identify Hidden Variables Consider a simple case Alignment variable a(i) Rewrite

Identify Hidden Variables Consider a simple case Alignment variable a(i) Rewrite

EM Algorithm for A Translation Model Introduce an alignment variable for each translation pair EM algorithm for the translation model E-step: compute the posterior for each alignment variable M-step: estimate the translation probability Pr(e|c)

EM Algorithm for A Translation Model Introduce an alignment variable for each translation pair EM algorithm for the translation model E-step: compute the posterior for each alignment variable M-step: estimate the translation probability Pr(e|c) We are luck here. In general, this step can be extremely difficult and usually requires approximate approaches

Compute Pr(e|c) First compute

Compute Pr(e|c) First compute

Bound Optimization for A Translation Model

Bound Optimization for A Translation Model

Iterative Scaling Maximum entropy model Iterative scaling All features Sum of features are constant

Iterative Scaling Compute the empirical mean for each feature of every class, i.e., for every j and every class y Start w1 ,w2 …, wc = 0 Repeat Compute p(y|x) for each training data point (xi, yi) using w from the previous iteration Compute the mean of each feature of every class using the estimated probabilities, i.e., for every j and every y Compute for every j and every y Update w as

Iterative Scaling

No, we can’t because we need a lower bound Iterative Scaling Can we use the concave property of logarithm function? No, we can’t because we need a lower bound

Iterative Scaling Weights still couple with each other Still need further decomposition

Iterative Scaling

Wait a minute, this can not be right! What happens? Iterative Scaling Wait a minute, this can not be right! What happens?

Logarithm Bound Algorithm Start with initial guess Come up with a lower bounded Search the optimal solution that maximizes

Iterative Scaling Where does it go wrong?

Iterative Scaling Not zero when  = ’

Definition of conditional exponential model Iterative Scaling Definition of conditional exponential model

Iterative Scaling

Iterative Scaling

Is this solution unique? Iterative Scaling How about ? Is this solution unique?

Iterative Scaling How about negative features?

Faster Iterative Scaling The lower bound may not be tight given all the coupling between weights is removed A tighter bound can be derived by not fully decoupling the correlation between weights Univariate functions!

Faster Iterative Scaling Log-likelihood

Bad News You may feel great after the struggle of the derivation. However, is iterative scaling a true great idea? Given there have been so many studies in optimization, we should try out existing methods.

Comparing Improved Iterative Scaling to Newton’s Method Dataset Iterations Time (s) Rule 823 42.48 81 1.13 Lex 241 102.18 176 20.02 Summary 626 208.22 69 8.52 Shallow 3216 71053.12 421 2420.30 Dataset Instances Features Rule 29,602 246 Lex 42,509 135,182 Summary 24,044 198,467 Shallow 8,625,782 264,142 Try out the standard numerical methods before you get excited about your algorithm Limited-memory Quasi-Newton method Improved iterative scaling