Expectation Maximization Algorithm

Slides:

Advertisements

Similar presentations

Image Modeling & Segmentation

Advertisements

Mixture Models and the EM Algorithm

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.

Expectation Maximization

Supervised Learning Recap

The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.

Segmentation and Fitting Using Probabilistic Methods

DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University.

Visual Recognition Tutorial

Overview Full Bayesian Learning MAP learning

Middle Term Exam 03/04, in class. Project It is a team work No more than 2 people for each team Define a project of your own Otherwise, I will assign.

Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.

Unconstrained Optimization Rong Jin. Recap  Gradient ascent/descent Simple algorithm, only requires the first order derivative Problem: difficulty in.

First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.

Unsupervised Learning: Clustering Rong Jin Outline  Unsupervised learning  K means for clustering  Expectation Maximization algorithm for clustering.

Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.

Lecture 5: Learning models using EM

Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.

Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your.

Gaussian Mixture Example: Start After First Iteration.

The EM algorithm LING 572 Fei Xia 03/01/07. What is EM? EM stands for “expectation maximization”. A parameter estimation method: it falls into the general.

Expectation Maximization for GMM Comp344 Tutorial Kai Zhang.

Maximum Likelihood (ML), Expectation Maximization (EM)

Expectation-Maximization

Visual Recognition Tutorial

What is it? When would you use it? Why does it work? How do you implement it? Where does it stand in relation to other methods? EM algorithm reading group.

Expectation-Maximization (EM) Chapter 3 (Duda et al.) – Section 3.9

Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.

Unconstrained Optimization Rong Jin. Logistic Regression The optimization problem is to find weights w and b that maximizes the above log-likelihood How.

Gaussian Mixture Models and Expectation Maximization.

Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.

Biointelligence Laboratory, Seoul National University

Gaussian Mixture Model and the EM algorithm in Speech Recognition

Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.

EM and expected complete log-likelihood Mixture of Experts

Model Inference and Averaging

Segmental Hidden Markov Models with Random Effects for Waveform Modeling Author: Seyoung Kim & Padhraic Smyth Presentor: Lu Ren.

CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.

1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.

Lecture 19: More EM Machine Learning April 15, 2010.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

HMM - Part 2 The EM algorithm Continuous density HMM.

Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.

Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica ext. 1819

Lecture 2: Statistical learning primer for biologists

CSE 517 Natural Language Processing Winter 2015

Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven

Lecture 3: MLE, Bayes Learning, and Maximum Entropy

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.

EM Algorithm 主講人：虞台文大同大學資工所智慧型多媒體研究室. Contents Introduction Example  Missing Data Example  Mixed Attributes Example  Mixture Main Body Mixture Model.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.

Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Lecture 18 Expectation Maximization

Classification of unlabeled data:

LECTURE 10: EXPECTATION MAXIMIZATION (EM)

Latent Variables, Mixture Models and EM

Expectation-Maximization

دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry

Bayesian Models in Machine Learning

Probabilistic Models with Latent Variables

Gaussian Mixture Models And their training with the EM algorithm

LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.

Presentation transcript:

Expectation Maximization Algorithm Rong Jin

A Mixture Model Problem Apparently, the dataset consists of two modes How can we automatically identify the two modes?

Gaussian Mixture Model (GMM) Assume that the dataset is generated by two mixed Gaussian distributions Gaussian model 1: Gaussian model 2: If we know the memberships for each bin, estimating the two Gaussian models is easy. How to estimate the two Gaussian models without knowing the memberships of bins?

EM Algorithm for GMM Let memberships to be hidden variables EM algorithm for Gaussian mixture model Unknown memberships: Unknown Gaussian models: Learn these two sets of parameters iteratively

Start with A Random Guess Random assign the memberships to each bin

Start with A Random Guess Random assign the memberships to each bin Estimate the means and variance of each Gaussian model

E-step Fixed the two Gaussian models Estimate the posterior for each data point

EM Algorithm for GMM Re-estimate the memberships for each bin

Weighted by posteriors M-Step Fixed the memberships Re-estimate the two model Gaussian Weighted by posteriors Weighted by posteriors

EM Algorithm for GMM Re-estimate the memberships for each bin Re-estimate the models

At the 5-th Iteration Red Gaussian component slowly shifts toward the left end of the x axis

At the10-th Iteration Red Gaussian component still slowly shifts toward the left end of the x axis

At the 20-th Iteration Red Gaussian component make more noticeable shift toward the left end of the x axis

At the 50-th Iteration Red Gaussian component is close to the desirable location

At the 100-th Iteration The results are almost identical to the ones for the 50-th iteration

EM as A Bound Optimization EM algorithm in fact maximizes the log-likelihood function of training data Likelihood for a data point x Log-likelihood of training data

EM as A Bound Optimization EM algorithm in fact maximizes the log-likelihood function of training data Likelihood for a data point x Log-likelihood of training data

EM as A Bound Optimization EM algorithm in fact maximizes the log-likelihood function of training data Likelihood for a data point x Log-likelihood of training data

Logarithm Bound Algorithm Start with initial guess

Logarithm Bound Algorithm Touch Point Start with initial guess Come up with a lower bounded

Logarithm Bound Algorithm Start with initial guess Come up with a lower bounded Search the optimal solution that maximizes

Logarithm Bound Algorithm Start with initial guess Come up with a lower bounded Search the optimal solution that maximizes Repeat the procedure

Logarithm Bound Algorithm Optimal Point Start with initial guess Come up with a lower bounded Search the optimal solution that maximizes Repeat the procedure Converge to the local optimal

EM as A Bound Optimization Parameter for previous iteration: Parameter for current iteration: Compute

Concave property of logarithm function

Definition of posterior

Log-Likelihood of EM Alg. Saddle points

Maximize GMM Model What is the global optimal solution to GMM? Maximizing the objective function of GMM is ill-posed problem

Maximize GMM Model What is the global optimal solution to GMM? Maximizing the objective function of GMM is ill-posed problem

Identify Hidden Variables For certain learning problems, identifying hidden variables is not a easy task Consider a simple translation model For a pair of English and Chinese sentences: A simple translation model is The log-likelihood of training corpus

Identify Hidden Variables Consider a simple case Alignment variable a(i) Rewrite

Identify Hidden Variables Consider a simple case Alignment variable a(i) Rewrite

Identify Hidden Variables Consider a simple case Alignment variable a(i) Rewrite

Identify Hidden Variables Consider a simple case Alignment variable a(i) Rewrite

EM Algorithm for A Translation Model Introduce an alignment variable for each translation pair EM algorithm for the translation model E-step: compute the posterior for each alignment variable M-step: estimate the translation probability Pr(e|c)

EM Algorithm for A Translation Model Introduce an alignment variable for each translation pair EM algorithm for the translation model E-step: compute the posterior for each alignment variable M-step: estimate the translation probability Pr(e|c) We are luck here. In general, this step can be extremely difficult and usually requires approximate approaches

Compute Pr(e|c) First compute

Compute Pr(e|c) First compute

Bound Optimization for A Translation Model

Bound Optimization for A Translation Model

Iterative Scaling Maximum entropy model Iterative scaling All features Sum of features are constant

Iterative Scaling Compute the empirical mean for each feature of every class, i.e., for every j and every class y Start w1 ,w2 …, wc = 0 Repeat Compute p(y|x) for each training data point (xi, yi) using w from the previous iteration Compute the mean of each feature of every class using the estimated probabilities, i.e., for every j and every y Compute for every j and every y Update w as

Iterative Scaling

No, we can’t because we need a lower bound Iterative Scaling Can we use the concave property of logarithm function? No, we can’t because we need a lower bound

Iterative Scaling Weights still couple with each other Still need further decomposition

Iterative Scaling

Wait a minute, this can not be right! What happens? Iterative Scaling Wait a minute, this can not be right! What happens?

Logarithm Bound Algorithm Start with initial guess Come up with a lower bounded Search the optimal solution that maximizes

Iterative Scaling Where does it go wrong?

Iterative Scaling Not zero when  = ’

Definition of conditional exponential model Iterative Scaling Definition of conditional exponential model

Iterative Scaling

Iterative Scaling

Is this solution unique? Iterative Scaling How about ? Is this solution unique?

Iterative Scaling How about negative features?

Faster Iterative Scaling The lower bound may not be tight given all the coupling between weights is removed A tighter bound can be derived by not fully decoupling the correlation between weights Univariate functions!

Faster Iterative Scaling Log-likelihood

Bad News You may feel great after the struggle of the derivation. However, is iterative scaling a true great idea? Given there have been so many studies in optimization, we should try out existing methods.

Comparing Improved Iterative Scaling to Newton’s Method Dataset Iterations Time (s) Rule 823 42.48 81 1.13 Lex 241 102.18 176 20.02 Summary 626 208.22 69 8.52 Shallow 3216 71053.12 421 2420.30 Dataset Instances Features Rule 29,602 246 Lex 42,509 135,182 Summary 24,044 198,467 Shallow 8,625,782 264,142 Try out the standard numerical methods before you get excited about your algorithm Limited-memory Quasi-Newton method Improved iterative scaling