Expectation-Maximization (EM)

Slides:



Advertisements
Similar presentations
Part 2: Unsupervised Learning
Advertisements

Learning with Missing Data
Unsupervised Learning
Expectation Maximization
Supervised Learning Recap
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
K-means clustering Hongning Wang
Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics.
Visual Recognition Tutorial
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Lecture 5: Learning models using EM
Most slides from Expectation Maximization (EM) Northwestern University EECS 395/495 Special Topics in Machine Learning.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Expectation Maximization Algorithm
Maximum Likelihood (ML), Expectation Maximization (EM)
Expectation-Maximization
Visual Recognition Tutorial
What is it? When would you use it? Why does it work? How do you implement it? Where does it stand in relation to other methods? EM algorithm reading group.
. Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger.
Gaussian Mixture Models and Expectation Maximization.
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Biointelligence Laboratory, Seoul National University
Model Inference and Averaging
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Lecture 19: More EM Machine Learning April 15, 2010.
Probabilistic Graphical Models
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Lecture 17 Gaussian Mixture Models and Expectation Maximization
Overview of the final test for CSC Overview PART A: 7 easy questions –You should answer 5 of them. If you answer more we will select 5 at random.
CS Statistical Machine learning Lecture 24
Lecture 2: Statistical learning primer for biologists
CSE 517 Natural Language Processing Winter 2015
Eric Xing © Eric CMU, Machine Learning Mixture Model, HMM, and Expectation Maximization Eric Xing Lecture 9, August 14, 2010 Reading:
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Learning Deep Generative Models by Ruslan Salakhutdinov
Lecture 18 Expectation Maximization
Model Inference and Averaging
Statistical Models for Automatic Speech Recognition
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
Machine Learning Basics
Latent Variables, Mixture Models and EM
Expectation-Maximization
Statistical Learning Dong Liu Dept. EEIS, USTC.
Hidden Markov Models Part 2: Algorithms
دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry
Bayesian Models in Machine Learning
Probabilistic Models with Latent Variables
Learning Bayesian networks
Statistical Models for Automatic Speech Recognition
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Lecture 5 Unsupervised Learning in fully Observed Directed and Undirected Graphical Models.
Lecture 11: Mixture of Gaussians
Expectation-Maximization & Belief Propagation
LECTURE 07: BAYESIAN ESTIMATION
Topic models for corpora and for graphs
Parametric Methods Berlin Chen, 2005 References:
Lecture 11 Generalizations of EM.
Unifying Variational and GBP Learning Parameters of MNs EM for BNs
Expectation Maximization Eric Xing Lecture 10, August 14, 2010
Recap: Naïve Bayes classifier
EM Algorithm 主講人:虞台文.
Learning Bayesian networks
Presentation transcript:

Expectation-Maximization (EM) School of Computer Science 10-601B Introduction to Machine Learning Expectation-Maximization (EM) Matt Gormley Lecture 24 November 21, 2016 Readings:

Reminders Final Exam in-class Wed., Dec. 7

Outline Models: Hard Expectation-Maximization (EM) Gaussian Naïve Bayes (GNB) Mixture Model (MM) Gaussian Mixture Model (GMM) Gaussian Discriminant Analysis Hard Expectation-Maximization (EM) Hard EM Algorithm Example: Mixture Model Example: Gaussian Mixture Model K-Means as Hard EM (Soft) Expectation-Maximization (EM) Soft EM Algorithm Extra Slides: Why Does EM Work? Properties of EM Nonconvexity / Local Optimization Example: Grammar Induction Variants of EM

Gaussian Mixture Model

Model 3: Gaussian Naïve Bayes Recall… Model 3: Gaussian Naïve Bayes Data: Model: Product of prior and the event model X1 XM X2 Y …

Mixture-Model Data: Generative Story: Model: Joint: Marginal: (Marginal) Log-likelihood:

Learning a Mixture Model Supervised Learning: The parameters decouple! Unsupervised Learning: Parameters are coupled by marginalization. X1 XM X2 Z … X1 XM X2 Z …

Learning a Mixture Model Supervised Learning: The parameters decouple! Unsupervised Learning: Parameters are coupled by marginalization. X1 XM X2 Z … X1 XM X2 Z … Training certainly isn’t as simple as the supervised case. In many cases, we could still use some black-box optimization method (e.g. Newton-Raphson) to solve this coupled optimization problem. This lecture is about an even simpler method: EM.

Mixture-Model Data: Generative Story: Model: XM X2 Z … Mixture-Model Data: Generative Story: Model: Joint: Marginal: (Marginal) Log-likelihood:

Gaussian Mixture-Model XM X2 Z … Gaussian Mixture-Model Data: Generative Story: Model: Joint: Marginal: (Marginal) Log-likelihood:

Identifiability A mixture model induces a multi-modal likelihood. Hence gradient ascent can only find a local maximum. Mixture models are unidentifiable, since we can always switch the hidden labels without affecting the likelihood. Hence we should be careful in trying to interpret the “meaning” of latent variables. Eric Xing © Eric Xing @ CMU, 2006-2011

aka. Viterbi EM Hard EM

K-means as Hard EM Loop: For each point n=1 to N, compute its cluster label: For each cluster k=1:K Eric Xing © Eric Xing @ CMU, 2006-2011

Whiteboard Background: Coordinate Descent algorithm

Hard Expectation-Maximization Initialize parameters randomly while not converged E-Step: Set the latent variables to the the values that maximizes likelihood, treating parameters as observed M-Step: Set the parameters to the values that maximizes likelihood, treating latent variables as observed Hallucinate some data Standard Bayes Net training

Hard EM for Mixture Models Implementation: For loop over possible values of latent variable Implementation: supervised Bayesian Network learning

Gaussian Mixture-Model XM X2 Z … Gaussian Mixture-Model Data: Generative Story: Model: Joint: Marginal: (Marginal) Log-likelihood:

Gaussian Discriminant Analysis X1 XM X2 Z … Gaussian Discriminant Analysis Data: Generative Story: Model: Joint: Log-likelihood:

Gaussian Discriminant Analysis X1 XM X2 Z … Gaussian Discriminant Analysis Data: Log-likelihood: Maximum Likelihood Estimates: Take the derivative of the Lagrangian, set it equal to zero and solve. Implementation: Just counting

Hard EM for GMMs Implementation: For loop over possible values of latent variable Implementation: Just counting as in supervised setting

K-means as Hard EM Loop: For each point n=1 to N, compute its cluster label: For each cluster k=1:K Eric Xing © Eric Xing @ CMU, 2006-2011

The standard EM algorithm (Soft) EM

(Soft) EM for GMM Start: Loop: "Guess" the centroid mk and coveriance Sk of each of the K clusters Loop: Eric Xing © Eric Xing @ CMU, 2006-2011

(Soft) Expectation-Maximization Initialize parameters randomly while not converged E-Step: Create one training example for each possible value of the latent variables Weight each example according to model’s confidence Treat parameters as observed M-Step: Set the parameters to the values that maximizes likelihood Treat pseudo-counts from above as observed Hallucinate some data Standard Bayes Net training

Posterior Inference for Mixture Model

(Soft) EM for GMM Initialize parameters randomly while not converged E-Step: Create one training example for each possible value of the latent variables Weight each example according to model’s confidence Treat parameters as observed M-Step: Set the parameters to the values that maximizes likelihood Treat pseudo-counts from above as observed

Hard EM vs. Soft EM

(Soft) EM for GMM Start: Loop: "Guess" the centroid mk and coveriance Sk of each of the K clusters Loop: Eric Xing © Eric Xing @ CMU, 2006-2011

Extra Slides Why Does EM Work?

Theory underlying EM Extra Slides What are we doing? Recall that according to MLE, we intend to learn the model parameter that would have maximize the likelihood of the data. But we do not observe z, so computing is difficult! What shall we do? Eric Xing © Eric Xing @ CMU, 2006-2011

Complete & Incomplete Log Likelihoods Extra Slides Complete & Incomplete Log Likelihoods Complete log likelihood Let X denote the observable variable(s), and Z denote the latent variable(s). If Z could be observed, then Usually, optimizing lc() given both z and x is straightforward (c.f. MLE for fully observed models). Recalled that in this case the objective for, e.g., MLE, decomposes into a sum of factors, the parameter for each factor can be estimated separately. But given that Z is not observed, lc() is a random quantity, cannot be maximized directly. Incomplete log likelihood With z unobserved, our objective becomes the log of a marginal probability: This objective won't decouple Eric Xing © Eric Xing @ CMU, 2006-2011

Expected Complete Log Likelihood Extra Slides Expected Complete Log Likelihood For any distribution q(z), define expected complete log likelihood: A deterministic function of q Linear in lc() --- inherit its factorizabiility Does maximizing this surrogate yield a maximizer of the likelihood? Jensen’s inequality Eric Xing © Eric Xing @ CMU, 2006-2011

Lower Bounds and Free Energy Extra Slides Lower Bounds and Free Energy For fixed data x, define a functional called the free energy: The EM algorithm is coordinate-ascent on F : E-step: M-step: Eric Xing © Eric Xing @ CMU, 2006-2011

E-step: maximization of expected lc w.r.t. q Extra Slides E-step: maximization of expected lc w.r.t. q Claim: This is the posterior distribution over the latent variables given the data and the parameters. Often we need this at test time anyway (e.g. to perform classification). Proof (easy): this setting attains the bound l(q;x)³F(q,q ) Can also show this result using variational calculus or the fact that Eric Xing © Eric Xing @ CMU, 2006-2011

E-step º plug in posterior expectation of latent variables Extra Slides E-step º plug in posterior expectation of latent variables Without loss of generality: assume that p(x,z|q) is a generalized exponential family distribution: Special cases: if p(X|Z) are GLIMs, then The expected complete log likelihood under is Eric Xing © Eric Xing @ CMU, 2006-2011

M-step: maximization of expected lc w.r.t. q Extra Slides M-step: maximization of expected lc w.r.t. q Note that the free energy breaks into two terms: The first term is the expected complete log likelihood (energy) and the second term, which does not depend on q, is the entropy. Thus, in the M-step, maximizing with respect to q for fixed q we only need to consider the first term: Under optimal qt+1, this is equivalent to solving a standard MLE of fully observed model p(x,z|q), with the sufficient statistics involving z replaced by their expectations w.r.t. p(z|x,q). Eric Xing © Eric Xing @ CMU, 2006-2011

Summary: EM Algorithm Extra Slides A way of maximizing likelihood function for latent variable models. Finds MLE of parameters when the original (hard) problem can be broken up into two (easy) pieces: Estimate some “missing” or “unobserved” data from observed data and current parameters. Using this “complete” data, find the maximum likelihood parameter estimates. Alternate between filling in the latent variables using the best guess (posterior) and updating the parameters based on this guess: E-step: M-step: In the M-step we optimize a lower bound on the likelihood. In the E-step we close the gap, making bound=likelihood. Eric Xing © Eric Xing @ CMU, 2006-2011

Properties of EM

Properties of EM EM is trying to optimize a nonconvex function But EM is a local optimization algorithm Typical solution: Random Restarts Just like K-Means, we run the algorithm many times Each time initialize parameters randomly Pick the parameters that give highest likelihood

Example: Grammar Induction Grammar Induction is an unsupervised learning problem We try to recover the syntactic parse for each sentence without any supervision

Example: Grammar Induction No semantic interpretation …

Example: Grammar Induction Training Data: Sentences only, without parses Sample 1: time like flies an arrow x(1) Sample 2: real like flies soup x(2) Sample 3: x(3) flies with fly their wings Sample 4: x(4) with you time will see Test Data: Sentences with parses, so we can evaluate accuracy

Example: Grammar Induction Q: Does likelihood correlate with accuracy on a task we care about? A: Yes, but there is still a wide range of accuracies for a particular likelihood value Figure from Gimpel & Smith (NAACL 2012) - slides

Variants of EM Generalized EM: Replace the M-Step by a single gradient-step that improves the likelihood Monte Carlo EM: Approximate the E-Step by sampling Sparse EM: Keep an “active list” of points (updated occasionally) from which we estimate the expected counts in the E-Step Incremental EM / Stepwise EM: If standard EM is described as a batch algorithm, these are the online equivalent etc.

A Report Card for EM Some good things about EM: no learning rate (step-size) parameter automatically enforces parameter constraints very fast for low dimensions each iteration guaranteed to improve likelihood Some bad things about EM: can get stuck in local minima can be slower than conjugate gradient (especially near convergence) requires expensive inference step is a maximum likelihood/MAP method Eric Xing © Eric Xing @ CMU, 2006-2011