What is it? When would you use it? Why does it work? How do you implement it? Where does it stand in relation to other methods? EM algorithm reading group.

Slides:



Advertisements
Similar presentations
Mixture Models and the EM Algorithm
Advertisements

Unsupervised Learning
Expectation Maximization
Maximum Likelihood And Expectation Maximization Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Supervised Learning Recap
The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Segmentation and Fitting Using Probabilistic Methods
DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University.
Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 143, Brown James Hays 02/22/11 Many slides from Derek Hoiem.
Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 03/15/12.
Mixture Language Models and EM Algorithm
Visual Recognition Tutorial
EE-148 Expectation Maximization Markus Weber 5/11/99.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
EE 290A: Generalized Principal Component Analysis Lecture 6: Iterative Methods for Mixture-Model Segmentation Sastry & Yang © Spring, 2011EE 290A, University.
6/10/ Visual Recognition1 Radial Basis Function Networks Computer Science, KAIST.
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
The EM algorithm (Part 1) LING 572 Fei Xia 02/23/06.
Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.
Lecture 5: Learning models using EM
Gaussian Mixture Example: Start After First Iteration.
The EM algorithm LING 572 Fei Xia 03/01/07. What is EM? EM stands for “expectation maximization”. A parameter estimation method: it falls into the general.
Expectation Maximization Algorithm
Maximum Likelihood (ML), Expectation Maximization (EM)
Expectation-Maximization
Visual Recognition Tutorial
Part 3 Vector Quantization and Mixture Density Model CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Expectation-Maximization (EM) Chapter 3 (Duda et al.) – Section 3.9
EM Algorithm Likelihood, Mixture Models and Clustering.
Zen, and the Art of Neural Decoding using an EM Algorithm Parameterized Kalman Filter and Gaussian Spatial Smoothing Michael Prerau, MS.
. Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger.
EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Biointelligence Laboratory, Seoul National University
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
EM and expected complete log-likelihood Mixture of Experts
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Lecture 19: More EM Machine Learning April 15, 2010.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
HMM - Part 2 The EM algorithm Continuous density HMM.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
CS Statistical Machine learning Lecture 24
Computer Vision Lecture 6. Probabilistic Methods in Segmentation.
Gaussian Mixture Models and Expectation-Maximization Algorithm.
Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica ext. 1819
Lecture 2: Statistical learning primer for biologists
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Design and Implementation of Speech Recognition Systems Fall 2014 Ming Li Special topic: the Expectation-Maximization algorithm and GMM Sep Some.
CSE 446: Expectation Maximization (EM) Winter 2012 Daniel Weld Slides adapted from Carlos Guestrin, Dan Klein & Luke Zettlemoyer.
Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 02/22/11.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
EM Algorithm 主講人:虞台文 大同大學資工所 智慧型多媒體研究室. Contents Introduction Example  Missing Data Example  Mixed Attributes Example  Mixture Main Body Mixture Model.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.
. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
Latent Variables, Mixture Models and EM
Expectation-Maximization
Bayesian Models in Machine Learning
Probabilistic Models with Latent Variables
Introduction to EM algorithm
SMEM Algorithm for Mixture Models
Biointelligence Laboratory, Seoul National University
Expectation Maximization Eric Xing Lecture 10, August 14, 2010
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Presentation transcript:

What is it? When would you use it? Why does it work? How do you implement it? Where does it stand in relation to other methods? EM algorithm reading group Introduction & Motivation Theory Practical Comparison with other methods

Expectation Maximization (EM) Iterative method for parameter estimation where you have missing data Has two steps: Expectation (E) and Maximization (M) Applicable to a wide range of problems Old idea (late 50’s) but formalized by Dempster, Laird and Rubin in 1977 Subject of much investigation. See McLachlan & Krishnan book 1997.

Applications of EM (1) Fitting mixture models

Applications of EM (2) Probabilistic Latent Semantic Analysis (pLSA) –Technique from text community P(w,d) P(w|z) P(z|d) Z W D Z D W

Applications of EM (3) Learning parts and structure models

Applications of EM (4) Automatic segmentation of layers in video

Motivating example Data: Model: Parameters: OBJECTIVE: Fit mixture of Gaussian model with C=2 components keepfixed i.e. only estimate x P(x|  ) where

Likelihood function Likelihood is a function of parameters,  Probability is a function of r.v. x DIFFERENT TO LAST PLOT

Probabilistic model Imagine model generating data Need to introduce label, z, for each data point Label is called a latent variable also called hidden, unobserved, missing  c Simplifies the problem: if we knew the labels, we can decouple the components as estimate parameters separately for each one

Intuition of EM E-step: Compute a distribution on the labels of the points, using current parameters M-step:Update parameters using current guess of label distribution. E E M M E

Theory

Complete log-likelihood (CLL) Log-likelihood [Incomplete log-likelihood (ILL)] Expected complete log-likelihood (ECLL) Some definitions Observed data Latent variables Iteration index Continuous I.I.D Discrete 1... C

Use Jensen’s inequality Lower bound on log-likelihood AUXILIARY FUNCTION

Jensen’s Inequality Jensen’s inequality: For a real continuous concave function and Equality holds when all x are the same 1. Definition of concavity. Consider where then 2. Consider a set of points,, lying in the interval andsuch that and then lies in 2. By induction: for

Recall key result : Auxiliary function is LOWER BOUND on likelihood EM is alternating ascent Alternately improve q then  : Is guaranteed to improve likelihood itself….

E-step: Choosing the optimal q(z|x,  ) Turns out that q(z|x,  ) = p(z|x,  t ) is the best.

E-step: What do we actually compute? nComponents x nPoints matrix (columns sum to 1): Component 1 Component 2 Point 1Point 2Point 6 Responsibility of component for point :

E-step: Alternative derivation

M-Step Entropy term ECLL Auxiliary function separates into ECLL and entropy term:

M-Step From previous slide: Recall definition of ECLL: From E-step Let’s see what happens for

Practical

Initialization Mean of data + random offset K-Means Termination Max # iterations log-likelihood change parameter change Convergence Local maxima Annealed methods (DAEM) Birth/death process (SMEM) Numerical issues Inject noise in covariance matrix to prevent blowup Single point gives infinite likelihood Number of components Open problem Minimum description length Bayesian approach Practical issues

Local minima

Robustness of EM

What EM won’t do Pick structure of model # components graph structure Find global maximum Always have nice closed-form updates optimize within E/M step Avoid computational problems sampling methods for computing expectations

Comparison with other methods

Why not use standard optimization methods? No step size Works directly in parameter space model, thus parameter constraints are obeyed Fits naturally into graphically model frame work Supposedly faster In favour of EM:

Gradient Newton EM

Gradient Newton EM

Acknowledgements Shameless stealing of figures and equations and explanations from: Frank Dellaert Michael Jordan Yair Weiss