CSE 446: Expectation Maximization (EM) Winter 2012 Daniel Weld Slides adapted from Carlos Guestrin, Dan Klein & Luke Zettlemoyer.

Slides:



Advertisements
Similar presentations
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Advertisements

Mixture Models and the EM Algorithm
Unsupervised Learning
Expectation Maximization
Supervised Learning Recap
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Segmentation and Fitting Using Probabilistic Methods
K-means clustering Hongning Wang
Machine Learning and Data Mining Clustering
Visual Recognition Tutorial
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Unsupervised Learning: Clustering Rong Jin Outline  Unsupervised learning  K means for clustering  Expectation Maximization algorithm for clustering.
Lecture 5: Learning models using EM
Most slides from Expectation Maximization (EM) Northwestern University EECS 395/495 Special Topics in Machine Learning.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Clustering.
Expectation Maximization Algorithm
Maximum Likelihood (ML), Expectation Maximization (EM)
Visual Recognition Tutorial
What is it? When would you use it? Why does it work? How do you implement it? Where does it stand in relation to other methods? EM algorithm reading group.
Learning Bayesian Networks
ECE 5984: Introduction to Machine Learning
EE462 MLCV 1 Lecture 3-4 Clustering (1hr) Gaussian Mixture and EM (1hr) Tae-Kyun Kim.
1 EM for BNs Graphical Models – Carlos Guestrin Carnegie Mellon University November 24 th, 2008 Readings: 18.1, 18.2, –  Carlos Guestrin.
Gaussian Mixture Models and Expectation Maximization.
Crash Course on Machine Learning
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Biointelligence Laboratory, Seoul National University
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Lecture 19: More EM Machine Learning April 15, 2010.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
1 EM for BNs Kalman Filters Gaussian BNs Graphical Models – Carlos Guestrin Carnegie Mellon University November 17 th, 2006 Readings: K&F: 16.2 Jordan.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Unsupervised Learning: Kmeans, GMM, EM Readings: Barber
Lecture 17 Gaussian Mixture Models and Expectation Maximization
HMM - Part 2 The EM algorithm Continuous density HMM.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
CS Statistical Machine learning Lecture 24
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Lecture 2: Statistical learning primer for biologists
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
CSE 446 Dimensionality Reduction and PCA Winter 2012 Slides adapted from Carlos Guestrin & Luke Zettlemoyer.
Flat clustering approaches
CSE 517 Natural Language Processing Winter 2015
Machine Learning 5. Parametric Methods.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Review of statistical modeling and probability theory Alan Moses ML4bio.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –(Finish) Expectation Maximization –Principal Component Analysis (PCA) Readings:
Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Classification of unlabeled data:
Latent Variables, Mixture Models and EM
Bayesian Models in Machine Learning
Probabilistic Models with Latent Variables
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Biointelligence Laboratory, Seoul National University
Unifying Variational and GBP Learning Parameters of MNs EM for BNs
Presentation transcript:

CSE 446: Expectation Maximization (EM) Winter 2012 Daniel Weld Slides adapted from Carlos Guestrin, Dan Klein & Luke Zettlemoyer

Machine Learning 2 Supervised Learning Parametric Reinforcement Learning Unsupervised Learning Non-parametric FriK-means & Agglomerative Clustering MonExpectation Maximization (EM) WedPrinciple Component Analysis (PCA)

K-Means An iterative clustering algorithm –Pick K random points as cluster centers (means) –Alternate: Assign data instances to closest mean Assign each mean to the average of its assigned points –Stop when no points’ assignments change

K-Means as Optimization Consider the total distance to the means: Two stages each iteration: –Update assignments: fix means c, change assignments a –Update means: fix assignments a, change means c Coordinate gradient ascent on Φ Will it converge? –Yes!, change from either update can only decrease Φ points assignments means

Phase I: Update Assignments (Expectation) For each point, re-assign to closest mean: Can only decrease total distance phi!

Phase II: Update Means (Maximization) Move each mean to the average of its assigned points: Also can only decrease total distance… (Why?) Fun fact: the point y with minimum squared Euclidean distance to a set of points {x} is their mean

Preview: EM Pick K random cluster models Alternate: –Assign data instances proportionately to different models –Revise each cluster model based on its (proportionately) assigned points Stop when no changes another iterative clustering algorithm

K-Means Getting Stuck A local optimum: Why doesn’t this work out like the earlier example, with the purple taking over half the blue?

Preference for Equally Sized Clusters 9

The Evils of “Hard Assignments”? Clusters may overlap Some clusters may be “wider” than others Distances can be deceiving!

Probabilistic Clustering Try a probabilistic model! allows overlaps, clusters of different size, etc. Can tell a generative story for data – P(X|Y) P(Y) is common Challenge: we need to estimate model parameters without labeled Ys YX1X1 X2X2 ?? ?? ?? ?? ?? ………

The General GMM assumption    P(Y): There are k components P(X|Y): Each component generates data from a multivariate Gaussian with mean μ i and covariance matrix  i Each data point is sampled from a generative process: 1.Choose component i with probability P(y=i) 2.Generate datapoint ~ N(m i,  i )

What Model Should We Use? Depends on X! Here, maybe Gaussian Naïve Bayes? – Multinomial over clusters Y – Gaussian over each X i given Y YX1X1 X2X2 ?? ?? ?? ?? ?? ………

Could we make fewer assumptions? What if the X i co-vary? What if there are multiple peaks? Gaussian Mixture Models! – P(Y) still multinomial – P(X|Y) is a multivariate Gaussian dist’n

The General GMM assumption    1.What’s a Multivariate Gaussian? 2.What’s a Mixture Model?

Review: Gaussians

Learning Gaussian Parameters (given fully-observable data)

Multivariate Gaussians 18 P(X=x j )= Covariance matrix, Σ, = degree to which x i vary together Eigenvalue, λ

Multivariate Gaussians 19 Σ  identity matrix

Multivariate Gaussians 20 Σ = diagonal matrix X i are independent ala Gaussian NB

Multivariate Gaussians 21 Σ = arbitrary (semidefinite) matrix specifies rotation (change of basis) eigenvalues specify relative elongation

The General GMM assumption    1.What’s a Multivariate Gaussian? 2.What’s a Mixture Model?

Mixtures of Gaussians (1) Old Faithful Data Set Time to Eruption Duration of Last Eruption

Mixtures of Gaussians (1) Old Faithful Data Set Single GaussianMixture of two Gaussians

Mixtures of Gaussians (2) Combine simple models into a complex model: Component Mixing coefficient K=3

Mixtures of Gaussians (3)

Eliminating Hard Assignments to Clusters Model data as mixture of multivariate Gaussians

Eliminating Hard Assignments to Clusters Model data as mixture of multivariate Gaussians

Eliminating Hard Assignments to Clusters Model data as mixture of multivariate Gaussians π i = probability point was generated from i th Gaussian

Eliminating Hard Assignments to Clusters Model data as mixture of multivariate Gaussians

Eliminating Hard Assignments to Clusters Model data as mixture of multivariate Gaussians π i = probability point was generated from i th Gaussian

Detour/Review: Supervised MLE for GMM How do we estimate parameters for Gaussian Mixtures with fully supervised data? Have to define objective and solve optimization problem. For example, MLE estimate has closed form solution:

Compare Univariate Gaussian Mixture of Multivariate Gaussians

That was easy! But what if unobserved data? MLE: – argmax θ  j P(y j,x j ) – θ: all model parameters eg, class probs, means, and variance for naïve Bayes But we don’t know y j ’s!!! Maximize marginal likelihood: – argmax θ  j P(x j ) = argmax  j  i=1 k P(y j =i,x j )

How do we optimize? Closed Form? Maximize marginal likelihood: – argmax θ  j P(x j ) = argmax  j  i=1 k P(y j =i,x j ) Almost always a hard problem! – Usually no closed form solution – Even when P(X,Y) is convex, P(X) generally isn’t… – For all but the simplest P(X), we will have to do gradient ascent, in a big messy space with lots of local optimum…

Simple example: learn means only! Consider: 1D data Mixture of k=2 Gaussians Variances fixed to σ=1 Dist’n over classes is uniform Just estimate μ 1 and μ P

Graph of log P(x 1, x 2.. x n | μ 1, μ 2 ) against μ 1 and μ 2 Max likelihood = (μ 1 =-2.13, μ 2 =1.668) Local minimum, but very close to global at (μ 1 =2.085, μ 2 =-1.257)* μ2μ2 Marginal Likelihood for Mixture of two Gaussians * corresponds to switching y 1 with y 2. μ1μ1

Learning general mixtures of Gaussian Marginal likelihood: Need to differentiate and solve for μ i, Σ i, and P(Y=i) for i=1..k There will be no closed for solution, gradient is complex, lots of local optimum Wouldn’t it be nice if there was a better way!?!

Expectation Maximization

The EM Algorithm A clever method for maximizing marginal likelihood: – argmax θ  j P(x j ) = argmax θ  j  i=1 k P(y j =i,x j ) – A type of gradient ascent that can be easy to implement (eg, no line search, learning rates, etc.) Alternate between two steps: – Compute an expectation – Compute a maximization Not magic: still optimizing a non-convex function with lots of local optima – The computations are just easier (often, significantly so!)

EM: Two Easy Steps Objective: argmax θ  j  i=1 k P(y j =i,x j |θ) =  j log  i=1 k P(y j =i,x j |θ) Data: {x j | j=1.. n} E-step: Compute expectations to “fill in” missing y values according to current parameters,  – For all examples j and values i for y, compute: P(y j =i | x j, θ) M-step: Re-estimate the parameters with “weighted” MLE estimates – Set θ = argmax θ  j  i=1 k P(y j =i | x j, θ) log P(y j =i,x j |θ) Especially useful when the E and M steps have closed form solutions!!! Notation a bit inconsistent Parameters =  =

Simple example: learn means only! Consider: 1D data Mixture of k=2 Gaussians Variances fixed to σ=1 Dist’n over classes is uniform Just need to estimate μ 1 and μ

EM for GMMs: only learning means Iterate: On the t’th iteration let our estimates be t = { μ 1 (t), μ 2 (t) … μ k (t) } E-step Compute “expected” classes of all datapoints M-step Compute most likely new μs given class expectations

E.M. for General GMMs Iterate: On the t’th iteration let our estimates be t = { μ 1 (t), μ 2 (t) … μ k (t),  1 (t),  2 (t) …  k (t), p 1 (t), p 2 (t) … p k (t) } E-step Compute “expected” classes of all datapoints for each class p i (t) is shorthand for estimate of P(y=i) on t’th iteration M-step Compute weighted MLE for μ given expected classes above m = #training examples Just evaluate a Gaussian at x j

Gaussian Mixture Example: Start

After first iteration

After 2nd iteration

After 3rd iteration

After 4th iteration

After 5th iteration

After 6th iteration

After 20th iteration

Some Bio Assay data

GMM clustering of the assay data

Resulting Density Estimator

Three classes of assay (each learned with it’s own mixture model)

What if we do hard assignments? Iterate: On the t’th iteration let our estimates be θ t = { μ 1 (t), μ 2 (t) … μ k (t) } E-step Compute “expected” classes of all datapoints M-step Compute most likely new μs given class expectations δ represents hard assignment to “most likely” or nearest cluster Equivalent to k-means clustering algorithm!!!

Lets look at the math behind the magic! We will argue that EM: Optimizes a bound on the likelihood Is a type of coordinate ascent Is guaranteed to converge to a (often local) optima

The general learning problem with missing data Marginal likelihood: x is observed, z (eg class labels, y) is missing: Objective: Find argmax θ l(θ:Data)

Skipping Gnarly Math EM Converges – E-step doesn’t decrease F( , D) – M-step doesn’t either EM is Coordinate Ascent 60

A Key Computation: E-step x is observed, z is missing Compute probability of missing data given current choice of  – Q(z|x j ) for each x j e.g., probability computed during classification step corresponds to “classification step” in K-means

Jensen’s inequality Theorem: – log  z P(z) f(z) ≥  z P(z) log f(z) – e.g., Binary case for convex function f: – actually, holds for any concave (convex) function applied to an expectation!

Applying Jensen’s inequality Use: log  z P(z) f(z) ≥  z P(z) log f(z)

The M-step Maximization step: We are optimizing a lower bound! Use expected counts to do weighted learning: – If learning requires Count(x,z) – Use E Q(t+1) [Count(x,z)] – Looks a bit like boosting!!! Lower bound:

Convergence of EM Define: potential function F( ,Q): – lower bound from Jensen’s inequality EM is coordinate ascent on F! – Thus, maximizes lower bound on marginal log likelihood

M-step can’t decrease F(θ,Q): by definition! We are maximizing F directly, by ignoring a constant!

E-step: more work to show that F(θ,Q) doesn’t decrease KL-divergence: measures distance between distributions KL=zero if and only if Q=P

E-step also doesn’t decrease F: Step 1 Fix  to  (t), take a max over Q:

E-step also doesn’t decrease F: Step 2 Fixing  to  (t) : Now, the max over Q yields: – Q(z|x j )  P(z|x j,  (t) ) – Why? The likelihood term is a constant; the KL term is zero iff the arguments are the same distribution!! – So, the E-step is actually a maximization / tightening of the bound. It ensures that:

EM is coordinate ascent M-step: Fix Q, maximize F over  (a lower bound on ): E-step: Fix , maximize F over Q: – “Realigns” F with likelihood:

What you should know K-means for clustering: – algorithm – converges because it’s coordinate ascent Know what agglomerative clustering is EM for mixture of Gaussians: – Also coordinate ascent – How to “learn” maximum likelihood parameters (locally max. like.) in the case of unlabeled data – Relation to K-means Hard / soft clustering Probabilistic model Remember, E.M. can get stuck in local minima, – And empirically it DOES

Acknowledgements K-means & Gaussian mixture models presentation contains material from excellent tutorial by Andrew Moore: – K-means Applet: – torial_html/AppletKM.html torial_html/AppletKM.html Gaussian mixture models Applet: – ml

Solution to #3 - Learning Given x 1 …x N, how do we learn θ =(,, ) to maximize P(x)? Unfortunately, there is no known way to analytically find a global maximum θ * such that θ * = arg max P(o | θ) But it is possible to find a local maximum; given an initial model θ, we can always find a model θ ’ such that P(o | θ ’ ) ≥ P(o | θ)

74 Chicken & Egg Problem If we knew the actual sequence of states – It would be easy to learn transition and emission probabilities – But we can’t observe states, so we don’t! If we knew transition & emission probabilities –Then it’d be easy to estimate the sequence of states (Viterbi) –But we don’t know them! Slide by Daniel S. Weld

75 Simplest Version Mixture of two distributions Know: form of distribution & variance, % =5 Just need mean of each distribution Slide by Daniel S. Weld

76 Input Looks Like Slide by Daniel S. Weld

77 We Want to Predict ? Slide by Daniel S. Weld

78 Chicken & Egg Note that coloring instances would be easy if we knew Gausians…. Slide by Daniel S. Weld

79 Chicken & Egg And finding the Gausians would be easy If we knew the coloring Slide by Daniel S. Weld

80 Expectation Maximization (EM) Pretend we do know the parameters – Initialize randomly: set  1 =?;  2 =? Slide by Daniel S. Weld

81 Expectation Maximization (EM) Pretend we do know the parameters – Initialize randomly [E step] Compute probability of instance having each possible value of the hidden variable Slide by Daniel S. Weld

82 Expectation Maximization (EM) Pretend we do know the parameters – Initialize randomly [E step] Compute probability of instance having each possible value of the hidden variable Slide by Daniel S. Weld

83 Expectation Maximization (EM) Pretend we do know the parameters – Initialize randomly [E step] Compute probability of instance having each possible value of the hidden variable [M step] Treating each instance as fractionally having both values compute the new parameter values Slide by Daniel S. Weld

84 ML Mean of Single Gaussian U ml = argmin u  i (x i – u) Slide by Daniel S. Weld

85 Expectation Maximization (EM) [E step] Compute probability of instance having each possible value of the hidden variable [M step] Treating each instance as fractionally having both values compute the new parameter values Slide by Daniel S. Weld

86 Expectation Maximization (EM) [E step] Compute probability of instance having each possible value of the hidden variable Slide by Daniel S. Weld

87 Expectation Maximization (EM) [E step] Compute probability of instance having each possible value of the hidden variable [M step] Treating each instance as fractionally having both values compute the new parameter values Slide by Daniel S. Weld

88 Expectation Maximization (EM) [E step] Compute probability of instance having each possible value of the hidden variable [M step] Treating each instance as fractionally having both values compute the new parameter values Slide by Daniel S. Weld

89 EM for HMMs [E step] Compute probability of instance having each possible value of the hidden variable – Compute the forward and backward probabilities for given model parameters and our observations [M step] Treating each instance as fractionally having both values compute the new parameter values - Re-estimate the model parameters - Simple Counting