CSE 446: Expectation Maximization (EM) Winter 2012 Daniel Weld Slides adapted from Carlos Guestrin, Dan Klein & Luke Zettlemoyer.

CSE 446: Expectation Maximization (EM) Winter 2012 Daniel Weld Slides adapted from Carlos Guestrin, Dan Klein & Luke Zettlemoyer

Machine Learning 2 Supervised Learning Parametric Reinforcement Learning Unsupervised Learning Non-parametric FriK-means & Agglomerative Clustering MonExpectation Maximization (EM) WedPrinciple Component Analysis (PCA)

K-Means An iterative clustering algorithm –Pick K random points as cluster centers (means) –Alternate: Assign data instances to closest mean Assign each mean to the average of its assigned points –Stop when no points’ assignments change

K-Means as Optimization Consider the total distance to the means: Two stages each iteration: –Update assignments: fix means c, change assignments a –Update means: fix assignments a, change means c Coordinate gradient ascent on Φ Will it converge? –Yes!, change from either update can only decrease Φ points assignments means

Phase I: Update Assignments (Expectation) For each point, re-assign to closest mean: Can only decrease total distance phi!

Phase II: Update Means (Maximization) Move each mean to the average of its assigned points: Also can only decrease total distance… (Why?) Fun fact: the point y with minimum squared Euclidean distance to a set of points {x} is their mean

Preview: EM Pick K random cluster models Alternate: –Assign data instances proportionately to different models –Revise each cluster model based on its (proportionately) assigned points Stop when no changes another iterative clustering algorithm

K-Means Getting Stuck A local optimum: Why doesn’t this work out like the earlier example, with the purple taking over half the blue?

Preference for Equally Sized Clusters 9

The Evils of “Hard Assignments”? Clusters may overlap Some clusters may be “wider” than others Distances can be deceiving!

Probabilistic Clustering Try a probabilistic model! allows overlaps, clusters of different size, etc. Can tell a generative story for data – P(X|Y) P(Y) is common Challenge: we need to estimate model parameters without labeled Ys YX1X1 X2X2 ??0.12.1 ??0.5-1.1 ??0.03.0 ??-0.1-2.0 ??0.21.5 ………

The General GMM assumption    P(Y): There are k components P(X|Y): Each component generates data from a multivariate Gaussian with mean μ i and covariance matrix  i Each data point is sampled from a generative process: 1.Choose component i with probability P(y=i) 2.Generate datapoint ~ N(m i,  i )

What Model Should We Use? Depends on X! Here, maybe Gaussian Naïve Bayes? – Multinomial over clusters Y – Gaussian over each X i given Y YX1X1 X2X2 ??0.12.1 ??0.5-1.1 ??0.03.0 ??-0.1-2.0 ??0.21.5 ………

Could we make fewer assumptions? What if the X i co-vary? What if there are multiple peaks? Gaussian Mixture Models! – P(Y) still multinomial – P(X|Y) is a multivariate Gaussian dist’n

The General GMM assumption    1.What’s a Multivariate Gaussian? 2.What’s a Mixture Model?

Review: Gaussians

Learning Gaussian Parameters (given fully-observable data)

Multivariate Gaussians 18 P(X=x j )= Covariance matrix, Σ, = degree to which x i vary together Eigenvalue, λ

Multivariate Gaussians 19 Σ  identity matrix

Multivariate Gaussians 20 Σ = diagonal matrix X i are independent ala Gaussian NB

Multivariate Gaussians 21 Σ = arbitrary (semidefinite) matrix specifies rotation (change of basis) eigenvalues specify relative elongation

The General GMM assumption    1.What’s a Multivariate Gaussian? 2.What’s a Mixture Model?

Mixtures of Gaussians (1) Old Faithful Data Set Time to Eruption Duration of Last Eruption

Mixtures of Gaussians (1) Old Faithful Data Set Single GaussianMixture of two Gaussians

Mixtures of Gaussians (2) Combine simple models into a complex model: Component Mixing coefficient K=3

Mixtures of Gaussians (3)

Eliminating Hard Assignments to Clusters Model data as mixture of multivariate Gaussians

Eliminating Hard Assignments to Clusters Model data as mixture of multivariate Gaussians π i = probability point was generated from i th Gaussian

Eliminating Hard Assignments to Clusters Model data as mixture of multivariate Gaussians

Eliminating Hard Assignments to Clusters Model data as mixture of multivariate Gaussians π i = probability point was generated from i th Gaussian

Detour/Review: Supervised MLE for GMM How do we estimate parameters for Gaussian Mixtures with fully supervised data? Have to define objective and solve optimization problem. For example, MLE estimate has closed form solution:

Compare Univariate Gaussian Mixture of Multivariate Gaussians

That was easy! But what if unobserved data? MLE: – argmax θ  j P(y j,x j ) – θ: all model parameters eg, class probs, means, and variance for naïve Bayes But we don’t know y j ’s!!! Maximize marginal likelihood: – argmax θ  j P(x j ) = argmax  j  i=1 k P(y j =i,x j )

How do we optimize? Closed Form? Maximize marginal likelihood: – argmax θ  j P(x j ) = argmax  j  i=1 k P(y j =i,x j ) Almost always a hard problem! – Usually no closed form solution – Even when P(X,Y) is convex, P(X) generally isn’t… – For all but the simplest P(X), we will have to do gradient ascent, in a big messy space with lots of local optimum…

Simple example: learn means only! Consider: 1D data Mixture of k=2 Gaussians Variances fixed to σ=1 Dist’n over classes is uniform Just estimate μ 1 and μ 2 -3 -2 -1 0 1 2 3 P

Graph of log P(x 1, x 2.. x n | μ 1, μ 2 ) against μ 1 and μ 2 Max likelihood = (μ 1 =-2.13, μ 2 =1.668) Local minimum, but very close to global at (μ 1 =2.085, μ 2 =-1.257)* μ2μ2 Marginal Likelihood for Mixture of two Gaussians * corresponds to switching y 1 with y 2. μ1μ1

Learning general mixtures of Gaussian Marginal likelihood: Need to differentiate and solve for μ i, Σ i, and P(Y=i) for i=1..k There will be no closed for solution, gradient is complex, lots of local optimum Wouldn’t it be nice if there was a better way!?!

Expectation Maximization

The EM Algorithm A clever method for maximizing marginal likelihood: – argmax θ  j P(x j ) = argmax θ  j  i=1 k P(y j =i,x j ) – A type of gradient ascent that can be easy to implement (eg, no line search, learning rates, etc.) Alternate between two steps: – Compute an expectation – Compute a maximization Not magic: still optimizing a non-convex function with lots of local optima – The computations are just easier (often, significantly so!)

EM: Two Easy Steps Objective: argmax θ  j  i=1 k P(y j =i,x j |θ) =  j log  i=1 k P(y j =i,x j |θ) Data: {x j | j=1.. n} E-step: Compute expectations to “fill in” missing y values according to current parameters,  – For all examples j and values i for y, compute: P(y j =i | x j, θ) M-step: Re-estimate the parameters with “weighted” MLE estimates – Set θ = argmax θ  j  i=1 k P(y j =i | x j, θ) log P(y j =i,x j |θ) Especially useful when the E and M steps have closed form solutions!!! Notation a bit inconsistent Parameters =  =

Simple example: learn means only! Consider: 1D data Mixture of k=2 Gaussians Variances fixed to σ=1 Dist’n over classes is uniform Just need to estimate μ 1 and μ 2.01.03.05.07.09

EM for GMMs: only learning means Iterate: On the t’th iteration let our estimates be t = { μ 1 (t), μ 2 (t) … μ k (t) } E-step Compute “expected” classes of all datapoints M-step Compute most likely new μs given class expectations

E.M. for General GMMs Iterate: On the t’th iteration let our estimates be t = { μ 1 (t), μ 2 (t) … μ k (t),  1 (t),  2 (t) …  k (t), p 1 (t), p 2 (t) … p k (t) } E-step Compute “expected” classes of all datapoints for each class p i (t) is shorthand for estimate of P(y=i) on t’th iteration M-step Compute weighted MLE for μ given expected classes above m = #training examples Just evaluate a Gaussian at x j

Gaussian Mixture Example: Start

After first iteration

After 2nd iteration

After 3rd iteration

After 4th iteration

After 5th iteration

After 6th iteration

After 20th iteration

Some Bio Assay data

GMM clustering of the assay data

Resulting Density Estimator

Three classes of assay (each learned with it’s own mixture model)

What if we do hard assignments? Iterate: On the t’th iteration let our estimates be θ t = { μ 1 (t), μ 2 (t) … μ k (t) } E-step Compute “expected” classes of all datapoints M-step Compute most likely new μs given class expectations δ represents hard assignment to “most likely” or nearest cluster Equivalent to k-means clustering algorithm!!!

Lets look at the math behind the magic! We will argue that EM: Optimizes a bound on the likelihood Is a type of coordinate ascent Is guaranteed to converge to a (often local) optima

The general learning problem with missing data Marginal likelihood: x is observed, z (eg class labels, y) is missing: Objective: Find argmax θ l(θ:Data)

Skipping Gnarly Math EM Converges – E-step doesn’t decrease F( , D) – M-step doesn’t either EM is Coordinate Ascent 60

A Key Computation: E-step x is observed, z is missing Compute probability of missing data given current choice of  – Q(z|x j ) for each x j e.g., probability computed during classification step corresponds to “classification step” in K-means

Jensen’s inequality Theorem: – log  z P(z) f(z) ≥  z P(z) log f(z) – e.g., Binary case for convex function f: – actually, holds for any concave (convex) function applied to an expectation!

Applying Jensen’s inequality Use: log  z P(z) f(z) ≥  z P(z) log f(z)

The M-step Maximization step: We are optimizing a lower bound! Use expected counts to do weighted learning: – If learning requires Count(x,z) – Use E Q(t+1) [Count(x,z)] – Looks a bit like boosting!!! Lower bound:

Convergence of EM Define: potential function F( ,Q): – lower bound from Jensen’s inequality EM is coordinate ascent on F! – Thus, maximizes lower bound on marginal log likelihood

M-step can’t decrease F(θ,Q): by definition! We are maximizing F directly, by ignoring a constant!

E-step: more work to show that F(θ,Q) doesn’t decrease KL-divergence: measures distance between distributions KL=zero if and only if Q=P

E-step also doesn’t decrease F: Step 1 Fix  to  (t), take a max over Q:

E-step also doesn’t decrease F: Step 2 Fixing  to  (t) : Now, the max over Q yields: – Q(z|x j )  P(z|x j,  (t) ) – Why? The likelihood term is a constant; the KL term is zero iff the arguments are the same distribution!! – So, the E-step is actually a maximization / tightening of the bound. It ensures that:

EM is coordinate ascent M-step: Fix Q, maximize F over  (a lower bound on ): E-step: Fix , maximize F over Q: – “Realigns” F with likelihood:

What you should know K-means for clustering: – algorithm – converges because it’s coordinate ascent Know what agglomerative clustering is EM for mixture of Gaussians: – Also coordinate ascent – How to “learn” maximum likelihood parameters (locally max. like.) in the case of unlabeled data – Relation to K-means Hard / soft clustering Probabilistic model Remember, E.M. can get stuck in local minima, – And empirically it DOES

Acknowledgements K-means & Gaussian mixture models presentation contains material from excellent tutorial by Andrew Moore: – http://www.autonlab.org/tutorials/ http://www.autonlab.org/tutorials/ K-means Applet: – http://www.elet.polimi.it/upload/matteucc/Clustering/tu torial_html/AppletKM.html http://www.elet.polimi.it/upload/matteucc/Clustering/tu torial_html/AppletKM.html Gaussian mixture models Applet: – http://www.neurosci.aist.go.jp/%7Eakaho/MixtureEM.ht ml

Solution to #3 - Learning Given x 1 …x N, how do we learn θ =(,, ) to maximize P(x)? Unfortunately, there is no known way to analytically find a global maximum θ * such that θ * = arg max P(o | θ) But it is possible to find a local maximum; given an initial model θ, we can always find a model θ ’ such that P(o | θ ’ ) ≥ P(o | θ)

74 Chicken & Egg Problem If we knew the actual sequence of states – It would be easy to learn transition and emission probabilities – But we can’t observe states, so we don’t! If we knew transition & emission probabilities –Then it’d be easy to estimate the sequence of states (Viterbi) –But we don’t know them! Slide by Daniel S. Weld

75 Simplest Version Mixture of two distributions Know: form of distribution & variance, % =5 Just need mean of each distribution.01.03.05.07.09 Slide by Daniel S. Weld

76 Input Looks Like.01.03.05.07.09 Slide by Daniel S. Weld

77 We Want to Predict.01.03.05.07.09 ? Slide by Daniel S. Weld

78 Chicken & Egg.01.03.05.07.09 Note that coloring instances would be easy if we knew Gausians…. Slide by Daniel S. Weld

79 Chicken & Egg.01.03.05.07.09 And finding the Gausians would be easy If we knew the coloring Slide by Daniel S. Weld

80 Expectation Maximization (EM) Pretend we do know the parameters – Initialize randomly: set  1 =?;  2 =?.01.03.05.07.09 Slide by Daniel S. Weld

81 Expectation Maximization (EM) Pretend we do know the parameters – Initialize randomly [E step] Compute probability of instance having each possible value of the hidden variable.01.03.05.07.09 Slide by Daniel S. Weld

82 Expectation Maximization (EM) Pretend we do know the parameters – Initialize randomly [E step] Compute probability of instance having each possible value of the hidden variable.01.03.05.07.09 Slide by Daniel S. Weld

83 Expectation Maximization (EM) Pretend we do know the parameters – Initialize randomly [E step] Compute probability of instance having each possible value of the hidden variable.01.03.05.07.09 [M step] Treating each instance as fractionally having both values compute the new parameter values Slide by Daniel S. Weld

84 ML Mean of Single Gaussian U ml = argmin u  i (x i – u) 2.01.03.05.07.09 Slide by Daniel S. Weld

85 Expectation Maximization (EM) [E step] Compute probability of instance having each possible value of the hidden variable.01.03.05.07.09 [M step] Treating each instance as fractionally having both values compute the new parameter values Slide by Daniel S. Weld

86 Expectation Maximization (EM) [E step] Compute probability of instance having each possible value of the hidden variable.01.03.05.07.09 Slide by Daniel S. Weld

89 EM for HMMs [E step] Compute probability of instance having each possible value of the hidden variable – Compute the forward and backward probabilities for given model parameters and our observations [M step] Treating each instance as fractionally having both values compute the new parameter values - Re-estimate the model parameters - Simple Counting

CSE 446: Expectation Maximization (EM) Winter 2012 Daniel Weld Slides adapted from Carlos Guestrin, Dan Klein & Luke Zettlemoyer.

Similar presentations

Presentation on theme: "CSE 446: Expectation Maximization (EM) Winter 2012 Daniel Weld Slides adapted from Carlos Guestrin, Dan Klein & Luke Zettlemoyer."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSE 446: Expectation Maximization (EM) Winter 2012 Daniel Weld Slides adapted from Carlos Guestrin, Dan Klein & Luke Zettlemoyer.

Similar presentations

Presentation on theme: "CSE 446: Expectation Maximization (EM) Winter 2012 Daniel Weld Slides adapted from Carlos Guestrin, Dan Klein & Luke Zettlemoyer."— Presentation transcript:

Similar presentations

About project

Feedback