Gaussian Mixture Models X1X1 X2X2 Like K-Means, GMM clusters have centers. In addition, they have probability distributions that indicate the probability that a point belongs to the cluster. These ellipses show “level sets”: lines with equal probability of belonging to the cluster. Notice that green points still have SOME probability of belonging to the blue cluster, but it’s much lower than the blue points. This is a more complex model than K- Means: distance from the center can matter more in one direction than another.
GMMs and EM Gaussian Mixture Models (GMMs) are a model, similar to a Naïve Bayes model but with important differences. Expectation-Maximization (EM) is a parameter- estimation algorithm for training GMMs using unlabeled data. To explain these further, we first need to review Gaussian (normal) distributions.
The Normal (aka Gaussian) Distribution σ
Quiz: MLE for Gaussians Based on your statistics knowledge, 1)What is the MLE for μ from a bunch of example X points? 2)What is the MLE for σ from a bunch of example X points?
Answer: MLE for Gaussians (average of the X values)
Quiz: Deriving the ML estimators How would you derive the MLE equations for Gaussian distributions?
Answer: Deriving the ML estimators
Quiz: Estimating a Gaussian On the left is a dataset with the following X values: 0, 3, 4, 5, 6, 7, 10 Find the maximum likelihood Gaussian distribution. 0
Answer: Estimating a Gaussian On the left is a dataset with the following X values: 0, 3, 4, 5, 6, 7, 10 0
Clustering by fitting K Gaussians Suppose our dataset looks like the one above. It doesn’t really look Gaussian anymore; it looks like it has 3 clusters. Fitting a single Gaussian to this data will still give you an estimate. But that Gaussian will have a low Likelihood value: it will give very low probability to the leftmost and rightmost clusters. 0
Clustering by fitting K Gaussians What we’d like to do instead is to fit K Gaussians. A model for data that involves multiple Gaussian distributions is called a Gaussian Mixture Model (GMM). 0
Clustering by fitting K Gaussians Another way of drawing these is with “Level sets”: Curves that show points with equal probability for each Guassian. Wider curves having lower probability than narrower curves. Notice that each point is contained within every Gaussian, but is most tightly bound to the closest Gaussian. 0 μ red μ blue μ green
Visualization of EM 1.Initialize the mean and standard deviation of each Gaussian randomly. 2.Repeat until convergence: – Expectation: For each point X and each Gaussian k, find P(X | Gaussian k)
Visualization of EM
Gaussian Mixture Model (i.i.d. assumption) Gaussian Prior
GMMs as Bayes Nets Cluster (1, 2, …, K) X (a real number)
Formal Description of the Algorithm
Evaulation metric for GMMs and EM
Analysis of EM Performance EM is guaranteed to find a local optimum of the Likelihood function. Theorem: After one iteration of EM, the Likelihood of the new GMM >= the Likelihood of the previous GMM. (Dempster, A.P.; Laird, N.M.; Rubin, D.B "Maximum Likelihood from Incomplete Data via the EM Algorithm". Journal of the Royal Statistical Society. Series B (Methodological) 39 (1): 1– 38.JSTOR ) "Maximum Likelihood from Incomplete Data via the EM Algorithm"JSTOR
EM Generality Even though EM was originally invented for GMMs, the same basic algorithm can be used for learning with arbitrary Bayes Nets when some of the training data has missing values. This has made EM one of the most popular unsupervised learning techniques in machine learning.
EM Quiz a b c g1g1 g2g2 g3g3 Which Gaussian(s) have a nonzero value for f(a)? How about f(c)?
Answer: EM Quiz a b c g1g1 g2g2 g3g3 Which Gaussian(s) have a nonzero f(a)? All Gaussians (g1, g2, and g3) have a nonzero value for f(a). How about f(c)? Ditto. All Gaussians have a nonzero value for f(c).
Quiz: EM vs. K-Means ac g1g1 g2g2 At the end of K-Means, where will cluster center g1 end up – Option 1 or Option 2? Option 1 Option 2 At the end of EM, where will cluster center g1 end up – Option 1 or Option 2? Option 3 Option 4
Answer: EM vs. K-Means ac g1g1 g2g2 At the end of K-Means, where will cluster center g1 end up – Option 1 or Option 2? Option 1: K-Means puts the “mean” at the center of all points in the cluster, and point a will be the only point in g1’s cluster. Option 1 Option 2 At the end of EM, where will cluster center g1 end up – Option 1 or Option 2? Option 2: EM puts the “mean” at the center of all points in the dataset, where each point is weighted by how likely it is according to the Gaussian. Point a and Point b will both have some likelihood, but Point a’s likelihood will be much higher. So the “mean” for g1 will be very close to Point a, but not all the way at Point a. Option 3 Option 4
How many clusters?
Quiz Is EM for GMMs Classification or Regression? Generative or Discriminative? Parametric or Nonparametric?
Answer Is EM for GMMs Classification or Regression? Two possible answers: -classification: output is a discrete value (cluster label) for each point -Regression: output is a real value (probability) for each possible cluster label for each point Generative or Discriminative? -normally, it’s used with a fixed set of input and output variables. However, GMMs are Bayes Nets that store a full joint distribution. Once it’s trained, a GMM can actually make predictions for any subset of the variables given any other subset. Technically, this is generative. Parametric or Nonparametric? - parametric: the number of parameters is 3K, which does not change with the number of training data points.
Quiz Is EM for GMMs Supervised or Unsupervised? Online or batch? Closed-form or iterative?
Answer Is EM for GMMs Supervised or Unsupervised? - Unsupervised Online or batch? - batch: if you add a new data point, you need to revisit all the training data to recompute the locally-optimal model Closed-form or iterative? -iterative: training requires many passes through the data