Unsupervised Learning

Slides:



Advertisements
Similar presentations
Image Modeling & Segmentation
Advertisements

Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Mixture Models and the EM Algorithm
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010
Clustering Beyond K-means
Support Vector Machines and Margins
K Means Clustering , Nearest Cluster and Gaussian Mixture
Supervised Learning Recap
Visual Recognition Tutorial
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Classification and risk prediction
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 10 Statistical Modelling Martin Russell.
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
Lecture 5: Learning models using EM
Most slides from Expectation Maximization (EM) Northwestern University EECS 395/495 Special Topics in Machine Learning.
Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I)
Expectation-Maximization
Visual Recognition Tutorial
Clustering & Dimensionality Reduction 273A Intro Machine Learning.
Gaussian Mixture Models and Expectation Maximization.
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent.
Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Learning Theory Reza Shadmehr Linear and quadratic decision boundaries Kernel estimates of density Missing data.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
INTRODUCTION TO Machine Learning 3rd Edition
CSE 5331/7331 F'07© Prentice Hall1 CSE 5331/7331 Fall 2007 Machine Learning Margaret H. Dunham Department of Computer Science and Engineering Southern.
Gaussian Mixture Models and Expectation-Maximization Algorithm.
Lecture 2: Statistical learning primer for biologists
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CSE 517 Natural Language Processing Winter 2015
Logistic Regression William Cohen.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Deep Feedforward Networks
Machine Learning and Data Mining Clustering
Classification of unlabeled data:
Statistical Models for Automatic Speech Recognition
Latent Variables, Mixture Models and EM
Expectation-Maximization
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry
Bayesian Models in Machine Learning
Probabilistic Models with Latent Variables
Statistical Models for Automatic Speech Recognition
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Gaussian Mixture Models And their training with the EM algorithm
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
INTRODUCTION TO Machine Learning
Biointelligence Laboratory, Seoul National University
Multivariate Methods Berlin Chen
Machine Learning and Data Mining Clustering
Mathematical Foundations of BME
Multivariate Methods Berlin Chen, 2005 References:
Presentation transcript:

Unsupervised Learning Gaussian Mixture Models Expectation-Maximization (EM)

Gaussian Mixture Models Like K-Means, GMM clusters have centers. In addition, they have probability distributions that indicate the probability that a point belongs to the cluster. These ellipses show “level sets”: lines with equal probability of belonging to the cluster. Notice that green points still have SOME probability of belonging to the blue cluster, but it’s much lower than the blue points. This is a more complex model than K-Means: distance from the center can matter more in one direction than another. X2 X1

GMMs and EM Gaussian Mixture Models (GMMs) are a model, similar to a Naïve Bayes model but with important differences. Expectation-Maximization (EM) is a parameter-estimation algorithm for training GMMs using unlabeled data. To explain these further, we first need to review Gaussian (normal) distributions.

The Normal (aka Gaussian) Distribution f 𝑥 = 1 2𝜋 𝜎 exp⁡( −(𝑥−𝜇 ) 2 2 𝜎 2 ) 𝜇: mean, σ2: variance σ 𝜇

Quiz: MLE for Gaussians Based on your statistics knowledge, What is the MLE for μ from a bunch of example X points? What is the MLE for σ from a bunch of example X points?

Answer: MLE for Gaussians Based on your statistics knowledge, What is the MLE for μ from a bunch of example X points? 𝜇= 1 𝑀 𝑖 𝑋 𝑖 What is the MLE for σ from a bunch of example X points? 𝜎 2 = 1 𝑀 𝑖 ( 𝑋 𝑖 −𝜇) 2 (average of the X values) (average deviation from the mean) Note: this is a so-called “biased” estimator for 𝜎 2 ; there is also an “unbiased” estimator which basically just uses (M-1) instead of M. We’ll stick to the “biased” one here, but either one is fine.

Quiz: Deriving the ML estimators How would you derive the MLE equations for Gaussian distributions?

Answer: Deriving the ML estimators How would you derive the MLE equations for Gaussian distributions? Same plan of attack as for MLE estimates of Bayes Nets: Write down the Likelihood function P(D | M) Make the assumption that each data point Xi is independently distributed, so P(D|M) = ∏𝑃 𝑋 𝑖 𝑀 Take the log Take the partial derivative with respect to μ, set this equal to zero, and solve for μ. Take the partial derivative with respect to σ, set this equal to zero, and solve for σ.

Quiz: Estimating a Gaussian On the left is a dataset with the following X values: 0, 3, 4, 5, 6, 7, 10 Find the maximum likelihood Gaussian distribution. 𝜇= 1 𝑀 𝑖 𝑋 𝑖 𝜎 2 = 1 𝑀 𝑖 ( 𝑋 𝑖 −𝜇) 2

Answer: Estimating a Gaussian On the left is a dataset with the following X values: 0, 3, 4, 5, 6, 7, 10 𝜇= 1 𝑀 𝑖 𝑋 𝑖 = 1 7 0+3+4+5+6+7+10 =5 𝜎 2 = 1 𝑀 𝑖 ( 𝑋 𝑖 −𝜇) 2 = 1 7 0−5 2 + 3−5 2 + 4−5 2 + 5−5 2 + 6−5 2 + 7−5 2 + 10−5 2 = 1 7 5 2 + 2 2 + 1 2 + 0 2 + 1 2 + 2 2 + 5 2 = 1 7 25+4+1+1+4+25 = 60 7 f 𝑥 = 1 2𝜋 60 7 exp⁡( −(𝑥−5 ) 2 120 7 )

Clustering by fitting K Gaussians Suppose our dataset looks like the one above. It doesn’t really look Gaussian anymore; it looks like it has 3 clusters. Fitting a single Gaussian to this data will still give you an estimate. But that Gaussian will have a low Likelihood value: it will give very low probability to the leftmost and rightmost clusters.

Clustering by fitting K Gaussians What we’d like to do instead is to fit K Gaussians. A model for data that involves multiple Gaussian distributions is called a Gaussian Mixture Model (GMM).

Clustering by fitting K Gaussians μred μblue μgreen Another way of drawing these is with “Level sets”: Curves that show points with equal probability for each Guassian. Wider curves having lower probability than narrower curves. Notice that each point is contained within every Gaussian, but is most tightly bound to the closest Gaussian.

Expectation-Maximization (EM) EM is “K-Means for GMMs”. It is a parameter estimation algorithm for GMMs that will determine a (locally-optimal) setting for all of the GMM parameters, using a bunch of unlabeled X points. Input: 1. Data points X1, …, XM 2. A number K Output: 𝜇 1 , 𝜎 1 2 , …, 𝜇 𝐾 , 𝜎 𝐾 2 such that the GMM with those means and standard deviations has a locally-maximum likelihood for the training data set.

Visualization of EM Initialize the mean and standard deviation of each Gaussian randomly. Repeat until convergence: Expectation: For each point X and each Gaussian k, find P(X | Gaussian k)

Visualization of EM Initialize the mean and standard deviation of each Gaussian randomly. Repeat until convergence: Expectation: For each point X and each Gaussian k, find f(X | Gaussian k) Maximization: Estimate new 𝜇 𝑘 and 𝜎 𝑘 parameters for each Gaussian. (Technically, you also need to estimate a third parameter, called πk. More later.)

Visualization of EM Initialize the mean and standard deviation of each Gaussian randomly. Repeat until convergence: Expectation: For each point X and each Gaussian k, find f(X | Gaussian k) Maximization: Estimate new 𝜇 𝑘 and 𝜎 𝑘 parameters for each Gaussian. (Technically, you also need to estimate a third parameter, called πk. More later.)

Gaussian Mixture Model K Gaussian distributions with parameters 𝜇 1 , 𝜎 1 2 through 𝜇 𝐾 , 𝜎 𝐾 2 . It also involves K additional parameters, called prior probabilities, 𝜋 1 through 𝜋 𝐾 . These describe the relative importance of each of the K Gaussian distributions in the full model. The likelihood equation for this model looks like this: 𝑓 𝑋 1 ,…, 𝑋 𝑀 𝐺𝑀𝑀)= 𝑖 𝑓 𝑋 𝑖 𝐺𝑀𝑀 (i.i.d. assumption) 𝑓 𝑋 𝑖 𝐺𝑀𝑀 = 𝑘=1 𝐾 𝜋 𝑘 1 2𝜋 𝜎 𝑘 exp⁡( −(𝑥− 𝜇 𝑘 ) 2 2 𝜎 𝑘 2 ) Prior Gaussian

GMMs as Bayes Nets Cluster (1, 2, …, K) X (a real number) GMMs are simple Bayes Nets. Two differences from previous BNs we’ve seen: We’re used to binary variables in BNs. Here, the “Cluster” variable has K possible values (1, 2, …, K) instead of just two (+cluster and –cluster). We used to store P(+a) and P(-a) for the parent variable; now we store 𝜋 1 through 𝜋 𝐾 . The “X” variable has infinitely many values (any real number) instead of just (+x and –x). We used to store P(+x | +a) and P(+x | -a). Now we store 𝜇 1 , 𝜎 1 2 through 𝜇 𝐾 , 𝜎 𝐾 2 , and we say f(X |Cluster is j) = 1 2𝜋 𝜎 𝑗 exp⁡( −(𝑥− 𝜇 𝑗 ) 2 2 𝜎 𝑗 2 ) X (a real number)

Formal Description of the Algorithm Init: For each k in {1, …, K}, create a random πk, μk, σ2k Repeat until all πk, μk, σ2k remain the same from one iteration to the next: Expectation (aka Assignment in K-Means): For each Xi, for each k: let C[Xi,k]  𝜋 𝑘 1 2𝜋 𝜎 𝑘 exp⁡( −(𝑥− 𝜇 𝑘 ) 2 2 𝜎 𝑘 2 ) 𝑘′ 𝜋 𝑘′ 1 2𝜋 𝜎 𝑘′ exp⁡( −(𝑥− 𝜇 𝑘′ ) 2 2 𝜎 𝑘′ 2 ) Maximization (aka Update in K-Means): For each k, 𝜋 𝑘 = 1 𝑀 𝑖=1 𝑀 𝐶[ 𝑋 𝑖 ,𝑘] 𝜇 𝑘 = 𝑖=1 𝑀 𝐶[ 𝑋 𝑖 ,𝑘]∙ 𝑋 𝑖 𝑖=1 𝑀 𝐶[ 𝑋 𝑖 ,𝑘] 𝜎 𝑘 = 𝑖=1 𝑀 𝐶 𝑋 𝑖 ,𝑘 ∙ ( 𝑋 𝑖 − 𝜇 𝑘 ) 2 𝑖=1 𝑀 𝐶[ 𝑋 𝑖 ,𝑘] 3. Return (for all values of k) πk, μk, σ2k

Evaulation metric for GMMs and EM LOSS Function (or Objective function) for EM: EM (locally) maximizes “Marginal” Likelihood: EM(X1, …, XM) = argmax[ 𝜇 1 , 𝜎 1 2 , 𝜋 1 , …, 𝜇 𝐾 , 𝜎 𝐾 2 , 𝜋 𝐾 ] f(X1,…XM | 𝜇 1 , 𝜎 1 2 , 𝜋 1 , …, 𝜇 𝐾 , 𝜎 𝐾 2 , 𝜋 𝐾 ) Notice that this is the Likelihood function for just the X variable in our Bayes Net, rather than the Likelihood for (X and Cluster), which is why it is called “marginal likelihood” rather than just “likelihood”.

Analysis of EM Performance EM is guaranteed to find a local optimum of the Likelihood function. Theorem: After one iteration of EM, the Likelihood of the new GMM >= the Likelihood of the previous GMM. (Dempster, A.P.; Laird, N.M.; Rubin, D.B. 1977. "Maximum Likelihood from Incomplete Data via the EM Algorithm". Journal of the Royal Statistical Society. Series B (Methodological) 39 (1): 1–38.JSTOR 2984875.)

EM Generality Even though EM was originally invented for GMMs, the same basic algorithm can be used for learning with arbitrary Bayes Nets when some of the training data has missing values. This has made EM one of the most popular unsupervised learning techniques in machine learning.

EM Quiz a b c g1 g2 g3 Which Gaussian(s) have a nonzero value for f(a)? How about f(c)?

Answer: EM Quiz a b c g1 g2 g3 Which Gaussian(s) have a nonzero f(a)? All Gaussians (g1, g2, and g3) have a nonzero value for f(a). How about f(c)? Ditto. All Gaussians have a nonzero value for f(c).

Quiz: EM vs. K-Means a c g1 g2 Option 1 Option 2 Option 3 Option 4 At the end of K-Means, where will cluster center g1 end up – Option 1 or Option 2? At the end of EM, where will cluster center g1 end up – Option 1 or Option 2?

Answer: EM vs. K-Means a c g1 g2 Option 1 Option 2 Option 3 Option 4 At the end of K-Means, where will cluster center g1 end up – Option 1 or Option 2? Option 1: K-Means puts the “mean” at the center of all points in the cluster, and point a will be the only point in g1’s cluster. At the end of EM, where will cluster center g1 end up – Option 1 or Option 2? Option 2: EM puts the “mean” at the center of all points in the dataset, where each point is weighted by how likely it is according to the Gaussian. Point a and Point b will both have some likelihood, but Point a’s likelihood will be much higher. So the “mean” for g1 will be very close to Point a, but not all the way at Point a.

How many clusters? We’ve been assuming a fixed K. Here’s a technique to determine this automatically, from data. New objective function: Minimize: − 𝑖=1 𝑀 log 𝑓 𝑋 𝑖 𝐺𝑀 𝑀 𝐾 +𝐶𝑜𝑠𝑡∙𝐾 Algorithm: 1. Initialize K somehow. Repeat until convergence: 2. Run EM. 3. Remove unnecessary clusters (low π value) 4. Create new random clusters (more or fewer than before, depending on a heuristic estimate of whether there were too many or too few before). This is slow. But one nice property is that it can overcome some difficulties with local maxima.

Quiz Is EM for GMMs Classification or Regression? Generative or Discriminative? Parametric or Nonparametric?

Answer Is EM for GMMs Classification or Regression? Two possible answers: classification: output is a discrete value (cluster label) for each point Regression: output is a real value (probability) for each possible cluster label for each point Generative or Discriminative? normally, it’s used with a fixed set of input and output variables. However, GMMs are Bayes Nets that store a full joint distribution. Once it’s trained, a GMM can actually make predictions for any subset of the variables given any other subset. Technically, this is generative. Parametric or Nonparametric? - parametric: the number of parameters is 3K, which does not change with the number of training data points.

Quiz Is EM for GMMs Supervised or Unsupervised? Online or batch? Closed-form or iterative?

Answer Is EM for GMMs Supervised or Unsupervised? - Unsupervised Online or batch? - batch: if you add a new data point, you need to revisit all the training data to recompute the locally-optimal model Closed-form or iterative? -iterative: training requires many passes through the data