Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

Slides:



Advertisements
Similar presentations
Mixture Models and the EM Algorithm
Advertisements

Unsupervised Learning
Clustering Beyond K-means
Probabilistic Generative Models Rong Jin. Probabilistic Generative Model Classify instance x into one of K classes Class prior Density function for class.
Classification and risk prediction
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 10 Statistical Modelling Martin Russell.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Text Classification from Labeled and Unlabeled Documents using EM Kamal Nigam Andrew K. McCallum Sebastian Thrun Tom Mitchell Machine Learning (2000) Presented.
Clustering.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
Unsupervised Training and Clustering Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
Expectation Maximization for GMM Comp344 Tutorial Kai Zhang.
Visual Recognition Tutorial
Kernel Methods Part 2 Bing Han June 26, Local Likelihood Logistic Regression.
EE462 MLCV 1 Lecture 3-4 Clustering (1hr) Gaussian Mixture and EM (1hr) Tae-Kyun Kim.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Gaussian Mixture Model and the EM algorithm in Speech Recognition
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
HMM - Part 2 The EM algorithm Continuous density HMM.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Flat clustering approaches
Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
CS479/679 Pattern Recognition Dr. George Bebis
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
Chapter 3: Maximum-Likelihood Parameter Estimation
LECTURE 06: MAXIMUM LIKELIHOOD ESTIMATION
Bayesian Rule & Gaussian Mixture Models
Machine Learning Lecture 9: Clustering
CH 5: Multivariate Methods
Classification of unlabeled data:
Statistical Models for Automatic Speech Recognition
10701 / Machine Learning.
Machine Learning Basics
CS 2750: Machine Learning Expectation Maximization
Latent Variables, Mixture Models and EM
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry
Bayesian Models in Machine Learning
Probabilistic Models with Latent Variables
Statistical Models for Automatic Speech Recognition
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Gaussian Mixture Models And their training with the EM algorithm
EM for Inference in MV Data
INTRODUCTION TO Machine Learning
Generally Discriminant Analysis
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Text Categorization Berlin Chen 2003 Reference:
Learning From Observed Data
Biointelligence Laboratory, Seoul National University
Multivariate Methods Berlin Chen
EM for Inference in MV Data
Multivariate Methods Berlin Chen, 2005 References:
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
EM Algorithm and its Applications
Clustering (2) & EM algorithm
Presentation transcript:

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

Soft Clustering with Gaussian mixture models A “soft” version of K-means clustering Given: Training set S = {x1, ..., xN}, and K. Assumption: Data is generated by sampling from a “mixture” (linear combination) of K Gaussians.

Gaussian Mixture Models Assumptions K clusters

Gaussian Mixture Models Assumptions K clusters Each cluster is modeled by a Gaussian distribution with a certain mean and standard deviation (or covariance). [This contrasts with K-means, in which each cluster is modeled only by a mean.]

Gaussian Mixture Models Assumptions K clusters Each cluster is modeled by a Gaussian distribution with a certain mean and standard deviation (or covariance). [This contrasts with K-means, in which each cluster is modeled only by a mean.] Assume that each data instance we have was generated by the following procedure:

Gaussian Mixture Models Assumptions K clusters Each cluster is modeled by a Gaussian distribution with a certain mean and standard deviation (or covariance). [This contrasts with K-means, in which each cluster is modeled only by a mean.] Assume that each data instance we have was generated by the following procedure: 1. Select cluster ci with probability P(ci) = πi

Gaussian Mixture Models Assumptions K clusters Each cluster is modeled by a Gaussian distribution with a certain mean and standard deviation (or covariance). [This contrasts with K-means, in which each cluster is modeled only by a mean.] Assume that each data instance we have was generated by the following procedure: 1. Select cluster ci with probability P(ci) = πi 2. Sample point from ci’s Gaussian distribution

Mixture of three Gaussians (one dimensional data)

Clustering via Gaussian mixture models Clusters: Each cluster will correspond to a single Gaussian. Each point x  S will have some probability distribution over the K clusters.

Clustering via Gaussian mixture models Clusters: Each cluster will correspond to a single Gaussian. Each point x  S will have some probability distribution over the K clusters. Goal: Given the data S, find the Gaussians! (And their probabilities πi .)

Clustering via Gaussian mixture models Clusters: Each cluster will correspond to a single Gaussian. Each point x  S will have some probability distribution over the K clusters. Goal: Given the data S, find the Gaussians! (And their probabilities πi .) i.e., Find parameters {θK} of these K Gaussians such P(S | {θK}) is maximized.

Clustering via Gaussian mixture models Clusters: Each cluster will correspond to a single Gaussian. Each point x  S will have some probability distribution over the K clusters. Goal: Given the data S, find the Gaussians! (And their probabilities πi .) i.e., Find parameters {θK} of these K Gaussians such P(S | {θK}) is maximized. This is called a Maximum Likelihood method. S is the data {θK} is the “hypothesis” or “model” P(S | {θK}) is the “likelihood”.

General form of one-dimensional (univariate) Gaussian Mixture Model

Maximum Likelihood for Single Univariate Gaussian Learning a GMM Simplest Case: Maximum Likelihood for Single Univariate Gaussian

Maximum Likelihood for Single Univariate Gaussian Learning a GMM Simplest Case: Maximum Likelihood for Single Univariate Gaussian Assume training set S has N values generated by a univariant Gaussian distribution:

Maximum Likelihood for Single Univariate Gaussian Learning a GMM Simplest Case: Maximum Likelihood for Single Univariate Gaussian Assume training set S has N values generated by a univariant Gaussian distribution: Likelihood function: probability of the data given the model:

Maximum Likelihood for Single Univariate Gaussian Learning a GMM Simplest Case: Maximum Likelihood for Single Univariate Gaussian Assume training set S has N values generated by a univariant Gaussian distribution: Likelihood function: probability of the data given the model:

How to estimate parameters  and  from S?

How to estimate parameters  and  from S? Maximize the likelihood function with respect to  and  .

How to estimate parameters  and  from S? Maximize the likelihood function with respect to  and  . We want the  and  that maximize the probability of the data.

How to estimate parameters  and  from S? Maximize the likelihood function with respect to  and  . We want the  and  that maximize the probability of the data. Use log of likelihood in order to avoid very small values.

Log-Likelihood version:

Log-Likelihood version: Find a simplified expression for this.

Now, given simplified expression for ln p(S | , σ), find the maximum likelihood parameters,  and σ.

Now, given simplified expression for ln p(S | , σ), find the maximum likelihood parameters,  and σ. First, maximize with respect to .

Now, given simplified expression for ln p(S | , σ), find the maximum likelihood parameters,  and σ. First, maximize with respect to . = 0 Result: (ML = “Maximum Likelihood”)

Now, given simplified expression for ln p(S | , σ), find the maximum likelihood parameters,  and σ. First, maximize with respect to . Result: (ML = “Maximum Likelihood”)

Now, given simplified expression for ln p(S | , σ), find the maximum likelihood parameters,  and σ. First, maximize with respect to . Result: (ML = “Maximum Likelihood”)

Now, given simplified expression for ln p(S | , σ), find the maximum likelihood parameters,  and σ. First, maximize with respect to . Result: (ML = “Maximum Likelihood”)

Now, given simplified expression for ln p(S | , σ), find the maximum likelihood parameters,  and σ. First, maximize with respect to . Result: (ML = “Maximum Likelihood”)

Now, maximize with respect to σ. It’s actually easier to maximize with respect to σ2.

Now, maximize with respect to σ. It’s actually easier to maximize with respect to σ2. = 0

Now, maximize with respect to σ. It’s actually easier to maximize with respect to σ2.

Now, maximize with respect to σ. It’s actually easier to maximize with respect to σ2. = 0

Now, maximize with respect to σ. It’s actually easier to maximize with respect to σ2. = 0

parameterizes the model. The resulting distribution is called a generative model because it can generate new data values. We say that parameterizes the model. In general, θ is used to denote the (learnable) parameters of a probabilistic model ML

parameterizes the model. The resulting distribution is called a generative model because it can generate new data values. We say that parameterizes the model. In general, θ is used to denote the (learnable) parameters of a probabilistic model ML

parameterizes the model. The resulting distribution is called a generative model because it can generate new data values. We say that parameterizes the model. In general, θ is used to denote the (learnable) parameters of a probabilistic model ML

Discriminative vs. generative models in machine learning Discriminative model: Creates a discrimination curve to separate the data into classes Generative model: Gives the parameters θML of a probability distribution that gives maximum likelihood p (S | θ) for the data.

Learning a GMM More general case: Multivariate Gaussian Distribution Multivariate (D-dimensional) Gaussian:

Covariance: Variance: Covariance Matrix Σ : Σi,j = cov (xi , xj)

Let S be a set of multivariate data points (vectors): S = {x1, ..., xm}.

Let S be a set of multivariate data points (vectors): S = {x1, ..., xm}. General expression for finite Gaussian mixture model:

Let S be a set of multivariate data points (vectors): S = {x1, ..., xm}. General expression for finite Gaussian mixture model: That is, x has probability of “membership” in multiple clusters/classes.

Maximum Likelihood for Multivariate Gaussian Mixture Model Goal: Given S = {x1, ..., xN}, and given K, find the Gaussian mixture model (with K multivariate Gaussians) for which S has maximum log-likelihood.

Maximum Likelihood for Multivariate Gaussian Mixture Model Goal: Given S = {x1, ..., xN}, and given K, find the Gaussian mixture model (with K multivariate Gaussians) for which S has maximum log-likelihood. Log likelihood function:

Maximum Likelihood for Multivariate Gaussian Mixture Model Goal: Given S = {x1, ..., xN}, and given K, find the Gaussian mixture model (with K multivariate Gaussians) for which S has maximum log-likelihood. Log likelihood function: Given S, we can maximize this function to find

Maximum Likelihood for Multivariate Gaussian Mixture Model Goal: Given S = {x1, ..., xN}, and given K, find the Gaussian mixture model (with K multivariate Gaussians) for which S has maximum log-likelihood. Log likelihood function: Given S, we can maximize this function to find But no closed form solution (unlike simple case in previous slides)

Maximum Likelihood for Multivariate Gaussian Mixture Model Goal: Given S = {x1, ..., xN}, and given K, find the Gaussian mixture model (with K multivariate Gaussians) for which S has maximum log-likelihood. Log likelihood function: Given S, we can maximize this function to find But no closed form solution (unlike simple case in previous slides) In this multivariate case, we can efficiently maximize this function using the “Expectation / Maximization” (EM) algorithm.

Expectation-Maximization (EM) algorithm

Expectation-Maximization (EM) algorithm Goal: Learn for k = 1, …, K

Expectation-Maximization (EM) algorithm Goal: Learn for k = 1, …, K General idea: Choose random initial values for means, covariances and mixing coefficients. (Analogous to choosing random initial cluster centers in K-means.)

Expectation-Maximization (EM) algorithm Goal: Learn for k = 1, …, K General idea: Choose random initial values for means, covariances and mixing coefficients. (Analogous to choosing random initial cluster centers in K-means.) Alternate between E (expectation) and M (maximization) step:

Expectation-Maximization (EM) algorithm Goal: Learn for k = 1, …, K General idea: Choose random initial values for means, covariances and mixing coefficients. (Analogous to choosing random initial cluster centers in K-means.) Alternate between E (expectation) and M (maximization) step: E step: use current values for parameters to evaluate posterior probabilities, or “responsibilities”, for each data point. (Analogous to determining which cluster a point belongs to, in K-means.)

Expectation-Maximization (EM) algorithm Goal: Learn for k = 1, …, K General idea: Choose random initial values for means, covariances and mixing coefficients. (Analogous to choosing random initial cluster centers in K-means.) Alternate between E (expectation) and M (maximization) step: E step: use current values for parameters to evaluate posterior probabilities, or “responsibilities”, for each data point. (Analogous to determining which cluster a point belongs to, in K-means.) M step: Use these probabilities to re-estimate means, covariances, and mixing coefficients. (Analogous to moving the cluster centers to the means of their members, in K-means.)

Expectation-Maximization (EM) algorithm Goal: Learn for k = 1, …, K General idea: Choose random initial values for means, covariances and mixing coefficients. (Analogous to choosing random initial cluster centers in K-means.) Alternate between E (expectation) and M (maximization) step: E step: use current values for parameters to evaluate posterior probabilities, or “responsibilities”, for each data point. (Analogous to determining which cluster a point belongs to, in K-means.) M step: Use these probabilities to re-estimate means, covariances, and mixing coefficients. (Analogous to moving the cluster centers to the means of their members, in K-means.) Repeat until the log-likelihood or the parameters θ do not change significantly.

More detailed version of EM algorithm

More detailed version of EM algorithm Initialization: Let X be the set of training data. Initialize the means k, covariances k, and mixing coefficients k, and evaluate initial value of log likelihood.

More detailed version of EM algorithm Initialization: Let X be the set of training data. Initialize the means k, covariances k, and mixing coefficients k, and evaluate initial value of log likelihood. E step. Evaluate the “responsibilities” using the current parameter values where rn,k denotes the “responsibilities” of the kth cluster for the nth data point.

M step. Re-estimate the parameters θ using the current responsibilities.

Recompute log-likelihood: Evaluate the log-likelihood with the new parameters and check for convergence of either the parameters or the log likelihood. If not converged, return to step 2.

EM much more computationally expensive than K-means Common practice: Use K-means to set initial parameters, then improve with EM. Initial means: Means of clusters found by k-means Initial covariances: Sample covariances of the clusters found by K-means algorithm. Initial mixture coefficients: Fractions of data points assigned to the respective clusters.

Can prove that EM finds local maxima of log-likelihood function. EM is very general technique for finding maximum-likelihood solutions for probabilistic models

Using GMM for Classification Assume each cluster corresponds to one of the classes. A new test example x is classified according to

Case Study: Text classification from labeled and unlabeled documents using EM K. Nigam et al., Machine Learning, 2000 Big problem with text classification: need labeled data. What we have: lots of unlabeled data. Question of this paper: Can unlabeled data be used to increase classification accuracy? In other words: Any information implicit in unlabeled data? Any way to take advantage of this implicit information?

General idea: A version of EM algorithm Train a classifier with small set of available labeled documents. Use this classifier to assign probabilisitically-weighted class labels to unlabeled documents. Then train a new classifier using all the documents, both originally labeled and formerly unlabeled. Iterate.

Representation of Documents A document d is represented by a vector of word frequencies: d = ( f1 , f2 , ... , fN ) for the N words in the vocabulary. Assume there are K classes, c1 , ... , cK (E.g., classes might correspond to different “topics”)

Probabilistic framework Assumes data are generated with Gaussian mixture model with K components Assumes one-to-one correspondence between mixture components and classes. “These assumptions rarely hold in real-world text data”

Probabilistic framework Let  = {1, ..., K}  {1, ..., K}  {1, ..., K} be the mixture parameters. Assumptions: Document di is created by: (1) selecting mixture component k according to the mixture weights j (2) this selected mixture component generates a document, with distribution N(di |  k , k) Likelihood of document di under model  :

Now, we will apply EM to a Naive Bayes Classifier Recall Naive Bayes classifier: Assume each feature is conditionally independent, given cj.

To train naïve Bayes from labeled data, estimate: Note that under our assumption of a Gaussian mixture model, These are estimates of the parameters θ. Call these values . Thus, Naïve Bayes gives us an initial estimate ( ) of our GMM. We’re going to use unlabeled data to improve this estimate.

Applying EM to Naive Bayes We have a small number of labeled documents Slabeled, and a large number of unlabeled documents, Sunlabeled. The initial parameters are estimated from the labeled documents Slabeled. Expectation step: The resulting classifier is used to assign probabilistically-weighted class labels to each unlabeled document x  Sunlabeled. Maximization step: Re-estimate using values for x  Sunlabeled  Sunlabeled Repeat until or has converged.

Applying EM to Naive Bayes We have a small number of labeled documents Slabeled, and a large number of unlabeled documents, Sunlabeled. The initial parameters are estimated from the labeled documents Slabeled. Expectation step: The resulting classifier is used to assign probabilistically-weighted class labels to each unlabeled document x  Sunlabeled. Maximization step: Re-estimate using values for x  Sunlabeled  Sunlabeled Repeat until or has converged.

Applying EM to Naive Bayes We have a small number of labeled documents Slabeled, and a large number of unlabeled documents, Sunlabeled. The initial parameters are estimated from the labeled documents Slabeled. Expectation step: The resulting classifier is used to assign probabilistically-weighted class labels to each unlabeled document x  Sunlabeled. Maximization step: Re-estimate using values for x  Sunlabeled  Sunlabeled Repeat until or has converged.

Applying EM to Naive Bayes We have a small number of labeled documents Slabeled, and a large number of unlabeled documents, Sunlabeled. The initial parameters are estimated from the labeled documents Slabeled. Expectation step: The resulting classifier is used to assign probabilistically-weighted class labels to each unlabeled document x  Sunlabeled. Maximization step: Re-estimate using values for x  Sunlabeled  Sunlabeled Repeat until or has converged.

Applying EM to Naive Bayes We have a small number of labeled documents Slabeled, and a large number of unlabeled documents, Sunlabeled. The initial parameters are estimated from the labeled documents Slabeled. Expectation step: The resulting classifier is used to assign probabilistically-weighted class labels to each unlabeled document x  Sunlabeled. Maximization step: Re-estimate using values for x  Sunlabeled  Sunlabeled Repeat until or has converged.

Data 20 UseNet newsgroups Web pages (WebKB) Newswire articles (Reuters)

Results