Presentation is loading. Please wait.

Presentation is loading. Please wait.

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

Similar presentations


Presentation on theme: "Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models"— Presentation transcript:

1 Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

2 Soft Clustering with Gaussian mixture models
A “soft” version of K-means clustering Given: Training set S = {x1, ..., xN}, and K. Assumption: Data is generated by sampling from a “mixture” (linear combination) of K Gaussians.

3 Gaussian Mixture Models Assumptions
K clusters

4 Gaussian Mixture Models Assumptions
K clusters Each cluster is modeled by a Gaussian distribution with a certain mean and standard deviation (or covariance). [This contrasts with K-means, in which each cluster is modeled only by a mean.]

5 Gaussian Mixture Models Assumptions
K clusters Each cluster is modeled by a Gaussian distribution with a certain mean and standard deviation (or covariance). [This contrasts with K-means, in which each cluster is modeled only by a mean.] Assume that each data instance we have was generated by the following procedure:

6 Gaussian Mixture Models Assumptions
K clusters Each cluster is modeled by a Gaussian distribution with a certain mean and standard deviation (or covariance). [This contrasts with K-means, in which each cluster is modeled only by a mean.] Assume that each data instance we have was generated by the following procedure: 1. Select cluster ci with probability P(ci) = πi

7 Gaussian Mixture Models Assumptions
K clusters Each cluster is modeled by a Gaussian distribution with a certain mean and standard deviation (or covariance). [This contrasts with K-means, in which each cluster is modeled only by a mean.] Assume that each data instance we have was generated by the following procedure: 1. Select cluster ci with probability P(ci) = πi 2. Sample point from ci’s Gaussian distribution

8 Mixture of three Gaussians (one dimensional data)

9 Clustering via Gaussian mixture models
Clusters: Each cluster will correspond to a single Gaussian. Each point x  S will have some probability distribution over the K clusters.

10 Clustering via Gaussian mixture models
Clusters: Each cluster will correspond to a single Gaussian. Each point x  S will have some probability distribution over the K clusters. Goal: Given the data S, find the Gaussians! (And their probabilities πi .)

11 Clustering via Gaussian mixture models
Clusters: Each cluster will correspond to a single Gaussian. Each point x  S will have some probability distribution over the K clusters. Goal: Given the data S, find the Gaussians! (And their probabilities πi .) i.e., Find parameters {θK} of these K Gaussians such P(S | {θK}) is maximized.

12 Clustering via Gaussian mixture models
Clusters: Each cluster will correspond to a single Gaussian. Each point x  S will have some probability distribution over the K clusters. Goal: Given the data S, find the Gaussians! (And their probabilities πi .) i.e., Find parameters {θK} of these K Gaussians such P(S | {θK}) is maximized. This is called a Maximum Likelihood method. S is the data {θK} is the “hypothesis” or “model” P(S | {θK}) is the “likelihood”.

13 General form of one-dimensional (univariate) Gaussian Mixture Model

14 Maximum Likelihood for Single Univariate Gaussian
Learning a GMM Simplest Case: Maximum Likelihood for Single Univariate Gaussian

15 Maximum Likelihood for Single Univariate Gaussian
Learning a GMM Simplest Case: Maximum Likelihood for Single Univariate Gaussian Assume training set S has N values generated by a univariant Gaussian distribution:

16 Maximum Likelihood for Single Univariate Gaussian
Learning a GMM Simplest Case: Maximum Likelihood for Single Univariate Gaussian Assume training set S has N values generated by a univariant Gaussian distribution: Likelihood function: probability of the data given the model:

17 Maximum Likelihood for Single Univariate Gaussian
Learning a GMM Simplest Case: Maximum Likelihood for Single Univariate Gaussian Assume training set S has N values generated by a univariant Gaussian distribution: Likelihood function: probability of the data given the model:

18 How to estimate parameters  and  from S?

19 How to estimate parameters  and  from S?
Maximize the likelihood function with respect to  and  .

20 How to estimate parameters  and  from S?
Maximize the likelihood function with respect to  and  . We want the  and  that maximize the probability of the data.

21 How to estimate parameters  and  from S?
Maximize the likelihood function with respect to  and  . We want the  and  that maximize the probability of the data. Use log of likelihood in order to avoid very small values.

22 Log-Likelihood version:

23 Log-Likelihood version:
Find a simplified expression for this.

24

25

26

27 Now, given simplified expression for ln p(S | , σ), find the maximum likelihood parameters,  and σ.

28 Now, given simplified expression for ln p(S | , σ), find the maximum likelihood parameters,  and σ. First, maximize with respect to .

29 Now, given simplified expression for ln p(S | , σ), find the maximum likelihood parameters,  and σ. First, maximize with respect to . = 0 Result: (ML = “Maximum Likelihood”)

30 Now, given simplified expression for ln p(S | , σ), find the maximum likelihood parameters,  and σ. First, maximize with respect to . Result: (ML = “Maximum Likelihood”)

31 Now, given simplified expression for ln p(S | , σ), find the maximum likelihood parameters,  and σ. First, maximize with respect to . Result: (ML = “Maximum Likelihood”)

32 Now, given simplified expression for ln p(S | , σ), find the maximum likelihood parameters,  and σ. First, maximize with respect to . Result: (ML = “Maximum Likelihood”)

33 Now, given simplified expression for ln p(S | , σ), find the maximum likelihood parameters,  and σ. First, maximize with respect to . Result: (ML = “Maximum Likelihood”)

34 Now, maximize with respect to σ. It’s actually easier to maximize with respect to σ2.

35 Now, maximize with respect to σ. It’s actually easier to maximize with respect to σ2. = 0

36

37 Now, maximize with respect to σ. It’s actually easier to maximize with respect to σ2.

38 Now, maximize with respect to σ. It’s actually easier to maximize with respect to σ2. = 0

39 Now, maximize with respect to σ. It’s actually easier to maximize with respect to σ2. = 0

40 parameterizes the model.
The resulting distribution is called a generative model because it can generate new data values. We say that parameterizes the model. In general, θ is used to denote the (learnable) parameters of a probabilistic model ML

41 parameterizes the model.
The resulting distribution is called a generative model because it can generate new data values. We say that parameterizes the model. In general, θ is used to denote the (learnable) parameters of a probabilistic model ML

42 parameterizes the model.
The resulting distribution is called a generative model because it can generate new data values. We say that parameterizes the model. In general, θ is used to denote the (learnable) parameters of a probabilistic model ML

43 Discriminative vs. generative models in machine learning
Discriminative model: Creates a discrimination curve to separate the data into classes Generative model: Gives the parameters θML of a probability distribution that gives maximum likelihood p (S | θ) for the data.

44 Learning a GMM More general case: Multivariate Gaussian Distribution
Multivariate (D-dimensional) Gaussian:

45 Covariance: Variance: Covariance Matrix Σ : Σi,j = cov (xi , xj)

46 Let S be a set of multivariate data points (vectors):
S = {x1, ..., xm}.

47 Let S be a set of multivariate data points (vectors):
S = {x1, ..., xm}. General expression for finite Gaussian mixture model:

48 Let S be a set of multivariate data points (vectors):
S = {x1, ..., xm}. General expression for finite Gaussian mixture model: That is, x has probability of “membership” in multiple clusters/classes.

49 Maximum Likelihood for Multivariate Gaussian Mixture Model
Goal: Given S = {x1, ..., xN}, and given K, find the Gaussian mixture model (with K multivariate Gaussians) for which S has maximum log-likelihood.

50 Maximum Likelihood for Multivariate Gaussian Mixture Model
Goal: Given S = {x1, ..., xN}, and given K, find the Gaussian mixture model (with K multivariate Gaussians) for which S has maximum log-likelihood. Log likelihood function:

51 Maximum Likelihood for Multivariate Gaussian Mixture Model
Goal: Given S = {x1, ..., xN}, and given K, find the Gaussian mixture model (with K multivariate Gaussians) for which S has maximum log-likelihood. Log likelihood function: Given S, we can maximize this function to find

52 Maximum Likelihood for Multivariate Gaussian Mixture Model
Goal: Given S = {x1, ..., xN}, and given K, find the Gaussian mixture model (with K multivariate Gaussians) for which S has maximum log-likelihood. Log likelihood function: Given S, we can maximize this function to find But no closed form solution (unlike simple case in previous slides)

53 Maximum Likelihood for Multivariate Gaussian Mixture Model
Goal: Given S = {x1, ..., xN}, and given K, find the Gaussian mixture model (with K multivariate Gaussians) for which S has maximum log-likelihood. Log likelihood function: Given S, we can maximize this function to find But no closed form solution (unlike simple case in previous slides) In this multivariate case, we can efficiently maximize this function using the “Expectation / Maximization” (EM) algorithm.

54 Expectation-Maximization (EM) algorithm

55 Expectation-Maximization (EM) algorithm
Goal: Learn for k = 1, …, K

56 Expectation-Maximization (EM) algorithm
Goal: Learn for k = 1, …, K General idea: Choose random initial values for means, covariances and mixing coefficients. (Analogous to choosing random initial cluster centers in K-means.)

57 Expectation-Maximization (EM) algorithm
Goal: Learn for k = 1, …, K General idea: Choose random initial values for means, covariances and mixing coefficients. (Analogous to choosing random initial cluster centers in K-means.) Alternate between E (expectation) and M (maximization) step:

58 Expectation-Maximization (EM) algorithm
Goal: Learn for k = 1, …, K General idea: Choose random initial values for means, covariances and mixing coefficients. (Analogous to choosing random initial cluster centers in K-means.) Alternate between E (expectation) and M (maximization) step: E step: use current values for parameters to evaluate posterior probabilities, or “responsibilities”, for each data point. (Analogous to determining which cluster a point belongs to, in K-means.)

59 Expectation-Maximization (EM) algorithm
Goal: Learn for k = 1, …, K General idea: Choose random initial values for means, covariances and mixing coefficients. (Analogous to choosing random initial cluster centers in K-means.) Alternate between E (expectation) and M (maximization) step: E step: use current values for parameters to evaluate posterior probabilities, or “responsibilities”, for each data point. (Analogous to determining which cluster a point belongs to, in K-means.) M step: Use these probabilities to re-estimate means, covariances, and mixing coefficients. (Analogous to moving the cluster centers to the means of their members, in K-means.)

60 Expectation-Maximization (EM) algorithm
Goal: Learn for k = 1, …, K General idea: Choose random initial values for means, covariances and mixing coefficients. (Analogous to choosing random initial cluster centers in K-means.) Alternate between E (expectation) and M (maximization) step: E step: use current values for parameters to evaluate posterior probabilities, or “responsibilities”, for each data point. (Analogous to determining which cluster a point belongs to, in K-means.) M step: Use these probabilities to re-estimate means, covariances, and mixing coefficients. (Analogous to moving the cluster centers to the means of their members, in K-means.) Repeat until the log-likelihood or the parameters θ do not change significantly.

61 More detailed version of EM algorithm

62 More detailed version of EM algorithm
Initialization: Let X be the set of training data. Initialize the means k, covariances k, and mixing coefficients k, and evaluate initial value of log likelihood.

63 More detailed version of EM algorithm
Initialization: Let X be the set of training data. Initialize the means k, covariances k, and mixing coefficients k, and evaluate initial value of log likelihood. E step. Evaluate the “responsibilities” using the current parameter values where rn,k denotes the “responsibilities” of the kth cluster for the nth data point.

64 M step. Re-estimate the parameters θ using the current responsibilities.

65 Recompute log-likelihood: Evaluate the log-likelihood with the new parameters
and check for convergence of either the parameters or the log likelihood. If not converged, return to step 2.

66 EM much more computationally expensive than K-means
Common practice: Use K-means to set initial parameters, then improve with EM. Initial means: Means of clusters found by k-means Initial covariances: Sample covariances of the clusters found by K-means algorithm. Initial mixture coefficients: Fractions of data points assigned to the respective clusters.

67 Can prove that EM finds local maxima of log-likelihood function.
EM is very general technique for finding maximum-likelihood solutions for probabilistic models

68 Using GMM for Classification
Assume each cluster corresponds to one of the classes. A new test example x is classified according to

69 Case Study: Text classification from labeled and unlabeled documents using EM K. Nigam et al., Machine Learning, 2000 Big problem with text classification: need labeled data. What we have: lots of unlabeled data. Question of this paper: Can unlabeled data be used to increase classification accuracy? In other words: Any information implicit in unlabeled data? Any way to take advantage of this implicit information?

70 General idea: A version of EM algorithm
Train a classifier with small set of available labeled documents. Use this classifier to assign probabilisitically-weighted class labels to unlabeled documents. Then train a new classifier using all the documents, both originally labeled and formerly unlabeled. Iterate.

71 Representation of Documents
A document d is represented by a vector of word frequencies: d = ( f1 , f2 , ... , fN ) for the N words in the vocabulary. Assume there are K classes, c1 , ... , cK (E.g., classes might correspond to different “topics”)

72 Probabilistic framework
Assumes data are generated with Gaussian mixture model with K components Assumes one-to-one correspondence between mixture components and classes. “These assumptions rarely hold in real-world text data”

73 Probabilistic framework
Let  = {1, ..., K}  {1, ..., K}  {1, ..., K} be the mixture parameters. Assumptions: Document di is created by: (1) selecting mixture component k according to the mixture weights j (2) this selected mixture component generates a document, with distribution N(di |  k , k) Likelihood of document di under model  :

74 Now, we will apply EM to a Naive Bayes Classifier
Recall Naive Bayes classifier: Assume each feature is conditionally independent, given cj.

75 To train naïve Bayes from labeled data, estimate:
Note that under our assumption of a Gaussian mixture model, These are estimates of the parameters θ. Call these values . Thus, Naïve Bayes gives us an initial estimate ( ) of our GMM. We’re going to use unlabeled data to improve this estimate.

76 Applying EM to Naive Bayes
We have a small number of labeled documents Slabeled, and a large number of unlabeled documents, Sunlabeled. The initial parameters are estimated from the labeled documents Slabeled. Expectation step: The resulting classifier is used to assign probabilistically-weighted class labels to each unlabeled document x  Sunlabeled. Maximization step: Re-estimate using values for x  Sunlabeled  Sunlabeled Repeat until or has converged.

77 Applying EM to Naive Bayes
We have a small number of labeled documents Slabeled, and a large number of unlabeled documents, Sunlabeled. The initial parameters are estimated from the labeled documents Slabeled. Expectation step: The resulting classifier is used to assign probabilistically-weighted class labels to each unlabeled document x  Sunlabeled. Maximization step: Re-estimate using values for x  Sunlabeled  Sunlabeled Repeat until or has converged.

78 Applying EM to Naive Bayes
We have a small number of labeled documents Slabeled, and a large number of unlabeled documents, Sunlabeled. The initial parameters are estimated from the labeled documents Slabeled. Expectation step: The resulting classifier is used to assign probabilistically-weighted class labels to each unlabeled document x  Sunlabeled. Maximization step: Re-estimate using values for x  Sunlabeled  Sunlabeled Repeat until or has converged.

79 Applying EM to Naive Bayes
We have a small number of labeled documents Slabeled, and a large number of unlabeled documents, Sunlabeled. The initial parameters are estimated from the labeled documents Slabeled. Expectation step: The resulting classifier is used to assign probabilistically-weighted class labels to each unlabeled document x  Sunlabeled. Maximization step: Re-estimate using values for x  Sunlabeled  Sunlabeled Repeat until or has converged.

80 Applying EM to Naive Bayes
We have a small number of labeled documents Slabeled, and a large number of unlabeled documents, Sunlabeled. The initial parameters are estimated from the labeled documents Slabeled. Expectation step: The resulting classifier is used to assign probabilistically-weighted class labels to each unlabeled document x  Sunlabeled. Maximization step: Re-estimate using values for x  Sunlabeled  Sunlabeled Repeat until or has converged.

81 Data 20 UseNet newsgroups Web pages (WebKB)
Newswire articles (Reuters)

82 Results

83


Download ppt "Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models"

Similar presentations


Ads by Google