CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.

Slides:



Advertisements
Similar presentations
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Advertisements

Mixture Models and the EM Algorithm
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Segmentation and Fitting Using Probabilistic Methods
DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University.
K-means clustering Hongning Wang
Machine Learning and Data Mining Clustering
EE-148 Expectation Maximization Markus Weber 5/11/99.
EE 290A: Generalized Principal Component Analysis Lecture 6: Iterative Methods for Mixture-Model Segmentation Sastry & Yang © Spring, 2011EE 290A, University.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Unsupervised Learning: Clustering Rong Jin Outline  Unsupervised learning  K means for clustering  Expectation Maximization algorithm for clustering.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Clustering.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
Unsupervised Learning
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
Visual Recognition Tutorial
Part 3 Vector Quantization and Mixture Density Model CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
EE462 MLCV 1 Lecture 3-4 Clustering (1hr) Gaussian Mixture and EM (1hr) Tae-Kyun Kim.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Clustering & Dimensionality Reduction 273A Intro Machine Learning.
Biointelligence Laboratory, Seoul National University
EM and expected complete log-likelihood Mixture of Experts
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Clustering Algorithms Presented by Michael Smaili CS 157B Spring
HMM - Part 2 The EM algorithm Continuous density HMM.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica ext. 1819
Lecture 2: Statistical learning primer for biologists
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
1 Unsupervised Learning and Clustering Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of.
Flat clustering approaches
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Machine Learning 5. Parametric Methods.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
CHAPTER 8: Nonparametric Methods Alpaydin transparencies significantly modified, extended and changed by Ch. Eick Last updated: March 4, 2011.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Gaussian Mixture Model classification of Multi-Color Fluorescence In Situ Hybridization (M-FISH) Images Amin Fazel 2006 Department of Computer Science.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
The EM algorithm for Mixture of Gaussians & Classification with Generative models Jakob Verbeek December 2, 2011 Course website:
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Classification of unlabeled data:
Latent Variables, Mixture Models and EM
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry
Probabilistic Models with Latent Variables
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
INTRODUCTION TO Machine Learning
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Radial Basis Functions: Alternative to Back Propagation
EM Algorithm and its Applications
Clustering (2) & EM algorithm
Presentation transcript:

CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014

2 k-Means Clustering Problem: Find k reference vectors M={m 1,…, m k } (representatives/centroids) and a 1-of-k coding scheme (X  M) represented by b_ti that minimizes the “reconstruction error”: b_it and M has to be found that minimizes E: Step1: find the best b_it: Step2: find M: m i =(  b_it*x t )/(  b_it) Problem: b_it depends on M!!  only an iterative solution can be found!! Significantly different from book!! Eick: K-Means and EM

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 3 k-means Clustering

4 k-Means Clustering Reformulated Given dataset X, k (and a random seed) Problem: Find k vectors M={m 1,…, m k }  d and 1-of-k coding scheme b i t (b i t : X  M) that minimizes the following error: Subject to:   {0,1}   (t)=  i =1 for t=1,…,N (“coding scheme”) Eick: K-Means and EM

5 k-Means Clustering Generalized Given dataset X, k, distance function d Problem: Find k vectors M={m 1,…, m k }  d and 1-of-k coding scheme b i t (b i t : X  M) that minimizes the following error: Subject to:   {0,1}   (t)=  i =1 for t=1,…,N (“coding scheme”) Eick: K-Means and EM

6 K-mean’s Complexity—O(kntd) K-means performs t iterations of E/M-steps; in each iteration the following computations have to be done:  E-Step: 1. Find the nearest centroid for each object; because there are n objects and k centroids n*k distances have to be computed and compared; the complexity of computing a distance is O(d) yielding: O(k*n*d) 2. Assign objects to clusters/update clusters (O(n))  M-Step: Compute centroid for each cluster (O(n)) We obtain: O(t*(k*n*d + n + n))=O(t*k*n*d) Eick: K-Means and EM

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 7 Semiparametric Density Estimation Parametric: Assume a single model for p (x | C i ) (Chapter 4 and 5) Semiparametric: p (x | C i ) is a mixture of densities Multiple possible explanations/prototypes: Nonparametric: No model; data speaks for itself (Chapter 8)

Eick: K-Means and EM 8 Mixture Densities where G i the components/groups/clusters, P ( G i ) mixture proportions (priors), p ( x | G i ) component densities Gaussian mixture where p(x|G i ) ~ N ( μ i, ∑ i ) parameters Φ = {P ( G i ), μ i, ∑ i } k i=1 unlabeled sample X={x t } t (unsupervised learning) Idea: use a density function for each cluster or use multiple density functions for a single class

9 Expectation-Maximization (EM) Clustering Assumptions: datasets X are created a mixture of k multivariate Gaussian G 1,…G k also preserving their priors P(G 1 ),…,P(G k ). Example (2-D Dataset): p(x|G 1 )~N((2,3),  1 ), P(G 1 )=0.5 p(x|G 2 )~ N((-4,1),  2 ), P(G 2 )=0.4 p(x|G 3 )~N((0,-9),  3 ), P(G 3 )=0.1 EM’s Task: Given X, reconstruct the k Gaussian and their priors; each Gaussian is a model of a cluster in the dataset. Forming of clusters: Assign x i to the cluster G r which has the maximum value for: p(x i |G r )  P(G r ); alternatively, set p(x i |G r )  P(G r )= (soft EM) Eick: K-Means and EM

10 Performance Measure of EM EM maximizes Log likelihood of a mixture model We choose  which makes X most likely. Assume hidden variables z, which when known, make optimization much simpler. z gives the probability of an object belonging to a particular cluster; z_ij is not known and has to be estimated; h_ij denotes its estimator. EM maximizes the log likelihood of the sample, but instead of maximizing L( Φ |X), it maximizes L( Φ |X,Z), in terms of x and z. Eick: K-Means and EM

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 11 E- and M-steps Iterate the two steps 1. E-step: Estimate z given X and current Φ 2. M-step: Find new Φ ’ given z, X, and old Φ. An increase in Q increases incomplete likelihood Assignments of samples to clusters Current Model Remark: z_it denotes the unknown membership of the t-th example

12 EM in Gaussian Mixtures z t i = p(x|G i ); assume p(x|G i )~N( μ i,∑ i ) E-step: Estimate mixture probabilities for X={x 1,…,x n } M-step: Obtain:  l+1 namely (P(G i ),m i,S i ) l+1 for i=1,…,k Use estimated labels in place of unknown labels Remarks: h_it plays the role of b_it in K-Means h_it acts as an estimator for the unknown labels z_ij.

13 EM Means Expectation-Maximization 1. E-step (Expectation): Determine object membership in clusters (from model) 2. M-step (Maximization): Create model by using Maximum Likelihood Estimation (from cluster memberships) Computations in the two steps (K-means/EM): 1. Compute b_it / h_it 2. Compute centroid/(prior, mean,co-variance matrix) Remark: K-means uses a simple form of the EM-procedure! Eick: K-Means and EM

14 Commonalities between K-Means and EM 1. They start with random clusters and rely on a 2 step- approach to minimize the objective function using the EM-procedure (see previous transparency). 2. Use the same optimization procedure of an objective function f(a 1,…,a m,b 1,…,,b k ); we basically, maximize the a-values (keeping the b-values fixed) and then the b- values (keeping the a-values fixed) until some convergence is reached. Consequently, both algorithms  only find a local minimum of the objective function  are sensitive to initialization 3. Both assume the number of clusters k is known Eick: K-Means and EM

15 Differences between K-Means and EM I 1. K-means is distanced-based and relies on 1-NN queries to form clusters. EM is density based/probabilistic; EM usually works with multivariate Gaussians but can be generalized to work with other probability distributions. 2. K-means minimizes the squared distance of on object to its cluster prototype (usually the centroid). EM maximizes the log-likelihood of a sample given a model (p(X|  )); models are assumed to be mixtures of k Gaussians and their priors. 3. K-means is a hard clustering, EM is a soft clustering algorithm: h_it  [0,1] Eick: K-Means and EM

16 Differences between K-Means and EM II 4. K-means cluster models are just k centroids; EM models are k “priors, means+co-variance matrices”. 5. EM directly deals with dependencies between attributes in its density estimation approach: the degree to which an object x belongs to a cluster c depends on the product of c’s prior with the Mahalanobis distance between x and the c’s mean; therefore, EM clusters do not depend on units of measurements and orientation of attributes in space. 6. The distance metrics can be viewed as an input parameter when using k-means, and generalizations of K-means have been proposed which use different distance functions. EM implicitly relies on the Mahalanobis distance function which is part of its density estimation approach. Eick: K-Means and EM

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 17 Mixture of Mixtures In classification, the input comes from a mixture of classes (supervised). If each class is also a mixture, e.g., of Gaussians, (unsupervised), we have a mixture of mixtures:

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 18 Choosing k Defined by the application, e.g., image quantization Plot data (after PCA) and check for clusters Incremental (leader-cluster) algorithm: Add one at a time until “elbow” (reconstruction error/log likelihood/intergroup distances) Manual check for meaning Run with multiple k-values and compare the results