1 Clustering: K-Means Machine Learning 10-601, Fall 2014 Bhavana Dalvi Mishra PhD student LTI, CMU Slides are based on materials from Prof. Eric Xing,

Slides:

Advertisements

Similar presentations

Mixture Models and the EM Algorithm

Advertisements

Expectation Maximization

K Means Clustering , Nearest Cluster and Gaussian Mixture

Supervised Learning Recap

Unsupervised learning

Segmentation and Fitting Using Probabilistic Methods

DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University.

K-means clustering Hongning Wang

Machine Learning and Data Mining Clustering

Introduction to Bioinformatics

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Unsupervised Learning: Clustering Rong Jin Outline  Unsupervised learning  K means for clustering  Expectation Maximization algorithm for clustering.

CS728 Web Clustering II Lecture 14. K-Means Assumes documents are real-valued vectors. Clusters based on centroids (aka the center of gravity or mean)

Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.

Expectation Maximization Algorithm

Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft”

Expectation-Maximization

Visual Recognition Tutorial

ECE 5984: Introduction to Machine Learning

Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Gaussian Mixture Models and Expectation Maximization.

Radial Basis Function Networks

CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:

Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.

CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.

Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:

Lecture 19: More EM Machine Learning April 15, 2010.

1 Clustering: K-Means Machine Learning , Fall 2014 Bhavana Dalvi Mishra PhD student LTI, CMU Slides are based on materials from Prof. Eric Xing,

Text Clustering.

Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.

Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for

First topic: clustering and pattern recognition Marc Sobel.

MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Unsupervised Learning: Kmeans, GMM, EM Readings: Barber

Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.

Lecture 17 Gaussian Mixture Models and Expectation Maximization

Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.

Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.

DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University.

Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.

Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

Flat clustering approaches

Chapter 13 (Prototype Methods and Nearest-Neighbors )

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –(Finish) Expectation Maximization –Principal Component Analysis (PCA) Readings:

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.

Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.

Slide 1 Directed Graphical Probabilistic Models: learning in DGMs William W. Cohen Machine Learning

Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.

Introduction to Machine Learning Nir Ailon Lecture 12: EM, Clustering and More.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Big Data Infrastructure

Machine Learning Clustering: K-means Supervised Learning

Lecture 18 Expectation Maximization

Classification of unlabeled data:

Latent Variables, Mixture Models and EM

دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry

Probabilistic Models with Latent Variables

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

INTRODUCTION TO Machine Learning

Text Categorization Berlin Chen 2003 Reference:

EM Algorithm and its Applications

Presentation transcript:

1 Clustering: K-Means Machine Learning , Fall 2014 Bhavana Dalvi Mishra PhD student LTI, CMU Slides are based on materials from Prof. Eric Xing, Prof. William Cohen and Prof. Andrew Ng

Outline 2 What is clustering? How are similarity measures defined? Different clustering algorithms  K-Means  Gaussian Mixture Models Expectation Maximization Advanced topics  How to choose #clusters

Clustering 3

Classification vs. Clustering Supervision availableUnsupervised 4 Unsupervised learning: learning from raw (unlabeled) data Learning from supervised data: example classifications are given

Clustering 5 The process of grouping a set of objects into clusters high intra-cluster similarity low inter-cluster similarity How many clusters? How to identify them?

Applications of Clustering 6 Google news: Clusters news stories from different sources about same event …. Computational biology: Group genes that perform the same functions Social media analysis: Group individuals that have similar political views Computer graphics: Identify similar objects from pictures

Examples 7 People Images Species

What is a natural grouping among these objects? 8

Similarity Measures 9

What is Similarity? 10 The real meaning of similarity is a philosophical question. Depends on representation and algorithm. For many rep./alg., easier to think in terms of a distance (rather than similarity) between vectors. Hard to define! But we know it when we see it

Intuitions behind desirable distance measure properties 11 D(A,B) = D(B,A) Symmetry Otherwise you could claim "Alex looks like Bob, but Bob looks nothing like Alex" D(A,A) = 0 Constancy of Self-Similarity Otherwise you could claim "Alex looks more like Bob, than Bob does" D(A,B) = 0 IIf A= B Identity of indiscerniblesects in your world that are different, but you cannot tell apart. D(A,B)  D(A,C) + D(B,C) Triangular Inequality Otherwise you could claim "Alex is very like Bob, and Alex is very like Carl, but Bob is very unlike Carl"

Intuitions behind desirable distance measure properties 12 D(A,B) = D(B,A) Symmetry Otherwise you could claim "Alex looks like Bob, but Bob looks nothing like Alex" D(A,A) = 0 Constancy of Self-Similarity Otherwise you could claim "Alex looks more like Bob, than Bob does" D(A,B) = 0 IIf A= B Identity of indiscernibles Otherwise there are objects in your world that are different, but you cannot tell apart. D(A,B)  D(A,C) + D(B,C) Triangular Inequality Otherwise you could claim "Alex is very like Bob, and Alex is very like Carl, but Bob is very unlike Carl"

Distance Measures: Minkowski Metric 13 Suppose two object x and y both have p features The Minkowski metric is defined by Most Common Minkowski Metrics

An Example x y

Hamming distance 15 Manhattan distance is called Hamming distance when all features are binary. Gene Expression Levels Under 17 Conditions (1-High,0-Low)

Similarity Measures: Correlation Coefficient 16 Time Gene A Gene B Expression Level Time Gene A Gene B Positively correlated Uncorrelated Negatively correlated Time Gene A Expression Level Gene B

Similarity Measures: Correlation Coefficient 17 Pearson correlation coefficient Special case: cosine distance

Clustering Algorithm K-Means 18

K-means Clustering: Step 1 19

K-means Clustering: Step 2 20

K-means Clustering: Step 3 21

K-means Clustering: Step 4 22

K-means Clustering: Step 5 23

K-Means: Algorithm Decide on a value for k. 2. Initialize the k cluster centers randomly if necessary. 3. Repeat till any object changes its cluster assignment Decide the cluster memberships of the N objects by assigning them to the nearest cluster centroid Re-estimate the k cluster centers, by assuming the memberships found above are correct.

K-Means is widely used in practice 25 Extremely fast and scalable: used in variety of applications Can be easily parallelized Easy Map-Reduce implementation E step: Mapper assigns each datapoint to nearest cluster M step: Reducer takes all points assigned to a cluster, and re- computes the centroids Sensitive to starting points or random seed initialization (Similar to Neural networks) There are extensions like K-Means++ that try to solve this problem

Outliers 26

Clustering Algorithm Gaussian Mixture Model 27

Density estimation 28 An aircraft testing facility measures Heat and Vibration parameters for every newly built aircraft. Estimate density function P(x) given unlabeled datapoints X 1 to X n

Mixture of Gaussians 29

Mixture Models 30 A density model p ( x ) may be multi-modal. We may be able to model it as a mixture of uni-modal distributions (e.g., Gaussians). Each mode may correspond to a different sub-population (e.g., male and female). 

Gaussian Mixture Models (GMMs) 31 Consider a mixture of K Gaussian components: This model can be used for unsupervised clustering. This model (fit by AutoClass) has been used to discover new kinds of stars in astronomical data, etc. mixture proportion mixture component

Learning mixture models 32 In fully observed iid settings, the log likelihood decomposes into a sum of local terms. With latent variables, all the parameters become coupled together via marginalization

If we are doing MLE for completely observed data Data log-likelihood MLE MLE for GMM 33 Gaussian Naïve Bayes

Learning GMM (z’s are unknown) 34

Expectation Maximization (EM) 35

Expectation-Maximization (EM) 36 Start: "Guess" the mean and covariance of each of the K gaussians Loop

37

Expectation-Maximization (EM) 38 Start: "Guess" the centroid and covariance of each of the K clusters Loop

The Expectation-Maximization (EM) Algorithm 39 E Step: Guess values of Z’s

The Expectation-Maximization (EM) Algorithm 40 M Step: Update parameter estimates

EM Algorithm for GMM 41 E Step: Guess values of Z’s M Step: Update parameter estimates

K-means is a hard version of EM 42 In the K-means “E-step” we do hard assignment: In the K-means “M-step” we update the means as the weighted sum of the data, but now the weights are 0 or 1:

Soft vs. Hard EM assignments 43 GMM K-Means

Theory underlying EM 44 What are we doing? Recall that according to MLE, we intend to learn the model parameters that would maximize the likelihood of the data. But we do not observe z, so computing is difficult! What shall we do?

Intuition behind the EM algorithm 45

Jensen’s Inequality 46 For a convex function f(x) Similarly, for a concave function f(x)

Jensen’s Inequality: concave f(x) 47

EM and Jensen’s Inequality 48

Advanced Topics 49

Convergence 50 Why should the K-means algorithm ever reach a fixed point? -- A state in which clusters don’t change. Goodness measure sum of squared distances from cluster centroid: Reassignment monotonically decreases SD since each vector is assigned to the closest centroid.

Seed Choice 51 Results can vary based on random seed selection. Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings. Select good seeds using a heuristic (e.g., least similar to existing means) Try out multiple starting points (very important!!!) Initialize with the results of another method.

How Many Clusters? 52 Number of clusters K is given Partition n docs into predetermined number of clusters Finding the “right” number of clusters is part of the problem E.g., for query results - ideal value of K not known up front - though UI may impose limits. Solve an optimization problem: penalize having lots of clusters Information theoretic approaches: model-based approach Tradeoff between having clearly separable clusters and having too many clusters Agglomerative clustering

Summary 53 What is clustering? Similarity Measures K-Means clustering algorithm Mixture of Gaussians (GMM) Expectation Maximization Advanced Topics How to seed clustering How to decide #clusters

Thank You Questions? 54