K-means clustering Hongning Wang CS@UVa.

Slides:



Advertisements
Similar presentations
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 16 10/18/2011.
Advertisements

Expectation Maximization
Supervised Learning Recap
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University.
Machine Learning and Data Mining Clustering
Mixture Language Models and EM Algorithm
EE 290A: Generalized Principal Component Analysis Lecture 6: Iterative Methods for Mixture-Model Segmentation Sastry & Yang © Spring, 2011EE 290A, University.
Clustering II.
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
Unsupervised Learning: Clustering Rong Jin Outline  Unsupervised learning  K means for clustering  Expectation Maximization algorithm for clustering.
CS728 Web Clustering II Lecture 14. K-Means Assumes documents are real-valued vectors. Clusters based on centroids (aka the center of gravity or mean)
Clustering Color/Intensity
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
Expectation Maximization Algorithm
Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft”
Unsupervised Learning
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
Expectation Maximization for GMM Comp344 Tutorial Kai Zhang.
Maximum Likelihood (ML), Expectation Maximization (EM)
Expectation-Maximization
ECE 5984: Introduction to Machine Learning
Clustering & Dimensionality Reduction 273A Intro Machine Learning.
Gaussian Mixture Models and Expectation Maximization.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Biointelligence Laboratory, Seoul National University
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
Language Models Hongning Wang Two-stage smoothing [Zhai & Lafferty 02] c(w,d) |d| P(w|d) = +  p(w|C) ++ Stage-1 -Explain unseen words -Dirichlet.
Lecture 19: More EM Machine Learning April 15, 2010.
1 Clustering: K-Means Machine Learning , Fall 2014 Bhavana Dalvi Mishra PhD student LTI, CMU Slides are based on materials from Prof. Eric Xing,
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
Relevance Feedback Hongning Wang What we have learned so far Information Retrieval User results Query Rep Doc Rep (Index) Ranker.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Lecture 17 Gaussian Mixture Models and Expectation Maximization
Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
1 Clustering: K-Means Machine Learning , Fall 2014 Bhavana Dalvi Mishra PhD student LTI, CMU Slides are based on materials from Prof. Eric Xing,
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Lecture 2: Statistical learning primer for biologists
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
Machine Learning Queens College Lecture 7: Clustering.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Flat clustering approaches
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Text Clustering Hongning Wang
Relevance Feedback Hongning Wang
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –(Finish) Expectation Maximization –Principal Component Analysis (PCA) Readings:
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
KNN & Naïve Bayes Hongning Wang
Machine Learning Lecture 4: Unsupervised Learning (clustering) 1.
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Classification of unlabeled data:
Latent Variables, Mixture Models and EM
Expectation-Maximization
Bayesian Models in Machine Learning
Probabilistic Models with Latent Variables
Stochastic Optimization Maximization for Latent Variable Models
Gaussian Mixture Models And their training with the EM algorithm
Text Categorization Berlin Chen 2003 Reference:
Clustering Techniques
Presentation transcript:

k-means clustering Hongning Wang CS@UVa

Today’s lecture k-means clustering A typical partitional clustering algorithm Convergence property Expectation Maximization algorithm Gaussian mixture model CS@UVa CS 6501: Text Mining

Partitional clustering algorithms Partition instances into exactly k non-overlapping clusters Flat structure clustering Users need to specify the cluster size k Task: identify the partition of k clusters that optimize the chosen partition criterion CS@UVa CS 6501: Text Mining

Partitional clustering algorithms Partition instances into exactly k non-overlapping clusters Typical criterion max 𝑖≠j 𝑑 𝑐 𝑖 , 𝑐 𝑗 −𝐶 𝑖 𝜎 𝑖 Optimal solution: enumerate every possible partition of size k and return the one optimizing the criterion Optimize this in an alternative way Inter-cluster distance Intra-cluster distance Let’s approximate this! Unfortunately, this is NP-hard! CS@UVa CS 6501: Text Mining

k-means algorithm Input: cluster size k, instances { 𝑥 𝑖 } 𝑖=1 𝑁 , distance metric 𝑑(𝑥,𝑦) Output: cluster membership assignments { 𝑧 𝑖 } 𝑖=1 𝑁 Initialize k cluster centroids { 𝑐 𝑖 } 𝑖=1 𝑘 (randomly if no domain knowledge available) Repeat until no instance changes its cluster membership: Decide the cluster membership of instances by assigning them to the nearest cluster centroid 𝑧 𝑖 =𝑎𝑟𝑔𝑚𝑖 𝑛 𝑘 𝑑( 𝑐 𝑘 , 𝑥 𝑖 ) Update the k cluster centroids based on the assigned cluster membership 𝑐 𝑘 = 𝑖 𝛿 𝑧 𝑖 = 𝑐 𝑘 𝑥 𝑖 𝑖 𝛿( 𝑧 𝑖 = 𝑐 𝑘 ) Minimize intra distance Maximize inter distance CS@UVa CS 6501: Text Mining

k-means illustration CS@UVa CS 6501: Text Mining

k-means illustration Voronoi diagram CS@UVa CS 6501: Text Mining

k-means illustration CS@UVa CS 6501: Text Mining

k-means illustration CS@UVa CS 6501: Text Mining

k-means illustration CS@UVa CS 6501: Text Mining

Complexity analysis Decide cluster membership Compute cluster centroid 𝑂(𝑘𝑛) Compute cluster centroid 𝑂(𝑛) Assume k-means stops after 𝑙 iterations 𝑂(𝑘𝑛𝑙) Don’t forget the complexity of distance computation, e.g., 𝑂(𝑉) for Euclidean distance CS@UVa CS 6501: Text Mining

Convergence property Why k-means will stop? Answer: it is a special version of Expectation Maximization (EM) algorithm, and EM is guaranteed to converge However, it is only guaranteed to converge to local optimal, since k-means (EM) is a greedy algorithm CS@UVa CS 6501: Text Mining

Probabilistic interpretation of clustering The density model of 𝑝 𝑥 is multi-modal Each mode represents a sub-population E.g., unimodal Gaussian for each group 𝑝(𝑥|𝑧=2) 𝑝(𝑥|𝑧=1) Mixture model 𝑝 𝑥 = 𝑧 𝑝 𝑥 𝑧 𝑝(𝑧) 𝑝(𝑥|𝑧=3) Unimodal distribution Mixing proportion CS@UVa CS 6501: Text Mining

Probabilistic interpretation of clustering If 𝑧 is known for every 𝑥 Estimating 𝑝(𝑧) and 𝑝(𝑥|𝑧) is easy Maximum likelihood estimation This is Naïve Bayes 𝑝(𝑥|𝑧=2) Mixture model 𝑝(𝑥|𝑧=1) 𝑝 𝑥 = 𝑧 𝑝 𝑥 𝑧 𝑝(𝑧) Unimodal distribution Mixing proportion 𝑝(𝑥|𝑧=3) CS@UVa CS 6501: Text Mining

Probabilistic interpretation of clustering Usually a constrained optimization problem But 𝑧 is unknown for all 𝑥 Estimating 𝑝(𝑧) and 𝑝(𝑥|𝑧) is generally hard max 𝛼,𝛽 𝑖 log 𝑧 𝑖 𝑝 𝑥 𝑖 𝑧 𝑖 ,𝛽 𝑝( 𝑧 𝑖 |𝛼) Appeal to the Expectation Maximization algorithm 𝑝 𝑥 𝑧=2 ? Mixture model 𝑝 𝑥 𝑧=1 ? 𝑝 𝑥 = 𝑧 𝑝 𝑥 𝑧 𝑝(𝑧) Unimodal distribution Mixing proportion 𝑝 𝑥 𝑧=3 ? CS@UVa CS 6501: Text Mining

Recap: Partitional clustering algorithms Partition instances into exactly k non-overlapping clusters Typical criterion max 𝑖≠j 𝑑 𝑐 𝑖 , 𝑐 𝑗 −𝐶 𝑖 𝜎 𝑖 Optimal solution: enumerate every possible partition of size k and return the one optimizing the criterion Optimize this in an alternative way Inter-cluster distance Intra-cluster distance Let’s approximate this! Unfortunately, this is NP-hard! CS@UVa CS 6501: Text Mining

Recap: k-means algorithm Input: cluster size k, instances { 𝑥 𝑖 } 𝑖=1 𝑁 , distance metric 𝑑(𝑥,𝑦) Output: cluster membership assignments { 𝑧 𝑖 } 𝑖=1 𝑁 Initialize k cluster centroids { 𝑐 𝑖 } 𝑖=1 𝑘 (randomly if no domain knowledge available) Repeat until no instance changes its cluster membership: Decide the cluster membership of instances by assigning them to the nearest cluster centroid 𝑧 𝑖 =𝑎𝑟𝑔𝑚𝑖 𝑛 𝑘 𝑑( 𝑐 𝑘 , 𝑥 𝑖 ) Update the k cluster centroids based on the assigned cluster membership 𝑐 𝑘 = 𝑖 𝛿 𝑧 𝑖 = 𝑐 𝑘 𝑥 𝑖 𝑖 𝛿( 𝑧 𝑖 = 𝑐 𝑘 ) Minimize intra distance Maximize inter distance CS@UVa CS 6501: Text Mining

Recap: Probabilistic interpretation of clustering The density model of 𝑝 𝑥 is multi-modal Each mode represents a sub-population E.g., unimodal Gaussian for each group 𝑝(𝑥|𝑧=2) 𝑝(𝑥|𝑧=1) Mixture model 𝑝 𝑥 = 𝑧 𝑝 𝑥 𝑧 𝑝(𝑧) 𝑝(𝑥|𝑧=3) Unimodal distribution Mixing proportion CS@UVa CS 6501: Text Mining

Introduction to EM Parameter estimation All data is observable Maximum likelihood estimator Optimize the analytic form of 𝐿 𝜃 = log𝑝(𝑋|𝜃) Missing/unobservable data Data: X (observed) + Z (hidden) Likelihood: 𝐿 𝜃 = log 𝑧 𝑝 𝑋,𝑍 𝜃 Approximate it! E.g. cluster membership Most of cases are intractable CS@UVa CS 6501: Text Mining

Background knowledge Jensen's inequality For any convex function 𝑓(𝑥) and positive weights 𝜆, 𝑓 𝑖 𝜆 𝑖 𝑥 𝑖 ≤ 𝑖 𝜆 𝑖 𝑓( 𝑥 𝑖 ) 𝑖 𝜆 𝑖 =1 CS@UVa CS 6501: Text Mining

Expectation Maximization Maximize data likelihood function by pushing the lower bound 𝐿 𝜃 = log 𝑍 𝑝 𝑋,Z 𝜃 = log 𝑍 𝑞 𝑍 𝑝 𝑋,Z 𝜃 𝑞 𝑍 Proposal distributions for 𝑍 Jensen's inequality 𝑓 𝐸 𝑥 ≥𝐸[𝑓(𝑥)] ≥ 𝑍 𝑞 𝑍 log 𝑝 𝑋,𝑍 𝜃 − 𝑍 𝑞 𝑍 log 𝑞(𝑍) Lower bound: easier to compute, many good properties! Components we need to tune when optimizing 𝑳 𝜽 : 𝒒(𝒁) and 𝜽! CS@UVa CS 6501: Text Mining

Intuitive understanding of EM Data likelihood p(X| ) Easier to optimize, guarantee to improve data likelihood Lower bound  CS@UVa CS 6501: Text Mining

Expectation Maximization (cont) Optimize the lower bound w.r.t. 𝑞(𝑍) 𝐿 𝜃 ≥ 𝑍 𝑞 𝑍 𝑙𝑜𝑔 𝑝 𝑋,𝑍 𝜃 − 𝑍 𝑞 𝑍 𝑙𝑜𝑔 𝑞(𝑍) = 𝑍 𝑞 𝑍 log 𝑝 𝑍 𝑋,𝜃 + log 𝑝(𝑋|𝜃) − 𝑍 𝑞 𝑍 log 𝑞(𝑍) = 𝑍 𝑞 𝑍 log 𝑝 𝑍 𝑋,𝜃 𝑞(𝑍) +log⁡𝑝(𝑋|𝜃) negative KL-divergence between 𝑞(𝑍) and 𝑝 𝑍 𝑋,𝜃 Constant with respect to 𝑞(𝑍) 𝐾𝐿(𝑃| 𝑄 = 𝑃 𝑥 log 𝑃(𝑥) 𝑄(𝑥) 𝑑𝑥 CS@UVa CS 6501: Text Mining

Expectation Maximization (cont) Optimize the lower bound w.r.t. 𝑞(𝑍) 𝐿 𝜃 ≥−𝐾𝐿 𝑞(𝑍)||𝑝 𝑍 𝑋,𝜃 +𝐿(𝜃) KL-divergence is non-negative, and equals to zero i.f.f. 𝑞 𝑍 =𝑝 𝑍 𝑋,𝜃 A step further: when 𝑞 𝑍 =𝑝 𝑍 𝑋,𝜃 , we will get 𝐿 𝜃 ≥𝐿(𝜃), i.e., the lower bound is tight! Other choice of 𝑞 𝑍 cannot lead to this tight bound, but might reduce computational complexity Note: calculation of 𝑞 𝑍 is based on current 𝜃 CS@UVa CS 6501: Text Mining

Expectation Maximization (cont) Optimize the lower bound w.r.t. 𝑞(𝑍) Optimal solution: 𝑞 𝑍 =𝑝 𝑍|𝑋, 𝜃 𝑡 Posterior distribution of 𝑍 given current model 𝜃 𝑡 In k-means: this corresponds to assigning instance 𝑥 𝑖 to its closest cluster centroid 𝑐 𝑘 𝑧 𝑖 =𝑎𝑟𝑔𝑚𝑖 𝑛 𝑘 𝑑( 𝑐 𝑘 , 𝑥 𝑖 ) CS@UVa CS 6501: Text Mining

Expectation Maximization (cont) Optimize the lower bound w.r.t. 𝜃 𝐿 𝜃 ≥ 𝑍 𝑝 𝑍|𝑋, 𝜃 𝑡 log 𝑝 𝑋,𝑍 𝜃 − 𝑍 𝑝 𝑍|𝑋, 𝜃 𝑡 log 𝑝 𝑍|𝑋, 𝜃 𝑡 𝜃 𝑡+1 =𝑎𝑟𝑔𝑚𝑎 𝑥 𝜃 𝑍 𝑝 𝑍|𝑋, 𝜃 𝑡 𝑙𝑜𝑔 𝑝 𝑋,𝑍 𝜃 Constant w.r.t. 𝜃 =𝑎𝑟𝑔𝑚𝑎 𝑥 𝜃 𝐸 𝑍|𝑋, 𝜃 𝑡 log 𝑝(𝑋,𝑍|𝜃) Expectation of complete data likelihood In k-means, we are not computing the expectation, but the most probable configuration, and then 𝑐 𝑘 = 𝑖 𝛿 𝑧 𝑖 = 𝑐 𝑘 𝑥 𝑖 𝑖 𝛿( 𝑧 𝑖 = 𝑐 𝑘 ) CS@UVa CS 6501: Text Mining

Expectation Maximization EM tries to iteratively maximize likelihood “Complete” likelihood: 𝐿 𝑐 𝜃 = log𝑝(𝑋,Z|𝜃) Starting from an initial guess (0), E-step: compute the expectation of the complete likelihood M-step: compute (t+1) by maximizing the Q-function 𝑄 𝜃; 𝜃 𝑡 = E 𝑍|𝑋,𝜃 𝑡 𝐿 𝑐 𝜃 = 𝑍 𝑝 𝑍 𝑋, 𝜃 𝑡 log p X,Z 𝜃 Key step! 𝜃 𝑡+1 =𝑎𝑟𝑔𝑚𝑎 𝑥 𝜃 𝑄 𝜃; 𝜃 𝑡 CS@UVa CS 6501: Text Mining

Intuitive understanding of EM In k-means E-step: identify the cluster membership - 𝑝 𝑧 𝑥,𝑐 M-step: update 𝑐 by 𝑝 𝑧 𝑥,𝑐 Data likelihood p(X| ) next guess current guess Lower bound (Q function)  E-step = computing the lower bound M-step = maximizing the lower bound CS@UVa CS 6501: Text Mining

Convergence guarantee Proof of EM log 𝑝 𝑋 𝜃 = log 𝑝 𝑍,𝑋 𝜃 − log 𝑝(𝑍|𝑋,𝜃) Taking expectation with respect to 𝑝(𝑍|𝑋, 𝜃 𝑡 ) of both sides: log 𝑝 𝑋 𝜃 = 𝑍 𝑝(𝑍|𝑋, 𝜃 𝑡 ) log 𝑝 𝑍,𝑋 𝜃 − 𝑍 𝑝(𝑍|𝑋, 𝜃 𝑡 ) log 𝑝(𝑍|𝑋,𝜃) =𝑄 𝜃; 𝜃 𝑡 +𝐻(𝜃; 𝜃 𝑡 ) Cross-entropy log 𝑝 𝑋 𝜃 − log 𝑝(𝑋| 𝜃 𝑡 ) =𝑄 𝜃; 𝜃 𝑡 +𝐻 𝜃; 𝜃 𝑡 −𝑄 𝜃 𝑡 ; 𝜃 𝑡 −𝐻( 𝜃 𝑡 ; 𝜃 𝑡 ) Then the change of log data likelihood between EM iteration is: By Jensen’s inequality, we know 𝐻 𝜃; 𝜃 𝑡 ≥𝐻 𝜃 𝑡 ; 𝜃 𝑡 , that means log 𝑝 𝑋 𝜃 − log 𝑝(𝑋| 𝜃 𝑡 ) ≥𝑄 𝜃; 𝜃 𝑡 −𝑄 𝜃 𝑡 ; 𝜃 𝑡 ≥0 M-step guarantee this CS@UVa CS 6501: Text Mining

What is not guaranteed Global optimal is not guaranteed! Likelihood: 𝐿 𝜃 = log 𝑍 𝑝 𝑋,Z 𝜃 is non-convex in most of case EM boils down to a greedy algorithm Alternative ascent Generalized EM E-step: 𝑞 𝑍 = argmin 𝑞(𝑍) 𝐾𝐿 𝑞(𝑍)||𝑝 𝑍 𝑋, 𝜃 𝑡 M-step: choose 𝜃 that improves 𝑄 𝜃; 𝜃 𝑡 CS@UVa CS 6501: Text Mining

k-means v.s. Gaussian Mixture If we use Euclidean distance in k-means We have explicitly assumed 𝑝(𝑥|𝑧) is Gaussian Gaussian Mixture Model (GMM) 𝑝 𝑥 𝑧 =𝑁 𝜇 𝑧 , Σ 𝑧 𝑝 𝑧 = 𝛼 𝑧 𝑃 𝑥 𝑧 = 1 2𝜋 𝑒 − 𝑥− 𝜇 𝑧 𝑇 (𝑥− 𝜇 𝑧 ) 2 𝑃 𝑥 𝑧 = 1 2𝜋 𝑘 Σ 𝑧 𝑒 − 1 2 𝑥− 𝜇 𝑧 𝑇 Σ z −1 𝑥− 𝜇 𝑧 In k-means, we assume equal variance across clusters, so we don’t need to estimate them Multinomial We do not consider cluster size in k-means CS@UVa CS 6501: Text Mining

k-means v.s. Gaussian Mixture Soft v.s., hard posterior assignment GMM k-means CS@UVa CS 6501: Text Mining

k-means in practice Extremely fast and scalable One of the most popularly used clustering methods Top 10 data mining algorithms – ICDM 2006 Can be easily parallelized Map-Reduce implementation Mapper: assign each instance to its closest centroid Reducer: update centroid based on the cluster membership Sensitive to initialization Prone to local optimal CS@UVa CS 6501: Text Mining

Better initialization: k-means++ Choose the first cluster center at uniformly random Repeat until all k centers have been found For each instance compute D x = min k 𝑑(𝑥, 𝑐 𝑘 ) Choose a new cluster center with probability 𝑝 𝑥 ∝ 𝐷 𝑥 2 Run k-means with selected centers as initialization new center should be far away from existing centers CS@UVa CS 6501: Text Mining

How to determine k Vary 𝑘 to optimize clustering criterion Internal v.s. external validation Cross validation Abrupt change in objective function CS@UVa CS 6501: Text Mining

How to determine k Vary 𝑘 to optimize clustering criterion Internal v.s. external validation Cross validation Abrupt change in objective function Model selection criterion – penalizing too many clusters AIC, BIC CS@UVa CS 6501: Text Mining

What you should know k-means algorithm An alternative greedy algorithm Convergence guarantee EM algorithm Hard clustering v.s., soft clustering k-means v.s., GMM CS@UVa CS 6501: Text Mining

Today’s reading Introduction to Information Retrieval Chapter 16: Flat clustering 16.4 k-means 16.5 Model-based clustering CS@UVa CS 6501: Text Mining