PARTITIONAL CLUSTERING

Slides:



Advertisements
Similar presentations
Copyright Jiawei Han, modified by Charles Ling for CS411a
Advertisements

Clustering.
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Clustering.
Unsupervised Learning
Cluster Analysis: Basic Concepts and Algorithms
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Data Mining Cluster Analysis Basics
Hierarchical Clustering, DBSCAN The EM Algorithm
Clustering Basic Concepts and Algorithms
Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Data Mining Techniques: Clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or.
Clustering II.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Cluster Analysis: Basic Concepts and Algorithms
Cluster Analysis (1).
What is Cluster Analysis?
Cluster Analysis CS240B Lecture notes based on those by © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004.
What is Cluster Analysis?
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
DATA MINING CLUSTERING K-Means.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
Jeff Howbert Introduction to Machine Learning Winter Clustering Basic Concepts and Algorithms 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
Clustering.
CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Clustering.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Clustering/Cluster Analysis. What is Cluster Analysis? l Finding groups of objects such that the objects in a group will be similar (or related) to one.
CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
CSE4334/5334 Data Mining Clustering. What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related)
Unsupervised Learning
Clustering Anna Reithmeir Data Mining Proseminar 2017
What Is Cluster Analysis?
Data Mining: Basic Cluster Analysis
Clustering CSC 600: Data Mining Class 21.
Data Mining K-means Algorithm
Topic 3: Cluster Analysis
CSE 5243 Intro. to Data Mining
Clustering Basic Concepts and Algorithms 1
Data Mining Cluster Techniques: Basic
Critical Issues with Respect to Clustering
CSE572, CBS598: Data Mining by H. Liu
CSE572, CBS572: Data Mining by H. Liu
MIS2502: Data Analytics Clustering and Segmentation
MIS2502: Data Analytics Clustering and Segmentation
Clustering Wei Wang.
Text Categorization Berlin Chen 2003 Reference:
Topic 5: Cluster Analysis
SEEM4630 Tutorial 3 – Clustering.
CSE572: Data Mining by H. Liu
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Unsupervised Learning
Presentation transcript:

PARTITIONAL CLUSTERING ACM Student Chapter, Heritage Institute of Technology 17th February, 2012 SIGKDD Presentation by Megha Nangia J. M. Mansa Koustav Mullick

Why do we cluster? Clustering results are used: As a stand-alone tool to get insight into data distribution Visualization of clusters may unveil important information As a preprocessing step for other algorithms Efficient indexing or compression often relies on clustering

What is Cluster Analysis? Cluster analysis or clustering is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more “similar” (in some sense or another) to each other than to those in other clusters. Cluster analysis itself is not one specific algorithm. But the general task to be solved is forming similar clusters. It can be achieved by various algorithms.

How do we define “similarity”? Recall that the goal is to group together “similar” data – but what does this mean? No single answer – it depends on what we want to find or emphasize in the data; this is one reason why clustering is an “art” The similarity measure is often more important than the clustering algorithm used – don’t overlook this choice!

Clustering: Minimize Intra-cluster distance Maximize Inter-cluster distance

Applications: Clustering is a main task of explorative data mining to reduce the size of large data sets. Its a common technique for statistical data analysis used in many fields, including : Machine learning Pattern recognition Image analysis Information retrieval Bioinformatics. Web applications such as social network analysis, grouping of shopping items, search result grouping etc.

Requirements of Clustering in Data Mining Scalability Ability to deal with different types of attributes Discovery of clusters with arbitrary shape Able to deal with noise and outliers Insensitive to order of input records High dimensionality Interpretability and usability

Notion of clustering: How many clusters? Six Clusters Two Clusters Four Clusters

Clustering Algorithms: Clustering algorithms can be categorized Some of the major algorithms are: Hierarchical or connectivity based clustering Partitional clustering (K-means or centroid-based clustering) Density based Grid based Model based

Mammals

Partitional Clustering: In statistics and data mining, k-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. This results into a partitioning of the data space into Voronoi cells. A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset

Partitional Clustering : A Partitional Clustering Original Points

Hierarchical Clustering: Connectivity based clustering, also known as hierarchical clustering, is based on the core idea of objects being more related to nearby objects than to objects farther away. As such, these algorithms connect "objects" to form "clusters" based on their distance. At different distances, different clusters will form, which can be represented using a dendrogram. These algorithms do not provide a single partitioning of the data set, but instead provide an extensive hierarchy of clusters that merge with each other at certain distances. A set of nested clusters organized as a hierarchical tree

Hierarchical Clustering: Traditional Hierarchical Clustering Traditional Dendrogram Non-traditional Hierarchical Clustering Non-traditional Dendrogram

Hierarchical Clustering. Partitional Clustering.

Partitioning Algorithms: Partitioning method: Construct a partition of n objects into a set of K clusters Given: a set of objects and the number K Find: a partition of K clusters that optimizes the chosen partitionin`g criterion Effective heuristic methods: K-means and K-medoids algorithms

Common choices for Similarity/ Distance measures: Euclidean distance: City block or Manhattan distance: Cosine similarity: Jaccard similarity:

K-means Clustering: Partitional clustering approach Each cluster is associated with a centroid (center point) Each point is assigned to the cluster with the closest centroid Number of clusters, K, must be specified The basic algorithm is very simple

K-Means Algorithm: START Choose K Centroids Select K points as initial Centroids. Repeat: Form k clusters by assigning all points to their respective closest centroid. Re-compute the centroid for each cluster 5. Until: The centroids don`t change. Form k clusters. Recompute centroid YES Centroids change NO END

Time Complexity Assume computing distance between two instances is O(m) where m is the dimensionality of the vectors. Reassigning clusters: O(kn) distance computations, or O(knm). Computing centroids: Each instance vector gets added once to some centroid: O(nm). Assume these two steps are each done once for I iterations: O(Iknm).

K-means Clustering: Step 1 Algorithm: k-means, Distance Metric: Euclidean Distance

K-means Clustering: Step 2 Algorithm: k-means, Distance Metric: Euclidean Distance

K-means Clustering: Step 3 Algorithm: k-means, Distance Metric: Euclidean Distance

K-means Clustering: Step 4 Algorithm: k-means, Distance Metric: Euclidean Distance

K-means Clustering: Step 5 Algorithm: k-means, Distance Metric: Euclidean Distance k1 k2 k3

K-Means Clustering: Example 2

K-Means Clustering: Example 2

Importance of Choosing Initial Centroids …

Importance of Choosing Initial Centroids …

Two different K-means Clusterings Original Points Optimal Clustering Sub-optimal Clustering

Solutions to Initial Centroids Problem Multiple runs Helps, but probability is not on your side Sample and use hierarchical clustering to determine initial centroids Select more than k initial centroids and then select among these initial centroids Select most widely separated Postprocessing Bisecting K-means Not as susceptible to initialization issues

Evaluating K-means Clusters Most common measure is Sum of Squared Error (SSE) For each point, the error is the distance to the nearest cluster To get SSE, we square these errors and sum them. x is a data point in cluster Ci and mi is the representative point for cluster Ci can show that mi corresponds to the center (mean) of the cluster Given two clusters, we can choose the one with the smallest error One easy way to reduce SSE is to increase K, the number of clusters A good clustering with smaller K can have a lower SSE than a poor clustering with higher K

Strength Relatively efficient: O(ikn), where n is # objects, k is # clusters, and i is # iterations. Normally, k, i << n. Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms Weakness Applicable only when mean is defined, then what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers Not suitable to discover clusters with non-convex shapes Also may give rise to Empty-clusters.

Outliers cluster outliers Outliers are objects that do not belong to any cluster or form clusters of very small cardinality cluster outliers

Which cluster to be picked for bisection ? Bisecting K-Means: A variant of k-means, that can produce a partitional or heirarchical clustering. Which cluster to be picked for bisection ? Can pick the largest Cluster , or The cluster With lowest average similarity, or Cluster with the largest SSE.

Bisecting K-Means Algorithm: START Bisecting K-Means Algorithm: Initialize clusters Initialize the list of clusters. Repeat: Select a cluster from the list of clusters. For i=1 to number_of_iterations Bisect the cluster using k-means algorithm End for Select two clusters having the lowest SSE Add the two clusters from the bisection to the list of clusters 9. Until: The list contains k clusters. Select a cluster NO i < no. of iterations YES Bisect the cluster. i++ Add the two bisected clusters, having lowest SSE, to list of clusters NO K clusters YES END

Bisecting K-means: Example

Why bisecting K-means works better than regular K-means? –Bisecting K-means tends to produce clusters of relatively uniform size. –Regular K-means tends to produce clusters of widely different sizes. –Bisecting K-means beats Regular K-means in Entropy measurement

Limitations of K-means: K-means has problems when clusters are of differing Sizes Densities Non-globular shapes K-means has problems when the data contains outliers.

Limitations of K-means: Differing Sizes Original Points K-means (3 Clusters)

Limitations of K-means: Differing Density Original Points K-means (3 Clusters)

Limitations of K-means: Non-globular Shapes Original Points K-means (2 Clusters)

Overcoming K-means Limitations Original Points K-means Clusters One solution is to use many clusters. Find parts of clusters, but need to put together.

Overcoming K-means Limitations Original Points K-means Clusters

Overcoming K-means Limitations Original Points K-means Clusters

K-Medoids Algorithm What is a medoid? A medoid can be defined as the object of a cluster, whose average dissimilarity to all the objects in the cluster is minimal, i.e, it is a most centrally located point in the cluster. In contrast to the k-means algorithm, k-medoids chooses datapoints as centers(medoids or exemplars) The most common realisation of k-medoid clustering is the Partitioning Around Medoids (PAM) algorithm.

medoids(PAM) algorithm Partitioning around medoids(PAM) algorithm 1. Initialize: randomly select k of the n data points as the medoids. 2. Associate each data point to the closest medoid. 3. For each medoid m 1. For each non-medoid data point o 1. Swap m and o and compute the total cost of the configuration. 4. Select the configuration with the lowest cost. 5. Repeat steps 2 to 5 until there is no change in the medoid.

Demonstration of PAM Cluster the following set of ten objects into two clusters i.e. k=2. Consider a data set of ten objects as follows: Point Cordinate 1 Cordinate2 X1 2 6 X2 3 4 X3 8 X4 7 X5 X6 X7 X8 X9 5 X10

Distribution of the data

Step 1 Initialize k centres. Let us assume c1=(3,4) and c2=(7,4). So here c1 and c2 are selected as medoids. Calculating distance so as to associate each data object to its nearest medoid. c1 Data objects (Xi) Cost 3 4 2 6 8 7 5 C2 Data objects (Xi) Cost 7 4 2 6 3 8 1 5

Then so the clusters become: The total cost involved is 20

Cluster after step 1 Next, we choose a non-medoid point for each medoid, swap it with the medoid and re-compute the cost. If the cost is optimized, we make it the new medoid and proceed similarly, until there is no change in the medoids.

Comments on PAM Algorithm Pam is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean Pam works well for small data sets but does not scale well for large data sets.

Conclusion: Partitional clustering is a very efficient and easy to implement clustering method. It helps us find the global and local optimums. Some of the heuristic approaches involve the K-means and K-medoid algorithms. However partitional clustering also suffers from a number of shortcomings: The performance of the algorithm depends on the initial centroids. So the algorithm gives no guarantee for an optimal solution. Choosing poor initial centroids may lead to the generation of empty clusters as well. The number of clusters need to be determined beforehand. Does not work well with non-globular clusters. Some of the above stated drawbacks can be solved using the other popular Clustering approach, such as Hierarchical or density based clustering. Nevertheless the importance of partitional clustering cannot be denied.