Presentation on theme: "PARTITIONAL CLUSTERING"— Presentation transcript:
1 PARTITIONAL CLUSTERING ACM Student Chapter,Heritage Institute of Technology17th February, 2012SIGKDD Presentation byMegha NangiaJ. M. MansaKoustav Mullick
2 Why do we cluster? Clustering results are used: As a stand-alone tool to get insight into data distributionVisualization of clusters may unveil important informationAs a preprocessing step for other algorithmsEfficient indexing or compression often relies on clustering
3 What is Cluster Analysis? Cluster analysis or clustering is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more “similar” (in some sense or another) to each other than to those in other clusters.Cluster analysis itself is not one specific algorithm. But the general task to be solved is forming similar clusters. It can be achieved by various algorithms.
4 How do we define “similarity”? Recall that the goal is to group together “similar” data – but what does this mean?No single answer – it depends on what we want to find or emphasize in the data; this is one reason why clustering is an “art”The similarity measure is often more important than the clustering algorithm used – don’t overlook this choice!
6 Applications:Clustering is a main task of explorative data mining to reduce the size of large data sets. Its a common technique for statistical data analysis used in many fields, including :Machine learningPattern recognitionImage analysisInformation retrievalBioinformatics.Web applications such as social network analysis, grouping of shopping items, search result grouping etc.
7 Requirements of Clustering in Data Mining ScalabilityAbility to deal with different types of attributesDiscovery of clusters with arbitrary shapeAble to deal with noise and outliersInsensitive to order of input recordsHigh dimensionalityInterpretability and usability
8 Notion of clustering: How many clusters? Six Clusters Two Clusters Four Clusters
9 Clustering Algorithms: Clustering algorithms can be categorizedSome of the major algorithms are:Hierarchical or connectivity based clusteringPartitional clustering (K-means or centroid-based clustering)Density basedGrid basedModel based
12 Partitional Clustering: In statistics and data mining, k-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. This results into a partitioning of the data space into Voronoi cells.A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset
13 Partitional Clustering : A Partitional ClusteringOriginal Points
14 Hierarchical Clustering: Connectivity based clustering, also known as hierarchical clustering, is based on the core idea of objects being more related to nearby objects than to objects farther away.As such, these algorithms connect "objects" to form "clusters" based on their distance. At different distances, different clusters will form, which can be represented using a dendrogram.These algorithms do not provide a single partitioning of the data set, but instead provide an extensive hierarchy of clusters that merge with each other at certain distances.A set of nested clusters organized as a hierarchical tree
15 Hierarchical Clustering: Traditional Hierarchical ClusteringTraditional DendrogramNon-traditional Hierarchical ClusteringNon-traditional Dendrogram
17 Partitioning Algorithms: Partitioning method: Construct a partition of n objects into a set of K clustersGiven: a set of objects and the number KFind: a partition of K clusters that optimizes the chosen partitionin`g criterionEffective heuristic methods: K-means and K-medoids algorithms
18 Common choices for Similarity/ Distance measures: Euclidean distance:City block or Manhattan distance:Cosine similarity:Jaccard similarity:
19 K-means Clustering: Partitional clustering approach Each cluster is associated with a centroid (center point)Each point is assigned to the cluster with the closest centroidNumber of clusters, K, must be specifiedThe basic algorithm is very simple
20 K-Means Algorithm: START Choose K Centroids Select K points as initial Centroids.Repeat:Form k clusters by assigning all points to theirrespective closest centroid.Re-compute the centroid for each cluster5. Until: The centroids don`t change.Form k clusters.Recompute centroidYESCentroidschangeNOEND
21 Time ComplexityAssume computing distance between two instances is O(m) where m is the dimensionality of the vectors.Reassigning clusters: O(kn) distance computations, or O(knm).Computing centroids: Each instance vector gets added once to some centroid: O(nm).Assume these two steps are each done once for I iterations: O(Iknm).
31 Two different K-means Clusterings Original PointsOptimal ClusteringSub-optimal Clustering
32 Solutions to Initial Centroids Problem Multiple runsHelps, but probability is not on your sideSample and use hierarchical clustering to determine initial centroidsSelect more than k initial centroids and then select among these initial centroidsSelect most widely separatedPostprocessingBisecting K-meansNot as susceptible to initialization issues
33 Evaluating K-means Clusters Most common measure is Sum of Squared Error (SSE)For each point, the error is the distance to the nearest clusterTo get SSE, we square these errors and sum them.x is a data point in cluster Ci and mi is the representative point for cluster Cican show that mi corresponds to the center (mean) of the clusterGiven two clusters, we can choose the one with the smallest errorOne easy way to reduce SSE is to increase K, the number of clustersA good clustering with smaller K can have a lower SSE than a poor clustering with higher K
34 StrengthRelatively efficient: O(ikn), where n is # objects, k is # clusters, and i is # iterations. Normally, k, i << n.Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithmsWeaknessApplicable only when mean is defined, then what about categorical data?Need to specify k, the number of clusters, in advanceUnable to handle noisy data and outliersNot suitable to discover clusters with non-convex shapesAlso may give rise to Empty-clusters.
35 Outliers cluster outliers Outliers are objects that do not belong to any cluster or form clusters of very small cardinalityclusteroutliers
36 Which cluster to be picked for bisection ? Bisecting K-Means:A variant of k-means, that can produce a partitional or heirarchical clustering.Which cluster to be picked for bisection ?Can pick the largest Cluster , orThe cluster With lowest average similarity, orCluster with the largest SSE.
37 Bisecting K-Means Algorithm: STARTBisecting K-Means Algorithm:Initialize clustersInitialize the list of clusters.Repeat:Select a cluster from the list of clusters.For i=1 to number_of_iterationsBisect the cluster using k-means algorithmEnd forSelect two clusters having the lowest SSEAdd the two clusters from the bisection tothe list of clusters9. Until: The list contains k clusters.Select a clusterNOi < no. of iterationsYESBisect the cluster.i++Add the two bisected clusters, having lowest SSE, to list of clustersNOK clustersYESEND
39 Why bisecting K-means works better than regular K-means? –Bisecting K-means tends to produce clusters of relatively uniform size.–Regular K-means tends to produce clusters of widely different sizes.–Bisecting K-means beats Regular K-means in Entropy measurement
40 Limitations of K-means: K-means has problems when clusters are of differingSizesDensitiesNon-globular shapesK-means has problems when the data contains outliers.
41 Limitations of K-means: Differing Sizes Original PointsK-means (3 Clusters)
42 Limitations of K-means: Differing Density Original PointsK-means (3 Clusters)
43 Limitations of K-means: Non-globular Shapes Original PointsK-means (2 Clusters)
44 Overcoming K-means Limitations Original Points K-means ClustersOne solution is to use many clusters.Find parts of clusters, but need to put together.
45 Overcoming K-means Limitations Original Points K-means Clusters
46 Overcoming K-means Limitations Original Points K-means Clusters
47 K-Medoids Algorithm What is a medoid? A medoid can be defined as the object of a cluster, whose average dissimilarity to all the objects in the cluster is minimal, i.e, it is a most centrally located point in the cluster.In contrast to the k-means algorithm, k-medoids chooses datapoints as centers(medoids or exemplars)The most common realisation of k-medoid clustering is the Partitioning Around Medoids (PAM) algorithm.
48 medoids(PAM) algorithm Partitioning aroundmedoids(PAM) algorithm1. Initialize: randomly select k of the n data points as the medoids.2. Associate each data point to the closest medoid.3. For each medoid m1. For each non-medoid data point o1. Swap m and o and compute the total cost of the configuration.4. Select the configuration with the lowest cost.5. Repeat steps 2 to 5 until there is no change in the medoid.
49 Demonstration of PAMCluster the following set of ten objects into two clusters i.e. k=2.Consider a data set of ten objects as follows:PointCordinate 1Cordinate2X126X234X38X47X5X6X7X8X95X10
51 Step 1 Initialize k centres. Let us assume c1=(3,4) and c2=(7,4). So here c1 and c2 are selected as medoids.Calculating distance so as to associate each data object to its nearest medoid.c1Data objects (Xi)Cost3426875C2Data objects (Xi)Cost74263815
52 Then so the clusters become: The total cost involved is 20
53 Cluster after step 1Next, we choose a non-medoid point for each medoid, swap it with the medoid and re-compute the cost. If the cost is optimized, we make it the new medoid and proceed similarly, until there is no change in the medoids.
54 Comments on PAM Algorithm Pam is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a meanPam works well for small data sets but does not scale well for large data sets.
55 Conclusion:Partitional clustering is a very efficient and easy to implement clustering method.It helps us find the global and local optimums.Some of the heuristic approaches involve the K-means and K-medoid algorithms.However partitional clustering also suffers from a number of shortcomings:The performance of the algorithm depends on the initial centroids. Sothe algorithm gives no guarantee for an optimal solution.Choosing poor initial centroids may lead to the generation of empty clusters as well.The number of clusters need to be determined beforehand.Does not work well with non-globular clusters.Some of the above stated drawbacks can be solved using the other popular Clustering approach, such as Hierarchical or density based clustering. Nevertheless the importance of partitional clustering cannot be denied.