Presentation is loading. Please wait.

Presentation is loading. Please wait.

ACM Student Chapter, Heritage Institute of Technology 17 th February, 2012 SIGKDD Presentation by Megha Nangia J. M. Mansa Koustav Mullick.

Similar presentations


Presentation on theme: "ACM Student Chapter, Heritage Institute of Technology 17 th February, 2012 SIGKDD Presentation by Megha Nangia J. M. Mansa Koustav Mullick."— Presentation transcript:

1 ACM Student Chapter, Heritage Institute of Technology 17 th February, 2012 SIGKDD Presentation by Megha Nangia J. M. Mansa Koustav Mullick

2 Clustering results are used: – As a stand-alone tool to get insight into data distribution Visualization of clusters may unveil important information – As a preprocessing step for other algorithms Efficient indexing or compression often relies on clustering

3 Cluster analysis or clustering is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more “similar” (in some sense or another) to each other than to those in other clusters. Cluster analysis itself is not one specific algorithm. But the general task to be solved is forming similar clusters. It can be achieved by various algorithms.

4 Recall that the goal is to group together “similar” data – but what does this mean? No single answer – it depends on what we want to find or emphasize in the data; this is one reason why clustering is an “art” The similarity measure is often more important than the clustering algorithm used – don’t overlook this choice!

5 Minimize Intra-cluster distance Maximize Inter-cluster distance

6 Clustering is a main task of explorative data mining to reduce the size of large data sets. Its a common technique for statistical data analysis used in many fields, including : Machine learning Pattern recognition Image analysis Information retrieval Bioinformatics. Web applications such as social network analysis, grouping of shopping items, search result grouping etc.

7 Scalability Ability to deal with different types of attributes Discovery of clusters with arbitrary shape Able to deal with noise and outliers Insensitive to order of input records High dimensionality Interpretability and usability

8 How many clusters? Four ClustersTwo Clusters Six Clusters

9 Clustering algorithms can be categorized Some of the major algorithms are: 1)Hierarchical or connectivity based clustering 2)Partitional clustering (K-means or centroid-based clustering) 3)Density based 4)Grid based 5)Model based

10 Mammals

11

12 In statistics and data mining, k-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. This results into a partitioning of the data space into Voronoi cells. A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset

13 Original Points A Partitional Clustering

14 Connectivity based clustering, also known as hierarchical clustering, is based on the core idea of objects being more related to nearby objects than to objects farther away. As such, these algorithms connect "objects" to form "clusters" based on their distance. At different distances, different clusters will form, which can be represented using a dendrogram. These algorithms do not provide a single partitioning of the data set, but instead provide an extensive hierarchy of clusters that merge with each other at certain distances. A set of nested clusters organized as a hierarchical tree

15 Traditional Hierarchical Clustering Non-traditional Hierarchical Clustering Non-traditional Dendrogram Traditional Dendrogram

16 Hierarchical Clustering.Partitional Clustering.

17 Partitioning method: Construct a partition of n objects into a set of K clusters Given: a set of objects and the number K Find: a partition of K clusters that optimizes the chosen partitionin`g criterion Effective heuristic methods: K-means and K-medoids algorithms

18 Euclidean distance: City block or Manhattan distance: Cosine similarity: Jaccard similarity:

19 Partitional clustering approach Each cluster is associated with a centroid (center point) Each point is assigned to the cluster with the closest centroid Number of clusters, K, must be specified The basic algorithm is very simple

20 1.Select K points as initial Centroids. 2.Repeat: 3.Form k clusters by assigning all points to their respective closest centroid. 4.Re-compute the centroid for each cluster 5. Until: The centroids don`t change. START Choose K Centroids Form k clusters. Recompute centroid Centroids change END YES NO

21 Assume computing distance between two instances is O(m) where m is the dimensionality of the vectors. Reassigning clusters: O(kn) distance computations, or O(knm). Computing centroids: Each instance vector gets added once to some centroid: O(nm). Assume these two steps are each done once for I iterations: O(Iknm).

22 Algorithm: k-means, Distance Metric: Euclidean Distance

23

24

25

26 k1k1 k2k2 k3k3

27

28

29

30

31 Sub-optimal ClusteringOptimal Clustering Original Points

32 Multiple runs – Helps, but probability is not on your side Sample and use hierarchical clustering to determine initial centroids Select more than k initial centroids and then select among these initial centroids – Select most widely separated Postprocessing Bisecting K-means – Not as susceptible to initialization issues

33 Most common measure is Sum of Squared Error (SSE) – For each point, the error is the distance to the nearest cluster – To get SSE, we square these errors and sum them. – x is a data point in cluster C i and m i is the representative point for cluster C i can show that m i corresponds to the center (mean) of the cluster – Given two clusters, we can choose the one with the smallest error – One easy way to reduce SSE is to increase K, the number of clusters A good clustering with smaller K can have a lower SSE than a poor clustering with higher K

34 Strength  Relatively efficient: O(ikn), where n is # objects, k is # clusters, and i is # iterations. Normally, k, i << n.  Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms Weakness  Applicable only when mean is defined, then what about categorical data?  Need to specify k, the number of clusters, in advance  Unable to handle noisy data and outliers  Not suitable to discover clusters with non-convex shapes  Also may give rise to Empty-clusters.

35 Outliers are objects that do not belong to any cluster or form clusters of very small cardinality cluster outliers

36 A variant of k-means, that can produce a partitional or heirarchical clustering. Can pick the largest Cluster, or The cluster With lowest average similarity, or Cluster with the largest SSE.

37 START Initialize clusters Select a cluster K clusters END NO YES 1.Initialize the list of clusters. 2.Repeat: 3.Select a cluster from the list of clusters. 4.For i=1 to number_of_iterations 5.Bisect the cluster using k-means algorithm 6.End for 7.Select two clusters having the lowest SSE 8.Add the two clusters from the bisection to the list of clusters 9. Until: The list contains k clusters. i < no. of iterations YES Bisect the cluster. i++ Add the two bisected clusters, having lowest SSE, to list of clusters NO

38

39

40 K-means has problems when clusters are of differing – Sizes – Densities – Non-globular shapes K-means has problems when the data contains outliers.

41 Original Points K-means (3 Clusters)

42 Original Points K-means (3 Clusters)

43 Original PointsK-means (2 Clusters)

44 Original PointsK-means Clusters One solution is to use many clusters. Find parts of clusters, but need to put together.

45 Original PointsK-means Clusters

46

47 A medoid can be defined as the object of a cluster, whose average dissimilarity to all the objects in the cluster is minimal, i.e, it is a most centrally located point in the cluster. In contrast to the k-means algorithm, k-medoids chooses datapoints as centers(medoids or exemplars) The most common realisation of k-medoid clustering is the Partitioning Around Medoids (PAM) algorithm.

48 1. Initialize: randomly select k of the n data points as the medoids. 2. Associate each data point to the closest medoid. 3. For each medoid m 1. For each non-medoid data point o 1. Swap m and o and compute the total cost of the configuration. 4. Select the configuration with the lowest cost. 5. Repeat steps 2 to 5 until there is no change in the medoid.

49 Cluster the following set of ten objects into two clusters i.e. k=2. Consider a data set of ten objects as follows: PointCordinate 1Cordinate2 X126 X234 X338 X447 X562 X664 X773 X874 X985 X1076

50

51 Initialize k centres. Let us assume c1=(3,4) and c2=(7,4). So here c1 and c2 are selected as medoids. Calculating distance so as to associate each data object to its nearest medoid. c1Data objects (Xi) Cost C2Data objects (Xi) Cost

52 Then so the clusters become: Cluster1={(3,4)(2,6)(3,8)(4,7)} Cluster2={(7,4) (6,2)(6,4)(7,3)(8,5)(7,6)} The total cost involved is 20

53 Next, we choose a non-medoid point for each medoid, swap it with the medoid and re-compute the cost. If the cost is optimized, we make it the new medoid and proceed similarly, until there is no change in the medoids.

54 Pam is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean Pam works well for small data sets but does not scale well for large data sets.

55


Download ppt "ACM Student Chapter, Heritage Institute of Technology 17 th February, 2012 SIGKDD Presentation by Megha Nangia J. M. Mansa Koustav Mullick."

Similar presentations


Ads by Google