# SEEM4630 2011-2012 Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.

## Presentation on theme: "SEEM4630 2011-2012 Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or."— Presentation transcript:

SEEM4630 2011-2012 Tutorial 4 – Clustering

2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or related to one another and different from (or unrelated to) the objects in other groups.  A good clustering method will produce high quality clusters high intra-class similarity: cohesive within clusters low inter-class similarity: distinctive between clusters

3 Notion of a Cluster can be Ambiguous How many clusters? Four ClustersTwo Clusters Six Clusters

4 K-Means Clusteringfixed Euclidean Distance etc.

5 K-Means Clustering: Example  Given: Means of the cluster k i, m i = (t i1 + t i2 + … + t im )/m Data {2, 4, 10, 12, 3, 20, 30, 11, 25} K = 2  Solution: m 1 = 2, m 2 = 4,  K 1 = {2, 3}, and K 2 = {4, 10, 12, 20, 30, 11, 25} m 1 = 2.5, m 2 = 16  K 1 = {2, 3, 4}, and K 2 = {10, 12, 20, 30, 11, 25} m 1 = 3, m 2 = 18  K 1 = {2, 3, 4, 10}, and K 2 = {12, 20, 30, 11, 25} m 1 = 4.75, m 2 = 19.6  K 1 = {2, 3, 4, 10, 11, 12}, and K 2 = {20, 30, 25} m 1 = 7, m 2 = 25  K 1 = {2, 3, 4, 10, 11, 12}, and K 2 = {20, 30, 25}

6 K-Means Clustering: Evaluation  Evaluation Sum of Squared Error (SSE) Given clusters, choose the one with the smallest error Data point in cluster C i Centroid of cluster C i

7 Limitations of K-means  It is hard to determine a good K value The initial K centroids  K-means has problems when the data contains outliers. Outliers can be handled better by hierarchical clustering and density-based clustering

8 Hierarchical Clustering  Produces a set of nested clusters organized as a hierarchical tree  Can be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits

9 Strengths of Hierarchical Clustering  Do not have to assume any particular number of clusters Any desired number of clusters can be obtained by ‘cutting’ the dendrogram at the proper level  Partition direction Agglomerative: starting with single elements and aggregating them into clusters Divisive: starting with the complete data set and dividing it into partitions

10 Agglomerative Hierarchical Clustering  Basic algorithm is straightforward 1. Compute the proximity matrix 2. Let each data point be a cluster 3. Repeat 4. Merge the two closest clusters 5. Update the proximity matrix 6. Until only a single cluster remains  Key operation is the computation of the proximity of two clusters Different approaches to define the distance between clusters

11 Hierarchical Clustering  Define Inter-Cluster Similarity Min Max Group Average Distance between Centroids

12 Hierarchical Clustering: Min or Single Link I1I2I3I4I5 I10.000.240.220.370.34 I20.240.000.150.200.14 I30.220.150.000.150.28 I40.370.200.150.000.29 I50.340.140.280.290.00 I6 0.23 0.250.110.220.39 0.23 0.25 0.11 0.22 0.39 0.00 362541 0 0.05 0.1 0.15 0.2 I1I2{I3, I6}I4I5 I10.000.240.220.370.34 I20.240.000.150.200.14 {I3, I6}0.220.150.000.150.28 I40.370.200.150.000.29 I50.340.140.280.290.00 I1{I2, I5}{I3, I6}I4 I10.000.240.220.37 {I2, I5}0.240.000.150.20 {I3, I6}0.220.150.000.15 I40.370.200.150.00 I1{I2, I5,I3, I6}I4 I10.000.220.37 {I2, I5, I3, I6} {I4} 0.220.000.15 0.370.150.00 I1{I2, I5,I3, I6, I4} I10.000.22 {I2, I5, I3, I6, I4} 0.220.00 Euclidean distance

Download ppt "SEEM4630 2011-2012 Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or."

Similar presentations