SEEM4630 Tutorial 3 – Clustering.

SEEM4630 Tutorial 3 – Clustering

What is Cluster Analysis?
Finding groups of objects such that the objects in a group will be similar (or related to one another and different from (or unrelated to) the objects in other groups. A good clustering method will produce high quality clusters high intra-class similarity: cohesive within clusters low inter-class similarity: distinctive between clusters

Notion of a Cluster can be Ambiguous
How many clusters? Six Clusters Two Clusters Four Clusters

K-Means Clustering fixed Euclidean Distance etc.

K-Means Clustering: Example
Given: Means of the cluster ki, mi = (ti1 + ti2 + … + tim)/m Data {2, 4, 10, 12, 3, 20, 30, 11, 25} K = 2 Solution: m1 = 2, m2 = 4, K1 = {2, 3}, and K2 = {4, 10, 12, 20, 30, 11, 25} m1 = 2.5, m2 = 16 K1 = {2, 3, 4}, and K2 = {10, 12, 20, 30, 11, 25} m1 = 3, m2 = 18 K1 = {2, 3, 4, 10}, and K2 = {12, 20, 30, 11, 25} m1 = 4.75, m2 = 19.6 K1 = {2, 3, 4, 10, 11, 12}, and K2 = {20, 30, 25} m1 = 7, m2 = 25

K-Means Clustering: Evaluation
Sum of Squared Error (SSE) Given clusters, choose the one with the smallest error Data point in cluster Ci Centroid of cluster Ci

Limitations of K-means
It is hard to determine a good K value The initial K centroids K-means has problems when the data contains outliers. Outliers can be handled better by hierarchical clustering and density-based clustering

Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits

Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters Any desired number of clusters can be obtained by ‘cutting’ the dendrogram at the proper level Partition direction Agglomerative: starting with single elements and aggregating them into clusters Divisive: starting with the complete data set and dividing it into partitions

Agglomerative Hierarchical Clustering
Basic algorithm is straightforward Compute the proximity matrix Let each data point be a cluster Repeat Merge the two closest clusters Update the proximity matrix Until only a single cluster remains Key operation is the computation of the proximity of two clusters Different approaches to define the distance between clusters

Hierarchical Clustering
Define Inter-Cluster Similarity Min Max Group Average Distance between Centroids

Hierarchical Clustering: Min or Single Link
Euclidean distance I1 {I2, I5, I3, I6, I4} 0.00 0.22 {I2, I5, I3, I6, I4} I1 {I2, I5, I3, I6} I4 0.00 0.22 0.37 {I2, I5, I3, I6} {I4} 0.15 I1 I2 {I3, I6} I4 I5 0.00 0.24 0.22 0.37 0.34 0.15 0.20 0.14 0.28 0.29 I1 I2 I3 I4 I5 0.00 0.24 0.22 0.37 0.34 0.15 0.20 0.14 0.28 0.29 I6 0.23 0.25 0.11 0.39 I1 {I2, I5} {I3, I6} I4 0.00 0.24 0.22 0.37 0.15 0.20 0.2 0.15 0.1 0.05 3 6 2 5 4 1

SEEM4630 Tutorial 3 – Clustering.

Similar presentations

Presentation on theme: "SEEM4630 Tutorial 3 – Clustering."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SEEM4630 Tutorial 3 – Clustering.

Similar presentations

Presentation on theme: "SEEM4630 Tutorial 3 – Clustering."— Presentation transcript:

Similar presentations

About project

Feedback