Download presentation

Presentation is loading. Please wait.

Published byEstefania Auston Modified about 1 year ago

1
HW 4 Answers

2
1. Consider the xy coordinates of 7 points shown in Table 1. (a) Construct the distance matrix by using Euclidean and perform single and complete link hierarchical clustering. Show your results by drawing a dendrogram. The dendrogram should clearly show the order in which the points are merged. (b) Following (a), compute the cophenetic correlation coefficient for the derived dendrograms.

3
(a) The distance matrix: p1p2p3p4p5p6p7 p10.000.230.220.370.340.240.19 p2 0.000.140.190.140.240.06 p3 0.000.160.280.100.17 p4 0.000.280.220.25 p50.000.390.15 p60.000.26 p70.00 p1p2p3p4p5p6p7 p10.000.230.220.370.340.240.19 p2 0.000.140.190.140.240.06 p3 0.000.160.280.100.17 p4 0.000.280.220.25 p50.000.390.15 p60.000.26 p70.00 Step 2: p1p2,p7p3p4p5p6 p10.000.190.220.370.340.24 p2,p7 0.000.140.190.140.24 p3 0.000.160.280.10 p4 0.000.280.22 p50.000.39 p60.00 p1p2,p7p3, p6p4p5 p10.000.190.220.370.34 p2,p7 0.000.140.190.14 p3, p6 0.000.160.28 p4 0.000.28 p50.00 Step 0: Step 3 (merge p5,p2,p7 first): p1p2,p5,p7p3, p6p4 p10.000.190.220.37 p2,p5,p7 0.000.140.19 p3, p6 0.000.16 p4 0.00 Step 1: Step 4 p1p2,p3,p5, p6,p7p4 p10.000.190.37 p2,p3,p5,p6,p7 0.000.16 p4 0.00 Step 5 p1p2,p3,p4,p5, p6,p7 p10.000.19 p2,p3,p4,p5, p6,p7 0.00

4
2 75 364 1 2 7 5 36 4 1 Two possible dendrograms for single link hierarchical clustering: (a) Case 1: merge p5,p2,p7 first (a) Case 2: merge p3,p6,p2,p7 first

5
(c) The cophenetic correlation coefficient matrix for single link clustering p1p2p3p4P5p6p7 p10.000.340.39 0.340.390.34 p2 0.000.39 0.150.390.06 p3 0.000.220.390.100.39 p4 0.000.390.220.39 p50.000.390.15 p60.000.39 p70.00 (a) Case 1 dendrogram (single link ) p1p2p3p4p5p6p7 p10.000.230.220.370.340.240.19 p2 0.000.140.190.140.240.06 p3 0.000.160.280.100.17 p4 0.000.280.220.25 p50.000.390.15 p60.000.26 p70.00 2 75 364 1 (a) The distance matrix

6
2 75 36 4 1 (a) The dendrogram for complete link clustering (b) The cophenetic correlation coefficient matrix for complete link clustering p1p2p3p4p5p6p7 p10.000.19 p2 0.000.140.160.14 0.06 p3 0.000.160.140.100.14 p4 0.000.16 p50.000.14 p60.000.14 p70.00

7
2. Consider the following four faces shown in Figure 2. Again, darkness or number of dots represents density. Lines are used only to distinguish regions and do not represent points. (a)For each figure, could you use single link to find the patterns represented by the nose, eyes, and mouth? Explain. (b) For each figure, could you use K-means to find the patterns represented by the nose, eyes, and mouth? Explain.

8
Ans: (a)Only for (b) and (d). For (b), the points in the nose, eyes, and mouth are much closer together than the points between these areas. For (d) there is only space between these regions. (b)Only for (b) and (d). For (b), K-means would find the nose, eyes, and mouth, but the lower density points would also be included. For (d), Kmeans would find the nose, eyes, and mouth straightforwardly as long as the number of clusters was set to 4.

9
3. Compute the entropy and purity for the confusion matrix in Table 2. -The purity of a cluster -The overall purity cluster i class j Purity (cluster #1): Purity (cluster #2): Purity (cluster #3): Purity (total):

10
Entropy –p ij : The probability that a member of cluster i belong to class j, p ij = m ij /m i m ij: The # of objects of class j in cluster i m i: The # of objects in cluster i –The entropy of a cluster L: The number of classes (ground truth, given) –The entropy of a clustering is the total entropy m: Total # of data points K: # of clusters

11
Entropy (cluster #1): Entropy (cluster #2): Entropy (cluster #3): Entropy (total):

12
4. Using the distance matrix in Table 3, compute the silhouette coefficient for each point, each of the two clusters, and the overall clustering. (Cluster 1 contains {P1, P2} and Cluster 2 contains { P3, P4}) Cluster 1: {P1, P2} Cluster 2: {P3, P4}

13
Silhouette Coefficient combine ideas of both cohesion and separation, but for individual points, as well as clusters and clusterings For an individual point, i –Calculate a = average distance of i to the points in its cluster –Calculate b = min (average distance of i to points in another cluster) –The silhouette coefficient for a point is then given by s = 1 – a/b if a < b, (or s = b/a - 1 if a b, not the usual case) –Typically between 0 and 1. –The closer to 1 the better. Can calculate the Average Silhouette width for a cluster or a clustering Internal Measures: Silhouette Coefficient a: 群內平均 b: 最短群外平均

14
Cluster 1: {P1, P2} Cluster 2: {P3, P4}

15
5. Given the set of cluster labels and similarity matrix shown in Tables 4 and.5, respectively, compute the correlation between the similarity matrix and the ideal similarity matrix, i.e., the matrix whose ijth entry is 1 if two objects belong to the same cluster, and 0 otherwise.

16
Idea similarity matrix: 1100 1100 0011 0011 y = x =

17
y = x = 註：取 σ 要開平方根

18
6. Compute the hierarchical F-measure for the eight objects {p1, p2, p3, p4, p5,p6, p7, p8} and hierarchical clustering shown in Figure 3. Class A contains points p1, p2, and p3, while p4, p5, p6, p7, and p8 belong to class B.

19
F-measure class i cluster j

20
Hierarchical F-measure class cluster The maximum is taken over all cluster j at all levels m i is the number of objects in class i m is the total number of objects Class A: {p1, p2, p3} Class B: {p4, p5, p6, p7, p8}

21
Class=B: R(B,1)=5/5=1, P(B,1)=5/8=0.625 F(B,1)=0.77 Overall Clustering:

22
7. Figure 4 shows a clustering of a two-dimensional point data set with two clusters: The leftmost cluster, whose points are marked by asterisks, is somewhat diffuse, while the rightmost cluster, whose points are marked by circles, is compact. To the right of the compact cluster, there is a single point (marked by an arrow) that belongs to the diffuse cluster, whose center is farther away than that of the compact cluster. Explain why this is possible with EM clustering, but not K- means clustering.

23
Ans: In EM clustering, we compute the probability that a point belongs to a cluster. In turn, this probability depends on both the distance from the cluster center and the spread (variance) of the cluster. Hence, a point that is closer to the centroid of one cluster than another can still have a higher probability with respect to the more distant cluster if that cluster has a higher spread than the closer cluster. K-means only takes into account the distance to the closest cluster when assigning points to clusters. This is equivalent to an EM approach where all clusters are assumed to have the same variance.

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google