Presentation is loading. Please wait.

Presentation is loading. Please wait.

Learning the threshold in Hierarchical Agglomerative Clustering

Similar presentations


Presentation on theme: "Learning the threshold in Hierarchical Agglomerative Clustering"— Presentation transcript:

1 Learning the threshold in Hierarchical Agglomerative Clustering
Kristine Daniels Christophe Giraud-Carrier Speaker: Ngai Wang Kay

2 Hierarchical clustering
Threshold Dendrogram d3 d2 d1 d2 d1 d3

3 Distance metric Single-link distance metric – the minimum of the simple distances (e.g. Euclidean distances) between the objects in the two clusters.

4 Distance metric Complete-link distance metric – the maximum of the simple distances between the objects in the two clusters.

5 Threshold determination
Some applications may just want a set of clusters for a particular threshold instead of a dendrogram. A more efficient clustering algorithm may be developed for such case. There are many possible thresholds. So, it is hard to determine the threshold that gives an accurate clustering result (based on a measure against the correct clusters).

6 Threshold determination
Suppose C1, …, Cn are the correct clusters and H1, …, Hm are the computed clusters. A F-measure is used to determine the accuracy of the computed clusters as follows:

7 Threshold determination
where N is the dataset size

8 Semi-supervised algorithm
Select a random subset S of the dataset. Label the correct clusters of the data in S. Cluster S using the previous algorithm. Compute the F-measure value for each threshold in the dendrogram. Find the threshold with the highest F-measure value. Cluster the dataset using this threshold.

9 Sample set Preliminary experiments show that a sample set of size 50 gives reasonable clustering results. The time complexity of the hierarchical clustering is usually O(N2) or higher in simple distance computations and numerical comparisons. So, learning the threshold may be a very small cost in comparison to that of clustering the dataset.

10 Experimental results Experiments are conducted by complete-link clustering on various real datasets in the UCI repository ( They are originally collected for the classification problem. The class labels of the data are used as the cluster labels in these experiments.

11 Experimental results Dataset Size # Classes Breast-Wisconsin 699 2 Car
1728 4 Diabetes 768 Glass 214 Hepatitis 155 Ionosphere 351 Kr-vs-Kp 3196 Tic-Tac-Toe 958 Vehicle 946

12 Experimental results Dataset Target threshold Learned threshold
Breast-Wisconsin 13.17 11.91 Car 7.35 6.68 Diabetes 8.84 11.61 Glass 9.39 8.06 Hepatitis 17.12 14.50 Ionosphere 24.81 24.00 Kr-vs-Kp 50.37 Tic-Tac-Toe 7.52 7.45 Vehicle 13.09 6.11

13 Experimental results Because of the nature of the data, there may be many good threshold values. So, large differences between the target and learned thresholds do not have to yield large differences between the corresponding F-measure values.

14 Experimental results Dataset F-measure (Target / Learned) # Clusters
Breast-Wisconsin 0.97 / 0.97 2 / 2 Car 0.90 / 0.64 2 / 5 Diabetes 0.71 / 0.65 13 / 4 Glass 0.82 / 0.82 11 / 13 Hepatitis 0.77 / 0.77 1 / 2 Ionosphere 0.69 / 0.66 Kr-vs-Kp 0.67 / 0.67 1 / 4 Tic-Tac-Toe 0.69 / 0.58 Vehicle 0.46 / 0.31 3 / 36

15 Experimental results The Vehicle dataset shows a huge difference in the number of clusters but a moderate difference in the F-measure. The Car dataset suffers from a serious loss of the F-measure, but the difference in the number of clusters is small. These anomalies may be explained, in part, by the sparseness of the data, the skewness of the underlying class distributions, and the cluster labels are based on the classification labels.

16 Experimental results The Diabetes dataset achieves a F-measure value close to optimal with fewer clusters when using the learned threshold. In summary, the learned threshold achieves clustering results close to the optimal ones at a fraction of the computational cost of clustering the whole dataset.

17 Conclusion Hierarchical clustering does not produce a single clustering result but a dendrogram, a series of nested clusters based on distance thresholds. This leads to the open problem of choosing the preferred threshold. An efficient semi-supervised algorithm is proposed to obtain such threshold. Experimental results show the clustering results obtained using the learned threshold are close to optimal.


Download ppt "Learning the threshold in Hierarchical Agglomerative Clustering"

Similar presentations


Ads by Google