Presentation on theme: "Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial."— Presentation transcript:
Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial Intelligence Research, 4 (1996) 147-179 Presented by: Biyu Liang ('06), Paul Haake ('07)
2 Outline Introduction Fast but Rough Clustering: Hierarchical Sorting Iterative Optimization Methods and Comparison Simplification of Hierarchical Clustering Conclusion
3 Introduction Overview of method: Construct an initial clustering inexpensively Iteratively optimize the clustering using some control strategy Simplify the clustering ➢ Goals: Find high quality clusterings without overfitting Good CPU efficiency
4 Introduction (continued) Properties of any clustering algorithm: objective function: evaluates the quality of a particular clustering on a set of data. control strategy: specifes how the algorithm searches the space of all possible clusterings, given some objective function. In this paper, the authors compare different control strategies using the same objective function.
5 Outline Introduction Fast but Rough Clustering: Hierarchical Sorting Iterative Optimization Methods and Experiments Simplification of Hierarchical Clustering Conclusion
6 Hierarchical Sorting Greedy algorithm to quickly build an initial rough clustering. All three control strategies (discussed later) begin with the clustering generated by hierarchical sorting. By shuffling records around, they improve the clustering.
7 Hierarchical Sorting CU(C K ) = P(C k ) i j [P(A i = V ij |C K ) 2 -P(A i = V ij ) 2 ] Clusters whose data records have similar attribute values have a higher CU score. Objective function = the “partition utility” (PU), the average CU value over all clusters.
8 Hierarchical Sorting Start with an empty clustering and add each data record one at a time For each record being added, there are two choices: Place the record in some existing cluster in the hierarchy Place the record in a new cluster Select the option that yields the highest quality score (PU)
11 Outline Introduction Fast but Rough Clustering: Hierarchical Sorting Iterative Optimization Methods and Comparison Simplification of Hierarchical Clustering Conclusion
12 Iterative Optimization Methods Important note: The primary goal of clustering in this paper is to obtain a single-level partitioning of optimal quality. Hierarchical clustering is used only as an intermediate means toward that end. To evaluate the quality of a solution, the authors therefore only apply the objective function to the first-level partition.
13 Iterative Optimization Methods Reorder-resort (CLUSTER/2): very similar to k-means Iterative redistribution of single observation: reassign each record to a better cluster Iterative hierarchical redistribution: reassign each record or subtree of records to a better cluster
14 Reorder-resort (k-mean) k random seeds are selected, and k clusters are growing around these attractors. The centroids of the clusters are picked as new seeds. The process iterates until there is no further improvement in the quality of generated clustering.
15 Reorder-resort (k-mean) con’t Ordering data to make consecutive observations dissimilar leads to good clusterings. Extract a “dissimilarity” ordering from the hierarchical sorting: consecutive records will tend to be dissimilar.
16 Iterative Redistribution of Single Observations Repeat until the clustering doesn't change: For every record, remove it from the clustering and resort it beginning at the root
17 Iterative Hierarchical Redistribution Problem: The last control strategy resorts only one record at a time. Solution: Resort entire subtrees of records at a time.
18 Iterative Hierarchical Redistribution Hierarchical-Redistribute-Recurse(SiblingSet) Repeat until two consecutive clusterings have the same set of siblings: For each sibling in SiblingSet: Remove the sibling from the hierarchy and resort SiblingSet ← remaining siblings For each sibling S in SiblingSet call Hierarchical-Redistribute-Recurse(S.children) Repeat until clustering converges: Clustering ← Hierarchical-Redistribute- Recurse(Clustering.root.children)
20 Main findings from the experiments Hierarchical redistribution achieves the highest mean PU scores in most cases Reordering and re-clustering comes closest to hierarchical redistribution’s performance in all cases Single-observation redistribution modestly improves an initial sort, and is substantially worse than the other two optimization methods
21 Outline Introduction Generating Initial Hierarchical Clustering Iterative Optimization Methods and Comparison Simplification of Hierarchical Clustering Conclusion
22 Simplifying Hierarchical Clustering Higher levels of the hierarchy are meaningful, but lower levels are subject to overfitting. Solution: post-process the hierarchy with validation and pruning.
23 Validation Strategy: Find internal nodes that are most predictive on unseen data (a testing set). What does “predictive” mean in this case? When a data record is classified into a cluster, we want to know how accurately that cluster, in turn, can predict the data record's attribute values. In a high-quality clustering, we expect that an unseen data record, classified into some cluster, will have attribute values similar to the attribute values of other data records in the cluster.
24 Validation For each variable A i : For each data record: Classify the data record through the cluster hierarchy, beginning at the root, and ignoring the value of A i. At each node, compare the record's A i value to the node's expected A i value; keep a counter of correct predictions for each variable at each node.
25 Validation After processing all variables, for each variable, identify a “frontier” in the hierarchy such that the number of correct predictions of that variable is maximized. If a node lies below the frontier of every variable, then it is pruned.
27 Validation The authors' experiments show that their validation method substantially reduces clustering size without diminishing predictive accuracy.
28 Concluding Remarks There are three phases in searching the space of hierarchical clusterings: Inexpensive generation of an initial clustering Iterative optimization for clusterings Post-processing simplification of generated clusterings Experiments found that the new method, hierarchical redistribution optimization, beats the other iterative optimization methods in most cases.
29 Final Exam Question #1 The main idea in this paper is to construct clusterings which satisfy two conditions. Name the conditions: Consistently constructs high-quality clusterings Computationally inexpensive Name the two steps to satisfy the conditions: Generate a tentative clustering inexpensively, using hierarchical sorting Iteratively optimize that initial clustering
30 Final Exam Question #2 Describe the three iterative methods for clustering optimization: Seed Selection, Reordering, and Reclustering (p. 14-15) Iterative Redistribution of Single Observations (p. 16) Iterative Hierarchical Redistribution (p. 17-19)
31 Final Exam Question #3 The cluster is better when the relative CU score is a) big, b) small, c) equal to 0 Which sorting method is better? a) random sorting, b) similarity sorting