Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Consensus-Based Clustering Method

Similar presentations


Presentation on theme: "A Consensus-Based Clustering Method"— Presentation transcript:

1 A Consensus-Based Clustering Method
for Summarizing Diverse Data Categorizations Hanan G. Ayad, and Mohamed S. Kamel Pattern Analysis and Machine Intelligence Lab, University of Waterloo, Canada. LORNET Theme 4 - Object Mining and Knowledge Discovery Introduction We seek the discovery of the complex categorization structure inherent in a collection of data objects, by obtaining a consensus among a set of diverse cluster structures of the collection. We aim at achieving the above objective by developing a computationally efficient consensus method. A competitive consensus method is demonstrated in recent literature, but is computationally expensive, making it unattractive to apply on large collections of data objects. Introduction of the idea of cumulative voting for transforming a clustering to a probabilistic representation with respect to a common reference of the ensemble. Definition of criteria for estimating an optimal representation for a clustering ensemble with maximum information content. Building upon the Information Bottleneck principle, an optimally compressed summary of estimated stochastic data is extracted such that maximum relevant information about the data is preserved. Effectiveness of the developed cumulative voting method is demonstrated as follows: Diverse cluster structures for a collection of text documents are generated with arbitrary coarse-to-fine resolutions, and consensus solution obtained. Comparison with equally efficient state-of-the art consensus methods. Contributions Cumulative Voting Method A text document is represented by a list of numeric weights corresponding to words of the corpus vocabulary. For a set X of n objects, a clustering Ci assigns each object to one of ki clusters denoted by the symbolic labels Multiple clusterings C1 … Cb of the dataset are generated with induced diversity, by varying the number of clusters randomly, obtaining k1 … kb clusters, respectively. The clustering of the ensemble which has maximum information content is selected as an initial reference clustering An iterative voting procedure is implemented as follows. For each clustering Ci Cumulative voting is applied whereby each current cluster “votes” for each current reference cluster according to estimated probabilities. Each clustering Ci is transformed to a stochastic representation with respect to the reference clustering. Reference clustering is updated to represent current estimates based on clusterings processed so far. Experimental Results Based on the cumulative voting method, three variant algorithms with different properties and weighting schemes were developed. Un-normalized fixed-Reference Cumulative Voting (URCV), fixed-Reference Cumulative Voting (RCV), and Adaptive Cumulative Voting (ACV). The last two use a normalized weighting scheme. The latter apply an iterative voting procedure whereas the first two use a fixed reference. The following performance measures are used, which measure the quality of the obtained consensus solution compared to human categorization of the data. Adjusted Rand Index Normalized Mutual Information Comparison with the following consensus algorithms is shown. Hyper-Graph Partitioning Algorithm HGPA, Meta-Clustering Algorithm MCLA, (Strehl et. al. 2002) Quadratic Mutual Information Algorithm QMI, (Topchy et. al. 2005) Each generated ensemble consists of 25 clusterings. Boxplots show the distribution of the performance measures over 10 runs. The Voting process The joint statistics P(C,X) of two categorical random variables representing the set of categories and the set of objects X are estimated. An agglomerative information-theoretic algorithm, derived from the information bottleneck principle, is developed to extract an optimal compressed summary of the estimated probability distribution so that maximum relevant information about the data is preserved. Based on the summary, each object is assigned to its most likely category. Conclusion Based on the idea of cumulative voting and the information bottleneck principle, efficient consensus clustering algorithms were developed to derive a meaningful consensus clustering from diverse clusterings of the data objects. Superior accuracy compared to recent consensus algorithms is obtained. Computational complexity is linear in the number of data object. Hanan G. Ayad and Mohamed S. Kamel. Cumulative Voting Consensus Method for Partitions with a Variable Number of Clusters. IEEE Transactions on Pattern Analysis and Machine Intelligence. To Appear. Further Reading Fourth Annual Scientific Conference – LORNET Research Network, November 4th - 7th, 2007, Montreal, Canada.


Download ppt "A Consensus-Based Clustering Method"

Similar presentations


Ads by Google