A Consensus-Based Clustering Method

Slides:

Advertisements

Similar presentations

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Advertisements

Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Foreground Focus: Finding Meaningful Features in Unlabeled Images Yong Jae Lee and Kristen Grauman University of Texas at Austin.

Fast Algorithms For Hierarchical Range Histogram Constructions

Intelligent Systems Lab. Recognizing Human actions from Still Images with Latent Poses Authors: Weilong Yang, Yang Wang, and Greg Mori Simon Fraser University,

ENSEMBLE SEGMENTATION USING EFFICIENT INTEGER LINEAR PROGRAMMING Ju-Hsin Hsieh Advisor : Sheng-Jyh Wang 2013/07/22 Amir Alush and Jacob Goldberger, “ Ensemble.

Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.

Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.

Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol

Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.

Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Ensemble Learning: An Introduction

Scalable Text Mining with Sparse Generative Models

The Power of Word Clusters for Text Classification Noam Slonim and Naftali Tishby Presented by: Yangzhe Xiao.

POTENTIAL RELATIONSHIP DISCOVERY IN TAG-AWARE MUSIC STYLE CLUSTERING AND ARTIST SOCIAL NETWORKS Music style analysis such as music classification and clustering.

A Cumulative Voting Consensus Method for Partitions with a Variable Number of Clusters Hanan G. Ayad, Mohamed S. Kamel, ECE Department University of Waterloo,

Ensemble Clustering.

Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.

6. Experimental Analysis Visible Boltzmann machine with higher-order potentials: Conditional random field (CRF): Exponential random graph model (ERGM):

Presented by Tienwei Tsai July, 2005

A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.

Web Mining: Phrase-based Document Indexing and Document Clustering Khaled Hammouda, Ph.D. Candidate Mohamed Kamel, Supervisor, PI PAMI Research Group University.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science ＆ Information Engineering.

Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova ， Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.

SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.

BCS547 Neural Decoding.

An Efficient Greedy Method for Unsupervised Feature Selection

Dependency Networks for Collaborative Filtering and Data Visualization UAI-2000 발표 : 황규백.

Information Bottleneck versus Maximum Likelihood Felix Polyakov.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:

A Comparison of Resampling Methods for Clustering Ensembles

Information-Theoretic Co- Clustering Inderjit S. Dhillon et al. University of Texas, Austin presented by Xuanhui Wang.

Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining Qiaozhu Mei and ChengXiang Zhai Department of Computer Science.

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

Sporadic model building for efficiency enhancement of the hierarchical BOA Genetic Programming and Evolvable Machines (2008) 9: Martin Pelikan, Kumara.

Ultra-high dimensional feature selection Yun Li

1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.

Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.

SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.

Intelligent and Adaptive Systems Research Group A Novel Method of Estimating the Number of Clusters in a Dataset Reza Zafarani and Ali A. Ghorbani Faculty.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Graduate ： Yu Cheng Chen Author: Michael.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Machine Learning: Ensemble Methods

Learning Deep Generative Models by Ruslan Salakhutdinov

Hanan Ayad Supervisor Prof. Mohamed Kamel

Research on Knowledge Element Relation and Knowledge Service for Agricultural Literature Resource Xie nengfu; Sun wei and Zhang xuefu 3rd April 2017.

LECTURE 11: Advanced Discriminant Analysis

Fast Kernel-Density-Based Classification and Clustering Using P-Trees

Data Mining K-means Algorithm

Basic machine learning background with Python scikit-learn

Data Mining Practical Machine Learning Tools and Techniques

of the Artificial Neural Networks.

Noémi Gaskó, Rodica Ioana Lung, Mihai Alexandru Suciu

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

Consensus Partition Liang Zheng 5.21.

CHAPTER 1 Exploring Data

CHAPTER 1 Exploring Data

CHAPTER 1 Exploring Data

CHAPTER 1 Exploring Data

Biointelligence Laboratory, Seoul National University

Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.

CHAPTER 1 Exploring Data

CHAPTER 1 Exploring Data

Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.

GhostLink: Latent Network Inference for Influence-aware Recommendation

Patterson: Chap 1 A Review of Machine Learning

Presentation transcript:

A Consensus-Based Clustering Method for Summarizing Diverse Data Categorizations Hanan G. Ayad, and Mohamed S. Kamel Pattern Analysis and Machine Intelligence Lab, University of Waterloo, Canada. LORNET Theme 4 - Object Mining and Knowledge Discovery Introduction We seek the discovery of the complex categorization structure inherent in a collection of data objects, by obtaining a consensus among a set of diverse cluster structures of the collection. We aim at achieving the above objective by developing a computationally efficient consensus method. A competitive consensus method is demonstrated in recent literature, but is computationally expensive, making it unattractive to apply on large collections of data objects. Introduction of the idea of cumulative voting for transforming a clustering to a probabilistic representation with respect to a common reference of the ensemble. Definition of criteria for estimating an optimal representation for a clustering ensemble with maximum information content. Building upon the Information Bottleneck principle, an optimally compressed summary of estimated stochastic data is extracted such that maximum relevant information about the data is preserved. Effectiveness of the developed cumulative voting method is demonstrated as follows: Diverse cluster structures for a collection of text documents are generated with arbitrary coarse-to-fine resolutions, and consensus solution obtained. Comparison with equally efficient state-of-the art consensus methods. Contributions Cumulative Voting Method A text document is represented by a list of numeric weights corresponding to words of the corpus vocabulary. For a set X of n objects, a clustering Ci assigns each object to one of ki clusters denoted by the symbolic labels Multiple clusterings C1 … Cb of the dataset are generated with induced diversity, by varying the number of clusters randomly, obtaining k1 … kb clusters, respectively. The clustering of the ensemble which has maximum information content is selected as an initial reference clustering An iterative voting procedure is implemented as follows. For each clustering Ci Cumulative voting is applied whereby each current cluster “votes” for each current reference cluster according to estimated probabilities. Each clustering Ci is transformed to a stochastic representation with respect to the reference clustering. Reference clustering is updated to represent current estimates based on clusterings processed so far. Experimental Results Based on the cumulative voting method, three variant algorithms with different properties and weighting schemes were developed. Un-normalized fixed-Reference Cumulative Voting (URCV), fixed-Reference Cumulative Voting (RCV), and Adaptive Cumulative Voting (ACV). The last two use a normalized weighting scheme. The latter apply an iterative voting procedure whereas the first two use a fixed reference. The following performance measures are used, which measure the quality of the obtained consensus solution compared to human categorization of the data. Adjusted Rand Index Normalized Mutual Information Comparison with the following consensus algorithms is shown. Hyper-Graph Partitioning Algorithm HGPA, Meta-Clustering Algorithm MCLA, (Strehl et. al. 2002) Quadratic Mutual Information Algorithm QMI, (Topchy et. al. 2005) Each generated ensemble consists of 25 clusterings. Boxplots show the distribution of the performance measures over 10 runs. The Voting process The joint statistics P(C,X) of two categorical random variables representing the set of categories and the set of objects X are estimated. An agglomerative information-theoretic algorithm, derived from the information bottleneck principle, is developed to extract an optimal compressed summary of the estimated probability distribution so that maximum relevant information about the data is preserved. Based on the summary, each object is assigned to its most likely category. Conclusion Based on the idea of cumulative voting and the information bottleneck principle, efficient consensus clustering algorithms were developed to derive a meaningful consensus clustering from diverse clusterings of the data objects. Superior accuracy compared to recent consensus algorithms is obtained. Computational complexity is linear in the number of data object. Hanan G. Ayad and Mohamed S. Kamel. Cumulative Voting Consensus Method for Partitions with a Variable Number of Clusters. IEEE Transactions on Pattern Analysis and Machine Intelligence. To Appear. Further Reading Fourth Annual Scientific Conference – LORNET Research Network, November 4th - 7th, 2007, Montreal, Canada.