Learning the threshold in Hierarchical Agglomerative Clustering

Slides:



Advertisements
Similar presentations
Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
Learning Trajectory Patterns by Clustering: Comparative Evaluation Group D.
Clustering Categorical Data The Case of Quran Verses
Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Combining Classification and Model Trees for Handling Ordinal Problems D. Anyfantis, M. Karagiannopoulos S. B. Kotsiantis, P. E. Pintelas Educational Software.
Agglomerative Hierarchical Clustering 1. Compute a distance matrix 2. Merge the two closest clusters 3. Update the distance matrix 4. Repeat Step 2 until.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.
1 Building a Dictionary of Image Fragments Zicheng Liao Ali Farhadi Yang Wang Ian Endres David Forsyth Department of Computer Science, University of Illinois.
Machine Learning and Data Mining Clustering
Second order cone programming approaches for handing missing and uncertain data P. K. Shivaswamy, C. Bhattacharyya and A. J. Smola Discussion led by Qi.
Introduction to Bioinformatics
Mutual Information Mathematical Biology Seminar
Iterative Optimization of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial Intelligence.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Ensemble Learning: An Introduction
What is Cluster Analysis?
Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah Supervisor: Dr. Sid Ray.
Applications of Data Mining in Microarray Data Analysis Yen-Jen Oyang Dept. of Computer Science and Information Engineering.
Distributed and Efficient Classifiers for Wireless Audio-Sensor Networks Baljeet Malhotra Ioanis Nikolaidis Mario A. Nascimento University of Alberta Canada.
Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Selective Sampling on Probabilistic Labels Peng Peng, Raymond Chi-Wing Wong CSE, HKUST 1.
1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls.
Clustering Unsupervised learning Generating “classes”
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Using Error-Correcting Codes For Text Classification Rayid Ghani Center for Automated Learning & Discovery, Carnegie Mellon University.
Dongyeop Kang1, Youngja Park2, Suresh Chari2
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
Clustering Uncertain Data Speaker: Ngai Wang Kay.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
One-class Training for Masquerade Detection Ke Wang, Sal Stolfo Columbia University Computer Science IDS Lab.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
A Novel Local Patch Framework for Fixing Supervised Learning Models Yilei Wang 1, Bingzheng Wei 2, Jun Yan 2, Yang Hu 2, Zhi-Hong Deng 1, Zheng Chen 2.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
BEHAVIORAL TARGETING IN ON-LINE ADVERTISING: AN EMPIRICAL STUDY AUTHORS: JOANNA JAWORSKA MARCIN SYDOW IN DEFENSE: XILING SUN & ARINDAM PAUL.
DB Seminar Series: Semi- supervised Projected Clustering By: Kevin Yip (4 th May 2004)
CURE: An Efficient Clustering Algorithm for Large Databases Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Stanford University Bell Laboratories Bell Laboratories.
Zhuo Peng, Chaokun Wang, Lu Han, Jingchao Hao and Yiyuan Ba Proceedings of the Third International Conference on Emerging Databases, Incheon, Korea (August.
ROCK: A Robust Clustering Algorithm for Categorical Attributes Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Data Engineering, Proceedings.,
RSVM: Reduced Support Vector Machines Y.-J. Lee & O. L. Mangasarian First SIAM International Conference on Data Mining Chicago, April 6, 2001 University.
Optimal Dimensionality of Metric Space for kNN Classification Wei Zhang, Xiangyang Xue, Zichen Sun Yuefei Guo, and Hong Lu Dept. of Computer Science &
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Machine Learning and Data Mining Clustering (adapted from) Prof. Alexander Ihler TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.
Efficient Clustering of Uncertain Data Wang Kay Ngai, Ben Kao, Chun Kit Chui, Reynold Cheng, Michael Chau, Kevin Y. Yip Speaker: Wang Kay Ngai.
Hierarchical Clustering
Feature Selection Poonam Buch. 2 The Problem  The success of machine learning algorithms is usually dependent on the quality of data they operate on.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
An unsupervised conditional random fields approach for clustering gene expression time series Chang-Tsun Li, Yinyin Yuan and Roland Wilson Bioinformatics,
Data Mining and Text Mining. The Standard Data Mining process.
Combinatorial clustering algorithms. Example: K-means clustering
Machine Learning for the Quantified Self
Boosted Augmented Naive Bayes. Efficient discriminative learning of
Computing and Compressive Sensing in Wireless Sensor Networks
Data Mining K-means Algorithm
Time Series Filtering Time Series
COMBINED UNSUPERVISED AND SEMI-SUPERVISED LEARNING FOR DATA CLASSIFICATION Fabricio Aparecido Breve, Daniel Carlos Guimarães Pedronette State University.
Collaborative Filtering Nearest Neighbor Approach
Data Transformations targeted at minimizing experimental variance
Sequential Hierarchical Clustering
University of Wisconsin - Madison
Machine Learning and Data Mining Clustering
The Research Process & Surveys, Samples, and Populations
Presentation transcript:

Learning the threshold in Hierarchical Agglomerative Clustering Kristine Daniels Christophe Giraud-Carrier Speaker: Ngai Wang Kay

Hierarchical clustering Threshold Dendrogram d3 d2 d1 d2 d1 d3

Distance metric Single-link distance metric – the minimum of the simple distances (e.g. Euclidean distances) between the objects in the two clusters.

Distance metric Complete-link distance metric – the maximum of the simple distances between the objects in the two clusters.

Threshold determination Some applications may just want a set of clusters for a particular threshold instead of a dendrogram. A more efficient clustering algorithm may be developed for such case. There are many possible thresholds. So, it is hard to determine the threshold that gives an accurate clustering result (based on a measure against the correct clusters).

Threshold determination Suppose C1, …, Cn are the correct clusters and H1, …, Hm are the computed clusters. A F-measure is used to determine the accuracy of the computed clusters as follows:

Threshold determination where N is the dataset size

Semi-supervised algorithm Select a random subset S of the dataset. Label the correct clusters of the data in S. Cluster S using the previous algorithm. Compute the F-measure value for each threshold in the dendrogram. Find the threshold with the highest F-measure value. Cluster the dataset using this threshold.

Sample set Preliminary experiments show that a sample set of size 50 gives reasonable clustering results. The time complexity of the hierarchical clustering is usually O(N2) or higher in simple distance computations and numerical comparisons. So, learning the threshold may be a very small cost in comparison to that of clustering the dataset.

Experimental results Experiments are conducted by complete-link clustering on various real datasets in the UCI repository (http://www.ics.uci.edu/~mlearn/mlrepository.html). They are originally collected for the classification problem. The class labels of the data are used as the cluster labels in these experiments.

Experimental results Dataset Size # Classes Breast-Wisconsin 699 2 Car 1728 4 Diabetes 768 Glass 214 Hepatitis 155 Ionosphere 351 Kr-vs-Kp 3196 Tic-Tac-Toe 958 Vehicle 946

Experimental results Dataset Target threshold Learned threshold Breast-Wisconsin 13.17 11.91 Car 7.35 6.68 Diabetes 8.84 11.61 Glass 9.39 8.06 Hepatitis 17.12 14.50 Ionosphere 24.81 24.00 Kr-vs-Kp 1605.28 50.37 Tic-Tac-Toe 7.52 7.45 Vehicle 13.09 6.11

Experimental results Because of the nature of the data, there may be many good threshold values. So, large differences between the target and learned thresholds do not have to yield large differences between the corresponding F-measure values.

Experimental results Dataset F-measure (Target / Learned) # Clusters Breast-Wisconsin 0.97 / 0.97 2 / 2 Car 0.90 / 0.64 2 / 5 Diabetes 0.71 / 0.65 13 / 4 Glass 0.82 / 0.82 11 / 13 Hepatitis 0.77 / 0.77 1 / 2 Ionosphere 0.69 / 0.66 Kr-vs-Kp 0.67 / 0.67 1 / 4 Tic-Tac-Toe 0.69 / 0.58 Vehicle 0.46 / 0.31 3 / 36

Experimental results The Vehicle dataset shows a huge difference in the number of clusters but a moderate difference in the F-measure. The Car dataset suffers from a serious loss of the F-measure, but the difference in the number of clusters is small. These anomalies may be explained, in part, by the sparseness of the data, the skewness of the underlying class distributions, and the cluster labels are based on the classification labels.

Experimental results The Diabetes dataset achieves a F-measure value close to optimal with fewer clusters when using the learned threshold. In summary, the learned threshold achieves clustering results close to the optimal ones at a fraction of the computational cost of clustering the whole dataset.

Conclusion Hierarchical clustering does not produce a single clustering result but a dendrogram, a series of nested clusters based on distance thresholds. This leads to the open problem of choosing the preferred threshold. An efficient semi-supervised algorithm is proposed to obtain such threshold. Experimental results show the clustering results obtained using the learned threshold are close to optimal.