CS 478 – Tools for Machine Learning and Data Mining Clustering Quality Evaluation.

Slides:



Advertisements
Similar presentations
AMCS/CS229: Machine Learning
Advertisements

CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
HW 4 Answers.
Unsupervised Learning
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Albert Gatt Corpora and Statistical Methods Lecture 13.
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Cluster Evaluation Metrics that can be used to evaluate the quality of a set of document clusters.
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
Introduction to Bioinformatics
COMP 328: Final Review Spring 2010 Nevin L. Zhang Department of Computer Science & Engineering The Hong Kong University of Science & Technology
Clustering II.
UPGMA Algorithm.  Main idea: Group the taxa into clusters and repeatedly merge the closest two clusters until one cluster remains  Algorithm  Add a.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
University at BuffaloThe State University of New York Cluster Validation Cluster validation q Assess the quality and reliability of clustering results.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Clustering Evaluation April 29, Today Cluster Evaluation – Internal We don’t know anything about the desired labels – External We have some information.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 16: Flat Clustering 1.
Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft”
Cluster Validation.
© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
Decision Tree Models in Data Mining
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Evaluating Performance for Data Mining Techniques
A Cumulative Voting Consensus Method for Partitions with a Variable Number of Clusters Hanan G. Ayad, Mohamed S. Kamel, ECE Department University of Waterloo,
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
New Tools for Evaluating the Results of Cluster Analyses Hilde Schaeper Higher Education Information System (HIS), Hannover/Germany Fourth.
DATA MINING CLUSTERING K-Means.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Chapter 9 – Classification and Regression Trees
Clustering 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 8: Clustering Mining Massive Datasets.
Lecture 20: Cluster Validation
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
CpSc 810: Machine Learning Evaluation of Classifier.
Critical Issues with Respect to Clustering Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Classification Techniques: Bayesian Classification
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Introduction to Data Mining by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Clustering Algorithm CS 157B JIA HUANG. Definition Data clustering is a method in which we make cluster of objects that are somehow similar in characteristics.
Clustering Patrice Koehl Department of Biological Sciences National University of Singapore
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Project 1 : Phase 1 22C:021 CS II Data Structures.
DATA MINING: CLUSTER ANALYSIS (3) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Clustering Machine Learning Unsupervised Learning K-means Optimization objective Random initialization Determining Number of Clusters Hierarchical Clustering.
CSE4334/5334 Data Mining Clustering. What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related)
Data Mining and Text Mining. The Standard Data Mining process.
Clustering. What is Clustering? Unsupervised learning Seeks to organize data into “reasonable” groups Often based on some similarity (or distance) measure.
Cluster Validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster.
GLBIO ML workshop May 17, 2016 Ivan Kryukov and Jeff Wintersinger
Today Cluster Evaluation Internal External
Machine Learning for the Quantified Self
Clustering Patrice Koehl Department of Biological Sciences
Hierarchical Clustering: Time and Space requirements
Clustering Evaluation The EM Algorithm
CSE 4705 Artificial Intelligence
Critical Issues with Respect to Clustering
Lecture 05: Decision Trees
Cluster Validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster.
Lecture 6: Introduction to Machine Learning
Hierarchical Clustering
Presentation transcript:

CS 478 – Tools for Machine Learning and Data Mining Clustering Quality Evaluation

Quality Metrics External metrics – Computed from a comparison to actual clustering (or classification) labels. Internal metrics – Computed solely from the composition of a cluster, with no recourse to external information.

External Metrics – Setup (I) Let: be the target and computed clusterings, respectively. TC = CC = original set of data Define the following:

External Metrics – Setup (II) a: number of pairs of items that belong to the same cluster in both CC and TC b: number of pairs of items that belong to different clusters in both CC and TC c: number of pairs of items that belong to the same cluster in CC but different clusters in TC d: number of pairs of items that belong to the same cluster in TC but different clusters in CC

F-measure

Rand Index Measure of clustering agreement: how similar are these two ways of partitioning the data?

Adjusted Rand Index Extension of the Rand index that attempts to account for items that may have been clustered by chance

Average Entropy Measure of purity with respect to the target clustering

V-measure Entropy-based Combines: – Homogeneity (H): all computed clusters contain only items which are members of the same target cluster/class – Completeness (C): all data items of a given target cluster/class are in the same computed cluster

Internal Metrics Assume no target clustering Attempt to capture some intrinsic properties of the clustering – Distance-based Sum of all squares Diameter sum or maximum Sum of minimum distance between clusters, etc. – Probability-based E.g., KL divergence