Cluster Evaluation Metrics that can be used to evaluate the quality of a set of document clusters.

Slides:



Advertisements
Similar presentations
AMCS/CS229: Machine Learning
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
On the Density of a Graph and its Blowup Raphael Yuster Joint work with Asaf Shapira.
Imbalanced data David Kauchak CS 451 – Fall 2013.
Clustering Beyond K-means
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
@ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
1 Jun Wang, 2 Sanjiv Kumar, and 1 Shih-Fu Chang 1 Columbia University, New York, USA 2 Google Research, New York, USA Sequential Projection Learning for.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Clustering Evaluation April 29, Today Cluster Evaluation – Internal We don’t know anything about the desired labels – External We have some information.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 16: Flat Clustering 1.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Semi-supervised protein classification using cluster kernels Jason Weston, Christina Leslie, Eugene Ie, Dengyong Zhou, Andre Elisseeff and William Stafford.
Cluster Validation.
Selecting Informative Genes with Parallel Genetic Algorithms Deodatta Bhoite Prashant Jain.
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
Introduction to Machine Learning Approach Lecture 5.
Hardness-Aware Restart Policies Yongshao Ruan, Eric Horvitz, & Henry Kautz IJCAI 2003 Workshop on Stochastic Search.
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
3.5 Solving systems of equations in 3 variables
1 Advanced Smoothing, Evaluation of Language Models.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Today Evaluation Measures Accuracy Significance Testing
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Lecture 20: Cluster Validation
Giorgos Giannopoulos (IMIS/”Athena” R.C and NTU Athens, Greece) Theodore Dalamagas (IMIS/”Athena” R.C., Greece) Timos Sellis (IMIS/”Athena” R.C and NTU.
Recommending Twitter Users to Follow Using Content and Collaborative Filtering Approaches John HannonJohn Hannon, Mike Bennett, Barry SmythBarry Smyth.
 Conversation Level Constraints on Pedophile Detection in Chat Rooms PAN 2012 — Sexual Predator Identification Claudia Peersman, Frederik Vaassen, Vincent.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Interactive Deduplication using Active Learning Sunita Sarawagi and Anuradha Bhamidipaty Presented by Doug Downey.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
UNSUPERVISED LEARNING David Kauchak CS 451 – Fall 2013.
CS 478 – Tools for Machine Learning and Data Mining Clustering Quality Evaluation.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Machine Learning and Data Mining Clustering (adapted from) Prof. Alexander Ihler TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese.
Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
DATA MINING: CLUSTER ANALYSIS (3) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Anomaly Detection Carolina Ruiz Department of Computer Science WPI Slides based on Chapter 10 of “Introduction to Data Mining” textbook by Tan, Steinbach,
Information Organization: Evaluation of Classification Performance.
Homework 5 Corrections: Only need to compute Sum-Squared-Error and Average Entropy of clusters, not “cohesion” and “separation”
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Today Cluster Evaluation Internal External
Hierarchical Agglomerative Clustering on graphs
Sampath Jayarathna Cal Poly Pomona
Centroid index Cluster level quality measure
Dipartimento di Ingegneria «Enzo Ferrari»,
Clustering Evaluation The EM Algorithm
CSE 4705 Artificial Intelligence
Introduction to Functions
Federalist Papers Activity
Hierarchical clustering approaches for high-throughput data
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Learning Literature Search Models from Citation Behavior
Cluster Validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster.
Information Organization: Evaluation of Classification Performance
Presentation transcript:

Cluster Evaluation Metrics that can be used to evaluate the quality of a set of document clusters.

Precision Recall & FScore  From Zhao and Karypis, 2002  These metrics are computed for every (class,cluster) pair.  Terms:  class L r of size n r  cluster S i if size n i  n ri documents in S i from class L r

Precision  Loosely equated to accuracy  Roughly answers the question: “How many of the documents in this cluster belong there?”  P(L r, S i ) = n ri /n i

Recall  Roughly answers the question: “Did all of the documents that belong in this cluster make it in?”  P(L r, S i ) = n ri /n r

FScore  Harmonic Mean of Precision and Recall  Tries to give a good combination of the other 2 metrics  Calculated with the equation:

FScore - Entire Solution  We calculate a per-class FScore:  We then combine these scores into a weighted average:

FScore Caveats  The Zhao, Karypis paper focused on Hierarchical clustering, so the definitions of Precision/Mean and FScore might not apply as well to “flat” clustering.  The metrics rely on the use of class labels, so they can not be applied in situations were there is no labeled data.

Possible Modifications  Calculate a per-cluster (not per class FScore:  Combine these scores into a weighted average:

Rand Index  Yeung, et al., 2001  Measure of partition agreement  Answers the question “How similar are these two ways of partitioning the data?”  To evaluate clusters, we compute the Rand Index between actual labels and clusters

Rand Index  a = # pairs of documents that are in the same S i and L r  b = # pairs of documents that are in the same L r, but not the same S i  c = # pairs of documents in the same S i, but not the same L r  d = # pairs of documents that are not in the same L r nor S i.

Adjusted Rand Index  The Rand index has a problem, the expected value for any 2 random partitions is relatively high, we’d like it to be close to 0.  Adjusted Rand index puts the expected value at 0, gives a more dynamic range and is probably a better metric.  See appendix B of Yeung, et al., 2001.

Rand Index Caveat  Penalizes good, but finer grained clusters: imagine a sports class that produces 2 clusters, one for ball sports and one for track sports.  To fix that issue, we could hard label each cluster and treat all clusters with the same label as the same (clustering the clusters).

Problems  The metrics so far depend on class labels.  They also give undeserved high scores as k approaches n, because almost all instances end up alone in a cluster.

Label Entropy  My idea? (I haven’t seen it anywhere else)  Calculate an entropy value per cluster:  Combine entropies (weighted average):

Log Likelihood of Data  Calculate the log likelihood of the Data according to the clusterers model.  If the clusterer doesn’t have an explicit model, treat clusters as classes and train a class conditional model of the data based on these class labelings. Use the new model to calculate log likelihood.