Clustering Evaluation April 29, 2010. Today Cluster Evaluation – Internal We don’t know anything about the desired labels – External We have some information.

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Clustering Beyond K-means
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Supervised Learning Recap
Cluster Evaluation Metrics that can be used to evaluate the quality of a set of document clusters.
Lecture 22: Evaluation April 24, 2010.
K-means clustering Hongning Wang
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011.
Lecture 21: Spectral Clustering
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
University at BuffaloThe State University of New York Cluster Validation Cluster validation q Assess the quality and reliability of clustering results.
Beginning the Research Design
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 16: Flat Clustering 1.
Cluster Analysis CS240B Lecture notes based on those by © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004.
Cluster Analysis for Gene Expression Data Ka Yee Yeung Center for Expression Arrays Department of Microbiology.
Distributional clustering of English words Authors: Fernando Pereira, Naftali Tishby, Lillian Lee Presenter: Marian Olteanu.
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
Gaussian Mixture Models and Expectation Maximization.
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
Relaxed Transfer of Different Classes via Spectral Partition Xiaoxiao Shi 1 Wei Fan 2 Qiang Yang 3 Jiangtao Ren 4 1 University of Illinois at Chicago 2.
Clustering Unsupervised learning Generating “classes”
Today Evaluation Measures Accuracy Significance Testing
B.Ramamurthy. Data Analytics (Data Science) EDA Data Intuition/ understand ing Big-data analytics StatsAlgs Discoveries / intelligence Statistical Inference.
Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
Clustering 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 8: Clustering Mining Massive Datasets.
Lecture 20: Cluster Validation
Machine Learning Queens College Lecture 2: Decision Trees.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
Chapter 7 The Logic Of Sampling. Observation and Sampling Polls and other forms of social research rest on observations. The task of researchers is.
UNSUPERVISED LEARNING David Kauchak CS 451 – Fall 2013.
Map of the Great Divide Basin, Wyoming, created using a neural network and used to find likely fossil beds See:
Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.
Lecture 17 Gaussian Mixture Models and Expectation Maximization
CS 478 – Tools for Machine Learning and Data Mining Clustering Quality Evaluation.
Prepared by: Mahmoud Rafeek Al-Farra
CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Machine Learning Queens College Lecture 7: Clustering.
Text Clustering Hongning Wang
Clustering Patrice Koehl Department of Biological Sciences National University of Singapore
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
1 Cluster Analysis – 2 Approaches K-Means (traditional) Latent Class Analysis (new) by Jay Magidson, Statistical Innovations based in part on a presentation.
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Machine Learning Lecture 4: Unsupervised Learning (clustering) 1.
Introduction to Machine Learning Nir Ailon Lecture 12: EM, Clustering and More.
GLBIO ML workshop May 17, 2016 Ivan Kryukov and Jeff Wintersinger
Today Cluster Evaluation Internal External
Sampath Jayarathna Cal Poly Pomona
Centroid index Cluster level quality measure
Big Data Analysis and Mining
Clustering Evaluation The EM Algorithm
Critical Issues with Respect to Clustering
Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育
Text Categorization Berlin Chen 2003 Reference:
Presentation transcript:

Clustering Evaluation April 29, 2010

Today Cluster Evaluation – Internal We don’t know anything about the desired labels – External We have some information about the labels

Internal Evaluation Clusters have been identified. How successful a partitioning of the data set was constructed through clustering? Internal measures have the quality that they can be directly optimized.

Intracluster variability How far is a point from its cluster centroid Intuition: every point assigned to a cluster should be closer to the center of that cluster than any other cluster K-means optimizes this measure.

Model Likelihood Intuition: the model that fits the data best represents the best clustering Requires a probabilistic model. Can be included in AIC and BIC measures to limit the number of parameters. GMM style:

Point similarity vs. Cluster similarity Intuition: two points that are similar should be in the same cluster Spectral Clustering optimizes this function.

Internal Cluster Measures Which cluster measure is best? – Centroid distance – Model likelihood – Point distance It depends on the data and the task.

External Cluster Evaluation If you have a little bit of labeled data, unsupervised (clustering) techniques can be evaluated using this knowledge. Assume for a subset of the data points, you have class labels. How can we evaluate the success of the clustering?

External Cluster Evaluation Can’t we use Accuracy? – or “Why is this hard?” The number of clusters may not equal the number of classes. It may be difficult to assign a class to a cluster.

Some principles. Homogeneity – Each cluster should include members of as few classes as possible Completeness – Each class should be represented in as few clusters as possible.

Some approaches Purity F-measure – Cluster definitions of Precision and Recall – Combined using harmonic mean as in traditional f-measure points of class i in cluster r

The problem of matching F-measure: 0.6

The problem of matching F-measure: 0.5

V-Measure Conditional Entropy based measure to explicitly calculate homogeneity and completeness.

Contingency Matrix Want to know how much the introduction of clusters is improving the information about the class distribution

Entropy Entropy calculates the amount of “information” in a distribution. Wide distributions have a lot of information Narrow distributions have very little Based on Shannon’s limit of the number of bits required to transmit a distribution Calculation of entropy:

Example Calculation of Entropy

Pair Based Measures Statistics over every pair of items. – SS – same cluster, same class – SD – same cluster, different class – DS – different cluster, same class – DD – different cluster different class These can be arranged in a contingency matrix similar to when accuracy is constructed.

Pair Based Measures Rand: Jaccard Folkes-Mallow

B-cubed Similar to pair based counting systems, B-cubed calculates an element by element precision and recall.

External Evaluation Measures There are many choices. Some should almost certainly never be used – Purity – F-measure Others can be used based on task or other preferences – V-Measure – VI – B-Cubed

Presentations Project Presentations – The schedule has 15 minutes per presentation. – This includes transition to the next speaker, and questions. – Prepare for 10 minutes. Course Evaluations

Thank you