“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis.
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
PARTITIONAL CLUSTERING
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Agglomerative Hierarchical Clustering 1. Compute a distance matrix 2. Merge the two closest clusters 3. Update the distance matrix 4. Repeat Step 2 until.
K Means Clustering , Nearest Cluster and Gaussian Mixture
Unsupervised learning
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Clustering II.
1 Machine Learning: Symbol-based 10d More clustering examples10.5Knowledge and Learning 10.6Unsupervised Learning 10.7Reinforcement Learning 10.8Epilogue.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
Clustering 10/9/2002. Idea and Applications Clustering is the process of grouping a set of physical or abstract objects into classes of similar objects.
Information Retrieval Ch Information retrieval Goal: Finding documents Search engines on the world wide web IR system characters Document collection.
Cluster Analysis: Basic Concepts and Algorithms
Unsupervised Learning and Data Mining
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Cluster Analysis (1).
What is Cluster Analysis?
Cluster Analysis CS240B Lecture notes based on those by © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
What is Cluster Analysis?
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.
Clustering Unsupervised learning Generating “classes”
Disambiguation of References to Individuals Levon Lloyd (State University of New York) Varun Bhagwan, Daniel Gruhl (IBM Research Center) Varun Bhagwan,
DATA MINING CLUSTERING K-Means.
Lecture 20: Cluster Validation
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comprehensive Comparison Study of Document Clustering.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
DATA CLUSTERING WITH KERNAL K-MEANS++ PROJECT OBJECTIVES o PROJECT GOAL  Experimentally demonstrate the application of Kernel K-Means to non-linearly.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into.
Text Clustering Hongning Wang
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 8. Text Clustering.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Machine Learning Lecture 4: Unsupervised Learning (clustering) 1.
CSE4334/5334 Data Mining Clustering. What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related)
Data Mining and Text Mining. The Standard Data Mining process.
Semi-Supervised Clustering
Data Mining K-means Algorithm
Data Clustering Michael J. Watts
K-means and Hierarchical Clustering
Revision (Part II) Ke Chen
Information Organization: Clustering
Revision (Part II) Ke Chen
Text Categorization Berlin Chen 2003 Reference:
Presentation transcript:

“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das CSE 6339 (Dr. Chengkai Li) Feb 9, 2010

CSE Document Clustering Clustering - act of grouping similar object into sets Document Clustering - act of collecting similar documents into bins, where similarity is some function on a document Uses of Document Clustering Browsing a large collection of documents (document organization, automatic topic extraction, fast information retrieval) Organizing results returned by search engine (efficient web search, automatic generation of taxonomy of web documents, effective document classifier) - Improves precision and recall in information retrieval systems

Feb 9, 2010CSE Types of Clustering Agglomerative Hierarchical Clustering Begin with as many clusters as objects; most similar clusters are successively merged until only one cluster remains Superior cluster quality; but O(n 2 ) complexity Partitional Clustering Begin with k initial centroids and assign all n objects to closest centroid; recompute centroid of each cluster and repeat until centroids don’t change Efficient O(knt) complexity; but often poor cluster quality

Feb 9, 2010CSE Agglomerative Hierarchical Clustering Euclidean distance is the similarity/distance metric

Feb 9, 2010CSE Comparison: Agglomerative Hierarchical Clustering Intra-Cluster Similarity Technique (IST) looks at the similarity of all documents in a cluster to their cluster centroid - to find which pair of cluster-merge will lead to smallest decrease in similarity Centroid Similarity Technique (CST) looks at the cosine similarity between the centroids of the two clusters UPGMA looks at cluster similarity as follows: Performs Best

Feb 9, 2010CSE Partitional Clustering (K-Means) Euclidean distance is the similarity/distance metric

Feb 9, 2010CSE Vector Space Model and Document Clustering Cosine Similarity between documents d 1 and d 2 Cluster Centroid Vector for a set of S documents in a cluster Cosine Similarity between a document and centroid vector Cosine Similarity between centroid vectors c 1 and c 2

Feb 9, 2010CSE Cluster Quality Evaluation Measures Internal Quality Measure Cohesiveness of cluster as measure of cluster quality OVERALL SIMILARITY Based on pairwise similarity of documents in a cluster For a set of S documents in a cluster External Quality Measure Compares the groups produced by clustering techniques to known classes ENTROPY F-MEASURE The Higher, The Better

Feb 9, 2010CSE ENTROPY: External Cluster Quality Measure ENTROPY Calculate class distribution of data p ij : the “probability” that a member of cluster j belongs to class i Entropy of cluster j Total entropy The Lower, The Better

Feb 9, 2010CSE F-MEASURE: External Cluster Quality Measure F-MEASURE Combines precision and recall ideas from information retrieval For cluster j and class i where n ij : number of members of cluster j in class i; n j : number of members of cluster j; n i : number of members of class i P Entire F-Measure p The Higher, The Better

Feb 9, 2010CSE Bisecting K-Means Clustering The algorithm starts with a single cluster of all documents Largest Cluster or Least Overall Similarity or Both

Feb 9, 2010CSE Bisecting K-Means Example

Feb 9, 2010CSE S K L D H S H4H4 H2H2 H3H3 H4H4 KL S H2H2 H4H4 H4H4 SS Bisecting K-Means Clustering Document Cluster Hierarchy Bisecting K-Means Example

Feb 9, 2010CSE Observations Bisecting K-Means is actually divisive hierarchical clustering Bisecting K-Means has a time complexity linear in number of documents Multiple runs of Bisecting K-Means does not improve results Bisecting K-Means (with or without refinement) is better than regular K-Means and UPGMA (with or without refinement) quite consistently (Overall Similarity and Entropy) Bisecting K-means produces better document hierarchies Refinement: Bisecting K-Means and UPGMA algorithms are followed by basic K-Means clustering algorithm which uses the centroids of the clusters produced by the techniques as initial centroids

Feb 9, 2010CSE Agglomerative Hierarchical Clustering vs. K-Means/Bisecting K-Means Documents share “core” vocabularies Two documents can often be nearest neighbors without belonging to the same class, so agglomerative algorithms make mistakes “Global properties” help overcome local minima Global property: computing the cosine similarity of a document to a cluster centroid is the same as computing the average similarity of the document to all the cluster’s documents K-means better suited to document clustering However, UPGMA outperforms a single run of K-Means Incremental update of centroid version of K-Means has been used Hybrid Hierarchical K-Means performs better than Hierarchical

Feb 9, 2010CSE Bisecting K-Means vs. K-Means Bisecting K-means tends to produce clusters of relatively uniform size Regular K-means tends to produce clusters of widely different sizes which affects overall cluster quality measure Bisecting K-means beats Regular K-means in Entropy measurement Is this explanation/intuition sufficient? What is the scope of the algorithm outside document clustering?

Thank You !! ??

Feb 9, 2010CSE References Cluster Analysis: Basic Concepts and Algorithms, Ruoming Jin A Comparison of Document Clustering Techniques, Leo Chen TaxaMiner: An Experimental Framework for Automated Taxonomy Bootstrapping, Vipul Kashyap K Means Clustering, Panos Pardalos Wikipedia