Text Clustering Hongning Wang CS@UVa.

Slides:



Advertisements
Similar presentations
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Advertisements

Albert Gatt Corpora and Statistical Methods Lecture 13.
PARTITIONAL CLUSTERING
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , Chapter 8.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Data Mining Techniques: Clustering
K-means clustering Hongning Wang
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 16: Flat Clustering 1.
What is Cluster Analysis
Information Retrieval Ch Information retrieval Goal: Finding documents Search engines on the world wide web IR system characters Document collection.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
1 Prototype Hierarchy Based Clustering for the Categorization and Navigation of Web Collections Zhao-Yan Ming, Kai Wang and Tat-Seng Chua School of Computing,
Clustering Unsupervised learning Generating “classes”
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
tch?v=Y6ljFaKRTrI Fireflies.
Hierarchical Document Clustering using Frequent Itemsets Benjamin C. M. Fung, Ke Wang, Martin Ester SDM 2003 Presentation Serhiy Polyakov DSCI 5240 Fall.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Prepared by: Mahmoud Rafeek Al-Farra
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Clustering.
Machine Learning and Data Mining Clustering (adapted from) Prof. Alexander Ihler TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Machine Learning Queens College Lecture 7: Clustering.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 28 Nov 9, 2005 Nanjing University of Science & Technology.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Unsupervised Classification
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
KNN & Naïve Bayes Hongning Wang
Data Mining and Text Mining. The Standard Data Mining process.
Hierarchical Clustering & Topic Models
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Information Organization: Overview
Semi-Supervised Clustering
Machine Learning and Data Mining Clustering
CSE572, CBS598: Data Mining by H. Liu
CSE572, CBS572: Data Mining by H. Liu
Data Mining – Chapter 4 Cluster Analysis Part 2
Clustering Wei Wang.
Text Categorization Berlin Chen 2003 Reference:
CSE572: Data Mining by H. Liu
Information Organization: Overview
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Presentation transcript:

Text Clustering Hongning Wang CS@UVa

Today’s lecture Clustering of text documents Problem overview Applications Distance metrics Two basic categories of clustering algorithms Evaluation metrics CS@UVa CS 6501: Text Mining

Clustering v.s. Classification Assigning documents to its corresponding categories How to label it? CS@UVa CS 6501: Text Mining

Clustering problem in general Discover “natural structure” of data What is the criterion? How to identify them? How many clusters? CS@UVa CS 6501: Text Mining

Clustering problem in general Clustering - the process of grouping a set of objects into clusters of similar objects Basic criteria high intra-class similarity low inter-class similarity No (little) supervision signal about the underlying clustering structure Need similarity/distance as guidance to form clusters CS@UVa CS 6501: Text Mining

What is the “natural grouping”? Clustering is very subjective! Distance metric is important! group by gender group by source of ability group by costume CS@UVa CS 6501: Text Mining

Clustering in text mining Serve for IR applications Sub-area of DM research Mining Access Filter information Discover knowledge Text clustering Organization Add Structure/Annotations Based on NLP/ML techniques CS@UVa CS6501: Text Mining

Applications of text clustering Organize document collections Automatically identify hierarchical/topical relation among documents CS@UVa CS 6501: Text Mining

Applications of text clustering Grouping search results Organize documents by topics Facilitate user browsing http://search.carrot2.org/stable/search CS@UVa CS 6501: Text Mining

Applications of text clustering Topic modeling Grouping words into topics Will be discussed later separately CS@UVa CS 6501: Text Mining

Distance metric Basic properties Positive separation Symmetry 𝐷 𝑥,𝑦 >0, ∀𝑥≠𝑦 𝐷 𝑥,𝑦 =0, i.f.f., 𝑥=𝑦 Symmetry 𝐷 𝑥,𝑦 =𝐷(𝑦,𝑥) Triangle inequality 𝐷 𝑥,𝑦 ≤𝐷 𝑥,𝑧 +𝐷(𝑧,𝑦) CS@UVa CS 6501: Text Mining

Typical distance metric Minkowski metric 𝑑 𝑥,𝑦 = 𝑝 𝑖=1 𝑉 𝑥 𝑖 − 𝑦 𝑖 𝑝 When 𝑝=2, it is Euclidean distance Cosine metric 𝑑 𝑥,𝑦 =1−𝑐𝑜𝑠𝑖𝑛𝑒(𝑥,𝑦) when 𝑥 2 = 𝑦 2 =1, 1−𝑐𝑜𝑠𝑖𝑛𝑒 𝑥,𝑦 = 𝑟 2 2 CS@UVa CS 6501: Text Mining

Typical distance metric Edit distance Count the minimum number of operations required to transform one string into the other Possible operations: insertion, deletion and replacement Can be efficiently solved by dynamic programming CS@UVa CS 6501: Text Mining

Typical distance metric Edit distance Count the minimum number of operations required to transform one string into the other Possible operations: insertion, deletion and replacement Extent to distance between sentences Word similarity as cost of replacement “terrible” -> “bad”: low cost “terrible” -> “terrific”: high cost Preserving word order in distance computation Lexicon or distributional semantics CS@UVa CS 6501: Text Mining

Clustering algorithms Partitional clustering algorithms Partition the instances into different groups Flat structure Need to specify the number of classes in advance CS@UVa CS 6501: Text Mining

Clustering algorithms Typical partitional clustering algorithms k-means clustering Partition data by its closest mean CS@UVa CS 6501: Text Mining

Clustering algorithms Typical partitional clustering algorithms k-means clustering Partition data by its closest mean Gaussian Mixture Model Consider variance within the cluster as well CS@UVa CS 6501: Text Mining

Clustering algorithms Hierarchical clustering algorithms Create a hierarchical decomposition of objects Rich internal structure No need to specify the number of clusters Can be used to organize objects CS@UVa CS 6501: Text Mining

Clustering algorithms Typical hierarchical clustering algorithms Bottom-up agglomerative clustering Start with individual objects as separated clusters Repeatedly merge closest pair of clusters Most typical usage: gene sequence analysis CS@UVa CS 6501: Text Mining

Clustering algorithms Typical hierarchical clustering algorithms Top-down divisive clustering Start with all data as one cluster Repeatedly splitting the remaining clusters into two CS@UVa CS 6501: Text Mining

Desirable properties of clustering algorithms Scalability Both in time and space Ability to deal with various types of data No/less assumption about input data Minimal requirement about domain knowledge Interpretability and usability CS@UVa CS 6501: Text Mining

Cluster validation Criteria to determine whether the clusters are meaningful Internal validation Stability and coherence External validation Match with known categories CS@UVa CS 6501: Text Mining

Internal validation Coherence Inter-cluster similarity v.s. intra-cluster similarity Davies–Bouldin index 𝐷𝐵= 1 𝑘 𝑖=1 𝑘 max 𝑗≠𝑖 𝜎 𝑖 + 𝜎 𝑗 𝑑( 𝑐 𝑖 , 𝑐 𝑗 ) where 𝑘 is total number of clusters, 𝜎 𝑖 is average distance of all elements in cluster 𝑖, 𝑑( 𝑐 𝑖 , 𝑐 𝑗 ) is the distance between cluster centroid c 𝑖 and c 𝑗 . Evaluate every pair of clusters We prefer smaller DB-index! CS@UVa CS 6501: Text Mining

Internal validation Coherence Limitation Inter-cluster similarity v.s. intra-cluster similarity Dunn index 𝐷= min 1≤𝑖<𝑗≤𝑘 𝑑( 𝑐 𝑖 , 𝑐 𝑗 ) max 1≤𝑖≤𝑘 𝜎 𝑖 Worst situation analysis Limitation No indication of actual application’s performance Bias towards a specific type of clustering algorithm if that algorithm is designed to optimize similar metric We prefer larger D-index! CS@UVa CS 6501: Text Mining

External validation Given class label Ω on each instance Required, might need extra cost Given class label Ω on each instance Purity: correctly clustered documents in each cluster 𝑝𝑢𝑟𝑖𝑡𝑦(Ω,𝐶) = 1 𝑁 𝑖=1 𝑘 max 𝑗 | 𝑐 𝑖 ∩ 𝑤 𝑗 | where 𝑐 𝑖 is a set of documents in cluster 𝑖, and 𝑤 𝑗 is a set of documents in cluster 𝑗 Not a good metric if we assign each document into a single cluster 𝑝𝑢𝑟𝑖𝑡𝑦(Ω,𝐶) = 1 17 5+4+3 CS@UVa CS 6501: Text Mining

External validation Given class label Ω on each instance Normalized mutual information (NMI) 𝑁𝑀𝐼(Ω,𝐶) = 𝐼(Ω,𝐶) 𝐻 Ω +𝐻(𝐶) /2 where 𝐼 Ω,𝐶 = 𝑖 𝑗 𝑃 𝑤 𝑖 ∩ 𝑐 𝑗 log 𝑃 𝑤 𝑖 ∩ 𝑐 𝑗 𝑃 𝑤 𝑖 𝑃 𝑐 𝑗 , 𝐻 Ω = 𝑖 𝑃 𝑤 𝑖 log 𝑃( 𝑤 𝑖 ) and 𝐻 C = 𝑗 𝑃 𝑐 𝑗 log 𝑃( 𝑐 𝑗 ) Indicate the increase of knowledge about classes when we know the clustering results Normalization by entropy will penalize too many clusters CS@UVa CS 6501: Text Mining

External validation Given class label Ω on each instance Rand index Idea: we want to assign two documents to the same cluster if and only if they are from the same class 𝑅𝐼= 𝑇𝑃+𝑇𝑁 𝑇𝑃+𝐹𝑃+𝐹𝑁+𝑇𝑁 Essentially it is like classification accuracy 𝑤 𝑖 = w j 𝑤 𝑖 ≠ w j 𝑐 𝑖 = 𝑐 𝑗 TP FP 𝑐 𝑖 ≠ 𝑐 𝑗 FN TN Over every pair of documents in the collection CS@UVa CS 6501: Text Mining

External validation Given class label Ω on each instance Rand index 𝑤 𝑖 = w j 𝑤 𝑖 ≠ w j 𝑐 𝑖 = 𝑐 𝑗 20 𝑐 𝑖 ≠ 𝑐 𝑗 24 72 𝑇𝑃= 5 2 + 4 2 + 3 2 + 2 2 =20 𝑇𝑃+𝐹𝑃= 6 2 + 6 2 + 5 2 =40 CS@UVa CS 6501: Text Mining

External validation Given class label Ω on each instance Precision/Recall/F-measure Based on the contingency table, we can also define precision/recall/F-measure of clustering quality 𝑤 𝑖 = w j 𝑤 𝑖 ≠ w j 𝑐 𝑖 = 𝑐 𝑗 TP FP 𝑐 𝑖 ≠ 𝑐 𝑗 FN TN CS@UVa CS 6501: Text Mining

What you should know Unsupervised natural of clustering problem Distance metric is essential to determine the clustering results Two basic categories of clustering algorithms Partitional clustering Hierarchical clustering Clustering evaluation Internal v.s. external CS@UVa CS 6501: Text Mining

Today’s reading Introduction to Information Retrieval Chapter 16: Flat clustering 16.2 Problem statement 16.3 Evaluation of clustering CS@UVa CS 6501: Text Mining