Text Clustering PengBo Nov 1, 2010.

Text Clustering PengBo Nov 1, 2010

Today’s Topic Document clustering Motivations Clustering algorithms
Partitional Hierarchical Evaluation

What’s Clustering? Computer cluster
Data cluster: Clustering is the classification of objects into different groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait - often proximity according to some defined distance measure. etrex.blogspot.com/2008/05/k-mean-clustering.html

What is clustering? Clustering: the process of grouping a set of objects into classes of similar objects The commonest form of unsupervised learning Unsupervised learning = learning from raw data, as opposed to supervised data where a classification of examples is given A common and important task that finds many applications in IR and other places Why cluster documents? Whole corpus analysis/navigation Better user interface For improving recall in search applications Better search results For better navigation of search results Effective “user recall” will be higher For speeding up vector space retrieval Faster search

Clustering Internal Criterion
How many clusters? High intra-cluster similarity Low inter-cluster similarity

Issues for clustering Representation for clustering How many clusters?
文档表示Document representation Vector space or language model? 相似度/距离similarity/distance COS similarity or KL distance How many clusters? Fixed a priori? Completely data driven? Avoid “trivial” clusters - too large or small Ideal: semantic similarity. Practical: statistical similarity We will use cosine similarity. Docs as vectors. For many algorithms, easier to think in terms of a distance (rather than similarity) between docs.

Clustering Algorithms
Hard clustering algorithms computes a hard assignment – each document is a member of exactly one cluster. Soft clustering algorithms is soft – a document’s assignment is a distribution over all clusters. Hard .vs. Soft

Clustering Algorithms
Flat algorithms Create cluster set without explicit structure Usually start with a random (partial) partitioning Refine it iteratively K means clustering Model based clustering Hierarchical algorithms Bottom-up, agglomerative Top-down, divisive Hard .vs. Soft

Evaluation

Think about it… Evaluation by High internal criterion scores?
Object function for High intra-cluster similarity and Low inter-cluster similarity Application User judgment Internal judgment

Example                  Cluster I Cluster II
                 Cluster I Cluster II Cluster III High purity is easy to achieve when the number of clusters is large – in particular, purity is 1 if each document gets its own cluster.

External criteria for clustering quality
测试集是什么？ground truth=？ Assume documents with C gold standard classes, while our clustering algorithms produce K clusters, ω1, ω2, …, ωK with ni members each. 一个简单的measure: purity 定义为cluster中占主导的class Ci 的文档数与cluster ωK 大小的比率 ω= {ω1,ω2, ,ωK} is the set of clusters and C = {c1, c2, , cJ} the set of classes. Quality measured by its ability to discover some or all of the hidden patterns or latent classes in gold standard data Assesses a clustering with respect to ground truth

Purity example                 
                 Cluster I Cluster II Cluster III Cluster I: Purity = 1/6 *(max(5, 1, 0)) = 5/6 High purity is easy to achieve when the number of clusters is large – in particular, purity is 1 if each document gets its own cluster. Cluster II: Purity = 1/6 * (max(1, 4, 1)) = 4/6 Cluster III: Purity = 1/5 * (max(2, 0, 3)) = 3/5 Total: Purity = 1/17 * (5+4+3) = 12/17

Rand Index View it as a series of decisions, one for each of the N(N − 1)/2 pairs of documents in the collection. true positive (TP) decision assigns two similar documents to the same cluster true negative (TN) decision assigns two dissimilar documents to different clusters. false positive (FP) decision assigns two dissimilar documents to the same cluster. false negative (FN) decision assigns two similar documents to different clusters.

Rand Index TP FN FP TN Number of points Same Cluster in clustering
Different Clusters in clustering Same class in ground truth Different classes in ground truth TP FN FP TN

Rand index Example                  Cluster I
                 Cluster I Cluster II Cluster III

K Means Algorithm Mean :平均数, 中间,
median ：中位数 a type of average that is described as the number dividing the higher half of a sample, a population, or a probability distribution, from the lower half Mode: 众数 the most frequent value assumed by a random variable,

Partitioning Algorithms
Given: a set of documents D and the number K Find: 找到一个K clusters的划分，使partitioning criterion最优 Globally optimal: exhaustively enumerate all partitions Effective heuristic methods: K-means algorithms cluster k的goodness为cluster中所有点到centroid中心的距离平方和: Gk = Σi (di – ck) (sum over all di in cluster k) 整体的goodness为 G = Σk Gk partitioning criterion: residual sum of squares(残差平方和)

K-Means 假设documents是实值 vectors.
基于cluster ω的中心centroids (aka the center of gravity or mean) 划分instances到clusters是根据它到cluster centroid中心点的距离，选择最近的centroid

K Means Example (K=2) Pick seeds Reassign clusters Compute centroids
Termination conditions Several possibilities, e.g., A fixed number of iterations. Doc partition unchanged. Centroid positions don’t change. Converged!

K-Means Algorithm

Convergence 为什么K-means算法会收敛? Reassignment: RSS单调减,每个向量分到最近的centroid.
A state in which clusters don’t change. Reassignment: RSS单调减,每个向量分到最近的centroid. Recomputation: 每个RSSk 单调减(mk is number of members in cluster k): a =(ωk )取什么值，使RSSK取得最小值？ K-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm. EM is known to converge. Number of iterations could be large. Σ –2(X – a) = 0 Σ X = Σ a mK a = Σ X a = (1/ mk) Σ X

Convergence = Global Minimum?
There is unfortunately no guarantee that a global minimum in the objective function will be reached outlier

Seed Choice Seed的选择会影响结果某些seeds导致收敛很慢，或者收敛到sub-optimal clusterings.
用heuristic选seeds (e.g., doc least similar to any existing mean) 尝试多种starting points 以其它clustering方法的结果来初始化.(e.g., by sampling) In the above, if you start with B and E as centroids you converge to {A,B,C} and {D,E,F} If you start with D and F you converge to {A,B,D,E} {C,F}

How Many Clusters? 怎样确定合适的K? 例如：
在产生更多cluster(每个cluster内部更像)和产生太多的cluster (eg.浏览代价大)之间取得平衡例如：定义Benefit :a doc到它所在的cluster centroid的cosine similarity。所有docs的benefit之和为Total Benefit. 定义一个cluster的Cost 定义clustering的Value = Total Benefit - Total Cost. 所有可能的K中，选取value最大的那一个

Is K-Means Efficient? Time Complexity M is …
Computing distance between two docs is O(M) where M is the dimensionality of the vectors. Reassigning clusters: O(KN) distance computations, or O(KNM). Computing centroids: Each doc gets added once to some centroid: O(NM). Assume these two steps are each done once for I iterations: O(IKNM). M is … Document is sparse vector, but Centroid is not K-medoids algorithms: the element closest to the center as "the medoid" Medoids： Medoids are representative objects of a data set. The term is used in computer science in fuzzy clustering algorithms. Basically the algorithm finds the center of a cluster and it takes the element closest to the center as "the medoid" upon which the distances to the other points are computed.

Efficiency: Medoid As Cluster Representative
Medoid: 用一个document来作cluster的表示如: 离centroid最近的document One reason this is useful 考察一个很大的cluster的representative (>1000 documents) The centroid of this cluster will be a dense vector The medoid of this cluster will be a sparse vector 类似于: mean .vs. median centroid vs. medoid

Hierarchical Clustering Algorithm

Hierarchical Agglomerative Clustering (HAC)
假定有了一个similarity function来确定两个 instances的相似度. 贪心算法：每个instances为一独立的cluster开始选择最similar的两个cluster，合并为一个新cluster 直到最后剩下一个cluster为止上面的合并历史形成一个binary tree或hierarchy. Agglomerative会凝聚的 Dendrogram 生]系统树图(一种表示亲缘关系的树状图解) Dendrogram

Dendrogram: Document Example
As clusters agglomerate, docs likely to fall into a hierarchy of “topics” or concepts. d3 d5 d1 d3,d4,d5 d4 d2 d1,d2 d4,d5 d3

HAC Algorithm, pseudo-code

Hierarchical Clustering algorithms
Agglomerative (bottom-up): Start with each document being a single cluster. Eventually all documents belong to the same cluster. Divisive (top-down): Start with all documents belong to the same cluster. Eventually each node forms a cluster on its own. 不需要预定clusters的数目k

Key notion: cluster representative
如何计算哪两个clusters最近？为了有效进行此计算，怎样表达每个cluster(cluster representation)? Representative可以cluster中的某些“typical” 或central点： point inducing smallest radii to docs in cluster smallest squared distances, etc. point that is the “average” of all docs in the cluster Centroid or center of gravity

“Closest pair” of clusters
“Center of gravity” centroids (centers of gravity)最cosine-similar的clusters Average-link 每对元素的平均cosine-similar Single-link 最近点(Similarity of the most cosine-similar) Complete-link 最远点(Similarity of the “furthest” points, the least cosine-similar)

Single Link Example chaining Use maximum similarity of pairs:
Can result in “straggly” (long and thin) clusters due to chaining effect. Appropriate in some domains, such as clustering islands: “Hawai’i clusters” After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:

Complete Link Example Affect by outliers
Use minimum similarity of pairs: Makes “tighter,” spherical clusters that are typically preferable. After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:

Computational Complexity
第一次iteration, HAC计算所有pairs之间的similarity : O(n2). 后续的n2 merging iterations, 需要计算最新产生的cluster和其它已有的clusters之间的similarity 其它的similarity不变为了达到整体的O(n2) performance 计算和其它cluster之间的similarity必须是constant time. 否则O(n2 log n) or O(n3)

Centroid Agglomerative Clustering
Example: n=6, k=3, closest pair of centroids d4 d6 d3 d5 Centroid after second step. d1 d2 Centroid after first step.

Group Average Agglomerative Clustering
合并后的cluster中所有pairs的平均similarity 可以在常数时间计算? Vectors都经过单位长度normalized. 保存每个cluster的sum of vectors. Compromise between single and complete link. Two options: Averaged across all ordered pairs in the merged cluster Averaged over all pairs between the two original clusters Some previous work has used one of these options; some the other. No clear difference in efficacy

Exercise 考虑在一条直线上的n个点的agglomerative聚类. 你能避免n3 次的距离/相似度计算吗？你的方式需要计算多少次？

Efficiency: “Using approximations”
标准算法中，每一步都必须找到最近的centroid pairs 近似算法: 找nearly closest pair simplistic example: maintain closest pair based on distances in projection on a random line Random line

Applications in IR

Navigating document collections
Table of Contents 1. Science of Cognition 1.a. Motivations 1.a.i. Intellectual Curiosity 1.a.ii. Practical Applications 1.b. History of Cognitive Psychology 2. The Neural Basis of Cognition 2.a. The Nervous System 2.b. Organization of the Brain 2.c. The Visual System 3. Perception and Attention 3.a. Sensory Memory 3.b. Attention and Sensory Information Processing Index Aardvark, 15 Blueberry, 200 Capricorn, 1, Dog, Egypt, 65 Falafel, Giraffes, 45-59 … <<how to read a book>> Parable is nice: Web pages are a book without table of contents and without index, that’s what yahoo directory and what the googles live. Information Retrieval —— a book index Document clusters —— a table of contents

Scatter/Gather: Cutting, Karger, and Pedersen
Corpus analysis/navigation Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics Allows user to browse through corpus to find information Crucial need: meaningful labels for topic nodes.

For better navigation of search results

Navigating search results (2)
按sense of a word 对documents聚类对搜索结果 (say Jaguar, or NLP), 聚成相关的文档组可看作是一种word sense disambiguation E.g., jaguar may have senses: The car company The animal The football team The video game

For speeding up vector space retrieval
VSM中retrieval, 需要找到和query vector最近的doc vectors 计算文档集里所有doc和query doc的similarity – slow (for some applications) 优化一下：使用inverted index，只计算那些query doc中的term出现过的doc By clustering docs in corpus a priori 只在子集上计算：query doc所在的cluster We could instead find natural clusters in the data Cluster documents into k clusters Retrieve closest cluster ci to query Rank documents in ci and return to user

Resources Weka 3 - Data Mining with Open Source Machine Learning Software in Java

本次课小结 Text Clustering Evaluation Partition Algorithm
Purity, NMI ,Rand Index Partition Algorithm K-Means Reassignment Recomputation Hierarchical Algorithm Cluster representation Close measure of cluster pair Single link Complete link Average link centroid

Thank You! Q&A

Readings [1]. IIR Ch Ch17.1-4 [2]. B. Florian, E. Martin, and X. Xiaowei, "Frequent term-based text clustering," in Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. Edmonton, Alberta, Canada: ACM, 2002.

Cluster Labeling

Major issue - labeling After clustering algorithm finds clusters - how can they be useful to the end user? Need pithy label for each cluster In search results, say “Animal” or “Car” in the jaguar example. In topic trees (Yahoo), need navigational cues. Often done by hand, a posteriori. Pithy有髓的, 精练的

How to Label Clusters Show titles of typical documents
Titles are easy to scan Authors create them for quick scanning! But you can only show a few titles which may not fully represent cluster Show words/phrases prominent in cluster More likely to fully represent cluster Use distinguishing words/phrases Differential labeling (think about Feature Selection) But harder to scan

Labeling Common heuristics - list 5-10 most frequent terms in the centroid vector. Drop stop-words; stem. Differential labeling by frequent terms Within a collection “Computers”, clusters all have the word computer as frequent term. Discriminant analysis of centroids. Perhaps better: distinctive noun phrase

Text Clustering PengBo Nov 1, 2010.

Similar presentations

Presentation on theme: "Text Clustering PengBo Nov 1, 2010."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Text Clustering PengBo Nov 1, 2010.

Similar presentations

Presentation on theme: "Text Clustering PengBo Nov 1, 2010."— Presentation transcript:

Similar presentations

About project

Feedback