Presentation is loading. Please wait.

Presentation is loading. Please wait.

Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation.

Similar presentations


Presentation on theme: "Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation."— Presentation transcript:

1 Text Clustering PengBo Nov 1, 2010

2 Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation

3 What’s Clustering?

4 What is clustering? Clustering: the process of grouping a set of objects into classes of similar objects The commonest form of unsupervised learning Unsupervised learning = learning from raw data, as opposed to supervised data where a classification of examples is given A common and important task that finds many applications in IR and other places

5 Clustering Internal Criterion High intra-cluster similarity Low inter-cluster similarity How many clusters?

6 Issues for clustering Representation for clustering 文档表示 Document representation Vector space or language model? 相似度 / 距离 similarity/distance COS similarity or KL distance How many clusters? Fixed a priori? Completely data driven? Avoid “ trivial ” clusters - too large or small

7 Clustering Algorithms Hard clustering algorithms computes a hard assignment – each document is a member of exactly one cluster. Soft clustering algorithms is soft – a document’s assignment is a distribution over all clusters.

8 Clustering Algorithms Flat algorithms Create cluster set without explicit structure Usually start with a random (partial) partitioning Refine it iteratively K means clustering Model based clustering Hierarchical algorithms Bottom-up, agglomerative Top-down, divisive

9 Clustering Algorithms Flat algorithms Create cluster set without explicit structure Usually start with a random (partial) partitioning Refine it iteratively K means clustering Model based clustering Hierarchical algorithms Bottom-up, agglomerative Top-down, divisive

10 Evaluation

11 Think about it… Evaluation by High internal criterion scores? Object function for High intra-cluster similarity and Low inter-cluster similarity Application User judgment Application User judgment Internal judgment

12      Cluster ICluster IICluster III Example

13 External criteria for clustering quality 测试集是什么? ground truth= ? Assume documents with C gold standard classes, while our clustering algorithms produce K clusters, ω 1, ω 2, …, ω K with n i members each. 一个简单的 measure: purity 定义为 cluster 中占主 导的 class C i 的文档数与 cluster ω K 大小的比率 ω= {ω 1,ω 2,...,ω K } is the set of clusters and C = {c 1, c 2,..., c J } the set of classes.

14      Cluster ICluster IICluster III Cluster I: Purity = 1/6 * (max(5, 1, 0)) = 5/6 Cluster II: Purity = 1/6 * (max(1, 4, 1)) = 4/6 Cluster III: Purity = 1/5 * (max(2, 0, 3)) = 3/5 Purity example Total: Purity = 1/17 * (5+4+3) = 12/17

15 Rand Index View it as a series of decisions, one for each of the N(N − 1)/2 pairs of documents in the collection. true positive (TP) decision assigns two similar documents to the same cluster true negative (TN) decision assigns two dissimilar documents to different clusters. false positive (FP) decision assigns two dissimilar documents to the same cluster. false negative (FN) decision assigns two similar documents to different clusters.

16 Rand Index Number of points Same Cluster in clustering Different Clusters in clustering Same class in ground truth Different classes in ground truth TPFN TNFP

17 Rand index Example      Cluster ICluster IICluster III

18 K Means Algorithm

19 Partitioning Algorithms Given: a set of documents D and the number K Find: 找到一个 K clusters 的划分,使 partitioning criterion 最优 Globally optimal: exhaustively enumerate all partitions Effective heuristic methods: K-means algorithms partitioning criterion: residual sum of squares( 残差平方和 ) partitioning criterion: residual sum of squares( 残差平方和 )

20 K-Means 假设 documents 是实值 vectors. 基于 cluster ω 的中心 centroids (aka the center of gravity or mean) 划分 instances 到 clusters 是根据它到 cluster centroid 中心点的距离,选择最近的 centroid

21 K Means Example (K=2) Pick seeds Reassign clusters Compute centroids x x Reassign clusters x x x x Compute centroids Reassign clusters Converged!

22 K-Means Algorithm

23 Convergence 为什么 K-means 算法会收敛 ? A state in which clusters don ’ t change. Reassignment: RSS 单调减, 每个向量分到最近的 centroid. Recomputation: 每个 RSS k 单调减 (m k is number of members in cluster k): a =  ( ω k ) 取什么值,使 RSS K 取得最小值? Σ –2(X – a) = 0 Σ X = Σ a m K a = Σ X a = (1/ m k ) Σ X Σ –2(X – a) = 0 Σ X = Σ a m K a = Σ X a = (1/ m k ) Σ X

24 Convergence = Global Minimum? There is unfortunately no guarantee that a global minimum in the objective function will be reached outlier

25 Seed Choice Seed 的选择会影响结果 某些 seeds 导致收敛很慢, 或者收敛到 sub-optimal clusterings. 用 heuristic 选 seeds (e.g., doc least similar to any existing mean) 尝试多种 starting points 以其它 clustering 方法的结果 来初始化.(e.g., by sampling) In the above, if you start with B and E as centroids you converge to {A,B,C} and {D,E,F} If you start with D and F you converge to {A,B,D,E} {C,F} In the above, if you start with B and E as centroids you converge to {A,B,C} and {D,E,F} If you start with D and F you converge to {A,B,D,E} {C,F}

26 How Many Clusters? 怎样确定合适的 K? 在产生更多 cluster( 每个 cluster 内部更像 ) 和产生太多的 cluster (eg. 浏览代价大 ) 之间取得平衡 例如: 定义 Benefit :a doc 到它所在的 cluster centroid 的 cosine similarity 。所有 docs 的 benefit 之和为 Total Benefit. 定义一个 cluster 的 Cost 定义 clustering 的 Value = Total Benefit - Total Cost. 所有可能的 K 中,选取 value 最大的那一个

27 Is K-Means Efficient? Time Complexity Computing distance between two docs is O(M) where M is the dimensionality of the vectors. Reassigning clusters: O(KN) distance computations, or O(KNM). Computing centroids: Each doc gets added once to some centroid: O(NM). Assume these two steps are each done once for I iterations: O(IKNM). M is … Document is sparse vector, but Centroid is not K-medoids algorithms: the element closest to the center as "the medoid"

28 Efficiency: Medoid As Cluster Representative Medoid: 用一个 document 来作 cluster 的表示 如 : 离 centroid 最近的 document One reason this is useful 考察一个很大的 cluster 的 representative (>1000 documents) The centroid of this cluster will be a dense vector The medoid of this cluster will be a sparse vector 类似于 : mean.vs. median centroid vs. medoid

29 Hierarchical Clustering Algorithm

30 Hierarchical Agglomerative Clustering (HAC) 假定有了一个 similarity function 来确定两个 instances 的相似度. 贪心算法: 每个 instances 为一独立的 cluster 开始 选择最 similar 的两个 cluster ,合并为一个新 cluster 直到最后剩下一个 cluster 为止 上面的合并历史形成一个 binary tree 或 hierarchy. Dendrogram

31 Dendrogram: Document Example As clusters agglomerate, docs likely to fall into a hierarchy of “topics” or concepts. d1 d2 d3 d4 d5 d1,d2 d4,d5 d3 d3,d4,d 5

32 HAC Algorithm, pseudo-code

33 Agglomerative (bottom-up): Start with each document being a single cluster. Eventually all documents belong to the same cluster. Divisive (top-down): Start with all documents belong to the same cluster. Eventually each node forms a cluster on its own. 不需要预定 clusters 的数目 k Hierarchical Clustering algorithms

34 Key notion: cluster representative 如何计算哪两个 clusters 最近? 为了有效进行此计算,怎样表达每个 cluster(cluster representation)? Representative 可以 cluster 中的某些 “typical” 或 central 点: point inducing smallest radii to docs in cluster smallest squared distances, etc. point that is the “average” of all docs in the cluster Centroid or center of gravity

35 “Closest pair” of clusters “Center of gravity” centroids (centers of gravity) 最 cosine-similar 的 clusters Average-link 每对元素的平均 cosine-similar Single-link 最近点 (Similarity of the most cosine-similar) Complete-link 最远点 (Similarity of the “furthest” points, the least cosine-similar)

36 Single Link Example chaining

37 Complete Link Example Affect by outliers

38 Computational Complexity 第一次 iteration, HAC 计算所有 pairs 之间的 similarity : O(n 2 ). 后续的 n  2 merging iterations, 需要计算最新产生 的 cluster 和其它已有的 clusters 之间的 similarity 其它的 similarity 不变 为了达到整体的 O(n 2 ) performance 计算和其它 cluster 之间的 similarity 必须是 constant time. 否则 O(n 2 log n) or O(n 3 )

39 Centroid Agglomerative Clustering d1 d2 d3 d4 d5 d6 Centroid after first step. Centroid after second step. Example: n=6, k=3, closest pair of centroids

40 Group Average Agglomerative Clustering 合并后的 cluster 中所有 pairs 的平均 similarity 可以在常数时间计算 ? Vectors 都经过单位长度 normalized. 保存每个 cluster 的 sum of vectors.

41 Exercise 考虑在一条直线上的 n 个点的 agglomerative 聚类. 你能避免 n 3 次的距离 / 相似度计算吗?你的方式需 要计算多少次?

42 Efficiency: “Using approximations” 标准算法中,每一步都必须找到最近的 centroid pairs 近似算法 : 找 nearly closest pair simplistic example: maintain closest pair based on distances in projection on a random line Random line

43 Applications in IR

44 Navigating document collections Information Retrieval —— a book index Document clusters —— a table of contents Index Aardvark, 15 Blueberry, 200 Capricorn, 1, Dog, Egypt, 65 Falafel, Giraffes, … Table of Contents 1. Science of Cognition 1.a. Motivations 1.a.i. Intellectual Curiosity 1.a.ii. Practical Applications 1.b. History of Cognitive Psychology 2. The Neural Basis of Cognition 2.a. The Nervous System 2.b. Organization of the Brain 2.c. The Visual System 3. Perception and Attention 3.a. Sensory Memory 3.b. Attention and Sensory Information Processing

45 Scatter/Gather: Cutting, Karger, and Pedersen

46 For better navigation of search results

47

48 Navigating search results (2) 按 sense of a word 对 documents 聚类 对搜索结果 (say Jaguar, or NLP), 聚成相关的文档组 可看作是一种 word sense disambiguation

49

50 For speeding up vector space retrieval VSM 中 retrieval, 需要找到和 query vector 最近 的 doc vectors 计算文档集里所有 doc 和 query doc 的 similarity – slow (for some applications) 优化一下:使用 inverted index ,只计算那些 query doc 中的 term 出现过的 doc By clustering docs in corpus a priori 只在子集上计算: query doc 所在的 cluster

51 Resources Weka 3 - Data Mining with Open Source Machine Learning Software in Java Weka 3

52 本次课小结 Text Clustering Evaluation Purity, NMI,Rand Index Partition Algorithm K-Means Reassignment Recomputation Hierarchical Algorithm Cluster representation Close measure of cluster pair Single link Complete link Average link centroid

53 Thank You! Q&A

54 Readings [1]. IIR Ch Ch [2]. B. Florian, E. Martin, and X. Xiaowei, "Frequent term-based text clustering," in Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. Edmonton, Alberta, Canada: ACM, 2002.

55 Cluster Labeling

56 Major issue - labeling After clustering algorithm finds clusters - how can they be useful to the end user? Need pithy label for each cluster In search results, say “Animal” or “Car” in the jaguar example. In topic trees (Yahoo), need navigational cues. Often done by hand, a posteriori.

57 How to Label Clusters Show titles of typical documents Titles are easy to scan Authors create them for quick scanning! But you can only show a few titles which may not fully represent cluster Show words/phrases prominent in cluster More likely to fully represent cluster Use distinguishing words/phrases Differential labeling (think about Feature Selection) But harder to scan

58 Labeling Common heuristics - list 5-10 most frequent terms in the centroid vector. Drop stop-words; stem. Differential labeling by frequent terms Within a collection “Computers”, clusters all have the word computer as frequent term. Discriminant analysis of centroids. Perhaps better: distinctive noun phrase


Download ppt "Text Clustering PengBo Nov 1, 2010. Today’s Topic Document clustering Motivations Clustering algorithms Partitional Hierarchical Evaluation."

Similar presentations


Ads by Google