3 Formulations and Approaches Partitioning Approaches One possible goal that we can set up for a clustering algorithm is to partition the document collection into k subsets or clusters D 1,···,D k so as to minimize the intracluster distance or maximize the intracluster resemblance. Bottom-up clustering Top-down clustering
4 Formulations and Approaches
5 Distance based Hierarchical clustering -The tree of hierarchical clustering can be produced Bottom-up(agglomerative clustering) –start with the individual object and grouping the most similar ones –join cluster with maximum similarity Top-down(divisive clustering) –start with all the object and divides them into groups in order to maximize within-group similarity –split least coherent part in cluster
6 Three methods in hierarchical clustering Single-link Similarity of two most similar members Complete link Similarity of two least similar members Group average Average similarity between members
7 Single link Clustering Similarity of two most similar members => O(n 2 ) Locally Coherent close objects are in the same cluster Chaining Effect Because of following a chain of large similarities without taking into account the global context => low global cluster quality
8 Complete link Clustering Similarity of two least similar members => O(n 3 ) The function focused on global cluster quality avoids elongated cluster a/f or b/e is tighter than a/d (tighter cluster are better than ‘straggly’ cluster)
9 Group average agglomerative clustering Averages similarity between members The complexity of computing average similarity is O(n 2 ) Average similarities are computed at each time a new group is formed compromise between single-link and complete-link
10 Comparison Single-link Relative efficient Long straggly clusters –Ellipsoidal cluster Loosely bound cluster Complete-link Tightly bound cluster Group average Intermediate between single and complete
11 Distance based Flat clustering -k – means - k – means 군집방법은 계층적 군집 분석과는 달리 개체가 어느 한 군집에만 속하도록 하는 상호 배반적 군집 방법이다. 이 방법은 군집의 수를 미리 정하고, 각 개체가 어느 군집에 속 하는지를 분석하는 방법으로서 대량의 데이터의 군집분석에 유용하게 이용되는 방법이다.
12 Distance based k – means
13 Geometric Embedding Approaches Self - organizing maps Multidimensional scaling Latent semantic indexing ★ A different form of partition-based clustering is to identify dense regions in space.
14 Geometric Embedding Approaches Self - organizing maps(SOMs) - Self – organizing maps are a close cousin to k-means, except that unlike k-means, which is concerned only with determining the association between clusters and documents, the SOM algorithm also embeds the clusters in a low – dimensional space right from the beginning and proceeds in as way that places related clusters close together in that space.
15 SOM : Example SOM computed from over a million documents taken from 80 Usenet newsgroups. Light areas have a high density of documents.
16 Geometric Embedding Approaches Multidimensional scaling (MDS) - The goal of MDS is to present documents as point in a low – dimensional space (often 2D-3D) such that the Euclidean distance between any pair of points is as close as possible to the distance between them specified by the input
17 Geometric Embedding Approaches Latent semantic indexing (LSI) - The latent semantic indexing (LSI) method is an attempt to solve the synonymy problem while staying within the vector space model framework
18 Latent semantic indexing (LSI) - k k-dim vector A Documents Terms U d t r DV d SVD TermDocument car auto
19 EM algorithm A soft version of K-means clustering both cluster move towards the centroid of all three objects reach the stable final state
20 EM algorithm(2) We want to calculate probability P(c j | vector x i ) Assume that cluster i has a normal distribution Maximum likelihood of the form
21 Procedure of EM Expectation Step (E) Compute h ij that is expectation of z ij Maximization Step (M)