# Similarity/Clustering 인공지능연구실 문홍구 2006. 1. 17. 2 Content  What is Clustering  Clustering Method  Distance-based -Hierarchical -Flat  Geometric embedding.

## Presentation on theme: "Similarity/Clustering 인공지능연구실 문홍구 2006. 1. 17. 2 Content  What is Clustering  Clustering Method  Distance-based -Hierarchical -Flat  Geometric embedding."— Presentation transcript:

Similarity/Clustering 인공지능연구실 문홍구 2006. 1. 17

2 Content  What is Clustering  Clustering Method  Distance-based -Hierarchical -Flat  Geometric embedding approach -self-organizing maps -multidimensional scaling -latent semantic indexing

3 Formulations and Approaches  Partitioning Approaches  One possible goal that we can set up for a clustering algorithm is to partition the document collection into k subsets or clusters D 1,···,D k so as to minimize the intracluster distance or maximize the intracluster resemblance.  Bottom-up clustering  Top-down clustering

4 Formulations and Approaches

5 Distance based  Hierarchical clustering -The tree of hierarchical clustering can be produced  Bottom-up(agglomerative clustering) –start with the individual object and grouping the most similar ones –join cluster with maximum similarity  Top-down(divisive clustering) –start with all the object and divides them into groups in order to maximize within-group similarity –split least coherent part in cluster

6 Three methods in hierarchical clustering  Single-link  Similarity of two most similar members  Complete link  Similarity of two least similar members  Group average  Average similarity between members

7 Single link Clustering  Similarity of two most similar members => O(n 2 )  Locally Coherent  close objects are in the same cluster  Chaining Effect  Because of following a chain of large similarities without taking into account the global context => low global cluster quality

8 Complete link Clustering  Similarity of two least similar members => O(n 3 )  The function focused on global cluster quality  avoids elongated cluster  a/f or b/e is tighter than a/d (tighter cluster are better than ‘straggly’ cluster)

9 Group average agglomerative clustering  Averages similarity between members  The complexity of computing average similarity is O(n 2 )  Average similarities are computed at each time a new group is formed  compromise between single-link and complete-link

10 Comparison  Single-link  Relative efficient  Long straggly clusters –Ellipsoidal cluster  Loosely bound cluster  Complete-link  Tightly bound cluster  Group average  Intermediate between single and complete

11 Distance based  Flat clustering -k – means - k – means 군집방법은 계층적 군집 분석과는 달리 개체가 어느 한 군집에만 속하도록 하는 상호 배반적 군집 방법이다. 이 방법은 군집의 수를 미리 정하고, 각 개체가 어느 군집에 속 하는지를 분석하는 방법으로서 대량의 데이터의 군집분석에 유용하게 이용되는 방법이다.

12 Distance based  k – means

13 Geometric Embedding Approaches  Self - organizing maps  Multidimensional scaling  Latent semantic indexing ★ A different form of partition-based clustering is to identify dense regions in space.

14 Geometric Embedding Approaches  Self - organizing maps(SOMs) - Self – organizing maps are a close cousin to k-means, except that unlike k-means, which is concerned only with determining the association between clusters and documents, the SOM algorithm also embeds the clusters in a low – dimensional space right from the beginning and proceeds in as way that places related clusters close together in that space.

15 SOM : Example SOM computed from over a million documents taken from 80 Usenet newsgroups. Light areas have a high density of documents.

16 Geometric Embedding Approaches  Multidimensional scaling (MDS) - The goal of MDS is to present documents as point in a low – dimensional space (often 2D-3D) such that the Euclidean distance between any pair of points is as close as possible to the distance between them specified by the input

17 Geometric Embedding Approaches  Latent semantic indexing (LSI) - The latent semantic indexing (LSI) method is an attempt to solve the synonymy problem while staying within the vector space model framework

18 Latent semantic indexing (LSI) - k k-dim vector A Documents Terms U d t r DV d SVD TermDocument car auto

19 EM algorithm  A soft version of K-means clustering  both cluster move towards the centroid of all three objects  reach the stable final state

20 EM algorithm(2)  We want to calculate probability P(c j | vector x i )  Assume that cluster i has a normal distribution  Maximum likelihood of the form

21 Procedure of EM  Expectation Step (E)  Compute h ij that is expectation of z ij  Maximization Step (M)

Download ppt "Similarity/Clustering 인공지능연구실 문홍구 2006. 1. 17. 2 Content  What is Clustering  Clustering Method  Distance-based -Hierarchical -Flat  Geometric embedding."

Similar presentations