Presentation is loading. Please wait.

Presentation is loading. Please wait.

Clustering.

Similar presentations


Presentation on theme: "Clustering."— Presentation transcript:

1 Clustering

2 Introduction

3 Clustering Summarization of large data Data organization
Understand the large customer data Data organization Manage the large customer data Outlier detection Find unusual customer data

4 Clustering Previous process before classification/association
Find useful grouping for classes Association rules within a particular cluster

5 Problem Description Given Task
A data set of N data items with each have a d-dimensional data feature vector Task Determine a natural, useful partitioning of the data set into a number of clusters (k) and noise

6 Measure of closeness: similarity
Dice’s Coefficient Simple Matching Cosine Coefficient Jaccard’s Coefficient

7 Measure of closeness: disimilarity
Distance Measure Distance = dissimilarity Manhattan distance Euclidean distance Minkowski metric Mahalnobis distance

8 Similarity Matrix Note that dij = dji (i.e., the matrix is symmetric. So, we only need the lower triangle part of the matrix:

9 Similarity Matrix - Example
Term-Term Similarity Matrix

10 Similarity Thresholds
A similarity threshold is used to mark pairs that are “sufficiently” similar The threshold value is application and collection dependent Using a threshold value of 10 in the previous example

11 Clustering methods Partitioning methods Hierarchical methods
Density-based methods Grid-based methods Model-based methods Graph-based methods

12 Partitioning methods K-means
Choose k objects as the initial cluster centers; set i=0 Loop For each object Assign data points to their nearest centroid Compute mean of cluster as centre

13 Partitioning methods I : centroid II

14 Partitioning Method (Iterative method)
The basic algorithm: 1. select M cluster representatives (centroids) 2. for i = 1 to N, assign Di to the most similar centroid 3. for j = 1 to M, recalculate the cluster centroid Cj 4. repeat steps 2 and 3 until these is (little or) no change in clusters Example: Initial (arbitrary) assignment: C1 = {T1,T2}, C2 = {T3,T4}, C3 = {T5,T6} Cluster Centroids

15 Partitioning Method (Iterative method)
Example (continued) Now using simple similarity measure, compute the new cluster-term similarity matrix Now compute new cluster centroids using the original document-term matrix The process is repeated until no further changes are made to the clusters

16 Clustering methods Partitioning methods Hierarchical methods
Density-based methods Grid-based methods Model-based methods Graph-based methods

17 Hierarchical methods Group objects into a tree of clusters Types
Agglomerative bottom-up approach Single-linkage Complete-linkage Group-linkage Centroid-linkage Ward’s method Divisive top-down approach Use of K-means clustering

18 Hierarchical methods a a b b a b c d e c c d e e d e d 4step 3step

19 Hierarchical methods Ward’s method
at each step join cluster pair whose merger minimizes the increase in total within-group error sum of squares

20

21 Clustering methods Partitioning methods Hierarchical methods
Density-based methods Grid-based methods Model-based methods Graph-based methods

22 Graph Representation The similarity matrix can be visualized as an undirected graph each item is represented by a node, and edges represent the fact that two items are similar (a one in the similarity threshold matrix) T1 T3 T4 T6 T8 T5 T2 T7

23 Clustering Algorithms (Graph-based)
Basic clustering techniques try to determine which object belong to the same class Clique Method (complete link) all items within a cluster must be within the similarity threshold of all other items in that cluster clusters may overlap generally produces small but very tight clusters Single Link Method any item in a cluster must be within the similarity threshold of at least one other item in that cluster produces larger but weaker clusters Other methods star method - start with an item and place all related items in that cluster string method - star with an item; place one related item in that cluster; then place anther item related to the last item entered, and so on

24 Clustering Algorithms (Graph-based)
Clique Method a clique is a completely connected subgraph of a graph in the clique method, each maximal clique in the graph becomes a cluster T1 T3 Maximal cliques (and therefore the clusters) in the previous example are: {T1, T3, T4, T6} {T2, T4, T6} {T2, T6, T8} {T1, T5} {T7} Note that, for example, {T1, T3, T4} is also a clique, but is not maximal. T5 T4 T2 T7 T6 T8

25 Clustering Algorithms (Graph-based)
Single Link Method selected a item not in a cluster and place it in a new cluster place all other related item in that cluster repeat step 2 for each item in the cluster until nothing more can be added repeat steps 1-3 for each item that remains unclustered T1 T3 In this case the single link method produces only two clusters: {T1, T3, T4, T5, T6, T2, T8} {T7} Note that the single link method does not allow overlapping clusters, thus partitioning the set of items. T5 T4 T2 T7 T6 T8

26 Clustering Algorithms (Graph-based)
Star method {t1, t3, t4, t5, t6} {t2, t8} {t7} String method {t1, t3, t4, t2, t6, t8} {t5} T1 T3 T5 T4 T2 T7 T6 T8

27 Clustering methods Partitioning methods Hierarchical methods
Density-based methods Grid-based methods Model-based methods Graph-based methods

28 Density-based methods
Clusters: density-connected sets DBSCAN algorithm

29 Density-based methods
Based on a set of density distribution functions

30 Density-based methods
Based on a set of density distribution functions Density function

31 Clustering methods Partitioning methods Hierarchical methods
Density-based methods Grid-based methods Model-based methods Graph-based methods

32 Grid-based methods Organize the data space as a grid file
Determines clusters as density-connected components of the grid Approximate clusters found by DBSCAN

33 Clustering methods Partitioning methods Hierarchical methods
Density-based methods Grid-based methods Model-based methods Graph-based methods

34 Model-based methods Optimize the fit between the given data and some mathematical model N-dim. Centroid vector

35 Self-Organizing Map (SOM)
A sample data vector X is randomly chosen BMU: best matching unit The map unit with centroid closest to X Update the centroid vector Neighborhood kernel function Learning rate

36 Self-Organizing Map Output layer Input sample After updating
Before updating

37 Self-Organizing Map SOM


Download ppt "Clustering."

Similar presentations


Ads by Google