Clustering
Introduction
Clustering Summarization of large data Data organization Understand the large customer data Data organization Manage the large customer data Outlier detection Find unusual customer data
Clustering Previous process before classification/association Find useful grouping for classes Association rules within a particular cluster
Problem Description Given Task A data set of N data items with each have a d-dimensional data feature vector Task Determine a natural, useful partitioning of the data set into a number of clusters (k) and noise
Measure of closeness: similarity Dice’s Coefficient Simple Matching Cosine Coefficient Jaccard’s Coefficient
Measure of closeness: disimilarity Distance Measure Distance = dissimilarity Manhattan distance Euclidean distance Minkowski metric Mahalnobis distance
Similarity Matrix Note that dij = dji (i.e., the matrix is symmetric. So, we only need the lower triangle part of the matrix:
Similarity Matrix - Example Term-Term Similarity Matrix
Similarity Thresholds A similarity threshold is used to mark pairs that are “sufficiently” similar The threshold value is application and collection dependent Using a threshold value of 10 in the previous example
Clustering methods Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based methods Graph-based methods
Partitioning methods K-means Choose k objects as the initial cluster centers; set i=0 Loop For each object Assign data points to their nearest centroid Compute mean of cluster as centre
Partitioning methods I : centroid II
Partitioning Method (Iterative method) The basic algorithm: 1. select M cluster representatives (centroids) 2. for i = 1 to N, assign Di to the most similar centroid 3. for j = 1 to M, recalculate the cluster centroid Cj 4. repeat steps 2 and 3 until these is (little or) no change in clusters Example: Initial (arbitrary) assignment: C1 = {T1,T2}, C2 = {T3,T4}, C3 = {T5,T6} Cluster Centroids
Partitioning Method (Iterative method) Example (continued) Now using simple similarity measure, compute the new cluster-term similarity matrix Now compute new cluster centroids using the original document-term matrix The process is repeated until no further changes are made to the clusters
Clustering methods Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based methods Graph-based methods
Hierarchical methods Group objects into a tree of clusters Types Agglomerative bottom-up approach Single-linkage Complete-linkage Group-linkage Centroid-linkage Ward’s method Divisive top-down approach Use of K-means clustering
Hierarchical methods a a b b a b c d e c c d e e d e d 4step 3step
Hierarchical methods Ward’s method at each step join cluster pair whose merger minimizes the increase in total within-group error sum of squares
Clustering methods Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based methods Graph-based methods
Graph Representation The similarity matrix can be visualized as an undirected graph each item is represented by a node, and edges represent the fact that two items are similar (a one in the similarity threshold matrix) T1 T3 T4 T6 T8 T5 T2 T7
Clustering Algorithms (Graph-based) Basic clustering techniques try to determine which object belong to the same class Clique Method (complete link) all items within a cluster must be within the similarity threshold of all other items in that cluster clusters may overlap generally produces small but very tight clusters Single Link Method any item in a cluster must be within the similarity threshold of at least one other item in that cluster produces larger but weaker clusters Other methods star method - start with an item and place all related items in that cluster string method - star with an item; place one related item in that cluster; then place anther item related to the last item entered, and so on
Clustering Algorithms (Graph-based) Clique Method a clique is a completely connected subgraph of a graph in the clique method, each maximal clique in the graph becomes a cluster T1 T3 Maximal cliques (and therefore the clusters) in the previous example are: {T1, T3, T4, T6} {T2, T4, T6} {T2, T6, T8} {T1, T5} {T7} Note that, for example, {T1, T3, T4} is also a clique, but is not maximal. T5 T4 T2 T7 T6 T8
Clustering Algorithms (Graph-based) Single Link Method selected a item not in a cluster and place it in a new cluster place all other related item in that cluster repeat step 2 for each item in the cluster until nothing more can be added repeat steps 1-3 for each item that remains unclustered T1 T3 In this case the single link method produces only two clusters: {T1, T3, T4, T5, T6, T2, T8} {T7} Note that the single link method does not allow overlapping clusters, thus partitioning the set of items. T5 T4 T2 T7 T6 T8
Clustering Algorithms (Graph-based) Star method {t1, t3, t4, t5, t6} {t2, t8} {t7} String method {t1, t3, t4, t2, t6, t8} {t5} T1 T3 T5 T4 T2 T7 T6 T8
Clustering methods Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based methods Graph-based methods
Density-based methods Clusters: density-connected sets DBSCAN algorithm
Density-based methods Based on a set of density distribution functions
Density-based methods Based on a set of density distribution functions Density function
Clustering methods Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based methods Graph-based methods
Grid-based methods Organize the data space as a grid file Determines clusters as density-connected components of the grid Approximate clusters found by DBSCAN
Clustering methods Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based methods Graph-based methods
Model-based methods Optimize the fit between the given data and some mathematical model N-dim. Centroid vector …
Self-Organizing Map (SOM) A sample data vector X is randomly chosen BMU: best matching unit The map unit with centroid closest to X Update the centroid vector Neighborhood kernel function Learning rate
Self-Organizing Map Output layer Input sample After updating Before updating
Self-Organizing Map SOM