Clustering.

Name: Clustering.
Uploaded: 2017-07-19T13:57:14+00:00
Duration: PTM10S9
Channel: Austin Dalton
Description: Clustering.

Clustering

Introduction

Clustering Summarization of large data Data organization
Understand the large customer data Data organization Manage the large customer data Outlier detection Find unusual customer data

Clustering Previous process before classification/association
Find useful grouping for classes Association rules within a particular cluster

Problem Description Given Task
A data set of N data items with each have a d-dimensional data feature vector Task Determine a natural, useful partitioning of the data set into a number of clusters (k) and noise

Measure of closeness: similarity
Dice’s Coefficient Simple Matching Cosine Coefficient Jaccard’s Coefficient

Measure of closeness: disimilarity
Distance Measure Distance = dissimilarity Manhattan distance Euclidean distance Minkowski metric Mahalnobis distance

Similarity Matrix Note that dij = dji (i.e., the matrix is symmetric. So, we only need the lower triangle part of the matrix:

Similarity Matrix - Example
Term-Term Similarity Matrix

Similarity Thresholds
A similarity threshold is used to mark pairs that are “sufficiently” similar The threshold value is application and collection dependent Using a threshold value of 10 in the previous example

Clustering methods Partitioning methods Hierarchical methods
Density-based methods Grid-based methods Model-based methods Graph-based methods

Partitioning methods K-means
Choose k objects as the initial cluster centers; set i=0 Loop For each object Assign data points to their nearest centroid Compute mean of cluster as centre

Partitioning methods I : centroid II

Partitioning Method (Iterative method)
The basic algorithm: 1. select M cluster representatives (centroids) 2. for i = 1 to N, assign Di to the most similar centroid 3. for j = 1 to M, recalculate the cluster centroid Cj 4. repeat steps 2 and 3 until these is (little or) no change in clusters Example: Initial (arbitrary) assignment: C1 = {T1,T2}, C2 = {T3,T4}, C3 = {T5,T6} Cluster Centroids

Partitioning Method (Iterative method)
Example (continued) Now using simple similarity measure, compute the new cluster-term similarity matrix Now compute new cluster centroids using the original document-term matrix The process is repeated until no further changes are made to the clusters

Hierarchical methods Group objects into a tree of clusters Types
Agglomerative bottom-up approach Single-linkage Complete-linkage Group-linkage Centroid-linkage Ward’s method Divisive top-down approach Use of K-means clustering

Hierarchical methods a a b b a b c d e c c d e e d e d 4step 3step

Hierarchical methods Ward’s method
at each step join cluster pair whose merger minimizes the increase in total within-group error sum of squares

Graph Representation The similarity matrix can be visualized as an undirected graph each item is represented by a node, and edges represent the fact that two items are similar (a one in the similarity threshold matrix) T1 T3 T4 T6 T8 T5 T2 T7

Clustering Algorithms (Graph-based)
Basic clustering techniques try to determine which object belong to the same class Clique Method (complete link) all items within a cluster must be within the similarity threshold of all other items in that cluster clusters may overlap generally produces small but very tight clusters Single Link Method any item in a cluster must be within the similarity threshold of at least one other item in that cluster produces larger but weaker clusters Other methods star method - start with an item and place all related items in that cluster string method - star with an item; place one related item in that cluster; then place anther item related to the last item entered, and so on

Clique Method a clique is a completely connected subgraph of a graph in the clique method, each maximal clique in the graph becomes a cluster T1 T3 Maximal cliques (and therefore the clusters) in the previous example are: {T1, T3, T4, T6} {T2, T4, T6} {T2, T6, T8} {T1, T5} {T7} Note that, for example, {T1, T3, T4} is also a clique, but is not maximal. T5 T4 T2 T7 T6 T8

Single Link Method selected a item not in a cluster and place it in a new cluster place all other related item in that cluster repeat step 2 for each item in the cluster until nothing more can be added repeat steps 1-3 for each item that remains unclustered T1 T3 In this case the single link method produces only two clusters: {T1, T3, T4, T5, T6, T2, T8} {T7} Note that the single link method does not allow overlapping clusters, thus partitioning the set of items. T5 T4 T2 T7 T6 T8

Star method {t1, t3, t4, t5, t6} {t2, t8} {t7} String method {t1, t3, t4, t2, t6, t8} {t5} T1 T3 T5 T4 T2 T7 T6 T8

Density-based methods
Clusters: density-connected sets DBSCAN algorithm

Based on a set of density distribution functions

Based on a set of density distribution functions Density function

Grid-based methods Organize the data space as a grid file
Determines clusters as density-connected components of the grid Approximate clusters found by DBSCAN

Model-based methods Optimize the fit between the given data and some mathematical model N-dim. Centroid vector …

Self-Organizing Map (SOM)
A sample data vector X is randomly chosen BMU: best matching unit The map unit with centroid closest to X Update the centroid vector Neighborhood kernel function Learning rate

Self-Organizing Map Output layer Input sample After updating
Before updating

Self-Organizing Map SOM

Clustering.

Similar presentations

Presentation on theme: "Clustering."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clustering.

Similar presentations

Presentation on theme: "Clustering."— Presentation transcript:

Similar presentations

About project

Feedback