Clustering. What is Clustering? Unsupervised learning Seeks to organize data into “reasonable” groups Often based on some similarity (or distance) measure.

Clustering

What is Clustering? Unsupervised learning Seeks to organize data into “reasonable” groups Often based on some similarity (or distance) measure defined over data elements Quantitative characterization may include – Centroid / Medoid – Radius – Diameter – Compactness – Etc.

Clustering Taxonomy Partitional methods: – Algorithm produces a single partition or clustering of the data elements Hierarchical methods: – Algorithm produces a series of nested partitions, each of which represents a possible clustering of the data elements Symbolic Methods: – Algorithm produces a hierarchy of concepts

Distance-based Clustering

K-means Overview Algorithm builds a single k-subset partition Works with numeric data only Starts with k random centroids Uses iterative re-assignment of data items to clusters based on some distance to centroids until all assignments remain unchanged

K-means Algorithm 1)Pick a number, k, of cluster centers (at random, do not have to be data items) 2)Assign every item to its nearest cluster center (e.g., using Euclidean distance) 3)Move each cluster center to the mean of its assigned items 4)Repeat steps 2 and 3 until convergence (e.g., change in cluster assignments less than a threshold)

k-Means Example (I) k1k1 k2k2 k3k3 X Y Pick 3 initial cluster centers (randomly)

k-Means Example (II) k1k1 k2k2 k3k3 X Y Assign each point to the closest cluster center

k-Means Example (III) X Y Move each cluster center to the mean of each cluster k1k1 k2k2 k2k2 k1k1 k3k3 k3k3

k-Means Example (IV) X Y Reassign points closest to a different new cluster center Q: Which points are reassigned? k1k1 k2k2 k3k3

k-Means Example (V) X Y A: three points with animation k1k1 k3k3 k2k2

k-Means Example (VI) X Y re-compute cluster means k1k1 k3k3 k2k2

k-Means Example (VII) X Y move cluster centers to cluster means Re-assign No change k2k2 k1k1 k3k3

K-means Discussion Result can vary significantly depending on initial choice of seeds Can get trapped in local minimum – Example: – Restart with different random seeds Does not handle outliers well Does not scale very well instances initial cluster centers

K-means Summary Advantages Simple, understandable Items automatically assigned to clusters Disadvantages Must pick number of clusters beforehand All items forced into a cluster Sensitive to outliers Time complexity is O(mkn) where m is # of iterations

K-medoids Overview Aka Partitioning Around Medoids (PAM) Similar to K-means: – Algorithm builds a single k-subset partition – Works with numeric data only – Uses iterative re-assignment of medoids as long as overall clustering quality improves Difference with K-means: – Starts with k random medoids and sticks to medoids (i.e., data points)

K-medoids Quality Measures Clustering quality: – Sum of all distances from a non-medoid object to the medoid for the cluster it is in (an item is assigned to the cluster represented by the medoid to which it is closest) Quality impact: – C jih = cost change for item j associated with swapping medoid i for non-medoid h – Total impact to clustering quality by medoid change (h replaces i):

K-medoids Algorithm 1)Pick a number, k, of random data items as medoids 2)Calculate 3)If TC mn < 0, replace m by n and go back to 2 4)Assign every item to its nearest medoid The pair (n,m) of medoid/non-medoid with the smallest impact on clustering quality

K-medoids Example (I) Assume k=2 Select X5 and X9 as medoids Current clustering: {X1,X2,X5,X6,X7},{X3,X4,X8,X9,X10}

K-medoids Example (II) Must try to replace X5 by X1, X2, X3, X4, X6, X7, X8, X10 Must try to replace X9 by X1, X2, X3, X4, X6, X7, X8, X10 Replace X5 by X4: -9 Replace X5 by X6: -5 Replace X5 by X7: 0 Replace X5 by X8: -1 Replace X5 by X10: 1  Replace X5 by X3 Replace X9 by X1: -7 Replace X9 by X2: -5 Replace X9 by X3: -7 Replace X9 by X4: -8 Replace X9 by X6: -3 Replace X9 by X7: 5 Replace X9 by X8: -1 Replace X9 by X10: -4

K-medoids Example (III) X3 and X9 are new medoids Current clustering: {X1,X2,X3,X4},{X5,X6,X7,X8,X9,X10} No change in medoids yields better quality  DONE! (I think)

K-medoids Discussion As in K-means, user must select the value of k, but the resulting clustering is independent of the initial choice of medoids Handles outliers well Does not scale well – CLARA and CLARANS improve on the time complexity of K-medoids by using sampling and neighborhoods

K-medoids Summary Advantages Simple, understandable Items automatically assigned to clusters Handles outliers Disadvantages Must pick number of clusters beforehand High time complexity

Hierarchical Clustering Two Approaches – Agglomerative Each instance is initially its own cluster. Most similar instance/clusters are then progressively combined until all instances are in one cluster. Each level of the hierarchy is a different set/grouping of clusters. – Divisive: Start with all instances as one cluster and progressively divide until all instances are their own cluster. You can then decide what level of granularity you want to output.

Hierarchical Agglomerative Clustering 1.Assign each data item to its own cluster 2.Compute pairwise distances between clusters 3.Merge the two closest clusters 4.If more than one cluster is left, go to step 2

Cluster Distances Complete-link – Maximum pairwise distance between the items of two different clusters Single-link – Minimum pairwise distance between the items of two different clusters Average-link – Average pairwise distance between the items of two different clusters

HAC Example Assume single-link

HAC Demo http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletH.html

HAC Discussion No need to specify number of clusters * May also be done after the dendrogram is built Still need to know when to stop: * Too early  clustering too fine Too late  clustering too coarse Trade one parameter (K) for another (distance threshold)?

Picking “the” Threshold Guessing (sub-optimal) Looking for “jumps” in the distance function (subjective) Human examination (expensive, unreasonable) Semi-supervised learning – Must-link, cannot-link constraints – Stated explicitly or implicitly (e.g., through labels)

Simple Solution Select random sample S of items Label items in S Cluster S Find the threshold value T that maximizes some clustering quality measure on S Cluster complete dataset up to T |S|=50 was shown to give reasonable results

HAC Summary Best implementation is O(n 2 logn) All clusterings in one pass Impact of cluster distance – Single link – can lead to long chained clusters where some points are quite far from each other – Complete link – finds more compact clusters – Average link – used less because have to re- compute the average each time

Clustering Evaluation

Quality Metrics External metrics – Computed from a comparison to actual clustering (or classification) labels. Internal metrics – Computed solely from the composition of a cluster, with no recourse to external information.

External Metrics – Setup (I) Let: be the target and computed clusterings, respectively. TC = CC = original set of data Define the following:

External Metrics – Setup (II) a: number of pairs of items that belong to the same cluster in both CC and TC b: number of pairs of items that belong to different clusters in both CC and TC c: number of pairs of items that belong to the same cluster in CC but different clusters in TC d: number of pairs of items that belong to the same cluster in TC but different clusters in CC

F-measure

Rand Index Measure of clustering agreement: how similar are these two ways of partitioning the data?

Adjusted Rand Index Extension of the Rand index that attempts to account for items that may have been clustered by chance

Average Entropy Measure of purity with respect to the target clustering

V-measure Entropy-based Combines: – Homogeneity (H): all computed clusters contain only items which are members of the same target cluster/class – Completeness (C): all data items of a given target cluster/class are in the same computed cluster

Internal Metrics Assume no target clustering Attempt to capture some intrinsic properties of the clustering – Distance-based Sum of squares Diameter sum or maximum Sum of minimum distance between clusters, etc. – Probability-based E.g., KL divergence Just a sample presented here

Compactness Members of a cluster are all similar and close together – One measure of compactness of a cluster is the sum of the squared distances of cluster instances to the cluster centroid where c is the centroid of a cluster C, made up of instances X c. The overall compactness of a particular clustering is just the sum of the compactness of the individual clusters – Numeric way to compare different clusterings by seeking clusterings which minimize the compactness metric Compactness is usually maximized when there are very many clusters

Separability (I) Members of a cluster are sufficiently different from members of another cluster (i.e., cluster dissimilarity) – One measure of the separability of two clusters is their squared distance dist ij = (c i - c j ) 2 where c i and c j are the cluster centroids – Other distance measures are possible (single link, etc.) For a clustering which cluster distances should we compare?

Separability (II) For each cluster add in the distance to its closest neighbor cluster – Goal: maximize separability Separability is usually maximized when there are very few clusters – squared distance amplifies larger distances 45

Davies-Bouldin Index Combine compactness and separability into one metric Example: Davies-Bouldin r index: where |X c | is number of elements in cluster represented by centroid c, and |C| is total number of clusters in clustering Intuition : – Want small scatter – small compactness with many cluster members – For each cluster i, r i is maximized for a close neighbor cluster with high scatter (measures worse case close neighbor, hope these are as low as possible) – Total clustering score r is just the sum of r i. Lower r better. – Davies-Bouldin score r finds a balance of separability (distance) being large and compactness (scatter) being small

KL Divergence (I) Measure of the difference in probability distributions between clusters How do we compute over a clustering, i.e., how do we obtain KL(CC)?

KL Divergence (II) Compute KL as follows: (one possibility) 1.For each cluster a)Randomly split cluster in half b)Compute resulting KL 2.Repeat step 1 N times and compute average 3.Sum averages of all clusters to produce the KL of the clustering

Summary Two types of evaluation – External – Internal Various metrics have been proposed Handle with care – Depends on the application – Some element of subjectivity

DBSCAN DBSCAN is a density-based algorithm. – Density = number of points within a specified radius (Eps) – A point is a core point if it has more than a specified number of points (MinPts) within Eps These are points that are at the interior of a cluster – A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point – A noise point is any point that is not a core point or a border point.

DBSCAN: Core, Border, and Noise Points

DBSCAN Algorithm Eliminate noise points Perform clustering on the remaining points

DBSCAN: Core, Border and Noise Points Original Points Point types: core, border and noise Eps = 10, MinPts = 4

When DBSCAN Works Well Original Points Clusters Resistant to Noise Can handle clusters of different shapes and sizes

When DBSCAN Does NOT Work Well Original Points (MinPts=4, Eps=9.75). (MinPts=4, Eps=9.92) Varying densities High-dimensional data

DBSCAN: Determining EPS and MinPts Idea is that for points in a cluster, their k th nearest neighbors are at roughly the same distance Noise points have the k th nearest neighbor at farther distance So, plot sorted distance of every point to its k th nearest neighbor

Clustering Evaluation

Cluster Validity For supervised classification we have a variety of measures to evaluate how good our model is – Accuracy, precision, recall For cluster analysis, the analogous question is how to evaluate the “ goodness ” of the resulting clusters? But “ clusters are in the eye of the beholder ” ! Then why do we want to evaluate them? – To avoid finding patterns in noise – To compare clustering algorithms – To compare two sets of clusters – To compare two clusters

1.Determining the clustering tendency of a set of data, i.e., distinguishing whether non-random structure actually exists in the data. 2.Comparing the results of a cluster analysis to externally known results, e.g., to externally given class labels. 3.Evaluating how well the results of a cluster analysis fit the data without reference to external information. - Use only the data 4.Comparing the results of two different sets of cluster analyses to determine which is better. 5.Determining the ‘ correct ’ number of clusters. For 2, 3, and 4, we can further distinguish whether we want to evaluate the entire clustering or just individual clusters. Different Aspects of Cluster Validation

Numerical measures that are applied to judge various aspects of cluster validity, are classified into the following three types. – External Index: Used to measure the extent to which cluster labels match externally supplied class labels. Entropy – Internal Index: Used to measure the goodness of a clustering structure without respect to external information. Sum of Squared Error (SSE) – Relative Index: Used to compare two different clusterings or clusters. Often an external or internal index is used for this function, e.g., SSE or entropy Sometimes these are referred to as criteria instead of indices – However, sometimes criterion is the general strategy and index is the numerical measure that implements the criterion. Measures of Cluster Validity

Two matrices – Proximity Matrix –“ Incidence ” Matrix One row and one column for each data point An entry is 1 if the associated pair of points belong to the same cluster An entry is 0 if the associated pair of points belongs to different clusters Compute the correlation between the two matrices – Since the matrices are symmetric, only the correlation between n(n-1) / 2 entries needs to be calculated. High correlation indicates that points that belong to the same cluster are close to each other. Not a good measure for some density or contiguity based clusters. Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets. Corr = -0.9235Corr = -0.5810

Order the similarity matrix with respect to cluster labels and inspect visually. Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp DBSCAN

Using Similarity Matrix for Cluster Validation Clusters in random data are not so crisp K-means

Using Similarity Matrix for Cluster Validation Clusters in random data are not so crisp Complete Link

Using Similarity Matrix for Cluster Validation DBSCAN

Clusters in more complicated figures aren ’ t well separated Internal Index: Used to measure the goodness of a clustering structure without respect to external information – SSE SSE is good for comparing two clusterings or two clusters (average SSE). Can also be used to estimate the number of clusters Internal Measures: SSE

SSE curve for a more complicated data set SSE of clusters found using K-means

Need a framework to interpret any measure. – For example, if our measure of evaluation has the value, 10, is that good, fair, or poor? Statistics provide a framework for cluster validity – The more “ atypical ” a clustering result is, the more likely it represents valid structure in the data – Can compare the values of an index that result from random data or clusterings to those of a clustering result. If the value of the index is unlikely, then the cluster results are valid – These approaches are more complicated and harder to understand. For comparing the results of two different sets of cluster analyses, a framework is less necessary. – However, there is the question of whether the difference between two index values is significant Framework for Cluster Validity

Example – Compare SSE of 0.005 against three clusters in random data – Histogram shows SSE of three clusters in 500 sets of random data points of size 100 distributed over the range 0.2 – 0.8 for x and y values Statistical Framework for SSE

Correlation of incidence and proximity matrices for the K- means clusterings of the following two data sets. Statistical Framework for Correlation Corr = -0.9235Corr = -0.5810

Cluster Cohesion: Measures how closely related are objects in a cluster – Example: SSE Cluster Separation: Measure how distinct or well-separated a cluster is from other clusters Example: Squared Error – Cohesion is measured by the within cluster sum of squares (SSE) – Separation is measured by the between cluster sum of squares – Where |C i | is the size of cluster i Internal Measures: Cohesion and Separation

Example: SSE – BSS + WSS = constant 12345  m1m1 m2m2 m K=2 clusters: K=1 cluster:

A proximity graph based approach can also be used for cohesion and separation. – Cluster cohesion is the sum of the weight of all links within a cluster. – Cluster separation is the sum of the weights between nodes in the cluster and nodes outside the cluster. Internal Measures: Cohesion and Separation cohesionseparation

Silhouette Coefficient combine ideas of both cohesion and separation, but for individual points, as well as clusters and clusterings For an individual point, i – Calculate a = average distance of i to the points in its cluster – Calculate b = min (average distance of i to points in another cluster) – The silhouette coefficient for a point is then given by s = 1 – a/b if a < b, (or s = b/a - 1 if a  b, not the usual case) – Typically between 0 and 1. – The closer to 1 the better. Can calculate the Average Silhouette width for a cluster or a clustering Internal Measures: Silhouette Coefficient

External Measures of Cluster Validity: Entropy and Purity

“ The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage. ” Algorithms for Clustering Data, Jain and Dubes Final Comment on Cluster Validity

Evaluating K-means Clusters Most common measure is Sum of Squared Error (SSE) – For each point, the error is the distance to the nearest cluster – To get SSE, we square these errors and sum them. – x is a data point in cluster C i and m i is the representative point for cluster C i can show that m i corresponds to the center (mean) of the cluster – Given two clusters, we can choose the one with the smallest error – One easy way to reduce SSE is to increase K, the number of clusters A good clustering with smaller K can have a lower SSE than a poor clustering with higher K

Clustering. What is Clustering? Unsupervised learning Seeks to organize data into “reasonable” groups Often based on some similarity (or distance) measure.

Similar presentations

Presentation on theme: "Clustering. What is Clustering? Unsupervised learning Seeks to organize data into “reasonable” groups Often based on some similarity (or distance) measure."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clustering. What is Clustering? Unsupervised learning Seeks to organize data into “reasonable” groups Often based on some similarity (or distance) measure.

Similar presentations

Presentation on theme: "Clustering. What is Clustering? Unsupervised learning Seeks to organize data into “reasonable” groups Often based on some similarity (or distance) measure."— Presentation transcript:

Similar presentations

About project

Feedback