CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.

CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining. http://www-users.cs.umn.edu/~kumar/dmbook/

What is Cluster Analysis? Unsupervised Machine Learning Grouping data so that elements in a group will be Similar (or related) to one another, Different (or unrelated) from other groups http://www.baseball.bornbybits.com/blog/uploaded_images/ Takashi_Saito-703616.gif Distance within clusters is minimized Distance between clusters is maximized

Applications Understanding Group related documents for browsing Create groups of similar customers Discover which stocks have similar price fluctuations Summarization Reduce the size of large data sets Those similar groups can be treated as a single data point

Even more examples Marketing Discover distinct customer groups for targeted promotions Insurance Finding “good customers” (low claim costs, reliable premium payments) Healthcare Find patients with high-risk behaviors

What cluster analysis is NOT Manual (“supervised”) classification People simply place items into categories Simple segmentation Dividing students into groups by last name Main idea: The clusters must come from the data, not from external specifications. Creating the “buckets” beforehand is categorization, but not clustering.

Two clustering techniques Partition Non-overlapping subsets (clusters) such that each data object is in exactly one subset Hierarchical Set of nested clusters organized as a hierarchical tree Does not require predefined number of clusters

Partitional Clustering Three distinct groups emerge, but… …some curveballs behave more like splitters. …some splitters look more like fastballs.

Hierarchical Clustering p1 p2 p3 p5 p4 p1p2p3p4p5 This is a dendrogram Tree diagram used to represent clusters

Clusters can be ambiguous The difference is the threshold you set. How distinct must a cluster be to be it’s own cluster? How many clusters? 6 2 4 adapted from Tan, Steinbach, and Kumar. Introduction to Data Mining (2004)

K-means (partitional) Choose K clusters Select K points as initial centroids Assign all points to clusters based on distance Recompute the centroid of each cluster Did the center change? DONE! Yes No The K-means algorithm is one method for doing partitional clustering

K-Means Demonstration Here is the initial data set

K-Means Demonstration Choose K points as initial centroids

K-Means Demonstration Assign data points according to distance

K-Means Demonstration Recalculate the centroids

K-Means Demonstration And re-assign the points

K-Means Demonstration And keep doing that until you settle on a final set of clusters

Choosing the initial centroids Choosing the right number Choosing the right initial location It matters They won’t make sense within the context of the problem Unrelated data points will be included in the same group Bad choices create bad groupings

Example of Poor Initialization This may “work” mathematically but the clusters don’t make much sense.

Evaluating K-Means Clusters On the previous slide, we did it visually, but there is a mathematical test Sum-of-Squares Error (SSE) The distance to the nearest cluster center How close does each point get to the center? This just means In a cluster, compute distance from a point (m) to the cluster center (x) Square that distance (so sign isn’t an issue) Add them all together

Example: Evaluating Clusters Cluster 1 Cluster 2 2 1.3 1 3 3.3 1.5 SSE 1 = 1 2 + 1.3 2 + 2 2 = 1 + 1.69 + 4 = 6.69 SSE 2 = 3 2 + 3.3 2 + 1.5 2 = 9 + 10.89 + 2.25 = 22.14 Lower individual cluster SSE = a better cluster Lower total SSE = a better set of clusters More clusters will reduce SSE Considerations Reducing SSE within a cluster increases cohesion (we want that)

Choosing the best initial centroids There’s no single, best way to choose initial centroids So what do you do? Multiple runs (?) Use a sample set of data first And then apply it to your main data set Select more centroids to start with Then choose the ones that are farthest apart Because those are the most distinct Pre and post-processing of the data

Pre-processing: Getting the right centroids “Pre”  Get the data ready for analysis Deal with Missing Values Address any measurement error Normalize the data Reduces the dispersion of data points by re-computing the distance Rationale: Preserves differences while dampening the effect of the outliers Remove outliers Reduces the dispersion of data points by removing the atypical data Rationale: They don’t represent the population anyway Big field of study now in data mining (has applications for fraud detection, discovery of blockbuster drugs in pharmaceuticals, etc.)

Post-processing: Getting the right centroids “Post”  Interpreting the results of the clustering analysis Remove small clusters May be outliers Split loose clusters With high SSE that look like they are really two different groups Merge clusters With relatively low SSE that are “close” together

Limitations of K-Means Clustering Clusters vary widely in size Clusters vary widely in density Clusters are not in rounded shapes The data set has a lot of outliers K-Means gives unreliable results when The clusters may never make sense. In that case, the data may just not be well-suited for clustering!

Similarity between clusters (inter-cluster) Most common: distance between centroids Also can use SSE Look at distance between cluster 1’s points and other centroids You’d want to maximize SSE between clusters Cluster 1 Cluster 5 Increasing SSE across clusters increases separation (we want that)

Figuring out if our clusters are good “Good” means Meaningful Useful Provides insight The pitfalls Poor clusters reveal incorrect associations Poor clusters reveal inconclusive associations There might be room for improvement and we can’t tell This is somewhat subjective and depends upon the expectations of the analyst

Cluster validity assessment Do to the clusters confirm predefined labels? Out-of-sample validation External How well-formed are the clusters? i.e., SSE, cohesion, separation Internal How well does one clustering algorithm compare to another? i.e., compare SSEs Relative

The Keys to Successful Clustering We want high cohesion within clusters (minimize differences) Low SSE, high correlation And high separation between clusters (maximize differences) High SSE, low correlation We need to choose the right number of clusters We need to choose the right initial centroids But there’s no easy way to do this Combination of trial-and-error, knowledge of the problem, and looking at the output In SAS, cohesion is measured by root mean square standard deviation… …and separation measured by distance to nearest cluster

Additional Notes Specific approaches to validating clusters Cohesion and separation (we know about these) Accuracy of attribute value predictions Other recent developments Distribution-based clustering, density-based clustering

Clustering High-Dimensional Data Curse of dimensionality (e.g., text documents) Number of attributes grows, data becomes sparse “Distance” between points becomes less meaningful Many “garbage” attributes, many correlated attributes How to resolve this issue Subspace clustering (shared cluster membership in subspaces)

CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.

Similar presentations

Presentation on theme: "CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.

Similar presentations

Presentation on theme: "CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining."— Presentation transcript:

Similar presentations

About project

Feedback