CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.

Slides:



Advertisements
Similar presentations
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Advertisements

Unsupervised Learning
Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Analysis Basics
PARTITIONAL CLUSTERING
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
MIS2502: Data Analytics Clustering and Segmentation.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
unsupervised learning - clustering
Data Mining Cluster Analysis: Basic Concepts and Algorithms
What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Cluster Analysis: Basic Concepts and Algorithms
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
Cluster Analysis (1).
What is Cluster Analysis?
Cluster Analysis CS240B Lecture notes based on those by © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004.
What is Cluster Analysis?
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Clustering Unsupervised learning Generating “classes”
Evaluating Performance for Data Mining Techniques
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
DATA MINING CLUSTERING K-Means.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
Partitional and Hierarchical Based clustering Lecture 22 Based on Slides of Dr. Ikle & chapter 8 of Tan, Steinbach, Kumar.
Jeff Howbert Introduction to Machine Learning Winter Clustering Basic Concepts and Algorithms 1.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Adapted from Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar.
Critical Issues with Respect to Clustering Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
EXAM REVIEW MIS2502 Data Analytics. Exam What Tool to Use? Evaluating Decision Trees Association Rules Clustering.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Clustering/Cluster Analysis. What is Cluster Analysis? l Finding groups of objects such that the objects in a group will be similar (or related) to one.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
CSE4334/5334 Data Mining Clustering. What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related)
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
Topic 4: Cluster Analysis Analysis of Customer Behavior and Service Modeling.
MIS2502: Data Analytics Clustering and Segmentation Jeremy Shafer
Data Mining Classification and Clustering Techniques Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining.
Data Mining: Basic Cluster Analysis
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Clustering CSC 600: Data Mining Class 21.
Data Mining K-means Algorithm
Topic 3: Cluster Analysis
MIS2502: Data Analytics Clustering and Segmentation
Clustering Basic Concepts and Algorithms 1
Data Mining Cluster Techniques: Basic
MIS2502: Data Analytics Clustering and Segmentation
Critical Issues with Respect to Clustering
MIS2502: Data Analytics Clustering and Segmentation
MIS2502: Data Analytics Clustering and Segmentation
Topic 5: Cluster Analysis
SEEM4630 Tutorial 3 – Clustering.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Presentation transcript:

CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.

What is Cluster Analysis? Grouping data so that elements in a group will be Similar (or related) to one another Different (or unrelated) from elements in other groups Takashi_Saito gif Distance within clusters is minimized Distance between clusters is maximized

Applications Understanding Group related documents for browsing Create groups of similar customers Discover which stocks have similar price fluctuations Summarization Reduce the size of large data sets Those similar groups can be treated as a single data point

Even more examples Marketing Discover distinct customer groups for targeted promotions Insurance Finding “good customers” (low claim costs, reliable premium payments) Healthcare Find patients with high-risk behaviors

What cluster analysis is NOT Manual (“supervised”) classification People simply place items into categories Simple segmentation Dividing students into groups by last name Main idea: The clusters must come from the data, not from external specifications. Creating the “buckets” beforehand is categorization, but not clustering.

Clusters can be ambiguous The difference is the threshold you set. How distinct must a cluster be to be it’s own cluster? How many clusters? 6 2 4

Two clustering techniques Partition Non-overlapping subsets (clusters) such that each data object is in exactly one subset Hierarchical Set of nested clusters organized as a hierarchical tree

Partitional Clustering Three distinct groups emerge, but… …some curveballs behave more like splitters. …some splitters look more like fastballs.

Hierarchical Clustering p1 p2 p3 p5 p4 p1p2p3p4p5 This is a dendrogram Tree diagram used to represent clusters

Clusters can be ambiguous The difference is the threshold you set. How distinct must a cluster be to be it’s own cluster? How many clusters? adapted from Tan, Steinbach, and Kumar. Introduction to Data Mining (2004)

K-means (partitional) Choose K clusters Select K points as initial centroids Assign all points to clusters based on distance Recompute the centroid of each cluster Did the center change? DONE! Yes No The K-means algorithm is one method for doing partitional clustering

K-Means Demonstration Here is the initial data set

K-Means Demonstration Choose K points as initial centroids

K-Means Demonstration Assign data points according to distance

K-Means Demonstration And re-assign the points

K-Means Demonstration Recalculate the centroids

K-Means Demonstration And keep doing that until you settle on a final set of clusters

Choosing the initial centroids Choosing the right number Choosing the right initial location It matters They won’t make sense within the context of the problem Unrelated data points will be included in the same group Bad choices create bad groupings

Example of Poor Initialization This may “work” mathematically but the clusters don’t make much sense.

Evaluating K-Means Clusters On the previous slide, we did it visually, but there is a mathematical test Sum-of-Squares Error (SSE) The distance to the nearest cluster center How close does each point get to the center? This just means In a cluster, compute distance from a point (m) to the cluster center (x) Square that distance (so sign isn’t an issue) Add them all together

Example: Evaluating Clusters Cluster 1 Cluster SSE 1 = = = 6.69 SSE 2 = = = Lower individual cluster SSE = a better cluster Lower total SSE = a better set of clusters More clusters will reduce SSE Considerations Reducing SSE within a cluster increases cohesion (we want that)

Choosing the best initial centroids There’s no single, best way to choose initial centroids So what do you do? Multiple runs (?) Use a sample set of data first And then apply it to your main data set Select more centroids to start with Then choose the ones that are farthest apart Because those are the most distinct Pre and post-processing of the data

Pre-processing: Getting the right centroids “Pre”  Get the data ready for analysis Normalize the data Reduces the dispersion of data points by re-computing the distance Rationale: Preserves differences while dampening the effect of the outliers Remove outliers Reduces the dispersion of data points by removing the atypical data Rationale: They don’t represent the population anyway

Post-processing: Getting the right centroids “Post”  Interpreting the results of the clustering analysis Remove small clusters May be outliers Split loose clusters With high SSE that look like they are really two different groups Merge clusters With relatively low SSE that are “close” together

Limitations of K-Means Clustering Clusters vary widely in size Clusters vary widely in density Clusters are not in rounded shapes The data set has a lot of outliers K-Means gives unreliable results when The clusters may never make sense. In that case, the data may just not be well-suited for clustering!

Similarity between clusters (inter-cluster) Most common: distance between centroids Also can use SSE Look at distance between cluster 1’s points and other centroids You’d want to maximize SSE between clusters Cluster 1 Cluster 5 Increasing SSE across clusters increases separation (we want that)

Figuring out if our clusters are good “Good” means Meaningful Useful Provides insight The pitfalls Poor clusters reveal incorrect associations Poor clusters reveal inconclusive associations There might be room for improvement and we can’t tell This is somewhat subjective and depends upon the expectations of the analyst

Cluster validity assessment Do to the clusters confirm predefined labels? i.e., “Entropy” External How well-formed are the clusters? i.e., SSE or correlation Internal How well does one clustering algorithm compare to another? i.e., compare SSEs Relative

The Keys to Successful Clustering We want high cohesion within clusters (minimize differences) High SSE, high correlation And high separation between clusters (maximize differences) Low SSE, low correlation We need to choose the right number of clusters We need to choose the right initial centroids But there’s no easy way to do this Combination of trial-and-error, knowledge of the problem, and looking at the output In SAS, cohesion is measured by root mean square standard deviation… …and separation measured by distance to nearest cluster