© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.

Slides:



Advertisements
Similar presentations
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Advertisements

Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.
Hierarchical Clustering
Unsupervised Learning
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Data Mining Cluster Analysis Basics
Hierarchical Clustering, DBSCAN The EM Algorithm
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
unsupervised learning - clustering
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Clustering II.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Cluster Analysis.
Cluster Analysis: Basic Concepts and Algorithms
Cluster Analysis (1).
What is Cluster Analysis?
Cluster Analysis CS240B Lecture notes based on those by © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
What is Cluster Analysis?
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
What is Cluster Analysis?
DATA MINING LECTURE 8 Clustering The k-means algorithm
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Clustering Basic Concepts and Algorithms 2
Partitional and Hierarchical Based clustering Lecture 22 Based on Slides of Dr. Ikle & chapter 8 of Tan, Steinbach, Kumar.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Adapted from Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar.
Critical Issues with Respect to Clustering Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar 10/30/2007.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Minqi Zhou Minqi Zhou Introduction.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Jianping Fan Department of Computer Science UNC-Charlotte Density-Based Data Clustering Algorithms: K-Means & Others.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Minqi Zhou © Tan,Steinbach, Kumar.
Data Mining Cluster Analysis: Basic Concepts and Algorithms.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Clustering/Cluster Analysis. What is Cluster Analysis? l Finding groups of objects such that the objects in a group will be similar (or related) to one.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
DATA MINING: CLUSTER ANALYSIS (3) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
CSE4334/5334 Data Mining Clustering. What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related)
Data Mining Cluster Analysis: Basic Concepts and Algorithms.
ΠΑΝΕΠΙΣΤΗΜΙΟ ΙΩΑΝΝΙΝΩΝ ΑΝΟΙΚΤΑ ΑΚΑΔΗΜΑΪΚΑ ΜΑΘΗΜΑΤΑ Εξόρυξη Δεδομένων Ομαδοποίηση (clustering) Διδάσκων: Επίκ. Καθ. Παναγιώτης Τσαπάρας.
Data Mining Classification and Clustering Techniques Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining.
Data Mining: Basic Cluster Analysis
Data Mining Cluster Analysis: Basic Concepts and Algorithms
More on Clustering in COSC 4335
CSE 4705 Artificial Intelligence
Hierarchical Clustering: Time and Space requirements
Clustering 28/03/2016 A diák alatti jegyzetszöveget írta: Balogh Tamás Péter.
Data Mining K-means Algorithm
Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Techniques: Basic
Critical Issues with Respect to Clustering
Clustering 23/03/2016 A diák alatti jegyzetszöveget írta: Balogh Tamás Péter.
SEEM4630 Tutorial 3 – Clustering.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Presentation transcript:

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster distances are maximized Intra-cluster distances are minimized

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Applications of Cluster Analysis l Understanding –Group related documents for browsing, group genes and proteins that have similar functionality, or group stocks with similar price fluctuations l Summarization –Reduce the size of large data sets Clustering precipitation in Australia

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Notion of a Cluster can be Ambiguous How many clusters? Four ClustersTwo Clusters Six Clusters

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Types of Clusterings l A clustering is a set of clusters l Important distinction between hierarchical and partitional sets of clusters l Partitional Clustering –A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset l Hierarchical clustering –A set of nested clusters organized as a hierarchical tree

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Partitional Clustering Original Points A Partitional Clustering

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Hierarchical Clustering Traditional Hierarchical Clustering Non-traditional Hierarchical ClusteringNon-traditional Dendrogram Traditional Dendrogram

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Clustering Algorithms l K-means and its variants l Density-based clustering l Hierarchical clustering

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ K-means Clustering l Partitional clustering approach l Each cluster is associated with a centroid (center point) l Each point is assigned to the cluster with the closest centroid l Number of clusters, K, must be specified l The basic algorithm is very simple

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ K-means: Example

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ K-means: Example

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Sub-optimal ClusteringOptimal Clustering Original Points Importance of Choosing Initial Centroids …

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Importance of Choosing Initial Centroids …

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Importance of Choosing Initial Centroids …

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Problems with Selecting Initial Points l If there are K ‘real’ clusters then the chance of selecting one centroid from each cluster is small. –Chance is relatively small when K is large –If clusters are the same size, n, then –For example, if K = 10, then probability = 10!/10 10 = –Sometimes the initial centroids will readjust themselves in ‘right’ way, and sometimes they don’t –Consider an example of five pairs of clusters

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Solutions to Initial Centroids Problem l Multiple runs –Helps, but probability is not on your side l Sample and use hierarchical clustering to determine initial centroids l Select more than k initial centroids and then select among these initial centroids –Select most widely separated l Postprocessing

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Limitations of K-means l K-means has problems when clusters are of differing –Sizes –Densities –Non-globular shapes l K-means has problems when the data contains outliers.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Limitations of K-means: Differing Sizes Original Points K-means (3 Clusters)

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Limitations of K-means: Differing Density Original Points K-means (3 Clusters)

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Limitations of K-means: Non-globular Shapes Original Points K-means (2 Clusters)

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Overcoming K-means Limitations Original PointsK-means Clusters One solution is to use many clusters. Find parts of clusters, but need to put together.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Overcoming K-means Limitations Original PointsK-means Clusters

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Overcoming K-means Limitations Original PointsK-means Clusters

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Clustering Algorithms l K-means and its variants l Density-based clustering l Hierarchical clustering

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Density based clustering

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Density based clustering

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Density-based clustering: DBSCAN l DBSCAN is a density-based algorithm. –Density = number of points within a specified radius (Eps) –A point is a core point if it has more than a specified number of points (MinPts) within Eps  These are points that are at the interior of a cluster –A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point –A noise point is any point that is not a core point or a border point.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ DBSCAN: Core, Border, and Noise Points

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ DBSCAN Algorithm l Eliminate noise points l Perform clustering on the remaining points –Connect all core points with an edge that are less than Eps from each other. –Make each group of connected core points into a separate cluster. –Assign each border point to one of its associated clusters.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ First point is selected All points that are density-reachable from this point form a new cluster Second point is selected Example: DBSCAN Second cluster is formed Third point selected and cluster is formed

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ DBSCAN: Core, Border and Noise Points Original Points Point types: core, border and noise Eps = 10, MinPts = 4

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ When DBSCAN Works Well Original Points Clusters Resistant to Noise Can handle clusters of different shapes and sizes

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ When DBSCAN Does NOT Work Well Original Points (MinPts=4, Eps=9.75). (MinPts=4, Eps=9.92) Varying densities High-dimensional data

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ DBSCAN: Determining EPS and MinPts l Idea is that for points in a cluster, their k th nearest neighbors are at roughly the same distance l Noise points have the k th nearest neighbor at farther distance l So, plot sorted distance of every point to its k th nearest neighbor

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Clustering Algorithms l K-means and its variants l Density-based clustering l Hierarchical clustering

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Hierarchical Clustering l Produces a set of nested clusters organized as a hierarchical tree l Can be visualized as a dendrogram –A tree like diagram that records the sequences of merges or splits

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Strengths of Hierarchical Clustering l Do not have to assume any particular number of clusters –Any desired number of clusters can be obtained by ‘cutting’ the dendrogram at the proper level l They may correspond to meaningful taxonomies –Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …)

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Hierarchical Clustering l Two main types of hierarchical clustering –Agglomerative:  Start with the points as individual clusters  At each step, merge the closest pair of clusters until only one cluster (or k clusters) left –Divisive:  Start with one, all-inclusive cluster  At each step, split a cluster until each cluster contains a point (or there are k clusters) l Traditional hierarchical algorithms use a similarity or distance matrix –Merge or split one cluster at a time

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Agglomerative Clustering Algorithm l More popular hierarchical clustering technique l Basic algorithm is straightforward 1.Compute the proximity matrix 2.Let each data point be a cluster 3.Repeat 4.Merge the two closest clusters 5.Update the proximity matrix 6.Until only a single cluster remains l Key operation is the computation of the proximity of two clusters –Different approaches to defining the distance between clusters distinguish the different algorithms

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Starting Situation l Start with clusters of individual points and a proximity matrix p1 p3 p5 p4 p2 p1p2p3p4p Proximity Matrix

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Intermediate Situation l After some merging steps, we have some clusters C1 C4 C2 C5 C3 C2C1 C3 C5 C4 C2 C3C4C5 Proximity Matrix

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Intermediate Situation l We want to merge the two closest clusters (C2 and C5) and update the proximity matrix. C1 C4 C2 C5 C3 C2C1 C3 C5 C4 C2 C3C4C5 Proximity Matrix

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ After Merging l The question is “How do we update the proximity matrix?” C1 C4 C2 U C5 C3 ? ? ? ? ? C2 U C5 C1 C3 C4 C2 U C5 C3C4 Proximity Matrix

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p Similarity? l MIN l MAX l Group Average l Distance Between Centroids l Other methods driven by an objective function Proximity Matrix

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p Proximity Matrix l MIN l MAX l Group Average l Distance Between Centroids l Other methods driven by an objective function

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p Proximity Matrix l MIN l MAX l Group Average l Distance Between Centroids l Other methods driven by an objective function

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p Proximity Matrix l MIN l MAX l Group Average l Distance Between Centroids l Other methods driven by an objective function

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p Proximity Matrix l MIN l MAX l Group Average l Distance Between Centroids l Other methods driven by an objective function 

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Hierarchical Clustering: Problems and Limitations l Once a decision is made to combine two clusters, it cannot be undone l No objective function is directly minimized l Different schemes have problems with one or more of the following: –Sensitivity to noise and outliers –Difficulty handling different sized clusters and convex shapes –Breaking large clusters

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Cluster Validity l For supervised classification we have a variety of measures to evaluate how good our model is –Accuracy, precision, recall l For cluster analysis, the analogous question is how to evaluate the “goodness” of the resulting clusters? l But “clusters are in the eye of the beholder”! l Then why do we want to evaluate them? –To avoid finding patterns in noise –To compare clustering algorithms –To compare two sets of clusters –To compare two clusters

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ l Order the similarity matrix with respect to cluster labels and inspect visually. Using Similarity Matrix for Cluster Validation

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Using Similarity Matrix for Cluster Validation l Clusters in random data are not so crisp DBSCAN

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Using Similarity Matrix for Cluster Validation l Clusters in random data are not so crisp K-means

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Using Similarity Matrix for Cluster Validation DBSCAN

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ “The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage.” Algorithms for Clustering Data, Jain and Dubes Final Comment on Cluster Validity