Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.

Slides:



Advertisements
Similar presentations
Cluster Analysis: Basic Concepts and Algorithms
Advertisements

Hierarchical Clustering, DBSCAN The EM Algorithm
PARTITIONAL CLUSTERING
Presented by: GROUP 7 Gayathri Gandhamuneni & Yumeng Wang.
Qiang Yang Adapted from Tan et al. and Han et al.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
Clustering (1) Clustering Similarity measure Hierarchical clustering Model-based clustering Figures from the book Data Clustering by Gan et al.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Clustering II.
Cluster Analysis.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
An Introduction to Clustering
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Segmentation CSE P 576 Larry Zitnick Many slides courtesy of Steve Seitz.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Cluster Analysis.
Cluster Analysis: Basic Concepts and Algorithms
Cluster Analysis (1).
What is Cluster Analysis?
What is Cluster Analysis?
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.
DATA MINING LECTURE 8 Clustering The k-means algorithm
9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Clustering.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
Presented by Ho Wai Shing
Density-Based Clustering Methods. Clustering based on density (local cluster criterion), such as density-connected points Major features: –Discover clusters.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Flat clustering approaches
1 CLUSTER VALIDITY  Clustering tendency Facts  Most clustering algorithms impose a clustering structure to the data set X at hand.  However, X may not.
Clustering/Cluster Analysis. What is Cluster Analysis? l Finding groups of objects such that the objects in a group will be similar (or related) to one.
CLUSTERING DENSITY-BASED METHODS Elsayed Hemayed Data Mining Course.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Unsupervised learning  Supervised and Unsupervised learning  General considerations  Clustering  Dimension reduction The lecture is partly based on:
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
DATA MINING: CLUSTER ANALYSIS (3) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Christoph F. Eick Questions Review October 12, How does post decision tree post-pruning work? What is the purpose of applying post-pruning in decision.
CSE4334/5334 Data Mining Clustering. What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related)
ΠΑΝΕΠΙΣΤΗΜΙΟ ΙΩΑΝΝΙΝΩΝ ΑΝΟΙΚΤΑ ΑΚΑΔΗΜΑΪΚΑ ΜΑΘΗΜΑΤΑ Εξόρυξη Δεδομένων Ομαδοποίηση (clustering) Διδάσκων: Επίκ. Καθ. Παναγιώτης Τσαπάρας.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Clustering (1) Clustering Similarity measure Hierarchical clustering
Data Mining: Basic Cluster Analysis
More on Clustering in COSC 4335
Hierarchical Clustering: Time and Space requirements
CSE 5243 Intro. to Data Mining
Data Mining K-means Algorithm
Classification of unlabeled data:
Clustering (3) Center-based algorithms Fuzzy k-means
CS 685: Special Topics in Data Mining Jinze Liu
CSE 5243 Intro. to Data Mining
Clustering Evaluation The EM Algorithm
CSE572, CBS598: Data Mining by H. Liu
CSE572, CBS572: Data Mining by H. Liu
Clustering Wei Wang.
CSE572: Data Mining by H. Liu
CS 685: Special Topics in Data Mining Jinze Liu
Presentation transcript:

Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations from Data Clustering by Gan et al.

Center-based clustering  Have objective functions which define how good a solution is;  The goal is to minimize the objective function;  Efficient for large/high dimensional datasets;  The clusters are assumed to be convex shaped; The cluster center is representative of the cluster;  Some model-based clustering, e.g. Gaussian mixtures, are center-based clustering.

Center-based clustering K-means clustering. Let C 1, C 2,…,C k be k disjoint clusters. Error is defined as the sum of the distance from the cluster center

The k-means algorithm: Center-based clustering

Understanding k-means as an optimization procedure: The objective function is: Minimize the P(W,Q) subject to:

The solution is iteratively solving two sub-problems: Center-based clustering

 In terms of optimization, the k-means procedure is greedy.  Every iteration decreases the value of the objective function; The algorithm converges to a local minimum after a finite number of iterations.  Results depend on initiation values.  The computational complexity is proportional to the size of the dataset  efficient on large data.  The clusters identified are mostly ball-shaped.  Works only on numerical data.

Center-based clustering A variant of k-means to save computing time: the compare-means algorithm. (There are many.) Based on triangle inequality, d(x, m i )+d(x, m j )≥d(m i, m j ) d(x, m j )≥d(m i, m j )-d(x, m i ) If d(m i, m j )≥2d(x, m i ), then d(x, m j )≥d(x, m i ) In every iteration, the small number of between-mean distances are first computed. Then for every x, first compare its distance to the closest known mean with the between-mean distances, to find which of the d(x, m j ) really need to be compute.

Center-based clustering Automated selection of k? The x-means algorithm based on AIC/BIC. A family of models at different k: is the likelihood of the data given the j th model. p j is the number of parameters. We have to assume a model to get the likelihood. The convenient one is Gaussian.

Center-based clustering Under the assumption of identical spherical Gaussian assumption, (n is sample size; k is number of centroids) μ (i) is the centroid associated with x i. The likelihood is: The number of parameters is (d is dimension): (class probabilities + parameters for mean & variance)

Center-based clustering K-harmonic means --- insensitive to initiation. K-means error: K-harmonic means error:

Center-based clustering K-modes algorithm for categorical data. Let x be a d-vector with categorical attributes. For a group of x’s, the mode is defined as the vector q that minimizes Where The objective function is similar to the one for the original k-means.

Center-based clustering K-prototypes algorithm for mixed type data. Between any two points, the distance is defined: γ is a parameter to balance between continuous and categorical variables. Cost function to minimize:

Fuzzy k-means Soft clustering --- an observation can be assigned to multiple clusters. With n samples and c partitions, the fuzzy c-partition matrix (c × n): If take max for every sample we get back to hard partition:

Fuzzy k-means The objective function is: q>1, it controls the “fuzziness”. V i is the centroid of cluster i, u ij is the degree of membership of x j belonging to cluster i, k is number of clusters.

Fuzzy k-means

Density-based algorithms (DBSCAN as an example) Capable of finding arbitrarily shaped clusters. Clusters are defined as dense regions surrounded by low-density regions. Automatically select the number of clusters. Needs only one scan through the original data set.

Density-based algorithms (DBSCAN as an example) Define the ε-neighborhood of a point x as: A point x is “directly density-reachable” from point y if: This relationship is not symmetric.

Density-based algorithms (DBSCAN as an example) Point x is “density-reachable” from point y if there’s a sequence of points x, x 1, x 2, ……, x i, y where each point is directly density reachable from the next one. Points x and y are “density connected” if there exists a point z, such that both x and y are density- reachable from z. (* all the relationships are with respect to ε and N min )

Density-based algorithms (DBSCAN as an example) The definition of “cluster”: Let D be the dataset. A cluster C with respect to ε and N min satisfies: (1)If x is in C and y is density-reachable from x, then y is in C; (2)Every pair of x and y in C are density-connected. “Noise” is a set of points that don’t belong to any cluster.

Density-based algorithms (DBSCAN as an example) Core point Border point noise

Density-based algorithms (DBSCAN as an example) Algorithm: Start with an arbitrary point x, find all points that are density-reachable from x. (If x is a core point, then a cluster is found; if x is a border or noise point, no point is density reachable from it.) Visit the next unclassified point. Two clusters may be merged if close enough. Cluster distance is defined by single linkage:

Density-based algorithms (DBSCAN as an example)

How to choose ε and N min ? A heuristic method called the “sorted k-dist graph”. Sort the F k (D) (from the entire dataset) in descending order and plot. Find the k* where k>k* doesn’t bring much change to the graph. N min =k* Find the first point in the first valley z 0. Set ε=F k* (z 0 ). Density-based algorithms (DBSCAN as an example)

Evaluation of clustering results

Evaluation External criteria approach: Comparing clustering results (C ) with a pre-specified partition (P). For all pairs of samples, M=a+b+c+d In same cluster in P In different cluster in P In same cluster in C ab In different cluster in C cd

Evaluation Monte Carlo methods based on H 0 (random generation), or bootstrap are needed to find significance.

Evaluation External criteria: An alternative is to compare the proximity matrix Q with the given partition P. Define matrix Y based on P:

Evaluation Internal criteria: evaluate clustering structure by features of the dataset (mostly proximity matrix of the data). Example: For Hierarchical clustering, P c : cophenetic matrix, the ij th element represents proximity level at which two data points x i and x j are first joined into the same cluster. P: proximity matrix.

Evaluation Cophenetic correlation coefficient index: CPCC is in [-1,1]. Higher value indicates better agreement.

Evaluation Relative criteria: choose the best result out of a set according to predefined criterion. Example: Modified Hubert’s Γ statistic: P is the proximity matrix of the data. High value indicates compact clusters.