Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.

Slides:



Advertisements
Similar presentations
K-means Clustering Given a data point v and a set of points X,
Advertisements

Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides
Clustering.
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
Object Specific Compressed Sensing by minimizing a weighted L2-norm A. Mahalanobis.
PARTITIONAL CLUSTERING
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Introduction to Bioinformatics
Cluster Analysis.
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 1 Cluster Analysis (from Chapter 12)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or.
Clustering II.
Lecture 21: Spectral Clustering
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Tree Clustering & COBWEB. Remember: k-Means Clustering.
Cluster Analysis: Basic Concepts and Algorithms
Cluster Analysis (1).
What is Cluster Analysis?
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Radial Basis Function Networks
Clustering Unsupervised learning Generating “classes”
Evaluating Performance for Data Mining Techniques
Image Segmentation by Clustering using Moments by, Dhiraj Sakumalla.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
COMP53311 Clustering Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan ’ s notes Presented by Raymond Wong
Presented by Tienwei Tsai July, 2005
Texture. Texture is an innate property of all surfaces (clouds, trees, bricks, hair etc…). It refers to visual patterns of homogeneity and does not result.
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
Lecture 20: Cluster Validation
Cluster analysis 포항공과대학교 산업공학과 확률통계연구실 이 재 현. POSTECH IE PASTACLUSTER ANALYSIS Definition Cluster analysis is a technigue used for combining observations.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Technological Educational Institute Of Crete Department Of Applied Informatics and Multimedia Intelligent Systems Laboratory 1 CLUSTERS Prof. George Papadourakis,
Cluster Analysis Cluster Analysis Cluster analysis is a class of techniques used to classify objects or cases into relatively homogeneous groups.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
1 CLUSTERING ALGORITHMS  Number of possible clusterings Let X={x 1,x 2,…,x N }. Question: In how many ways the N points can be assigned into m groups?
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
1 CLUSTER VALIDITY  Clustering tendency Facts  Most clustering algorithms impose a clustering structure to the data set X at hand.  However, X may not.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Clustering Patrice Koehl Department of Biological Sciences National University of Singapore
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Iterative K-Means Algorithm Based on Fisher Discriminant UNIVERSITY OF JOENSUU DEPARTMENT OF COMPUTER SCIENCE JOENSUU, FINLAND Mantao Xu to be presented.
Unsupervised Classification
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Clustering Machine Learning Unsupervised Learning K-means Optimization objective Random initialization Determining Number of Clusters Hierarchical Clustering.
Machine Learning Lecture 4: Unsupervised Learning (clustering) 1.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Clustering Patrice Koehl Department of Biological Sciences
Hierarchical Clustering: Time and Space requirements
Clustering (3) Center-based algorithms Fuzzy k-means
Clustering Evaluation The EM Algorithm
Simple Kmeans Examples
Cluster Analysis.
CLUSTERING ALGORITHMS
Hierarchical Clustering
Presentation transcript:

Assessment

Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very small or even singletons clusters are rather suspicious

Standardization K-means Initial centroids Validation example Cluster merit Index Cluster validation, three approaches Relative criteria Validity Index Dunn Index Davies-Bouldin (DB) index Combination of different distances/diameter methods

Standardization Issue The need for any standardization must be questioned If the interesting clusters are based on the original features, than any standardization method may distort those clusters Only when there are grounds to search for clusters in transformed space that some standardization rule should new used

There is no methodological way except by “trail and error”

An easy standardization method that will often follow and frequently achieve good results is the simple division or multiplication by a simple scale factor A should be properly chosen so that all feature values occupy a suitable interval

k-means Clustering Cluster centers c 1,c 2,.,c k with clusters C 1,C 2,.,C k

Initial centroids Specify which patterns are used as initial centroids Random initialization Tree clustering in a reduced number of patterns may performed for this purpose Choose first k patterns as initial centroids Sort distances between all patterns and choose patterns at constant intervals of these distances as initial centroids Adaptive initialization (according to a chosen radius)

k-means example (Sá 2001) Cluster merit index R i (n patterns in k cluserts)

Cluster merit index measure the decrease in overall within-cluster distance when passing from a solution with k clusters to one with k+1 clusters High value of the merit indexes indicates a substantial decrease in overall within-cluster distance

Cluster merit index Factor 1 has the most important contribution The values k=3,5,8 are sensible choices k=3 attractive

Cluster validation The procedure of evaluating the results of a clustering algorithm is known under the term cluster validity In general terms, there are three approaches to investigate cluster validity

The first is based on external criteria This implies that we evaluate the results of a clustering algorithm based on a pre- specified structure, which is imposed on a data set and reflects our intuition about the clustering structure of the data set

Error Classification Rate smaller value, good representation Data partition according to known classes L i,

The second approach is based on internal criteria We may evaluate the results of a clustering algorithm in terms of quantities that involve the vectors of the data set themselves (e.g. proximity matrix)

Proximity matrix Dissimilarity matrix

The basis of the above described validation methods is often statistical testing Major drawback of techniques based on internal or external criteria and statistical testing is their high computational demands

The third approach of clustering validity is based on relative criteria Here the basic idea is the evaluation of a clustering structure by comparing it to other clustering schemes, resulting by the same algorithm but with different parameter values

There are two criteria proposed for clustering evaluation and selection of an optimal clustering scheme (Berry and Linoff, 1996) Compactness, the members of each cluster should be as close to each other as possible. A common measure of compactness is the variance, which should be minimized Separation, the clusters themselves should be widely spaced

Distance between two clusters There are three common approaches measuring the distance between two different clusters Single linkage: It measures the distance between the closest members of the clusters Complete linkage: It measures the distance between the most distant members Comparison of centroids: It measures the distance between the centers of the clusters

Relative criteria Based on relative criteria, does not involve statistical tests The fundamental idea of this approach is to choose the best clustering scheme of a set of defined schemes according to a pre-specified criterion

Among the clustering schemes C i,i=1,..., k defined by a specific algorithm, for different values of the parameters choose the one that best fits the data set The procedure of identifying the best clustering scheme is based on a validity index q

Selecting a suitable performance index q,we proceed with the following steps We run the clustering algorithm for all values of k between a minimum k min and a maximum k max The minimum and maximum values have been defined a- priori by user For each of the values of k, we run the algorithm r times, using different set of values for the other parameters of the algorithm (e.g. different initial conditions) We plot the best values of the index q obtained by each k as the function of k

Based on this plot we may identify the best clustering scheme There are two approaches for defining the best clustering depending on the behavior of q with respect to k If the validity index does not exhibit an increasing or decreasing trend as k increases we seek the maximum (minimum) of the plot

For indices that increase (or decrease) as the number of clusters increase we search for the values of k at which a significant local change in value of the index occurs This change appears as a “knee” (joelho) in the plot and it is an indication of the number of clusters underlying the data-set The absence of a knee may be an indication that the data set possesses no clustering structure

Validity index Dunn index, a cluster validity index for k- means clustering proposed in Dunn (1974) Attempts to identify “compact and well separated clusters”

Dunn index

If the dataset contains compact and well- separated clusters, the distance between the clusters is expected to be large and the diameter of the clusters is expected to be small Large values of the index indicate the presence of compact and well-separated clusters

The index D k does not exhibit any trend with respect to number of clusters Thus, the maximum in the plot of D k versus the number of clusters k can be an indication of the number of clusters that fits the data

The implications of the Dunn index are: Considerable amount of time required for its computation Sensitive to the presence of noise in datasets, since these are likely to increase the values of diam(c)

The Davies-Bouldin (DB) index (1979)

Small indexes correspond to good clusters, clusters are compact and their centers are far away The DB k index exhibits no trends with respect to the number of clusters and thus we seek the minimum value of DB k its plot versus the number of clusters

Different methods may be used to calculate distance between clusters Single linkage Complete linkage Comparison of centroids Average linkage

Differnet methods to calculate the diamater of a cluster Max Radius Average distance

A connected graph with s nodes has edges

Combination of different distances/diameter methods It has been shown that using different distances/diameter methods may produce indices of different scale range (Azuje and Bolshakova 2002)

Normalization i selects the different distance method i  (1,2,3,4) j selects the different diameter method j  (1,2,3)  (D ij ) or  (DB ij ) standart deviation of D k ij or DB k ij accross diferent values for k

Normalized indexes

Literature

See also J.P Marques de Sá, Pattern Recognition, Springer, TCD-CS pdf TCD-CS pdf

Standardization K-means Initial centroids Validation example Cluster merit Index Cluster validation, three approaches Relative criteria Validity Index Dunn Index Davies-Bouldin (DB) index Combination of different distances/diameter methods

Next lecture KNN LVQ SOM