Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006.

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

Pattern Recognition and Machine Learning
L2 and L1 Criteria for K-Means Bilinear Clustering B. Mirkin School of Computer Science Birkbeck College, University of London Advert of a Special Issue:
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Clustering II.
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Clustering.
Aggregating local image descriptors into compact codes
Clustering Basic Concepts and Algorithms
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
AEB 37 / AE 802 Marketing Research Methods Week 7
Cluster Analysis.
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 1 Cluster Analysis (from Chapter 12)
Clustering (1) Clustering Similarity measure Hierarchical clustering Model-based clustering Figures from the book Data Clustering by Gan et al.
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I)
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Data Mining Techniques
Clustering: Tackling Challenges with Data Recovery Approach B. Mirkin School of Computer Science Birkbeck University of London Advert of a Special Issue:
Метод К-средних в кластер-анализе и его интеллектуализация Б.Г. Миркин Профессор, Кафедра анализа данных и искусственного интеллекта, НИУ ВШЭ Москва РФ.
Census A survey to collect data on the entire population.   Data The facts and figures collected, analyzed, and summarized for presentation and.
Data clustering: Topics of Current Interest Boris Mirkin 1,2 1 National Research University Higher School of Economics Moscow RF 2 Birkbeck University.
CJT 765: Structural Equation Modeling Class 7: fitting a model, fit indices, comparingmodels, statistical power.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Name: Angelica F. White WEMBA10. Teach students how to make sound decisions and recommendations that are based on reliable quantitative information During.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Discriminant Analysis Discriminant analysis is a technique for analyzing data when the criterion or dependent variable is categorical and the predictor.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Research Seminars in IT in Education (MIT6003) Quantitative Educational Research Design 2 Dr Jacky Pow.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Clustering.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Module III Multivariate Analysis Techniques- Framework, Factor Analysis, Cluster Analysis and Conjoint Analysis Research Report.
Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Unsupervised Learning
MATH-138 Elementary Statistics
Chapter 7. Classification and Prediction
Hierarchical Segmentation of Polarimetric SAR Images
CJT 765: Structural Equation Modeling
Clustering (3) Center-based algorithms Fuzzy k-means
INITIALISATION OF K-MEANS
K-means and Hierarchical Clustering
REMOTE SENSING Multispectral Image Classification
REMOTE SENSING Multispectral Image Classification
Basic Statistical Terms
Revision (Part II) Ke Chen
Clustering and Multidimensional Scaling
Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育
Revision (Part II) Ke Chen
Data Mining – Chapter 4 Cluster Analysis Part 2
Clustering Wei Wang.
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Cluster Analysis.
Text Categorization Berlin Chen 2003 Reference:
Cluster analysis Presented by Dr.Chayada Bhadrakom
Unsupervised Learning
Presentation transcript:

Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006

Different Perspectives at Number of Clusters: Talk Outline Clustering and K-Means: A discussion Clustering goals and four perspectives Number of clusters in: - Classical statistics perspective - Machine learning perspective - Data Mining perspective (including a simulation study with 8 methods) - Knowledge discovery perspective (including a comparative genomics project)

WHAT IS CLUSTERING; WHAT IS DATA K-MEANS CLUSTERING: Conventional K-Means; Initialization of K- Means; Intelligent K-Means; Interpretation Aids WARD HIERARCHICAL CLUSTERING: Agglomeration; Divisive Clustering with Ward Criterion; Extensions of Ward Clustering DATA RECOVERY MODELS: Statistics Modelling as Data Recovery; Data Recovery Model for K-Means; for Ward; Extensions to Other Data Types; One-by-One Clustering DIFFERENT CLUSTERING APPROACHES: Extensions of K-Means; Graph-Theoretic Approaches; Conceptual Description of Clusters GENERAL ISSUES: Feature Selection and Extraction; Similarity on Subsets and Partitions; Validity and Reliability

Example: W. Jevons ( ), updated in Mirkin 1996 Pluto doesnt fit in the two clusters of planets

Example: A Few Clusters Clustering interface to WEB search engines (Grouper): Query: Israel (after O. Zamir and O. Etzioni 2001) Cluster # sites Interpretation 1 View Refine 24Society, religion Israel and Iudaism Judaica collection 2 View Refine 12Middle East, War, History The state of Israel Arabs and Palestinians 3 View Refine 31Economy, Travel Israel Hotel Association Electronics in Israel

Clustering: Main Steps Data collecting Data pre-processing Finding clusters (the only step appreciated in conventional clustering) Interpretation Drawing conclusions

Conventional Clustering: Cluster Algorithms Single Linkage: Nearest Neighbour Ward Agglomeration Conceptual Clustering K-means Kohonen SOM ………………….

K-Means: a generic clustering method Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence K= 3 hypothetical centroids * * * * * * * @ ** * * *

K-Means: a generic clustering method Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence * * * * * * * @ ** * * *

K-Means: a generic clustering method Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence * * * * * * * @ ** * * *

K-Means: a generic clustering method Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence 4. Output final centroids and clusters * * * * * * ** * *

Advantages of K-Means Conventional: M odels typology building C omputationally effective C an be incremental, `on-line Unconventional: A ssociates feature salience with feature scales and correlation/association Applicable to mixed scale data

Drawbacks of K-Means No advice on: Data pre-processing Number of clusters Initial setting Instability of results Criterion can be inadequate Insufficient interpretation aids

Initial Centroids: Correct Two cluster case

Initial Centroids: Correct Initial Final

Different Initial Centroids

Different Initial Centroids: Wrong, even though in different clusters Initial Final

Two types of goals (with no clear-cut borderline) Engineering goals Data analysis goals

Engineering goals (examples) Devising a market segmentation to minimise the promotion and advertisement expenses Dividing a large scheme into modules to minimise the cost Organisation structure design

Data analysis goals (examples) Recovery of the distribution function Prediction Revealing patterns in data Enhancing knowledge with additional concepts and regularities Each of these is realised in a different perspective at clustering

Clustering Perspectives Classical statistics: Recovery of a multimodal distribution function Machine learning: Prediction Data mining: Revealing patterns in data Knowledge discovery: additional concepts and regularities

Clustering Perspectives at # Clusters Classical statistics: As many as meaningful modes (mixture items) Machine learning: As many as needed for acceptable prediction Data mining: As many as meaningful patterns in data (including incomplete clustering) Knowledge discovery: As many as needed to produce concepts and regularities adequate to the domain

Main Sources for Deriving # Clusters Classical statistics: Model of the world Machine learning: Cost & Accuracy Trade Off Data mining: Data Knowledge discovery: Domain knowledge

Classical Statistics Perspective There must be a model of data generation E.g., Mixture of Gaussians The task: identify all parameters of the model by using observed data E.g., The number of Gaussians and their probabilities, means and covariances

Mixture of 3 Gaussian densities

Classical statistics perspective on K-Means But a maximum likelihood method with spherical Gaussians of the same variance: - within a cluster, all variables are independent and Gaussian with the same cluster-independent variance (z - scoring is a must then); - the issue of the number of clusters can be approached with conventional approaches to hypothesis testing

Machine learning perspective Clusters should be of help in learning data incrementally generated The number should be specified by the trade-off between accuracy and cost A criterion should guarantee partitioning of the feature space with clearly separated high density areas; A method should be proven to be consistent with the criterion on the population

Machine learning on K-Means The number of clusters: to be specified according to prediction goals Pre-processing: no advice An incremental version of K-means converges to a minimum of the summary within-cluster variance, under conventional assumptions of data generation (McQueen 1967 – major reference, though the method is traced a dozen or two years earlier)

Data mining perspective

Data recovery framework for data mining methods Type of Data Similarity Temporal Entity-to-feature Co-occurrence Type of Model Regression Principal components Clusters Model: Data = Model_Derived_Data + Residual Pythagoras: Data 2 = Model_Derived_Data 2 + Residual 2 The better fit the better the model

K-Means as a data recovery method

Representing a partition Cluster k : Centroid c kv (v - feature) Binary 1/0 membership z ik (i - entity)

Basic equations (analogous to PCA) y – data entry, z - membership c - cluster centroid, N – cardinality i - entity, v - feature /category, k - cluster

Meaning of Data scatter The sum of contributions of features – the basis for feature pre-processing (dividing by range, not std) Proportional to the summary variance

Contribution of a feature F to a partition Proportional to correlation ratio 2 if F is quantitative a contingency coefficient between cluster partition and F, if F is nominal: Pearson chi-square (Poisson normalised) Goodman-Kruskal tau-b (Range normalised) Contrib(F) =

Contribution of a quantitative feature to a partition Proportional to correlation ratio 2 if F is quantitative

Contribution of a nominal feature to a partition Proportional to a contingency coefficient Pearson chi-square (Poisson normalised) Goodman-Kruskal tau-b (Range normalised) B j =1

Pythagorean Decomposition of data scatter for interpretation

Contribution based description of clusters C. Dickens: FCon = 0 M. Twain: LenD < 28 L. Tolstoy: NumCh > 3 or Direct = 1

Principal Cluster Analysis (Anomalous Pattern) Method y iv =c v z i + e iv, where z i = 1 if i S, z i = 0 if i S With Euclidean distance squared c S must be anomalous, that is, interesting

Initial setting with Anomalous Single Cluster for iK-Means Tom Sawyer

iK-Means with Anomalous Single Clusters 0 Tom Sawyer 1 2 3

Anomalous clusters + K-means After extracting 2 clusters (how one can know that 2 is right?) Final

Simulation study of 8 methods (joint work with Mark Chiang) Number-of clusters methods: Variance based: Hartigan(HK) Calinski & Harabasz (CH) Jump Statistic (JS) Structure based: Silhouette Width (SW) Consensus based: Consensus Distribution area (CD) Consensus Distribution mean (DD) Sequential extraction of APs: Least Square (LS) Least Moduli (LM)

Data generation for the experiment Gaussian Mixture (6,7,9 clusters) with: Cluster spatial size: - Constant (spherical) - k-proportional - k 2 -proportional Cluster spread (distance between centroids) SpreadSpherical PPCA model k-proport.k 2 -proport. Large2 ( )10 ( ) Small0.2 ( )0.5 ( )2 ( )

Evaluation of results: Estimated clustering versus that generated Number of clusters Distance between centroids Similarity between partitions

Distance between estimated centroids (o) and those generated (o ) Prime Assignment G1(p1) G2(p2) G3(p3) e1(q1) e2(q2) e3(q3) e4(q4) e5(q5) g e2 g e4 g e5

Distance between estimated centroids (o) and those generated (o ) Final Assignment G1(p1) G2(p2) G3(p3) e1(q1) e2(q2) e3(q3) e4(q4) e5(q5) g e2, e1 g e4, e3 g e5

Distance between centroids: quadratic and city-block 1.Assignment 2.Distancing g1(p1)------e1(q1), e2(q2) g2(p2)------e3(q3), e4(q4) g3(p3)------e5(q5) d1=(q1*d(g1,e1)+q2*d(g1,e2))/(q1+q2) d2=(q3*d(g2,e3)+q4*d(g2,e4))/(q3+q4) d3=(q5*d(g3,e5)/q5

Distance between centroids: quadratic and city-block 1.Assignment 2.Distancing 3. Averaging p1*d1+p2*d2+p3*d3

Similarity between partitions according to their confusion table Relative distance (Mirkin-Cherny 1970) Tchouprov coefficient (Cramer 1943) Adjusted Rand Index (Arabie-Hubert, 1985) Average Overlap (Mirkin 2005)

Results at 9 clusters, 1000 entities, 20 features generated Estimated number of clusters Distance between Centroids Adjust Rand Index Large spread Small spread Large spread Small spread Large spreadSmall spread HK CH JS SW CD DD LS LM

Knowledge discovery perspective on clustering Conforming to and enhancing domain knowledge: Informal considerations so far Relevant items: Decision trees External validation

A case to generalise Entities with a similarity measure Clustering interpretation tool developed Clustering method using a similarity threshold leading to a number of clusters Domain knowledge leading to constraints to the similarity threshold Best fitting interpretation provides for the best number of clusters

Entities with a similarity measure 740 Homologous Protein Families (HPFs) (in 30 herpes-virus genomes) Homology defined by a protein sequence fragment: Sequence neighbourhood based similarity measure on HPFs F1 F2 F3

Interpretation tool: Mapping to an evolutionary tree over genomes Different aggregations – different histories y o z m k a p b l w d c r n s f g h e x j u t i q v F3 F2 F1

Algorithm: ADDI-S (Mirkin JoC 1987), a data approximation technique To maximize Contribution to Data Scatter, Average within-cluster similarity c multiplied by the clusters size #S Algorithm ADDI-S: Take S={ j } for arbitrary j Given S, find c=c(S) and similarities b(i,S) to S for all entities i in and out of S; Check the differences b(i,S)-c/2. If they are consistent, change the state of a most contributing entity. Else, stop and output S. Resulting S: a tightness property. Holzinger (1941) B-coefficient, Arkadiev&Braverman (1964, 1967) Specter, Mirkin (1976, 1987) ADDI family, Ben-Dor, Shamir, Yakhini (1999) CAST

Algorithm: ADDI-S (Mirkin 1987), a data approximation techniques Number of clusters: Depends on similarity shift threshold b b(ij) b(ij) – b b

Domain knowledge: Function is known at some HPFs 287 pairs of HPFs with known function of which 86 are SYNONYMOUS (same function) Similarity density synonymous Non-synonymous Two values: Min error No non-synonymous

Knowledge enhancing Analyzing the reconstructed contents in 3 family ancestors and HUCA (the root) Analyzing differences between b=.42 and b=.67 cluster reconstructions Analyzing gene arrangement within genomes Glycoprotein Ls HPFs are sequence- dissimilar, but they are always followed in genomes by glycolase that is mapped to HUCA Glyc L Glycolase Therefore glycoprotein L must be in HUCA too

Final HPFs and APFs HPFs with a sequence-based similarity measure Interpretation: parsimonious histories Clustering ADDI-S using a similarity threshold leading to a number of clusters Domain knowledge: 86 pairs should be in same clusters, and 201 in different clusters 2 suggested similarity thresholds Best fitting: 102 APF (aggregating 249 HPFs) and 491 singleton HPFs

Whole HPF aggregation methods structure (joint work with R. Camargo, T. Fenner, P. Kellam, G. Loizou)

Conclusion I: Number of clusters? Engineering perspective: defined by cost/effect Classical statistics perspective: can and should be determined from data with a model Machine learning perspective: can be specified according to the prediction accuracy to achieve Data mining perspective: not to pre-specify; only those are of interest that bear interesting patterns Knowledge discovery perspective: not to pre- specify; those that are best in knowledge enhancing

Conclusion II: Each other data analysis concept Classical statistics perspective: can be determined from data with a model Machine learning perspective: prediction accuracy to achieve Data mining perspective: data approximation Knowledge discovery perspective: knowledge enhancing

Variance based methods Hartigan (HK): -calculate HT=(W k /W k+1 -1)(N-k-1), where N is the number of entities -find the k which HT is less than a threshold 10 Calinski and Harabasz (CH): -calculate CH=((T-W k )/(k-1))/(W k /(N-k)), where T is the data scatter -find the k which maximize CH W k is given K, the smallest within-cluster summary distance to centroids among those found at different K-Means initializations

Variance based methods Jump Statistic (JS): for each entity i, clustering S={S 1,S 2,…,S k }, and centroids C={C 1,C 2,…,C k } calculate d(i, S k )=(y i -C k ) T Γ -1 (y i -C k ) and d k =( d(i, Sk))/P*N where P is the number of features, N is the number of rows and Γ is the covariance matrix of y select a transformation power, typically P/2 calculate the jumps JS=d find the k which maximize JS -d and d 0

Structure based methods Silhouette Width (SW): for each entity i, a(i)=average dissimilarity between i and all other entities of the cluster to which i belongs and b(i) is the minimum of average dissimilarity of i to all entities in other cluster for each other cluster S k, d(i, S k )=average dissimilarity between i and all entities of S k (i)=min(d(i, S k )) over S k s(i)=[b(i)-a(i)]/max(a(i),b(i)) calculate the average s= find the K maximizing the average s s(i)/N

Consensus based methods For each different K-means initializations Find the connectivity matrix Calculate the consensus matrix Calculate the cumulative distribution matrix CDF Calculate the area under CDF A(k) Calculate Δ(K+1)= find the k which maximize Δ(k) Consensus based area(CD)

Consensus based methods avdis(K)= μ K *(1- μ K )- σ K 2 μ K is the mean of the consensus matrix σ K is the variance of the consensus matrix davdis(K)=(avdis(K)-avdis(K+1))/avdis(K+1) Find the K which maximize davdis(DD)

Sequential cluster extraction Intelligent K-Means: Anomalous Pattern (Initial clusters) Removal of singletons K-Means Euclidean Distance + The within-cluster mean Least Square Criterion (LS) Manhattan Distance + The within-cluster median Least Modules Criterion (LM)