K-means properties Pasi Fränti

Slides:



Advertisements
Similar presentations
Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

Learning Trajectory Patterns by Clustering: Comparative Evaluation Group D.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Curse of Dimensionality Prof. Navneet Goyal Dept. Of Computer Science & Information Systems BITS - Pilani.
PARTITIONAL CLUSTERING
Pattern Recognition and Machine Learning
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Reduced Support Vector Machine
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
What is Cluster Analysis?
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Evaluating Performance for Data Mining Techniques
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Clustering Methods: Part 2d Pasi Fränti Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu, FINLAND Swap-based.
Clustering methods Course code: Pasi Fränti Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
Cut-based & divisive clustering Clustering algorithms: Part 2b Pasi Fränti Speech & Image Processing Unit School of Computing University of Eastern.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Genetic Algorithm Using Iterative Shrinking for Solving Clustering Problems UNIVERSITY OF JOENSUU DEPARTMENT OF COMPUTER SCIENCE FINLAND Pasi Fränti and.
Genetic algorithms (GA) for clustering Pasi Fränti Clustering Methods: Part 2e Speech and Image Processing Unit School of Computing University of Eastern.
DATA CLUSTERING WITH KERNAL K-MEANS++ PROJECT OBJECTIVES o PROJECT GOAL  Experimentally demonstrate the application of Kernel K-Means to non-linearly.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Manoranjan.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Genotype Calling Matt Schuerman. Biological Problem How do we know an individual’s SNP values (genotype)? Each SNP can have two values (A/B) Each individual.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall DM Finals Study Guide Rodney Nielsen.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Computational Intelligence: Methods and Applications Lecture 15 Model selection and tradeoffs. Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
How to cluster data Algorithm review Extra material for DAA Prof. Pasi Fränti Speech & Image Processing Unit School of Computing University.
Genetic Algorithms for clustering problem Pasi Fränti
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Agglomerative clustering (AC)
Data Mining: Basic Cluster Analysis
Semi-Supervised Clustering
DEEP LEARNING BOOK CHAPTER to CHAPTER 6
The Johns Hopkins University
Centroid index Cluster level quality measure
LECTURE 11: Advanced Discriminant Analysis
Fast nearest neighbor searches in high dimensions Sami Sieranoja
Random Swap algorithm Pasi Fränti
Data Mining K-means Algorithm
Haim Kaplan and Uri Zwick
Machine Learning Basics
Machine Learning University of Eastern Finland
Clustering (3) Center-based algorithms Fuzzy k-means
A Consensus-Based Clustering Method
Clustering Evaluation The EM Algorithm
Random Swap algorithm Pasi Fränti
Introduction to Instrumentation Engineering
Critical Issues with Respect to Clustering
DATA MINING Introductory and Advanced Topics Part II - Clustering
Cluster Analysis.
Pasi Fränti and Sami Sieranoja
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
EM Algorithm and its Applications
Mean-shift outlier detection
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Dimensionally distributed Pasi Fränti and Sami Sieranoja
Introduction to Machine learning
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
Clustering methods: Part 10
Presentation transcript:

K-means properties Pasi Fränti 11.10.2017 K-means properties on six clustering benchmark datasets Pasi Fränti and Sami Sieranoja Algorithms, 2017.

SSE = sum-of-squared errors Goal of k-means Input N points: X={x1, x2, …, xN} Output partition and k centroids: P={p1, p2, …, pk} C={c1, c2, …, ck} Objective function: SSE = sum-of-squared errors

Goal of k-means Input N points: Output partition and k centroids: X={x1, x2, …, xN} Output partition and k centroids: P={p1, p2, …, pk} C={c1, c2, …, ck} Objective function: Assumptions: SSE is suitable k is known

K-means algorithm K-Means(X, C) → (C, P) REPEAT Cprev ← C; X = Data set C = Cluster centroids P = Partition K-Means(X, C) → (C, P) REPEAT Cprev ← C; FOR i=1 TO N DO pi ← FindNearest(xi, C); FOR j=1 TO k DO cj ← Average of xi  pi = j; UNTIL C = Cprev Assignment step Centroid step

K-means optimization steps Assignment step: Centroid step:

Problems of k-means Distance of clusters Cannot move centroids between clusters far away

Problems of k-means Dependency of initial solution After k-means:

K-means performance How affected by? 1. Overlap 3. Dimensionality 2. Number of clusters 4. Unbalance of cluster sizes

Basic Benchmark

Data sets statistics Dataset Varying Size Clusters Per cluster A Number of clusters 3000 – 7500 20 - 50 150 S Overlap 5000 15 333 Dim Dimensions 1024 16 64 G2 Dimensions + overlap 2048 2 Birch Structure 100,000 100 1000 Unbalance Balance 6500 8 100-2000

A sets A1 A2 A3 Spherical clusters K=20 K=35 K=50 A1 A2 A3 Spherical clusters Number of clusters changing from k=20 to 50 Subsets of each other: A1  A2  A3. Other parameters fixed: Cluster size = 150 Deviation = 1402 Overlap = 0.30 - Dimensionality = 2

Strong overlap but the clusters still recognizable S sets K=15 Gaussian clusters (few truncated) S S S S 1 2 3 4 overlap increases 9% 22% 41% 44% Least overlap Strong overlap but the clusters still recognizable

Unbalance Areas well-separated Dense clusters Sparse clusters 100 100 K=8 Dense clusters Sparse clusters st.dev=2043 st.dev=6637 100 100 Areas well-separated 2000 100 2000 2000 100 100 *Correct clustering can be obtained by minimizing SSE

DIM sets Well-separated clusters in high-dimensional spaces K=16 DIM32 Well-separated clusters in high-dimensional spaces Dimensions vary: 32, 64, 128, 256, 512, 1024

G2 Datasets G2-2-30 G2-2-50 G2-2-70 Dataset name: G2-dim-sd K=2 G2-2-30 G2-2-50 G2-2-70 600,600 500,500 Dataset name: G2-dim-sd Centroid 1: [500,500, ...] Centroid 2: [600,600, ...] Dimensions: 2,4,8,16, ... 1024 St.dev. 10,20,30, ... 100

Birch Birch1 Birch2 Regular 10x10 grid Constant variance offset = 43659 amplitude = -37819 phaseshift = 20.8388 frequency = 0.000004205 y(x) = amplitude * sin(2**frequency*x + phaseshift) + offset

Birch2 subsets B2-random B2-sub N=100 000 k=100 N=99 000 k=99 N=98 000 Random subsampling …...………….. ….… Cutting off last cluster k=3 N=2 000 k=2 N=1 000 k=1

Properties

Measured properties Overlap Contrast Intrinsic dimensionality H-index Distance profiles

Misclassification probability Points from blue cluster that are closer to red centroid. Points from red cluster that are closer to blue centroid. Points = 2048 Incorrect = 20 Overlap = 20 / 2048 0.9 %

Overlap 16 % Points = 2048 Evidence = 332 Overlap = 332 / 2048 Points in blue cluster whose red neigbor is closer than its centroids. Points in red cluster whose blue neighbor is closer than its centroids. Points = 2048 Evidence = 332 Overlap = 332 / 2048 16 % d1 = distance to nearest centroid d2 = distance to 2nd nearest

Contrast

Intrinsic dimensionality Average of distances Variance of distances Unbalance DIM 0.4 Birch1 Birch2 6.7 - 7.5 2.6 S sets A1 A2 A3 8.3 1.5 2.0 2.5 2.2

H-index Rank: 1 2 3 4 5 6 7 Hub: 4 2 2 2 2 2 0 Hub Hubness values: 2 4 4 2 Hub 2 2 2 Rank: 1 2 3 4 5 6 7 Hub: 4 2 2 2 2 2 0

Distance profiles Data that contains clusters tends to have two peaks: Local distances: distances inside the clusters Global distances: distances across different clusters A1 A2 A3 S1 S2 S3 S4 Birch1 Birch2 DIM32 Unbalance

Distance profiles G2 datasets G2: dimension increases D=2 D=4 D=8 D=16 D=32 D=1024 G2: overlap increases sd=10 sd=20 sd=30 sd=40 sd=50 sd=60 D=2 sd=10 sd=20 sd=30 sd=40 sd=50 sd=60 D=128

Summary of the properties

G2 overlap Overlap decreases Overlap increases

G2 contrast Contrast decreases Contrast decreases

G2 Intrinsic dimensionality ID increases (if overlap) ID increases Most significant

G2 H-index H-index increases No change Most significant

Evaluation

Internal measures Sum of squared distances (SSE) Normalized mean square error (nMSE) Approximation ratio ()

External measures Centroid index P. Fränti, M. Rezaei, Q. Zhao Centroid index: cluster level similarity measure Pattern Recognition 2014 CI=4 Missing centroids Too many centroids

External measures Success rate 17% CI=1 CI=2 CI=1 CI=0 CI=2 CI=2

Results

Summary of results

Dependency on overlap S datasets Success rates and CI-values: overlap increases S S S S 1 2 3 4 3% 11% 12% 26% CI=1.8 CI=1.4 CI=1.3 CI=0.9

Dependency on overlap G2 datasets

Why overlap helps? Overlap = 7% Overlap = 22% 13 iterations

Main observation 1. Overlap Overlap is good!

Dependency on clusters (k) A datasets Clusters increases K=20 K=35 K=50 A1 A2 A3 1% 0% 0% Success: 2.5 4.5 6.6 CI: 13% 13% 13% Relative CI:

Dependency on clusters (k)

Dependency on data size (N)

Main observation 2. Number of clusters Linear increase with k!

Dependency on dimensions DIM datasets Dimensions increases 32 64 128 256 512 1024 CI: 3.6 3.5 3.8 3.8 3.9 3.7 Success rate: 0%

Dependency on dimensions G2 datasets Success degrades Success improves

Lack of overlap is the cause! Correlation: 0.91 Success

Main observation 3. Dimensionality No direct effect!

Effect of unbalance DIM datasets Success: Average CI: Problem originates from the random initialization. 0% 3.9

Main observation 4. Unbalance of cluster sizes Unbalance is bad!

Improving k-means

Better initialization technique Simple initializations: Random centroids (Random) [Forgy][MacQueen] Further point heuristic (max) [Gonzalez] More complex: K-means++ [Vasilievski] Luxburg [Luxburg]

Initialization techniques Varying N

Initialization techniques Varying k

Repeated k-means (RKM) Repeat 100 times Can increase changes to success significantly In principle, running forever would solve Limitations if k is large

Genetic Algorithm (GA) A better algorithm Random Swap (RS) Genetic Algorithm (GA) http://cs.uef.fi/pages/franti/research/rs.txt cs.uef.fi/pages/franti/research/ga.txt

Overall comparison CI-values

Conclusions

Conclusions How did K-means perform? 1. Overlap 3. Dimensionality Good! No change 2. Number of clusters 4. Unbalance of cluster sizes Bad! Bad!

References J. MacQueen, Some methods for classification and analysis of multivariate observations, Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, pp. 281-297, University of California Press, Berkeley, Calif., 1967. S.P. Lloyd, Least squares quantization in PCM, IEEE Trans. on Information Theory, 28 (2), 129–137, 1982. Forgy, E. (1965). Cluster analysis of multivariate data: Efficiency vs. interpretability of classification. Biometrics, 21, 768. M. Steinbach, L. Ertöz, V. Kumar, The challenges of clustering high dimensional data, New Vistas in Statistical Physics -- Applications in Econophysics, Bioinformatics, and Pattern Recognition, Springer-Verlag, 2003. U. Luxburg, R.C. Williamson, I. Guyon, "Clustering: Science or Art?", J. Machine Learning Research, 27: 65–79, 2012. P. Fränti, "Genetic algorithm with deterministic crossover for vector quantization", Pattern Recognition Letters, 21 (1), 61-68, 2000 P. Fränti and J. Kivijärvi, "Randomized local search algorithm for the clustering problem", Pattern Analysis and Applications, 3 (4), 358-369, 2000. P. Fränti, O. Virmajoki and V. Hautamäki, Fast agglomerative clustering using a k-nearest neighbor graph, IEEE Trans. on Pattern Analysis and Machine Intelligence, 28 (11), 1875-1881, November 2006. Zhang R. Ramakrishnan and M. Livny, BIRCH: A new data clustering algorithm and its applications, Data Mining and Knowledge Discovery, 1 (2), 141-182, 1997. I. Kärkkäinen and P. Fränti, Dynamic local search algorithm for the clustering problem, Research Report A-2002-6. P. Fränti and O. Virmajoki, Iterative shrinking method for clustering problems, Pattern Recognition, 39 (5), 761-765, May 2006. P. Fränti R. Mariescu-Istodor and C. Zhong, XNN graph IAPR Joint Int. Workshop on Structural, Syntactic, and Statistical Pattern Recognition Merida, Mexico, LNCS 10029, 207-217, November 2016. M. Rezaei and P. Fränti, "Set-matching methods for external cluster validity", IEEE Trans. on Knowledge and Data Engineering, 28 (8), 2173-2186, August 2016. E. Chavez and G. Navarro, A probabilistic spell for the curse of dimensionality. Workshop on Algorithm Engineering and Experimentation, LNCS 2153, 147-160, 2001. N. Tomasev, M. Radovanovi, D. Mladeni and M. Ivanovi, “The role of hubness in clustering high-dimensional data”, IEEE Trans. on Knowledge and Data Engineering, 26 (3), 739-751, March 2014. D. Steinley, Local optima in k-means clustering: what you don’t know may hurt you”, Psychological Methods, 8, 294–304, 2003. P. Fränti, M. Rezaei and Q. Zhao, "Centroid index: Cluster level similarity measure", Pattern Recognition, 47 (9), 3034-3045, 2014. T. Gonzalez, Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38 (2–3), 293–306, 1985.