Presentation is loading. Please wait.

Presentation is loading. Please wait.

K-means properties Pasi Fränti

Similar presentations


Presentation on theme: "K-means properties Pasi Fränti"— Presentation transcript:

1 K-means properties Pasi Fränti 11.10.2017
K-means properties on six clustering benchmark datasets Pasi Fränti and Sami Sieranoja Algorithms, 2017.

2 SSE = sum-of-squared errors
Goal of k-means Input N points: X={x1, x2, …, xN} Output partition and k centroids: P={p1, p2, …, pk} C={c1, c2, …, ck} Objective function: SSE = sum-of-squared errors

3 Goal of k-means Input N points: Output partition and k centroids:
X={x1, x2, …, xN} Output partition and k centroids: P={p1, p2, …, pk} C={c1, c2, …, ck} Objective function: Assumptions: SSE is suitable k is known

4 K-means algorithm K-Means(X, C) → (C, P) REPEAT Cprev ← C;
X = Data set C = Cluster centroids P = Partition K-Means(X, C) → (C, P) REPEAT Cprev ← C; FOR i=1 TO N DO pi ← FindNearest(xi, C); FOR j=1 TO k DO cj ← Average of xi  pi = j; UNTIL C = Cprev Assignment step Centroid step

5 K-means optimization steps
Assignment step: Centroid step:

6 Problems of k-means Distance of clusters
Cannot move centroids between clusters far away

7 Problems of k-means Dependency of initial solution
After k-means:

8 K-means performance How affected by?
1. Overlap 3. Dimensionality 2. Number of clusters 4. Unbalance of cluster sizes

9 Basic Benchmark

10 Data sets statistics Dataset Varying Size Clusters Per cluster A
Number of clusters 3000 – 7500 150 S Overlap 5000 15 333 Dim Dimensions 1024 16 64 G2 Dimensions + overlap 2048 2 Birch Structure 100,000 100 1000 Unbalance Balance 6500 8

11 A sets A1 A2 A3 Spherical clusters
K=20 K=35 K=50 A1 A2 A3 Spherical clusters Number of clusters changing from k=20 to 50 Subsets of each other: A1  A2  A3. Other parameters fixed: Cluster size = 150 Deviation = 1402 Overlap = 0.30 - Dimensionality = 2

12 Strong overlap but the clusters still recognizable
S sets K=15 Gaussian clusters (few truncated) S S S S 1 2 3 4 overlap increases 9% 22% 41% 44% Least overlap Strong overlap but the clusters still recognizable

13 Unbalance Areas well-separated Dense clusters Sparse clusters 100 100
K=8 Dense clusters Sparse clusters st.dev=2043 st.dev=6637 100 100 Areas well-separated 2000 100 2000 2000 100 100 *Correct clustering can be obtained by minimizing SSE

14 DIM sets Well-separated clusters in high-dimensional spaces
K=16 DIM32 Well-separated clusters in high-dimensional spaces Dimensions vary: 32, 64, 128, 256, 512, 1024

15 G2 Datasets G2-2-30 G2-2-50 G2-2-70 Dataset name: G2-dim-sd
K=2 G2-2-30 G2-2-50 G2-2-70 600,600 500,500 Dataset name: G2-dim-sd Centroid 1: [500,500, ...] Centroid 2: [600,600, ...] Dimensions: 2,4,8,16, St.dev. 10,20,30,

16 Birch Birch1 Birch2 Regular 10x10 grid Constant variance
offset = amplitude = phaseshift = frequency = y(x) = amplitude * sin(2**frequency*x + phaseshift) + offset

17 Birch2 subsets B2-random B2-sub N=100 000 k=100 N=99 000 k=99 N=98 000
Random subsampling …...………….. ….… Cutting off last cluster k=3 N=2 000 k=2 N=1 000 k=1

18 Properties

19 Measured properties Overlap Contrast Intrinsic dimensionality H-index
Distance profiles

20 Misclassification probability
Points from blue cluster that are closer to red centroid. Points from red cluster that are closer to blue centroid. Points = 2048 Incorrect = 20 Overlap = 20 / 2048 0.9 %

21 Overlap 16 % Points = 2048 Evidence = 332 Overlap = 332 / 2048
Points in blue cluster whose red neigbor is closer than its centroids. Points in red cluster whose blue neighbor is closer than its centroids. Points = 2048 Evidence = 332 Overlap = 332 / 2048 16 % d1 = distance to nearest centroid d2 = distance to 2nd nearest

22 Contrast

23 Intrinsic dimensionality
Average of distances Variance of distances Unbalance DIM 0.4 Birch1 Birch2 2.6 S sets A1 A2 A3 8.3 1.5 2.0 2.5 2.2

24 H-index Rank: 1 2 3 4 5 6 7 Hub: 4 2 2 2 2 2 0 Hub Hubness values: 2 4
4 2 Hub 2 2 2 Rank: Hub:

25 Distance profiles Data that contains clusters tends to have two peaks:
Local distances: distances inside the clusters Global distances: distances across different clusters A1 A2 A3 S1 S2 S3 S4 Birch1 Birch2 DIM32 Unbalance

26 Distance profiles G2 datasets
G2: dimension increases D=2 D=4 D=8 D=16 D=32 D=1024 G2: overlap increases sd=10 sd=20 sd=30 sd=40 sd=50 sd=60 D=2 sd=10 sd=20 sd=30 sd=40 sd=50 sd=60 D=128

27 Summary of the properties

28 G2 overlap Overlap decreases Overlap increases

29 G2 contrast Contrast decreases Contrast decreases

30 G2 Intrinsic dimensionality
ID increases (if overlap) ID increases Most significant

31 G2 H-index H-index increases No change Most significant

32 Evaluation

33 Internal measures Sum of squared distances (SSE)
Normalized mean square error (nMSE) Approximation ratio ()

34 External measures Centroid index
P. Fränti, M. Rezaei, Q. Zhao Centroid index: cluster level similarity measure Pattern Recognition 2014 CI=4 Missing centroids Too many centroids

35 External measures Success rate
17% CI=1 CI=2 CI=1 CI=0 CI=2 CI=2

36 Results

37 Summary of results

38 Dependency on overlap S datasets
Success rates and CI-values: overlap increases S S S S 1 2 3 4 3% 11% 12% 26% CI=1.8 CI=1.4 CI=1.3 CI=0.9

39 Dependency on overlap G2 datasets

40 Why overlap helps? Overlap = 7% Overlap = 22% 13 iterations

41 Main observation 1. Overlap Overlap is good!

42 Dependency on clusters (k) A datasets
Clusters increases K=20 K=35 K=50 A1 A2 A3 1% 0% 0% Success: 2.5 4.5 6.6 CI: 13% 13% 13% Relative CI:

43 Dependency on clusters (k)

44 Dependency on data size (N)

45 Main observation 2. Number of clusters Linear increase with k!

46 Dependency on dimensions DIM datasets
Dimensions increases 32 64 128 256 512 1024 CI: 3.6 3.5 3.8 3.8 3.9 3.7 Success rate: 0%

47 Dependency on dimensions G2 datasets
Success degrades Success improves

48 Lack of overlap is the cause!
Correlation: 0.91 Success

49 Main observation 3. Dimensionality No direct effect!

50 Effect of unbalance DIM datasets
Success: Average CI: Problem originates from the random initialization. 0% 3.9

51 Main observation 4. Unbalance of cluster sizes Unbalance is bad!

52 Improving k-means

53 Better initialization technique
Simple initializations: Random centroids (Random) [Forgy][MacQueen] Further point heuristic (max) [Gonzalez] More complex: K-means++ [Vasilievski] Luxburg [Luxburg]

54 Initialization techniques Varying N

55 Initialization techniques Varying k

56 Repeated k-means (RKM)
Repeat 100 times Can increase changes to success significantly In principle, running forever would solve Limitations if k is large

57 Genetic Algorithm (GA)
A better algorithm Random Swap (RS) Genetic Algorithm (GA) cs.uef.fi/pages/franti/research/ga.txt

58 Overall comparison CI-values

59 Conclusions

60 Conclusions How did K-means perform?
1. Overlap 3. Dimensionality Good! No change 2. Number of clusters 4. Unbalance of cluster sizes Bad! Bad!

61 References J. MacQueen, Some methods for classification and analysis of multivariate observations, Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, pp , University of California Press, Berkeley, Calif., 1967. S.P. Lloyd, Least squares quantization in PCM, IEEE Trans. on Information Theory, 28 (2), 129–137, 1982. Forgy, E. (1965). Cluster analysis of multivariate data: Efficiency vs. interpretability of classification. Biometrics, 21, 768. M. Steinbach, L. Ertöz, V. Kumar, The challenges of clustering high dimensional data, New Vistas in Statistical Physics -- Applications in Econophysics, Bioinformatics, and Pattern Recognition, Springer-Verlag, 2003. U. Luxburg, R.C. Williamson, I. Guyon, "Clustering: Science or Art?", J. Machine Learning Research, 27: 65–79, 2012. P. Fränti, "Genetic algorithm with deterministic crossover for vector quantization", Pattern Recognition Letters, 21 (1), 61-68, 2000 P. Fränti and J. Kivijärvi, "Randomized local search algorithm for the clustering problem", Pattern Analysis and Applications, 3 (4), , 2000. P. Fränti, O. Virmajoki and V. Hautamäki, Fast agglomerative clustering using a k-nearest neighbor graph, IEEE Trans. on Pattern Analysis and Machine Intelligence, 28 (11), , November 2006. Zhang R. Ramakrishnan and M. Livny, BIRCH: A new data clustering algorithm and its applications, Data Mining and Knowledge Discovery, 1 (2), , 1997. I. Kärkkäinen and P. Fränti, Dynamic local search algorithm for the clustering problem, Research Report A P. Fränti and O. Virmajoki, Iterative shrinking method for clustering problems, Pattern Recognition, 39 (5), , May 2006. P. Fränti R. Mariescu-Istodor and C. Zhong, XNN graph IAPR Joint Int. Workshop on Structural, Syntactic, and Statistical Pattern Recognition Merida, Mexico, LNCS 10029, , November 2016. M. Rezaei and P. Fränti, "Set-matching methods for external cluster validity", IEEE Trans. on Knowledge and Data Engineering, 28 (8), , August 2016. E. Chavez and G. Navarro, A probabilistic spell for the curse of dimensionality. Workshop on Algorithm Engineering and Experimentation, LNCS 2153, , 2001. N. Tomasev, M. Radovanovi, D. Mladeni and M. Ivanovi, “The role of hubness in clustering high-dimensional data”, IEEE Trans. on Knowledge and Data Engineering, 26 (3), , March 2014. D. Steinley, Local optima in k-means clustering: what you don’t know may hurt you”, Psychological Methods, 8, 294–304, 2003. P. Fränti, M. Rezaei and Q. Zhao, "Centroid index: Cluster level similarity measure", Pattern Recognition, 47 (9), , 2014. T. Gonzalez, Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38 (2–3), 293–306, 1985.


Download ppt "K-means properties Pasi Fränti"

Similar presentations


Ads by Google