Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Jian-Lin Kuo Author : Aristidis Likas Nikos Vlassis Jakob J.Verbeek 國立雲林科技大學 National Yunlin.

Similar presentations


Presentation on theme: "Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Jian-Lin Kuo Author : Aristidis Likas Nikos Vlassis Jakob J.Verbeek 國立雲林科技大學 National Yunlin."— Presentation transcript:

1 Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Jian-Lin Kuo Author : Aristidis Likas Nikos Vlassis Jakob J.Verbeek 國立雲林科技大學 National Yunlin University of Science and Technology The global k-means clustering algorithem Received 23 March 2001; accepted 4 March 2002

2 Intelligent Database Systems Lab Outline Motivation Objective Literature review Global k-means algorithm Speeding up execution Experiment I,II,III,IV Conclusions N.Y.U.S.T. I.M.

3 Intelligent Database Systems Lab Motivation The k-means algorithm is a popular clustering method. But it has a serious drawback that its performance heavily depends on the initial starting condition. N.Y.U.S.T. I.M.

4 Intelligent Database Systems Lab Objective The global k-means clustering algorithm is proposed. It constitutes a deterministic effective global clustering algorithm for the minimization of the cluster error. N.Y.U.S.T. I.M.

5 Intelligent Database Systems Lab Literature review Clustering error - The most used clustering criterion is the sum of the squared Euclidean distances. where is data point of the set. is the cluster center of the subset. I(X) = 1 if X is true and 0 otherwise. N.Y.U.S.T. I.M.

6 Intelligent Database Systems Lab Literature review (cont.) K-means algorithm - randomly initialize the cluster centers. - finds locally optimal solutions with respect to the clustering error. - its sensitivity to initial positions of cluster centers is the main disadvantage. N.Y.U.S.T. I.M.

7 Intelligent Database Systems Lab Literature review (cont.) K-d trees - is a recursive partitioning of the data space into disjoint subspaces. - terminates if a terminal node is created containing less than or equal the bucket size. - k-d tree structure was originally used for speeding up distance-based search operations. N.Y.U.S.T. I.M.

8 Intelligent Database Systems Lab Global k-means algorithm employs the k-means algorithm as a local search procedure. proceeds in an incremental way attemping to optimally add one new cluster center at each stage. N.Y.U.S.T. I.M.

9 Intelligent Database Systems Lab Global k-means algorithm (cont.) To solve a clustering problem with M clusters. 1. Starts with one cluster (k=1) and find its optimal position. (the centroid of the data set X) 2. Solve the problem with two clusters(k=2) we perform N executions of k-means algorithm. a) The first cluster center is placed at the optimal position with k=1. b) The sencond center at execution n is placed at the position of the datapoint (n=1…N). c) The solution which is best obtained after the N executions of the k-means algorithm. N.Y.U.S.T. I.M.

10 Intelligent Database Systems Lab Global k-means algorithm (cont.) 3. We try to find the solution of the k-means algorithm with the k-clustering problem until finding the solution for the (k-1)-clustering problem 4. The best solution obtained from the N runs is considered as the solution of the k-clustering problem. 5. A solution with M clusters is obtained. ※ In general, let denote the final solution for k-clustering problem. N.Y.U.S.T. I.M.

11 Intelligent Database Systems Lab Speeding-up execution The global k-means algorithm is rather computational heavy assumption. There are several other options ( examing fewer initial positions) : - the fast global k-means algorithm - initialization with k-d trees N.Y.U.S.T. I.M.

12 Intelligent Database Systems Lab Speeding-up execution (cont.) The fast global k-means algorithm 1. not execute the k-means algorithm for each of the N initial states until convergence to obtain the final clustering error. 2. Computing an upper bound ≤ E -. 3. To Initialize the new cluster center at the point that minimizes, or maximized. 4. Executes the k-means algorithm to obtain the k-clusters solution. N.Y.U.S.T. I.M.

13 Intelligent Database Systems Lab Speeding-up execution (cont.) where is for all possible allocation positions E is the error in the (k-1)-clustering problem is the squared distance between and the closest center of the cluster which belongs. N.Y.U.S.T. I.M.

14 Intelligent Database Systems Lab Speeding-up execution (cont.) Initialization with k-d trees - In this paper, it uses a variational k-means. - use the bucket centers as possible insertion locations for the algorithm presented previously. N.Y.U.S.T. I.M.

15 Intelligent Database Systems Lab Experiment 1 The introduction of the data set - 10 data sets - each one consisting of 300 data points - each point draw from the 15 Gaussian mixture. Compare three methods for the clustering problem with k=15 centers. N.Y.U.S.T. I.M.

16 Intelligent Database Systems Lab Experiment 1 (cont.) the solid line with error bar: the fast global k-means algorithm using k-d tree (μ=14.9,σ=1.3) the solid line: the standard k-means algorithm using k-d tree (μ=24.4,σ=9.8) the dashed line: the fast global k-means algorithm (μ=15.7,σ=12) (Variation of the buckets number on the horizontal axis for the fast global global algorithm with k-d tree of the solid line) N.Y.U.S.T. I.M.

17 Intelligent Database Systems Lab Experiment 1 (cont.) We can conclude - the fast global k-means approach is better than when starting with all centers at the same time initialized using the k-d tree method. - the fast global k-means with k-d tree does not degrade performance significantly (using bucket number larger then the cluster number). N.Y.U.S.T. I.M.

18 Intelligent Database Systems Lab Experiment 2 The introduction of the data set - the iris data data set: 150 four-dimensional data points - the synthetic data set: 250 two-dimensional data points - the image segmentation data: 210 six-dimensional data points Compare three methods for the clustering problem with k=15 centers. N.Y.U.S.T. I.M.

19 Intelligent Database Systems Lab Experiment 2 (cont.) Each data set we conducted the follwing experiments: - one run of the global k-means algorithm for M=15 - one run of the fast global k-means algorithm for M=15 - for each k=1…15, the k-means algorithm was executed N times starting from random initial positions for the k centers ( N is the number of data points). N.Y.U.S.T. I.M.

20 Intelligent Database Systems Lab Experiment 2 (cont.) N.Y.U.S.T. I.M.

21 Intelligent Database Systems Lab Experiment 2 (cont.) Global k-means algorithm - better or equal the k-means algorithm in all cases. Fast global k-means algorithm - provides solution comparable to original method (significantly fast ). - can run much faster if k-d are employed N.Y.U.S.T. I.M.

22 Intelligent Database Systems Lab Experiment 3 The objecive is to cluster 16*16 pixel image patched extracted from a set of 37 Brodatz texture images. - each complete texture image consists of 256*256 pixel. - 500 patches per texture were extracted by random selecting 16*16 windows. - patches originating from the same texture image to form an individual cluster. N.Y.U.S.T. I.M.

23 Intelligent Database Systems Lab Experiment 3 (cont.) N.Y.U.S.T. I.M. Brodatz texture image

24 Intelligent Database Systems Lab Experiment 3 (cont.) The introduction of the data set - 100 data set. - randomly selecting k of the 37 texture in each data set ( k=2…6 ). - then selecting 200 patches for each texture, resulting in 200k per data set. N.Y.U.S.T. I.M.

25 Intelligent Database Systems Lab Experiment 3 (cont.) We compared performance of the three algorithms: - k-means initialized using a uniformly selected random subset of the data. - fast global k-mean. - fast global k-means with the insertion locations limited to the top 2k nodes of the correspomding k-d tree. We considered the mean squared clustering error ( MSE ) of the patches to their closest mean. N.Y.U.S.T. I.M.

26 Intelligent Database Systems Lab Experiment 3 (cont.) << Results for the texture segmentation problem using as many clusters as textures >> Results for the texture segmentation problem using twice as many clusters as textures N.Y.U.S.T. I.M.

27 Intelligent Database Systems Lab Experiment 3 (cont.) It can be observed - k-means algorithm with random initialization gives the worst results. - the fast global k-means that uses the top 2k nodes of the k-d tree is not only faster than the generic fast global k-means algorithm, but also provides slightly better reskults. N.Y.U.S.T. I.M.

28 Intelligent Database Systems Lab Experiment 4 The purpose is to compare - the randomly initialized k-means algorithm - ‘greedy’ algorithm : the fast global k-means algorithm that uses the top2k nodes of the corresponding k-d tree ( as canidate insertion locations ). The data have been drawn from randomly Gaussian mixtures. The number of data points in each data set was 50k (k is the number of sources). N.Y.U.S.T. I.M.

29 Intelligent Database Systems Lab Experiment 4 (cont.) Greedy algoritm is applied first. The randomly initialized k-means algorithm is applied as many times as the run time of the greedy algorithm. There are three variable of the sources: - the number of sources k={2,7,12,17,22} - the dimensionality d={2,4,6,8} of data space - the separation c={0.5,0.6,0.7,0.8,0.9,1.0 } N.Y.U.S.T. I.M.

30 Intelligent Database Systems Lab Experiment 4 (cont.) ‘min’ is (1-min/u) * 100 ‘ ‘ is (1- /μ)*100 ‘σ‘ is (σ/μ) * 100 ‘trial’ is how many runs of the k-means algorithm where min is the minimum clustering error value of the k-means bold values in the tables to indicate the smallest values. N.Y.U.S.T. I.M.

31 Intelligent Database Systems Lab Experiment 4 (cont.) ‘min’ is (1-min/u) * 100 ‘ ‘ is (1- /μ)*100 ‘σ‘ is (σ/μ) * 100 ‘trial’ is how many runs of the k-means algorithm where min is the minimum clustering error value of the k-means bold values in the tables to indicate the smallest values. N.Y.U.S.T. I.M.

32 Intelligent Database Systems Lab Experiment 4 (cont.) It is clear from the experiments: - the benefit of the greedy method becomes large if there are more clusters, the spearation larger, and the dimensionality gets smaller. - The greedy algorithm gives better result in almost all cases. - The number of trials allowed for the random k-means algorithm grows slowly as the number of clusters increase. N.Y.U.S.T. I.M.

33 Intelligent Database Systems Lab Conclusions The global k-means algorithm - excellent results in terms of the clustering error criterion. - independent of any starting conditions. - compares favorably to the k-means algorithm. Reduce the computational load without significantly affecting solution quality. Future work is related with the use of parallel processing for accelerating the proposed methods. N.Y.U.S.T. I.M.


Download ppt "Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Jian-Lin Kuo Author : Aristidis Likas Nikos Vlassis Jakob J.Verbeek 國立雲林科技大學 National Yunlin."

Similar presentations


Ads by Google