Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 New Unsupervised Clustering Algorithm for Large Datasets.

Similar presentations


Presentation on theme: "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 New Unsupervised Clustering Algorithm for Large Datasets."— Presentation transcript:

1 Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 New Unsupervised Clustering Algorithm for Large Datasets Advisor : Dr. Hsu Graduate : Wen-Hsiang Hu Authors : William Peter, John Chiochetti, Clare Giardina Year of Publication: 2003 ; Publisher : ACM Press New York, NY, USA.

2 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outline Motivation Objective Introduction Particle-Mesh Heuristic Rule-Based Agents Small and Large Dataset Examples Other Considerations Conclusions Personal Opinion Review

3 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 3 Motivation Hierarchical clustering methods and the K-medoid partitioning method have a large computational cost of O(N 2 ). K-means heuristic and Ward’s algorithm cannot a priori determine the number of clusters in a dataset. Many of these algorithms do not cluster directly on density.

4 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 4 Objective Important advantages of these particle-mesh heuristic. (1) speed (2) All clusters can be determined automatically, and without supervision. (3) Clusters can be ranked by density. (4) New data can be clustered incrementally. (5) The clustering is amenable to massively parallel or distributed computation.

5 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 5 Introduction The ACE algorithm is based on: clustering data by the particle-mesh heuristic. using rule-based agents to determine (and rank) the grid points associated with the highest data density.

6 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 6 One approach to reducing computational time on N- body problems (O(N 2 )). We will show that these standard particle-mesh techniques are applicable to data mining and clustering. The dataset (which is assumed to have N points in an n-dimensional space) are weighted to the grid points of a mesh by some suitable weighting scheme. Particle-Mesh Heuristic

7 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 7 Grid Weighting Consider the problem in one spatial dimension x with a uniform grid of cell spacing H. H Where H is the cell size, and are the grid points of the mesh. In nearest grid point weighting, the data point at is assigned to the nearest grid point at. Where is the density of data, it is obtained by “weighting” the raw data values to the grid points.

8 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 8 Rule-Based Agents A small number of agents are randomly placed on grid points of the mesh. The goal of each agent is to climb the hills of data density. Each agent is given two “rules” of behavior as follows: Consider a one-dimensional grid. An agent residing at a grid point rolls a die to determine if it should move up to or down to. In n-dimensions, this would be a 2n-sided die. If it is found that the agent should move up a grid point, the agent moves to only if it is moving up in density.

9 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 9 Data points within a user-specified distance threshold can be assumed to belong to the cluster with hub at (high data density). The position of the hub at can also be iteratively recalculated to be the cluster centroid: where the are the positions of the cluster members and is their total number. With the hub now at the data centroid, data points within a distance of (and not ) would belong to this cluster.

10 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 10 Results with ACE Small dataset example Total grid points Ng=100 Total data points N=160

11 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 11 Large dataset example Total data points Ng=484 We used a geospatial dataset made up of 10 5 points. Figure 4:Plot of the large spatial dataset distributed between latitudes 37 0 and 46 0, and longitudes 169 0 and 180 0.

12 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 12 Other Considerations Parallel and Distributed Computation Load balancing is achieved by dividing the spatial mesh into sectors, so that each processor only acts on a certain well-defined region of space. Real-Time and Incremental Clustering ACE can cluster new data incrementally ( without re-clustering).

13 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 13 Conclusions ACE efficiently cluster large volumes of multidimensional geospatial data with a cost O(N). Finally, ACE algorithm is ideally suited to incremental clustering and massively parallel or distributed computation.

14 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 14 Personal Opinion Generalizing self-organizing maps to handle categorical and hybrid data. We can consider the entropy or other index besides density.

15 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 15 Review ACE algorithm Particle-Mesh Methods Rule-Based Agents Small and Large Dataset Examples. Other Considerations Parallel and Distributed Computation Real-Time and Incremental Clustering


Download ppt "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 New Unsupervised Clustering Algorithm for Large Datasets."

Similar presentations


Ads by Google