Presentation is loading. Please wait.

Presentation is loading. Please wait.

CLARANS: A Method for Clustering Objects for Spatial Data Mining IEEE Transactions on Knowledge and Data Enginerring, 2002 Raymond T. Ng et al. 22 MAR.

Similar presentations


Presentation on theme: "CLARANS: A Method for Clustering Objects for Spatial Data Mining IEEE Transactions on Knowledge and Data Enginerring, 2002 Raymond T. Ng et al. 22 MAR."— Presentation transcript:

1 CLARANS: A Method for Clustering Objects for Spatial Data Mining IEEE Transactions on Knowledge and Data Enginerring, 2002 Raymond T. Ng et al. 22 MAR 2011 Kwak, Namju 1

2 Overview 2

3 Introduction Spatial data mining –On spatial databases –Huge amount (usually, terabytes) –Satellite images, medical equipments, video cameras, etc. Applications of spatial data mining –NASA Earth Observing System –Nat’l Inst. of Justice (crime mapping) –Dept. of Transportation (traffic data) –Nat’l Inst. of Health (cancer cluster) Difficulty –Spatial data type (point, polygon, etc.) –Spatial relationship (A is [in front of, at the back of, nearby, etc.] B) –Spatial autocorrelation (Similar objects gather together.) 3

4 Introduction Key issues for cluster analysis –Whether there exists a natural notion of similarities among the objects to be clustered. Point objects vs. polygon objects –Whether clustering a large number of objects can be efficiently carried out. CLARANS (which is proposed in this paper) –More efficient than the existing algorithms PAM and CLARA –Calculating the similarity between two polygons in the most efficient and effective way (using the separation distance between isothetic rectangles of the polygons) 4

5 Clustering Algorithms Based on Partitioning Hierarchical methods –Agglomerative and divisive –Successfully applied to many biological applications –They can never undo what was done previously. Partitioning methods –k-means, k-medoid, fuzzy analysis, etc. –k-medoid Robust to the existence of outliers Not dependent on the order in which the objects are examined Invariant with respect to translations and orthogonal transformations of data points 5

6 Clustering Algorithms Based on Partitioning 6

7 PAM (Partitioning Around Medoids) –Suppose there are 2 medoids: A and B. And we consider replacing A with a new medoid M. 7

8 Clustering Algorithms Based on Partitioning 8

9 9

10 CLARA (Clustering LARge Application) –CLARA draws a sample of the data set, applies PAM on the sample, and finds medoids of the sample. –If the sample is drawn in a sufficiently random way, the medoids of the sample would approximate the medoids of the entire data set. – 10

11 A Clustering Algorithm Based on Randomized Search 11

12 A Clustering Algorithm Based on Randomized Search CLARANS PAM is a search for a minimum on G n,k. –Examining all k(n-k) neighbors is time consuming. CLARA restricts the search on subgraphs of G n,k. –Sa is the set of objects in a sample. –The search is confined within G Sa,k. If M is not included in G Sa,k, M will never be found in the search. 12

13 A Clustering Algorithm Based on Randomized Search CLARANS –Like CLARA, it does not check every neighbors of a node. –Unlike CLARA, each sample is drawn dynamically. –While CLARA draws a sample of nodes at the beginning of a search, CLARANS draws a sample of neighbors in each step of a search. –Gives higher quality clusterings. –Requires a very small number of searches. 13

14 A Clustering Algorithm Based on Randomized Search CLARANS – –The higher the value of maxneighbor, the closer is CLARANS to PAM. 14

15 Clustering Convex Polygon Objects In practice, numerous spatial objects to cluster are polygonal in nature. –Shopping malls, parks, etc. Representative point approximation: centroid –A typical house of 200 square meters in a rectangular shape vs. a park of 500,000 square meters in a irregular shape –Clusterings of poor quality Multiple representative points –For a large park, two of its representative points may be 5,000 meters apart from each other. There is no guarantee that they will be in the same cluster. 15

16 Clustering Convex Polygon Objects 16

17 Clustering Convex Polygon Objects 17

18 Clustering Convex Polygon Objects Approximating by the Separation Distance between Isothetic Rectangles –IR-approximation –Compute isothetic rectangles I A, I B and calculate the separation distance between them. –The isothetic rectangle I A is the smallest rectangle that contains a polygon A, whose edges are parallel to either the x- or y-axes. –While the isothetic rectangle has an area larger than that of a minimum bounding rectangle, it can be easily obtained by finding the minimum and maximum of the x-coordinate set and the y-coordinate set of the vertices. –Trivial amount of time to compute. 18

19 Clustering Convex Polygon Objects Approximating by the Separation Distance between Isothetic Rectangles –For isothetic rectangles, it takes constant time in the first step where possible intersection is checked, but logarithmic time for polygons. –In the next step where the actual separation distance is computed, it is constant time for isothetic rectangles, but logarithmic time for polygons. –Underestimation of exact separation distance –The original polygons do not have to be convex. 19

20 Experimental Results CLARANS vs. PAM –For large and medium data sets, it is obvious that CLARANS is much more efficient than PAM. –On small data: data sets with 40, 60, 80, and 100 points in five clusters –The clusterings produced by both algorithms are of the same quality. 20

21 Experimental Results CLARANS vs. CLARA –Since CLARA is not designed for small data sets, this set of experiments was run on data sets whose number of objects exceeds 100. –CLARANS is always able to find clusterings of better quality than those found by CLARA. –However, in some cases, CLARA may take much less time than CLARANS. –What if they were given the same amount of time? 21

22 Experimental Results 22

23 Conclusion For small data sets, CLARANS is a few time faster than PAM. The performance gap for larger data sets is even larger. When given the same amount of runtime, CLARANS can produce clusterings that are of much better quality than those generated by CLARA. IR-approximation is a few times faster than the method that computes the exact separation distance. IR-approximation is able to find clusterings that are of quality almost as good as those produced by using the exact separation distance. 23


Download ppt "CLARANS: A Method for Clustering Objects for Spatial Data Mining IEEE Transactions on Knowledge and Data Enginerring, 2002 Raymond T. Ng et al. 22 MAR."

Similar presentations


Ads by Google