Presentation is loading. Please wait.

Presentation is loading. Please wait.

CURE: An Efficient Clustering Algorithm for Large Databases Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Presentation by: Vuk Malbasa For CIS664.

Similar presentations


Presentation on theme: "CURE: An Efficient Clustering Algorithm for Large Databases Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Presentation by: Vuk Malbasa For CIS664."— Presentation transcript:

1 CURE: An Efficient Clustering Algorithm for Large Databases Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Presentation by: Vuk Malbasa For CIS664 Prof. Vasilis Megalooekonomou

2 Overview Introduction Previous Approaches Drawbacks of previous approaches CURE: Approach Enhancements for Large Datasets Conclusions

3 Introduction Clustering problem: Given points separate them into clusters so that data points within a cluster are more similar to each other than points in different clusters. Traditional clustering techniques either favor clusters with spherical shapes and similar sizes or are fragile to the presence of outliers. CURE is robust to outliers and identifies clusters with non-spherical shapes, and wide variances in size. Each cluster is represented by a fixed number of well scattered points.

4 Introduction CURE is a hierarchical clustering technique where each partition is nested into the next partition in the sequence. CURE is an agglomerative algorithm where disjoint clusters are successively merged until the number of clusters reduces to the desired number of clusters.

5 Previous Approaches At each step in agglomerative clustering the merged clusters are ones where some distance metric is minimized. This distance metric can be: –Distance between means of clusters, d mean –Average distance between all points in clusters, d ave –Maximal distance between points in clusters, d max –Minimal distance between points in clusters, d min

6 Drawbacks of previous approaches For situations where clusters vary in size d ave, d max and d mean distance metrics will split large clusters into parts. Non spherical clusters will be split by d mean Clusters connected by outliers will be connected if the d min metric is used None of the stated approaches work well in the presence of non spherical clusters or outliers.

7 Drawbacks of previous approaches

8 CURE: Approach CURE is positioned between centroid based (d ave ) and all point (d min ) extremes. A constant number of well scattered pointsis used to capture the shape and extend of a cluster. The points are shrunk towards the centroid of the cluster by a factor α. These well scattered and shrunk points are used as representative of the cluster.

9 CURE: Approach Scattered points approach alleviates shortcomings of d ave and d min. –Since multiple representatives are used the splitting of large clusters is avoided. –Multiple representatives allow for discovery of non spherical clusters. –The shrinking phase will affect outliers more than other points since their distance from the centroid will be decreased more than that of regular points.

10 CURE: Approach Initially since all points are in separate clusters, each cluster is defined by the point in the cluster. Clusters are merged until they contain at least c points. The first scattered point in a cluster in one which is farthest away from the clusters centroid. Other scattered points are chosen so that their distance from previously chosen scattered points in maximal. When c well scattered points are calculated they are shrunk by some factor α (r = p + α*(mean-p)). After clusters have c representatives the distance between two clusters is the distance between two of the closest representatives of each cluster Every time two clusters are merged their representatives are re- calculated.

11 Enhancements for Large Datasets Random sampling –Filters outliers and allows the dataset to fit into memory Partitioning –First cluster in partitions then merge partitions Labeling Data on Disk –The final labeling phase can be done by NN on already chosen cluster representatives Handling outliers –Outliers are partially eliminated and spread out by random sampling, are identified because they belong to small clusters that grow slowly

12 Conclusions CURE can identify clusters that are not spherical but also ellipsoid CURE is robust to outliers CURE correctly clusters data with large differences in cluster size Running time for a low dimensional dataset with s points is O(s 2 ) Using partitioning and sampling CURE can be applied to large datasets

13 Thanks!

14 ?


Download ppt "CURE: An Efficient Clustering Algorithm for Large Databases Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Presentation by: Vuk Malbasa For CIS664."

Similar presentations


Ads by Google