Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.

Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang

Input Parameters of DBSCAN 1.The neighborhood radius, r. 2.The minimum number, k, of neighbors to be a core point – the seed for expansion The clustering quality highly depends on the input parameters, especially r.

Selection of Parameters Matters:

Content Introduction Related Work Parameter Reduction for Density-based Clustering –Experiments and Observations –Determination of the neighborhood radii –Nonparametric Density-based Clustering Performance Analysis Conclusion

Introduction Density-based clustering is widely used in various spatial applications such as geographical information analysis, medical applications, sky data observatories and satellite image analysis. In density-based clustering, clusters are dense areas of points in the data space that are separated by areas of low density (noises) One of the problems of density-based clustering is minimal domain knowledge to determine the input parameters.

Our Approach to Solve the Problem We explore an automatic approach to determine the minimum neighborhood radii based on data distribution of the dataset. The algorithm, MINR, is developed to determine the minimum neighborhood radii for different density clusters based on many experiments and observations. We combine MINR with the enhanced DBCSCAN, e- DBCSCAN, into a nonparametric density-based clustering algorithm (NPDBC).

Problem with Input Parameter In the density based clustering, the main input is the minimum neighborhood radius. A dataset may consist of clusters with same density or different densities. Figure shows some possible distributions of X. When clusters are in different densities, it is more difficult to determine the minimum neighborhood radii.

Related Work Clustering methods Attempts to reduce parameters Enhanced DBSCAN clustering

Clustering Methods There are mainly two clustering methods: –similarity-based partitioning methods –density-based clustering methods. A similarity-based partitioning algorithm breaks a dataset into k subsets, called clusters. The major problems with partitioning methods are: –k has to be predetermined; –it is difficult to identify clusters with different sizes; –it only finds convex clusters.

Clustering Methods (Cont.) Density-based clustering methods are used to discover clusters with arbitrary shapes. The most typical algorithm is DBSCAN [1]. DBSCAN is very sensitive to input parameters, which are the neighborhood radius (r) and a minimum number of neighbors (MinPts). Another density-based algorithm DenClue uses a grid and is very efficient. This algorithm generalizes some other clustering approaches which, however, results in a large number of input parameters.

Attempts To Reduce Parameters There have been many efforts to make clustering process parameter- free, such as OPTICS [8], CHAMELEON [7] and TURN*[2]. OPTICS computes an augmented cluster ordering, which represents the density-based clustering structure of the data. This method is used for interactive cluster analysis. CHAMELEON has been found to be very effective in clustering convex shapes. However, the algorithm cannot handle outliers and needs parameter setting to work effectively. TURN* is a brute force approach. It first decreases the neighborhood radius to so small that every data point becomes noise. Then the radius is doubled each time to do clustering until it finds a “turn.” Even though it chooses big steps, the computation time is not promising for large datasets with various densities.

Enhanced DBSCAN (e-DBSCAN) A point p is an internal point if it has at least k neighbors within its neighborhood. Its neighborhood is called core. A point p is an external point if the number of its neighbors within its neighborhood is less than k, and it is located within a core. Given k = 4, points 6 and 8 are internal points, while point 5 is an external point.

e-DBSCAN (Cont.) A cluster C is a collection of cores, the centers of which are density reachable from each other. Boundary points of a cluster is a collection of external points within clusters. Enhanced DBSCAN (e-DBSCAN) is different from the original DBSCAN in that the boundary points of each cluster are stored as a separate set. The boundary sets are used for cluster merge at the later stage.

Steps of e-DBSCAN 1.Pick an arbitrary point x, if it is not an internal point, it is labeled as noise. Otherwise its neighborhood will be a rudiment cluster C. Insert all neighbors of point x into the seed store. 2.Retrieve the next point from the seed store. If it is an internal point, merge its neighborhood to cluster C. Insert all its neighbors to the seed store; if it is an external point, insert it to the boundary set of C. 3.Go back to step 2 with the next seed until the seed store is empty. 4.Go back to step 1 with the next unclustered point in the dataset.

Parameter Reduction DBSCAN have two input parameters: –the minimum number of neighbors, k, –the neighborhood radius, r. k is the size of the smallest cluster. DBSCAN set k to 4 [1]. TURN* also treats it as a fixed value [2]. The clustering comparison between k= 4 and k=7 are shown in the next slide. Therefore, the only input parameter is the minimum neighborhood radius, r. r depends on data distribution of the dataset. It should be different for different density clusters.

“K = 4” vs “K = 7”: (R 4 ) 2 /4 = (R 7 ) 2 /7

Observation 1 Define R as a distance between each point x and its 4th nearest neighbor. The points are then sorted based on R in ascending order. DS1 is a dataset used by DBSCAN. The data size is 200. DS2 is reproduced from a dataset used by CHAMELEON. (a) DS1 (b) DS2 (c) R-x of DS1 (d) R-x of DS2

Observation 2 Given a neighborhood radius r, we calculate the number of neighbors for each point within the given radius, denoted as K Sort the points in descending order, and get the sorted K-x graph. When the neighborhood radius is close to the maximum R, the K-x graph shows “knees” very clearly.

Observation 2 (Cont.) In order to find the “knees” in the graph, we calculate the differentials of the graphs,  Ks. The knees are close to the points with peak differentials. The number of “knees” is equal to the number of cluster densities in the dataset. Intuitively, we infer that the points divided by “knees” belong to different density clusters or noise.

Observation 3 Sort the dataset DS2 based on K, and then partition the sorted dataset into three subsets separated by two “ knees. ” The two “ knees ” are at positions of 10000 and 15500. Therefore the three partitions are 0 – 10000, 10001 – 15500, and 15501-17500.

Determination of the neighborhood radii Based on the experiments, we develop an algorithm to automatically determine the minimum neighborhood radii for mining clusters with different densities, MINR, based on the data distribution.

Nonparametric Density-based Clustering Given a series of radii, start clustering using the enhance DBSCAN algorithm, e-DBSCAN, with the smallest r = r1. The densest cluster(s) would be formed. Then set r = r2. Only process those unclustered points. The next sparser cluster(s) are formed.

Nonparametric Density-based Clustering Calculates a series of neighborhood radii for different density clusters using MINR Iterative clustering process using e-DBSCAN with the radii. Merge any pair of clusters which share most of the boundary points of either cluster.

Performance Analysis We compare our nonparametric density-based clustering algorithm (NPDBC) with the performance of TURN*. We show the run time comparisons on the dataset, DS2, we discussed above. In order to make the data contain the clusters in different densities, we artificially insert more data in some clusters to make them denser than the others. The resulted datasets have the sizes from 10k to 200k.

Performance Analysis (Cont.) NPDBC is more efficient than TURN* for large datasets. The reason is that for NPDBC, the parameters are computed once at the beginning of the clustering process TURN* algorithm tries different neighborhood radii until the first “turn” is found in case of two different densities. We only compare NPDBC with TURN* on datasets with two different densities. If the density variety increases, NPDBC will outperform TURN* much more.

Conclusion and Future Work In this paper, we explore an automatic approach to determine this parameter based on the distribution of datasets. The algorithm, MINR, is developed to determine the minimum neighborhood radii for different density clusters. We developed a nonparametric clustering method (NPDBC) by combining MINR with the enhanced DBCSCAN, e- DBCSCAN. Experiments show our NPDBC is more efficient and scalable than TURN* for clusters in two different densities. In our future work, we will implement our NPDBC using the vertical data structure, P-tree, the efficient data mining ready data representation.

Thanks!

Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.

Similar presentations

Presentation on theme: "Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.

Similar presentations

Presentation on theme: "Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang."— Presentation transcript:

Similar presentations

About project

Feedback