Download presentation

Presentation is loading. Please wait.

Published byVernon Brooks Modified about 1 year ago

1

2
Clustering Prof. Navneet Goyal BITS, Pilani

3
Density-based methods Based on connectivity and density functions Filter out noise, find clusters of arbitrary shape Grid-based methods Quantize the object space into a grid structure Other Approaches to Clustering

4
Density-Based Clustering Methods Major features: Discover clusters of arbitrary shape Handle noise One scan Need density parameters as termination condition Several interesting studies: DBSCAN: Ester, et al. (KDD’96) OPTICS: Ankerst, et al (SIGMOD’99). DENCLUE: Hinneburg & D. Keim (KDD’98) CLIQUE: Agrawal, et al. (SIGMOD’98)

5
Density-Based Spatial Clustering of Applications with Noise Clusters are dense regions of objects separated by regions of low density ( noise) Outliers will not effect creation of cluster Input –MinPts – minimum number of points in any cluster –Eps – for each point in cluster there must be another point in it less than this distance away Density-Based Method: DBSCAN

6
Eps-neighborhood: Points within Eps distance of a point. Core point: Eps-neighborhood dense enough (MinPts) Directly density-reachable: A point p is directly density-reachable from a point q if the distance is small (Eps) and q is a core point. Density-reachable: A point is density-reachable form another point if there is a path from one to the other consisting of only core points. DBSCAN Density Concepts

7
Density-Based Method: DBSCAN Eps-neighborhood: Points within Eps distance of a point. N Eps (p):{q belongs to D | dist(p,q) <= Eps} Core point: Eps-neighborhood dense enough (MinPts) Directly density-reachable: A point p is directly density-reachable from a point q if the distance is small (Eps) and q is a core point. Directly density-reachable: A point p is directly density- reachable from a point q wrt. Eps, MinPts if 1) p belongs to N Eps (q) 2) core point condition: |N Eps (q)| >= MinPts p q MinPts = 5 Eps = 1 cm

8
Density-Based Method: DBSCAN Density-reachable: A point is density- reachable form another point if there is a path from one to the other consisting of only core points A point p is density-reachable from a point q wrt. Eps, MinPts if there is a chain of points p 1, …, p n, p 1 = q, p n = p such that p i+1 is directly density- reachable from p i for all i (1,n-1) p q p1p1

9
Density-connected –A point p is density-connected to a point q wrt. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o wrt. Eps and MinPts. pq o Density-Based Method: DBSCAN

10
DBSCAN Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density- connected points Discovers clusters of arbitrary shape in spatial databases with noise Core Border Outlier Eps = 1cm MinPts = 5

11
DBSCAN: Core, Border, and Noise Points

12
1.Label all points as core, border, or noise points 2.Eliminate noise points 3.Put an edge between all core points that are within ε of each other\ 4.Make each group of connected core points into a separate cluster 5.Assign each border point to one of the its associated core point DBSCAN: The Algorithm

13
DBSCAN: Core, Border and Noise Points Original Points Point types: core, border and noise Eps = 10, MinPts = 4 Source of figure: Introduction to Data Mining by Tan et. al.

14
When DBSCAN Works Well Original Points Clusters Resistant to Noise Can handle clusters of different shapes and sizes Source of figure: Introduction to Data Mining by Tan et. al.

15
When DBSCAN Does NOT Work Well Original Points (MinPts=4, Eps=9.75). (MinPts=4, Eps=9.92) Varying densities High-dimensional data Source of figure: Introduction to Data Mining by Tan et. al.

16
Idea is that for points in a cluster, their k th nearest neighbors are at roughly the same distance Noise points have the k th nearest neighbor at farther distance So, plot sorted distance of every point to its k th nearest neighbor DBSCAN: Determining EPS and MinPts Eps=10 Minpts=4 Source of figure: Introduction to Data Mining by Tan et. al.

17
Ordering Points To Identify Clustering Structure DBSCAN is sensitive to the choice of input parameters Parameter setting is done empirically High dimensional data – more pronounced High dimensional data clustering structures are not generally characterized by global density parameters like eps & minpts OPTICS as a solution! OPTICS: Self Study

18
Computes an augmented cluster ordering Ordering represents the density based clustering structure of the data Contains information that is equivalent to density based clustering obtained from a wide range of parameter settings Cluster ordering can be used to extract basic clustering information OPTICS

19
In DBSCAN, for constant minpts, clusters with high density (lower eps) are completely contained in density connected sets obtained with lower density Extend DBSCAN to process a set of distance parameter eps at the same time. For this the objects need to be processed in a specific order This order selects an object that is density reachable wrt lowest eps so that clusters of higher density will be finished first. OPTICS

20
2 values need to be stored for each object: –Core distance –Reachability distance Core distance – smallest eps that makes it a core object. If p is not core, it is iundefined. Reachability distance of q wrt p is the greater value of the core distance of p and the euclidean distance between p & q. If p is not a core object, distance reachability bet p & q is undefined OPTICS

21
Index-based: k = number of dimensions N = 20 p = 75% M = N(1-p) = 5 –Complexity: O(kN 2 ) Core Distance Reachability Distance OPTICS: Some Extension from DBSCAN D p2 MinPts = 5 = 3 cm Max (core-distance (o), d (o, p)) r(p1, o) = 2.8cm. r(p2,o) = 4cm o o p1

22
Efficiency issues with DBSCAN Finding clusters in subspaces Modeling density accurately We now look at: Grid-based clustering –Partitions data space into grid cells and forms clusters from cells that are dense enough –Efficient approach for low-dimensional data Subspace clustering –Finds clusters in subsets of all dimensions –2 n -1 subspaces to be searched!!! Density-based Clustering Contd…

23
GRIDCLUS STING CLIQUE WaveCluster Grid-based Clustering

24
Significant reduction in time complexity, especially for large data sets Number of cells << number of data points Instead of clustering data points, neighborhood surrounding the data points are clustered Grid-based Clustering

25
Steps involved: 1.Creating the grid structure 2.Calculating cell density for each cell 3.Sorting of the cells according to their densities 4.Identifying cluster centers 5.Traversal of neighborhood cells Grid-based Clustering

26
Algorithm: 1.Define a set of grid cells 2.Assign objects to appropriate grid cells and compute the density of each cell 3.Eliminate cells having density below a specified threshold 4.Form clusters from contiguous groups of dense cells Grid-based Clustering

27
Defining Grid Cells –Key step –Equal width intervals along all dimensions Each cell has same volume Density of cell is defined as no. of points in cell –Alternatively, equi-depth approach can be used Equal number of points in each interval Called as equal frequency discretization –MAFIA : subspace clustering algorithm initially uses equal width intervals and then combines intervals of similar density Definition of grid has strong impact on clustering results Grid-based Clustering

28
Density of Grid Cells –No. of points in the cell divided by the volume of the cell No. of road signs per km No. of tigers in a sq. km No. of molecules of a gas in cu. cm Grid-based Clustering Source of figure: Introduction to Data Mining by Tan et. al.

29
Forming Clusters from dense grid cells –Relatively straight forward –In the example on previous slide: 2 clusters –Define adjacency 4 or 8 adjacent cells in 2-D? Efficient technique to find adjacent cells (only occupied cells are stored) –Partially empty cells on the fringe of clusters which are not dense and will be discarded –4 parts of the larger cluster will be lost if the threshold is 9 Grid-based Clustering Source of figure: Introduction to Data Mining by Tan et. al.

30
Strengths & Limitations –Single pass is enough to determine the cell and count of every cell –Grid cells created only for non-empty cells –Complexity of O(m) –O(mlogm) – grids are rectangular –Curse of dimensionality –Grid cells containing just one element Grid-based Clustering Source of figure: Introduction to Data Mining by Tan et. al.

31
Clustering algorithms considered so far take into account all attributes Consider only a subspace of data Subspace Clustering Source of figure: Introduction to Data Mining by Tan et. al.

32
Subspace Clustering Source of figure: Introduction to Data Mining by Tan et. al.

33
Ensemble Clustering Parallelizing Clustering Algorithms to leverage a Cluster Some Research Directions

34
Similar to Ensemble Classification Consensus Clustering Obtain different clustering solutions and then reconcile them Ensemble Clustering

35
Parallelize to leverage a cluster Two levels of parallelism –Node Level –Core Level Not Necessarily Orthogonal Hybrid – Non Trivial Programming Environment: –MPI –Open MP Parallelizing Clustering Algorithms

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google