# Clustering Methods Professor: Dr. Mansouri

## Presentation on theme: "Clustering Methods Professor: Dr. Mansouri"— Presentation transcript:

Clustering Methods Professor: Dr. Mansouri
Presented by : Muhammad Abouei &Mohsen Ghahremani Manesh

Clustering Methods Density-Based Clustering Methods
DBSCAN (Density Based Spatial Clustering of Applications with Noise) OPTICS (Ordering Points To Identify the Clustering Structure) DENCLUE (DENsity-based CLUstEring) Grid-based Clustering

Density Based Clustering

DBSCAN Concepts ε -neighborhood: Points within ε distance (radius) of a point. MinPts: minimum number of points in cluster (ε-neighborhood of that point). ε-neighborhood of q ε-neighborhood of p MinPts = 5 where ε and MinPts are a user-defined function.

DBSCAN Concepts Density : number of points within a specified radius (ε) Density(p)=5

DBSCAN Concepts Core point : A point is a core point if it has more than a specified number of points (MinPts) within ε These are points that are at the interior of a cluster ε-neighborhood of q ε-neighborhood of p p is a core point (MinPts = 5) q is not a core point.

DBSCAN Concepts Directly density-reachable : point p is directly density-reachable from a point q w.r.t. ε , MinPts if p belongs to ε -neighborhood of q, q is a core point, MinPts = 4 p is DDR from q. q is not DDR from p! DDR is an asymmetric relation.

DBSCAN Concepts Density-reachable: A point p is density-reachable from a point q w.r.t. ε , MinPts if there is a chain of points P1, …, Pn , P1=q, Pn=p such that Pi +1is directly density-reachable from Pi . Or, point p is density-reachable form q, if there is a path (chain of points) from p to q consisting of only core points. MinPts = 4 p is DR from q. q is not DR from p! p is not core. DR is an asymmetric relation.

DBSCAN Concepts Density-connectivity: point p is density-connected to point q w.r.t. ε , MinPts if there is a point r such that both, p and q are density-reachable from r w.r.t. ε and MinPts. MinPts = 4 p and q are density-connected. DC is an symmetric relation.

DBSCAN Concepts Border point : A border point has fewer than MinPts within ε, but is in the neighborhood of a core point MinPts =5 ε = circle radius

DBSCAN Concepts Noise (outlier) point : is any point that is not a core point nor a border point. MinPts =5 ε = circle radius

DBSCAN Concepts DBSCAN relies on a density-based notion of cluster.
Cluster : a cluster C is a non-empty set of density-connected points that is maximal w.r.t. density-reachability. Maximality: For all p, q; if q ∈ C and if p is density-reachable from q w.r.t. ε and MinPts, then also p ∈ C. MinPts = 3 ε = circle radius

DBSCAN Algorithm Arbitrary select a point p
Retrieve all points density-reachable from p w.r.t. ε and MinPts. If p is a core point, a cluster is formed. If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database. Continue the process until all of the points have been processed.

DBSCAN MinPts = 4

DBSCAN DBSCAN is Sensitive to Parameters. MinPts = 4

Original Points Point types: core, border
DBSCAN Core, Border and Noise Points: MinPts = 4, ε = 10 Original Points Point types: core, border and noise

DBSCAN When DBSCAN works well: Resistant to Noise
Can handle clusters of different shapes and sizes Original Points Clusters

DBSCAN When DBSCAN does not work well: Varying densities
High-dimensional data

DBSCAN Complexity If a spatial index (ex, kd-tree, R*-tree) is used, the computational complexity of DBSCAN is O(n.logn), where n is the number of database objects. Otherwise, it is O(n2).

OPTICS Core distance: smallest ε that makes it a core object. If p is not core, it is undefined. Core Distance of p or ε′ : distance between p and its 4-thNN. MinPts = 5 ε = 3 cm

OPTICS Reachability distance: of r w.r.t. p is the greater value of the core distance of p and the Euclidean distance between p & r. If p is not a core object, distance reachability between p & q is undefined. reachability-distance ε, MinPts(p, r) = ε′ reachability-distance ε, MinPts(p, r′) = d(p, r′ ) MinPts = 5 ε = 3 cm

OPTICS

OPTICS

OPTICS

OPTICS

OPTICS

OPTICS

OPTICS

OPTICS

OPTICS

OPTICS

OPTICS Color image segmentation using density-Based clustering

DENCLUE Major features DENCLUE (DENsity-based CLUstEring)
Solid mathematical foundation Good for data sets with large amounts of noise Allows a compact mathematical description of arbitrarily shaped clusters in high-dimensional data sets Significant faster than existing algorithm (faster than DBSCAN by a factor of up to 45) But needs a large number of parameters

DENCLUE Technical Essence
Uses grid cells but only keeps information about grid cells that do actually contain data points and manages these cells in a tree- based access structure.

DENCLUE Technical Essence DENCLUE is based on the following concepts:
Influence function Density function Density attractors.

DENCLUE Influence function : The influence function f y(x) for a point 𝑦∈𝐷(data space) at point x is a positive function that decays to zero as x “moves away” from 𝑦 𝑑 𝑥,𝑦 →∞ . Typical examples are: and where σ is a user-defined function.

DENCLUE Density function :The density function at x based on a data space of N points; i.e. D = {x1,…, xN}; is defined as the sum of the influence function of all data points at x : The goal of the definition: Identify all “significant” local maxima, xj*, j=1,…,m of f D(x) Create a cluster Cj for each xj* and assign to Cj all points of D that lie within the “region of attraction” of xj*.

DENCLUE Example: Density Computation D={x1,x2,x3,x4}
f DGaussian (x) = influence(x1)+influence(x2)+influence(x3)+influence(x4) = =0.78 Remark: the density value of y would be larger than the one for x.

DENCLUE Density attractors :Density attractors are local maxima of the overall density function f D(x). Clusters can then be determined mathematically by identifying density attractors. A hill-climbing algorithm guided by the gradient can be used to determine the density attractor of a set of data points.

DENCLUE Density-attracted : A point x is density-attracted to a density attractor x*, if there exists a set of points x0, x1, …, xk such that x0 = x , xk = x* and the gradient of xi-1 is in the direction of xi for 0<i<k.

DENCLUE Center-Defined Cluster :A center-defined cluster (w.r.t. to σ, ε) for a density attractor x* is a subset C ⊆ D, with x ∈ C being density-attracted by x* and f D(x)≥ ε. Outlier: Point x ∈ D is called outlier if it is density-attracted by a local maximum xo* with f D(xo*) < ε.

DENCLUE Multicenter defined clusters : Multicenter defined clusters are a set of center-defined clusters linked by a path of significance.

DENCLUE An arbitrary-shape cluster : An arbitrary-shape cluster (w.r.t. to σ, ε) for a set of density attractors X is a subset C ⊆ D, where ∀𝑥∈𝐶 ∃ 𝑥 ∗ ∈𝑋: 𝑓 𝐷 𝑥 ∗ ≥ε , x is density-attracted to 𝑥 ∗ , and ∀ 𝑥 1 ∗ , 𝑥 2 ∗ ∈𝑋 : ∃ a path P from 𝑥 1 ∗ to 𝑥 2 ∗ with ∀ 𝑝∈𝑃: 𝑓 𝐷 𝑝 ≥ ε

DENCLUE Note : that the number of clusters found by DENCLUE varies depending on σ, ε.

DENCLUE DENCLUE is able to detect arbitrarily shaped clusters.
The algorithm deals with noise very satisfactory. The worst-case time complexity of DENCLUE is O(N.log2N). Experimental results indicate that the average time complexity is O(log2N). It works efficiently with high-dimensional data. DENCLUE needs at least 3 parameters to be determined, i.e. σ, ε ,εc.

Grid-based Using multi-resolution grid data structure
Clustering complexity depends on the number of populated grid cells and not on the number of objects in the dataset Several interesting methods: CS Tree (Clustering Statistical Tree) STING WaveCluster

Grid-based Basic Grid-based Algorithm Define a set of grid-cells.
Assign objects to the appropriate grid cell and compute the density of each cell. Eliminate cells, whose density is below a certain threshold τ. Form clusters from contiguous (adjacent) groups of dense cells (usually minimizing a given objective function).

Grid-based Fast: No distance computations,
Clustering is performed on summaries and not individual objects; complexity is usually O(no_of_populated_grid_cells) and not O(no_of_objects), Easy to determine which clusters are neighboring.

References A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988. A.K. Jain and M. N. Murty and P.J. Flynn, Data Clustering: A Review, ACM Computing Surveys, vol 31. No 3,pp , 1999. A. L. N. Fred, J. M. N. Leitão, A New Cluster Isolation Criterion Based on Dissimilarity Increments, IEEE “Optimal grid-clustering: Toward breaking the curse of dimensionality in high-dimensional clustering,”in Proc. 25th VLDB Conf.,1999, pp. 506–517.

?