Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN More on Clustering 1. Hierarchical Clustering to be discussed in Clustering Part2 2. DBSCAN.

Similar presentations


Presentation on theme: "Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN More on Clustering 1. Hierarchical Clustering to be discussed in Clustering Part2 2. DBSCAN."— Presentation transcript:

1 Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN More on Clustering 1. Hierarchical Clustering to be discussed in Clustering Part2 2. DBSCAN will be used in programming project

2 Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN Hierarchical Clustering l Produces a set of nested clusters organized as a hierarchical tree l Can be visualized as a dendrogram –A tree like diagram that records the sequences of merges or splits

3 Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN Agglomerative Clustering Algorithm l More popular hierarchical clustering technique l Basic algorithm is straightforward 1.Compute the proximity matrix 2.Let each data point be a cluster 3.Repeat 4.Merge the two closest clusters 5.Update the proximity matrix 6.Until only a single cluster remains l Key operation is the computation of the proximity of two clusters –Different approaches to defining the distance between clusters distinguish the different algorithms

4 Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN Starting Situation l Start with clusters of individual points and a proximity matrix p1 p3 p5 p4 p2 p1p2p3p4p Proximity Matrix

5 Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN Intermediate Situation l After some merging steps, we have some clusters C1 C4 C2 C5 C3 C2C1 C3 C5 C4 C2 C3C4C5 Proximity Matrix

6 Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN Intermediate Situation l We want to merge the two closest clusters (C2 and C5) and update the proximity matrix. C1 C4 C2 C5 C3 C2C1 C3 C5 C4 C2 C3C4C5 Proximity Matrix

7 Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN After Merging l The question is “How do we update the proximity matrix?” C1 C4 C2 U C5 C3 ? ? ? ? ? C2 U C5 C1 C3 C4 C2 U C5 C3C4 Proximity Matrix

8 Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p Similarity? l MIN l MAX l Group Average l Distance Between Centroids l Other methods driven by an objective function –Ward’s Method uses squared error Proximity Matrix

9 Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p Proximity Matrix l MIN l MAX l Group Average l Distance Between Centroids l Other methods driven by an objective function –Ward’s Method uses squared error

10 Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p Proximity Matrix l MIN l MAX l Group Average l Distance Between Centroids l Other methods driven by an objective function –Ward’s Method uses squared error

11 Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p Proximity Matrix l MIN l MAX l Group Average l Distance Between Centroids l Other methods driven by an objective function –Ward’s Method uses squared error

12 Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p Proximity Matrix l MIN l MAX l Group Average l Distance Between Centroids l Other methods driven by an objective function –Ward’s Method uses squared error 

13 Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN Cluster Similarity: Group Average l Proximity of two clusters is the average of pairwise proximity between points in the two clusters. l Need to use average connectivity for scalability since total proximity favors large clusters 12345

14 Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN Density-based Clustering Density-based Clustering algorithms use density-estimation techniques l to create a density-function over the space of the attributes; then clusters are identified as areas in the graph whose density is above a certain threshold (DENCLUE’s Approach) l to create a proximity graph which connects objects whose distance is above a certain threshold  ; then clustering algorithms identify contiguous, connected subsets in the graph which are dense (DBSCAN’s Approach).

15 Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN DBSCAN (http://www2.cs.uh.edu/~ceick/7363/Papers/dbscan.pdf )http://www2.cs.uh.edu/~ceick/7363/Papers/dbscan.pdf l DBSCAN is a density-based algorithm. –Density = number of points within a specified radius (Eps) –Input parameter: MinPts and Eps –A point is a core point if it has more than a specified number of points (MinPts) within Eps  These are points that are at the interior of a cluster –A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point –A noise point is any point that is not a core point or a border point.

16 Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN DBSCAN: Core, Border, and Noise Points

17 Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN DBSCAN Algorithm (simplified view for teaching) 1. Create a graph whose nodes are the points to be clustered 2. For each core-point c create an edge from c to every point p in the  -neighborhood of c 3. Set N to the nodes of the graph; 4. If N does not contain any core points terminate 5. Pick a core point c in N 6. Let X be the set of nodes that can be reached from c by going forward; 1.create a cluster containing X  {c} 2.N=N/(X  {c}) 7. Continue with step 4 Remarks: points that are not assigned to any cluster are outliers; gives a more efficient implementation by performing steps 2 and 6 in parallel

18 Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN DBSCAN: Core, Border and Noise Points Original Points Point types: core, border and noise Eps = 10, MinPts = 4

19 Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN When DBSCAN Works Well Original Points Clusters Resistant to Noise Supports Outliers Can handle clusters of different shapes and sizes

20 Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN When DBSCAN Does NOT Work Well Original Points (MinPts=4, Eps=9.75). (MinPts=4, Eps=9.12) Varying densities High-dimensional data Problems with

21 Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN Assignment 3 Dataset: Earthquake

22 Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN Assignment3 Dataset: Complex9 K-Means in Weka DBSCAN in Weka Dataset:

23 Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN DBSCAN: Determining EPS and MinPts l Idea is that for points in a cluster, their k th nearest neighbors are at roughly the same distance l Noise points have the k th nearest neighbor at farther distance l So, plot sorted distance of every point to its k th nearest neighbor Non-Core-points Core-points Run DBSCAN for Minp=4 and  =5

24 Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 24 DBSCAN —A Second Introduction l Two parameters: –Eps: Maximum radius of the neighbourhood –MinPts: Minimum number of points in an Eps- neighbourhood of that point l N Eps (p):{q belongs to D | dist(p,q) <= Eps} l Directly density-reachable: A point p is directly density- reachable from a point q wrt. Eps, MinPts if –1) p belongs to N Eps (q) –2) core point condition: |N Eps (q)| >= MinPts p q MinPts = 5 Eps = 1 cm

25 Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 25 Density-Based Clustering: Background (II) l Density-reachable: –A point p is density-reachable from a point q wrt. Eps, MinPts if there is a chain of points p 1, …, p n, p 1 = q, p n = p such that p i+1 is directly density-reachable from p i l Density-connected –A point p is density-connected to a point q wrt. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o wrt. Eps and MinPts. p q p1p1 pq o

26 Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 26 DBSCAN: Density Based Spatial Clustering of Applications with Noise l Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points l Capable to discovers clusters of arbitrary shape in spatial datasets with noise Core Border Outlier Eps = 1cm MinPts = 5 Density reachable from core point Not density reachable from core point

27 Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 27 DBSCAN: The Algorithm 1.Arbitrary select a point p 2.Retrieve all points density-reachable from p wrt Eps and MinPts. 3.If p is a core point, a cluster is formed. 4.If p ia not a core point, no points are density-reachable from p and DBSCAN visits the next point of the database. 5.Continue the process until all of the points have been processed.

28 Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 28 Density-based Clustering: Pros and Cons  +: can (potentially) discover clusters of arbitrary shape  +: not sensitive to outliers and supports outlier detection  +: can handle noise  +  : medium algorithm complexities O(n**2), O(n*log(n)   : finding good density estimation parameters is frequently difficult; more difficult to use than K-means.   : usually, does not do well in clustering high- dimensional datasets.   : cluster models are not well understood (yet)

29 Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 29 DENCLUE: using density functions DENsity-based CLUstEring by Hinneburg & Keim (KDD ’ 98) l Major features –Solid mathematical foundation –Good for data sets with large amounts of noise –Allows a compact mathematical description of arbitrarily shaped clusters in high-dimensional data sets –Significant faster than existing algorithm (faster than DBSCAN by a factor of up to 45) –But needs a large number of parameters


Download ppt "Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN More on Clustering 1. Hierarchical Clustering to be discussed in Clustering Part2 2. DBSCAN."

Similar presentations


Ads by Google