Clustering Methods Professor: Dr. Mansouri

Slides:

Advertisements

Similar presentations

DBSCAN & Its Implementation on Atlas Xin Zhou, Richard Luo Prof. Carlo Zaniolo Spring 2002.

Advertisements

Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.

Cluster Analysis: Basic Concepts and Algorithms

Hierarchical Clustering, DBSCAN The EM Algorithm

Osmar Zaïane and Chi-Hoon Lee Database Laboratory Dept. of Computing Science University of Alberta Density-Based Clustering of Spatial Data when facing.

Data Mining Cluster Analysis: Advanced Concepts and Algorithms

Lecture outline Density-based clustering (DB-Scan) – Reference: Martin Ester, Hans-Peter Kriegel, Jorg Sander, Xiaowei Xu: A Density-Based Algorithm for.

Presented by: GROUP 7 Gayathri Gandhamuneni & Yumeng Wang.

DBSCAN – Density-Based Spatial Clustering of Applications with Noise M.Ester, H.P.Kriegel, J.Sander and Xu. A density-based algorithm for discovering clusters.

Density-based Approaches

Segmentation in color space using clustering Student: Yijian Yang Advisor: Longin Jan Latecki.

Cluster Analysis Part III. Learning Objectives Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary.

2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.

Qiang Yang Adapted from Tan et al. and Han et al.

Clustering Prof. Navneet Goyal BITS, Pilani

CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.

Part II - Clustering© Prentice Hall1 Clustering Large DB Most clustering algorithms assume a large data structure which is memory resident. Most clustering.

More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.

Chapter 3: Cluster Analysis

Spatial Clustering Methods

K-Means and DBSCAN Erik Zeitler Uppsala Database Laboratory.

Cluster Analysis.

4. Clustering Methods Concepts Partitional (k-Means, k-Medoids)

An Introduction to Clustering

Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.

Instructor: Qiang Yang

SCAN: A Structural Clustering Algorithm for Networks

Cluster Analysis.

Cluster Analysis: Basic Concepts and Algorithms

DATA MINING LECTURE 8 Clustering The k-means algorithm

Clustering Part2 BIRCH Density-based Clustering --- DBSCAN and DENCLUE

Time-focused density-based clustering of trajectories of moving objects Margherita D’Auria Mirco Nanni Dino Pedreschi.

1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.

9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)

An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim.

1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.

Density-Based Clustering Algorithms

Han/Eick: Clustering II 1 Clustering Part2 continued 1. BIRCH skipped 2. Density-based Clustering --- DBSCAN and DENCLUE 3. GRID-based Approaches --- STING.

Topic9: Density-based Clustering

Han/Eick: Clustering II 1 Clustering Part2 continued 1. BIRCH skipped 2. Density-based Clustering --- DBSCAN and DENCLUE 3. GRID-based Approaches --- STING.

DBSCAN Data Mining algorithm Dr Veljko Milutinović Milan Micić

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.

Presented by Ho Wai Shing

Density-Based Clustering Methods. Clustering based on density (local cluster criterion), such as density-connected points Major features: –Discover clusters.

Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.

1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English.

Clustering/Cluster Analysis. What is Cluster Analysis? l Finding groups of objects such that the objects in a group will be similar (or related) to one.

Other Clustering Techniques

CLUSTERING DENSITY-BASED METHODS Elsayed Hemayed Data Mining Course.

Marko Živković 3179/2015.  Clustering is the process of grouping large data sets according to their similarity  Density-based clustering: ◦ groups together.

Clustering By : Babu Ram Dawadi. 2 Clustering cluster is a collection of data objects, in which the objects similar to one another within the same cluster.

Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.

1 Similarity and Dissimilarity Between Objects Distances are normally used to measure the similarity or dissimilarity between two data objects Some popular.

DATA MINING: CLUSTER ANALYSIS (3) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.

Christoph F. Eick Questions Review October 12, How does post decision tree post-pruning work? What is the purpose of applying post-pruning in decision.

Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.

Data Mining: Basic Cluster Analysis

DATA MINING Spatial Clustering

More on Clustering in COSC 4335

CSE 4705 Artificial Intelligence

Hierarchical Clustering: Time and Space requirements

©Jiawei Han and Micheline Kamber Department of Computer Science

CS 685: Special Topics in Data Mining Jinze Liu

Data Mining Cluster Analysis: Advanced Concepts and Algorithms

CSE572, CBS598: Data Mining by H. Liu

CSE572, CBS572: Data Mining by H. Liu

CSE572: Data Mining by H. Liu

CS 685: Special Topics in Data Mining Jinze Liu

Presentation transcript:

Clustering Methods Professor: Dr. Mansouri Presented by : Muhammad Abouei &Mohsen Ghahremani Manesh

Clustering Methods Density-Based Clustering Methods DBSCAN (Density Based Spatial Clustering of Applications with Noise) OPTICS (Ordering Points To Identify the Clustering Structure) DENCLUE (DENsity-based CLUstEring) Grid-based Clustering

Density Based Clustering

DBSCAN Concepts ε -neighborhood: Points within ε distance (radius) of a point. MinPts: minimum number of points in cluster (ε-neighborhood of that point). ε-neighborhood of q ε-neighborhood of p MinPts = 5 where ε and MinPts are a user-defined function.

DBSCAN Concepts Density : number of points within a specified radius (ε) Density(p)=5

DBSCAN Concepts Core point : A point is a core point if it has more than a specified number of points (MinPts) within ε These are points that are at the interior of a cluster ε-neighborhood of q ε-neighborhood of p p is a core point (MinPts = 5) q is not a core point.

DBSCAN Concepts Directly density-reachable : point p is directly density-reachable from a point q w.r.t. ε , MinPts if p belongs to ε -neighborhood of q, q is a core point, MinPts = 4 p is DDR from q. q is not DDR from p! DDR is an asymmetric relation.

DBSCAN Concepts Density-reachable: A point p is density-reachable from a point q w.r.t. ε , MinPts if there is a chain of points P1, …, Pn , P1=q, Pn=p such that Pi +1is directly density-reachable from Pi . Or, point p is density-reachable form q, if there is a path (chain of points) from p to q consisting of only core points. MinPts = 4 p is DR from q. q is not DR from p! p is not core. DR is an asymmetric relation.

DBSCAN Concepts Density-connectivity: point p is density-connected to point q w.r.t. ε , MinPts if there is a point r such that both, p and q are density-reachable from r w.r.t. ε and MinPts. MinPts = 4 p and q are density-connected. DC is an symmetric relation.

DBSCAN Concepts Border point : A border point has fewer than MinPts within ε, but is in the neighborhood of a core point MinPts =5 ε = circle radius

DBSCAN Concepts Noise (outlier) point : is any point that is not a core point nor a border point. MinPts =5 ε = circle radius

DBSCAN Concepts DBSCAN relies on a density-based notion of cluster. Cluster : a cluster C is a non-empty set of density-connected points that is maximal w.r.t. density-reachability. Maximality: For all p, q; if q ∈ C and if p is density-reachable from q w.r.t. ε and MinPts, then also p ∈ C. MinPts = 3 ε = circle radius

DBSCAN Algorithm Arbitrary select a point p Retrieve all points density-reachable from p w.r.t. ε and MinPts. If p is a core point, a cluster is formed. If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database. Continue the process until all of the points have been processed.

DBSCAN MinPts = 4

DBSCAN DBSCAN is Sensitive to Parameters. MinPts = 4

Original Points Point types: core, border DBSCAN Core, Border and Noise Points: MinPts = 4, ε = 10 Original Points Point types: core, border and noise

DBSCAN When DBSCAN works well: Resistant to Noise Can handle clusters of different shapes and sizes Original Points Clusters

DBSCAN When DBSCAN does not work well: Varying densities High-dimensional data

DBSCAN Complexity If a spatial index (ex, kd-tree, R*-tree) is used, the computational complexity of DBSCAN is O(n.logn), where n is the number of database objects. Otherwise, it is O(n2).

OPTICS Core distance: smallest ε that makes it a core object. If p is not core, it is undefined. Core Distance of p or ε′ : distance between p and its 4-thNN. MinPts = 5 ε = 3 cm

OPTICS Reachability distance: of r w.r.t. p is the greater value of the core distance of p and the Euclidean distance between p & r. If p is not a core object, distance reachability between p & q is undefined. reachability-distance ε, MinPts(p, r) = ε′ reachability-distance ε, MinPts(p, r′) = d(p, r′ ) MinPts = 5 ε = 3 cm

OPTICS

OPTICS

OPTICS

OPTICS

OPTICS

OPTICS

OPTICS

OPTICS

OPTICS

OPTICS

OPTICS Color image segmentation using density-Based clustering

DENCLUE Major features DENCLUE (DENsity-based CLUstEring) Solid mathematical foundation Good for data sets with large amounts of noise Allows a compact mathematical description of arbitrarily shaped clusters in high-dimensional data sets Significant faster than existing algorithm (faster than DBSCAN by a factor of up to 45) But needs a large number of parameters

DENCLUE Technical Essence Uses grid cells but only keeps information about grid cells that do actually contain data points and manages these cells in a tree- based access structure.

DENCLUE Technical Essence DENCLUE is based on the following concepts: Influence function Density function Density attractors.

DENCLUE Influence function : The influence function f y(x) for a point 𝑦∈𝐷(data space) at point x is a positive function that decays to zero as x “moves away” from 𝑦 𝑑 𝑥,𝑦 →∞ . Typical examples are: and where σ is a user-defined function.

DENCLUE Density function :The density function at x based on a data space of N points; i.e. D = {x1,…, xN}; is defined as the sum of the influence function of all data points at x : The goal of the definition: Identify all “significant” local maxima, xj*, j=1,…,m of f D(x) Create a cluster Cj for each xj* and assign to Cj all points of D that lie within the “region of attraction” of xj*.

DENCLUE Example: Density Computation D={x1,x2,x3,x4} f DGaussian (x) = influence(x1)+influence(x2)+influence(x3)+influence(x4) =0.04+0.06+0.08+0.6=0.78 Remark: the density value of y would be larger than the one for x.

DENCLUE Density attractors :Density attractors are local maxima of the overall density function f D(x). Clusters can then be determined mathematically by identifying density attractors. A hill-climbing algorithm guided by the gradient can be used to determine the density attractor of a set of data points.

DENCLUE Density-attracted : A point x is density-attracted to a density attractor x*, if there exists a set of points x0, x1, …, xk such that x0 = x , xk = x* and the gradient of xi-1 is in the direction of xi for 0<i<k.

DENCLUE Center-Defined Cluster :A center-defined cluster (w.r.t. to σ, ε) for a density attractor x* is a subset C ⊆ D, with x ∈ C being density-attracted by x* and f D(x)≥ ε. Outlier: Point x ∈ D is called outlier if it is density-attracted by a local maximum xo* with f D(xo*) < ε.

DENCLUE Multicenter defined clusters : Multicenter defined clusters are a set of center-defined clusters linked by a path of significance.

DENCLUE An arbitrary-shape cluster : An arbitrary-shape cluster (w.r.t. to σ, ε) for a set of density attractors X is a subset C ⊆ D, where ∀𝑥∈𝐶 ∃ 𝑥 ∗ ∈𝑋: 𝑓 𝐷 𝑥 ∗ ≥ε , x is density-attracted to 𝑥 ∗ , and ∀ 𝑥 1 ∗ , 𝑥 2 ∗ ∈𝑋 : ∃ a path P from 𝑥 1 ∗ to 𝑥 2 ∗ with ∀ 𝑝∈𝑃: 𝑓 𝐷 𝑝 ≥ ε

DENCLUE Note : that the number of clusters found by DENCLUE varies depending on σ, ε.

DENCLUE DENCLUE is able to detect arbitrarily shaped clusters. The algorithm deals with noise very satisfactory. The worst-case time complexity of DENCLUE is O(N.log2N). Experimental results indicate that the average time complexity is O(log2N). It works efficiently with high-dimensional data. DENCLUE needs at least 3 parameters to be determined, i.e. σ, ε ,εc.

Grid-based Using multi-resolution grid data structure Clustering complexity depends on the number of populated grid cells and not on the number of objects in the dataset Several interesting methods: CS Tree (Clustering Statistical Tree) STING WaveCluster

Grid-based Basic Grid-based Algorithm Define a set of grid-cells. Assign objects to the appropriate grid cell and compute the density of each cell. Eliminate cells, whose density is below a certain threshold τ. Form clusters from contiguous (adjacent) groups of dense cells (usually minimizing a given objective function).

Grid-based Fast: No distance computations, Clustering is performed on summaries and not individual objects; complexity is usually O(no_of_populated_grid_cells) and not O(no_of_objects), Easy to determine which clusters are neighboring.

References A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988. A.K. Jain and M. N. Murty and P.J. Flynn, Data Clustering: A Review, ACM Computing Surveys, vol 31. No 3,pp 264-323, 1999. A. L. N. Fred, J. M. N. Leitão, A New Cluster Isolation Criterion Based on Dissimilarity Increments, IEEE “Optimal grid-clustering: Toward breaking the curse of dimensionality in high-dimensional clustering,”in Proc. 25th VLDB Conf.,1999, pp. 506–517.

?