An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim.

An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim

Outline Multimedia data Density-based clustering Influence and density functions Center-defined vs. Arbitrary-shape Comparison with other algorithms Algorithm What can we learn / have we learned?

Multimedia Data Examples Images CAD Geographic Molecular biology High-dimensional feature vectors Color histograms Shape descriptors Fourier vectors

Density-Based Clustering (loose definition) Clusters defined by high density of points Many points with the same combination of attribute values Is density irrelevant for other methods? No! Most methods look for dense areas DENCLUE uses density directly

Density-Based Clustering (stricter definition) Closeness to a dense area is the only criterion for cluster membership DENCLUE has two variants Arbitrary-shaped clusters Similar to other density based methods Center-defined clusters Similar to distance-based methods

Idea Each data point has an influence that extends over a range  Influence function Add all influence functions  Density function

Influence Functions

Definitions Density Attractor x* Local maximum of the density function Density attracted points Points from which a path to x* exists for which the gradient is continuously positive (case of continuous and differentiable influence function)

Center Defined Clusters All points that are density attracted to a given density attractor x* Density function at the maximum must exceed  Points that are attracted to smaller maxima are considered outliers

Arbitrary-Shape Clusters Merges center defined clusters if a path exists for which the density function continuously exceeds 

Examples

Noise Invariance Density distribution for noise is constant  No influence on number and location of attractors Claim Number of density attractors with or without noise is the same Probability that they are identical goes to 1 for large noise

Parameter Choices Choice of  : Use different  and determine largest interval with constant number or clusters Choice of  : Greater than noise level Smaller than smallest relevant maxima

Comparison with DBSCAN Corresponding setup Square wave influence function radius  models neighborhood  in DBSCAN Definition of core objects in DBSCAN involves MinPts  Density reachable in DBSCAN becomes density attracted in DENCLUE (!?)

Comparison with k-means Corresponding setup Gaussian influence function Step-size for hill-climbing  =  /2 Claim In DENCLUE  can be chosen such that k clusters are found DENCLUE result corresponds to global optimum in k-means

Comparison with Hierarchical Methods Start with very small  to get largest number of clusters Increasing  will merge clusters Finally one density attractor

Algorithm Step 1: Construct a map of data points Uses hypercubes of with edge length 2  Only populated cubes are saved Step 2: Determine density attractors for all points using hill-climbing Keeps track of paths that have been taken and points close to them

Local Density Function Influence function of “near” points contributes fully Far away points are ignored For Gaussian influence function: cut-off chosen as 4 

Step 1: Constructing the map Hypercubes contain Number of data points Pointers to data points Sum of data values (for mean) Save populated hypercubes in B + tree Connect neighboring populated cubes for fast access Limited to highly populated cubes derived from outlier criterion

Step 2: Clustering Step Uses only highly populated cubes and cubes that are connected to them Hill-climbing based on local density function and its gradient Points within  /2 of each hill-climbing path are attached to clusters as well

Time Complexity / Efficiency Worst case, for N data points O(N log(N)) Average case (without building data structure?) O(log(N)) Explanation: Only highly populated areas are considered Up to 45 times faster than DBSCAN

Application to Molecular Biology Simulation of a small but flexible peptide Point in a 19-dimensional angle space Pharmaceutical industry is interested in stable conformations Non-stable conformations make up >50 percent => noise

What can we learn? Algorithm is fast for 2 reasons Efficient data structure Data points that are close in attribute space are stored together Similar to P-trees: fast access to data, based on attribute values Optimization problem inherently linear in search space K-medoids problem is quadratic!

Why is k-medoids quadratic in the search space? Review: Cost function calculated as sum over squared distance within each cluster I.e. cost associated with each cluster center depends on all other cluster centers! Can be viewed as an influence function that depends on cluster boundaries

Cost functions K-medoids DENCLUE

Motivating a Gaussian influence function Why not use a parabola as influence function? Only 1 minimum (mean of data set) We need cut-off K-medoids cut-off depends on cluster centers Cluster center independent cut-off?  Gaussian function!

Is DENCLUE only an Approximation to k-medoids? Not necessarily Minimizing square distance is a fundamental measure but not the only one Why should “influence” depend on density of points? “Influence” may be determined by system

If DENCLUE is so good can we still improve it? Need a special data structure They map out all space Density-based idea A distance based version can look for cluster centers only Allows using a promising starting point Define partitions by proximity

Conclusion DENCLUE paper contains many fundamentally valuable ideas Data structure efficient Algorithm related to but much more efficient than k-medoids

An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim.

Similar presentations

Presentation on theme: "An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim.

Similar presentations

Presentation on theme: "An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim."— Presentation transcript:

Similar presentations

About project

Feedback