An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim.

Slides:



Advertisements
Similar presentations
Cluster Analysis: Basic Concepts and Algorithms
Advertisements

Hierarchical Clustering, DBSCAN The EM Algorithm
PARTITIONAL CLUSTERING
Fast Algorithms For Hierarchical Range Histogram Constructions
November 12, 2013Computer Vision Lecture 12: Texture 1Signature Another popular method of representing shape is called the signature. In order to compute.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Presented by: GROUP 7 Gayathri Gandhamuneni & Yumeng Wang.
Clustering Prof. Navneet Goyal BITS, Pilani
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
Clustering Methods Professor: Dr. Mansouri
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Chapter 3: Cluster Analysis
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Multimedia DBs. Multimedia dbs A multimedia database stores text, strings and images Similarity queries (content based retrieval) Given an image find.
Instructor: Mircea Nicolescu Lecture 13 CS 485 / 685 Computer Vision.
K-Means and DBSCAN Erik Zeitler Uppsala Database Laboratory.
Cluster Analysis.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Overview Of Clustering Techniques D. Gunopulos, UCR.
© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.
Cluster Analysis.
Cluster Analysis: Basic Concepts and Algorithms
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
What is Cluster Analysis?
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Multiple Object Class Detection with a Generative Model K. Mikolajczyk, B. Leibe and B. Schiele Carolina Galleguillos.
Birch: An efficient data clustering method for very large databases
Computer Vision James Hays, Brown
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
CSE 185 Introduction to Computer Vision Pattern Recognition 2.
PMLAB Finding Similar Image Quickly Using Object Shapes Heng Tao Shen Dept. of Computer Science National University of Singapore Presented by Chin-Yi Tsai.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Density-Based Clustering Algorithms
Data Extraction using Image Similarity CIS 601 Image Processing Ajay Kumar Yadav.
Topic9: Density-based Clustering
Iterative Improvement Algorithm 2012/03/20. Outline Local Search Algorithms Hill-Climbing Search Simulated Annealing Search Local Beam Search Genetic.
Presenter : Lin, Shu-Han Authors : Jeen-Shing Wang, Jen-Chieh Chiang
The Curse of Dimensionality Richard Jang Oct. 29, 2003.
Fast Kernel-Density-Based Classification and Clustering Using P-Trees Anne Denton Major Advisor: William Perrizo.
1 A System for Outlier Detection and Cluster Repair Ying Liu Dr. Sprague Oct 21, 2005.
BIRCH: An Efficient Data Clustering Method for Very Large Databases Tian Zhang, Raghu Ramakrishnan, Miron Livny University of Wisconsin-Maciison Presented.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Presented by Ho Wai Shing
Density-Based Clustering Methods. Clustering based on density (local cluster criterion), such as density-connected points Major features: –Discover clusters.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English.
Other Clustering Techniques
CLUSTERING DENSITY-BASED METHODS Elsayed Hemayed Data Mining Course.
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
May 2003 SUT Color image segmentation – an innovative approach Amin Fazel May 2003 Sharif University of Technology Course Presentation base on a paper.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
April 21, 2016Introduction to Artificial Intelligence Lecture 22: Computer Vision II 1 Canny Edge Detector The Canny edge detector is a good approximation.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Data Mining: Basic Cluster Analysis
DATA MINING Spatial Clustering
More on Clustering in COSC 4335
Hierarchical Clustering: Time and Space requirements
K Nearest Neighbor Classification
CSE572, CBS598: Data Mining by H. Liu
GPX: Interactive Exploration of Time-series Microarray Data
CSE572, CBS572: Data Mining by H. Liu
Fourier Transform of Boundaries
CSE572: Data Mining by H. Liu
CS 685: Special Topics in Data Mining Jinze Liu
Presentation transcript:

An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim

Outline Multimedia data Density-based clustering Influence and density functions Center-defined vs. Arbitrary-shape Comparison with other algorithms Algorithm What can we learn / have we learned?

Multimedia Data Examples Images CAD Geographic Molecular biology High-dimensional feature vectors Color histograms Shape descriptors Fourier vectors

Density-Based Clustering (loose definition) Clusters defined by high density of points Many points with the same combination of attribute values Is density irrelevant for other methods? No! Most methods look for dense areas DENCLUE uses density directly

Density-Based Clustering (stricter definition) Closeness to a dense area is the only criterion for cluster membership DENCLUE has two variants Arbitrary-shaped clusters Similar to other density based methods Center-defined clusters Similar to distance-based methods

Idea Each data point has an influence that extends over a range  Influence function Add all influence functions  Density function

Influence Functions

Definitions Density Attractor x* Local maximum of the density function Density attracted points Points from which a path to x* exists for which the gradient is continuously positive (case of continuous and differentiable influence function)

Center Defined Clusters All points that are density attracted to a given density attractor x* Density function at the maximum must exceed  Points that are attracted to smaller maxima are considered outliers

Arbitrary-Shape Clusters Merges center defined clusters if a path exists for which the density function continuously exceeds 

Examples

Noise Invariance Density distribution for noise is constant  No influence on number and location of attractors Claim Number of density attractors with or without noise is the same Probability that they are identical goes to 1 for large noise

Parameter Choices Choice of  : Use different  and determine largest interval with constant number or clusters Choice of  : Greater than noise level Smaller than smallest relevant maxima

Comparison with DBSCAN Corresponding setup Square wave influence function radius  models neighborhood  in DBSCAN Definition of core objects in DBSCAN involves MinPts  Density reachable in DBSCAN becomes density attracted in DENCLUE (!?)

Comparison with k-means Corresponding setup Gaussian influence function Step-size for hill-climbing  =  /2 Claim In DENCLUE  can be chosen such that k clusters are found DENCLUE result corresponds to global optimum in k-means

Comparison with Hierarchical Methods Start with very small  to get largest number of clusters Increasing  will merge clusters Finally one density attractor

Algorithm Step 1: Construct a map of data points Uses hypercubes of with edge length 2  Only populated cubes are saved Step 2: Determine density attractors for all points using hill-climbing Keeps track of paths that have been taken and points close to them

Local Density Function Influence function of “near” points contributes fully Far away points are ignored For Gaussian influence function: cut-off chosen as 4 

Step 1: Constructing the map Hypercubes contain Number of data points Pointers to data points Sum of data values (for mean) Save populated hypercubes in B + tree Connect neighboring populated cubes for fast access Limited to highly populated cubes derived from outlier criterion

Step 2: Clustering Step Uses only highly populated cubes and cubes that are connected to them Hill-climbing based on local density function and its gradient Points within  /2 of each hill-climbing path are attached to clusters as well

Time Complexity / Efficiency Worst case, for N data points O(N log(N)) Average case (without building data structure?) O(log(N)) Explanation: Only highly populated areas are considered Up to 45 times faster than DBSCAN

Application to Molecular Biology Simulation of a small but flexible peptide Point in a 19-dimensional angle space Pharmaceutical industry is interested in stable conformations Non-stable conformations make up >50 percent => noise

What can we learn? Algorithm is fast for 2 reasons Efficient data structure Data points that are close in attribute space are stored together Similar to P-trees: fast access to data, based on attribute values Optimization problem inherently linear in search space K-medoids problem is quadratic!

Why is k-medoids quadratic in the search space? Review: Cost function calculated as sum over squared distance within each cluster I.e. cost associated with each cluster center depends on all other cluster centers! Can be viewed as an influence function that depends on cluster boundaries

Cost functions K-medoids DENCLUE

Motivating a Gaussian influence function Why not use a parabola as influence function? Only 1 minimum (mean of data set) We need cut-off K-medoids cut-off depends on cluster centers Cluster center independent cut-off?  Gaussian function!

Is DENCLUE only an Approximation to k-medoids? Not necessarily Minimizing square distance is a fundamental measure but not the only one Why should “influence” depend on density of points? “Influence” may be determined by system

If DENCLUE is so good can we still improve it? Need a special data structure They map out all space Density-based idea A distance based version can look for cluster centers only Allows using a promising starting point Define partitions by proximity

Conclusion DENCLUE paper contains many fundamentally valuable ideas Data structure efficient Algorithm related to but much more efficient than k-medoids