CLARANS: A Method for Clustering Objects for Spatial Data Mining IEEE Transactions on Knowledge and Data Enginerring, 2002 Raymond T. Ng et al. 22 MAR.

Slides:



Advertisements
Similar presentations
1 DATA STRUCTURES USED IN SPATIAL DATA MINING. 2 What is Spatial data ? broadly be defined as data which covers multidimensional points, lines, rectangles,
Advertisements

AI Pathfinding Representing the Search Space
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Cluster Analysis: Basic Concepts and Algorithms
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
PARTITIONAL CLUSTERING
CS 352: Computer Graphics Chapter 7: The Rendering Pipeline.
Efficient access to TIN Regular square grid TIN Efficient access to TIN Let q := (x, y) be a point. We want to estimate an elevation at a point q: 1. should.
Fast Algorithms For Hierarchical Range Histogram Constructions
WEI-MING CHEN k-medoid clustering with genetic algorithm.
2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.
Unsupervised learning
Visibility Culling using Hierarchical Occlusion Maps Hansong Zhang, Dinesh Manocha, Tom Hudson, Kenneth E. Hoff III Presented by: Chris Wassenius.
A (1+  )-Approximation Algorithm for 2-Line-Center P.K. Agarwal, C.M. Procopiuc, K.R. Varadarajan Computational Geometry 2003.
Introduction to Bioinformatics
Spatial Mining.
Graph Drawing Zsuzsanna Hollander. Reviewed Papers Effective Graph Visualization via Node Grouping Janet M. Six and Ioannis G. Tollis. Proc InfoVis 2001.
Clustering II.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Overview Of Clustering Techniques D. Gunopulos, UCR.
1 Partitioning Algorithms: Basic Concepts  Partition n objects into k clusters Optimize the chosen partitioning criterion Example: minimize the Squared.
Cluster Analysis (1).
What is Cluster Analysis?
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
What is Cluster Analysis?
Exposure In Wireless Ad-Hoc Sensor Networks S. Megerian, F. Koushanfar, G. Qu, G. Veltri, M. Potkonjak ACM SIG MOBILE 2001 (Mobicom) Journal version: S.
What is Cluster Analysis?
כמה מהתעשייה? מבנה הקורס השתנה Computer vision.
Cluster Analysis Part I
Clustering Analysis of Spatial Data Using Peano Count Trees Qiang Ding William Perrizo Department of Computer Science North Dakota State University, USA.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
CSIE in National Chi-Nan University1 Approximate Matching of Polygonal Shapes Speaker: Chuang-Chieh Lin Advisor: Professor R. C. T. Lee National Chi-Nan.
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
Clustering Spatial Data Using Random Walk David Harel and Yehuda Koren KDD 2001.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
1 Efficient and Effective Clustering Methods for Spatial Data Mining Raymond T. Ng, Jiawei Han Pavan Podila COSC 6341, Fall ‘04.
Cluster Analysis.
Efficient Clustering of Uncertain Data Wang Kay Ngai, Ben Kao, Chun Kit Chui, Reynold Cheng, Michael Chau, Kevin Y. Yip Speaker: Wang Kay Ngai.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
1 Similarity and Dissimilarity Between Objects Distances are normally used to measure the similarity or dissimilarity between two data objects Some popular.
CURE: An Efficient Clustering Algorithm for Large Databases Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Presentation by: Vuk Malbasa For CIS664.
Pareto-Optimality of Cognitively Preferred Polygonal Hulls for Dot Patterns Antony Galton University of Exeter UK.
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
Hierarchical Clustering
Data Mining K-means Algorithm
Statistical Data Analysis
Object Recognition in the Dynamic Link Architecture
Efficient Distance Computation between Non-Convex Objects
Hierarchical and Ensemble Clustering
Craig Schroeder October 26, 2004
The normal distribution
Critical Issues with Respect to Clustering
DATA MINING Introductory and Advanced Topics Part II - Clustering
Distributed Probabilistic Range-Aggregate Query on Uncertain Data
Hierarchical and Ensemble Clustering
CSCI N317 Computation for Scientific Applications Unit Weka
Clustering Wei Wang.
Statistical Data Analysis
Group 9 – Data Mining: Data
K-Medoid May 5, 2019.
Presentation transcript:

CLARANS: A Method for Clustering Objects for Spatial Data Mining IEEE Transactions on Knowledge and Data Enginerring, 2002 Raymond T. Ng et al. 22 MAR 2011 Kwak, Namju 1

Overview 2

Introduction Spatial data mining –On spatial databases –Huge amount (usually, terabytes) –Satellite images, medical equipments, video cameras, etc. Applications of spatial data mining –NASA Earth Observing System –Nat’l Inst. of Justice (crime mapping) –Dept. of Transportation (traffic data) –Nat’l Inst. of Health (cancer cluster) Difficulty –Spatial data type (point, polygon, etc.) –Spatial relationship (A is [in front of, at the back of, nearby, etc.] B) –Spatial autocorrelation (Similar objects gather together.) 3

Introduction Key issues for cluster analysis –Whether there exists a natural notion of similarities among the objects to be clustered. Point objects vs. polygon objects –Whether clustering a large number of objects can be efficiently carried out. CLARANS (which is proposed in this paper) –More efficient than the existing algorithms PAM and CLARA –Calculating the similarity between two polygons in the most efficient and effective way (using the separation distance between isothetic rectangles of the polygons) 4

Clustering Algorithms Based on Partitioning Hierarchical methods –Agglomerative and divisive –Successfully applied to many biological applications –They can never undo what was done previously. Partitioning methods –k-means, k-medoid, fuzzy analysis, etc. –k-medoid Robust to the existence of outliers Not dependent on the order in which the objects are examined Invariant with respect to translations and orthogonal transformations of data points 5

Clustering Algorithms Based on Partitioning 6

PAM (Partitioning Around Medoids) –Suppose there are 2 medoids: A and B. And we consider replacing A with a new medoid M. 7

Clustering Algorithms Based on Partitioning 8

9

CLARA (Clustering LARge Application) –CLARA draws a sample of the data set, applies PAM on the sample, and finds medoids of the sample. –If the sample is drawn in a sufficiently random way, the medoids of the sample would approximate the medoids of the entire data set. – 10

A Clustering Algorithm Based on Randomized Search 11

A Clustering Algorithm Based on Randomized Search CLARANS PAM is a search for a minimum on G n,k. –Examining all k(n-k) neighbors is time consuming. CLARA restricts the search on subgraphs of G n,k. –Sa is the set of objects in a sample. –The search is confined within G Sa,k. If M is not included in G Sa,k, M will never be found in the search. 12

A Clustering Algorithm Based on Randomized Search CLARANS –Like CLARA, it does not check every neighbors of a node. –Unlike CLARA, each sample is drawn dynamically. –While CLARA draws a sample of nodes at the beginning of a search, CLARANS draws a sample of neighbors in each step of a search. –Gives higher quality clusterings. –Requires a very small number of searches. 13

A Clustering Algorithm Based on Randomized Search CLARANS – –The higher the value of maxneighbor, the closer is CLARANS to PAM. 14

Clustering Convex Polygon Objects In practice, numerous spatial objects to cluster are polygonal in nature. –Shopping malls, parks, etc. Representative point approximation: centroid –A typical house of 200 square meters in a rectangular shape vs. a park of 500,000 square meters in a irregular shape –Clusterings of poor quality Multiple representative points –For a large park, two of its representative points may be 5,000 meters apart from each other. There is no guarantee that they will be in the same cluster. 15

Clustering Convex Polygon Objects 16

Clustering Convex Polygon Objects 17

Clustering Convex Polygon Objects Approximating by the Separation Distance between Isothetic Rectangles –IR-approximation –Compute isothetic rectangles I A, I B and calculate the separation distance between them. –The isothetic rectangle I A is the smallest rectangle that contains a polygon A, whose edges are parallel to either the x- or y-axes. –While the isothetic rectangle has an area larger than that of a minimum bounding rectangle, it can be easily obtained by finding the minimum and maximum of the x-coordinate set and the y-coordinate set of the vertices. –Trivial amount of time to compute. 18

Clustering Convex Polygon Objects Approximating by the Separation Distance between Isothetic Rectangles –For isothetic rectangles, it takes constant time in the first step where possible intersection is checked, but logarithmic time for polygons. –In the next step where the actual separation distance is computed, it is constant time for isothetic rectangles, but logarithmic time for polygons. –Underestimation of exact separation distance –The original polygons do not have to be convex. 19

Experimental Results CLARANS vs. PAM –For large and medium data sets, it is obvious that CLARANS is much more efficient than PAM. –On small data: data sets with 40, 60, 80, and 100 points in five clusters –The clusterings produced by both algorithms are of the same quality. 20

Experimental Results CLARANS vs. CLARA –Since CLARA is not designed for small data sets, this set of experiments was run on data sets whose number of objects exceeds 100. –CLARANS is always able to find clusterings of better quality than those found by CLARA. –However, in some cases, CLARA may take much less time than CLARANS. –What if they were given the same amount of time? 21

Experimental Results 22

Conclusion For small data sets, CLARANS is a few time faster than PAM. The performance gap for larger data sets is even larger. When given the same amount of runtime, CLARANS can produce clusterings that are of much better quality than those generated by CLARA. IR-approximation is a few times faster than the method that computes the exact separation distance. IR-approximation is able to find clusterings that are of quality almost as good as those produced by using the exact separation distance. 23