Project Presentation CPSC 695 Prepared By: Priyadarshi Bhattacharya.

Slides:



Advertisements
Similar presentations
Indexing DNA Sequences Using q-Grams
Advertisements

Hierarchical Clustering, DBSCAN The EM Algorithm
CS690L: Clustering References:
Fast Algorithms For Hierarchical Range Histogram Constructions
Osmar Zaïane and Chi-Hoon Lee Database Laboratory Dept. of Computing Science University of Alberta Density-Based Clustering of Spatial Data when facing.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Midterm topics Chapter 2 Data Data preprocessing Measures of similarity/dissimilarity Chapter.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.
Clustering Prof. Navneet Goyal BITS, Pilani
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Ruslana Mys Delaunay Triangulation Delaunay Triangulation (DT)  Introduction  Delaunay-Voronoi based method  Algorithms to compute the convex hull 
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
High-Quality Simplification with Generalized Pair Contractions Pavel Borodin,* Stefan Gumhold, # Michael Guthe,* Reinhard Klein* *University of Bonn, Germany.
Segmentation into Planar Patches for Recovery of Unmodeled Objects Kok-Lim Low COMP Computer Vision 4/26/2000.
Cluster Analysis.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.
© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.
Cluster Analysis.
Cluster Analysis: Basic Concepts and Algorithms
University at BuffaloThe State University of New York WaveCluster A multi-resolution clustering approach qApply wavelet transformation to the feature space.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004.
Birch: An efficient data clustering method for very large databases
CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling
FEATURE EXTRACTION FOR JAVA CHARACTER RECOGNITION Rudy Adipranata, Liliana, Meiliana Indrawijaya, Gregorius Satia Budhi Informatics Department, Petra Christian.
 Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using.
Algorithms for Triangulations of a 3D Point Set Géza Kós Computer and Automation Research Institute Hungarian Academy of Sciences Budapest, Kende u
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
Clustering Spatial Data Using Random Walks Author : David Harel Yehuda Koren Graduate : Chien-Ming Hsiao.
Clustering Spatial Data Using Random Walk David Harel and Yehuda Koren KDD 2001.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
Clustering Moving Objects in Spatial Networks Jidong Chen, Caifeng Lai, Xiaofeng Meng, Renmin University of China Jianliang Xu, and Haibo Hu Hong Kong.
Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie.
Dr. Marina Gavrilova Associate Professor, Department of Computer Science, University of Calgary, Calgary, Alberta, Canada. CPSC 461 Clustering.
Mesh Coarsening zhenyu shu Mesh Coarsening Large meshes are commonly used in numerous application area Modern range scanning devices are used.
CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Author:George et al. Advisor:Dr. Hsu Graduate:ZenJohn Huang IDSL seminar 2001/10/23.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Presented by Ho Wai Shing
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
CLUSTERING HIGH-DIMENSIONAL DATA Elsayed Hemayed Data Mining Course.
An Exact Algorithm for Difficult Detailed Routing Problems Kolja Sulimma Wolfgang Kunz J. W.-Goethe Universität Frankfurt.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
James Hipp Senior, Clemson University.  Graph Representation G = (V, E) V = Set of Vertices E = Set of Edges  Adjacency Matrix  No Self-Inclusion (i.
Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik.
CLUSTERING GRID-BASED METHODS Elsayed Hemayed Data Mining Course.
Cohesive Subgraph Computation over Large Graphs
More on Clustering in COSC 4335
Data Mining Soongsil University
Clustering in Ratemaking: Applications in Territories Clustering
CS 685: Special Topics in Data Mining Jinze Liu
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
On Efficient Graph Substructure Selection
CSE572, CBS598: Data Mining by H. Liu
GPX: Interactive Exploration of Time-series Microarray Data
CSE572, CBS572: Data Mining by H. Liu
CSE572, CBS572: Data Mining by H. Liu
Clustering Wei Wang.
CSE572: Data Mining by H. Liu
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
Presentation transcript:

Project Presentation CPSC 695 Prepared By: Priyadarshi Bhattacharya

Outline of Talk Introduction to clustering and its relevance to my research interests. Discussion on existing clustering techniques and their shortcomings. Introduction to a new Delaunay based clustering algorithm. Experimental Results and comparison with other methods. Direction of future research.

Clustering – Definition Automatic identification of groups of similar objects. A method of grouping data such that intracluster similarity is maximized and intercluster similarity is minimized.

Properties of clustering Scalability: Clustering performance should decrease linearly with data size increase Ability to detect clusters of different shapes Minimal input parameter Robust with regard to noise Insensitive to data input order Scalability to higher dimensions (properties referred from “On Data Clustering Analysis: Scalability, Constraints and Validation” with minor modifications)

Relevance to my research Identification of high-risk areas in the sea based on incident data from the Maritime Activity and Risk Investigation System (MARIS), maintained primarily by the University of Halifax. Incident Data Clustering Algorithm Marine Route Planning (ESRI Shape File) High-risk areas Location of SAR Bases

Existing clustering algorithms Clustering PartitioningHierarchical Density-basedGrid-based K-Means, K-MedoidBIRCH, CURE, ROCK, CHAMELEON DBSCAN, TURN*WaveCluster 1, CLIQUE 1 WaveCluster: A novel clustering approach based on wavelet transforms. Applies a multi- resolution grid structure on the data space. For more details, refer to “Wavecluster: a multi- resolution clustering approach for very large spatial databases”, Proc. 24 th Conf. on Very Large Databases.

Shortcomings of existing methods Require large number of parameters to be input by user. Example – number of clusters, threshold to quantify “similarity”, stopping condition, number of nearest neighbors etc. Sensitivity to user-supplied parameters. Capability of identifying clusters degrades with increase in noise. Inability to identify clusters of widely varying shapes and sizes. Most detect spherical ones only. Identification of dense clusters in presence of sparse ones, clusters connected by multiple bridges, closely lying dense clusters remains elusive.

CRYSTAL – A new Delaunay based clustering algorithm The algorithm has 3 stages : Triangulation phase: Forms the Delaunay Triangulation of the data points and sorts the vertices in the order of decreasing average length of adjacent edges. Grow cluster phase: Scans the sorted vertex list and grows clusters from the vertices in that order, first encompassing first order neighbors, then second order neighbors and so on. The growth stops when the boundary of the cluster is determined. Noise removal phase: The algorithm identifies noise as sparse clusters. They can be easily eliminated by removing clusters which are very small in size or which have a very low density.

Description of stage I Triangulation phase:  Triangulation is done in O(nlogn) time using the incremental algorithm.  An auxiliary grid structure (O(n) in size) is used to speed up the point location problem in the Delaunay Triangulation. This considerably reduces length of walk in the graph to locate the triangle containing the data point.  The well-known Winged-Edge data-structure is used to represent the Delaunay Triangulation because of its efficiency in answering proximity queries.

Description of Stage II Grow Cluster phase: A queue is used to maintain a list of vertices in order, from which the cluster is grown. Only vertices that are not boundary points are inserted into the queue. To decide whether a point belongs to the cluster, the edge length is compared with the average edge length of the cluster. To decide whether a point is on the boundary of a cluster, the average adjacent edge length of the point is compared to the average edge length of the cluster.

Description of Stage III Noise Removal Phase: Noise in the data may be in the form of isolated data points or scattered throughout the data. In the former case, cluster based at these data points will not be able to grow. However, if the noise is scattered uniformly throughout the data, our algorithm identifies it as a single sparse cluster. This phase simply gets rid of noise by eliminating the cluster with the highest average edge length. Also any trivial clusters (size less than an acceptable number) are removed in this phase.

Complexity Analysis The algorithm operates in O(nlogn) time. Delaunay Triangulation is generated in O(nlogn) time. As a vertex once assigned to a cluster is not considered again, the clustering is done in O(n) time. Cluster size (1000) Vs Time consumed (ms)

Clustering in action

Experimental Results Comparison with K-Means based approaches

Experimental Results (contd.) 1. Clusters of different shapes 2. Closely lying dense clusters

Experimental Results (contd.) 1. Clusters connected by multiple bridges 2. Clusters of widely varying density

Experimental Results (contd.) Data setK-Means GEMCRYSTAL

Experimental Results (contd.) Results on t7.10k.dat (originally used in “CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling”)

Conclusion & Future Work CRYSTAL is a fast O(nlogn) clustering algorithm that automatically identifies clusters of widely varying shapes, sizes and densities without requiring any input from user. Future work will involve: Application of the clustering algorithm in identification of high- risk areas in the sea using the MARIS database. Extension of the algorithm to 3D. Considering physical constraints in clustering. In GIS, physical constraints such as rivers, highways, mountain ranges can hinder or alter the clustering result.

References G. Papari, N. Petkov: Algorithm That Mimics Human Perceptual Grouping of Dot Patterns. Lecture Notes in Computer Science (2005) Vladimir Estivill-Castro, Ickjai Lee: AUTOCLUST: Automatic Clustering via Boundary Extraction for Mining Massive Point-Data Sets. Fifth International Conference on Geocomputation (2000) Osmar R. Zaiane, Andrew Foss, Chi-Hoon Lee, Weinan Wang: On Data Clustering Analysis: Scalability, Constraints and Validation. Advances in Knowledge Discovery and Data Mining, Springer-Verlag (2002 ) Z.S.H. Chan, N. Kasabov: Efficient global clustering using the Greedy Elimination Method. Electronics Letters (2004 ) Aristidis Likas, Nikos Vlassis, Jakob J. Verbeek: The global k-means clustering algorithm. Pattern Recognition 36 2 (2003 ) Ying Xu, Victor Olman, Dong Xu: Minimum Spanning Trees for Gene Expression Data Clustering. Computational Protein Structure Group, Life Sciences Division, Oak Ridge National Laboratory, USA C. Eldershaw, M. Hegland: Cluster Analysis using Triangulation. Computational Techniques and Applications CTAC97, World Scientific, Singapore, 1997 Mir Abolfazl Mostafavi, Christopher Gold, Maciej Dakowicz: Delete and insert operations in Voronoi/Delaunay methods and applications. Computers \& Geosciences (2003) Atsuyuki Okabe, Barry Boots, Kokichi Sugihara: Spatial Tessellations: Concepts and Applications of Voronoi Diagrams.

Thank You! All 11 identified by CRYSTAL! Questions?