Garrett Poppe, Liv Nguekap, Adrian Mirabel CSUDH, Computer Science Department.

Slides:



Advertisements
Similar presentations
DBSCAN & Its Implementation on Atlas Xin Zhou, Richard Luo Prof. Carlo Zaniolo Spring 2002.
Advertisements

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
PARTITIONAL CLUSTERING
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Lecture outline Density-based clustering (DB-Scan) – Reference: Martin Ester, Hans-Peter Kriegel, Jorg Sander, Xiaowei Xu: A Density-Based Algorithm for.
Presented by: GROUP 7 Gayathri Gandhamuneni & Yumeng Wang.
DBSCAN – Density-Based Spatial Clustering of Applications with Noise M.Ester, H.P.Kriegel, J.Sander and Xu. A density-based algorithm for discovering clusters.
OPTICS: Ordering Points To Identify the Clustering Structure Mihael Ankerst, Markus M. Breunig, Hans- Peter Kriegel, Jörg Sander Presented by Chris Mueller.
2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.
Clustering Prof. Navneet Goyal BITS, Pilani
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
Part II - Clustering© Prentice Hall1 Clustering Large DB Most clustering algorithms assume a large data structure which is memory resident. Most clustering.
Clustering Methods Professor: Dr. Mansouri
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
2015/6/1Course Introduction1 Welcome! MSCIT 521: Knowledge Discovery and Data Mining Qiang Yang Hong Kong University of Science and Technology
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Cluster Analysis.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
SCAN: A Structural Clustering Algorithm for Networks
Washington, 08/27/03 Washington, 08/27/03 Martin Pfeifle, Database Group, University of Munich Representatives for Visually Analyzing Cluster Hierarchies.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
KDD for Science Data Analysis Issues and Examples.
FLANN Fast Library for Approximate Nearest Neighbors
Project Presentation Arpan Maheshwari Y7082,CSE Supervisor: Prof. Amitav Mukerjee Madan M Dabbeeru.
CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling
Introduction Using time property and location property from lost items’ pictures, we construct the Lost and Found System which combined with image search.
Tree-Based Density Clustering using Graphics Processors
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Data Mining Chun-Hung Chou
Tennessee Technological University1 The Scientific Importance of Big Data Xia Li Tennessee Technological University.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
Expanding the CASE Framework to Facilitate Load Balancing of Social Network Simulations Amara Keller, Martin Kelly, Aaron Todd.
Mining High Utility Itemset in Big Data
Density-Based Clustering Algorithms
Spatial Data Mining Ashkan Zarnani Sadra Abedinzadeh Farzad Peyravi.
DB group seminar 2006/06/29The University of Hong Kong, Dept. of Computer Science Neighborhood based detection of anomalies in high dimensional spatio-temporal.
Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 29-May 3, 2013 Mr. Scan: Efficient Clustering with MRNet and GPUs Evan Samanas and Ben.
Topic9: Density-based Clustering
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
CURE: An Efficient Clustering Algorithm for Large Databases Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Stanford University Bell Laboratories Bell Laboratories.
DBSCAN Data Mining algorithm Dr Veljko Milutinović Milan Micić
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
Presented by Ho Wai Shing
Density-Based Clustering Methods. Clustering based on density (local cluster criterion), such as density-connected points Major features: –Discover clusters.
5/29/2008AI UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
1 Efficient and Effective Clustering Methods for Spatial Data Mining Raymond T. Ng, Jiawei Han Pavan Podila COSC 6341, Fall ‘04.
Project Seminar on STABLE CLUSTERING ALGORITHM TO IDENTIFY CPU USAGE OF COMPUTERS BEHAVIOR IN GRID ENVIRONMENT Under the guidance of Prof. Lakshmi Rajamani.
Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.
CLUSTERING DENSITY-BASED METHODS Elsayed Hemayed Data Mining Course.
An Energy-Efficient Approach for Real-Time Tracking of Moving Objects in Multi-Level Sensor Networks Vincent S. Tseng, Eric H. C. Lu, & Kawuu W. Lin Institute.
Marko Živković 3179/2015.  Clustering is the process of grouping large data sets according to their similarity  Density-based clustering: ◦ groups together.
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Density-based Place Clustering in Geo-Social Networks Jieming Shi, Nikos Mamoulis, Dingming Wu, David W. Cheung Department of Computer Science, The University.
Queensland University of Technology
Clustering Anna Reithmeir Data Mining Proseminar 2017
DATA MINING Spatial Clustering
More on Clustering in COSC 4335
CSE 4705 Artificial Intelligence
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
The University of Adelaide, School of Computer Science
CS 685: Special Topics in Data Mining Jinze Liu
FREERIDE: A Framework for Rapid Implementation of Datamining Engines
Clustering methods: Part 10
Presentation transcript:

Garrett Poppe, Liv Nguekap, Adrian Mirabel CSUDH, Computer Science Department

 Introduction to the topic  History and related work  Problem definition  Existing approaches to solving the problem  Description of proposed algorithm  Problems and solutions  The trend of the field  Conclusion Overview

 Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed 1996.[1]  “It is a density-based clustering algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away).”  “DBSCAN is one of the most common clustering algorithms and also most cited in scientific literature.”[2]  “In 2014, the algorithm was awarded the test of time award (an award given to algorithms which have received substantial attention in theory and practice) at the leading data mining conference, KDD.”[3] Introduction

Motivation Census survey data

Motivation Face recognition(FaceVACS-DBScan)

Motivation Mining Biomedical Images with Density-based Clustering

Motivation Satellite image recognition

Problem  O(nlog(n)) Best case  O(n²) Worst case  Current algorithms are done as a single task  Algorithm starts with first point and continues comparing to last point  Requires user to input minPts and Eps  Parallelization of DBSCAN is challenging as it exhibits an inherent sequential data access order.

Approaches to solve problem  PDSDBSCAN using graph algorithmic concepts and using a tree-based bottom-up approach to construct the clusters, yields a better balanced workload distribution. Implementation of the algorithm both for shared and for distributed memory.  CURE utilizes multiple representative points for each cluster that are generated by selecting well scattered points from the cluster and then shrinking them toward the center of the cluster by a specified fraction. This enables CURE to adjust well to the geometry of clusters having non-spherical shapes and wide variances in size.

Current

Proposed Algorithm (Data Set)

Proposed Algorithm (Map Tasks)

Proposed Algorithm (Map Function)

Proposed Algorithm (Map Function Results)

Proposed Algorithm (Reduce Function) If MIN_pts = 2  Start at first cluster table.  Visit each cluster within table.  Add all points from visited table to first cluster table.  When all points are visited go to next unvisited cluster table.  Repeat step 1 until all tables are visited.  Omit any noise tables (a cluster table with less than 2 points).

Proposed Algorithm (Reduce Function)

Proposed Algorithm (Final Clusters)

Proposed Algorithm (MIN_pts) Clusters that do not contain the minimum number of points within the EPS_min, will be dropped during the reduce phase.  If MIN_pts = 4  Check ptsCntr for each cluster table visited and add only ptsCtr if it is > 4

Proposed Algorithm (MIN_pts)

Anticipated problems and solutions Dataset is too large for memory of a single node.  Split dataset into portions where the origin point is compared with each split during the map phase.  Combine all clusters created from split dataset during the reduce phase.

Trends and future research  Big Data requires parallel processing  Data collected is outgrowing processing power  Machine learning and AI can fill the need for analysis of large amounts of data

References   =pdf    [1] Ester, Martin; Kriegel, Hans-Peter; Sander, Jörg; Xu, Xiaowei (1996). Simoudis, Evangelos; Han, Jiawei; Fayyad, Usama M., eds. A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD- 96). AAAI Press. pp. 226–231. ISBN CiteSeerX:  [2] Most cited data mining articles according to Microsoft academic search; DBSCAN is on rank 24, when accessed on: 4/18/2010  [3] "2014 SIGKDD Test of Time Award". ACM SIGKDD Retrieved