Download presentation

Presentation is loading. Please wait.

Published byToby Weaver Modified about 1 year ago

1
MR-DBSCAN: An Efficient Parallel Density-based Clustering Algorithm using MapReduce Yaobin He, Haoyu Tan, Wuman Luo, Huajian Mao, Di Ma, Shengzhong Feng, Jianping Fan

2
INTRODUCTION This paper is mainly focus on “Parallel Density-based Data Cl ustering” on shared-nothing cluster environment. Data clustering is essential data mining technique which can view macrosco pic patterns of data. Due to the size of datasets, there is a needs to develop parallel data clusteri ng algorithm. In this paper, the authors propose an parallel density-based cl ustering algorithm and implement it by a 4-stages MapReduce paradigm Adopt a quick partitioning strategy for large scale non-indexed data Study the metric of merge among bordering partitions and optimizations Evaluate on real large scale datasets (approx. 1.9 billion GPS log)

3
Introduction Clustering techniques Pros of DBScan Divide data into clusters with arbitrary shapes Does not require the number of the clusters a priori Insensitive to the order of the points in the dataset Cons of DBScan The sizes of the datasets are growing so that they can not be held on a single machine Much higher computation complexity compared with K-means => PARALLELIZE using MapReduce!! (what a simple..) 3

4
Background : DBScan DBSCAN (Martin Ester et al, KDD, 1996) The key idea of density-based clustering is that for each point of a cluster the neighborhood of a given radius (Eps) has to contain at least a minimum number of points (MinPts) Directly density-reachable (DDR): o is DDR p if p ∈ N Eps (o) and Card(N Eps (o)) ≤ MinPts. Density-reachable (DR): if there is a chain of points {p i |i = 0,.., n} that each p i is DDR p i+1, then p i is DR t, where t ∈ {p j |j = i + 1,..., n}. (canonical extension) Density-connected (DC): if o is DR p and o is DR q, then p is DC q. (symmetric version) 4

5
Background : DBScan 5 Class of point : -Unclassified -Core -Border -Noise

6
Background : MapReduce Borrows from functional programming Users should implement two primary methods: Map: (k1, v1) → list(k2, v2) Reduce: (k2, list(v2)) → list(k3, v3)] 6

7
Background : MapReduce 7

8
Design And Implementation Problem Statement Given a set of d-dimensional points DB = {p 1, p 2,..., p n }, a minimal density of clusters defined by Eps and MinPts, and a set of computer CP = {C1, C2,...,Cn} managed by Map-Reduce platform; find the density-based clusters with respect to the given Eps and MinPts values. Overall Framework 8

9
Stage 1 : Preprocessing Summary spatial distribution, and then genenrate grid based partition Main challenges for a partitioning strategy 1) Load balancing 2) Minimized communication One of the possible solutions is to build an efficient spatial index However the authors does not take well-known indexing method such as R- Tree, KD-Tree, … Because, iterating recursion to get a hierarchical structure is not practical in MapReduc paradigm The authors uses partition algorithm on MapReduce adjusted from the grid file. 9

10
Stage 1 : Preprocessing 10 Raw Data Bucket Counting (in example, 10 bucket which created by interval 0.1) Compute Spatial distribution for each dimension Partitioning - Proposed Metrics : avg, m Bucket ID Count

11
Stage 1 : Preprocessing Shape of the Partiton necessity of the access to remote data For a given Eps, and MinPts D 5, if there is no support of accessing remote data, then the neighborhood of object p 1 would contain only 3 points which is less than MinPts, and therefore p 1 would not be a core point. Therefore, to obtain correct clustering results, a “view” over the border of partitions is necessary So, the shape of the partition is S + halo 11 S 1 or iS 2 or i+1 halo Outer halo Inner halo Eps

12
Stage 2 : Local DBSCAN The algorithm in Local DBSCAN is very similar with DBSCAN Differences is.. A non-noise point q on outer halo, in this point we does not know whether q is a core point or border point. (because computing node is on shared-nothing environment) Those points are classified “Onqueue” status and put into MergeCandidates set (MC) 12

13
Stage 3 : Find Merging Mapping Character of MC set The composition of MC set The Completeness of MC set 13 q is not in halo q is core pointMore than one neighbor are on halo O is Core point or border point on halo

14
Stage 3 : Find Merging Mapping Merging clusters of adjacent spaces are needed or not 14

15
Stage 3 : Find Merging Mapping Let MC1(C, S1) = {AP1 ∪ BP1}, where AP1 is the set of core points and BP1 is the set of border points Theorem 1: Let MC1(C1, S1) = {AP1 ∪ BP1}, where AP1 is the set of core points and BP1 is the set of border points w.r.t. space constraint S1. MC2(C2, S2) = AP2 ∪ BP2, where AP2 is the set of core points and BP2 is the set of border points w.r.t. space constraint S2. If S1 and S2 are bordering 15

16
Stage 3 : Find Merging Mapping 16

17
Stage 4 : Merge Build Global Mapping -> Merge and Relabel 17

18
Evaluation Experiment environment 13-node cluster Each node has 3.0GHz i7 950 (quad-core), 8GB ram, 2TB hdd Ubuntu Hadoop Block size : 64MB Data Set Sanghai taxi GPS logs 18

19
Evaluation Each location point is normalized into range [0, 1) Two DBSCAN configuration WL-1 Eps : 0.002, MinPts : 1,000 WL-2 Eps : , MinPts : ds-4

20
Evaluation 20 WL-1 SPD= node ds4 ds3 ds2 ds1 (2/12) (4/12) (6/12)

21
Conclusions In this paper, implement an efficient parallel DBScan algorithm in a 4-stages MapReduce paradigm. We analyze and propose a practical data partition strategy for large scale non-indexed spatial data. We apply our work on a real world spatial dataset, which contains over 1.9 billion GPS raw records, and run our experiment on a lab-size 13-nodes cluster. Result from experiment shows the speedup and scale-up performance are very efficient. We observe that roadmap based spatial data will highly skew in the road network. If a main road happens lying in the replication area after partitioning, computation and data replication will increase dramatically. One of the future works is to improve the partitioning strategy to aware of this observation and minimize the size of MC sets. The challenge is that its performance is still highly restricted by the distribution of raw spatial data.

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google