Parallel Density-based Hybrid Clustering

Slides:



Advertisements
Similar presentations
Heuristic Search techniques
Advertisements

Clustering Categorical Data The Case of Quran Verses
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Pei Fan*, Ji Wang, Zibin Zheng, Michael R. Lyu Toward Optimal Deployment of Communication-Intensive Cloud Applications 1.
Advanced Topics in Algorithms and Data Structures Lecture pg 1 Recursion.
Advanced Topics in Algorithms and Data Structures Page 1 Parallel merging through partitioning The partitioning strategy consists of: Breaking up the given.
Chapter 11 Sorting and Searching. Copyright © 2005 Pearson Addison-Wesley. All rights reserved Chapter Objectives Examine the linear search and.
Parallel Merging Advanced Algorithms & Data Structures Lecture Theme 15 Prof. Dr. Th. Ottmann Summer Semester 2006.
FLANN Fast Library for Approximate Nearest Neighbors
Birch: An efficient data clustering method for very large databases
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Statistical Performance Analysis for Scientific Applications Presentation at the XSEDE14 Conference Atlanta, GA Fei Xing Haihang You Charng-Da Lu July.
A computational study of protein folding pathways Reducing the computational complexity of the folding process using the building block folding model.
1 Time Analysis Analyzing an algorithm = estimating the resources it requires. Time How long will it take to execute? Impossible to find exact value Depends.
Topology aggregation and Multi-constraint QoS routing Presented by Almas Ansari.
1 Parallel Sorting Algorithms. 2 Potential Speedup O(nlogn) optimal sequential sorting algorithm Best we can expect based upon a sequential sorting algorithm.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
Multiple parallel hidden layers and other improvements to recurrent neural network language modeling ICASSP 2013 Diamantino Caseiro, Andrej Ljolje AT&T.
2-3 Tree. Slide 2 Outline  Balanced Search Trees 2-3 Trees Trees.
Chapter 18: Searching and Sorting Algorithms. Objectives In this chapter, you will: Learn the various search algorithms Implement sequential and binary.
Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉 教授 : 許毅然 作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
CSCI-455/552 Introduction to High Performance Computing Lecture 21.
Data Structures and Algorithms Instructor: Tesfaye Guta [M.Sc.] Haramaya University.
1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.
1 Double-Patterning Aware DSA Template Guided Cut Redistribution for Advanced 1-D Gridded Designs Zhi-Wen Lin and Yao-Wen Chang National Taiwan University.
| presented by Vasileios Zois CS at USC 09/20/2013 Introducing Scalability into Smart Grid 1.
BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.
Chapter 23 Sorting Jung Soo (Sue) Lim Cal State LA.
Advanced Algorithms Analysis and Design
PERFORMANCE EVALUATIONS
Virtual memory.
CHAPTER SEVEN PARALLEL PROCESSING © Prepared By: Razif Razali.
Distributed Network Traffic Feature Extraction for a Real-time IDS
Parallel Programming By J. H. Wang May 2, 2017.
Parallel ODETLAP for Terrain Compression and Reconstruction
The University of Adelaide, School of Computer Science
Optimization Code Optimization ©SoftMoore Consulting.
Nithin Michael, Yao Wang, G. Edward Suh and Ao Tang Cornell University
Algorithm Analysis CSE 2011 Winter September 2018.
Parallel Programming in C with MPI and OpenMP
CS 685: Special Topics in Data Mining Jinze Liu
Binary Search Back in the days when phone numbers weren’t stored in cell phones, you might have actually had to look them up in a phonebook. How did you.
Finding Heuristics Using Abstraction
DHT Routing Geometries and Chord
Communication and Memory Efficient Parallel Decision Tree Construction
Objective of This Course
Unit-2 Divide and Conquer
Selected Topics: External Sorting, Join Algorithms, …
8/04/2009 Many thanks to David Sun for some of the included slides!
Pei Fan*, Ji Wang, Zibin Zheng, Michael R. Lyu
Parallel Sorting Algorithms
Sungho Kang Yonsei University
Consensus Partition Liang Zheng 5.21.
Data Mining – Chapter 4 Cluster Analysis Part 2
Analysis of Algorithms
CENG 351 Data Management and File Structures
Parallel Programming in C with MPI and OpenMP
Parallel Sorting Algorithms
Core Assessments Core #1: This Friday (5/4) Core #2: Tuesday, 5/8.
Donghui Zhang, Tian Xia Northeastern University
Clustering.
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
Divide and Conquer Merge sort and quick sort Binary search
Presentation transcript:

Parallel Density-based Hybrid Clustering Baoying Wang October 18th, 2008 The Sixth Virtual Conference on Genomics and Bioinformatics

Outline Introduction Sequential DH-Clustering The Parallel Clustering Approach Experimental Results Conclusions

Introduction Clustering techniques partition the data set into groups such that similar items fall into the same group [2]. Data clustering is a common data mining technique. There are partitioning clustering and hierarchical clustering methods. Hierarchical clustering is more flexible than partitioning clustering. However, hierarchical clustering is very computationally expensive for large data sets. Scalable parallel computers can be used to speed up hierarchical clustering. Recently there has been an increasing interest in parallel implementations of data clustering algorithms. However, most existing parallel approaches have been developed for traditional agglomerative clustering.

Introduction (cont.) In this paper, we propose a parallel algorithm to implement density-based hybrid clustering on MIMD (Multiple Instruction stream, Multiple Data stream) parallel machines using MPI. DH-Clustering method clusters a dataset into a set of preliminary attractor trees and then merges the attractor trees until the whole dataset becomes one tree. DH-Clustering is a hybrid clustering. It is a combination of partitioning clustering and hierarchical clustering. It is faster than the traditional hierarchical clustering but still does not scale well when the data size increases. Experiments demonstrate that our parallel approach speeds up the sequential hierarchical clustering tremendously with comparatively good clustering results.

Sequential DH-Clustering The basic idea of DH-Clustering clustering is to partition the data set into clusters in terms of local density attractor trees. Given a data point x, if we follow the steepest density ascending path, the path will finally lead to a local density attractor. All points whose steepest ascending paths lead to the same local attractor form a local attractor tree. If x doesn’t have such a path, it can be either an attractor or a noise. The local attractor trees are the preliminary clusters. The resultant graph is a collection of local attractor trees with local attractors as the roots. After the local attract trees (preliminary clusters) are built, cluster merging process starts to combine the most similar cluster pair level-by-level based on the similarity measure. When two clusters are merged, two local attractor trees are combined into a new attractor tree. The cluster merging is processed recursively until there is one tree left.

Analysis of the DH-Clustering DH-Clustering consists of five major steps: loading data calculating density building local attractor trees merging process outputting results. We will focus on the three major computational steps: (2), (3) and (4).

Analysis of the DH-Clustering (cont.) Step 2 is to calculate the density of each data point to find the density of each data point, the number of neighbors falling in each equal interval ring (EINring neighbors) needs to be calculated. If the dataset size is n, the time to find the EINring neighbors within each ring will be O(n). If we divide the neighborhood into m rings, the time to calculate the density of each data point will be O(m*n). Therefore, the total time to calculate the densities of n points is O(m*n2).

Analysis of the DH-Clustering (cont.) Step 3 is to build local attractor trees. The process starts with an arbitrary point and builds the path to a densest point within a specified step range. The process continues recursively until it reaches the peak (density attractor). There are two extreme cases: when the step is so large that the whole data set is clung to one attractor tree; when the step is so small that every point is an attractor. In case (1), it takes O(n2) time to build the attractor tree, and in case (2) the time to build n attractor trees is also O(n2). Therefore, the average time to build local attractors is O(n2).

Analysis of the DH-Clustering (cont.) Step 4 is the merging process. It starts with k attractor trees and merges the closest attractor tree pair until there is one tree left. The time complexity to find the closest pair among the k attractor trees is O(k2). And it takes (k-1) times to merge k trees into one tree. Therefore, the time of the whole merging process is O((k-1)k2) or O(k3). Obviously, the time of merging process only depends on the value of k. In our experiments, k is much smaller than the data size. Hence, generally, the merging process is not as expensive as the previous two steps.

Parallel DH-Clustering Our parallel algorithm is designed to run on p parallel machines. First thought we had was to divide the dataset into p parts and assign each part to each machine. However, one problem arose during density calculation. If each data part was isolated from another when densities were calculated, density values would not be correct, especially when the data portion on each machine covered data from all over the data space. To solve this problem, we load the whole data set onto each machine, but each machine only calculates the densities of the points assigned to it. In this way, we still achieve high efficiency without loss of accuracy. The complexity of density calculation on p machines is reduced to O(m*n2/p) from the sequential O(m*n2).

Parallel DH-Clustering (cont.) After the densities are calculated, each machine focuses on its own data to build local attractor trees. Since the data size on each machine is n/p, the parallel complexity of this step is reduced to O(n2/p2) from the sequential O(n2). Generally, the number of local attractor trees built on individual machines is smaller than the sequential approach. For example, if the sequential approach produces k attractor trees, with the data divided among p machines, each machine may produce up to k local attractor trees. In this case, the total number of attractor trees might be up to p*k. But this will not affect the final clustering result. It only adds more layers to the clustering structure since the merging process starts with more local attractors.

Parallel DH-Clustering (cont.) There are two ways to merge the attractor trees. One way is to let the local attractor trees stay on their own machine and assign one machine to do the merging. In this way, the closest tree pair is chosen by polling all the machines at each merging step. The other way is to collect all local attractor trees to one machine before the merging process is carried out on this machine. We did experiments and found out the first approach seemed more load balanced and computationally efficient but turned out very slow due to the communication time. So we decided to adopt the second approach.

Some Technical Issues Since the number of the local attractor trees on each machine is different and the attractor trees are of different sizes, it is very inefficient to send one tree at a time. To solve these problems, we pack all the local attractor trees into large sending buffers. A special delimiter is inserted between neighboring trees and the beginning of a tree is its attractor. Another issue is that the sending buffers on different machines are different because the numbers of attractors are different. Therefore it is hard for the root machine to gather the data/trees with different sizes. To solve this problem, we let the root gather the size of the sending buffer from each machine first. Then we prepare an array of displacements on the receiving buffer based on the different sending sizes so when data comes from a certain machine, it will go directly to the position determined by its displacement.

Parallel DH-Clustering (cont.) When all attractor trees are gathered in a receiving buffer at the root machine, all machines stop except the root machine. The root machine will then dissect the receiving buffer based on delimiters and recover the attractor trees from the received data. With all the newly recovered attractor trees, the merging process will start just as the sequential algorithm. The figure in next slide illustrates the parallel DH-Clustering process.

… The Whole Data Set Density calculation Dp Build local attractor trees Sending buffer Pack the trees to buffer Mp … D0 M0 D1 M1 Receiving buffer Gather all buffers Merging Process

Experimental Results We have implemented both the sequential DH-Clustering and the parallel DH-Clustering in C++ on the BigBen at the Pittsburgh Supercomputer Center. The parallel DH-Clustering is implemented using MPI. BigBen is a Cray XT3 MPP system with 2068 computer nodes. Each computer node has two 2.6 GHz AMD Opteron processors and 2 GBs of memory. In the experiments, we compared run times and clustering results between the sequential DH-Clustering and the parallel DH-Clustering.

Run Time Comparison

Comparison of Clustering Results Seq 2M 4M 6M 8M 10M 3K 14 21 34 112 200 264 6K 18 22 39 81 203 233 12K 36 42 86 151 230 294 18K 56 158 218 347 379 24K 72 128 199 288 439 30K 90 98 159 245 410 543

Conclusions In this paper, we presented the parallel density based hybrid clustering (the parallel DH-Clustering). The algorithm was implemented on the supercomputer, BigBen, using MPI. Our experiments show that the parallel DH-Clustering is much faster than the sequential approach. The improvement is especially great for large data sets. The clustering results of the parallel approach are still comparable to those of the sequential. In the future, we would like to test our method on larger datasets, including real data sets. We would like to experiment and find out the optimal numbers of machines for clustering different data sizes. These parameters will be a useful guide for users of parallel clustering algorithms.

Thank you! Questions?