Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Slides:



Advertisements
Similar presentations
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
K-means clustering –An unsupervised and iterative clustering algorithm –Clusters N observations into K clusters –Observations assigned to cluster with.
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
Scaling Content Based Image Retrieval Systems Christine Lo, Sushant Shankar, Arun Vijayvergiya CS 267.
Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.
CS 732: Advance Machine Learning Usman Roshan Department of Computer Science NJIT.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
FLANN Fast Library for Approximate Nearest Neighbors
Panda: MapReduce Framework on GPU’s and CPU’s
The hybird approach to programming clusters of multi-core architetures.
MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.
Jawwad A Shamsi Nouman Durrani Nadeem Kafi Systems Research Laboratories, FAST National University of Computer and Emerging Sciences, Karachi Novelties.
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
GPU-accelerated Evaluation Platform for High Fidelity Networking Modeling 11 December 2007 Alex Donkers Joost Schutte.
“SEMI-AUTOMATED PARALLELISM USING STAR-P " “SEMI-AUTOMATED PARALLELISM USING STAR-P " Dana Schaa 1, David Kaeli 1 and Alan Edelman 2 2 Interactive Supercomputing.
Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Charu Aggarwal + * Department of Computer Science, University of Texas at Dallas + IBM T. J. Watson.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.
Massively Parallel Mapping of Next Generation Sequence Reads Using GPUs Azita Nouri, Reha Oğuz Selvitopi, Özcan Öztürk, Onur Mutlu, Can Alkan Bilkent University,
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
NVIDIA Tesla GPU Zhuting Xue EE126. GPU Graphics Processing Unit The "brain" of graphics, which determines the quality of performance of the graphics.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Accelerating image recognition on mobile devices using GPGPU
GPU Architecture and Programming
Today Ensemble Methods. Recap of the course. Classifier Fusion
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.
IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.
Unsupervised Learning. Supervised learning vs. unsupervised learning.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
QCAdesigner – CUDA HPPS project
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
Image Processing A Study in Pixel Averaging Building a Resolution Pyramid With Parallel Computing Denise Runnels and Farnaz Zand.
CS 732: Advance Machine Learning
Exploiting Computing Power of GPU for Data Mining Application Wenjing Ma, Leonid Glimcher, Gagan Agrawal.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Joe Bradish Parallel Neural Networks. Background  Deep Neural Networks (DNNs) have become one of the leading technologies in artificial intelligence.
Multi-dimensional Range Query Processing on the GPU Beomseok Nam Date Intensive Computing Lab School of Electrical and Computer Engineering Ulsan National.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
An Accelerated Procedure for Hypergraph Coarsening on the GPU Lin Cheng, Hyunsu Cho, and Peter Yoon Trinity College Hartford, CT, USA.
Machine Learning Lecture 4: Unsupervised Learning (clustering) 1.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
Two-Dimensional Phase Unwrapping On FPGAs And GPUs
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Linchuan Chen, Xin Huo and Gagan Agrawal
CMAQ PARALLEL PERFORMANCE WITH MPI AND OpenMP George Delic, Ph
Applying Twister to Scientific Applications
Data-Intensive Computing: From Clouds to GPU Clusters
Parallel k-means++ for Multiple Shared-Memory Architectures
Presentation transcript:

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi Electrical and Computer Engineering Dept. Northeastern University Boston, MA Supported by: Accelerating K-Means Clustering

Introduction 1 Accelerating K-Means Clustering

Era of Big Data Facebook loads TB compressed data per day Google processes more than 20 PB data per day 2 Accelerating K-Means Clustering

Handling Big Data Smart data processing: Data Classification Data Clustering Data Reduction Fast processing: Parallel computing (MPI, OpenMP) GPUs 3 Accelerating K-Means Clustering

Clustering Unsupervised classification of data in groups with similar features Used to address: – Feature extraction – Data compression – Dimension reduction Methods: – Neural networks – Distribution based – Iterative learning 4 Accelerating K-Means Clustering

K-means Clustering 5 One of the most popular centroid-based clustering algorithm An unsupervised, iterative machine learning algorithm - Partition n observations into k clusters 5 Accelerating K-Means Clustering

Contributions 6 A K-means implementation that converges based on dataset and user input. Comparison of different styles of parallelism using different platforms for K-means implementation. – Shared memory - OpenMP – Distributed memory - MPI – Graphics Processing Unit - CUDA Speed-up the algorithm by parallel initialization Accelerating K-Means Clustering

K-means Clustering 7 Accelerating K-Means Clustering

90% of the total time is spent in calculating nearest centroid (gprof) Parallel Implementation 8 Accelerating K-Means Clustering Which part to be parallelized?

9 Parallel Feature Extraction Most time consuming steps SequentialParallel CalculationCommunication Accelerating K-Means Clustering

10 Other Major Challenges Initializing centroids Number of centroids (K) Number of iterations (I) Accelerating K-Means Clustering Three features that effect K-means clustering execution time

Goal: find a good set of initial centroids. Our method: explore parallelism during initialization - Each set of means is operated on each thread independently for 5 iterations on a subset of the dataset Best quality: - minimum intra-cluster distance - maximum inter-cluster distance Improved Parallel Initialization 11 Accelerating K-Means Clustering

Drop-out Technique 12 Goal: determine the proper number of clusters (K) Method: – Initially give an upper limit of K as input – Drop some clusters which have no points assigned Accelerating K-Means Clustering K = 12 K = 4 Drop=out

Convergence 13 When to stop iterating? Tolerance: track of points changing their clusters in a given iteration compared to the prior iteration Total # of iterations depends on the input size, contents and tolerance – No need to be given as input – Be decided at runtime. Accelerating K-Means Clustering

Parallel Implementation 14 Accelerating K-Means Clustering

Three Forms of Parallelism Shared memory (OpenMP) Distributed memory (MPI – Message Passing Interface) Graphics Processing Units (CUDA-C – Compute Unified Device Architecture) 15 Accelerating K-Means Clustering

Evaluation 16 Accelerating K-Means Clustering

Experiments Cloud 2013 Input dataset – 2D color images Five features – RGB channel (three), x and y position (two) 17 Map IntensiveReduce Intensive Setup Compute nodes – Dual Intel E CPUs with 16 physical and 32 logical cores GPU nodes – NVIDIA Tesla K20m with 2496 CUDA cores Vary size of image, number of clusters, tolerance and number of parallel processing tasks Accelerating K-Means Clustering

Results Parallel versions perform better than sequential C Multi-threaded OpenMP version outperforms rest with a speed-up of 31x for 300x300 pixels input image - Shared memory platform is good while working with small and medium datasets 18 Accelerating K-Means Clustering KIter.Seq. (s)OpenMP (s)MPI (s)CUDA (s) Time for 300x300 pixels input image K drop_out = 78 Tol = Speed Up = 30.93

Parallel versions pexrform better than sequential C CUDA performs best for 1164x1200 pixels input image with 30x speed-up - GPU is best while working with large datasets KIter.Seq. (s)OpenMP (s)MPI (s)CUDA (s) Time for 1164x1200 pixels input image K drop_out = 217 Tol = Speed Up = 30.26

300x300 pixels image K=30 16 threads OpenMP Tolerance 20 Accelerating K-Means Clustering As the tolerance decreases, the speed-up compared to sequential C increases Sequential computation VS Parallel computation with random sequential initialization

300x300 pixel Tol = threads Parallel Initialization 21 Accelerating K-Means Clustering 1164x1200 pixel Tol = threads Parallel computation with random initialization VS parallel initialization Additional 1.5x to 2.5x speed-up over parallel version

Conclusions and Future work 22 Accelerating K-Means Clustering

23 Accelerating K-Means Clustering Our K-means implementation tackles the major challenges of K-means K-means performance evaluated across three parallel programing approaches Our experimental results show around 35x speed-up in total We also observe that the shared memory platform with OpenMP performs best for smaller images while a GPU with CUDA-C outperforms the rest for larger images. Future work: – Investigate using multiple GPUs hybrid approaches: OpenMP-CUDA and MPI-CUDA – Adapt our implementation to handle larger datasets.

Thank You ! 24 Accelerating K-Means Clustering Janki Bhimani Website: Supported by: