Scalable Data Clustering with GPUs Student: Andrew D. Pangborn 1 Advisors: Dr. Muhammad Shaaban 1, Dr. Gregor von Laszewski 2, Dr. James Cavenaugh 3, Dr.

Slides:

Advertisements

Similar presentations

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.

Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

Development of a track trigger based on parallel architectures Felice Pantaleo PH-CMG-CO (University of Hamburg) Felice Pantaleo PH-CMG-CO (University.

OpenFOAM on a GPU-based Heterogeneous Cluster

Optimizing and Auto-Tuning Belief Propagation on the GPU Scott Grauer-Gray and Dr. John Cavazos Computer and Information Sciences, University of Delaware.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

Panda: MapReduce Framework on GPU’s and CPU’s

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.

Chapter 2 Computer Clusters Lecture 2.3 GPU Clusters for Massive Paralelism.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute.

The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.

1 © 2012 The MathWorks, Inc. Parallel computing with MATLAB.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Genetic Programming on General Purpose Graphics Processing Units (GPGPGPU) Muhammad Iqbal Evolutionary Computation Research Group School of Engineering.

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Emergence of GPU systems and clusters for general purpose high performance computing ITCS 4145/5145 April 3, 2012 © Barry Wilkinson.

GPU Architecture and Programming

Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.

Copyright © 2011 Curt Hill MIMD Multiple Instructions Multiple Data.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

My Coordinates Office EM G.27 contact time:

Background Computer System Architectures Computer System Software.

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.

Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Leiming Yu, Fanny Nina-Paravecino, David Kaeli, Qianqian Fang

Lecture 2: Intro to the simd lifestyle and GPU internals

Introduction to CUDA.

Chapter 4 Multiprocessors

Multicore and GPU Programming

6- General Purpose GPU Programming

Multicore and GPU Programming

Presentation transcript:

Scalable Data Clustering with GPUs Student: Andrew D. Pangborn 1 Advisors: Dr. Muhammad Shaaban 1, Dr. Gregor von Laszewski 2, Dr. James Cavenaugh 3, Dr. Roy Melton 1 1- RIT 2 – Indiana University 3 – University of Rochester Many scientific and business applications produce massive data sets. This data must often be classified into meaningful subsets in order for data analysts to draw meaningful conclusions from it. Data clustering is the broad field of statistical analysis that groups (classifies) similar objects into relatively homogenous sets called clusters. Data clustering has a history in a wide variety of fields, such as, data mining, machine learning, geology, astronomy, and bioinformatics, to name a few. There has been a myriad of different techniques developed over the past 50 to 60+ years to tackle the problem of clustering data. These algorithms typically have computational complexities that increase combinatorially as the dimensions of the input and the number of classes (clusters) increases. In many fields the amount of data requiring clustering is growing rapidly, yet the performance gains from single CPU cores has declined over the past few years. The computational demands of multivariate clustering grow rapidly and therefore large data sets are very time consuming on a single CPU. Fortunately these techniques lend themselves naturally to large scale parallel processing. To address the computational demands graphics processing units, specifically NVIDIA’s CUDA framework and Tesla architecture, were investigated as a low-cost, high performance solution to a number of clustering algorithms. C-means and Expectation Maximization with Gaussian mixture models were implemented using the CUDA framework. The algorithm implementations use a hybrid of CUDA, OpenMP, and MPI to scale to many GPUs on multiple nodes in a distributed memory cluster environment. This framework is envisioned as part of a larger cloud- based workflow service where scientists can apply multiple algorithms and parameter sweeps to their data sets and quickly receive a thorough set of results that can be further analyzed by experts. Using a cluster of 128 GPUs, a speedup of 6,300 with 72% parallel efficiency was observed compared to a 2.5 GHz Intel Xeon E5420 CPU. The paradigm in general purpose computing over the past few years has been shifting from individual processors getting faster, to getting wider (more threads and cores), and will likely continue in upcoming years. On the other side of the spectrum, graphics processing units (GPUs), which were parallel fixed-function co-processors are becoming increasingly programmable and therefore suitable for more general purpose applications. CPU GPU Fixed Function Fully Programmable Partially Programmable Multi-threadedMulti-coreMany-core Intel Larabee NVIDIA CUDA GPUs provide significantly higher floating point performance (nearly an order of magnitude) and more memory bandwidth than CPUs. Very high performance/cost and performance/watt ratios (becoming increasingly important for data centers due to heat and energy costs) with a GPU. Modern NVIDIA GPUs are many-core systems with multiprocessors capable of general purpose computing applications. The Tesla architecture has up to 30 multiprocessors for a total of 240 processing cores, while NVIDIA’s new Fermi architecture (launched in March, 2010) boasts up to 512 processing cores. Independent tasks within a kernel map to individual multiprocessors. Threads within a multiprocessor map to the processing cores and can communicate via shared memory. These multiprocessors each handle hundreds of concurrent threads executing single-instruction multiple- data (SIMD) operations. Threads can diverge to perform different instructions when necessary, at the cost of throughput. NVIDIA calls this model single-instruction multiple-thread (SIMT), which is more flexible for writing general purpose applications than strict SIMD. CUDA is a framework that enables developers to target GPUs with general purpose algorithms. This research uses C and the CUDA runtime to accelerate two clustering algorithms. CUDA programs consist of code that runs on the host CPU and device kernels that run on the GPU. The host code is written in ANSI C and compiled with gcc. The kernels are written in C for CUDA and then compiled to a hardware independent pseudo-assembly language called PTX by the nvcc compiler. The CUDA driver has a JIT compiler that converts the PTX code to the machine code for the target hardware at runtime and handles communication between the host and device. Multi-Tier Parallelization Hierarchy MPI – Internode communication. One process per server node OpenMP – Intranode communication. One CPU thread per GPU CUDA – Kernel grids map to many-core GPU architecture. Thousands of lightweight threads compute fine-grained parallel tasks The data sets for clustering contain N independent vectors which are evenly distributed to each server node. If the nodes have multiple GPUs, the vectors are further divided amongst multiple host thread+GPU pairs (one OpenMP thread and CUDA context per GPU). After distributing the work, the host thread invokes a CUDA kernel. Within the GPU large independent tasks, such as computing the covariance between different pairs of dimensions are mapped to different GPU multiprocessors. The individual threads then execute fine-grained parallel SIMD operations on the data elements. Results are copied from the GPU back to the host CPU. Reduction occurs across the different threads and server nodes to form a final result. The process repeats for all kernels until the algorithm converges to a final answer. TeraGrid Resources – Lincoln The TeraGrid is a high performance computing infrastructure bringing together resources from 11 different sites across the United States. Lincoln is the name of one machine on the TeraGrid located in the National Center for Supercomputing Applications at the University of Illinois Urbana-Champaign. Lincoln contains 92 NVIDIA Tesla S1070 accelerator units for a total of 368 GPUs with 88,320 cores. This research uses Lincoln as a testing environment for developing parallel data clustering algorithms using multiple GPUs, supported in part by the National Science Foundation under TeraGrid grant number TG-MIP NVIDIA CUDA 3.0 Programming Guide. [Online] available: 2.NVIDIA CUDA Architecture Introduction & Overview. [Online] available: 3.National Center for Supercomputing Applications, “Ncsa intel 64 tesla linux cluster lincoln technical summary,” [Online]. Available: 4.J. Espenshade, A. Pangborn, G. von Laszewski, D. Roberts, and J. Cavenaugh, “Accelerating partitional algorithms for flow cytometry on gpus,” in Parallel and Distributed Processing with Applications, 2009 IEEE International Symposium on, Aug. 2009, pp. 226– Invitrogen, “Fluorescence tutorials: Intro to flow cytometry,” [Online]. Available: The single GPUs provide 1.8 to 2.0 orders of magnitude speedup over an optimized C version running on a single core of a comparable modern Intel CPU (2.5 GHz Xeon E5420 Quad core). This is a large improvement at a cost significantly less than a single desktop machine (the GTX260 currently retails for about $200). The speedup is expected to be even greater compared to a CPU with NVIDIA’s new Fermi GPUs which have more than twice the cores, significantly higher memory bandwidth, and more memory caching capabilities than the Tesla-based cards. Testing was also performed on the Lincoln supercomputer. The algorithm achieved near ideal speedup with 32 GPUs, and 72% parallel efficiency with 128 GPUs for a total speedup over a CPU of 6,300. This allows the clustering of a typical flow cytometry file in about 5 seconds rather than 10 hours. Queuing time on a shared resource such as Lincoln may make it impractical to use for individual FCS files, but it is certainly viable for achieving high throughput with a large number of flow cytometry files in a clinical study. Flow cytometry is a mainstay technology used by biologists and immunologists for analyzing cells suspended in a fluid. It is used in a variety of clinical and biological research areas. The conventional analysis of flow cytometry data is sequential bivariate gating (where an operator manually draws bins around groups of cells in 2D plots), but there has been a surge of research in recent years trying to use automated multivariate clustering algorithms to analyze flow data. Flow cytometry generates large data sets on the order of 10 6 vectors, 24 dimensions, and many possible clusters. Data sets of this size and complexity have huge computational requirements. This makes clustering slow at best, and often simply impractical on a single CPU. Flow cytometry data was the inspiration for the GPU- based acceleration in this research. Pictured below is an overview of a typical flow cytometer [5]. A laser hits cells flowing through a tube one at a time (ideally) and the light is measured by various detectors forming a multidimensional vector for each cell. On the right is a density plot of two dimensions (side light scatter and forward light scatter) of a flow cytometry data file. The above figure is based on an Intel Larabee presentation slide at Supercomputing 09 in Portland, Oregon