Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

Slides:

Advertisements

Similar presentations

List Ranking and Parallel Prefix

Advertisements

List Ranking on GPUs Sathish Vadhiyar. List Ranking on GPUs Linked list prefix computations – computations of prefix sum on the elements contained in.

1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.

1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*

Presented by Rengan Xu LCPC /16/2014

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

 Data copy forms part of an auto-tuning compiler framework.  Auto-tuning compiler, while using the library, can empirically evaluate the different implementations.

“Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Computations” By Ravi, Ma, Chiu, & Agrawal Presented.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

An Introduction to Programming with CUDA Paul Richmond

SAGE: Self-Tuning Approximation for Graphics Engines

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.

Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.

Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.

Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal

Algorithm Engineering „GPGPU“ Stefan Edelkamp. Graphics Processing Units  GPGPU = (GP)²U General Purpose Programming on the GPU  „Parallelism for the.

HiPC 2010 AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS Wenjing Ma Gagan Agrawal The Ohio State University.

+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Data-Intensive Computing: From Multi-Cores and GPGPUs to Cloud Computing and Deep Web Gagan Agrawal u.

Mar 16, Automatic Transformation and Optimization of Applications on GPUs and GPU Clusters PhD Oral Defence: Wenjing Ma Advisor: Dr Gagan Agrawal.

Integrating and Optimizing Transactional Memory in a Data Mining Middleware Vignesh Ravi and Gagan Agrawal Department of ComputerScience and Engg. The.

A Map-Reduce System with an Alternate API for Multi-Core Environments Wei Jiang, Vignesh T. Ravi and Gagan Agrawal.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

GPU Architecture and Programming

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal June 1,

Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.

FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

OpenCL Programming James Perry EPCC The University of Edinburgh.

Compiler Supported High-level Abstractions for Sparse Disk-resident Datasets Renato Ferreira Gagan Agrawal Joel Saltz Ohio State University.

Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The.

Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

Rapid Tomographic Image Reconstruction via Large-Scale Parallelization Ohio State University Computer Science and Engineering Dep. Gagan Agrawal Argonne.

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters Wei Jiang and Gagan Agrawal.

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Exploiting Computing Power of GPU for Data Mining Application Wenjing Ma, Leonid Glimcher, Gagan Agrawal.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)

Automatic CPU-GPU Communication Management and Optimization Thomas B. Jablin,Prakash Prabhu. James A. Jablin, Nick P. Johnson, Stephen R.Breard David I,

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

CS427 Multicore Architecture and Parallel Computing

Accelerating MapReduce on a Coupled CPU-GPU Architecture

A Pattern Specification and Optimizations Framework for Accelerating Scientific Computations on Heterogeneous Clusters Linchuan Chen Xin Huo and Gagan.

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz

6- General Purpose GPU Programming

FREERIDE: A Framework for Rapid Implementation of Datamining Engines

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009

Euro-Par, 2006 Motivation and Overview Two Popular Trends –Data-intensive computing –GPU programming Seems like a good match Can we ease use of GPGPUs ? –Domain-specific Programming Tool –Can exploit common programming structure –Enable good speedups ICS 2009

Euro-Par, 2006 Context Many years of work on compiler and runtime support for data-intensive applications –Clusters, SMPs, Cluster of SMPs –FREERIDE and language front-ends Similar to map-reduce but … –Predates it and performs better !! –Recent work on (Cluster of) Multi-cores, Incorporate RSTM GPUs – C and Matlab front-end –Cluster of GPUs, Multi-core and GPUs ICS 2009

Euro-Par, 2006 ICS 2009 Outline Background GPU Computing Parallel Data Mining Challenges of Data Mining on GPU Architecture of the System –Sequential code analysis –Generation of CUDA programs –Optimization Techniques Experimental Results –k-means, EM, PCA Related and future work ICS 2009

Euro-Par, 2006 ICS 2009 Background - GPU Computing Many-core architectures/Accelerators are becoming more popular GPUs are inexpensive and fast CUDA is a high-level language for GPU programming

Euro-Par, 2006 ICS 2009 CUDA Programming Significant improvement over use of Graphics Libraries But.. Need detailed knowledge of the architecture of GPU and a new language Must specify the grid configuration Deal with memory allocation and movement Explicit management of memory hierarchy

Euro-Par, 2006 ICS 2009 Parallel Data mining Common structure of data mining applications (FREERIDE)‏ /* outer sequential loop *//* outer sequential loop */ while() { while() { /* Reduction loop */ /* Reduction loop */ Foreach (element e){ Foreach (element e){ (i, val) = process(e); (i, val) = process(e); Reduc(i) = Reduc(i) op val; Reduc(i) = Reduc(i) op val; } }

Euro-Par, 2006 Porting on GPUs High-level Parallelization is straight-forward Details of Data Movement Impact of Thread Count on Reduction time Use of shared memory

Euro-Par, 2006 ICS 2009 Architecture of the System Variable information Reduction functions Optional functions Code Analyzer( In LLVM)‏ Variable Analyzer Code Generator Variable Access Pattern and Combination Operations Host Program Grid configuration and kernel invocation Kernel functions Executable User Input

Euro-Par, 2006 User Input A sequential reduction function Optional functions (initialization function, combination function…)‏ Values of each variable or size of array Variables to be used in the reduction function

Euro-Par, 2006 ICS 2009 Analysis of Sequential Code Get the information of access features of each variable Determine the data to be replicated Get the operator for global combination Variables for shared memory

Euro-Par, 2006 Memory Allocation and Copy Copy the updates back to host memory after the kernel reduction function returns C.C.C.C. Need copy for each thread T0T1 T2 T3 T4 T61T62 T63T0T1 …… T0T1 T2T3T4 T61T62 T63T0T1 …… A.A.A.A. B.B.B.B.

Euro-Par, 2006 ICS 2009 Extract information of variable access Variable analyzer IR from LLVM Extract variables to be written Argument list Extract read-only variables User input Extract temporary variables

Euro-Par, 2006 ICS 2009 Generating CUDA Code and C++/C code Invoking the Kernel Function Memory allocation and copy Thread grid configuration (block number and thread number)‏ Global function Kernel reduction function Global combination

Euro-Par, 2006 ICS 2009 Global Combination Assume all updates are summed or multiplied from each thread An automatically generated global combination function which is invoked by 1 thread

Euro-Par, 2006 ICS 2009 Kernel Reduction Function Generated out of the original sequential code Divide the main loop by block_number and thread_number Replace the access offsets with appropriate indices

Euro-Par, 2006 ICS 2009 Optimizations Using shared memory Providing user-specified initialization functions and combination functions Specifying variables that are allocated once

Euro-Par, 2006 ICS 2009 Dealing with Shared memory Size = length * sizeof(type) * thread_info –length: size of the array –type: char, int, and float –thread_info: whether it’s copied to each thread Mark each array as shared until the size exceeds the limit of shared memory

Euro-Par, 2006 ICS 2009 Shared memory layout Strategies No-sorting Greedy sorting Write-first sorting

Euro-Par, 2006 ICS 2009 No sorting Shared Memory B A CD

Euro-Par, 2006 ICS 2009 Greedy sorting Shared Memory BACD BACD

Euro-Par, 2006 ICS 2009 Other Optimizations Reducing Memory allocation and copy overhead –Arrays shared by multiple iterations can be allocated and copied only once User defined combination function

Euro-Par, 2006 ICS 2009 Applications K-means clustering EM clustering PCA

Euro-Par, 2006 ICS 2009 Experiment Results Speedup of k-means

Euro-Par, 2006 ICS 2009 Speedup of k-means on GeForce 9800X2

Euro-Par, 2006 ICS 2009 Speedup of EM

Euro-Par, 2006 ICS 2009 Speedup of PCA

Euro-Par, 2006 Related Work OpenMP to CUDA (Purdue) Domain-specific operators to CUDA (NEC) CUDA-lite etc. (Illinois) Various application studies

Euro-Par, 2006 Conclusions Automatic CUDA Code Generation and Optimization is feasible Restricting to domain / communication style helps Interesting new compiler optimizations