Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Advertisements

Intermediate GPGPU Programming in CUDA

Speed, Accurate and Efficient way to identify the DNA.

Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.

1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

CUDA Grids, Blocks, and Threads

Programming with CUDA WS 08/09 Lecture 3 Thu, 30 Oct, 2008.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

CUDA C/C++ BASICS NVIDIA Corporation © NVIDIA 2013.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

GPU Programming EPCC The University of Edinburgh.

An Introduction to Programming with CUDA Paul Richmond

GPU Parallel Computing Zehuan Wang HPC Developer Technology Engineer

Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.

CIS 565 Fall 2011 Qing Sun

Lecture 8 : Manycore GPU Programming with CUDA Courtesy : Prof. Christopher Cooper’s and Prof. Chowdhury’s course note slides are used in this lecture.

GPU Architecture and Programming

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.

OpenCL Programming James Perry EPCC The University of Edinburgh.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.

Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

Martin Kruliš by Martin Kruliš (v1.0)1.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.

© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.

Massively Parallel Programming with CUDA: A Hands-on Tutorial for GPUs Carlo del Mundo ACM Student Chapter, Students Teaching Students (SRS) Series November.

Would'a, CUDA, Should'a. CUDA: Compute Unified Device Architecture OU Supercomputing Symposium Highly-Threaded HPC.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

Programming with CUDA WS 08/09 Lecture 2 Tue, 28 Oct, 2008.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

CUDA C/C++ Basics Part 3 – Shared memory and synchronization

CS/EE 217 – GPU Architecture and Parallel Programming

Computer Engg, IIT(BHU)

CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)

Lecture 2: Intro to the simd lifestyle and GPU internals

GPU Programming using OpenCL

CUDA C/C++ BASICS NVIDIA Corporation © NVIDIA 2013.

More on GPU Programming

CS/EE 217 – GPU Architecture and Parallel Programming

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

CUDA Grids, Blocks, and Threads

General Purpose Graphics Processing Units (GPGPUs)

Convolution Layer Optimization

Chapter 4:Parallel Programming in CUDA C

Quiz Questions CUDA ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, 2013, QuizCUDA.ppt Nov 12, 2014.

6- General Purpose GPU Programming

Presentation transcript:

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015

Goals Running CUDA code on CPUs. Why? Performance portability! A major challenge faced today by developers on heterogeneous high performance computers. We want to write the best possible code using the best possible frameworks and run it on the best possible hardware. Our code should run well even after the translation on a platform that does not needs to have an NVIDIA graphic card. CUDA is an effective data-parallel programming model for more than just GPU architectures? 2

CUDA Computational Model Data-parallel model of computation Data-parallel model of computation – Data split into a 1D, 2D, 3D grid of blocks – Each block can be 1D, 2D, 3D in shape – More than 512 threads/block Built-in variables: Built-in variables: – dim3 threadIdx – dim3 blockIdx – dim3 blockDim – dim3 gridDim Device Grid 1 Block (0,0,0) Block (1,0,0) Block (2,0,0) Block (2,1,0) Block (0,1,0) Block (1,1,0) 3

Mapping The mapping of the computation is NOT straightforward. Conceptually easiest implementation: spawn a CPU thread for every GPU thread specified in the programming model. Quite inefficient: mitigates the locality benefits incurs a large amount of scheduling overhead 4

Mapping Translating the CUDA program such that the mapping of programming constructs maintains the locality expressed in the programming model with existing operating system and hardware features. CUDA blocks execution is asynchronous Each CPU thread should be scheduled to a single core for locality Maintain the ordering semantics imposed by potential barrier synchronization points CUDAC++ blockstd::thread / Taskasynchronous thread sequential unrolled for loop (can be vectorized) synchronous (barriers) 5

Source-to-source translation Why don’t just find a way to compile it for x86? Because having the output source code would be nice! We would like to analyze the obtained code: further optimizations debugging We don’t want to focus only on x86 6

Working with ASTs What is an Abstract Syntax Tree? A tree representation of the abstract syntactic structure of source code written in a programming language. Why regular expression tools aren’t powerful enough? Almost all programming languages are (deterministic) context free languages, a superset of regular languages. while (x < 10) { x := x + 1; } while x10 x+ :=< x1 7

Clang Clang is a compiler front end for the C, C++, Objective-C and Objective-C++ programming languages. It uses LLVM as its back end. “Clang’s AST closely resembles both the written C++ code and the C++ standard.” This makes Clang’s AST a good fit for refactoring tools. Clang handles CUDA syntax! 8

Vector addition: host translation int main(){... sum >>(in_a, in_b, out_c);... } int main() {... for(i = 0; i < numBlocks; i++) sum(in_a, in_b, out_c, i, numThreadsPerBlock);... } 9

Vector addition: kernel translation __global__ void sum(int* a, int* b, int* c){... c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x];... } void sum(int* a, int* b, int* c, dim3 blockIdx, dim3 blockDim){ dim3 threadIdx;... //Thread_Loop for(threadIdx.x=0; threadIdx.x < blockDim.x; threadIdx.x++){ c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x]; }... } Not built-in variables anymore! The loops enumerate the values of the previously implicit threadIdx 10

Example: Stencil Consider applying a 1D stencil to a 1D array of elements Consider applying a 1D stencil to a 1D array of elements – Each output element is the sum of input elements within a radius Fundamental to many algorithms Fundamental to many algorithms – Standard discretization methods, interpolation, convolution, filtering If radius is 3, then each output element is the sum of 7 input elements: If radius is 3, then each output element is the sum of 7 input elements: radius 11

Sharing Data Between Threads Terminology: within a block, threads share data via shared memory Declare using __shared__, allocated per block Data is not visible to threads in other blocks 12

Implementing with shared memory Cache data in shared memory Cache data in shared memory – Read (blockDim.x + 2 * radius) input elements from global memory to shared memory – Compute blockDim.x output elements – Write blockDim.x output elements to global memory blockDim.x output elements halo on left halo on right Each block needs a halo of radius elements at each boundary 13

Stencil kernel __global__ void stencil_1d(int *in, int *out) { __shared__ int temp[BLOCK_SIZE + 2 * RADIUS]; int global_id = threadIdx.x + blockIdx.x * blockDim.x; int local_id = threadIdx.x + RADIUS; // Read input elements into shared memory temp[local_id] = in[global_id]; if (threadIdx.x < RADIUS) { temp[local_id - RADIUS] = in[global_id - RADIUS]; temp[local_id + BLOCK_SIZE] = in[global_id + BLOCK_SIZE]; } // Synchronize (ensure all the data is available) __syncthreads(); 14

Stencil kernel // Apply the stencil int result = 0; for(int offset = -RADIUS ; offset <= RADIUS ; offset++) result += temp[local_id + offset]; // Store the result out[global_id] = result; } 15

Stencil: CUDA kernel __global__ void stencil_1d(int *in, int *out) { __shared__ int temp[BLOCK_SIZE + 2 * RADIUS]; int global_id = threadIdx.x + blockIdx.x * blockDim.x; int local_id = threadIdx.x + radius; // Read input elements into shared memory temp[local_id] = in[global_id]; if (threadIdx.x < RADIUS) { temp[local_id – RADIUS] = in[global_id – RADIUS]; temp[local_id + BLOCK_SIZE] = in[global_id + BLOCK_SIZE]; } // Synchronize (ensure all the data is available) __syncthreads(); // Apply the stencil int result = 0; for(int offset = -RADIUS ; offset <= RADIUS ; offset++) result += temp[local_id + offset]; // Store the result out[global_id] = result; } 16

Stencil: kernel translation (1) void stencil_1d(int *in, int *out, dim3 blockDim, dim3 blockIdx) { int temp[BLOCK_SIZE + 2 * RADIUS]; int global_id; int local_id; THREAD_LOOP_BEGIN{ global_id = threadIdx.x + blockIdx.x * blockDim.x; local_id = threadIdx.x + radius; // Read input elements into shared memory temp[local_id] = in[global_id]; if (threadIdx.x < RADIUS) { temp[local_id – RADIUS] = in[global_id – RADIUS]; temp[local_id + BLOCK_SIZE] = in[global_id + BLOCK_SIZE]; } }THREAD_LOOP_END THREAD_LOOP_BEGIN{ // Apply the stencil int result = 0; for(int offset = -RADIUS ; offset <= RADIUS ; offset++) result += temp[local_id + offset]; out[global_id] = result; // Store the result }THREAD_LOOP_END } A thread loop implicitly introduces a barrier synchronization among logical threads at its boundaries 17

Stencil: kernel translation (2) void stencil_1d(int *in, int *out, dim3 blockDim, dim3 blockIdx) { int temp[BLOCK_SIZE + 2 * RADIUS]; int global_id[]; int local_id[]; THREAD_LOOP_BEGIN{ global_id[tid] = threadIdx.x + blockIdx.x * blockDim.x; local_id[tid] = threadIdx.x + RADIUS; // Read input elements into shared memory temp[local_id[tid]] = in[global_id[tid]]; if (threadIdx.x < RADIUS) { temp[local_id[tid] - RADIUS] = in[global_id[tid] - RADIUS]; temp[local_id[tid] + BLOCK_SIZE] = in[global_id[tid] + BLOCK_SIZE]; } }THREAD_LOOP_END THREAD_LOOP_BEGIN{ // Apply the stencil int result = 0; for(int offset = -RADIUS ; offset <= RADIUS ; offset++) result += temp[local_id[tid] + offset]; out[global_id[tid]] = result; // Store the result }THREAD_LOOP_END } We create an array of values for each local variable of the former CUDA threads. Statements within thread loops access these arrays by loop index. 18

Benchmark Used source codeTime (ms)Slowdown wrt CUDA CUDA¹ Translated TBB² Native sequential³ Native TBB² – Not optimized CUDA code, memory copies included 2 - We wrapped the call to the function in a TBB parallel for 3 - Not optimized, naive implementation Intel® Core™ i GHz NVIDIA® Tesla K40 Kepler GK110B 19

Conclusion One of the most challenging aspects was obtaining the ASTs from CUDA source code Solved including CUDA headers at compile time Now we can match all CUDA syntax At the moment: kernel translation almost completed. Interesting results, but still a good amount of work to do. 20

Future work Handle CUDA API in the host code (i.e. cudaMallocManaged) Atomic operations in the kernel Adding vectorization pragmas 21

Thank you GitHub References Felice Pantaleo, “Introduction to GPU programming using CUDA” 22

(Backup) Example: __syncthread inside a while statement while(j>0){ foo(); __syncthreads(); bar(); } bool newcond []; thread_loop { newcond [tid] = (j[tid]>0); } LABEL: thread_loop{ if(newcond[tid]) { foo(); } } thread_loop{ if(newcond[tid] { bar(); } } thread_loop{ newcond[tid] = (j[tid]>0); if(newcond[tid]) go = true; } if(go) gotoLABEL 23