Heterogeneous Computing With GPGPUs Matthew Piehl Overview Introduction to CUDA Project Overview Issues faced nvcc Implementation Performance Metrics Conclusions.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 DeviceRoutines.pptx Device Routines and device variables These notes will introduce:
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Intermediate GPGPU Programming in CUDA
INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
Multi-GPU and Stream Programming Kishan Wimalawarne.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 20, 2011 CUDA Programming Model These notes will introduce: Basic GPU programming model.
GPU PROGRAMMING David Gilbert California State University, Los Angeles.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.
The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Barzan Shkeh. Outline Introduction Massive multithreading GPGPU CUDA memory types CUDA C/ C++ programming CUDA in Bioinformatics.
An Introduction to Programming with CUDA Paul Richmond
SAGE: Self-Tuning Approximation for Graphics Engines
2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson CUDA-3.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU.
Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:
Algorithm Engineering „GPGPU“ Stefan Edelkamp. Graphics Processing Units  GPGPU = (GP)²U General Purpose Programming on the GPU  „Parallelism for the.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
CIS 565 Fall 2011 Qing Sun
GPU Architecture and Programming
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Introduction to CUDA C (Part 2)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Martin Kruliš by Martin Kruliš (v1.0)1.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
CUDA C/C++ Basics Part 2 - Blocks and Threads
Prof. Zhang Gang School of Computer Sci. & Tech.
Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.
Lecture 20 Computing with Graphical Processing Units
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)
CUDA Programming Model
CS427 Multicore Architecture and Parallel Computing
Basic CUDA Programming
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
Parallel programming with GPGPU coprocessors
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
CUDA Programming Model
GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.
CUDA Programming Model
6- General Purpose GPU Programming
Parallel Computing 18: CUDA - I
Presentation transcript:

Heterogeneous Computing With GPGPUs Matthew Piehl Overview Introduction to CUDA Project Overview Issues faced nvcc Implementation Performance Metrics Conclusions Improvements Advantages Disadvantages Introduction to CUDA Copy data from Main Memory to GPU Memory CPU instructs process to GPU GPU Executes on many cores Transfer Data back to Main Memory Nvidia Tesla S Graphics cards 4 x 240 thread processors (TPs) 960 total TPs 4Gb Memory

The Project: Offload the Boolean conjugation operations such as AND and OR operators to GPU In theory performance should improve since graphics cards specialize in Boolean operations (AND, OR) CUDA Approach : With PGI not an option, I have to move to Nvidia's compiler nvcc. Nvcc is Nvidia's CUDA preproc. (nvcc only compiles your kernel code. Passes your device (C code) to gcc). A namespace nightmare with PGI ( vcc -> gcc -> pgCC ). Since nvcc only compiles C code we need an external C fnctn cudaMalloc() cudaMemcpy() Host to Device Device to Host /*Allocate memory on Nvidia Device*/ cudaMalloc((void**) &Ptrinc, memsize); cudaMalloc((void**) &Ptrcurr, memsize); cudaMalloc((void**) &Ptrtmp, memsize); /*Copy to allocated Memory*/ cudaMemcpy(Ptrinc, incoming, memsize, cudaMemcpyHostToDevice); cudaMemcpy(Ptrcurr, curr, memsize, cudaMemcpyHostToDevice); offload_OR >>(Ptrinc,Ptrcurr,Ptrtmp); __global__ void offload_OR(size_t* inc, size_t* curr, size_t* tmp) { int i = threadIdx.x; tmp[i] = inc[i] | curr[i]; } cudaMemcpy(newtree,Ptrtmp,memsize,cudaMemcpyDeviceToHost); /*Free our Nvidia allocated memory*/ cudaFree(Ptrinc); cudaFree(Ptrtmp); cudaFree(Ptrcurr); return newtree;

Performance

Conclusion Memory latency causing problems Spending way to much time moving data to and from the device Current program is using limited loop vectorization CUDA implementation is dispatching 1000s of threads 960 thread processors to work on Very little overhead in thread creation Can we get around this issue of memory latency? Current environment isn't enough A heterogeneous cluster would have potential Build the PTrees on the device don't copy back GB limit on each device 2-Dimensional PTrees can be helpful General disadvantages of CUDA Memory latency a major issue 4Gb of device memory Steep learning curve vs PGI Linking and namespace nightmare when linking many different compilers together Several advantages of CUDA Very good when you have to do many computations with limited memory copies Integral approximation (Monte Carlo Simulation, Simpsons Rule) Supports thousands of threads