Multi-GPU and Stream Programming Kishan Wimalawarne.

Slides:



Advertisements
Similar presentations
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Advertisements

Intermediate GPGPU Programming in CUDA
CUDA exercitation. Ex 1 Analyze device properties of each device on the node by using cudaGetDeviceProperties function Check the compute capability, global.
List Ranking and Parallel Prefix
INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
Outline Reading Data From Files Double Buffering GMAC ECE
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
ME964 High Performance Computing for Engineering Applications “Software is like entropy: It is difficult to grasp, weighs nothing, and obeys the Second.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
CS 791v Fall # a simple makefile for building the sample program. # I use multiple versions of gcc, but cuda only supports # gcc 4.4 or lower. The.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 14, 2011 Streams.pptx CUDA Streams These notes will introduce the use of multiple CUDA.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
Cuda Streams Presented by Savitha Parur Venkitachalam.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
An Introduction to Programming with CUDA Paul Richmond
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
Martin Kruliš by Martin Kruliš (v1.0)1.
CUDA 5.0 By Peter Holvenstot CS6260. CUDA 5.0 Latest iteration of CUDA toolkit Requires Compute Capability 3.0 Compatible Kepler cards being installed.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson CUDA-3.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
CUDA Asynchronous Memory Usage and Execution Yukai Hung Department of Mathematics National Taiwan University Yukai Hung
CUDA Misc Mergesort, Pinned Memory, Device Query, Multi GPU.
CIS 565 Fall 2011 Qing Sun
GPU Architecture and Programming
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 4, 2013 Streams.pptx Page-Locked Memory and CUDA Streams These notes introduce the use.
CS6235 L17: Generalizing CUDA: Concurrent Dynamic Execution, and Unified Address Space.
CUDA Streams These notes will introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations. Also introduced is paged-locked.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
12/22/ Thread Model for Realizing Concurrency B. Ramamurthy.
Synchronization These notes introduce:
CS/EE 217 GPU Architecture and Parallel Programming Lecture 17: Data Transfer and CUDA Streams.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS Fall 2011.
Martin Kruliš by Martin Kruliš (v1.0)1.
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
CS 179 Lecture 13 Host-Device Data Transfer 1. Moving data is slow So far we’ve only considered performance when the data is already on the GPU This neglects.
1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.
Lecture 9 Streams and Events Kyu Ho Park April 12, 2016 Ref:[PCCP]Professional CUDA C Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 Fall 2015 Applied Parallel Programming.
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
7/9/ Realizing Concurrency using Posix Threads (pthreads) B. Ramamurthy.
CUDA Libraries and Language Extensions for GKLEE.
CUDA C/C++ Basics Part 3 – Shared memory and synchronization
CUDA C/C++ Basics Part 2 - Blocks and Threads
Multi-GPU Programming
GPU Computing CIS-543 Lecture 10: Streams and Events
Heterogeneous Programming
Basic CUDA Programming
Host-Device Data Transfer
Pipeline parallelism and Multi–GPU Programming
CS 179 Lecture 14.
More on GPU Programming
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
CUDA Execution Model – III Streams and Events
GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.
Synchronization These notes introduce:
Rui (Ray) Wu Unified Cuda Memory Rui (Ray) Wu
Presentation transcript:

Multi-GPU and Stream Programming Kishan Wimalawarne

Agenda Memory Stream programming Multi-GPU programming UVA & GPUDirect

Memory Paged locked memory (Pinned memory) – Useful in concurrent kernel execution – Use cudaHostAlloc() and cudaFreeHost() allocate and free page-locked host memory Mapped memory – A block of page-locked host memory can also be mapped into the address space of the device by passing flag cudaHostAllocMapped to cudaHostAlloc()

Zero-Copy Zero-Copy enables GPU threads to directly access host memory. Requires mapped pinned (non-pageable) memory. Zero copy can be used in place of streams because kernel-originated data transfers automatically overlap kernel execution without the overhead of setting up and determining the optimal number of streams Use cudaSetDeviceFlags() with cudaDeviceMapHost()

Zero-Copy

Stream Programming

Introduction Stream programming (pipeline) is a useful parallel pattern. Data transfer from host to device is a major performance bottleneck in GPU programming CUDA provides support for asynchronous data transfer and kernel executions. A stream is simply a sequence of operations that are performed in order on the device. Allow concurrent execution of kernels. Maximum number of concurrent kernel calls to be launched is 16.

Introduction

Asynchronous memory Transfer Use cudaMemcpyAsync() instead of cudaMemcpy(). cudaMemcpyAsync() – non-blocking data transfer method uses pinned host memory. cudaError_t cudaMemcpyAsync ( void * dst, const void * src, size_t count, enum cudaMemcpyKind, cudaStream_t stream)

Stream Structures cudaStream_t – Sepcifies a stream in a CUDA program cudaStreamCreate(cudaStream_t * stm) – Instantiate streams

Streaming example

Event processing Events are used for – Monitor device behavior – Accurate rate timing cudaEvent_t e cudaEventCreate(&e); cudaEventDestroy(e);

Event processing cudaEventRecord() records and event associated with a stream. cudaEventElapsedTime() finds the time between two input events. cudaEventSynchronize() blocks until the event has actually been recorded cudaEventQuery() Check status of an event. cudaStreamWaitEvent() makes all future work submitted to stream wait until event reports completion before beginning execution. cudaEventCreateWithFlags() create events with flags e.g:- cudaEventDefault, cudaEventBlockingSync

Stream Synchronization cudaDeviceSynchronize() waits until all preceding commands in all streams of all host threads have completed. cudaStreamSynchronize() takes a stream as a parameter and waits until all preceding commands in the given stream have completed cudaStreamWaitEvent() takes a stream and an event as parameters and makes all the commands added to the given stream after the call to cudaStreamWaitEvent() delay their execution until the given event has completed. cudaStreamQuery() provides applications with a way to know if all preceding commands in a stream have completed.

Multi GPU programming

Multiple device access cudaSetDevice(devID) – Devise selection within the code by specifying the identifier and making CUDA kernels run on the selected GPU.

Peer to peer memory Access Peer-to-Peer Memory Access – Only on Tesla or above – cudaDeviceEnablePeerAccess() to check peer access

Peer to peer memory Copy Using cudaMemcpyPeer() – works for Geforce 480 and other GPUs.

Programming multiple GPUs The most efficient way to use multiple GPUs is to use host threads for multiple GPUs and divide the work among them. – E.g- pthreads Need to combine the parallelism of multi-core processor to in conjunction with multiple GPU's. In each thread use cudaSetDevice() to specify the device to run.

Multiple GPU For each computation on GPU create a separate thread and specify the device a CUDA kernel should run. Synchronize both CPU threads and GPU.

Multiple GPU Example void * GPUprocess(void *id){ long tid; tid = (long)id; if(tid ==0){ cudaSetDevice(tid); cudaMalloc((void **)&p2, size); cudaMemcpy(p2, p0, size, cudaMemcpyHostToDevice ); test >>(p2,tid +2); cudaMemcpy(p0,p2, size, cudaMemcpyDeviceToHost ); }else if(tid ==1){ cudaSetDevice(tid); cudaMalloc((void **)&p3, size); cudaMemcpy(p3, p1, size, cudaMemcpyHostToDevice ); test >>(p3,tid +2); cudaMemcpy(p1,p3, size, cudaMemcpyDeviceToHost ); }

Multiple GPU Example #include int NUM_THREADS=2; pthread_t thread[NUM_THREADS]; pthread_attr_t attr; pthread_attr_init(&attr); pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE); for(t=0; t<NUM_THREADS; t++) { rc = pthread_create(&thread[t], &attr, GPUprocess, (void *)t); if (rc) { printf("ERROR; return code from pthread_create() is %d\n", rc); exit(-1); }

Unified Virtual Address Space (UVA) 64-bit process on Windows Vista/7 in TCC mode (only on Tesla)

GPUDirect Build on UVA for Tesla (fermi) products.

GPUDirect