CUDA Misc Mergesort, Pinned Memory, Device Query, Multi GPU.

Slides:



Advertisements
Similar presentations
List Ranking and Parallel Prefix
Advertisements

Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
Multi-GPU and Stream Programming Kishan Wimalawarne.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
More on threads, shared memory, synchronization
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
CS 791v Fall # a simple makefile for building the sample program. # I use multiple versions of gcc, but cuda only supports # gcc 4.4 or lower. The.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 14, 2011 Streams.pptx CUDA Streams These notes will introduce the use of multiple CUDA.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 20, 2011 CUDA Programming Model These notes will introduce: Basic GPU programming model.
Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.
CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.
CUDA Grids, Blocks, and Threads
Cuda Streams Presented by Savitha Parur Venkitachalam.
CUDA C/C++ BASICS NVIDIA Corporation © NVIDIA 2013.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
GPU Programming EPCC The University of Edinburgh.
An Introduction to Programming with CUDA Paul Richmond
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
Basic CUDA Programming Computer Architecture 2014 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course.
1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU.
Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
CUDA Asynchronous Memory Usage and Execution Yukai Hung Department of Mathematics National Taiwan University Yukai Hung
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
CIS 565 Fall 2011 Qing Sun
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects.
GPU Architecture and Programming
CS 193G Lecture 7: Parallel Patterns II. Overview Segmented Scan Sort Mapreduce Kernel Fusion.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 4, 2013 Streams.pptx Page-Locked Memory and CUDA Streams These notes introduce the use.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 4, 2013 Zero-Copy Host Memory These notes will introduce “zero-copy” memory. “Zero-copy”
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
CS6235 L17: Generalizing CUDA: Concurrent Dynamic Execution, and Unified Address Space.
CUDA Streams These notes will introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations. Also introduced is paged-locked.
Lecture 6: Shared-memory Computing with GPU. Free download NVIDIA CUDA a-downloads CUDA programming on visual studio.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
OpenCL Programming James Perry EPCC The University of Edinburgh.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Introduction to CUDA C (Part 2)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Synchronization These notes introduce:
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.
Massively Parallel Programming with CUDA: A Hands-on Tutorial for GPUs Carlo del Mundo ACM Student Chapter, Students Teaching Students (SRS) Series November.
Heterogeneous Computing With GPGPUs Matthew Piehl Overview Introduction to CUDA Project Overview Issues faced nvcc Implementation Performance Metrics Conclusions.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
CUDA programming Performance considerations (CUDA best practices)
1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.
Matrix Multiplication in CUDA Kyeo-Reh Park Kyeo-Reh Park Nuclear & Quantum EngineeringNuclear & Quantum Engineering.
CUDA C/C++ Basics Part 3 – Shared memory and synchronization
CUDA C/C++ Basics Part 2 - Blocks and Threads
CUDA Programming Model
Basic CUDA Programming
Device Routines and device variables
Programming Massively Parallel Graphics Processors
CUDA C/C++ BASICS NVIDIA Corporation © NVIDIA 2013.
Operating Systems Lecture 13.
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
Memory Coalescing These notes will demonstrate the effects of memory coalescing Use of matrix transpose to improve matrix multiplication performance B.
Device Routines and device variables
Programming Massively Parallel Graphics Processors
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
CUDA Execution Model – III Streams and Events
CUDA Programming Model
Presentation transcript:

CUDA Misc Mergesort, Pinned Memory, Device Query, Multi GPU

Parallel Mergesort O(N) runtime with memory copy overhead – Not really worth it compared to O(NlgN) sequential version but an interesting exercise Regular mergesort

CUDA Mergesort Split portion – Assign each thread to a number in the unsorted array – Example: 2 blocks, 4 threads per block B0T0B0T1B0T2B0T3B1T0B1T1B1T2B1T3 Merge split into two phases – First phase: Sort each block by merging into shared memory B0T B0T B1T B1T B0T B1T index = threadIdx.x + (blockIdx.x * blockDim.x) e.g. index = 3 + (1 * 4) = 7 for Block1 Thread 3 Why can’t we keep doing this for the whole array?

Code to sort blocks // This version only works for N = THREADS*BLOCKS __global__ void sortBlocks(int *a) { int i=2; __shared__ int temp[THREADS]; while (i <= THREADS) { if ((threadIdx.x % i)==0) { int index1 = threadIdx.x + (blockIdx.x * blockDim.x); int endIndex1 = index1 + i/2; int index2 = endIndex1; int endIndex2 = index2 + i/2; int targetIndex = threadIdx.x; int done = 0; while (!done) { if ((index1 == endIndex1) && (index2 < endIndex2)) temp[targetIndex++] = a[index2++]; else if ((index2 == endIndex2) && (index1 < endIndex1)) temp[targetIndex++] = a[index1++]; else if (a[index1] < a[index2]) temp[targetIndex++] = a[index1++]; else temp[targetIndex++] = a[index2++]; if ((index1==endIndex1) && (index2==endIndex2)) done = 1; } __syncthreads(); a[threadIdx.x + (blockIdx.x*blockDim.x)] = temp[threadIdx.x]; __syncthreads(); i *= 2; }

Code for main int main() { int a[N]; int *dev_a, *dev_temp; cudaMalloc((void **) &dev_a, N*sizeof(int)); cudaMalloc((void **) &dev_temp, N*sizeof(int)); // Fill array srand(time(NULL)); for (int i = 0; i < N; i++) { int num = rand() % 100; a[i] = num; printf("%d ",a[i]); } printf("\n"); // Copy data from host to device cudaMemcpy(dev_a, a, N*sizeof(int), cudaMemcpyHostToDevice); sortBlocks >>(dev_a); cudaMemcpy(a, dev_a, N*sizeof(int), cudaMemcpyDeviceToHost); …

Merging Blocks We now need to merge the sorted blocks – For simplicity, 1 thread per block 3 38 B0T B1T B0T

Single Step of Parallel Merge __global__ void mergeBlocks(int *a, int *temp, int sortedsize) { int id = blockIdx.x; int index1 = id * 2 * sortedsize; int endIndex1 = index1 + sortedsize; int index2 = endIndex1; int endIndex2 = index2 + sortedsize; int targetIndex = id * 2 * sortedsize; int done = 0; while (!done) { if ((index1 == endIndex1) && (index2 < endIndex2)) temp[targetIndex++] = a[index2++]; else if ((index2 == endIndex2) && (index1 < endIndex1)) temp[targetIndex++] = a[index1++]; else if (a[index1] < a[index2]) temp[targetIndex++] = a[index1++]; else temp[targetIndex++] = a[index2++]; if ((index1==endIndex1) && (index2==endIndex2)) done = 1; } temp = device memory same size as a sortedsize = length of a sorted “block” (doubles in size from original block)

Main code int blocks = BLOCKS/2; int sortedsize = THREADS; while (blocks > 0) { mergeBlocks >>(dev_a, dev_temp, sortedsize); cudaMemcpy(dev_a, dev_temp, N*sizeof(int), cudaMemcpyDeviceToDevice); blocks /= 2; sortedsize *= 2; } cudaMemcpy(a, dev_a, N*sizeof(int), cudaMemcpyDeviceToHost); Copy from device to device

MergeSort With bigger array: #define N #define THREADS 512 #define BLOCKS 2048 Our implementation is limited to a power of 2 for the number of blocks and for the number of threads per block The slowest part seems to be copying the data back to the host, is there anything we can do about that?

Page-Locked or Pinned Memory The CUDA runtime offers cudaHostAlloc () which is similar to malloc malloc memory is standard, pageable host memory cudaHostAlloc () memory is page-locked host memory or pinned memory – The OS guarantees it will never page the memory to disk and will reside in physical memory – Faster copying to the GPU because paged memory is first copied to pinned memory then DMA copies it to the GPU Does take away from total available system memory, may affect system performance

cudaHostAlloc Instead of malloc use: int *a; cudaHostAlloc((void **) &a, size, cudaHostAllocDefault); … cudaFreeHost(a); Won’t make much difference on our small mergesort but benchmark test with hundreds of copies: – Time using cudaMalloc: ms – MB/s during copy up: – Time using cudaMalloc: ms – MB/s during copy down: – Time using cudaHostAlloc: ms – MB/s during copy up: – Time using cudaHostAlloc: ms – MB/s during copy down:

Zero-Copy Host Memory Skipping, but pinned memory allows the possibility for the GPU to directly access host memory – Requires some different flags for cudaHostAlloc – Performance win if the GPU is integrated with the host (memory shared with the host anyway) – Performance loss for data read multiple times since zero-copy memory is not cached on the GPU

Device Query How do you know if you have integrated graphics? – Can use deviceQuery to see what devices you have – cudaGetDeviceCount( &count ) Stores number of CUDA-enabled devices in count – cudaGetDeviceProperties( &prop, i ) Stores device info into the prop struct for device i

Code #include "stdio.h" int main() { cudaDeviceProp prop; int count; cudaGetDeviceCount(&count); for (int i=0; i< count; i++) { cudaGetDeviceProperties(&prop, i); printf( " --- General Information for device %d ---\n", i ); printf( "Name: %s\n", prop.name ); printf( "Compute capability: %d.%d\n", prop.major, prop.minor ); printf( "Clock rate: %d\n", prop.clockRate ); printf( "Device copy overlap: " ); printf( "Integrated graphics: " ); if (prop.integrated) printf( "True\n" ); else printf( "False\n" ); if (prop.deviceOverlap) printf( "Enabled\n" ); else printf( "Disabled\n"); …

Using Multiple GPU’s Can use cudaSetDevice(deviceNum) but has to run on separate threads Fortunately this is not too bad – Thread implementation varies by OS – Simple example using pthreads Better than fork/exec since threads share the same memory instead of a copy of the memory space

Thread Sample /* Need to compile with -pthread */ #include typedef struct argdata { int i; int return_val; } arg_data; void *TaskCode(void *argument) { int tid; arg_data *p; p = (arg_data *) argument; tid = (*p).i; printf("Hello World! It's me, thread %d!\n", tid); p->return_val = tid; return NULL; } int main () { pthread_t thread1,thread2; arg_data arg1, arg2; /* create two threads */ arg1.i = 1; arg2.i = 2; pthread_create(&thread1, NULL, TaskCode, (void *) &arg1); pthread_create(&thread2, NULL, TaskCode, (void *) &arg2); /* wait for all threads to complete */ pthread_join(thread1, NULL); pthread_join(thread2, NULL); printf("Done, values in return: %d %d\n", arg1.return_val, arg2.return_val); return 0; }

Threads with GPU Code // Using two GPU's to increment by 1 an array of 4 integers, // one GPU to increment the first two, the second GPU to increment the next two // Don't need to use -pthread with nvcc #include typedef struct argdata { int deviceID; int *data; } arg_data; __global__ void kernel(int *data) { data[threadIdx.x]++; } // Use 2 threads to increment 2 integers in an array void *TaskCode(void *argument) { arg_data *p; int *dev_data; p = (arg_data *) argument; cudaSetDevice(p->deviceID); cudaMalloc((void **) &dev_data, 2*sizeof(int)); cudaMemcpy(dev_data, p->data, 2*sizeof(int), cudaMemcpyHostToDevice); kernel >>(dev_data); cudaMemcpy(p->data, dev_data, 2*sizeof(int), cudaMemcpyDeviceToHost); cudaFree(dev_data); return NULL; }

Main int main () { pthread_t thread1,thread2; arg_data arg1, arg2; int a[4]; a[0] = 0; a[1] = 1; a[2] = 2; a[3] = 3; arg1.deviceID = 0; arg2.deviceID = 1; arg1.data = &a[0]; // Address of first 2 ints arg2.data = &a[2]; // Address of second 2 ints /* create two threads */ pthread_create(&thread1, NULL, TaskCode, (void *) &arg1); pthread_create(&thread2, NULL, TaskCode, (void *) &arg2); /* wait for all threads to complete */ pthread_join(thread1, NULL); pthread_join(thread2, NULL); for (int i=0; i < 4; i++) printf("%d ", a[i]); printf("\n"); return 0; }