CUDA More GA, Events, Atomics. GA Revisited Speedup with a more computationally intense evaluation function Parallel version of the crossover and mutation.

Slides:



Advertisements
Similar presentations
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Advertisements

More on threads, shared memory, synchronization
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Introduction to CUDA Programming Histograms and Sparse Array Multiplication Andreas Moshovos Winter 2009 Based on documents from: NVIDIA & Appendix A of.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 10, 2011 Atomics.pptx Atomics and Critical Sections These notes will introduce: Accessing.
CS 791v Fall # a simple makefile for building the sample program. # I use multiple versions of gcc, but cuda only supports # gcc 4.4 or lower. The.
BWUPEP2011, UIUC, May 29 - June Taking CUDA to Ludicrous Speed BWUPEP2011, UIUC, May 29 - June Blue Waters Undergraduate Petascale Education.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 14, 2011 Streams.pptx CUDA Streams These notes will introduce the use of multiple CUDA.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson Revised.
L8: Memory Hierarchy Optimization, Bandwidth CS6963.
CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 3, 2011 ConstantMemTiming.ppt Measuring Performance of Constant Memory These notes will.
CUDA and the Memory Model (Part II). Code executed on GPU.
Using Random Numbers in CUDA ITCS 4/5145 Parallel Programming Spring 2012, April 12a, 2012.
GPU Programming David Monismith Based on Notes from the Udacity Parallel Programming (cs344) Course.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, April 12, 2012 Timing.ppt Measuring Performance These notes will introduce: Timing Program.
CUDA C/C++ BASICS NVIDIA Corporation © NVIDIA 2013.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
An Introduction to Programming with CUDA Paul Richmond
SAGE: Self-Tuning Approximation for Graphics Engines
More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.
Introduction to CUDA Programming Histograms and Sparse Array Multiplication Andreas Moshovos Winter 2009 Based on documents from: NVIDIA & Appendix A of.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course.
High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
CIS 565 Fall 2011 Qing Sun
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects.
GPU Architecture and Programming
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 4, 2013 Streams.pptx Page-Locked Memory and CUDA Streams These notes introduce the use.
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
1 SC12 The International Conference for High Performance Computing, Networking, Storage and Analysis Salt Lake City, Utah. Workshop 119: An Educator's.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
1 ITCS 4/5010 GPU Programming, B. Wilkinson, Jan 21, CUDATiming.ppt Measuring Performance These notes introduce: Timing Program Execution How to.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
Massively Parallel Programming with CUDA: A Hands-on Tutorial for GPUs Carlo del Mundo ACM Student Chapter, Students Teaching Students (SRS) Series November.
CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS Fall 2011.
Martin Kruliš by Martin Kruliš (v1.0)1.
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 19: Atomic.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.
© David Kirk/NVIDIA and Wen-mei W
Matrix Multiplication in CUDA Kyeo-Reh Park Kyeo-Reh Park Nuclear & Quantum EngineeringNuclear & Quantum Engineering.
CUDA C/C++ Basics Part 3 – Shared memory and synchronization
CUDA C/C++ Basics Part 2 - Blocks and Threads
GPU Memories These notes will introduce:
Basic CUDA Programming
Recitation 2: Synchronization, Shared memory, Matrix Transpose
CS 179: GPU Programming Lecture 7.
CUDA Grids, Blocks, and Threads
Memory Coalescing These notes will demonstrate the effects of memory coalescing Use of matrix transpose to improve matrix multiplication performance B.
Parallel Computation Patterns (Reduction)
CS 179: Lecture 3.
Measuring Performance
CUDA Execution Model – III Streams and Events
General Purpose Graphics Processing Units (GPGPUs)
Measuring Performance
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
Parallel Computation Patterns (Histogram)
6- General Purpose GPU Programming
Presentation transcript:

CUDA More GA, Events, Atomics

GA Revisited Speedup with a more computationally intense evaluation function Parallel version of the crossover and mutation operators – In this version of CUDA, no random in GPU – Used rand.cu, linear congruent; fairly poor random number generator – New kernel invocation based on POP_SIZE >> See file online for code

Events Can use CUDA events to time how long something takes cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop); … cudaEventRecord(start, 0); // Do some work on the GPU, stream 0 cudaEventRecord(stop, 0); cudaEventSynchronize(stop); float elapsedTime; cudaEventElapsedTime(&elapsedTime, start, stop)); printf(“Time: %3.2f milliseconds\n”,elapsedTime);

Texture Memory Just Mentioning – requires use of some graphics-like functions for non-graphics use E.g. tex1Dfetch(textureIn, offset); – Designed for OpenGL rendering pipeline – Read-Only memory but cached on-chip – Good for memory access patterns using spatial locality (e.g. access other items near a 2D coordinate)

Atomics Requires Compute Capability 1.2 or higher – Our Tesla supports 1.3 – GeForce GTX TX 580 and Tesla S2070 support 2.0 Atomics are used to manage contention from multiple processors or threads to shared resources – Primitives discussed earlier; e.g. Exchange instruction to implement barrier synchronization, keep memory consistent

Example: Histogram (CPU version 1) #include #define SIZE 26 int main() { char *buffer = "THE QUICK BROWN FOX JUMPED OVER THE LAZY DOGS"; int histo[SIZE]; // Count of letters A-Z in buffer, where [0]=A's, // [1]=B's, etc. for (int i = 0; i < SIZE; i++) histo[i] = 0; for (int i = 0; i < strlen(buffer); i++) histo[buffer[i]-'A']++; for (int i = 0; i < SIZE; i++) { printf("%c appears %d times.\n", ('A'+i), histo[i]); } return 0; }

Longer Histogram #include #define STRSIZE 26*20000 // chars #define SIZE 26 int main() { char *buffer; int histo[SIZE]; printf("Creating data buffer...\n"); // Allocate buffer buffer = (char *)malloc(STRSIZE); for (int i = 0; i < 26*20000; i++) buffer[i]= 'A' + (i % 26); for (int i = 0; i < SIZE; i++) histo[i] = 0; printf("Counting...\n"); for (int i = 0; i < strlen(buffer); i++) printf("Counting...\n"); for (int i = 0; i < strlen(buffer); i++) histo[buffer[i]-'A']++; for (int i = 0; i < SIZE; i++) { printf("%c appears %d times.\n", ('A'+i), histo[i]); } free (buffer); return 0; }

First Attempt – GPU Version Have a thread operate on each character 520,000 chars – Let’s use blocks and 32 threads/block – Index into the string is the usual blockIdx.x * blockDim.x + threadIdx.x __global__ void histo_kernel(char *dev_buffer, int *dev_histo) { int tid = blockIdx.x * blockDim.x + threadIdx.x; char c = dev_buffer[tid]; dev_histo[c-'A']++; } histo_kernel >>(dev_buffer, dev_histo);

FAIL – WHY?

Atomics to the Rescue Guarantee only one thread can perform the operation at a time Must compile with nvcc –arch=sm_11 or nvcc –arch=sm_12 atomicAdd, atomicSub, atomicMin, atomicMax, atomicExch, atomicInc See

Modified Kernel __global__ void histo_kernel(char *dev_buffer, int *dev_histo) { int tid = blockIdx.x * blockDim.x + threadIdx.x; char c = dev_buffer[tid]; atomicAdd(&dev_histo[c-'A'],1); }

Shared Memory Atomics Compute Capability 1.2+ allows atomics in shared memory; using blocks, 32 threads __global__ void histo_kernel(char *dev_buffer, int *dev_histo) { __shared__ int block_histo[SIZE]; if (threadIdx.x < 26) block_histo[threadIdx.x] = 0; __syncthreads(); int tid = blockIdx.x * blockDim.x + threadIdx.x; char c = dev_buffer[tid]; atomicAdd(&block_histo[c-'A'],1); __syncthreads(); if (threadIdx.x < 26) atomicAdd(&dev_histo[threadIdx.x],block_histo[threadIdx.x]); } Zero histogram per block Add block histogram Add to global histogram