CS179: GPU Programming Lecture 11: Lab 5 Recitation.

Slides:



Advertisements
Similar presentations
Chapter 19 Vectors, templates, and exceptions Bjarne Stroustrup
Advertisements

Intermediate GPGPU Programming in CUDA
CS 11 C track: lecture 7 Last week: structs, typedef, linked lists This week: hash tables more on the C preprocessor extern const.
Lecture 20: 11/12/2002CS170 Fall CS170 Computer Organization and Architecture I Ayman Abdel-Hamid Department of Computer Science Old Dominion University.
CS 179: Lecture 2 Lab Review 1. The Problem  Add two arrays  A[] + B[] -> C[]
More on threads, shared memory, synchronization
Basic Algorithms on Arrays. Learning Objectives Arrays are useful for storing data in a linear structure We learn how to process data stored in an array.
Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
L9: Control Flow CS6963. Administrative Project proposals Due 5PM, Friday, March 13 (hard deadline) MPM Sequential code and information posted on website.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
1 CS 201 Passing Function as Parameter & Array Debzani Deb.
L8: Memory Hierarchy Optimization, Bandwidth CS6963.
Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.
Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.
Ceng 545 Performance Considerations. Memory Coalescing High Priority: Ensure global memory accesses are coalesced whenever possible. Off-chip memory is.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
CS 179: GPU Computing Lecture 3 / Homework 1. Recap Adding two arrays… a close look – Memory: Separate memory space, cudaMalloc(), cudaMemcpy(), … – Processing:
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course.
L8: Writing Correct Programs, cont. and Control Flow L8: Control Flow CS6235.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
1 Object Oriented Programming Lecture IX Some notes on Java Performance with aspects on execution time and memory consumption.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
Array Cs212: DataStructures Lab 2. Array Group of contiguous memory locations Each memory location has same name Each memory location has same type a.
CS179: GPU Programming Lecture 16: Final Project Discussion.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
CUDA - 2.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Computer Science 313 – Advanced Programming Topics.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
Introduction to CUDA Programming Implementing Reductions Andreas Moshovos Winter 2009 Based on slides from: Mark Harris NVIDIA Optimizations studied on.
Lecture 7.  There are 2 types of libraries used by standard C++ The C standard library (math.h) and C++ The C++ standard template library  Allows us.
1 ENERGY 211 / CME 211 Lecture 4 September 29, 2008.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
CS/EE 217 – GPU Architecture and Parallel Programming
Sathish Vadhiyar Parallel Programming
EECE571R -- Harnessing Massively Parallel Processors ece
ECE 498AL Lectures 8: Bank Conflicts and Sample PTX code
Recitation 2: Synchronization, Shared memory, Matrix Transpose
L18: CUDA, cont. Memory Hierarchy and Examples
Presented by: Isaac Martin
CS/EE 217 – GPU Architecture and Parallel Programming
Parallel Computation Patterns (Scan)
CS 179: Lecture 3.
ECE 498AL Lecture 15: Reductions and Their Implementation
ECE 498AL Lecture 10: Control Flow
GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.
Mattan Erez The University of Texas at Austin
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
ECE 498AL Spring 2010 Lecture 10: Control Flow
Lecture 5: Synchronization and ILP
Chapter 4:Parallel Programming in CUDA C
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

CS179: GPU Programming Lecture 11: Lab 5 Recitation

Today  Monte-Carlo Integration  Recap on CUBLAS/CURAND  Reductions  Optimizing a reduction

Monte-Carlo Integration  Integration is a common tool is computational math  Oftentimes used for finding areas  Integration is hard on a computer  Difficult to do analytically  Integration is sometimes analytically impossible  Can’t integrate exp(x 2 ) analytically..

Monte-Carlo Integration  Could use discrete Riemann sum

Monte-Carlo Integration  What if there’s no predefined function?  Ex.: Area of union of shapes

Monte-Carlo Integration  Solution: Monte-Carlo Integration  Saturate bounded space with sample points  Check if each point is in any shape  Area = # of points in a shape / # of points total * area of space

Monte-Carlo Integration  Lab 5: Given N spheres in a bounded space, find the volume of their union  Possible to do analytically…  But very difficult!  Spheres have random positions, area of intersections, etc.  Makes good use of Monte-Carlo integration  Easy to check if a point is in any of the spheres  Easy to use CURAND to generate lots of points!

Lab 5  Remember: CURAND has host API and device API  You will use both!  volumeCUBLAS: uses host API with CUBLAS  volumeCUDA: uses device API with reduction kernel

Lab 5 volumeCUBLAS  Allocate necessary memory  Need memory for points  Need memory for 1 bool per point  Is point in any sphere?  Use CURAND host API to generate lots of points  Create, seed, generate, destroy  Use CheckPointsK kernel to see if each point is in a sphere  You must write this kernel!  Get total # of points in a sphere using cublasDasum  cublasDasum(int n, double *src, int stride)  Free initialized memory

Lab 5 volumeCUDA  Allocate memory for data  Now, we also need memory for curandStates!  Generate lots of points using CURAND device API  Call GenerateRandom3K kernel -- but you must fill in the kernel!  Check if points are in sphere  Same as volumeCUBLAS  Use reduction to sum vector  More on this later…  Free memory

Lab 5 Kernels  PointInSphere: Checks if a point is in a given sphere  Do this first!  Should be easy geometry  CheckPointsK: Checks if a point is in any sphere  Copy spheres to shared memory, then iterate through spheres  Remember to make sure array entry is non-NULL  GenerateRandom3K: Generates lots of float3 points  Use CURAND device API

Reduction  Iteratively reduces array via reduce function (ex. addition)

Reduction  Start with size = nPts / 2  Repeatedly call reduction on block size, halving it each time  With main loop in host, device code is very simple…  Just need to add element i and element i + size for each thread  Alternatively, could build loop into device code, and call kernel only once  Once size == 1, we should have summed up all elements

Reduction  Lots of optimizations to make!  Avoiding thread divergence  Contiguous memory accesses  Avoiding shared memory bank conflicts  More we haven’t discussed yet…  Unrolling loops  Templates  And more!

Optimizations  Avoiding thread divergence  Avoid calls that make different calls to threads in same warp  if(threadIdx.x % 2 == 0)  Instead, group by warps  if(threadIdx.x / WARP_SIZE == 0)

Optimizations  Contiguous memory accesses  Memory is linear, can’t swap dimensions  Need to address non-sequential accesses…  Shared memory banks  Also solved by sequential addressing!

Optimizations  Example in reduction kernel:  Reversed loop indexing  for (int i = 1; i < max_size; i *= 2) { … }  for (int i = max_size / 2; i > 0; i /= 2) { … }

Optimizations  Unrolling loops  Basic idea: when reduction size < 32, threads are wasting space due to warps  Unrolling last iteration of loop saves useless work

Optimizations  Unrolling loops example: for (int i = max_size / 2; i > 0; i /= 2) { sdata[tid] += sdata[tid + i]; } for (int i = max_size / 2; i > 0; i /= 2) { sdata[tid] += sdata[tid + i]; if (tid < 32) { sdata[tid] += sdata[tid + 32]; sdata[tid] += sdata[tid + 16]; sdata[tid] += sdata[tid + 8]; // etc… }

Optimizations  Advanced unrolling: templates  Exploit compiler to handle some conditions at compile-time  Use templated functions (like in C++)  Ex.: template __global__ void kernel(…) { if (blockSize >= 512) // some reduction code; else if (blockSize >= 256) // some reduction code; // etc… }  Then, call templated function on host:  kernel >>(…);

Optimizations  Works well with a switch statement: switch (numThreads) { case 512: kernel >>(…); case 256: kernel >>(…); case 128: kernel >>(…); // etc… }