Introduction to CUDA Programming

Slides:



Advertisements
Similar presentations
List Ranking and Parallel Prefix
Advertisements

Introduction to Computer Science 2 Lecture 7: Extended binary trees
CS252: Systems Programming Ninghui Li Program Interview Questions.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Introduction to CUDA Programming
Parallel Programming – OpenMP, Scan, Work Complexity, and Step Complexity David Monismith CS599 Based upon notes from GPU Gems 3, Chapter
CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.
Parallel Prefix Sum (Scan) GPU Graphics Gary J. Katz University of Pennsylvania CIS 665 Adapted from articles taken from GPU Gems III.
CSE 373 Data Structures Lecture 15
ME964 High Performance Computing for Engineering Applications “There are two ways of constructing a software design: one way is to make it so simple that.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
Introduction to CUDA Programming Scans Andreas Moshovos Winter 2009 Based on slides from: Wen Mei Hwu (UIUC) and David Kirk (NVIDIA) White Paper/Slides.
L8: Writing Correct Programs, cont. and Control Flow L8: Control Flow CS6235.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Fall 2013.
LECTURE 3: INTRODUCTION TO PARALLEL COMPUTING USING CUDA Ken Domino, Domem Technologies May 16, 2011 IEEE Boston Continuing Education Program.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Priority Queues and Heaps. October 2004John Edgar2  A queue should implement at least the first two of these operations:  insert – insert item at the.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
CS 193G Lecture 5: Parallel Patterns I. Getting out of the trenches So far, we’ve concerned ourselves with low-level details of kernel programming Mapping.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 13: Application Lessons When the tires.
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
Introduction to CUDA Programming Implementing Reductions Andreas Moshovos Winter 2009 Based on slides from: Mark Harris NVIDIA Optimizations studied on.
GPGPU: Parallel Reduction and Scan Joseph Kider University of Pennsylvania CIS Fall 2011 Credit: Patrick Cozzi, Mark Harris Suresh Venkatensuramenan.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
8/3/2007CMSC 341 BTrees1 CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.
CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.
Parallel primitives – Scan operation CDP – Written by Uri Verner 1 GPU Algorithm Design.
Priority Queues and Heaps. John Edgar  Define the ADT priority queue  Define the partially ordered property  Define a heap  Implement a heap using.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.
CS/EE 217 – GPU Architecture and Parallel Programming
Introduction to Analysis of Algorithms
ECE408 Fall 2015 Applied Parallel Programming Lecture 11 Parallel Computation Patterns – Reduction Trees © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al,
Basic CUDA Programming
Heap Sort Example Qamar Abbas.
ME964 High Performance Computing for Engineering Applications
Recitation 2: Synchronization, Shared memory, Matrix Transpose
Map interface Empty() - return true if the map is empty; else return false Size() - return the number of elements in the map Find(key) - if there is an.
CMSC 341 Lecture 10 B-Trees Based on slides from Dr. Katherine Gibson.
CS 179: GPU Programming Lecture 7.
L18: CUDA, cont. Memory Hierarchy and Examples
Mattan Erez The University of Texas at Austin
CS/EE 217 – GPU Architecture and Parallel Programming
B- Trees D. Frey with apologies to Tom Anastasio
Data Structures Review Session
Parallel Computation Patterns (Scan)
B- Trees D. Frey with apologies to Tom Anastasio
GPGPU: Parallel Reduction and Scan
Mattan Erez The University of Texas at Austin
Parallel Computation Patterns (Reduction)
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.
B- Trees D. Frey with apologies to Tom Anastasio
ECE 498AL Lecture 15: Reductions and Their Implementation
ECE 498AL Lecture 10: Control Flow
Data Structures & Algorithms
CENG 351 Data Management and File Structures
Amortized Analysis and Heaps Intro
Mattan Erez The University of Texas at Austin
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
ECE 498AL Spring 2010 Lecture 10: Control Flow
Presentation transcript:

Introduction to CUDA Programming Scans Andreas Moshovos Winter 2009 Based on slides from: Wen Mei Hwu (UIUC) and David Kirk (NVIDIA) White Paper/Slides by Mark Harris (NVIDIA) Introduction to CUDA Programming

Scan / Parallel Prefix Sum 3 1 7 4 1 6 3 3 4 11 11 15 16 22 Given an array A = [a0, a1, …, an-1] and a binary associative operator @ with identity I scan (A) = [I, a0, (a0 @ a1), …, (a0 @ a1 @ … @ an-2)] This is the exclusive scan  We’ll focus on this

This is the inclusive scan 3 1 7 4 1 6 3 3 4 11 11 15 16 22 25 Given an array A = [a0, a1, …, an-1] and a binary associative operator @ with identity I scan (A) = [a0, (a0 @ a1), …, (a0 @ a1 @ … @ an-1)] This is the inclusive scan

Scan is used as a building block for many parallel algorithms Applications of Scan Scan is used as a building block for many parallel algorithms Radix sort Quicksort String comparison Lexical analysis Run-length encoding Histograms Etc. See: Guy E. Blelloch. “Prefix Sums and Their Applications”. In John H. Reif (Ed.), Synthesis of Parallel Algorithms, Morgan Kaufmann, 1990. http://www.cs.cmu.edu/afs/cs.cmu.edu/project/scandal/public/papers/CMU-CS-90-190.html

Pre-GPU GPU Computing Scan Background First proposed in APL by Iverson (1962) Used as a data parallel primitive in the Connection Machine (1990) Feature of C* and CM-Lisp Guy Blelloch used scan as a primitive for various parallel algorithms Blelloch, 1990, “Prefix Sums and Their Applications” GPU Computing O(n log n) work GPU implementation by Daniel Horn (GPU Gems 2) Applied to Summed Area Tables by Hensley et al. (EG05) O(n) work GPU scan by Sengupta et al. (EDGE06) and Greß et al. (EG06) O(n) work & space GPU implementation by Harris et al. (2007) NVIDIA CUDA SDK and GPU Gems 3 Applied to radix sort, stream compaction, and summed area tables

3 1 7 4 1 6 3 3 4 N additions Use a guide: Sequential algorithm 4 1 6 3 3 4 void scan( float* output, float* input, int length) { output[0] = 0; // since this is a prescan, not a scan for(int j = 1; j < length; ++j) output[j] = input[j-1] + output[j-1]; } N additions Use a guide: Want parallel to be work efficient Does similar amount of work

Naïve Parallel Algorithm for d := 1 to log2n do forall k in parallel do if k >= 2d then x[k] := x[k − 2d-1] + x[k] 3 1 7 4 1 6 3 3 1 7 4 1 6 d = 1, 2d -1 = 1 3 4 8 7 4 5 7 d = 2, 2d -1 = 2 3 4 11 11 12 12 11 d = 3, 2d -1 = 4 3 4 11 11 15 16 22

Need Double-Buffering First all read Then all write But no ordering guarantees on a GPU Solution Use two arrays: Input & Output Alternate at each step 3 4 8 7 4 5 7 3 4 11 11 12 12 11

Output in global memory 3 1 7 4 1 6 3 3 1 7 4 1 6 3 4 8 7 4 5 7 3 4 11 Double Buffering Two arrays A & B Input in global memory Output in global memory 3 1 7 4 1 6 3 input 3 1 7 4 1 6 A B 3 4 8 7 4 5 7 3 4 11 11 12 12 11 A B 3 4 11 11 15 16 22 3 4 11 11 15 16 22 global

Naïve Kernel in CUDA __global__ void scan_naive(float *g_odata, float *g_idata, int n) { extern __shared__ float temp[]; int thid = threadIdx.x, pout = 0, pin = 1; temp[pout*n + thid] = (thid > 0) ? g_idata[thid-1] : 0; for (int dd = 1; dd < n; dd *= 2) pout = 1 - pout; pin = 1 - pout; int basein = pin * n, baseout = pout * n; syncthreads(); temp[baseout +thid] = temp[basein +thid]; if (thid >= dd) temp[baseout +thid] += temp[basein +thid - dd]; } g_odata[thid] = temp[baseout +thid];

Analysis of naïve kernel This scan algorithm executes log(n) parallel iterations The steps do n-1, n-2, n-4,... n/2 adds each Total adds: O(n*log(n)) This scan algorithm is NOT work efficient Sequential scan algorithm does n adds

A common parallel algorithms pattern: Balanced Trees Improving Efficiency A common parallel algorithms pattern: Balanced Trees Build balanced binary tree on input data and sweep to and from the root Tree is conceptual, not an actual data structure For scan: Traverse from leaves to root building partial sums at internal nodes Root holds sum of all leaves Traverse from root to leaves building the scan from the partial sums Algorithm originally described by Blelloch (1990)

Balanced Tree-Based Scan Algorithm / Up-Sweep

Balanced Tree-Based Scan Algorithm / Up-Sweep

Balanced Tree-Based Scan Algorithm / Up-Sweep

Balanced Tree-Based Scan Algorithm / Up-Sweep

Balanced Tree-Based Scan Algorithm / Up-Sweep

Balanced Tree-Based Scan Algorithm / Down-Sweep

Balanced Tree-Based Scan Algorithm / Down-Sweep

Balanced Tree-Based Scan Algorithm / Down-Sweep

Up-Sweep Pseudo-Code

Down-Sweep Pseudo-Code

Up-Sweep Down-Sweep Essentially a reduction Two phases Up-Sweep Essentially a reduction Produces many partial results Down-Sweep Propagating the partial results to all relevant elements

Just a reduction: Up-Sweep 1 2 2 5 6 3 8 2 4 1 5 2 7 9 3 5 1 3 2 7 6 9 10 4 5 5 7 7 16 3 8 1 3 2 10 6 9 8 19 4 5 5 12 7 16 3 24 1 3 2 10 6 9 8 29 4 5 5 12 7 16 3 36 1 3 2 10 6 9 8 29 4 5 5 12 7 16 3 65

Now let’s see this is a tree Up-Sweep Now let’s see this is a tree 1 2 2 5 6 3 8 2 4 1 5 2 7 9 3 5 3 7 9 10 5 7 16 8 10 19 12 24 29 36 Notice we only have these nodes left in our array: the rest were partial results 65 1 3 2 10 6 9 8 29 4 5 5 12 7 16 3 65

Up-Sweep So, this is what’s left nodes without values don’t exist, they were partial results 1 2 6 8 4 5 7 3 3 9 5 16 10 12 29 65

For the second phase we need to think: Down-Sweep For the second phase we need to think: The edges in reverse The empty nodes as placeholders for partial results 1 2 6 8 4 5 7 3 3 9 5 16 10 12 29 65

Now let’s view the tree as a collection of subtrees Down-Sweep Now let’s view the tree as a collection of subtrees The root of each sub tree, where it’s still present contains the reduction of all subtree elements i.e., the sum of all subtree elements 1 2 6 8 4 5 7 3 3 9 5 16 10 12 29 65

Let’s focus on the rightmost subtree: Down-Sweep Let’s focus on the rightmost subtree: 1 2 6 8 4 5 7 3 3 9 5 16 10 12 29 65

Down-Sweep Before the last step of the down-sweep phase the yellow element will contain the sum (57) of all elements to the left of the subtree. 3 57 The last step will take the following two actions 3+ 57 = 60, this goes on the rightmost element This is the sum of all elements including 3 but excluding the right most one overwrite 3 with 57 This is the sum of all elements left of 3

Down-Sweep In terms of the array stored in memory the aforementioned actions look like this: 57 61 3 57 Where: the dark arrows represent addition the red dotted arrow represents a move

Down-Sweep Let’s now focus at the rightmost subtree that contains the last four nodes: This will be processed at the step before the previous subtree we just discussed 7 3 16

Down-Sweep Before the previous to the last step of the down-sweep phase the green element will contain the sum (41) of all elements to the left of the subtree. 7 3 16 41

The actions that will be taken at this step are: Down-Sweep The actions that will be taken at this step are: 16 + 41 = 57 will be written as the root of the rightmost subtree As we saw before this is the sum of all element left of the rightmost subtree 41 will replace 16 This is the sum of all elements left of the subtree rooted by 16 7 3 41 57 41

Down-Sweep In terms of the array stored in memory the aforementioned actions look like this: 7 41 3 57 7 16 3 41 Where: the dark arrows represent addition the red dotted arrow represents a move

Down-Sweep Now let’s go a step back looking at the complete right subtee (in green) 4 5 7 3 5 16 12

Down-Sweep Before this step the root node will contain the sum (29) of all elements of the left subtree 4 5 7 3 5 16 12 29

As before we’ll do two things: Down-Sweep As before we’ll do two things: 29+12 = 41 and this becomes the root of the rightmost subtree This should be the sum of all elements to the left of that subtree for the next step (which we saw previously) 29 replaces 12 4 5 7 3 same reason: 29 is the sum of all elements left of the subtree rooted by what was 12. 5 16 29 41 29

Down-Sweep Let’s try to generalize what happens at every step of the down-sweep phase Let’s look at step 1: There is only one subtree shown in purple 1 2 6 8 4 5 7 3 3 9 5 16 10 12 29 65

Down-Sweep Before we process this tree as described before the root node must contain the sum of all elements to the left of the tree There are no elements Hence the root must be 0 1 2 6 8 4 5 7 3 3 9 5 16 10 12 29

Now repeat the steps we saw before Down-Sweep Now repeat the steps we saw before 29 + 0 = 29 and this becomes the root of the right subtree 29 gets replaced by 0 1 2 6 8 4 5 7 3 3 9 5 16 10 12 29

Down-Sweep In terms of the array stored in memory the aforementioned actions look like this: 1 3 2 10 6 9 8 4 5 5 12 7 16 3 29 1 3 2 10 6 9 8 29 4 5 5 12 7 16 3 Where: the dark arrows represent addition the red dotted arrow represents a move

Declarations & Copying to shared memory Cuda Implementation Declarations & Copying to shared memory Two elements per thread __global__ void prescan(float *g_odata, float *g_idata, int n) { extern __shared__ float temp[N];// allocated on invocation int thid = threadIdx.x; int offset = 1; temp[2*thid] = g_idata[2*thid]; // load input into shared memory temp[2*thid+1] = g_idata[2*thid+1];

Cuda Implementation Up-Sweep for (int d = n>>1; d > 0; d >>= 1) // build sum in place up the tree { __syncthreads(); if (thid < d) int ai = offset*(2*thid+1)-1; int bi = offset*(2*thid+2)-1; temp[bi] += temp[ai]; } offset *= 2; Same computation Different assignment of threads

Up-Sweep: Who does what

Up-Sweep: Who does what For N=16 ai 0 bi 1 offset 1 d 8 n 16 thid 0 ai 1 bi 3 offset 2 d 4 n 16 thid 0 ai 3 bi 7 offset 4 d 2 n 16 thid 0 ai 7 bi 15 offset 8 d 1 n 16 thid 0 ai 2 bi 3 offset 1 d 8 n 16 thid 1 ai 5 bi 7 offset 2 d 4 n 16 thid 1 ai 11 bi 15 offset 4 d 2 n 16 thid 1 ai 4 bi 5 offset 1 d 8 n 16 thid 2 ai 9 bi 11 offset 2 d 4 n 16 thid 2 ai 6 bi 7 offset 1 d 8 n 16 thid 3 ai 13 bi 15 offset 2 d 4 n 16 thid 3 ai 8 bi 9 offset 1 d 8 n 16 thid 4 ai 10 bi 11 offset 1 d 8 n 16 thid 5 ai 12 bi 13 offset 1 d 8 n 16 thid 6 ai 14 bi 15 offset 1 d 8 n 16 thid 7

Down-Sweep // clear the last element if (thid == 0) { temp[n - 1] = 0; } // traverse down tree & build scan for (int d = 1; d < n; d *= 2) { offset >>= 1; __syncthreads(); if (thid < d) int ai = offset*(2*thid+1)-1; int bi = offset*(2*thid+2)-1; float t = temp[ai]; temp[ai] = temp[bi]; temp[bi] += t; } __syncthreads()

Down-Sweep: Who does what ai 7 bi 15 offset 8 d 1 n 16 thid 0 ai 3 bi 7 offset 4 d 2 n 16 thid 0 ai 1 bi 3 offset 2 d 4 n 16 thid 0 ai 0 bi 1 offset 1 d 8 n 16 thid 0 ai 11 bi 15 offset 4 d 2 n 16 thid 1 ai 5 bi 7 offset 2 d 4 n 16 thid 1 ai 2 bi 3 offset 1 d 8 n 16 thid 1 ai 9 bi 11 offset 2 d 4 n 16 thid 2 ai 4 bi 5 offset 1 d 8 n 16 thid 2 ai 13 bi 15 offset 2 d 4 n 16 thid 3 ai 6 bi 7 offset 1 d 8 n 16 thid 3 ai 8 bi 9 offset 1 d 8 n 16 thid 4 ai 10 bi 11 offset 1 d 8 n 16 thid 5 ai 12 bi 13 offset 1 d 8 n 16 thid 6 ai 14 bi 15 offset 1 d 8 n 16 thid 7

All threads do: __syncthreads(); Copy to output // write results to global memory g_odata[2*thid] = temp[2*thid]; g_odata[2*thid+1] = temp[2*thid+1]; }

Current scan implementation has many shared memory bank conflicts These really hurt performance on hardware Occur when multiple threads access the same shared memory bank with different addresses No penalty if all threads access different banks Or if all threads access exact same address Access costs 2*M cycles if there is a conflict Where M is max number of threads accessing single bank

Loading from Global Memory to Shared Each thread loads two shared mem data elements Original code interleaves loads: temp[2*thid]   = g_idata[2*thid]; temp[2*thid+1] = g_idata[2*thid+1]; Threads:(0,1,2,…,8,9,10,…) banks:(0,2,4,…,0,2,4,…) Better to load one element from each half of the array temp[thid]         = g_idata[thid]; temp[thid + (n/2)] = g_idata[thid + (n/2)];

Bank Conflicts in the Tree Algorithm / Up-Sweep When we build the sums, each thread reads two shared memory locations and writes one: Threads 0 and 8 access bank 0 t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 Bank: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ... … … 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ... … First iteration: 2 threads access each of 8 banks. Each corresponds to a single thread. Like-colored arrows represent simultaneous memory accesses

Bank Conflicts in the Tree Algorithm / Up-Sweep When we build the sums, each thread reads two shared memory locations and writes one: Threads 1 and 9 access bank 2, and so on t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 Bank: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ... … … 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ... … First iteration: 2 threads access each of 8 banks. Each corresponds to a single thread. Like-colored arrows represent simultaneous memory accesses

Bank Conflicts in the Tree Algorithm / Down-Sweep 2nd iteration: even worse 4-way bank conflicts; for example: Threads 0,4,8,12, access bank 1, Threads 1,5,9,13, access Bank 5, etc. t0 t1 t2 t3 t4 Bank: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ... … … 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ... 16 … 2nd iteration: 4 threads access each of 4 banks. Each corresponds to a single thread. Like-colored arrows represent simultaneous memory accesses

Using Padding to Prevent Conflicts We can use padding to prevent bank conflicts Just add a word of padding every 16 words: Bank: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 … P … 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ... P … In time:

Using Padding to Remove Conflicts After you compute a shared mem address like this: Address = 2 * stride * thid; Add padding like this: Address += (address / 16); (address >> 4) This removes most bank conflicts Not all, in the case of deep trees

A full binary tree with 64 leaf nodes: Scan Bank Conflicts (1) A full binary tree with 64 leaf nodes: Multiple 2-and 4-way bank conflicts Shared memory cost for whole tree 1 32-thread warp = 6 cycles per thread w/o conflicts Counting 2 shared mem reads and one write (s[a] += s[b]) 6 * (2+4+4+4+2+1) = 102 cycles 36 cycles if there were no bank conflicts (6 * 6) Half-warp

Scan Bank Conflicts (1) Look at step 1: scale 1  2-way conflicts 1. first 8 threads read s[a]  2 cycles 2. the next 8 threads read s[a]  2 cycles 3. The first 8 threads read s[b]  2 cycles 4. the next 8 threads read s[b]  2 cycles 5. the first 8 thread update s[a]  2 cycles 6. the next 8 threads update s[a]  2 cycles Total is 12 cycles or 6 (the no-bank-conflicts time) x 2 (the bank conflict ways)

Hence for all steps: Scan Bank Conflicts (1) Look at step 2: scale 2  4-way conflicts 1. first 4 threads read s[a]  2 cycles 2. threads 3-7 read s[a]  2 cycles 3. threads 8-11 read s[a]  2 cycles 4. threads 12-15 read s[a]  2 cycles total is 24 cycles or 6 (the no-bank-conflicts time) x 4 (the bank conflict ways) Hence for all steps: 6 * (2+4+4+4+2+1) = 102 cycles

It’s much worse with bigger trees Scan Bank Conflicts (2) It’s much worse with bigger trees A full binary tree with 128 leaf nodes Only the last 6 iterations shown (root and 5 levels below) Cost for whole tree: 12*2 + 6*(4+8+8+4+2+1) = 186 cycles 48 cycles if there were no bank conflicts: 12*1 + (6*6) Note two warps are needed for the first step hence the 12 * 2 for the first step after step 1 only one warp is active

A full binary tree with 512 leaf nodes Scan Bank Conflicts (3) A full binary tree with 512 leaf nodes Only the last 6 iterations shown (root and 5 levels below) Cost for whole tree: 48*2+24*4+12*8+6* (16+16+8+4+2+1) = 570 cycles 120 cycles if there were no bank conflicts

Fixing Scan Bank Conflicts Insert padding every NUM_BANKS elements const int LOG_NUM_BANKS = 4; // 16 banks int tid = threadIdx.x; int s = 1; // Traversal from leaves up to root for (d = n>>1; d > 0; d >>= 1) { if (thid <= d) int a = s*(2*tid); int b = s*(2*tid+1) a += (a >> LOG_NUM_BANKS); // insert pad word b += (b >> LOG_NUM_BANKS); // insert pad word shared[a] += shared[b]; }

Fixing Scan Bank Conflicts A full binary tree with 64 leaf nodes No more bank conflicts However, there are ~8 cycles overhead for addressing For each s[a] += s[b] (8 cycles/iter. * 6 iter. = 48 extra cycles) So just barely worth the overhead on a small tree 84 cycles vs. 102 with conflicts vs. 36 optimal 2 shifts and 2 adds per address computation. At 2 cycles each for a 32-thread warp, that’s 8 cycles overhead, plus the 6 cycles for the s[a] += s[b] without bank conflicts. So (6+8)*6 = 84 cycles

Fixing Scan Bank Conflicts A full binary tree with 128 leaf nodes Only the last 6 iterations shown (root and 5 levels below) No more bank conflicts! Significant performance win: 106 cycles vs. 186 with bank conflicts vs. 48 optimal 1 shift and 1 add per address computation. At 2 cycles each for a 32-thread warp, that’s 8 cycles overhead, plus the 6 cycles for the s[a] += s[b] without bank conflicts. So (6+8)*7 = 98 cycles

Fixing Scan Bank Conflicts A full binary tree with 512 leaf nodes Only the last 6 iterations shown (root and 5 levels below) Wait, we still have bank conflicts Improved 304 cycles vs. 570 with bank conflicts vs. 120 optimal 1 shift and 1 add per address computation. At 2 cycles each for a 32-thread warp, that’s 8 cycles overhead plus the 6 cycles for the s[a] += s[b] without bank conflicts. But we still have 2-way bank conflicts on 4 tree levels (out of 9 total). So… (6+8)*5 + (12+8)*4= 150 cycles

Why are there bank conflicts 1-st level padding Adds one element every 16 Original address becomes a + a / 16 Recall: threads using a s=2^n stride So adjacent threads will try to access Thread i  k*2^n Thread i+1  (k+2) * 2^n int a = s*(2*tid); int b = s*(2*tid+1) With our padding these become k*2^n + k*2^n / 16 (k+2) * 2^n + (k+2) *2^n / 16 What happens when n = 7? You are going over 16 pad words ;) k * 128 + k * 8 (k+2) * 128 + (k + 2) * 8 Use k = 0 for example 0  bank 0 256 + 16 = 272  bank 0

Fixing Scan Bank Conflicts It’s possible to remove all bank conflicts Just do multi-level padding Example: two-level padding: const int LOG_NUM_BANKS = 4; // 16 banks on G80 int tid = threadIdx.x; int s = 1; // Traversal from leaves up to root for (d = n>>1; d > 0; d >>= 1) { if (thid <= d) int a = s*(2*tid); int b = s*(2*tid+1) int offset = (a >> LOG_NUM_BANKS); // first level a += offset + (offset >>LOG_NUM_BANKS); // second level offset = (b >> LOG_NUM_BANKS); // first level b += offset + (offset >>LOG_NUM_BANKS); // second level temp[a] += temp[b]; } A and b calculation so that both offset and offset>>LOG_NUM_BANKS are added to them.

Fixing Scan Bank Conflicts A full binary tree with 512 leaf nodes Only the last 6 iterations shown (root and 5 levels below) No bank conflicts But an extra cycle overhead per address calculation Not worth it: 440 cycles vs. 304 with 1-level padding With 1-level padding, bank conflicts only occur in warp 0 Very small remaining cost due to bank conflicts Removing them hurts all other warps 2 shifts and 1 add per address computation. At 2 cycles each for a 32-thread warp, that’s 12 cycles overhead, plus the 6 cycles for the s[a] += s[b] without bank conflicts. (6+12)*9 = 162 cycles

See Scan Large Array in SDK Large Arrays So far: Array can be processed by a block 1024 elements Larger arrays? Divide into blocks Scan each with a block of threads Produce partial scans Scan the partial scans Add the corresponding scan result back to all elements of each block See Scan Large Array in SDK

Large Arrays

Application: Stream Compaction

Application: Radix Sort

Application: Radix Sort f(i) = how many values with a ‘0’ LSB have I seen up to this point? Let’s call these Falses How many Falses are there? We scanned so f(max) includes all but the last element So we got to add e(max) to f(max) to get the number of Falses f(i) can be used as the position to place falses on the output array We need to calculate positions for the non falses. We got to place non-falses after the falses  + totalFalses One after the other: i their original position But ignore any in-between falses  - f(i)

Using Streams to Overlap Kernels with Data Transfers Queue of ordered CUDA requests By default all CUDA request go to the same stream Create a stream: cudaStreamCreate (cudaStream *stream)

Overlapping Kernels cudaMemcpyAsync (dA, hA, sizeB, cudaMemcpyHostToDevice, streamA); cudaMemcpyAsync (dB, hB, sizeB, cudaMemcpyHostToDevice, streamB); Kernel<<<100, 512, 0, streamA>>> (dAo, dA, sizeA); Kernel<<<100, 512, 0, streamB>>> (dBo, dB, sizeB); cudaMemcpyAsync(hBo, dAo, cudaMemcpyDeviceToHost, streamA); cudaMemcpyAsync(hBo, dAo, cudaMemcpyDeviceToHost, streamB); cudaThreadSynchronize();