Presentation is loading. Please wait.

Presentation is loading. Please wait.

Prepared 10/11/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Similar presentations


Presentation on theme: "Prepared 10/11/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron."— Presentation transcript:

1 Prepared 10/11/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

2 Always measure where your time is going! Even if you think you know where it is going Start coarse, go fine-grained as need be Keep in mind Amdahl’s Law when optimizing any part of your code Don’t continue to optimize once a part is only a small fraction of overall execution time Performance Considerations – Slide 2

3 Performance Consideration Issues Memory coalescing Shared memory bank conflicts Control-flow divergence Occupancy Kernel launch overheads Performance Considerations – Slide 3

4 Off-chip memory is accessed in chunks Even if you read only a single word If you don’t use whole chunk, bandwidth is wasted Chunks are aligned to multiples of 32/64/128 bytes Unaligned accesses will cost more When accessing global memory, peak performance utilization occurs when all threads in a half warp access continuous memory locations. Performance Considerations – Slide 4

5 Performance Considerations – Slide 5 M 2,0 M 1,1 M 1,0 M 0,0 M 0,1 M 3,0 M 2,1 M 3,1 M 2,0 M 1,0 M 0,0 M 3,0 M 1,1 M 0,1 M 2,1 M 3,1 M 1,2 M 0,2 M 2,2 M 3,2 M 1,2 M 0,2 M 2,2 M 3,2 M 1,3 M 0,3 M 2,3 M 3,3 M 1,3 M 0,3 M 2,3 M 3,3 M

6 Performance Considerations – Slide 6 M 2,0 M 1,1 M 1,0 M 0,0 M 0,1 M 3,0 M 2,1 M 3,1 M 2,0 M 1,0 M 0,0 M 3,0 M 1,1 M 0,1 M 2,1 M 3,1 M 1,2 M 0,2 M 2,2 M 3,2 M 1,2 M 0,2 M 2,2 M 3,2 M 1,3 M 0,3 M 2,3 M 3,3 M 1,3 M 0,3 M 2,3 M 3,3 M Access direction in Kernel code T1T1 T2T2 T3T3 T4T4 Time Period 1 T1T1 T2T2 T3T3 T4T4 Time Period 2 …

7 T1T1 T2T2 T3T3 T4T4 Time Period 1 T1T1 T2T2 T3T3 T4T4 Time Period 2 … Performance Considerations – Slide 7 M 2,0 M 1,1 M 1,0 M 0,0 M 0,1 M 3,0 M 2,1 M 3,1 M 2,0 M 1,0 M 0,0 M 3,0 M 1,1 M 0,1 M 2,1 M 3,1 M 1,2 M 0,2 M 2,2 M 3,2 M 1,2 M 0,2 M 2,2 M 3,2 M 1,3 M 0,3 M 2,3 M 3,3 M 1,3 M 0,3 M 2,3 M 3,3 M Access direction in Kernel code

8 Performance Considerations – Slide 8 Md Nd W I D T H WIDTH Thread 1 Thread 2 Not coalescedcoalesced

9 Performance Considerations – Slide 9 Md Nd W I D T H WIDTH Md Nd Original Access Pattern Tiled Access Pattern Copy into scratchpad memory Perform multiplication with scratchpad values

10 Threads 0-15 access 4-byte words at addresses 116-176 Thread 0 is lowest active, accesses address 116 128-byte segment: 0-127 Performance Considerations – Slide 10 96192128 128B segment 160 224 t1t2 288256... t0t15 03264 t3

11 Threads 0-15 access 4-byte words at addresses 116-176 Thread 0 is lowest active, accesses address 116 128-byte segment: 0-127 (reduce to 64B) Performance Considerations – Slide 11 96192128 64B segment 160 224 t1t2 288256... t0t15 03264 t3

12 Threads 0-15 access 4-byte words at addresses 116-176 Thread 0 is lowest active, accesses address 116 128-byte segment: 0-127 (reduce to 32B) Performance Considerations – Slide 12 96192128 32B segment 160 224 t1t2 288256... t0t15 03264 t3

13 Threads 0-15 access 4-byte words at addresses 116-176 Thread 3 is lowest active, accesses address 128 128-byte segment: 128-255 Performance Considerations – Slide 13 96192128 128B segment 160 224 t1t2 288256... t0t15 03264 t3

14 Threads 0-15 access 4-byte words at addresses 116-176 Thread 3 is lowest active, accesses address 128 128-byte segment: 128-255 (reduce to 64B) Performance Considerations – Slide 14 96192128 64B segment 160 224 t1t2 288256... t0t15 03264 t3

15 Performance Considerations – Slide 15 __global__ void foo(int* input, float3* input2) { int i = blockDim.x * blockIdx.x + threadIdx.x; // Stride 1 int a = input[i]; // Stride 2, half the bandwidth is wasted int b = input[2 * i]; // Stride 3, 2/3 of the bandwidth wasted float c = input2[i].x; }

16 Performance Considerations – Slide 16 struct record { int key; int value; int flag; }; record *d_records; cudaMalloc((void**)&d_records,...);

17 Performance Considerations – Slide 17 struct SoA { int * keys; int * values; int * flags; }; SoA d_SoA_data; cudaMalloc((void**)&d_SoA_data.keys,...); cudaMalloc((void**)&d_SoA_data.values,...); cudaMalloc((void**)&d_SoA_data.flags,...);

18 Performance Considerations – Slide 18 __global__ void bar(record *AoS_data, SoA SoA_data) { int i = blockDim.x * blockIdx.x + threadIdx.x; // AoS wastes bandwidth int key = AoS_data[i].key; // SoA efficient use of bandwidth int key_better = SoA_data.keys[i]; }

19 Structure of arrays is often better than array of structures Very clear win on regular, stride 1 access patterns Unpredictable or irregular access patterns are case-by- case Performance Considerations – Slide 19

20 As seen each SM has 16 KB of shared memory 16 banks of 32-bit words (Tesla) CUDA uses shared memory as shared storage visible to all threads in a thread block read and write access Not used explicitly for pixel shader programs we dislike pixels talking to each other Performance Considerations – Slide 20 I$ L1 Multithreaded Instruction Buffer R F C$ L1 Shared Mem Operand Select MADSFU

21 So shared memory is banked Only matters for threads within a warp Full performance with some restrictions Threads can each access different banks Or can all access the same value Consecutive words are in different banks If two or more threads access the same bank but different value, get bank conflicts Performance Considerations – Slide 21

22 In a parallel machine, many threads access memory Therefore, memory is divided into banks Essential to achieve high bandwidth Each bank can service one address per cycle A memory can service as many simultaneous accesses as it has banks Multiple simultaneous accesses to a bank result in a bank conflict Conflicting accesses are serialized Performance Considerations – Slide 22 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0

23 No Bank ConflictsNo Bank Conflicts Linear addressing, stride == 1Random 1:1 permutation Performance Considerations – Slide 23 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0

24 Two-way Bank Conflicts Eight-way Bank Conflicts Linear addressing stride == 2Linear addressing stride == 8 Performance Considerations – Slide 24 Thread 11 Thread 10 Thread 9 Thread 8 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 9 Bank 8 Bank 15 Bank 7 Bank 2 Bank 1 Bank 0 x8

25 Each bank has a bandwidth of 32 bits per clock cycle Successive 32-bit words are assigned to successive banks G80 has 16 banks So bank = address % 16 Same as the size of a half-warp No bank conflicts between different half-warps, only within a single half-warp Performance Considerations – Slide 25

26 Shared memory is as fast as registers if there are no bank conflicts The fast case: If all threads of a half-warp access different banks, there is no bank conflict If all threads of a half-warp access the identical address, there is no bank conflict (broadcast) The slow case: Bank conflict: multiple threads in the same half-warp access the same bank Must serialize the accesses Cost = max # of simultaneous accesses to a single bank Performance Considerations – Slide 26

27 Change all shared memory reads to the same value All broadcasts = no conflicts Will show how much performance could be improved by eliminating bank conflicts The same doesn’t work for shared memory writes So, replace shared memory array indices with threadIdx.x Can also be done to the reads Performance Considerations – Slide 27

28 Given: This is only bank-conflict-free if s shares no common factors with the number of banks 16 on G80, so s must be odd Performance Considerations – Slide 28 __shared__ float shared[256]; float foo = shared[baseIndex + s * threadIdx.x];

29 Performance Considerations – Slide 29 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 s=3s=1

30 texture and __constant__ Read-only Data resides in global memory Different read path: includes specialized caches Performance Considerations – Slide 30

31 Data stored in global memory, read through a constant-cache path __constant__ qualifier in declarations Can only be read by GPU kernels Limited to 64KB To be used when all threads in a warp read the same address Serializes otherwise Throughput: 32 bits per warp per clock per multiprocessor Performance Considerations – Slide 31

32 Immediate address constants Indexed address constants Constants stored in DRAM, and cached on chip L1 per SM A constant value can be broadcast to all threads in a warp Extremely efficient way of accessing a value that is common for all threads in a block! Performance Considerations – Slide 32 I$ L1 Multithreaded Instruction Buffer R F C$ L1 Shared Mem Operand Select MADSFU

33 Objectives To understand the implications of control flow on Branch divergence overhead SM execution resource utilization To learn better ways to write code with control flow To understand compiler/HW predication designed to reduce the impact of control flow There is a cost involved. Performance Considerations – Slide 33

34 Thread: concurrent code and associated state executed on the CUDA device (in parallel with other threads) The unit of parallelism in CUDA Warp: a group of threads executed physically in parallel in G80 Block: a group of threads that are executed together and form the unit of resource assignment Grid: a group of thread blocks that must all complete before the next kernel call of the program can take effect Performance Considerations – Slide 34

35 Thread blocks are partitioned into warps with instructions issued per 32 threads (warp) Thread IDs within a warp are consecutive and increasing Warp 0 starts with Thread ID 0 Partitioning is always the same Thus you can use this knowledge in control flow The exact size of warps may change from generation to generation Performance Considerations – Slide 35

36 However, DO NOT rely on any ordering between warps If there are any dependencies between threads, you must __syncthreads() to get correct results Performance Considerations – Slide 36

37 Main performance concern with branching is divergence Threads within a single warp take different paths if-else,... Different execution paths within a warp are serialized in G80 The control paths taken by the threads in a warp are traversed one at a time until there is no more. Different warps can execute different code with no impact on performance Performance Considerations – Slide 37

38 A common case: avoid diverging within a warp, i.e. when branch condition is a function of thread ID Example with divergence: This creates two different control paths for threads in a block Branch granularity < warp size; threads 0 and 1 follow different path than the rest of the threads in the first warp Performance Considerations – Slide 38 if (threadIdx.x > 2) {...} else {...}

39 A common case: avoid diverging within a warp, i.e. when branch condition is a function of thread ID Example without divergence: Also creates two different control paths for threads in a block Branch granularity is a whole multiple of warp size; all threads in any given warp follow the same path Performance Considerations – Slide 39 if (threadIdx.x / WARP_SIZE > 2) {...} else {...}

40 Given an array of values, “reduce” them to a single value in parallel Examples sum reduction: sum of all values in the array Max reduction: maximum of all values in the array Typically parallel implementation: Recursively halve # threads, add two values per thread Takes log(n) steps for n elements, requires n/2 threads Performance Considerations – Slide 40

41 Performance Considerations – Slide 41 __global__ void per_thread_sum(int *indices, float *data, float *sums) {... // number of loop iterations is data dependent for(int j=indices[i];j<indices[i+1]; j++) { sum += data[j]; } sums[i] = sum; }

42 Assume an in-place reduction using shared memory The original vector is in device global memory The shared memory used to hold a partial sum vector Each iteration brings the partial sum vector closer to the final sum The final solution will be in element 0 Performance Considerations – Slide 42

43 Assume we have already loaded array into __shared__ float partialSum[] Performance Considerations – Slide 43 unsigned int t = threadIdx.x; for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) { __syncthreads(); if (t % (2*stride) == 0) partialSum[t] += partialSum[t+stride]; }

44 Performance Considerations – Slide 44 01234576109811 0+12+34+56+710+118+9 0...34..78..11 0..78..15 1 2 3 Array elements ITERATIONSITERATIONS

45 Performance Considerations – Slide 45 01234576109811 0+12+34+56+710+118+9 0...34..78..11 0..78..15 1 2 3 Array elements ITERATIONSITERATIONS Thread 0Thread 8Thread 2Thread 4Thread 6Thread 10

46 In each iteration, two control flow paths will be sequentially traversed for each warp Threads that perform addition and threads that do not Threads that do not perform addition may cost extra cycles depending on the implementation of divergence Performance Considerations – Slide 46

47 No more than half of threads will be executing at any time All odd index threads are disabled right from the beginning! On average, less than ¼ of the threads will be activated for all warps over time. After the 5 th iteration, entire warps in each block will be disabled, poor resource utilization but no divergence. This can go on for a while, up to 4 more iterations (512/32=16= 2 4 ), where each iteration only has one thread activated until all warps retire Performance Considerations – Slide 47

48 Assume we have already loaded array into __shared__ float partialSum[] Performance Considerations – Slide 48 unsigned int t = threadIdx.x; for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) { __syncthreads(); if (t % (2*stride) == 0) partialSum[t] += partialSum[t+stride]; } BAD: Divergence due to interleaved branch decisions

49 Assume we have already loaded array into __shared__ float partialSum[] Performance Considerations – Slide 49 unsigned int t = threadIdx.x; for (unsigned int stride = blockDim.x; stride > 1; stride >> 1) { __syncthreads(); if (t < stride) partialSum[t] += partialSum[t+stride]; }

50 Performance Considerations – Slide 50 Thread 0 0123…13151418171619 0+1615+31 1 3 4

51 Only the last 5 iterations will have divergence Entire warps will be shut down as iterations progress For a 512-thread block, 4 iterations to shut down all but one warps in each block Better resource utilization, will likely retire warps and thus blocks faster Recall, no bank conflicts either Performance Considerations – Slide 51

52 For last 6 loops only one warp active (i.e. tid’s 0..31) Shared reads and writes SIMD synchronous within a warp, so skip __syncthreads() and unroll last 5 iterations Performance Considerations – Slide 52 unsigned int tid = threadIdx.x; for (unsigned int d = n>>1; d > 32; d >>= 1) { __syncthreads(); if (tid < d) shared[tid] += shared[tid + d]; } __syncthreads(); if (tid <= 32) { // unroll last 6 predicated steps shared[tid] += shared[tid + 32]; shared[tid] += shared[tid + 16]; shared[tid] += shared[tid + 8]; shared[tid] += shared[tid + 4]; shared[tid] += shared[tid + 2]; shared[tid] += shared[tid + 1]; } This would not work properly is warp size decreases; need __synchthreads() between each statement! However, having ___synchthreads() in if statement is problematic.

53 A single thread can drag a whole warp with it for a long time Know your data patterns If data is unpredictable, try to flatten peaks by letting threads work on multiple data items Performance Considerations – Slide 53

54 LDR r1,r2,0 If p1 is TRUE, instruction executes normally If p1 is FALSE, instruction treated as NOP Predication Example Performance Considerations – Slide 54 … if (x == 10) c = c + 1; … LDR r5, X p1 <- r5 eq 10 LDR r1 <- C ADD r1, r1, 1 STR r1 -> C …

55 Performance Considerations – Slide 55 B A C D ABCDABCD

56 The cost is extra instructions will be issued each time the code is executed. However, there is no branch divergence. Performance Considerations – Slide 56 … p1,p2 <- r5 eq 10 inst 1 from B inst 2 from B … inst 1 from C inst 2 from C … p1,p2 <- r5 eq 10 inst 1 from B inst 1 from C inst 2 from B inst 2 from C … schedule

57 Comparison instructions set condition codes (CC) Instructions can be predicated to write results only when CC meets criterion (CC != 0, CC >= 0, etc.) Compiler tries to predict if a branch condition is likely to produce many divergent warps If guaranteed not to diverge: only predicates if < 4 instructions If not guaranteed: only predicates if < 7 instructions May replace branches with instruction predication Performance Considerations – Slide 57

58 ALL predicated instructions take execution cycles Those with false conditions don’t write their output Or invoke memory loads and stores Saves branch instructions, so can be cheaper than serializing divergent paths Performance Considerations – Slide 58

59 “A Comparison of Full and Partial Predicated Execution Support for ILP Processors,” S. A. Mahlke, R. E. Hank, J.E. McCormick, D. I. August, and W. W. Hwu, Proceedings of the 22nd International Symposium on Computer Architecture, June 1995, pp. 138-150 http://www.crhc.uiuc.edu/IMPACT/ftp/conference/ isca-95-partial-pred.pdf Also available in Readings in Computer Architecture, edited by Hill, Jouppi, and Sohi, Morgan Kaufmann, 2000 Performance Considerations – Slide 59

60 Recall that streaming multiprocessor implements zero-overhead warp scheduling At any time, only one of the warps is executed by SM * Warps whose next instruction has its inputs ready for consumption are eligible for execution Eligible Warps are selected for execution on a prioritized scheduling policy All threads in a warp execute the same instruction when selected Performance Considerations – Slide 60

61 What happens if all warps are stalled? No instruction issued  performance lost Most common reason for stalling? Waiting on global memory If your code reads global memory every couple of instructions You should try to maximize occupancy What determines occupancy? Register usage per thread and shared memory per thread block Performance Considerations – Slide 61

62 Pool of registers and shared memory per streaming multiprocessor Each thread block grabs registers and shared memory Performance Considerations – Slide 62 TB 0 RegistersShared Memory TB 1 TB 2 TB 0 TB 1 TB 2

63 Pool of registers and shared memory per streaming multiprocessor If one or the other is fully utilized  no more thread blocks Performance Considerations – Slide 63 TB 0 Registers TB 1 TB 0 TB 1 Shared Memory

64 Can only have eight thread blocks per streaming multiprocessor If they’re too small, can’t fill up the SM Need 128 threads/thread block (gt200), 192 threads/TB (gf100) Higher occupancy has diminishing returns for hiding latency Performance Considerations – Slide 64

65 Performance Considerations – Slide 65

66 Use nvcc -Xptxas –v to get register and shared memory usage Plug those numbers into CUDA Occupancy Calculator http://developer.download.nvidia.com/compute/cuda/ CUDA_Occupancy_calculator.xls Performance Considerations – Slide 66

67 Performance Considerations – Slide 67

68 Performance Considerations – Slide 68

69 Performance Considerations – Slide 69

70 Performance Considerations – Slide 70

71 Pass option –maxrregcount=X to nvcc This isn’t magic, won’t get occupancy for free Use this very carefully when you are right on the edge Performance Considerations – Slide 71

72 Kernel launches aren’t free A null kernel launch will take non-trivial time Actual number changes with hardware generations and driver software, so I can’t give you one number Independent kernel launches are cheaper than dependent kernel launches Dependent launch: Some read-back to the CPU If you are launching lots of small grids you will lose substantial performance due to this effect Performance Considerations – Slide 72

73 If you are reading back data to the CPU for control decisions, consider doing it on the GPU Even though the GPU is slow at serial tasks, can do surprising amounts of work before you used up kernel launch overhead Performance Considerations – Slide 73

74 Measure, measure, then measure some more! Once you identify bottlenecks, apply judicious tuning What is most important depends on your program You’ll often have a series of bottlenecks, where each optimization gives a smaller boost than expected Performance Considerations – Slide 74

75 Reading: Chapter 6, “Programming Massively Parallel Processors” by Kirk and Hwu. Based on original material from The University of Illinois at Urbana-Champaign David Kirk, Wen-mei W. Hwu Stanford University: Jared Hoberock, David Tarjan Revision history: last updated 10/11/2011. Previous revisions: 9/9/2011. Performance Considerations – Slide 75


Download ppt "Prepared 10/11/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron."

Similar presentations


Ads by Google