Presentation is loading. Please wait.

Presentation is loading. Please wait.

Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.

Similar presentations


Presentation on theme: "Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008."— Presentation transcript:

1 Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008

2 Previously CUDA Runtime Component CUDA Runtime Component –Common Component –Device Component –Host Component: runtime & driver APIs

3 Today Memory & instruction optimizations Memory & instruction optimizations Final projects - reminder Final projects - reminder

4 Instruction Performance

5 Instruction Processing To execute an instruction on a warp of threads, the SM To execute an instruction on a warp of threads, the SM –Reads in instruction operands for each thread –Executes the instuction on all threads –Writes the result of each thread

6 Instruction Throughput Maximized when Maximized when –Use of low throughput instructions is minimized –Available memory bandwidth is used maximally –The thread scheduler overlaps compute & memory operations Programs have a high arithmetic intensity per memory operation Programs have a high arithmetic intensity per memory operation Each SM has many active threads Each SM has many active threads

7 Instruction Throughput Maximized when Maximized when –Use of low throughput instructions is minimized –Available memory bandwidth is used maximally –The thread scheduler overlaps compute & memory operations Programs have a high arithmetic intensity per memory operation Programs have a high arithmetic intensity per memory operation Each SM has many active threads Each SM has many active threads

8 Instruction Throughput Avoid low throughput instructions Avoid low throughput instructions –Be aware of clock cycles used per instruction –There are often faster alternatives for math functions, e.g. sinf and __sinf –Size of operands (24 bit, 32 bit) also makes a difference

9 Instruction Throughput Avoid low throughput instructions Avoid low throughput instructions –Integer division and modulo are expensive Use bitwise operations (>>, &) instead Use bitwise operations (>>, &) instead –Type conversion costs cycles char / short => int char / short => int double float double float –Define float quantities with f, e.g. 1.0f –Use float functions, e.g. expf –Some devices (<= 1.2) demote double to float

10 Instruction Throughput Avoid branching Avoid branching –Diverging threads in a warp are serialized –Try to minimize the number of divergent warps –Loop unrolling by the compiler can be controlled using #pragma unroll

11 Instruction Throughput Avoid high latency memory instructions Avoid high latency memory instructions –A SM takes 4 clock cycles to issue a memory instruction to a warp –In case of local/global memory, there is an overhead of 400 to 600 cycles __shared__ float shared; __device__ float device; shared = device; 4 + 4 + [400,600] cycles

12 Instruction Throughput Avoid high latency memory instructions Avoid high latency memory instructions –If local/global memory has to be accessed, surround it with independent arithmetic instructions SM can do math while accessing memory SM can do math while accessing memory

13 Instruction Throughput Cost of __syncThreads()‏ Cost of __syncThreads()‏ –Instruction itself takes 4 clock cycles for a warp –Additional cycles spent waiting for threads to catch up

14 Instruction Throughput Maximized when Maximized when –Use of low throughput instructions is minimized –Available memory bandwidth is used maximally –The thread scheduler overlaps compute & memory operations Programs have a high arithmetic intensity per memory operation Programs have a high arithmetic intensity per memory operation Each SM has many active threads Each SM has many active threads

15 Instruction Throughput Effective memory bandwidth of each memory space (global, local, shared) depends on the memory access pattern Effective memory bandwidth of each memory space (global, local, shared) depends on the memory access pattern Device memory has higher latency and lower bandwidth than on-chip memory Device memory has higher latency and lower bandwidth than on-chip memory –Minimize use of device memory

16 Instruction Throughput Typical execution Typical execution –Each thread loads data from device to shared memory –Synch threads, if necessary –Each thread processes data in shared memory –Synch threads, if necessary –Write data from shared to device memory

17 Instruction Throughput Global memory Global memory –High latency, low bandwidth –Not cached –Right access patterns are crucial

18 Instruction Throughput Global memory: alignment Global memory: alignment –Supported word sizes: 4, 8, 16 bytes – __device__ type device[32]; type data = device[tid]; compiles to a single load instruction if type has a supported size type has a supported size type variables are aligned to sizeof(type) : the address of the variable should be a multiple of sizeof(type)‏ type variables are aligned to sizeof(type) : the address of the variable should be a multiple of sizeof(type)‏

19 Instruction Throughput Global memory: alignment Global memory: alignment –Alignment requirement is automatically fulfilled for built-in types –For self-defined structures, alignment can be forced struct __align__(8) { float a,b; } myStruct8; struct __align__(16) { float a,b,c; } myStruct12;

20 Instruction Throughput Global memory: alignment Global memory: alignment –Addresses of global variables are aligned to 256 bytes –Align structures cleverly struct { float a,b,c,d,e; } myStruct20; Five 32-bit load instructions

21 Instruction Throughput Global memory: alignment Global memory: alignment –Addresses of global variables are aligned to 256 bytes –Align structures cleverly struct __align__(8) { float a,b,c,d,e; } myStruct20; Three 64-bit load instructions

22 Instruction Throughput Global memory: alignment Global memory: alignment –Addresses of global variables are aligned to 256 bytes –Align structures cleverly struct __align__(16) { float a,b,c,d,e; } myStruct20; Two 128-bit load instructions

23 Instruction Throughput Global memory: coalescing Global memory: coalescing –Size of a memory transaction on global memory can be 32 (>= 1.2), 64 or 128 bytes –Used most efficiently when simultaneous memory accesses by threads in a half- warp can be coalesced into a single memory transaction –Coalescing varies w/ comp capability

24 Instruction Throughput Global memory: coalescing, <= 1.1 Global memory: coalescing, <= 1.1 –Global memory access by threads in a half- warp are coalesced if Each thread accesses words of size Each thread accesses words of size –4 bytes: one 64-byte memory operation –8 bytes: one 128-byte memory operation –16 bytes: two 128-byte memory operations All 16 words lie in the same (aligned) segment in global memory All 16 words lie in the same (aligned) segment in global memory Threads access words in sequence Threads access words in sequence

25 Instruction Throughput Global memory: coalescing, <= 1.1 Global memory: coalescing, <= 1.1 –If any of the conditions is violated by a half-warp, thread memory accesses are serialized –Coalesced access of larger sizes is slower than coalesced access of lower sizes Still a lot more efficient than non-coalesced access Still a lot more efficient than non-coalesced access

26

27

28

29 Instruction Throughput Global memory: coalescing, >= 1.2 Global memory: coalescing, >= 1.2 –Global memory access by threads in a half- warp are coalesced if accessed words lie in the same aligned segment of required size 32 bytes for 2 byte words 32 bytes for 2 byte words 64 bytes for 4 byte words 64 bytes for 4 byte words 128 bytes for 8/16 byte words 128 bytes for 8/16 byte words –Any access pattern is allowed Lower CC cards restrict access patterns Lower CC cards restrict access patterns

30 Instruction Throughput Global memory: coalescing, >= 1.2 Global memory: coalescing, >= 1.2 –If a half-warp addresses words in N different segments, N memory transactions are issued Lower CC cards issue 16 Lower CC cards issue 16 –Hardware automatically detects and optimizes for unused words, e.g. if request words lie in the lower of upper half of a 128 byte segment, a 64 byte operation is issued.

31 Instruction Throughput Global memory: coalescing, >= 1.2 Global memory: coalescing, >= 1.2 –Summary for memory transactions by threads in a half-warp Find the memory segment containing the address requested by the lowest numbered active thread Find the memory segment containing the address requested by the lowest numbered active thread Find all other active threads requesting addresses in the same segment Find all other active threads requesting addresses in the same segment Reduce transaction size, if possible Reduce transaction size, if possible Do the transaction, mark threads inactive Do the transaction, mark threads inactive Repeat until all threads are serviced Repeat until all threads are serviced

32

33 Instruction Throughput Global memory: coalescing Global memory: coalescing –General patterns TYPE* BaseAddress; // 1D array // thread reads BaseAddress + tid TYPE* BaseAddress; // 1D array // thread reads BaseAddress + tid TYPE must meet size and alignment req.s TYPE must meet size and alignment req.s If TYPE is larger than 16 bytes, split it into smaller objects that meet the requirements If TYPE is larger than 16 bytes, split it into smaller objects that meet the requirements

34 Instruction Throughput Global memory: coalescing Global memory: coalescing –General patterns TYPE* BaseAddress; // 2D array of size: width x height // read BaseAddress + tx*width + ty TYPE* BaseAddress; // 2D array of size: width x height // read BaseAddress + tx*width + ty Size and alignment requirements hold Size and alignment requirements hold

35 Instruction Throughput Global memory: coalescing Global memory: coalescing –General patterns Memory coalescing achieved for all half-warps in a block if Memory coalescing achieved for all half-warps in a block if –Width of the block is a multiple of 16 – width is a multiple of 16 Arrays whose width is a multiple of 16 are accessed more efficiently Arrays whose width is a multiple of 16 are accessed more efficiently –Useful to pad arrays up to multiples of 16 –Done automatically by the cuMemAllocPitch cudaMallocPitch functions

36 Instruction Throughput Local memory Local memory –Used for some internal variables –Not cached –As expensive as global memory –As accesses are, by definition, per-thread, they are automatically coalesced

37 Instruction Throughput Constant memory Constant memory –Cached Costs one memory read from device memory on cache miss Costs one memory read from device memory on cache miss Otherwise, one cache read Otherwise, one cache read –For threads in a half-warp, cost of reading cache is proportional to number of different addresses read Recommended to have all threads in a half-warp read the same address Recommended to have all threads in a half-warp read the same address

38 Instruction Throughput Texture memory Texture memory –Cached Costs one memory read from device memory on cache miss Costs one memory read from device memory on cache miss Otherwise, one cache read Otherwise, one cache read –Texture cache is optimized for 2D spatial locality Recommended for threads in a warp to read neighboring texture addresses Recommended for threads in a warp to read neighboring texture addresses

39 Instruction Throughput Shared memory Shared memory –On-chip As fast as registers, provided there are no bank conflicts between threads As fast as registers, provided there are no bank conflicts between threads –Divided into equally-sized modules, called banks If N requests fall in N separate banks, they are processed concurrently If N requests fall in N separate banks, they are processed concurrently If N requests fall in the same bank, there is an N-way bank conflict If N requests fall in the same bank, there is an N-way bank conflict –The N requests are serialized

40 Instruction Throughput Shared memory: banks Shared memory: banks –Successive 32-bit words are assigned to successive banks –Bandwidth: 32 bits per 2 clock cycles –Requests from a warp are split according to half-warps Threads in different half-warps cannot conflict Threads in different half-warps cannot conflict

41

42

43

44 Instruction Throughput Shared memory: bank conflicts Shared memory: bank conflicts – __shared__ char shared[32]; char data = shared[BaseIndex+tId]; –Why?

45 Instruction Throughput Shared memory: bank conflicts Shared memory: bank conflicts – __shared__ char shared[32]; char data = shared[BaseIndex+tId]; –Multiple array members, e.g. char[0], char[1], char[2] and char[3], lie in the same bank –Can be resolved as char data = shared[BaseIndex+4*tId];

46 Instruction Throughput Shared memory: bank conflicts Shared memory: bank conflicts – __shared__ double shared[32]; double data = shared[BaseIndex+tId]; –Why?

47 Instruction Throughput Shared memory: bank conflicts Shared memory: bank conflicts – __shared__ double shared[32]; double data = shared[BaseIndex+tId]; –2-way bank conflict because of a stride of two 32-bit words

48 Instruction Throughput Shared memory: bank conflicts Shared memory: bank conflicts – __shared__ TYPE shared[32]; TYPE data = shared[BaseIndex+tId];

49 Instruction Throughput Shared memory: bank conflicts Shared memory: bank conflicts – __shared__ TYPE shared[32]; TYPE data = shared[BaseIndex+tId]; –Three separate memory reads with no bank conflicts struct TYPE { float x,y,z; }; –Stride of three 32-bit words

50 Instruction Throughput Shared memory: bank conflicts Shared memory: bank conflicts – __shared__ TYPE shared[32]; TYPE data = shared[BaseIndex+tId]; –Two separate memory reads with no bank conflicts struct TYPE { float x,y; }; –Stride of two 32-bit words, similar to double

51 Final Projects Reminder Reminder –Form groups by next lecture –Think of project ideas for your group Encouraged to submit several ideas Encouraged to submit several ideas –For each idea, submit a short text describing the problem you want to solve describing the problem you want to solve why you think it is suited for parallel computation why you think it is suited for parallel computation –Jens and I will assign you one of your suggested topics

52 Final Projects Reminder Reminder –If some people have not formed groups, Jens and I will assign you randomly to groups. –If you cannot think of any ideas, Jens and I will assign you some. –We will float around some write-ups of our own ideas. You may choose one of those.

53 Final Projects Time-line Time-line –Thu, 20 Nov (today): Float write-ups on ideas of Jens & Waqar Float write-ups on ideas of Jens & Waqar –Tue, 25 Nov: Suggest groups and topics Suggest groups and topics –Thu, 27 Nov: Groups and topics assigned Groups and topics assigned –Tue, 2 Dec: Last chance to change groups/topics Last chance to change groups/topics Groups and topics finalized Groups and topics finalized

54 All for today Next time Next time –More on bank conflicts –Other optimizations

55 See you next week!


Download ppt "Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008."

Similar presentations


Ads by Google