CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011

Agenda Data Prefetching Loop Unrolling Thread Granularity

Data Prefetching Independent instructions between a global memory read and its use can hide memory latency float m = Md[i]; float f = a * b + c * d; float f2 = m * f;

Data Prefetching Independent instructions between a global memory read and its use can hide memory latency float m = Md[i]; float f = a * b + c * d; float f2 = m * f; Read global memory

Data Prefetching Independent instructions between a global memory read and its use can hide memory latency float m = Md[i]; float f = a * b + c * d; float f2 = m * f; Execute instructions that are not dependent on memory read

Data Prefetching Independent instructions between a global memory read and its use can hide memory latency float m = Md[i]; float f = a * b + c * d; float f2 = m * f; Use global memory after the above line executes in enough warps hide the memory latency

Data Prefetching Prefetching data from global memory can effectively increase the number of independent instructions between global memory read and use

Data Prefetching Recall tiled matrix multiply: for (/*... */) { // Load current tile into shared memory __syncthreads(); // Accumulate dot product __syncthreads(); }

Data Prefetching Tiled matrix multiply with prefetch: // Load first tile into registers for (/*... */) { // Deposit registers into shared memory __syncthreads(); // Load next tile into registers // Accumulate dot product __syncthreads(); }

Data Prefetching Tiled matrix multiply with prefetch: // Load first tile into registers for (/*... */) { // Deposit registers into shared memory __syncthreads(); // Load next tile into registers // Accumulate dot product __syncthreads(); } Prefetch for next iteration of the loop

Data Prefetching Tiled matrix multiply with prefetch: // Load first tile into registers for (/*... */) { // Deposit registers into shared memory __syncthreads(); // Load next tile into registers // Accumulate dot product __syncthreads(); } These instructions executed by enough warps will hide the memory latency of the prefetch

Data Prefetching Cost  Added complexity  More registers – what does this imply?

Loop Unrolling for (int k = 0; k < BLOCK_SIZE; ++k) { Pvalue += Ms[ty][k] * Ns[k][tx]; } Instructions per iteration  One floating-point multiple  One floating-point add  What else?

Loop Unrolling for (int k = 0; k < BLOCK_SIZE; ++k) { Pvalue += Ms[ty][k] * Ns[k][tx]; } Other instructions per iteration  Update loop counter

Loop Unrolling for (int k = 0; k < BLOCK_SIZE; ++k) { Pvalue += Ms[ty][k] * Ns[k][tx]; } Other instructions per iteration  Update loop counter  Branch

Loop Unrolling for (int k = 0; k < BLOCK_SIZE; ++k) { Pvalue += Ms[ty][k] * Ns[k][tx]; } Other instructions per iteration  Update loop counter  Branch  Address arithmetic

Loop Unrolling for (int k = 0; k < BLOCK_SIZE; ++k) { Pvalue += Ms[ty][k] * Ns[k][tx]; } Instruction Mix  2 floating-point arithmetic instructions  1 loop branch instruction  2 address arithmetic instructions  1 loop counter increment instruction

Loop Unrolling Only 1/3 are floating-point calculations But I want my full theoretical 346.5 GFLOPs (G80) Consider loop unrolling

Loop Unrolling Pvalue += Ms[ty][0] * Ns[0][tx] + Ms[ty][1] * Ns[1][tx] +... Ms[ty][15] * Ns[15][tx]; // BLOCK_SIZE = 16 No more loop  No loop count update  No branch  Constant indices – no address arithmetic instructions

Thread Granularity How much work should one thread do?  Parallel Reduction Reduce two elements?  Matrix multiply Compute one element of Pd?

Thread Granularity Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf Matrix Multiple

Thread Granularity Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf Matrix Multiple  Both elements of Pd require the same row of Md

Thread Granularity Matrix Multiple  Compute both Pd elements in the same thread Reduces global memory access by ¼ Increases number of independent instructions  What is the benefit? New kernel uses more registers and shared memory  What does that imply?

Matrix Multiply What improves performance?  Prefetching?  Loop unrolling?  Thread granularity? For what inputs?

Matrix Multiply Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf

Matrix Multiply Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf 8x8 Tiles Coarser thread granularity helps Prefetching doesn’t Loop unrolling doesn’t

Matrix Multiply Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf 16x16 Tiles Coarser thread granularity helps

Matrix Multiply Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf 16x16 Tiles Full loop unrolling can help

Matrix Multiply Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf 16x16 Tiles Prefetch helps for 1x1 tiling

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 University of Illinois, Urbana-Champaign 32 What is IEEE floating-point format? A floating point binary number consists of three parts:  sign (S), exponent (E), and mantissa (M).  Each (S, E, M) pattern uniquely identifies a floating point number. For each bit pattern, its IEEE floating-point value is derived as:  value = (-1) S * M * {2 E }, where 1.0 ≤ M < 10.0 B The interpretation of S is simple: S=0 results in a positive number and S=1 a negative number.

IEEE 754 Format http://kipirvine.com/asm/workbook/floating_tut.htm Single Precision 1 bit sign, 8 bit exponent (bias-127), 23 bit fraction Double Precision 1 bit sign, 11 bit exponent (1023-bias), 52 bit fraction

Mantissa - 3.154 x 10 5 as an example, the sign is negative, the mantissa is 3.154, and the exponent is 5. The fractional portion of the mantissa is the sum of each digit multiplied by a power of 10:.154 = 1/10 + 5/100 + 4/1000 A binary floating-point number is similar. For example, in the number +11.1011 x 2 3,  the sign is positive, the mantissa is 11.1011, and the exponent is 3.  The fractional portion of the mantissa is the sum of successive powers of 2. In our example, it is expressed as: D.1011 = 1/2 + 0/4 + 1/8 + 1/16 =0.6875 D Combined with the left-hand side of 11.1011, the decimal value of the number is 3.6875. http://kipirvine.com/asm/workbook/floating_tut.htm

Normalizing the Mantissa Before a floating-point binary number can be stored correctly, its mantissa must be normalized. The process is basically the same as when normalizing a floating-point decimal number. For example, decimal 1234.567 is normalized as 1.234567 x 10 3 by moving the decimal point so that only one digit appears before the decimal. http://kipirvine.com/asm/workbook/floating_tut.htm

The Exponent 8-bit unsigned integers with a bias of 127.  An example: 1.101 x 2 5. The exponent (5) is added to 127(2 n-1 -1) and the sum (132) is binary 10100010. http://kipirvine.com/asm/workbook/floating_tut.htm

Creating the IEEE Bit Representation 1.101 x 2 0 is stored as sign = 0 (positive), mantissa = 101, and exponent = 01111111 (the exponent value is added to 127). The "1" to the left of the decimal point is dropped from the mantissa. Here are more examples: http://kipirvine.com/asm/workbook/floating_tut.htm

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 University of Illinois, Urbana-Champaign 38 Arithmetic Instruction Throughput int and float add, shift, min, max and float mul, mad: 4 cycles per warp  int multiply (*) is by default 32-bit requires multiple cycles / warp  Use __mul24() / __umul24() intrinsics for 4-cycle 24-bit int multiply Integer divide and modulo are expensive  Compiler will convert literal power-of-2 divides to shifts  Be explicit in cases where compiler can’t tell that divisor is a power of 2!  Useful trick: foo % n == foo & (n-1) if n is a power of 2

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 University of Illinois, Urbana-Champaign 39 Arithmetic Instruction Throughput Reciprocal, reciprocal square root, sin/cos, log, exp: 16 cycles per warp  These are the versions prefixed with “__”  Examples:__rcp(), __sin(), __exp() Other functions are combinations of the above  y / x == rcp(x) * y == 20 cycles per warp  sqrt(x) == rcp(rsqrt(x)) == 32 cycles per warp

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 University of Illinois, Urbana-Champaign 40 Runtime Math Library There are two types of runtime math operations  __func(): direct mapping to hardware ISA Fast but low accuracy (see prog. guide for details) Examples: __sin(x), __exp(x), __pow(x,y)  func() : compile to multiple instructions Slower but higher accuracy (5 ulp, units in the least place, or less) Examples: sin(x), exp(x), pow(x,y) The -use_fast_math compiler option forces every func() to compile to __func()

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 University of Illinois, Urbana-Champaign 41 Make your program float-safe! Future hardware will have double precision support  G80 is single-precision only  Double precision will have additional performance cost  Careless use of double or undeclared types may run more slowly on G80+ Important to be float-safe (be explicit whenever you want single precision) to avoid using double precision where it is not needed  Add ‘f’ specifier on float literals: foo = bar * 0.123; // double assumed foo = bar * 0.123f; // float explicit  Use float version of standard library functions foo = sin(bar); // double assumed foo = sinf(bar); // single precision explicit

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 University of Illinois, Urbana-Champaign 42 Deviations from IEEE-754 Addition and Multiplication are IEEE 754 compliant  Maximum 0.5 ulp (units in the least place) error However, often combined into multiply-add (FMAD)  Intermediate result is truncated Division is non-compliant (2 ulp) Not all rounding modes are supported Denormalized numbers are not supported No mechanism to detect floating-point exceptions

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 University of Illinois, Urbana-Champaign 43 GPU Floating Point Features G80SSEIBM AltivecCell SPE PrecisionIEEE 754 Rounding modes for FADD and FMUL Round to nearest and round to zero All 4 IEEE, round to nearest, zero, inf, -inf Round to nearest only Round to zero/truncate only Denormal handlingFlush to zero Supported, 1000’s of cycles Flush to zero NaN supportYes No Overflow and Infinity support Yes, only clamps to max norm Yes No, infinity FlagsNoYes Some Square rootSoftware onlyHardwareSoftware only DivisionSoftware onlyHardwareSoftware only Reciprocal estimate accuracy 24 bit12 bit Reciprocal sqrt estimate accuracy 23 bit12 bit log2(x) and 2^x estimates accuracy 23 bitNo12 bitNo

CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Similar presentations

Presentation on theme: "CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Similar presentations

Presentation on theme: "CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011."— Presentation transcript:

Similar presentations

About project

Feedback