Presentation is loading. Please wait.

Presentation is loading. Please wait.

Optimizing Modular Multiplication for NVIDIA's Maxwell GPUs by Niall Emmart 1, Justin Luitjens 2, Charles Weems 1 and Cliff Woolley 2 1 University of Massachusetts.

Similar presentations


Presentation on theme: "Optimizing Modular Multiplication for NVIDIA's Maxwell GPUs by Niall Emmart 1, Justin Luitjens 2, Charles Weems 1 and Cliff Woolley 2 1 University of Massachusetts."— Presentation transcript:

1 Optimizing Modular Multiplication for NVIDIA's Maxwell GPUs by Niall Emmart 1, Justin Luitjens 2, Charles Weems 1 and Cliff Woolley 2 1 University of Massachusetts 2 NVIDIA Corporation

2 Modular Multiplication Computes: A * B mod M Where A, B and M are hundreds to thousands of bits in length.

3 Motivating Problems Modular multiplication is a key primitive, used especially in cryptographic operations : RSA Diffie-Hellman Digital signature algorithms Prime generation Factorization

4 Preliminaries Number Representation GPU is a batched device One problem instance per thread. Separate multiply and Montgomery reduce phases. Small sizes, at most 512 bits.

5 Achieving High Performance Algorithms: use fast squaring, use “Almost Montgomery” (a redundant representation) Don’t use asymptotically faster algorithms Keep everything on chip Minimize register usage Performance is dominated by the low level techniques used to sum the product terms Good utilization (instructions/cycle)

6 Multiplication – Product Terms n word by n word product is computed by summing terms

7 Prior Work – NVIDIA 1.x architectures Two hardware instructions: mad24.lo D, A, B, C and mad24.hi D, A, B, C Note, no carry in or carry out

8 Prior Work – NVIDIA 1.x architectures (cont) Bernstein et al. [1]: – a and b sampled into 15 limbs of 14-bits – column oriented approach – uses mad24.lo as a 32-bit accumulator – achieves 461M 210-bit mod muls op/sec. Emmart and Weems [2]: – a and b sampled into 22 bit values – column oriented approach – uses pairs of mad24.lo / mad24.hi ops as a 48-bit accumulator – achieves 822K 256-bit mod exps ops/sec – equivalent to 816M 210-bit mod muls ops/sec

9 Prior Work – 2.x and 3.x architectures Two hardware instructions: imad{c}.hi.{cc} D, A, B, C imad{c}.lo.{cc} D, A, B, C Note the carry in and carry out options

10 L (A 0 B 0 ) L (A 1 B 0 ) L (A 2 B 0 ) L (A 3 B 0 ) Prior Work – 2.x and 3.x architectures (cont) A3A2A1A0A3A2A1A0 B3B2B1B0B3B2B1B0 H (A 0 B 0 ) H (A 3 B 1 ) H (A 2 B 1 ) H (A 1 B 1 ) H (A 0 B 1 ) ADDL (A 3 B 1 ) L (A 2 B 1 ) L (A 1 B 1 ) L (A 0 B 1 ) H (A 3 B 2 ) H (A 2 B 2 ) H (A 1 B 2 ) H (A 0 B 2 ) ADDL (A 3 B 2 ) L (A 2 B 2 ) L (A 1 B 2 ) L (A 0 B 2 ) H (A 3 B 3 ) H (A 2 B 3 ) H (A 1 B 3 ) H (A 0 B 3 ) ADDL (A 3 B 3 ) L (A 2 B 3 ) L (A 1 B 3 ) L (A 0 B 3 ) H (A 1 B 0 ) H (A 2 B 0 ) H (A 3 B 0 ) Uses an accumulator for each column and ripples the carry 2n^2 + n – 1 instructions

11 Prior Work – 2.x and 3.x architectures (cont) Zheng et al. [3]: – row oriented approach – rippled carries – 3.412B 256-bit mod mul ops/sec Emmart and Weems [2]: – same approach – 3.469B 256-bit mod mul ops/sec – noted this approach does not work on Maxwell

12 Maxwell – 5.x Single hardware instruction xmad{.x}{.cc} D, A.{h0|h1}, B.{h0|h1}, C Note, this instruction also supports carry in and carry out

13 Maxwell – 5.x (cont) Consider computing A*B where A and B are each 32-bits, using a 16-bit multiplier: On Maxwell, a 32-bit madc.lo.cc and madc.hi.cc are emulated and take 4 and 6 instructions respectively. Thus row oriented multiply takes ~10*n 2 instructions! AL * BLAH * BH AL * BH AH * BL These two products are half word aligned A * B =

14 Maxwell – 5.x (cont) A0L * B0LA1L * B0LA2L * B0LA3L * B0L A0H * B0LA1H * B0LA2H * B0LA3H * B0L A0L * B0HA1L * B0HA2L * B0HA3L * B0H A0H * B0HA1H * B0HA2H * B0HA3H * B0H A 3 A 2 A 1 A 0 B 1 B 0 A0L * B1LA1L * B1LA2L * B1LA3L * B1L A0H * B1LA1H * B1LA2H * B1LA3H * B1L A0L * B1HA1L * B1HA2L * B1HA3L * B1H A0H * B1LA1H * B1LA2H * B1LA3H * B1L B0 Terms B1 Terms Green terms are full word aligned Red terms are half word aligned

15 Maxwell – 5.x (cont) A0L * B0LA1L * B0LA2L * B0LA3L * B0L A0H * B0LA1H * B0LA2H * B0LA3H * B0L A0L * B0HA1L * B0HA2L * B0HA3L * B0H A0H * B0HA1H * B0HA2H * B0HA3H * B0H A 3 A 2 A 1 A 0 B 1 B 0 A0L * B1LA1L * B1LA2L * B1LA3L * B1L A0H * B1LA1H * B1LA2H * B1LA3H * B1L A0L * B1HA1L * B1HA2L * B1HA3L * B1H A0H * B1HA1H * B1HA2H * B1HA3H * B1H SUM THE RED TERMS AND SHIFT LEFT 16 BITS USING PRMT ADD IN THE GREEN TERMS 4n^2 + 4n – 4 instructions add

16 Montgomery Reduction on Maxwell MontgomeryReduce(MP X, MP M) { MP U[n]=0; REPEAT n TIMES … use Montgomery’s technique to zero out low 16-bits of X … U=U + (X[0]>>16);... use Montgomery’s technique to zero out low 16-bits of U … X=(X>>32) + (U[0]>>16); U=U>>32; return X=X+(U<<16); } X 0 X 1 X 2 U 0 U 1 U 2 X n-1 X n X 2n-1 U n-1... X = U =

17 Results –Mod Mul Performance

18 Results – Mod Square Performance

19 Instructions Per Cycle / Utilization

20 Results – per SM per MHz

21 Thank you! Questions?


Download ppt "Optimizing Modular Multiplication for NVIDIA's Maxwell GPUs by Niall Emmart 1, Justin Luitjens 2, Charles Weems 1 and Cliff Woolley 2 1 University of Massachusetts."

Similar presentations


Ads by Google