Optimizing Modular Multiplication for NVIDIA's Maxwell GPUs by Niall Emmart 1, Justin Luitjens 2, Charles Weems 1 and Cliff Woolley 2 1 University of Massachusetts.

Optimizing Modular Multiplication for NVIDIA's Maxwell GPUs by Niall Emmart 1, Justin Luitjens 2, Charles Weems 1 and Cliff Woolley 2 1 University of Massachusetts 2 NVIDIA Corporation

Modular Multiplication Computes: A * B mod M Where A, B and M are hundreds to thousands of bits in length.

Motivating Problems Modular multiplication is a key primitive, used especially in cryptographic operations : RSA Diffie-Hellman Digital signature algorithms Prime generation Factorization

Preliminaries Number Representation GPU is a batched device One problem instance per thread. Separate multiply and Montgomery reduce phases. Small sizes, at most 512 bits.

Achieving High Performance Algorithms: use fast squaring, use “Almost Montgomery” (a redundant representation) Don’t use asymptotically faster algorithms Keep everything on chip Minimize register usage Performance is dominated by the low level techniques used to sum the product terms Good utilization (instructions/cycle)

Multiplication – Product Terms n word by n word product is computed by summing terms

Prior Work – NVIDIA 1.x architectures Two hardware instructions: mad24.lo D, A, B, C and mad24.hi D, A, B, C Note, no carry in or carry out

Prior Work – NVIDIA 1.x architectures (cont) Bernstein et al. [1]: – a and b sampled into 15 limbs of 14-bits – column oriented approach – uses mad24.lo as a 32-bit accumulator – achieves 461M 210-bit mod muls op/sec. Emmart and Weems [2]: – a and b sampled into 22 bit values – column oriented approach – uses pairs of mad24.lo / mad24.hi ops as a 48-bit accumulator – achieves 822K 256-bit mod exps ops/sec – equivalent to 816M 210-bit mod muls ops/sec

Prior Work – 2.x and 3.x architectures Two hardware instructions: imad{c}.hi.{cc} D, A, B, C imad{c}.lo.{cc} D, A, B, C Note the carry in and carry out options

L (A 0 B 0 ) L (A 1 B 0 ) L (A 2 B 0 ) L (A 3 B 0 ) Prior Work – 2.x and 3.x architectures (cont) A3A2A1A0A3A2A1A0 B3B2B1B0B3B2B1B0 H (A 0 B 0 ) H (A 3 B 1 ) H (A 2 B 1 ) H (A 1 B 1 ) H (A 0 B 1 ) ADDL (A 3 B 1 ) L (A 2 B 1 ) L (A 1 B 1 ) L (A 0 B 1 ) H (A 3 B 2 ) H (A 2 B 2 ) H (A 1 B 2 ) H (A 0 B 2 ) ADDL (A 3 B 2 ) L (A 2 B 2 ) L (A 1 B 2 ) L (A 0 B 2 ) H (A 3 B 3 ) H (A 2 B 3 ) H (A 1 B 3 ) H (A 0 B 3 ) ADDL (A 3 B 3 ) L (A 2 B 3 ) L (A 1 B 3 ) L (A 0 B 3 ) H (A 1 B 0 ) H (A 2 B 0 ) H (A 3 B 0 ) Uses an accumulator for each column and ripples the carry 2n^2 + n – 1 instructions

Prior Work – 2.x and 3.x architectures (cont) Zheng et al. [3]: – row oriented approach – rippled carries – 3.412B 256-bit mod mul ops/sec Emmart and Weems [2]: – same approach – 3.469B 256-bit mod mul ops/sec – noted this approach does not work on Maxwell

Maxwell – 5.x Single hardware instruction xmad{.x}{.cc} D, A.{h0|h1}, B.{h0|h1}, C Note, this instruction also supports carry in and carry out

Maxwell – 5.x (cont) Consider computing A*B where A and B are each 32-bits, using a 16-bit multiplier: On Maxwell, a 32-bit madc.lo.cc and madc.hi.cc are emulated and take 4 and 6 instructions respectively. Thus row oriented multiply takes ~10*n 2 instructions! AL * BLAH * BH AL * BH AH * BL These two products are half word aligned A * B =

Maxwell – 5.x (cont) A0L * B0LA1L * B0LA2L * B0LA3L * B0L A0H * B0LA1H * B0LA2H * B0LA3H * B0L A0L * B0HA1L * B0HA2L * B0HA3L * B0H A0H * B0HA1H * B0HA2H * B0HA3H * B0H A 3 A 2 A 1 A 0 B 1 B 0 A0L * B1LA1L * B1LA2L * B1LA3L * B1L A0H * B1LA1H * B1LA2H * B1LA3H * B1L A0L * B1HA1L * B1HA2L * B1HA3L * B1H A0H * B1LA1H * B1LA2H * B1LA3H * B1L B0 Terms B1 Terms Green terms are full word aligned Red terms are half word aligned

Maxwell – 5.x (cont) A0L * B0LA1L * B0LA2L * B0LA3L * B0L A0H * B0LA1H * B0LA2H * B0LA3H * B0L A0L * B0HA1L * B0HA2L * B0HA3L * B0H A0H * B0HA1H * B0HA2H * B0HA3H * B0H A 3 A 2 A 1 A 0 B 1 B 0 A0L * B1LA1L * B1LA2L * B1LA3L * B1L A0H * B1LA1H * B1LA2H * B1LA3H * B1L A0L * B1HA1L * B1HA2L * B1HA3L * B1H A0H * B1HA1H * B1HA2H * B1HA3H * B1H SUM THE RED TERMS AND SHIFT LEFT 16 BITS USING PRMT ADD IN THE GREEN TERMS 4n^2 + 4n – 4 instructions add

Montgomery Reduction on Maxwell MontgomeryReduce(MP X, MP M) { MP U[n]=0; REPEAT n TIMES … use Montgomery’s technique to zero out low 16-bits of X … U=U + (X[0]>>16);... use Montgomery’s technique to zero out low 16-bits of U … X=(X>>32) + (U[0]>>16); U=U>>32; return X=X+(U<<16); } X 0 X 1 X 2 U 0 U 1 U 2 X n-1 X n X 2n-1 U n-1... X = U =

Results –Mod Mul Performance

Results – Mod Square Performance

Instructions Per Cycle / Utilization

Results – per SM per MHz

Thank you! Questions?

Optimizing Modular Multiplication for NVIDIA's Maxwell GPUs by Niall Emmart 1, Justin Luitjens 2, Charles Weems 1 and Cliff Woolley 2 1 University of Massachusetts.

Similar presentations

Presentation on theme: "Optimizing Modular Multiplication for NVIDIA's Maxwell GPUs by Niall Emmart 1, Justin Luitjens 2, Charles Weems 1 and Cliff Woolley 2 1 University of Massachusetts."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Optimizing Modular Multiplication for NVIDIA's Maxwell GPUs by Niall Emmart 1, Justin Luitjens 2, Charles Weems 1 and Cliff Woolley 2 1 University of Massachusetts.

Similar presentations

Presentation on theme: "Optimizing Modular Multiplication for NVIDIA's Maxwell GPUs by Niall Emmart 1, Justin Luitjens 2, Charles Weems 1 and Cliff Woolley 2 1 University of Massachusetts."— Presentation transcript:

Similar presentations

About project

Feedback