Optimizing Modular Multiplication for NVIDIA's Maxwell GPUs by Niall Emmart 1, Justin Luitjens 2, Charles Weems 1 and Cliff Woolley 2 1 University of Massachusetts.

Slides:

Advertisements

Similar presentations

Factoring of Large Numbers using Number Field Sieve Matrix Step Chandana Anand, Arman Gungor, and Kimberly A. Thomas ECE 646 Fall 2006.

Advertisements

Computer Organization CS224 Fall 2012 Lesson 19. Floating-Point Example  What number is represented by the single-precision float …00 

Datorteknik ArithmeticCircuits bild 1 Computer arithmetic Somet things you should know about digital arithmetic: Principles Architecture Design.

Fast Modular Reduction

C. Walter, Data Integrity for Modular Arithmetic, CHES 2000 CHES 2000 Data Integrity in Hardware for Modular Arithmetic Colin Walter Computation Department,

Deterministic Length Reduction: Fast Convolution in Sparse Data and Applications Written by: Amihood Amir, Oren Kapah and Ely Porat.

Contemporary Logic Design Arithmetic Circuits © R.H. Katz Lecture #24: Arithmetic Circuits -1 Arithmetic Circuits (Part II) Randy H. Katz University of.

Computer ArchitectureFall 2008 © August 25, CS 447 – Computer Architecture Lecture 3 Computer Arithmetic (1)

An Expandable Montgomery Modular Multiplication Processor Adnan Abdul-Aziz GutubAlaaeldin A. M. Amin Computer Engineering Department King Fahd University.

Cryptography & Number Theory

M. Interleaving Montgomery High-Radix Comparison Improvement Adders CLA CSK Comparison Conclusion Improving Cryptographic Architectures by Adopting Efficient.

COE 308: Computer Architecture (T041) Dr. Marwan Abu-Amara Integer & Floating-Point Arithmetic (Appendix A, Computer Architecture: A Quantitative Approach,

Long Modular Multiplication for Cryptographic Applications Laszlo Hars Seagate Research Workshop on Cryptographic Hardware and Embedded Systems, CHES 2004.

CS 1308 – Computer Literacy and the Internet. It’s Not Magic  The goal of the next series of lectures is to show you exactly how a computer works. 

Institute for Applied Information Processing and Communications (IAIK) – VLSI & Security Dr. Johannes Wolkerstorfer IAIK – Graz University of Technology.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

BCRYPT ECC-Day 2008 Requirements, Algorithms, Architectures The design space of ECC hardware.

How to Multiply using Lattice. Step 1: How many boxes to you need? 23 X 5 You need 2 boxes on top and 1 on the side.

STEP 1 Multiply the digits in the ones place. Write the product in the ones place of the answer box. If the product is greater than ten, carry the number.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

COE 308: Computer Architecture (T032) Dr. Marwan Abu-Amara Integer & Floating-Point Arithmetic (Appendix A, Computer Architecture: A Quantitative Approach,

Greatest Common Divisors & Least Common Multiples  Definition 4 Let a and b be integers, not both zero. The largest integer d such that d|a and d|b is.

Integer Multiplication and Division COE 301 Computer Organization Dr. Muhamed Mudawar College of Computer Sciences and Engineering King Fahd University.

Integer Operations Computer Organization and Assembly Language: Module 5.

An Optimized Hardware Architecture for the Montgomery Multiplication Algorithm Miaoqing Huang 1, Kris Gaj 2, Soonhak Kwon 3, Tarek El-Ghazawi 1 1 The George.

Lattice Multiplication. Step 1 1)Draw a set of 2 by 2 boxes. 46 x 79 2) Cut the boxes in half diagonally. 3) Place the numbers on the outside of the boxes.

Integer Multiplication and Division ICS 233 Computer Architecture & Assembly Language Prof. Muhamed Mudawar College of Computer Sciences and Engineering.

S 2/e C D A Computer Systems Design and Architecture Second Edition© 2004 Prentice Hall Chapter 6 Overview Number Systems and Radix Conversion Fixed point.

Efficient Montgomery Modular Multiplication Algorithm Using Complement and Partition Techniques Speaker: Te-Jen Chang.

Motivation Basis of modern cryptosystems

Known-Plaintext-Only Attack on RSA-CRT with Montgomerry Multiplication

An Optimized Hardware Architecture for the Montgomery Multiplication Algorithm Miaoqing Huang Nov. 5, 2010.

More Binary Arithmetic - Multiplication

EKT 221 : DIGITAL 2.

The Euclidean Algorithm

Somet things you should know about digital arithmetic:

Computer Architecture & Operations I

Sathish Vadhiyar Parallel Programming

Integer Multiplication and Division

Computer Architecture & Operations I

Digital Systems and Number Systems

Multiplication.

Morgan Kaufmann Publishers

CSCI206 - Computer Organization & Programming

CSE 575 Computer Arithmetic Spring 2003 Mary Jane Irwin (www. cse. psu

Addition and multiplication

Morgan Kaufmann Publishers

CS-401 Computer Architecture & Assembly Language Programming

Multipliers Multipliers play an important role in today’s digital signal processing and various other applications. The common multiplication method is.

Fundamentals & Ethics of Information Systems IS 201

Applied Discrete Mathematics Week 4: Number Theory

Lecture 5: GPU Compute Architecture

Basics Combinational Circuits Sequential Circuits Ahmad Jawdat

BIC 10503: COMPUTER ARCHITECTURE

Lecture 5: GPU Compute Architecture for the last time

Chap. 6 Programming the Basic Computer

A.R. Hurson 323 CS Building, Missouri S&T

Addition and multiplication

UNIVERSITY OF MASSACHUSETTS Dept

Addition and multiplication

UNIVERSITY OF MASSACHUSETTS Dept

How to Multiply using Lattice

Multi-Digit Multiplication

微處理機 Microprocessor (100上) ARM 內核嵌入式SOC原理

Presentation transcript:

Optimizing Modular Multiplication for NVIDIA's Maxwell GPUs by Niall Emmart 1, Justin Luitjens 2, Charles Weems 1 and Cliff Woolley 2 1 University of Massachusetts 2 NVIDIA Corporation

Modular Multiplication Computes: A * B mod M Where A, B and M are hundreds to thousands of bits in length.

Motivating Problems Modular multiplication is a key primitive, used especially in cryptographic operations : RSA Diffie-Hellman Digital signature algorithms Prime generation Factorization

Preliminaries Number Representation GPU is a batched device One problem instance per thread. Separate multiply and Montgomery reduce phases. Small sizes, at most 512 bits.

Achieving High Performance Algorithms: use fast squaring, use “Almost Montgomery” (a redundant representation) Don’t use asymptotically faster algorithms Keep everything on chip Minimize register usage Performance is dominated by the low level techniques used to sum the product terms Good utilization (instructions/cycle)

Multiplication – Product Terms n word by n word product is computed by summing terms

Prior Work – NVIDIA 1.x architectures Two hardware instructions: mad24.lo D, A, B, C and mad24.hi D, A, B, C Note, no carry in or carry out

Prior Work – NVIDIA 1.x architectures (cont) Bernstein et al. [1]: – a and b sampled into 15 limbs of 14-bits – column oriented approach – uses mad24.lo as a 32-bit accumulator – achieves 461M 210-bit mod muls op/sec. Emmart and Weems [2]: – a and b sampled into 22 bit values – column oriented approach – uses pairs of mad24.lo / mad24.hi ops as a 48-bit accumulator – achieves 822K 256-bit mod exps ops/sec – equivalent to 816M 210-bit mod muls ops/sec

Prior Work – 2.x and 3.x architectures Two hardware instructions: imad{c}.hi.{cc} D, A, B, C imad{c}.lo.{cc} D, A, B, C Note the carry in and carry out options

L (A 0 B 0 ) L (A 1 B 0 ) L (A 2 B 0 ) L (A 3 B 0 ) Prior Work – 2.x and 3.x architectures (cont) A3A2A1A0A3A2A1A0 B3B2B1B0B3B2B1B0 H (A 0 B 0 ) H (A 3 B 1 ) H (A 2 B 1 ) H (A 1 B 1 ) H (A 0 B 1 ) ADDL (A 3 B 1 ) L (A 2 B 1 ) L (A 1 B 1 ) L (A 0 B 1 ) H (A 3 B 2 ) H (A 2 B 2 ) H (A 1 B 2 ) H (A 0 B 2 ) ADDL (A 3 B 2 ) L (A 2 B 2 ) L (A 1 B 2 ) L (A 0 B 2 ) H (A 3 B 3 ) H (A 2 B 3 ) H (A 1 B 3 ) H (A 0 B 3 ) ADDL (A 3 B 3 ) L (A 2 B 3 ) L (A 1 B 3 ) L (A 0 B 3 ) H (A 1 B 0 ) H (A 2 B 0 ) H (A 3 B 0 ) Uses an accumulator for each column and ripples the carry 2n^2 + n – 1 instructions

Prior Work – 2.x and 3.x architectures (cont) Zheng et al. [3]: – row oriented approach – rippled carries – 3.412B 256-bit mod mul ops/sec Emmart and Weems [2]: – same approach – 3.469B 256-bit mod mul ops/sec – noted this approach does not work on Maxwell

Maxwell – 5.x Single hardware instruction xmad{.x}{.cc} D, A.{h0|h1}, B.{h0|h1}, C Note, this instruction also supports carry in and carry out

Maxwell – 5.x (cont) Consider computing A*B where A and B are each 32-bits, using a 16-bit multiplier: On Maxwell, a 32-bit madc.lo.cc and madc.hi.cc are emulated and take 4 and 6 instructions respectively. Thus row oriented multiply takes ~10*n 2 instructions! AL * BLAH * BH AL * BH AH * BL These two products are half word aligned A * B =

Maxwell – 5.x (cont) A0L * B0LA1L * B0LA2L * B0LA3L * B0L A0H * B0LA1H * B0LA2H * B0LA3H * B0L A0L * B0HA1L * B0HA2L * B0HA3L * B0H A0H * B0HA1H * B0HA2H * B0HA3H * B0H A 3 A 2 A 1 A 0 B 1 B 0 A0L * B1LA1L * B1LA2L * B1LA3L * B1L A0H * B1LA1H * B1LA2H * B1LA3H * B1L A0L * B1HA1L * B1HA2L * B1HA3L * B1H A0H * B1LA1H * B1LA2H * B1LA3H * B1L B0 Terms B1 Terms Green terms are full word aligned Red terms are half word aligned

Maxwell – 5.x (cont) A0L * B0LA1L * B0LA2L * B0LA3L * B0L A0H * B0LA1H * B0LA2H * B0LA3H * B0L A0L * B0HA1L * B0HA2L * B0HA3L * B0H A0H * B0HA1H * B0HA2H * B0HA3H * B0H A 3 A 2 A 1 A 0 B 1 B 0 A0L * B1LA1L * B1LA2L * B1LA3L * B1L A0H * B1LA1H * B1LA2H * B1LA3H * B1L A0L * B1HA1L * B1HA2L * B1HA3L * B1H A0H * B1HA1H * B1HA2H * B1HA3H * B1H SUM THE RED TERMS AND SHIFT LEFT 16 BITS USING PRMT ADD IN THE GREEN TERMS 4n^2 + 4n – 4 instructions add

Montgomery Reduction on Maxwell MontgomeryReduce(MP X, MP M) { MP U[n]=0; REPEAT n TIMES … use Montgomery’s technique to zero out low 16-bits of X … U=U + (X[0]>>16);... use Montgomery’s technique to zero out low 16-bits of U … X=(X>>32) + (U[0]>>16); U=U>>32; return X=X+(U<<16); } X 0 X 1 X 2 U 0 U 1 U 2 X n-1 X n X 2n-1 U n-1... X = U =

Results –Mod Mul Performance

Results – Mod Square Performance

Instructions Per Cycle / Utilization

Results – per SM per MHz

Thank you! Questions?