Fast Modular Reduction

Slides:



Advertisements
Similar presentations
MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan.
Advertisements

Elliptic Curve Cryptography The EC Discrete Logarithm problem and Pollard’s Rho attack Ofer Schwarz, Winter Advisor: Barukh Ziv.
Instruction Set Design
RISC and Pipelining Prof. Sin-Min Lee Department of Computer Science.
IHP Im Technologiepark Frankfurt (Oder) Germany IHP Im Technologiepark Frankfurt (Oder) Germany ©
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Lecture Implementations. The efficiency of a particular cryptographic scheme based on any one of the algebraic structures will depend on a number.
The XTR public key system (extended version of Crypto 2000 presentation) Arjen K. Lenstra Citibank, New York Technical University Eindhoven Eric R. Verheul.
1 EFFICIENT ADDERS TO SPEEDUP MODULAR MULTIPLICATION FOR CRYPTOGRAPHY Adnan Gutub Hassan Tahhan Computer Engineering Department KFUPM, Dhahran, SAUDI ARABIA.
EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under.
Homework #4 Solutions Brian A. LaMacchia Portions © , Brian A. LaMacchia. This material is provided without.
Lecture #17 INTRODUCTION TO THE FAST FOURIER TRANSFORM ALGORITHM Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh,
Instruction Level Parallelism (ILP) Colin Stevens.
1 Fast Sparse Matrix Multiplication Raphael Yuster Haifa University (Oranim) Uri Zwick Tel Aviv University ESA 2004.
L10 – Multiplication Division 1 Comp 411 – Fall /19/2009 Binary Multipliers ×
CS470, A.SelcukElGamal Cryptosystem1 ElGamal Cryptosystem and variants CS 470 Introduction to Applied Cryptography Instructor: Ali Aydin Selcuk.
Computer ArchitectureFall 2008 © August 25, CS 447 – Computer Architecture Lecture 3 Computer Arithmetic (1)
An Expandable Montgomery Modular Multiplication Processor Adnan Abdul-Aziz GutubAlaaeldin A. M. Amin Computer Engineering Department King Fahd University.
CHES20021 Scalable and Unified Hardware to Compute Montgomery Inverse in GF(p) and GF(2 n ) A. Gutub, A. Tenca, E. Savas, and C. Koc Information Security.
IHP Im Technologiepark Frankfurt (Oder) Germany IHP Im Technologiepark Frankfurt (Oder) Germany ©
M. Interleaving Montgomery High-Radix Comparison Improvement Adders CLA CSK Comparison Conclusion Improving Cryptographic Architectures by Adopting Efficient.
Computer ArchitectureFall 2007 © August 29, 2007 Karem Sakallah CS 447 – Computer Architecture.
1 Montgomery Multiplication David Harris and Kyle Kelley Harvey Mudd College Claremont, CA {David_Harris,
-Anusha Uppaluri.  ECC- A set of algorithms for key generation, encryption and decryption (public key encryption technique)  ECC was introduced by Victor.
Montgomery Multipliers & Exponentiation Units
Long Modular Multiplication for Cryptographic Applications Laszlo Hars Seagate Research Workshop on Cryptographic Hardware and Embedded Systems, CHES 2004.
CS 312: Algorithm Analysis Lecture #3: Algorithms for Modular Arithmetic, Modular Exponentiation This work is licensed under a Creative Commons Attribution-Share.
INTRODUCTION Crusoe processor is 128 bit microprocessor which is build for mobile computing devices where low power consumption is required. Crusoe processor.
200/MAPLD 2004 Craven1 Super-Sized Multiplies: How Do FPGAs Fare in Extended Digit Multipliers? Stephen Craven Cameron Patterson Peter Athanas Configurable.
What have mr aldred’s dirty clothes got to do with the cpu
Implementation of the RSA Algorithm on a Dataflow Architecture
BCRYPT ECC-Day 2008 Requirements, Algorithms, Architectures The design space of ECC hardware.
CDA 3101 Fall 2013 Introduction to Computer Organization The Arithmetic Logic Unit (ALU) and MIPS ALU Support 20 September 2013.
Elliptic Curve Cryptography
Faster Implementation of Modular Exponentiation in JavaScript
Reconfigurable Computing - Pipelined Systems John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound, Western.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
Cryptography issues – elliptic curves Presented by Tom Nykiel.
Introduction: Efficient Algorithms for the Problem of Computing Fibonocci Numbers Prepared by John Reif, Ph.D. Analysis of Algorithms.
Accelerating an N-Body Simulation Anuj Kalia Maxeler Technologies.
Understanding Cryptography – A Textbook for Students and Practitioners by Christof Paar and Jan Pelzl Chapter 7 – The RSA Cryptosystem.
A Reconfigurable System on Chip Implementation for Elliptic Curve Cryptography over GF(2 n ) Michael Jung 1, M. Ernst 1, F. Madlener 1, S. Huss 1, R. Blümel.
1 How to Multiply Slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved. integers, matrices, and polynomials.
Dan Boneh Intro. Number Theory Arithmetic algorithms Online Cryptography Course Dan Boneh.
CSE 8351 Computer Arithmetic Fall 2005 Instructors: Peter-Michael Seidel.
An Optimized Hardware Architecture for the Montgomery Multiplication Algorithm Miaoqing Huang 1, Kris Gaj 2, Soonhak Kwon 3, Tarek El-Ghazawi 1 1 The George.
Grade School Again: A Parallel Perspective CS Lecture 7.
Introduction to Elliptic Curve Cryptography CSCI 5857: Encoding and Encryption.
Hardware Implementations of Finite Field Primitives
Optimizing Multipliers for the CPU: A ROM based approach Michael Moeng Jason Wei Electrical Engineering and Computer Science University of California:
Efficient Montgomery Modular Multiplication Algorithm Using Complement and Partition Techniques Speaker: Te-Jen Chang.
Full Adder Truth Table Conjugate Symmetry A B C CARRY SUM
Array Multiplier Haibin Wang Qiong Wu. Outlines Background & Motivation Principles Implementation & Simulation Advantages & Disadvantages Conclusions.
Preventing Interrupt Overload Presented by Jiyong Park Seoul National University, Korea John Regehr, Usit Duogsaa, School of Computing, University.
An Optimized Hardware Architecture for the Montgomery Multiplication Algorithm Miaoqing Huang Nov. 5, 2010.
Optimizing Modular Multiplication for NVIDIA's Maxwell GPUs by Niall Emmart 1, Justin Luitjens 2, Charles Weems 1 and Cliff Woolley 2 1 University of Massachusetts.
Supported in part by NIST/U.S. Department of Commerce
Memory COMPUTER ARCHITECTURE
Fast Truncated Multiplication for Cryptographic Applications
Multipliers Multipliers play an important role in today’s digital signal processing and various other applications. The common multiplication method is.
Products.pdf Karatsuba, Toom-Cook not used much
STUDY AND IMPLEMENTATION
EFFICIENT ADDERS TO SPEEDUP MODULAR MULTIPLICATION FOR CRYPTOGRAPHY
All-Pairs Shortest Paths
The Application of Elliptic Curves Cryptography in Embedded Systems
Multi Core Processing What is term Multi Core?.
CSE 373 Data Structures and Algorithms
Introduction to Microprocessor Programming
How to improve (decrease) CPI
Presentation transcript:

Fast Modular Reduction Will Hasenplaugh Gunnar Gaubatz Vinodh Gopal June 27, 2007

Modular Multiplication Modular Multiplication is used in Public Key Cryptography Diffie-Hellman and RSA Prime-field Elliptic Curve Cryptography Compute AB mod M where A,B and M are typically 100’s to 1000’s of bits We present a variant of Barrett’s Modular Reduction Algorithm which exploits Karatsuba Multiplication and Modular Folding Analysis is software focused We use an abstract processor to compare algorithms fairly The native word size is w-bits (a power of 2) 1-cycle add and an m-cycle multiply We present example data on an 8-bit processor with a 2-cycle multiplier Atmel AVR series - representative of embedded handheld devices Our algorithm is also applicable to hardware acceleration Digital Enterprise Group

Digital Enterprise Group Montgomery vs. Barrett Word-Serial Montgomery Pro: Regularity Interleaved Multiply and Reduce Low-Complexity Quotient Estimation Right-to-Left computation leads to convenient hardware pipelines Con: Transformation Overhead n2 complexity Barrett Pro: No Transformation Overhead Large Digit Based Computation Allows sub-n2 multiplication techniques Flexible ‘Off the Shelf’ hardware Con: Quotient Estimation requires a ‘large digit’ multiplication Left-to-Right computation is less convenient for hardware Digital Enterprise Group

Digital Enterprise Group Barrett vs. Montgomery Performance of n2 Barrett approaches ~2/3 of Montgomery Quotient Estimation for Montgomery is amortized as operands grow Digital Enterprise Group

Karatsuba Multiplication Recursive multiplication algorithm with O( n1.585 ) complexity. ‘Schoolbook’ multiplication complexity scales as O( n2 ), but requires fewer additions per recursion. N=AB A=a12n+a0 B=b12n+b0 Schoolbook Multiplication - N=a1b122n+(a1b0+a0b1)2n+a0b0 Karatsuba Multiplication - N=a1b122n+ [(a1+a0)(b1+b0)-a1b1-a0b0]2n+a0b0 a1 A a0 x b1 B b0 a1+a0 b1+b0 a1b1 a0b0 + (a1+a0)(b1+b0) - a0b0 - a1b1 N=AB Digital Enterprise Group

Recursive Karatsuba Decomposition <= 1 <= 2 For k recursions: ‘extra’ word is <= log2k bits a1+a0 <= 3 There are fewer particles in the universe than that. Just one extra word on an 8-bit machine is sufficient to handle multiplication of numbers up to 2^258 bits. So, we probably won’t need to rewrite this code. Digital Enterprise Group

Digital Enterprise Group Carry Handling There is considerable overhead in the naïve implementation of Karatsuba. At a recursion depth of 4, ~20% of the multiplies are with sparsely populated ‘extra’ words. We turn sparsely populated multiplies into branches and adds. N=AB A=ah2n+al B=bh2n+bl ah and bh are booleans N=ahbh22n+[ahbl+bhal]2n+albl ah al x bh bl albl + if =1 al bh + if =1 bl ah + if & =1 1 ah bh N Each recursion is a conveniently-sized multiply -> No ‘extra’ words. Digital Enterprise Group

Karatsuba vs. Schoolbook Multiplication Digital Enterprise Group

Digital Enterprise Group Barrett’s Algorithm A, B and M are n-bit numbers. We seek to find R = AB mod M using Barrett’s Algorithm. A total of 3 n-bit multiplies. A x B N N / 2n N mod 2n x μ μ N / 2n ~μ N / 22n x M - ~μ NM / 22n R Digital Enterprise Group

Digital Enterprise Group Barrett vs. Montgomery Digital Enterprise Group

Digital Enterprise Group Folding We accelerate the reduction process by partially reducing N ( =AB ) with an inexpensive method called Folding: A x B N / 23s N N mod 23s x M’=23s mod M ~NM’ / 23s + N’ Digital Enterprise Group

Digital Enterprise Group Iterative Folding We can play the same trick again. F times, in fact. N / 21.5n N N mod 21.5n x M(1) + N(1) N(1) mod 21.25n x M(2) + N(2) N(2) mod 21.125n Digital Enterprise Group

Iterative Folding ( F = 2 ) Digital Enterprise Group

Digital Enterprise Group Summary This Fast Modular Reduction technique is ~2x faster than Montgomery on RSA Encryption on 512 – 1024 bit keys. As security requirements heighten, key sizes will grow to meet them and the asymptotic advantage of Karatsuba will continue to shine. We see a ~3x and ~4x advantage, respectively, for 2048 and 4096 bit keys. The speedup of a multiplier-bound, w-bit architecture is Strong encryption on low-power handheld devices is challenging Ex: A 16MHz 8-bit Atmel AVR computes a 4096-bit RSA in almost 4 minutes with Montgomery, but we can do it in 1. Digital Enterprise Group