Presentation is loading. Please wait.

Presentation is loading. Please wait.

Long Modular Multiplication for Cryptographic Applications Laszlo Hars Seagate Research Workshop on Cryptographic Hardware and Embedded Systems, CHES 2004.

Similar presentations


Presentation on theme: "Long Modular Multiplication for Cryptographic Applications Laszlo Hars Seagate Research Workshop on Cryptographic Hardware and Embedded Systems, CHES 2004."— Presentation transcript:

1 Long Modular Multiplication for Cryptographic Applications Laszlo Hars Seagate Research Workshop on Cryptographic Hardware and Embedded Systems, CHES 2004 Boston, MA Full version of the paper is at: http://www.hars.us/Papers/ModMult.pdf

2 Outline Background ( need, algorithms, complexity… ) Target: occasional PK crypto ( smartcard, OSD… ) Optimizations –Hardware architecture General purpose, support fast modular reduction Speed: Parallel operation: multiply || add / load… Memory: In-place update –Algorithmic improvements Multiply with short Reciprocal (~trial division) –Precision – scaling of reciprocals –Drop insignificant terms Modulus scaling

3 Modular Multiplication a × b mod m = remainder of (a × b) ÷ m Used in RSA, ECC, ElGamal, Diffie- Hellman, Primality tests, BBS-PRNG… Assume a,b,m are n-digit numbers ▫m normalized: ½ d n ≤ m < d n ▫Digit size (machine word) = 16 bits (8…64) ▫n = 64 for RSA-1024 (10…256) Squaring ~twice faster Conserve memory  Divide after Multiply: double length product

4 Modular Multiplication Interleaved multiplication and division Barrett multiplication –Multiply with reciprocal ([d 2n / m]: extra n digits) Quisquater's multiplication –Scaling the modulus for many MS 1-bits (S: extra n digits storage) Montgomery multiplication –Number representation: a → a × d n mod m –Right-to-left (simple) interleaved division –Needs pre- and post processing

5 Sub-Quadratic time algorithms Fast multiplications  Complicated algorithms ▫ Pays for very long numbers ▫ Karatsuba: O(n log 2 3 ) – faster if n > 10…30 ▫ Toom-Cook 3,4…way O(n α ) ▫ 3FT ( Finite Field Fourier Transform ) O(n·logn·loglogn) Division = multiplication with reciprocal Long Reciprocal [d 2n / m] –Newton iteration: 0.6…2 multiplication time Speed-ups for PKC www.hars.us/Papers/Truncated Products.pdf

6 Quadratic time algorithms School multiplication: n 2 digit products School division: k·n 2 digit operations –Quotient digits estimated with short divisions Digit-Multiplications || other operations +Simple structure +No extra storage when interleaved –Slower –Quotient digits with trial-and-error Goal: reduce # correction steps

7 Multiply-Accumulate  DSP: multiplication parallel to load / store / add / compare…  Order of the digit-product calculation ▫ Row-order (use input digits sequentially) for i = 0 … | a | -1 for j = 0 … | b | -1 …a i b j … –More memory access ▪ Column-order (output digits sequentially) for k = 0 … | a | + | b | -2 for i,j: i+j = k …a i b j … –Longer accumulator (can be split)

8 HW Architecture General purpose µP with enhancements –Circuit utilization: Multi-use DSP structure: multiplication || others –Multiplier is large and slow Long accumulator  Split adder / counter In-Accumulator instructions Quotient-digit correction circuit Updateable memory –circular offset write

9 HW Architecture 16-bit digits || Shift-add = 17.5-bit mult In-Accumulator ▫ Shift ▫ Add

10 Quotient Digits No need to store q q ← multiplication with short reciprocal µ –µ is used many times –µ ← Newton iteration, look-up table… –All bits - 2 MS digits and 1 bit: error = 0 or 1 (-1) –More than 1-digit reciprocal: quotient often OK –Most economical: µ = [d n +2 / 2m] = {µ 1,µ 0 } scale: ÷2m, making µ exact 2-digit Special case m = ½ d n  µ : = d 2 −1 –Usable: µ = [d n +1.5 / m], µ = [2d n +1 / m]…

11 The basic algorithm LRL4 R n-1 … n-3 = a n a -1 b n b -1 d + a n a -1 b n b -2 + a n a -2 b n b -1 // Col 1, 2 for k = n a +n b -4 … n-3 // Columns to left R n … n-4 += Σ i+j=k a i b j // Loop-1 to right if (overflow) R -= m q =(R n-1 µ 1 d 2 + R n-1 µ 0 d + R n-2 µ 1 d + R n-2 µ 0 )/d 3 ·2 R =(R–q·m)d // Loop-2 for k = 0 … n-4 // LS digits to left R n … k += Σ i+j=k a i b j // Loop-3 ~ 1 while( R n > 0 ) R -= m // fix overflow Left-Right-Left (military step) algorithm 1234

12 Q = 0 // 50-bit accumulator for k = 0 … n-4 Q = MS(Q) + r k for j = max(0,k+1-n a )… min(k+1,n b ) Q += a k-j b j r k = D0(Q) for i = n-3 … n // storing digits Q = MS(Q) + r i r i = D0(d) Inner Loops (multiply-add) c = 0 // 1-digit temp store Q = 0 // 33-bit accumulator for k = 0 … n-1 Q = MS(Q) + c – q·m k c = r k r k = D0(Q) Σ i+j=k a i b j (R–q·m)d

13 Improvements Probability of an overflow < n / d. –When a, b and m uniform random (?) DSP SW mod reduction time = 1.0001n 2 + 4n –multiply time = 10 additions: 1.000 01n 2 + 4n HW assisted time = n 2 + 4n Variants (Accumulator = x n d 3 + x n−1 d 2 + … ) –LRL4: q = [ 2 ( µ 1 x n d 2 + (µ 1 x n−1 +µ 0 x n ) d + µ 0 x n−1 ) / d 3 ] –LRL3: q = [ 2 ( µ 1 x n d + (µ 1 x n−1 +µ 0 x n ) ) / d 2 ] ? LRL2: q = [ ( µ 1 x n d + µ 0 x n ) / d 2+δ ], many corrections ε 2 Sequential quotient correction

14 Shorter reciprocal 1 digit → error explosion 1 digit + 2 bits OK: µ = ½ [2d n+1 / m] = d + µ 0 + δ, with δ = 0 or ½ 50-bit Accumulator with carry c = 0 or 1 R = c d 3 + x n d 2 + x n−1 d + x n−2 Estimated quotient-digit q = [( R + R δ /d ) / d 2 + µ 0 c + µ 0 x n / d ] ≈ µR / d Mod reduction time –SW: 1.25n 2 + n (mult = 10 adds: 1.025n 2 + n) –HW: n 2 + n µ 1 =1 Quotient correction

15 Modulus Scaling Special m: NO multiplication for quotient-digit –Quotient digit: q = r n +1 –(0F) MS digit of m = d −1 = 11…1 2 –(10 ) MS 2 digits of m = {1,0} Transform m: 1-digit scaling factor S –mS is n+1-digit –Last reduction step is with m → n-digit result  Need to store m and mS Faster than Montgomery: n 2 + const  Montgomery with modulus scaling: n 2 + const –LS digit of m = d −1 = 11…1 2 (xF) –Last reduction step is with m → n-digit result

16 Summary


Download ppt "Long Modular Multiplication for Cryptographic Applications Laszlo Hars Seagate Research Workshop on Cryptographic Hardware and Embedded Systems, CHES 2004."

Similar presentations


Ads by Google