Download presentation

Presentation is loading. Please wait.

Published byLexi Culmer Modified over 4 years ago

1
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla

2
Agenda Introduction Survey of Existing Architectures Xtensa+ Crypto Processor Rijndael Algorithm (AES final selection) RC6, IDEA, and DES Performance Trade-off Analysis Conclusion

3
Introduction Commercial Networking Applications require flexible & high throughput secure connectivity Encryption/Decryption algorithm computation intensive Multi-session applications present significant load on embedded processors Embedded systems need performance while optimizing power and area Our study – existing architectures, analysis of Xtensa as an alternative, performance analysis and trade-offs for embedded

4
Survey of Existing Architectures Three categories Specialized Crypto Processors Reconfigurable Architectures Full Hardware Implementation (ASICs/FPGAs) High Variation in architecture complexity Performance vs Area tradeoff Suitability for Embedded Applications

5
Specialized Crypto Processors Few VLIW architectures - CryptoManiac Instruction Combining – Instruction Word combining to exploit ILP Crypto Arithmetic Unit(s) – multiple XORs, GF multiplication/addition, lookup table substitution, and permutation Coarse configurability of datapath Mostly lacking SIMD support Performance is typically 2x to 6x that of general processors

6
Reconfigurable Architectures Numerous reconfigurable processor architectures – PipeRench, MorphoSys, COBRA, and GARP Functional Units that provide all crypto arithmetic - multiple XORs, GF multiplication/addition, modulo multiplication Reconfigurable Interconnection Network to provide dynamic change to functional unit connectivity VLIW Instructions Reconfiguration Registers Suitable for Block Ciphers High Variability in Performance increase w.r.t Processors

7
Full Hardware Implementation High performance implementations targeted to ASICs/FPGAs DES – 12 Gbps on Virtex-E XCV300E AES – 18 Gbps on ASIC using TSMC 0.18 m process Lacking flexibility and crypto-modes Memory and Area efficient Typical latency only in DMA of data to Hardware unit Need additional processor for control path

8
Xtensa+ Crypto Architecture Custom Extensions to Xtensa Processor using the TIE framework Addition of Generic Key Schedule Register File and Instructions to support all Crypto Algorithms studied Addition of multiple on-chip SRAMs (in addition to 4 Data-RAMs) to the Xtensa processor Currently Implemented using Table construct in TIE Hacked TIE Compiler generated Verilog Code to instantiate multiple RAM models (implemented using multi-dimensional array) for viability analysis Addition of 4 State Registers and 4 Next State Registers generic to all algorithms studied Possible future extensions to include multi-session key storage and fast retrieval support

9
AES Overview AES (Advanced Encryption Standard) is the standard set to replace DES for both government and private-sector encryption Uses a fixed block size of 128-bits, with key sizes of 128-, 196-, or 256-bits Designed to be efficient in both hardware and software across a variety of platforms 10, 12, or 14 rounds depending on key size 128-bit round key used for each round Can be pre-computed and cached for future encryptions

10
AES Implementation Abstraction Each round consists of a lookup, byte-level permutation, finite field multiplication, and key XOR Lookup and multiplication can be combined into four separate 8x32 lookup tables, so each round is 16 lookups and 16 XORs Decryption is essentially the same, but with different tables and a different key schedule

11
TIE Implementation Our implementation does all 16 lookups in parallel, requiring 16 SRAMs x0, x1, x2, x3, represents the round state (each 32 bits), k0, k1, k2, k3 are the current round key, and Tij are the T-boxes, where i is a duplication index and j is the T-box index Each round is then: x0 = T00[x0]^T01[x1>>8]^T02[x2>>16]^T03[x3>>24] ^ k0 x1 = T10[x1]^T11[x2>>8]^T12[x3>>16]^T13[x0>>24] ^ k0 x2 = T20[x2]^T21[x3>>8]^T22[x0>>16]^T23[x1>>24] ^ k0 x3 = T30[x3]^T31[x0>>8]^T32[x1>>16]^T33[x2>>24] ^ k0

12
Other Ciphers Implemented DES (Data Encryption Standard) 64-bit block, 56-bit key, 16 rounds, Feistel network 8 6x4 S-Boxes, XORs, and bit-level permutations Can’t really be done efficiently in software TIE Implementation required 1 Instruction per round IDEA (International Data Encryption Algorithm) 64-bit block, 128-bit key, 8 rounds, iterated, operates on 16-bit numbers 4 Multiplications mod 2 16 + 1, 4 adds mod 2 16, 6 XORS Each round is highly sequential, so difficult to parallelize TIE Implementation required 7 Instructions per round RC6 Same block and key modes as AES, 20 rounds, iterated Multiplication mod 2 32, XORs, rotations, addition mod 2 32 TIE Implementation required 2 Instructions per round

13
AES Performance in Xtensa+ Performance of TIE extensions approaches performance of non-pipelined ASICs Total of 31 run-time instructions per data-block Initial EXOR Instruction 1 Instruction per round computation (10 total) 20 Cycles for Load and Store of 128-bit Data Blocks Generally an order of magnitude better than pure software Also faster than reconfigurable hardware or a specialized VLIW processor

14
Mbps of Throughput BaseVLIWTIEASICReconfig. AES43.751298418000594 DES26.52405861500053.3 IDEA2820023120341013 RC66136850815200470

15
Cycles Per Block BaseVLIWTIEASIC AES838903110 DES6901122616 IDEA653112669 RC6600140609

16
Design Tradeoffs Flexibility Algorithm changes New algorithms New encryption modes Implementation bugs Time to Market Closer to software development time Can choose which parts to accelerate

17
Power vs. Performance: Mbps/mW BaseVLIWTIEASICRec. AES0.361.155.63300.66 DES0.220.544.1959.130.08 IDEA0.230.622.1315.82.89 RC60.511.374.6914.121.35

18
Conclusion Xtensa instructions provide flexibility, performance, and Mbps/mW all somewhere between an ASIC and a VLIW or Software-based solution Suitable for most Embedded Applications like 802.11i, etc. Using Xtensa for cryptography is a good choice if: You don’t need absolute throughput You don’t need absolute flexibility You need a control processor anyway The algorithms needed are known ahead of time

Similar presentations

© 2019 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google