Presentation on theme: "Algorithm Efficiency in Hardware with an Emphasis on Skein"— Presentation transcript:
1 Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty
2 Outline Purpose of this Presentation Full Custom (ASIC) Design Digital Hardware Implementation BasicsGatesArithmeticField Programmable Gate Arrays (FPGAs)LayoutHow FPGA’s are usedSkein Hashing Algorithm
3 Purpose Touch upon basic hardware elements Inform future cryptographers and designers of cryptographic algorithms of the benefits and limitations of hardwarePresent Skein as an algorithm with pretty good hardware compatibility
4 Full Custom (ASIC) Design Image contributed from Dr. Shaaban, CE Dept.
5 Digital Logic Gates Basic operation block 1 or more input voltages, and exactly 1 output voltageVoltage is either High or Low (1 or 0)TTL (Bipolar Junction Transistors)CMOS (Complementary Metal Oxide Semiconductor Field Effect Transistors)
6 Primary Gates INVERT, AND, OR NAND and NOR INVERT isn’t always necessary depending on underlying technologyNAND and NORNAND is an AND gate with INVERTed OutputNOR is an OR gate with INVERTed OutputSchematic is similar to AND and OR, but with a bubble on the output (representing inverse)Either can be solely used to build any logic
7 Inverter Schematic Truth Table Algebraic Notation Input A Output Y 1 1Algebraic NotationY = A’
8 AND Gate Schematic Truth Table Algebraic Notation Input A Input B Output Y1Algebraic NotationY = AB
9 OR Gate Schematic Truth Table Algebraic Notation Input A Input B Output Y1Algebraic NotationY = A + B
10 XOR Gate Schematic Truth Table Algebraic Notation Input A Input B Output Y1Algebraic NotationY = A ⊕ B
11 XOR Gate (Continued) Can be composed of INVERT, AND, & OR A ⊕ B = A’B + AB’But it can be easily implemented in hardware using faster methods
12 Gate Delay Gates are not instantaneous There is a delay between the time an input changes to the time an output changes
14 Addition and Subtraction Ripple-Carry AdderEasiest to analyzeFaster adders are used in industryNaffziger (Intel Core 2)Carry Look-ahead Adders, etc.Uses two components, Half Adder and Full AdderFull Adder has a third input for Carry-In compared to the Half AdderSubtraction is just addition by a negative number in 2’s complement notation
15 Ripple Carry Adder Algorithm Similar to manual additionLeast Significant Bits (A0 and B0) are added together to produce a Sum Bit and a Carry Bit (S0 and C1).The next pair of bits (A1 and B1) are added together along with the previous Carry Bit (C1) to produce a Sum Bit and a Carry Bit (S1 and C2).The process repeats
16 Ripple-Carry Adder Components Half AdderFull Adder2 Gate Delays for Sum bit3 Gate Delays for Carry bit1 Gate Delay to change theSum Bit if the incomingCarry bit changes2 Gate Delays to change theCarry bit if the incoming1 Gate Delay for both the Sum bit and the Carry bit
18 Ripple-Carry Adder Worst Case Worst Case Scenario is when C0 is 0, A is all 1’s and B is all 0’s, and then C0 changes to 1The Carry has to propagate through all of the Full Adder BlocksFor an n-bit Ripple-Carry Adder2(n-1) + 1 gate delays to change the final Sum bit2n gate delays to change the final Carry bit
19 Multiplication Generic Multiplier Constant Coefficient Multiplier Any two numbers can be multiplied togetherA * B = Yn-bit inputs produces 2n-bit outputConstant Coefficient MultiplierMultiplication by a constantA * 5 = YEasier to implementUsed in Finite Impulse Response (FIR) Filters
20 Generic Multipliers O(n2) gate delays for an n-bit Generic Multiplier Very slow compared to additionUses many resources compared to addition
21 Optimized 3-bit Generic Multiplier At most 11 gate delays
22 Optimized 8-bit Generic Multiplier At most 53 Gate Delays
23 Division/Modulus More complex than Multiplication Can be implemented as a series of subtractionsSequential logic may be better suitedUses Registers and a Clock signal
24 Shortcuts Multiplication Division Modulus If multiplying by a power of 2, shift left by the powerDivisionIf dividing by a power of 2, shift right by the powerModulusIf taking a modulus of a power of 2, AND the bits with the (modulus – 1)
25 Full Custom Benefits Drawbacks Best Possible Performance Can be specially designed for low power consumption (embedded systems) or for high speed (PC expansion card)No restrictions on logicNo restrictions on routingExpensive to designExpensive to testFabrication takes months
26 Image contributed from Dr. Shaaban, CE Dept. FPGAImage contributed from Dr. Shaaban, CE Dept.
27 What is an FPGA? Field Programmable Gate Array (FPGA) It is an array of gates that can be programmedA good compromise between General Purpose Processors and Full Custom
28 Image contributed from Dr. Łukowiak, CE Dept. Layout of an FPGAInput and Output (I/O) Blocks Interface with the outside worldLED displaySwitches, buttons, etc.Logic Blocks usually take 3-4 input signals and generate the desired output signalData can be registeredInterconnects can be programmed to connect logic blocks and I/O blocks together (Logic -> Logic, I/O -> Logic, Logic -> I/O, I/O -> I/O)Usually a special Clock network to avoid Clock skew problemsImage contributed from Dr. Łukowiak, CE Dept.
29 How are FPGA’s actually used? They use a “programming language”VHDL -> VHSIC Hardware Description LanguageVHSIC -> Very High Speed Integrated CircuitVerilog -> C-like LanguagePrograms are NOT Top-Down like C, BASIC, etc.The programs describe the hardwareVery parallel with some sequential parts running in parallel
30 Step 1: SimulationThe programs run through a simulator which applies the correct input and generates the outputOnce the simulator produces the desired output, THE TASK IS NOT OVER YET!
31 Step 2: SynthesisThe Compiler will try to Synthesize the code into the appropriate logic blocks(Previous Multiplier Schematic was Synthesized from VHDL)Not all VHDL statements are Synthesizablewhile loop, wait statements, etc.Many times the program has to be adjusted to use only synthesizable commands… back to Simulation
32 Step 3: Place & RouteThe compiler now figures out where to place each logic block, and how the logic blocks are interconnectedSometimes more hardware is needed than is actually on the specific FPGA deviceBuy a bigger FPGARedesign the program to reuse more hardware, or to route data differently… back to Simulation
33 Step 4: Download to FPGA Download the program onto the FPGA Run the program and make sure the correct results are obtainedIf logic is too complex, then the clock frequency may have to be scaled downGate delay exceeds clock periodIf everything works, then done
34 FPGA Benefits Drawbacks Better performance than General Purpose ProcessorsEven though clock frequency may be MHzEasier to design than Full CustomEasier to test than Full CustomGood for prototyping Full CustomNot a Production-Grade piece of hardwareNo application uses 100% of everything available on the FPGASome FPGA’s reset on power loss, and need to be reprogrammed
35 Skein Hashing Algorithm Different versions depending on the internal state and output sizeSkein has a 512-bit internal state, and 1024 output bitsSkein is the default proposalSkein will be examined in this presentationOnly 256, 512, and 1024 internal states supportedAny output size may be usedSkein-256 and Skein-512 have 72 rounds; Skein-1024 has 80 roundsBased on the Threefish Block Cipher (introduced alongside Skein)Threefish Block Cipher has 3 componentsMIXPermuteAdd SubkeySkein wraps a 512-bit XOR around Threefish to create a UBI block, which is chained together
36 Threefish Block Cipher Encryption starts with 8 64-bit Subkey additionsThen there are 4 rounds of MIX and Permute followed by the next Subkey additionThere are a total of 72 roundsThe Cipher ends with the 18th Subkey addition
37 The MIX FunctionOne 64-bit additionOne 64-bit rotateOne 64-bit XOR
38 MIX Function Hardware Analysis 64-bit AdditionFull Custom (ASIC) isn’t too badFPGA’s can handle a few of theseBit RotationSimply a wire-mapping64-bit XOREven easier than Addition1 Gate Delay
39 The Permute Function64-bit words are swapped between MIX functions
40 Permute Function Hardware Analysis Entirely wire mappingsNot an issue
42 Subkey Hardware Analysis 8 XOR’s chained together8 Gate DelaysSubkey Index mod 9 (and 3)Full Custom (ASIC) can be hard-codedCreative methods must be done in FPGATwo 64-bit Additions chained togetherAdditions taken mod 264Our only good news!
43 Subkey Hardware Analysis Continued Eight 64-bit Additions happen “logically” in parallelEach of those Eight is really 2 64-bit Additions chained together, as mentioned previouslyTo actually do this in parallel is a large hardware commitmentTo save on hardware, each addition should happen serially using the same Logic Blocks (FPGA)This may require external memory I/O between additions to swap out the addendsVERY SLOW
45 UBI Block Hardware Analysis One 512-bit XOROK for Full Custom (ASIC), but a major painTrouble for FPGAWire-routing nightmareChaining is no big deal~640-bit register (512-bit “key”, 128-bit “tweak”)
46 FPGA Stats on a Spartan 2 for Skein Number of Slices: out of 2352 148% (*) Number of Slice Flip Flops: 4604 out of 4704 97% Number of 4 input LUTs: 6262 out of 4704 133% (*) Number of IOs: 62 Number of bonded IOBs: 44 out of 140 31% IOB Flip Flops: 4 Number of GCLKs: 2 out of 4 50%
47 Changes Necessary to Fit Complete redesign of the underlying componentsSpecifically SubkeyMinimize routingMore utilization of external memory moduleBuy a bigger FPGASpartan 3?