Presentation on theme: "Algorithm Efficiency in Hardware with an Emphasis on Skein"— Presentation transcript:
1Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty
2Outline Purpose of this Presentation Full Custom (ASIC) Design Digital Hardware Implementation BasicsGatesArithmeticField Programmable Gate Arrays (FPGAs)LayoutHow FPGA’s are usedSkein Hashing Algorithm
3Purpose Touch upon basic hardware elements Inform future cryptographers and designers of cryptographic algorithms of the benefits and limitations of hardwarePresent Skein as an algorithm with pretty good hardware compatibility
4Full Custom (ASIC) Design Image contributed from Dr. Shaaban, CE Dept.
5Digital Logic Gates Basic operation block 1 or more input voltages, and exactly 1 output voltageVoltage is either High or Low (1 or 0)TTL (Bipolar Junction Transistors)CMOS (Complementary Metal Oxide Semiconductor Field Effect Transistors)
6Primary Gates INVERT, AND, OR NAND and NOR INVERT isn’t always necessary depending on underlying technologyNAND and NORNAND is an AND gate with INVERTed OutputNOR is an OR gate with INVERTed OutputSchematic is similar to AND and OR, but with a bubble on the output (representing inverse)Either can be solely used to build any logic
7Inverter Schematic Truth Table Algebraic Notation Input A Output Y 1 1Algebraic NotationY = A’
8AND Gate Schematic Truth Table Algebraic Notation Input A Input B Output Y1Algebraic NotationY = AB
9OR Gate Schematic Truth Table Algebraic Notation Input A Input B Output Y1Algebraic NotationY = A + B
10XOR Gate Schematic Truth Table Algebraic Notation Input A Input B Output Y1Algebraic NotationY = A ⊕ B
11XOR Gate (Continued) Can be composed of INVERT, AND, & OR A ⊕ B = A’B + AB’But it can be easily implemented in hardware using faster methods
12Gate Delay Gates are not instantaneous There is a delay between the time an input changes to the time an output changes
14Addition and Subtraction Ripple-Carry AdderEasiest to analyzeFaster adders are used in industryNaffziger (Intel Core 2)Carry Look-ahead Adders, etc.Uses two components, Half Adder and Full AdderFull Adder has a third input for Carry-In compared to the Half AdderSubtraction is just addition by a negative number in 2’s complement notation
15Ripple Carry Adder Algorithm Similar to manual additionLeast Significant Bits (A0 and B0) are added together to produce a Sum Bit and a Carry Bit (S0 and C1).The next pair of bits (A1 and B1) are added together along with the previous Carry Bit (C1) to produce a Sum Bit and a Carry Bit (S1 and C2).The process repeats
16Ripple-Carry Adder Components Half AdderFull Adder2 Gate Delays for Sum bit3 Gate Delays for Carry bit1 Gate Delay to change theSum Bit if the incomingCarry bit changes2 Gate Delays to change theCarry bit if the incoming1 Gate Delay for both the Sum bit and the Carry bit
18Ripple-Carry Adder Worst Case Worst Case Scenario is when C0 is 0, A is all 1’s and B is all 0’s, and then C0 changes to 1The Carry has to propagate through all of the Full Adder BlocksFor an n-bit Ripple-Carry Adder2(n-1) + 1 gate delays to change the final Sum bit2n gate delays to change the final Carry bit
19Multiplication Generic Multiplier Constant Coefficient Multiplier Any two numbers can be multiplied togetherA * B = Yn-bit inputs produces 2n-bit outputConstant Coefficient MultiplierMultiplication by a constantA * 5 = YEasier to implementUsed in Finite Impulse Response (FIR) Filters
20Generic Multipliers O(n2) gate delays for an n-bit Generic Multiplier Very slow compared to additionUses many resources compared to addition
21Optimized 3-bit Generic Multiplier At most 11 gate delays
22Optimized 8-bit Generic Multiplier At most 53 Gate Delays
23Division/Modulus More complex than Multiplication Can be implemented as a series of subtractionsSequential logic may be better suitedUses Registers and a Clock signal
24Shortcuts Multiplication Division Modulus If multiplying by a power of 2, shift left by the powerDivisionIf dividing by a power of 2, shift right by the powerModulusIf taking a modulus of a power of 2, AND the bits with the (modulus – 1)
25Full Custom Benefits Drawbacks Best Possible Performance Can be specially designed for low power consumption (embedded systems) or for high speed (PC expansion card)No restrictions on logicNo restrictions on routingExpensive to designExpensive to testFabrication takes months
26Image contributed from Dr. Shaaban, CE Dept. FPGAImage contributed from Dr. Shaaban, CE Dept.
27What is an FPGA? Field Programmable Gate Array (FPGA) It is an array of gates that can be programmedA good compromise between General Purpose Processors and Full Custom
28Image contributed from Dr. Łukowiak, CE Dept. Layout of an FPGAInput and Output (I/O) Blocks Interface with the outside worldLED displaySwitches, buttons, etc.Logic Blocks usually take 3-4 input signals and generate the desired output signalData can be registeredInterconnects can be programmed to connect logic blocks and I/O blocks together (Logic -> Logic, I/O -> Logic, Logic -> I/O, I/O -> I/O)Usually a special Clock network to avoid Clock skew problemsImage contributed from Dr. Łukowiak, CE Dept.
29How are FPGA’s actually used? They use a “programming language”VHDL -> VHSIC Hardware Description LanguageVHSIC -> Very High Speed Integrated CircuitVerilog -> C-like LanguagePrograms are NOT Top-Down like C, BASIC, etc.The programs describe the hardwareVery parallel with some sequential parts running in parallel
30Step 1: SimulationThe programs run through a simulator which applies the correct input and generates the outputOnce the simulator produces the desired output, THE TASK IS NOT OVER YET!
31Step 2: SynthesisThe Compiler will try to Synthesize the code into the appropriate logic blocks(Previous Multiplier Schematic was Synthesized from VHDL)Not all VHDL statements are Synthesizablewhile loop, wait statements, etc.Many times the program has to be adjusted to use only synthesizable commands… back to Simulation
32Step 3: Place & RouteThe compiler now figures out where to place each logic block, and how the logic blocks are interconnectedSometimes more hardware is needed than is actually on the specific FPGA deviceBuy a bigger FPGARedesign the program to reuse more hardware, or to route data differently… back to Simulation
33Step 4: Download to FPGA Download the program onto the FPGA Run the program and make sure the correct results are obtainedIf logic is too complex, then the clock frequency may have to be scaled downGate delay exceeds clock periodIf everything works, then done
34FPGA Benefits Drawbacks Better performance than General Purpose ProcessorsEven though clock frequency may be MHzEasier to design than Full CustomEasier to test than Full CustomGood for prototyping Full CustomNot a Production-Grade piece of hardwareNo application uses 100% of everything available on the FPGASome FPGA’s reset on power loss, and need to be reprogrammed
35Skein Hashing Algorithm Different versions depending on the internal state and output sizeSkein has a 512-bit internal state, and 1024 output bitsSkein is the default proposalSkein will be examined in this presentationOnly 256, 512, and 1024 internal states supportedAny output size may be usedSkein-256 and Skein-512 have 72 rounds; Skein-1024 has 80 roundsBased on the Threefish Block Cipher (introduced alongside Skein)Threefish Block Cipher has 3 componentsMIXPermuteAdd SubkeySkein wraps a 512-bit XOR around Threefish to create a UBI block, which is chained together
36Threefish Block Cipher Encryption starts with 8 64-bit Subkey additionsThen there are 4 rounds of MIX and Permute followed by the next Subkey additionThere are a total of 72 roundsThe Cipher ends with the 18th Subkey addition
42Subkey Hardware Analysis 8 XOR’s chained together8 Gate DelaysSubkey Index mod 9 (and 3)Full Custom (ASIC) can be hard-codedCreative methods must be done in FPGATwo 64-bit Additions chained togetherAdditions taken mod 264Our only good news!
43Subkey Hardware Analysis Continued Eight 64-bit Additions happen “logically” in parallelEach of those Eight is really 2 64-bit Additions chained together, as mentioned previouslyTo actually do this in parallel is a large hardware commitmentTo save on hardware, each addition should happen serially using the same Logic Blocks (FPGA)This may require external memory I/O between additions to swap out the addendsVERY SLOW
45UBI Block Hardware Analysis One 512-bit XOROK for Full Custom (ASIC), but a major painTrouble for FPGAWire-routing nightmareChaining is no big deal~640-bit register (512-bit “key”, 128-bit “tweak”)
46FPGA Stats on a Spartan 2 for Skein Number of Slices: out of 2352 148% (*) Number of Slice Flip Flops: 4604 out of 4704 97% Number of 4 input LUTs: 6262 out of 4704 133% (*) Number of IOs: 62 Number of bonded IOBs: 44 out of 140 31% IOB Flip Flops: 4 Number of GCLKs: 2 out of 4 50%
47Changes Necessary to Fit Complete redesign of the underlying componentsSpecifically SubkeyMinimize routingMore utilization of external memory moduleBuy a bigger FPGASpartan 3?