Presentation is loading. Please wait.

Presentation is loading. Please wait.

Random Stuff Centaur Technology Inc. G Glenn Henry Quick Background Our Security Functions Centaur Build Methodology Physical Design Example.

Similar presentations


Presentation on theme: "Random Stuff Centaur Technology Inc. G Glenn Henry Quick Background Our Security Functions Centaur Build Methodology Physical Design Example."— Presentation transcript:

1 Random Stuff Centaur Technology Inc. G Glenn Henry Quick Background Our Security Functions Centaur Build Methodology Physical Design Example

2 Quick Background We’re Centaur Technology Inc. (Austin, TX)  We design x86 processors  Have been alive for 11 yrs, have shipped processors for 8.5  We operate independently, but are owned by VIA  We are a tiny group; but shipping millions of processors/yr Our processors are software & bus compatible with Intel x86  But are unique vs. Intel & AMD (re design & target market): + lower cost (price) + lower power consumption + smaller chip footprint + unique integrated security features – generally, lower performance This fits some rapidly growing “new” markets for x86 Parent company is VIA Technologies (Taiwan)  They manufacture, market & sell our processor designs  They develop all other PC platform chips (including chip sets for Intel & AMD processors), etc.

3 90nm IBM SOI Technology 400 MHz–2.0 GHz 31.2 mm 2 26.2 M transistors First Shipped 8/2005 Lowest Power/MHz 3.5W @ 1 GHz TDP 20W @ 2 GHz TDP 128KB, 32-way exclusive L2 P4 instructions (incl SSE2 & SSE3) C5J (aka VIA Esther, VIA C7-M) +2-way SMP support Exclusive security features 64KB 4-way D-cache 64KB 4-way I-cache P-M power mngt features+ P-M bus and new VIA “V4”bus (400-800 MHz) unique nanoBGA package

4 4 90nm Intel Pentium M (Dothan) 90nm VIA C7-M 84 mm 2 31 mm 2 our die cost

5 128 KB 32-way L2 64 KB 4-way L1-D SSE 1,2 & 3, MMX ROM x87 FP Br pred I-unit Fetch, Decode & Translate DCU Bus & APIC Security PLLs etc C5J Die 6.9 mm 64 KB 4-way L1-I fuses

6 Our Security Strategy Provide comprehensive set of data security functions …That are very secure …That are world’s fastest (for a single chip) These goals require that the functions …Be Integrated tightly into the processor core Processor silicon & implementation is fastest hdw Only hdw can be “trusted” (no viruses, etc.) …Require no operating system support/involvement  available via non-privileged x86 instructions  hardware must manage multi-tasking considerations Available in all of our processors, for free  We believe data security should be built into all processors  It’s easy to do & small (effectively free)  It’s our hobby

7 Our Security Implementation 2 units fastest in world! Hardware RNG Encryption Secure Hash C5P (shipped 1/2004) C5J (shipped 8/2005) Full AES (FIPS-197) standard in hdw ECB,CBC,CFB,OFB Modes in hdw fastest in world! +CBC/CFB-MAC modes +CTR mode +unaligned support +faster RSA Hdw Assist (Montgomery multiply) (can also feed entropy to hardware SHA to get faster high quality) CN (future) xxx (faster/better using built-in hdw hash functions) Full SHA-1 & -256 (FIPS-180-1) standard in hdw Hardware RNG unit C5XL (shipped 1/2003)

8 Centaur Hardware RNG adj DC bias asynch clocked 2 duplicate RNGs in different physical areas (& rotated) SSE store bus 1-of-n bit selector 1-byte per delivery up to 8-byte delivery per store request status in EAX 32 byte hardware collection buffer A, B, or both x86 “store-rand” instruction ~ ^ whitener ~ ~ ~ ~ ~ ~ ~ ^ ~~ ~ ~~ ~

9 RNG “Typical” Performance “Randomness” too hard to describe here, but here’s some basics… Key requirements for “truly random” (per Schneier)  Unbiased statistical distribution  determined by statistics  Unpredictability  determined by modeling  Unreproducibility  only hardware need apply Many statistical tests defined & used (& argued about)  Collections of many different statistical analyses FIPS-140-2  useless (4-tests, broken, 20,000 bit sample!) Diehard (18 tests)  oriented to software RNGs, 10Mb sample NIST (16 tests)  we think the best (much overlap with Diehard) Ent, etc. everyone has one, everyone has their favorite  Individual tests entropy important & widely reported, but it’s not randomness chi 2 heavily used, especially for huge samples, our favorite Maurer, etc. everyone has their favorite  Many different evaluation approaches threshhold value, fixed ranges, probability analysis (p-value) Much analysis & interpretation needed to make sense here

10 RNG “Typical” Performance Performance & randomness varies by part; these are “typical” We have done extensive analysis  Many terabytes of data  Massive sample sizes (terabyte)  Hundreds of chip  Our own testbed software  Analysis & report by external group www.cryptography.com/research/evaluations.html  Here’s an embarrassingly simple summary… Setting Speed (Mbs) Entropy (byte) Randomness 1 MB sample random? 1 Max sample size for random 2 white  8  1.7 7.9999+Y50 MB-10 GB white  4  3.4 7.999Y–NY–N0-10 MB raw  28–240 7.5-7.95N– hashed raw (AES) 3  150– 1,000 7.9999+Y1 TB up 1. Passes standard test collections: FIPS, NIST, Diehard 2. “Good” chi 2 results 3. Many variations: SHA, random seed size, etc.

11 Centaur AES Encryption Features Full FIPS-197 implemented in hardware  Encrypt & decrypt  128b, 192b, & 256b keys  128b data blocks Multiple operating modes in hardware  ECB, CBC, CFB, OFB  CBC/CFC-MAC & CTR modes Optional extended key generation in hardware  For 128b key (both E & D) only Various “experimentation” options supported  Round count 1-16, intermediate round results, etc. Accessed via new application-level x86 instructions  No OS support needed  Hardware provides inherent multitasking US export licenses in place

12 Centaur AES Hardware input 1input 0 key ctrl S-box row-shift out 0out 1 Round key generation SSE store bus SSE load bus round key Extended Key Ram 16x16B block startup + CBC, CFB, OFB, etc. block finish + CBC, CFB, OFB, etc. column mix key add round fwd blk-blk fwd shared logic can pipeline 2 blks in ECB 16-byte blocks  0.3 mm 2 total! Everything runs at processor clock speed

13 Centaur AES Performance AES instruction performance (approx.)  128-bit key & block size: usual instruction timing assumptions = data in cache, no interrupts, aligned, key done, etc.  Approximate clocks w/ 128b extended keys already loaded ECB, 1 block:  17 clocks ECB, large block count:  11.8/blk CBC/CFB/etc, 1 block:  37 CBC/etc, large block count:  22.5/blk  Additional extended key generation/load time (128b key) Hardware generated:  38 Loaded from memory:  53

14 AES Performance Measured Performance  P4 = Gladman library AES, C5J = replaced routine with AES inst  ECB mode (other modes slower, but same advantage over P4)  Same memory size (512MB), same bus speeds (533 MHz)  Another example: Gladman reports (his site) using his library (ECB) data size2.53-GHz P42.0-GHz C5J 8 KB  0.56 Gb/s  21.5 Gb/s 64 KB  0.56  19.5 1 Mb  0.56  5.45 10 MB  0.56  5.23 data size1.2-GHz C5P 16 Kb  15.2 Gb/s bus limited Earlier part

15 C5J Montgomery Multiplier Features Goal: Speed up RSA’s modular exponentiation  c = m e mod n is dominated by repeated d = m x y mod(n) ops  where m, y, n are thousand bits long! This multiply is “always” done using “Montgomery Multiply” algorithm  Uses special number space to make d’ = a’ x b’ mod(m) much faster by eliminating divide  But initial & result values must be transformed to/from Montgomery number space  In real usage, the transformation overhead is relatively small Our hardware directly performs “Montgomery Multiply”  About as fast as an ordinary multiply!  For up to 32Kb numbers! New application-level x86 MontMul instruction

16 Centaur Montgomery Multiplier M[j]T[j] 32 x 32 SSE store bus SSE load bus temp regs 16-byte blocks A[j] 32 x 32 + 32 64 32 + 64 33 T[j-1]Hi 33b Bits 64:32 33 Bits 31:0 B[i] U 32 32b x 32b mod(32b)= 4 clks (2 clk pipelined) Ucode sequences loads & stores Usable with any size data (256 to 32Kb, 128b steps) hack of existing multipliers

17 Centaur MontMul Performance Compared to GMP library  Perform c = m e mod n (m,e,n chosen randomly)  An example (speeds vary slightly based on values)  Note: this is most of RSA time, but not the whole thing  Same hardware as for AES chart mod size (bits)2.53-GHz P42.0-GHz C5J 512  340 exp/s  1800 exp/s 1024  50  243 1536  15.6  78 2048  7.1  35

18 Centaur SHA Features FIPS-180-1 completely implemented in hardware  SHA-1 (160-bit result)  SHA-256 (256-bit result) Instruction timing  SHA-1:  251 clks  SHA-256:  262 where n is the number of 64B blocks to be compressed Measured performance (Gb/s)  Same hardware as for AES chart, GPL SHA SW (Devine) data size 2.53-GHz P42.0-GHz C5J SHA-1SHA-256SHA-1SHA-256 10 B  0.07  0.04  0.38  0.35 100 B  0.43  0.24  2.41  2.24 1,000 B  0.59  0.33  3.81  3.60 1,000,000 B  0.62  0.34  2.97 bus limited

19 Function generators C5J SHA Hardware next 64b data SSE store bus SSE load bus accumulating digest Initial digest 160b 64 + regs data scheduler (16 x32b regs) + SHA-1: 2 clks/32b rnd (5) SHA-256: 3 clks/round Final sha-256 add 5-way add

20 20 Build Process

21 21 The Centaur Process

22 Centaur Build Methodolgy Our challenges!  Complex logic with lots of architectural interconnections  2-GHz & aggressive power/size objectives  Relatively few designers (  30 logic & circuit)  Strong schedule pressure (must do it fast)  Industry tools not sufficient (oriented to APR methodology) Our Basic Approach  Hundreds of top-level stand-alone “blocks” Allows parallel development of “one-person” blocks Facilitates fast “build” time (chip assembly, timing, etc.) Facilitates use of optimum process for particular logic  Hook blocks together with top-level routing, clocks, etc. Significant “content” added in top-level build  Full-chip timing with fast iterations  Fast full-chip build iterations  Develop our own tools & methodology to accomplish above

23

24 Centaur Chip Physical Build Process

25 C5J Die

26 Underlying Source Statistics Verilog lines as written (small) (no behaviorals, no comments, no clocks, no “top” chip)  APR logic 112K lines129K cells  Stack logic 41K lines172K cells  Note: this is “single instance” as written much of this gets instantiated multiple times Schematic “pages” as written (large)  Primitive (inv, nand2, nor2, etc.) 110  Standard cells 712  Datapath elements1308  Full customs1332 ------- 3462 Circuit library sizeavailused  Clock regens 445 277  Std cell 547 435  G datapath elements 493 271  W datapath elements 248 147 ----- 17331130

27 C5J Security Components (metal 1-4 only) stk APR (control for all stacks) cus- tom stk clock repeaters 7 RC bfrs global clk meanders 32b data “bfr” section decoupling caps Note: global interconnects not shown

28 128b-wide AES engine key RAM common control logic RNG buffersSHA sch & ALU C5J Security Components (metal 1-4 only)

29 “Fast Build & Timing” Every 1-5 days  Full-chip “Release”  APRs synthesized, placed & RCs estimated  Stacks “cracked”, placed & RCs estimated  Full-chip timing done with estimated RCs  Takes < 1 day for full-chip timing report Every 5-10 days  Full-Chip Physical Build  APRs routed  Stacks routed  Global chip routed  Global chip layout produced  APRs, stack & global route RC extraction RCs feed back to calibrate estimated RCs  This goes on continuously, picking up new Releases as needed Our experience at other companies  much slower

30 Basic “Release” Process

31 RTL Design Rules APR Blocks Element instantiation OK  Registers (req’d  synthesis can’t infer them correctly)  Clock buffers & distribution (req’d  synthesis clocks are slow!!)  Occasional logic (this has diminished over time)  The instantiated elements are really macros Auto expanded to right size, number bits, etc. in the flow Wires & continuous assignment OK  Including operators like ?, +, < etc. Nothing else! (no procedural stuff)  No if/else, no case, no loops, no “always”, no “at”, etc.  No timing information/control  Synthesis generates bad logic for these Unexpected/surperflous elements, registers where not expected, timing doesn’t work, etc. Stacks Component instantiation & wires only!

32 32 assign idleNS = (T[0] | T[8]) | shaDone_P; assign funcNS = (T[1] | T[3] | T[6] | T[10]) & ~shaDone_P; assign add1NS = (T[2]) & ~shaDone_P; assign add2NS = (T[5]) & ~shaDone_P; assign faddNS = (T[4] | T[7] | T[9]) & ~shaDone_P; rregs #(5) state (.q ({idleState, funcState, add1State, add2State, faddState}),.d ({idleNS,funcNS,add1NS,add2NS,faddNS}),.clk (ph1c) ); ------------------ sha2cnst sha2cnst(.in (iteration[5:0] ),.ksel (shKSel ),.algo (sha1_P ),.out (KsubI )); ------------------ wire [6:0] nextIteration; assign nextIteration = (shaDone_P | idleState) ? 7'b0000000 : shIterationStall ? iteration : iteration + 1; APR RTL Example As Written

33 33 Datapath Section /*------------------- KeyGen XOR --------------------------*/ wire [31:0] aesKeyGenXorOut2_L; zdxor #(32,15) keyg1 (.out (aesKeyGenXorOut2_L ),.in0 (aesWord2I_LB ),.in1 (aesKeyGenXorOut1_LB )); zinv #(32,60) kgen2 (aesKeyGenXorOut2_LB, aesKeyGenXorOut2_L); wire [31:0] aesKeyGenXorOut2_MB; wire [31:0] aesKeyGenXorOut2_M; zregi_en #(32,10) keyg2 (.q (aesKeyGenXorOut2_MB ),.d (aesKeyGenXorOut2_L ),.clk (EPH1 ),.en (aesDynEn_K)); zinv #(32,10) keyg2i (aesKeyGenXorOut2_M, aesKeyGenXorOut2_MB); Buffer Section rregsi #(2,20) bf_kk (.qb (aesKeyMuxSel_M ),.d (aesKeyMuxSel_LB),.clk (evph1)); Stack RTL Example

34 Stack Placement Tool Output (32-bit AES stack)

35 Buffer section addedInter-element routing (m2-6)

36 Global wires added

37 37 timepathelementdelta load cap wire rise/fall 0.875ns eeph1aesdp2 ^aesdp2/eph1buf_aesdp2/0.050ns 0.2423pF 0.000ns 0.000ns 0.925ns aesdp2/eph1 ^ aesdp2/sc_c0ph1_48/ 0.160ns 0.0321pF 0.000ns 0.000ns 1.085ns aesdp2/keyg2_ph1 ^ aesdp2/gxregi_x4_10…………………… 0.063ns 0.0035pF 0.000ns 0.004ns 1.148ns aesdp2/aesdp2_dp_aeskeygenxorout2_mb10 v 0.000ns 0.0035pF 0.000ns 0.004ns 1.148ns aesdp2/aesdp2_dp_keyg2i_stack_bit10_i0 v aesdp2/ginv_10………………………………… 0.026ns 0.0209pF 0.000ns 0.044ns 1.173ns aesdp2/aesdp2_dp_aeskeygenxorout2_m10 ^ 0.000ns 0.0209pF 0.000ns 0.045ns 1.174ns aesdp2/aesdp2_dp_invk_stack_bit10_i0 ^ aesdp2/gemux3i_19………………………… 0.045ns 0.0336pF 0.000ns 0.031ns 1.219ns aesdp2/aesdp2_dp_key_mb10 v 0.000ns 0.0336pF 0.000ns 0.031ns 1.219ns aesdp2/aesdp2_dp_kml_stack_bit10_i0 v aesdp2/ginv_31………………………………… 0.017ns 0.0188pF 0.000ns 0.013ns 1.236ns aesdp2/aesdp2_dp_key_m10 ^ 0.001ns 0.0188pF 0.001ns 0.014ns 1.236ns aesdp2/aesdp2_dp_mixcoldec_xorout_stack_bit10_in0 ^ aesdp2/gxor8_10……………………………… 0.095ns 0.0170pF 0.000ns 0.029ns 1.331ns aesdp2/aesdp2_dp_decout_m10 v 0.000ns 0.0170pF 0.000ns 0.030ns 1.332ns aesdp2/aesdp2_dp_mcmux_stack_bit10_i2 v aesdp2/gmux3i_10………………………… 0.030ns 0.0089pF 0.000ns 0.017ns 1.362ns aesdp2/aesdp2_dp_mcout_mb10 ^ 0.000ns 0.0089pF 0.000ns 0.017ns 1.362ns aesdp2/aesdp2_dp_invm_stack_bit10_i0 ^ aesdp2/ginv_31……………………………… 0.030ns 0.1101pF 0.000ns 0.053ns 1.391ns aesdp2/aesdp2_dp_mcout_m10 v 0.012ns 0.1101pF 0.012ns 0.078ns 1.403ns aesdp2/aesdp2_dp_pipemux0_stack_bit10_i1 v aesdp2/gmux2i_16…………………………… 0.048ns 0.0249pF 0.000ns 0.030ns 1.451ns aesdp2/aesdp2_dp_aesword2i_kb10 ^ 0.001ns 0.0249pF 0.001ns 0.032ns 1.452ns aesdp2/aesdp2_dp_byte1_indx_pb2 ^ Sample Timing Report “Path” Local reg clock-to-next reg input = 1.452-1.085 = 367ps

38 Random Circuit Topics Clocking is very difficult & very critical  Very aggressive skew goals “0” ps clock skew across all top-level blocks <20 ps skew worst case within a block  These are met in our designs ignoring on-chip silicon variations  Multiple clock domains required (for bus & various power states)  Many “early”, late”, etc. versions of the clocks needed  Clocks must be gated (for power management) Our clocking methodology is proprietary, but…  Hand-routed global clock tree (continually changing)  Our own tools to generate clock shields tuned to surroundings  Tunable “repeaters” (via fuse & via metal)  Hand instantiated clock elements within blocks  Many selectable clocks (  xx ps for each reg)  Auto-generated clock grids within APRs & stacks  Fuse adjustable PLL characteristics (duty cycle, etc.) Power/ground distribution critical  Extensive analysis & “management” required

39 Random Circuit Topics (cont) Robust circuit design req’d across  12 “corner” models  54 formal corners identified, we choose the most critical “12”  Covers variations in: Temp, V, N xistor, P xistor  Automated element simulation done across these models  Full-chip timing is done using 2 of these corners (hi V, lo V) Extensive use of dynamic logic  Precharge in phase 1, evaluate in phase 2  Registers, adders, comparators, arrays, etc.  Customs, stacks (& APRs) Two stack-element libraries  With different bit pitches Element libraries has several versions of same function  Usually, at least “Fast/big/hot” & “slow/small/cool”  Example: C5J has 2 different “vanilla” 32-bit adders Fast (dynamic): 180 ps 37.9  high Slow (static): 250 ps 16.9  high Note: 25 total adders in library, instantiated 65 total times

40 Random Circuit Topics (cont) Several families of registers available  Differ in function, speed, size & performance  Std cell, datapath & custom versions  Each comes in many drive strengths (sizes)  Many have built-in functions muxes, and/or logic, xors, compares, etc. These provide speed/size/power improvements vs. separate elements  Examples using C5J stack elements k-reg 10 k-reg +dynamic cmp-eq 60 static cmp-eq 20 82 ps (data-to-out) 5.0  90 32 17 ----- 139 ps 88 ps 9.5  4.6  3.8  1b 26b 1b inv 54 x-reg 10 3.8  32 ps 1.4  normal regfast reg

41 41

42 C5J Security Component Sizes (mm 2 ) 0.080 0.091 0.014 0.034 0.069 0.046 0.021 Total = 0.529 mm 2 + 0.014 for 2 RNG’s (elsewhere) = 0.54 (a few cents, but for this chip it’s really free) 227  Sample scale

43 C5J Security Component Sizes Note: We had so much spare room on die that we didn’t spend any effort making this smaller. We estimate at least 30% smaller if we tried hard! 0.080 0.080 mm 2 0.0800.091 0.014 0.034 0.069 0.046 227  (If we had only known about all this space when we started…)

44 S-box ROM (2 x 256 x 8 bit) x 4 bytes  200 ps access (dynamic) Row-shift muxes (wires to other 32b stacks not visible) Column multiply (& key xor) made out of 2-,3-,4-,5-,6-, 7- & 8-input xors Startup, CBC, etc. muxes & registers ---register---------------------------------- Startup, CBC, etc. muxes & registers ---register---------------------------------- (extra stuff at bottom for key generation)


Download ppt "Random Stuff Centaur Technology Inc. G Glenn Henry Quick Background Our Security Functions Centaur Build Methodology Physical Design Example."

Similar presentations


Ads by Google