Random Stuff Centaur Technology Inc. G Glenn Henry Quick Background Our Security Functions Centaur Build Methodology Physical Design Example.

Random Stuff Centaur Technology Inc. G Glenn Henry Quick Background Our Security Functions Centaur Build Methodology Physical Design Example

Quick Background We’re Centaur Technology Inc. (Austin, TX)  We design x86 processors  Have been alive for 11 yrs, have shipped processors for 8.5  We operate independently, but are owned by VIA  We are a tiny group; but shipping millions of processors/yr Our processors are software & bus compatible with Intel x86  But are unique vs. Intel & AMD (re design & target market): + lower cost (price) + lower power consumption + smaller chip footprint + unique integrated security features – generally, lower performance This fits some rapidly growing “new” markets for x86 Parent company is VIA Technologies (Taiwan)  They manufacture, market & sell our processor designs  They develop all other PC platform chips (including chip sets for Intel & AMD processors), etc.

90nm IBM SOI Technology 400 MHz–2.0 GHz 31.2 mm 2 26.2 M transistors First Shipped 8/2005 Lowest Power/MHz 3.5W @ 1 GHz TDP 20W @ 2 GHz TDP 128KB, 32-way exclusive L2 P4 instructions (incl SSE2 & SSE3) C5J (aka VIA Esther, VIA C7-M) +2-way SMP support Exclusive security features 64KB 4-way D-cache 64KB 4-way I-cache P-M power mngt features+ P-M bus and new VIA “V4”bus (400-800 MHz) unique nanoBGA package

4 90nm Intel Pentium M (Dothan) 90nm VIA C7-M 84 mm 2 31 mm 2 our die cost

128 KB 32-way L2 64 KB 4-way L1-D SSE 1,2 & 3, MMX ROM x87 FP Br pred I-unit Fetch, Decode & Translate DCU Bus & APIC Security PLLs etc C5J Die 6.9 mm 64 KB 4-way L1-I fuses

Our Security Strategy Provide comprehensive set of data security functions …That are very secure …That are world’s fastest (for a single chip) These goals require that the functions …Be Integrated tightly into the processor core Processor silicon & implementation is fastest hdw Only hdw can be “trusted” (no viruses, etc.) …Require no operating system support/involvement  available via non-privileged x86 instructions  hardware must manage multi-tasking considerations Available in all of our processors, for free  We believe data security should be built into all processors  It’s easy to do & small (effectively free)  It’s our hobby

Our Security Implementation 2 units fastest in world! Hardware RNG Encryption Secure Hash C5P (shipped 1/2004) C5J (shipped 8/2005) Full AES (FIPS-197) standard in hdw ECB,CBC,CFB,OFB Modes in hdw fastest in world! +CBC/CFB-MAC modes +CTR mode +unaligned support +faster RSA Hdw Assist (Montgomery multiply) (can also feed entropy to hardware SHA to get faster high quality) CN (future) xxx (faster/better using built-in hdw hash functions) Full SHA-1 & -256 (FIPS-180-1) standard in hdw Hardware RNG unit C5XL (shipped 1/2003)

Centaur Hardware RNG adj DC bias asynch clocked 2 duplicate RNGs in different physical areas (& rotated) SSE store bus 1-of-n bit selector 1-byte per delivery up to 8-byte delivery per store request status in EAX 32 byte hardware collection buffer A, B, or both x86 “store-rand” instruction ~ ^ whitener ~ ~ ~ ~ ~ ~ ~ ^ ~~ ~ ~~ ~

RNG “Typical” Performance “Randomness” too hard to describe here, but here’s some basics… Key requirements for “truly random” (per Schneier)  Unbiased statistical distribution  determined by statistics  Unpredictability  determined by modeling  Unreproducibility  only hardware need apply Many statistical tests defined & used (& argued about)  Collections of many different statistical analyses FIPS-140-2  useless (4-tests, broken, 20,000 bit sample!) Diehard (18 tests)  oriented to software RNGs, 10Mb sample NIST (16 tests)  we think the best (much overlap with Diehard) Ent, etc. everyone has one, everyone has their favorite  Individual tests entropy important & widely reported, but it’s not randomness chi 2 heavily used, especially for huge samples, our favorite Maurer, etc. everyone has their favorite  Many different evaluation approaches threshhold value, fixed ranges, probability analysis (p-value) Much analysis & interpretation needed to make sense here

RNG “Typical” Performance Performance & randomness varies by part; these are “typical” We have done extensive analysis  Many terabytes of data  Massive sample sizes (terabyte)  Hundreds of chip  Our own testbed software  Analysis & report by external group www.cryptography.com/research/evaluations.html  Here’s an embarrassingly simple summary… Setting Speed (Mbs) Entropy (byte) Randomness 1 MB sample random? 1 Max sample size for random 2 white  8  1.7 7.9999+Y50 MB-10 GB white  4  3.4 7.999Y–NY–N0-10 MB raw  28–240 7.5-7.95N– hashed raw (AES) 3  150– 1,000 7.9999+Y1 TB up 1. Passes standard test collections: FIPS, NIST, Diehard 2. “Good” chi 2 results 3. Many variations: SHA, random seed size, etc.

Centaur AES Encryption Features Full FIPS-197 implemented in hardware  Encrypt & decrypt  128b, 192b, & 256b keys  128b data blocks Multiple operating modes in hardware  ECB, CBC, CFB, OFB  CBC/CFC-MAC & CTR modes Optional extended key generation in hardware  For 128b key (both E & D) only Various “experimentation” options supported  Round count 1-16, intermediate round results, etc. Accessed via new application-level x86 instructions  No OS support needed  Hardware provides inherent multitasking US export licenses in place

Centaur AES Hardware input 1input 0 key ctrl S-box row-shift out 0out 1 Round key generation SSE store bus SSE load bus round key Extended Key Ram 16x16B block startup + CBC, CFB, OFB, etc. block finish + CBC, CFB, OFB, etc. column mix key add round fwd blk-blk fwd shared logic can pipeline 2 blks in ECB 16-byte blocks  0.3 mm 2 total! Everything runs at processor clock speed

Centaur AES Performance AES instruction performance (approx.)  128-bit key & block size: usual instruction timing assumptions = data in cache, no interrupts, aligned, key done, etc.  Approximate clocks w/ 128b extended keys already loaded ECB, 1 block:  17 clocks ECB, large block count:  11.8/blk CBC/CFB/etc, 1 block:  37 CBC/etc, large block count:  22.5/blk  Additional extended key generation/load time (128b key) Hardware generated:  38 Loaded from memory:  53

AES Performance Measured Performance  P4 = Gladman library AES, C5J = replaced routine with AES inst  ECB mode (other modes slower, but same advantage over P4)  Same memory size (512MB), same bus speeds (533 MHz)  Another example: Gladman reports (his site) using his library (ECB) data size2.53-GHz P42.0-GHz C5J 8 KB  0.56 Gb/s  21.5 Gb/s 64 KB  0.56  19.5 1 Mb  0.56  5.45 10 MB  0.56  5.23 data size1.2-GHz C5P 16 Kb  15.2 Gb/s bus limited Earlier part

C5J Montgomery Multiplier Features Goal: Speed up RSA’s modular exponentiation  c = m e mod n is dominated by repeated d = m x y mod(n) ops  where m, y, n are thousand bits long! This multiply is “always” done using “Montgomery Multiply” algorithm  Uses special number space to make d’ = a’ x b’ mod(m) much faster by eliminating divide  But initial & result values must be transformed to/from Montgomery number space  In real usage, the transformation overhead is relatively small Our hardware directly performs “Montgomery Multiply”  About as fast as an ordinary multiply!  For up to 32Kb numbers! New application-level x86 MontMul instruction

Centaur Montgomery Multiplier M[j]T[j] 32 x 32 SSE store bus SSE load bus temp regs 16-byte blocks A[j] 32 x 32 + 32 64 32 + 64 33 T[j-1]Hi 33b Bits 64:32 33 Bits 31:0 B[i] U 32 32b x 32b mod(32b)= 4 clks (2 clk pipelined) Ucode sequences loads & stores Usable with any size data (256 to 32Kb, 128b steps) hack of existing multipliers

Centaur MontMul Performance Compared to GMP library  Perform c = m e mod n (m,e,n chosen randomly)  An example (speeds vary slightly based on values)  Note: this is most of RSA time, but not the whole thing  Same hardware as for AES chart mod size (bits)2.53-GHz P42.0-GHz C5J 512  340 exp/s  1800 exp/s 1024  50  243 1536  15.6  78 2048  7.1  35

Centaur SHA Features FIPS-180-1 completely implemented in hardware  SHA-1 (160-bit result)  SHA-256 (256-bit result) Instruction timing  SHA-1:  251 clks  SHA-256:  262 where n is the number of 64B blocks to be compressed Measured performance (Gb/s)  Same hardware as for AES chart, GPL SHA SW (Devine) data size 2.53-GHz P42.0-GHz C5J SHA-1SHA-256SHA-1SHA-256 10 B  0.07  0.04  0.38  0.35 100 B  0.43  0.24  2.41  2.24 1,000 B  0.59  0.33  3.81  3.60 1,000,000 B  0.62  0.34  2.97 bus limited

Function generators C5J SHA Hardware next 64b data SSE store bus SSE load bus accumulating digest Initial digest 160b 64 + regs data scheduler (16 x32b regs) + SHA-1: 2 clks/32b rnd (5) SHA-256: 3 clks/round Final sha-256 add 5-way add

20 Build Process

21 The Centaur Process

Centaur Build Methodolgy Our challenges!  Complex logic with lots of architectural interconnections  2-GHz & aggressive power/size objectives  Relatively few designers (  30 logic & circuit)  Strong schedule pressure (must do it fast)  Industry tools not sufficient (oriented to APR methodology) Our Basic Approach  Hundreds of top-level stand-alone “blocks” Allows parallel development of “one-person” blocks Facilitates fast “build” time (chip assembly, timing, etc.) Facilitates use of optimum process for particular logic  Hook blocks together with top-level routing, clocks, etc. Significant “content” added in top-level build  Full-chip timing with fast iterations  Fast full-chip build iterations  Develop our own tools & methodology to accomplish above

Centaur Chip Physical Build Process

C5J Die

Underlying Source Statistics Verilog lines as written (small) (no behaviorals, no comments, no clocks, no “top” chip)  APR logic 112K lines129K cells  Stack logic 41K lines172K cells  Note: this is “single instance” as written much of this gets instantiated multiple times Schematic “pages” as written (large)  Primitive (inv, nand2, nor2, etc.) 110  Standard cells 712  Datapath elements1308  Full customs1332 ------- 3462 Circuit library sizeavailused  Clock regens 445 277  Std cell 547 435  G datapath elements 493 271  W datapath elements 248 147 ----- 17331130

C5J Security Components (metal 1-4 only) stk APR (control for all stacks) custom stk clock repeaters 7 RC bfrs global clk meanders 32b data “bfr” section decoupling caps Note: global interconnects not shown

128b-wide AES engine key RAM common control logic RNG buffersSHA sch & ALU C5J Security Components (metal 1-4 only)

“Fast Build & Timing” Every 1-5 days  Full-chip “Release”  APRs synthesized, placed & RCs estimated  Stacks “cracked”, placed & RCs estimated  Full-chip timing done with estimated RCs  Takes < 1 day for full-chip timing report Every 5-10 days  Full-Chip Physical Build  APRs routed  Stacks routed  Global chip routed  Global chip layout produced  APRs, stack & global route RC extraction RCs feed back to calibrate estimated RCs  This goes on continuously, picking up new Releases as needed Our experience at other companies  much slower

Basic “Release” Process

RTL Design Rules APR Blocks Element instantiation OK  Registers (req’d  synthesis can’t infer them correctly)  Clock buffers & distribution (req’d  synthesis clocks are slow!!)  Occasional logic (this has diminished over time)  The instantiated elements are really macros Auto expanded to right size, number bits, etc. in the flow Wires & continuous assignment OK  Including operators like ?, +, < etc. Nothing else! (no procedural stuff)  No if/else, no case, no loops, no “always”, no “at”, etc.  No timing information/control  Synthesis generates bad logic for these Unexpected/surperflous elements, registers where not expected, timing doesn’t work, etc. Stacks Component instantiation & wires only!

32 assign idleNS = (T[0] | T[8]) | shaDone_P; assign funcNS = (T[1] | T[3] | T[6] | T[10]) & ~shaDone_P; assign add1NS = (T[2]) & ~shaDone_P; assign add2NS = (T[5]) & ~shaDone_P; assign faddNS = (T[4] | T[7] | T[9]) & ~shaDone_P; rregs #(5) state (.q ({idleState, funcState, add1State, add2State, faddState}),.d ({idleNS,funcNS,add1NS,add2NS,faddNS}),.clk (ph1c) ); ------------------ sha2cnst sha2cnst(.in (iteration[5:0] ),.ksel (shKSel ),.algo (sha1_P ),.out (KsubI )); ------------------ wire [6:0] nextIteration; assign nextIteration = (shaDone_P | idleState) ? 7'b0000000 : shIterationStall ? iteration : iteration + 1; APR RTL Example As Written

33 Datapath Section /*------------------- KeyGen XOR --------------------------*/ wire [31:0] aesKeyGenXorOut2_L; zdxor #(32,15) keyg1 (.out (aesKeyGenXorOut2_L ),.in0 (aesWord2I_LB ),.in1 (aesKeyGenXorOut1_LB )); zinv #(32,60) kgen2 (aesKeyGenXorOut2_LB, aesKeyGenXorOut2_L); wire [31:0] aesKeyGenXorOut2_MB; wire [31:0] aesKeyGenXorOut2_M; zregi_en #(32,10) keyg2 (.q (aesKeyGenXorOut2_MB ),.d (aesKeyGenXorOut2_L ),.clk (EPH1 ),.en (aesDynEn_K)); zinv #(32,10) keyg2i (aesKeyGenXorOut2_M, aesKeyGenXorOut2_MB); Buffer Section rregsi #(2,20) bf_kk (.qb (aesKeyMuxSel_M ),.d (aesKeyMuxSel_LB),.clk (evph1)); Stack RTL Example

Stack Placement Tool Output (32-bit AES stack)

Buffer section addedInter-element routing (m2-6)

Global wires added

37 timepathelementdelta load cap wire rise/fall 0.875ns eeph1aesdp2 ^aesdp2/eph1buf_aesdp2/0.050ns 0.2423pF 0.000ns 0.000ns 0.925ns aesdp2/eph1 ^ aesdp2/sc_c0ph1_48/ 0.160ns 0.0321pF 0.000ns 0.000ns 1.085ns aesdp2/keyg2_ph1 ^ aesdp2/gxregi_x4_10…………………… 0.063ns 0.0035pF 0.000ns 0.004ns 1.148ns aesdp2/aesdp2_dp_aeskeygenxorout2_mb10 v 0.000ns 0.0035pF 0.000ns 0.004ns 1.148ns aesdp2/aesdp2_dp_keyg2i_stack_bit10_i0 v aesdp2/ginv_10………………………………… 0.026ns 0.0209pF 0.000ns 0.044ns 1.173ns aesdp2/aesdp2_dp_aeskeygenxorout2_m10 ^ 0.000ns 0.0209pF 0.000ns 0.045ns 1.174ns aesdp2/aesdp2_dp_invk_stack_bit10_i0 ^ aesdp2/gemux3i_19………………………… 0.045ns 0.0336pF 0.000ns 0.031ns 1.219ns aesdp2/aesdp2_dp_key_mb10 v 0.000ns 0.0336pF 0.000ns 0.031ns 1.219ns aesdp2/aesdp2_dp_kml_stack_bit10_i0 v aesdp2/ginv_31………………………………… 0.017ns 0.0188pF 0.000ns 0.013ns 1.236ns aesdp2/aesdp2_dp_key_m10 ^ 0.001ns 0.0188pF 0.001ns 0.014ns 1.236ns aesdp2/aesdp2_dp_mixcoldec_xorout_stack_bit10_in0 ^ aesdp2/gxor8_10……………………………… 0.095ns 0.0170pF 0.000ns 0.029ns 1.331ns aesdp2/aesdp2_dp_decout_m10 v 0.000ns 0.0170pF 0.000ns 0.030ns 1.332ns aesdp2/aesdp2_dp_mcmux_stack_bit10_i2 v aesdp2/gmux3i_10………………………… 0.030ns 0.0089pF 0.000ns 0.017ns 1.362ns aesdp2/aesdp2_dp_mcout_mb10 ^ 0.000ns 0.0089pF 0.000ns 0.017ns 1.362ns aesdp2/aesdp2_dp_invm_stack_bit10_i0 ^ aesdp2/ginv_31……………………………… 0.030ns 0.1101pF 0.000ns 0.053ns 1.391ns aesdp2/aesdp2_dp_mcout_m10 v 0.012ns 0.1101pF 0.012ns 0.078ns 1.403ns aesdp2/aesdp2_dp_pipemux0_stack_bit10_i1 v aesdp2/gmux2i_16…………………………… 0.048ns 0.0249pF 0.000ns 0.030ns 1.451ns aesdp2/aesdp2_dp_aesword2i_kb10 ^ 0.001ns 0.0249pF 0.001ns 0.032ns 1.452ns aesdp2/aesdp2_dp_byte1_indx_pb2 ^ Sample Timing Report “Path” Local reg clock-to-next reg input = 1.452-1.085 = 367ps

Random Circuit Topics Clocking is very difficult & very critical  Very aggressive skew goals “0” ps clock skew across all top-level blocks <20 ps skew worst case within a block  These are met in our designs ignoring on-chip silicon variations  Multiple clock domains required (for bus & various power states)  Many “early”, late”, etc. versions of the clocks needed  Clocks must be gated (for power management) Our clocking methodology is proprietary, but…  Hand-routed global clock tree (continually changing)  Our own tools to generate clock shields tuned to surroundings  Tunable “repeaters” (via fuse & via metal)  Hand instantiated clock elements within blocks  Many selectable clocks (  xx ps for each reg)  Auto-generated clock grids within APRs & stacks  Fuse adjustable PLL characteristics (duty cycle, etc.) Power/ground distribution critical  Extensive analysis & “management” required

Random Circuit Topics (cont) Robust circuit design req’d across  12 “corner” models  54 formal corners identified, we choose the most critical “12”  Covers variations in: Temp, V, N xistor, P xistor  Automated element simulation done across these models  Full-chip timing is done using 2 of these corners (hi V, lo V) Extensive use of dynamic logic  Precharge in phase 1, evaluate in phase 2  Registers, adders, comparators, arrays, etc.  Customs, stacks (& APRs) Two stack-element libraries  With different bit pitches Element libraries has several versions of same function  Usually, at least “Fast/big/hot” & “slow/small/cool”  Example: C5J has 2 different “vanilla” 32-bit adders Fast (dynamic): 180 ps 37.9  high Slow (static): 250 ps 16.9  high Note: 25 total adders in library, instantiated 65 total times

Random Circuit Topics (cont) Several families of registers available  Differ in function, speed, size & performance  Std cell, datapath & custom versions  Each comes in many drive strengths (sizes)  Many have built-in functions muxes, and/or logic, xors, compares, etc. These provide speed/size/power improvements vs. separate elements  Examples using C5J stack elements k-reg 10 k-reg +dynamic cmp-eq 60 static cmp-eq 20 82 ps (data-to-out) 5.0  90 32 17 ----- 139 ps 88 ps 9.5  4.6  3.8  1b 26b 1b inv 54 x-reg 10 3.8  32 ps 1.4  normal regfast reg

C5J Security Component Sizes (mm 2 ) 0.080 0.091 0.014 0.034 0.069 0.046 0.021 Total = 0.529 mm 2 + 0.014 for 2 RNG’s (elsewhere) = 0.54 (a few cents, but for this chip it’s really free) 227  Sample scale

C5J Security Component Sizes Note: We had so much spare room on die that we didn’t spend any effort making this smaller. We estimate at least 30% smaller if we tried hard! 0.080 0.080 mm 2 0.0800.091 0.014 0.034 0.069 0.046 227  (If we had only known about all this space when we started…)

S-box ROM (2 x 256 x 8 bit) x 4 bytes  200 ps access (dynamic) Row-shift muxes (wires to other 32b stacks not visible) Column multiply (& key xor) made out of 2-,3-,4-,5-,6-, 7- & 8-input xors Startup, CBC, etc. muxes & registers ---register---------------------------------- Startup, CBC, etc. muxes & registers ---register---------------------------------- (extra stuff at bottom for key generation)

Random Stuff Centaur Technology Inc. G Glenn Henry Quick Background Our Security Functions Centaur Build Methodology Physical Design Example.

Similar presentations

Presentation on theme: "Random Stuff Centaur Technology Inc. G Glenn Henry Quick Background Our Security Functions Centaur Build Methodology Physical Design Example."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Random Stuff Centaur Technology Inc. G Glenn Henry Quick Background Our Security Functions Centaur Build Methodology Physical Design Example.

Similar presentations

Presentation on theme: "Random Stuff Centaur Technology Inc. G Glenn Henry Quick Background Our Security Functions Centaur Build Methodology Physical Design Example."— Presentation transcript:

Similar presentations

About project

Feedback