Presentation is loading. Please wait.

Presentation is loading. Please wait.

High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel.

Similar presentations


Presentation on theme: "High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel."— Presentation transcript:

1 High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel Labs Intel Corporation, Hillsboro, OR, USA ramk@ichips.intel.com Prof. Vojin Oklobdzija ACSEL Lab, Dept. of ECE University of California, Davis, CA, USA vojin@ece.ucdavis.edu Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel Labs Intel Corporation, Hillsboro, OR, USA ramk@ichips.intel.com Prof. Vojin Oklobdzija ACSEL Lab, Dept. of ECE University of California, Davis, CA, USA vojin@ece.ucdavis.edu Intel Labs 16 th IEEE International Computer Arithmetic Symposium, Santiago, June 18 th 2003

2 2 Motivation Design choices for high-performance circuits SOI vs. Bulk devices: ALU design test-case 64-bit ALUs in PD-SOI and Bulk CMOS Energy-efficient high-performance AGU/ALUs 4GHz Sparse-tree AGU Design 6.5-10GHz Integer ALU Design Summary Motivation Design choices for high-performance circuits SOI vs. Bulk devices: ALU design test-case 64-bit ALUs in PD-SOI and Bulk CMOS Energy-efficient high-performance AGU/ALUs 4GHz Sparse-tree AGU Design 6.5-10GHz Integer ALU Design Summary Outline

3 3 Frequency doubles every generation Performance-critical units ALUs & AGUs Register files, L0 caches High-performance trends Single-cycle latency & throughput

4 64-bit ALUs in 0.18 m PD-SOI/Bulk CMOS: Design & Scaling Trends [S. Mathew et al, ISSCC 2001] [S. Mathew et al, JSSC, Nov 2001]

5 5 High performance devices: Partially depleted Silicon-on-Insulator Pros & Cons vs. bulk CMOS Scaling trends High performance circuit design: Sparse-tree semi-dynamic AGU Single-rail dynamic ALU High performance devices: Partially depleted Silicon-on-Insulator Pros & Cons vs. bulk CMOS Scaling trends High performance circuit design: Sparse-tree semi-dynamic AGU Single-rail dynamic ALU Design choices

6 6 p+ n+ PD-SOI Devices Body of devices not tied to V cc /V ss Body is isolated by buried oxide Floating Body! P-Substrate n+ p+ STI Buried Oxide P type body N type body STI

7 7 Delay = Function of switching history –Capacitive coupling from S/G/D –Impact Ionization, Diode conduction –Transient V bs DC V bs Backgate Buried Oxide n+ n+ Gate Body Potential S D G C box C db C sb C gb Complicates timing analysis History Effect in PD-SOI

8 8 64-bit ALU architecture Ideal test-bed for evaluating process technologies 1200 m Loopback bus Single rail adder core Single rail adder core Sum 2:1 Mux External operands Shift control 5:1 Mux 0.5pF 9:1 Mux Mux control 3:1 Mux Mux control 9:1 Mux External operands Sign control

9 9 High-performance Adders: Kogge Stone Generate all carries: Full-blown binary tree energy-inefficient # Carry-merge stages = log 2 (N) GG=G i +P i G i-1 GP=P i P i-1 Odd input bits Even input bits Sum even Sum odd PG Gen. CM1CM2CM3CM4 CM5 CM1CM2CM3CM4 CM5 PG Gen. 1 2 3 4 5 6 7 XOR

10 10 64-bit Han-Carlson adder core Carry-merge done on even bitslices 50% fewer carry-merge gates vs. Kogge-Stone Extra logic stage generates odd carries Odd bit Even bit CM 0 CM 1 b1b0b2b3 b63b62b61b60 PG generator Odd carry generator Sum XOR Carry-merge0 Carry-merge1 Carry-merge5 3N 2P 2N 2P b59

11 11 Energy-efficient adder core 43% less energy/transition at iso-performance Adder architectureEnergy/transition Kogge-Stone Han-Carlson 120pJ 68pJ

12 12 CSG Han Carlson carry-merge tree Single rail adder core CSG circuit generates dual-rail carry C even 3N 2P 2N 2P 2N 2P Even inputs 2P C odd 2N CM0 PG gen. CM1CM2CM3CM4CM5CM6 3N Odd inputs C even C odd CSG Carry-merge tree Odd carry generator Complementary signal generator Single rail Dual rail

13 13 Complementary signal gen. Domino-compatible Carry/Carry Permits a single-rail carry-merge tree design Not time-borrowable – Penalty absorbed by placing gate at 2 boundary Keeper Carry i Cin i True pull-down path Complementary pull-down path

14 14 Partial sum generator Generates domino-compatible partial sum Placing the gate at 1 boundary mitigates output noise-glitches 1 AiAi Psum i Keeper BiBi AiAi BiBi 1 1 PiPi GiGi

15 15 ALU performance in bulk CMOS 2P 2N XOR 2N 2 Inp. 3N 9:1 Mux 5:1 Mux 3:1 Mux Bus driver 1200 m Bus 1 64b Han-Carlson ALU simulation results ALU delay482ps 0.18 m bulk CMOS, V cc =1.5V Adder core 310ps 2P Sum

16 16 Porting from bulk to PD-SOI SOI favored redesign Bulk design SOI design SOI-optimal design Direct port Design issues: Noise tolerance due to lowered V t Min-delay timing-analysis Motivation for redesign: Reduced SOI stack penalty Deeper stack design Stage reduction Motivation for redesign: Reduced SOI stack penalty Deeper stack design Stage reduction Design choices: Architecture should favor deep stack design Avoid increase in fanouts

17 17 0.18 m Bulk & PD-SOI technologies Equal I OFF at DC V bs SOI I DSAT is 1-2% lower Ioff(nA/ m)Idsat( A/ m) NMOS-Bulk3.31070 NMOS-SOI3.31050 Ioff(nA/ m)Idsat( A/ m) PMOS-Bulk0.7445 PMOS-SOI0.7441

18 18 History effect measurements in 0.18 m PD-SOI

19 19 Direct port of Han-Carlson ALU to PD-SOI Adder core speedup = 14% – [Stasiak et al.,ISSCC 2000] 21% speedup 64b Han-Carlson ALU delay simulations % Delay improvement over bulk Bulk482ps 16% Direct-port to SOI403ps 0.18 m technology, V cc =1.5V

20 20 Speedup analysis Stage type Speedup over bulk from direct port to 0.18 m PD-SOI Static gates Dynamic gates 3:1 TG Mux 5:1 TG Mux 9:1 TG Mux 12-15% 2-9% 20% 23% 35% Diffusion dominated muxes Max. speedup Load dominated gates Speedup decreases

21 21 Motivation for PDSOI-optimal redesign Reduced stack penalty in SOI Deeper stack design Stage reduction ALU is amenable to such a redesign –Not true for all CPU critical paths SOI-optimal ALU architecture –Increasing stack depth must not increase fanouts A novel deep-stack sparse-tree ALU was developed

22 22 Sparse-tree adder core 50% reduced fanouts compared to Han-Carlson 7 gate stages (Two less than Han-Carlson) 2N 2P 4N 2P 3N Mux b1b0b2b3 b63b62b61b60 PG generator 63:6261:60 59:58 3:2 1:0 7:0 15:8 7:6 5:4 23:1631:2439:32 47:40 15:0 31:16 47:32 47:0 31:0 SumGen Int. carry gen. 59:58 57:56 55:54 53:52 51:50 49:48 43:42 41:40 39:38 37:36 35:34 33:32 27:26 25:24 23:22 21:20 19:18 11:10 9:8 7:6 5:4 3:2 17:16 1:0 Fast carry-merge tree

23 23 Intermediate Carry Generator Generates 1 in 4 carries (C 3, C 7, C 19 ….. C 59 ) Non-critical path (ripple carry-select scheme) Fast carry selects bet. the conditional carries 01 P 3:0 G 3:0 P 7:4 G 7:4 P 11:8 G 11:8 2 2 2 2 Carry from Fast CM Chain C3C3 C7C7 C 11 2:1 Mux CM

24 24 Non-critical Sum Generator Non-critical path: ripple carry chain Reduced area, energy consumption, leakage Generate conditional sums for each bit 1 in 4 carry selects appropriate sum PiPi P i+1 P i+2,G i+2 Sum i+1 Sum i+2 Sum i+3 XOR P i+3,G i+3 Sum i Sum i,1 Sum i,0 Carry G i+1 2:1 1 1 0 0 CM CMCM CMCMCMCM XOR

25 25 Sparse-tree adder critical path Fast carry-merge path Critical path Non-critical side-paths Ripple-carry 2N2N2P2P4N4N2P2P3N3N Inv3N 2P 2N 2P 3N Sum generator Intermediate carry generator Fast carry-merge path Sumout Input 2N

26 26 PD-SOI optimal redesign in 0.18 m Deeper stack redesign additional 5% speedup 64b ALU delay simulations Speedup over bulk Bulk482ps- Direct-port SOI403ps16% SOI-optimal redesign380ps21% 0.18 m technology, V cc =1.5V

27 27 Margining for reverse-body bias in PD-SOI 400mV rvs. bias increases rise-delay by 10% Difficult to detect for large circuits 10% Margin required for all max-delay paths Overall PD-SOI speedup reduces to 11%

28 28 Reducing reverse-bias penalty in dynamic SOI gates Point solution for dynamic designs Pre-charging stack node decreases penalty to 2% M1M1 B A Out A B Stack node Body-B Body-A P0 Max-delay margin reduced to 2% Cost 5% increase in clock energy Cost 5% increase in clock energy

29 29 0.18 m ALU performance after margining Maximum PD-SOI speedup reduces to 19% 64b ALU delay simulations Speedup over bulk Speedup after margining Bulk482ps-- Direct-port SOI403ps16% 14% SOI-Optimal redesign 380ps21% 19% 0.18 m technology, V cc =1.5V

30 30 Scaling to 0.13 m technologies Equal SOI & bulk I OFF-DC MOSFET & impact ionization data obtained from 0.13 m bulk measurements SOI parasitic BJT/diode characteristics unchanged from 0.18 m fitting

31 31 Scaling ALU designs to 0.13 m technology Maximum PD-SOI speedup reduces to 16% 64b ALU delay simulations Speedup over bulk Speedup after margining Bulk351ps-- Direct-port SOI312ps11%9% SOI-Optimal redesign 286ps18%16% 0.13 m technology, V cc =1.2V

32 32 SOI vs. bulk Summary 482ps energy-efficient dynamic 64b ALU in 0.18 m bulk –310ps adder core Direct port to 0.18 m SOI 14% speedup SOI optimal redesign 19% speedup Floating body can get reverse-biased –Preconditioning reduces margin from 10% to 2% Scaling to 0.13 m decreases PD-SOI speedup 16% Maximum PD-SOI speedup in 0.13 m falls to 16%

33 33 Goal: Shift the E-D curve High-Performance Low Power Datapath design Delay Energy

34 A 4GHz 130nm Address Generation Unit with 32-bit Sparse-tree Adder Core Intel Labs [S. Mathew et al, VLSI Symp. 2002], [S. Mathew et al, JSSC May 2003]

35 35 AGUs: performance and peak-current limiters High activity thermal hotspot Goal: high-performance energy-efficient design Motivation Execution core 120 o C Cache Processor thermal map AGU Temp ( o C)

36 36 AGU Architecture Single-cycle latency and throughput Effective Address = Base + Index * Scale + (Segment +Displacement) 2-phase address computation Displacement Effective Address 3:2 Compressor Base Index Segment 3b shift 32 32b add 32 + clk clk2 clk3 32

37 37 AGU Operation: Phase 1 Index pre-scaled via 3-bit barrel shifter 3:2 compressor renders partial address: Carry-save format Adder in pre-charge state Displacement Effective Address 3:2 Compressor Base Index Segment 32 32b adder 32 + clk clk2 clk3 32 Carry-Save format Carry-Save format 3b shift 3b shift

38 38 AGU Operation: Phase 2 Carry-save to binary format conversion: 2s complement parallel 32-bit adder Displacement Effective Address Effective Address Base Index Segment 3b shift 32 32b adder 32 + clk clk2 clk3 32 3:2 Compressor

39 39 Kogge-Stone Adder Critical path = PG+5+XOR = 7 gate stages Generate,Propagate fanout of 2,3 Maximum interconnect spans 16b Energy inefficient 12354679810111312141517161819212022232524262729283031 PG Carry-merge gates XOR 0 0

40 40 Sparse-tree Adder Architecture Generate every 4 th carry in parallel Side-path: 4-bit conditional sum generator 73% fewer carry-merge gates energy-efficient

41 41 Non-critical Sum Generator Non-critical path: ripple carry chain Reduced area, energy consumption, leakage Generate conditional sums for each bit Sparse-tree carry selects appropriate sum PiPi P i+1 P i+2,G i+2 Sum i+1 Sum i+2 Sum i+3 XOR P i+3,G i+3 Sum i Sum i,1 Sum i,0 Carry G i+1 2:1 1 1 0 0 CM CMCM CMCMCMCM XOR

42 42 Conditional Carry for Cin=0 Optimized First-level Carry-merge Carry-merge stage reduces to inverter Conditional carry_0 = G i # C#_0 GiGi CMCM 00

43 43 Conditional carry for Cin=1 Optimized First-level Carry-merge P i & G i correlated Conditional carry_1 = P i # PiPi GiGi AiAi BiBi PiPi GiGi C#_1 00001 01100 10100 11110 PiPi CMCM 11

44 44 Optimized Sum Generator PiPi P i+1 P i+2,G i+2 Sum i+1 Sum i+2 Sum i+3 XOR P i+3,G i+3 Sum i Sum i,1 Sum i,0 Carry G i+1 CMCMCMCM Optimized 1st-level carry-merge Optimized non-critical path: 4 stages 2:1 2:12:1 CMCMCMCM XORXORXORXOR

45 45 Adder Core Critical Path Critical path: 7 gate stages same as KS Sparse-tree: single-rail dynamic Exploit non-criticality of sum generator Convert to static logic Semi-dynamic design PG GG 1 GG 7 Static sum generator Single-rail dynamic sparse-tree path AdderInputs clk2 Sum 31 clk3 clk clk GG 27 GG 15 CM0Latch CM1 XOR C 27 Sum 31_0 Sum 31_1 GG 3

46 46 1 st -level Carry-merge: Static Latch Holds state in pre-charge phase Prevents pre-charging of static stages

47 47 Domino-Static Interface Sum=Sum0 during pre-charge Mux output resolves during evaluation clk=0 clk=1

48 48 Sparse-tree Architecture Performance impact: (20% speedup) 33-50% reduced G/P fanouts 80% reduced wiring complexity 30% reduction in maximum interconnect Power impact: (56% reduction) 73% fewer carry-merge gates 50% reduction in average transistor size

49 49 Energy-delay Space 20% speedup over Kogge-Stone 56% worst-case energy reduction Scales with activity factor 0 20 40 60 80 100 140160180200220240260280 Delay (ps) Worst-case Energy (pJ) Dynamic Kogge-Stone Semi-dynamic Sparse-Tree 20% 4GHzDesign 56% 130nm CMOS, 1.2V, 110 o C Simulation

50 50 Semi-dynamic Design Static sum generators : low switching activity 71% lower average energy at 10% activity 0 10 20 30 40 00.10.20.30.40.5 Activity factor Average Energy (pJ) Dynamic DynamicKogge-Stone Semi-dynamicSparse-Tree 71% 71%

51 51 Dual-V t Allocation Exploit non-criticality of sidepaths Use high-V t devices 0% performance penalty 56% reduction in active leakage energy Low-V t Dual-V t Delay Switching Energy Leakage Energy 152ps36pJ0.9pJ152ps 34pJ (-6%) 0.4pJ (-56%) 130nm CMOS, 1.2V, 110 o C Simulation

52 52 Scaling Performance Average transistor size = 3.5 m Reduces impact of increasing leakage 80% reduction in wiring complexity Reduces impact of wire resistance 33% delay scaling, 50% energy reduction130nm100nmDelay Switching Energy Leakage Energy 152ps36pJ0.9pJ 102ps (-33%) 18pJ (-50%) 0.7pJ (-23%)

53 A 6.5GHz, 130nm Single-ended Dynamic ALU Intel Labs [M. Anders et al, ISSCC 2002], [S. Vangal et al, JSSC November 2002]

54 54 32-bit ALU/Scheduler Loop Performance-critical execution core loop

55 55 Han-Carlson ALU Organization Single-rail dynamic 9-stage low-Vt design

56 56 Odd-bits CSG Sum Generation Final carry-merge CSG(dual-rail carry output) pass-transistor sum XOR

57 57 Even-bits CSG Sum Generation Domino-compatible sum Dual-rail sum from single-ended g inputs

58 58 Die Micro-photograph 130nm 6-metal dual-Vt CMOS Scheduler: 210μm x 210μm ALU: 84μm x 336μm Scheduler ALU

59 59 Delay and Power Measurements 6.5GHz at 1.1V, 25ºC Power: 120mW total, 15mW leakage Scalable to 10GHz at 1.7V, 25ºC Design target 25ºC

60 60 Area50% Performance (Delay) 10% Active Leakage 40% Robustnessequal Improvements Over Dual-rail Domino Leakage reduced by eliminating dual-rail logic Robustness not compromised CSG improves both area and performance

61 61 SummarySummary 4GHz AGU in 1.2V, 130nm technology 4GHz AGU in 1.2V, 130nm technology Sparse-tree adder architecture described 20% speedup and 56% energy reduction Semi-dynamic design: Energy scales with switching activity Dual-V t non-critical paths: Low active leakage energy 6.5GHz ALU and scheduler loop at 1.1V, 25ºC 6.5GHz ALU and scheduler loop at 1.1V, 25ºC –Scalable to 10GHz at 1.7V, 25ºC


Download ppt "High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel."

Similar presentations


Ads by Google