Download presentation

Presentation is loading. Please wait.

Published byKobe Skaggs Modified over 2 years ago

1
High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel Labs Intel Corporation, Hillsboro, OR, USA Prof. Vojin Oklobdzija ACSEL Lab, Dept. of ECE University of California, Davis, CA, USA Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel Labs Intel Corporation, Hillsboro, OR, USA Prof. Vojin Oklobdzija ACSEL Lab, Dept. of ECE University of California, Davis, CA, USA Intel Labs 16 th IEEE International Computer Arithmetic Symposium, Santiago, June 18 th 2003

2
2 Motivation Design choices for high-performance circuits SOI vs. Bulk devices: ALU design test-case 64-bit ALUs in PD-SOI and Bulk CMOS Energy-efficient high-performance AGU/ALUs 4GHz Sparse-tree AGU Design GHz Integer ALU Design Summary Motivation Design choices for high-performance circuits SOI vs. Bulk devices: ALU design test-case 64-bit ALUs in PD-SOI and Bulk CMOS Energy-efficient high-performance AGU/ALUs 4GHz Sparse-tree AGU Design GHz Integer ALU Design Summary Outline

3
3 Frequency doubles every generation Performance-critical units ALUs & AGUs Register files, L0 caches High-performance trends Single-cycle latency & throughput

4
64-bit ALUs in 0.18 m PD-SOI/Bulk CMOS: Design & Scaling Trends [S. Mathew et al, ISSCC 2001] [S. Mathew et al, JSSC, Nov 2001]

5
5 High performance devices: Partially depleted Silicon-on-Insulator Pros & Cons vs. bulk CMOS Scaling trends High performance circuit design: Sparse-tree semi-dynamic AGU Single-rail dynamic ALU High performance devices: Partially depleted Silicon-on-Insulator Pros & Cons vs. bulk CMOS Scaling trends High performance circuit design: Sparse-tree semi-dynamic AGU Single-rail dynamic ALU Design choices

6
6 p+ n+ PD-SOI Devices Body of devices not tied to V cc /V ss Body is isolated by buried oxide Floating Body! P-Substrate n+ p+ STI Buried Oxide P type body N type body STI

7
7 Delay = Function of switching history –Capacitive coupling from S/G/D –Impact Ionization, Diode conduction –Transient V bs DC V bs Backgate Buried Oxide n+ n+ Gate Body Potential S D G C box C db C sb C gb Complicates timing analysis History Effect in PD-SOI

8
8 64-bit ALU architecture Ideal test-bed for evaluating process technologies 1200 m Loopback bus Single rail adder core Single rail adder core Sum 2:1 Mux External operands Shift control 5:1 Mux 0.5pF 9:1 Mux Mux control 3:1 Mux Mux control 9:1 Mux External operands Sign control

9
9 High-performance Adders: Kogge Stone Generate all carries: Full-blown binary tree energy-inefficient # Carry-merge stages = log 2 (N) GG=G i +P i G i-1 GP=P i P i-1 Odd input bits Even input bits Sum even Sum odd PG Gen. CM1CM2CM3CM4 CM5 CM1CM2CM3CM4 CM5 PG Gen XOR

10
10 64-bit Han-Carlson adder core Carry-merge done on even bitslices 50% fewer carry-merge gates vs. Kogge-Stone Extra logic stage generates odd carries Odd bit Even bit CM 0 CM 1 b1b0b2b3 b63b62b61b60 PG generator Odd carry generator Sum XOR Carry-merge0 Carry-merge1 Carry-merge5 3N 2P 2N 2P b59

11
11 Energy-efficient adder core 43% less energy/transition at iso-performance Adder architectureEnergy/transition Kogge-Stone Han-Carlson 120pJ 68pJ

12
12 CSG Han Carlson carry-merge tree Single rail adder core CSG circuit generates dual-rail carry C even 3N 2P 2N 2P 2N 2P Even inputs 2P C odd 2N CM0 PG gen. CM1CM2CM3CM4CM5CM6 3N Odd inputs C even C odd CSG Carry-merge tree Odd carry generator Complementary signal generator Single rail Dual rail

13
13 Complementary signal gen. Domino-compatible Carry/Carry Permits a single-rail carry-merge tree design Not time-borrowable – Penalty absorbed by placing gate at 2 boundary Keeper Carry i Cin i True pull-down path Complementary pull-down path

14
14 Partial sum generator Generates domino-compatible partial sum Placing the gate at 1 boundary mitigates output noise-glitches 1 AiAi Psum i Keeper BiBi AiAi BiBi 1 1 PiPi GiGi

15
15 ALU performance in bulk CMOS 2P 2N XOR 2N 2 Inp. 3N 9:1 Mux 5:1 Mux 3:1 Mux Bus driver 1200 m Bus 1 64b Han-Carlson ALU simulation results ALU delay482ps 0.18 m bulk CMOS, V cc =1.5V Adder core 310ps 2P Sum

16
16 Porting from bulk to PD-SOI SOI favored redesign Bulk design SOI design SOI-optimal design Direct port Design issues: Noise tolerance due to lowered V t Min-delay timing-analysis Motivation for redesign: Reduced SOI stack penalty Deeper stack design Stage reduction Motivation for redesign: Reduced SOI stack penalty Deeper stack design Stage reduction Design choices: Architecture should favor deep stack design Avoid increase in fanouts

17
m Bulk & PD-SOI technologies Equal I OFF at DC V bs SOI I DSAT is 1-2% lower Ioff(nA/ m)Idsat( A/ m) NMOS-Bulk NMOS-SOI Ioff(nA/ m)Idsat( A/ m) PMOS-Bulk PMOS-SOI0.7441

18
18 History effect measurements in 0.18 m PD-SOI

19
19 Direct port of Han-Carlson ALU to PD-SOI Adder core speedup = 14% – [Stasiak et al.,ISSCC 2000] 21% speedup 64b Han-Carlson ALU delay simulations % Delay improvement over bulk Bulk482ps 16% Direct-port to SOI403ps 0.18 m technology, V cc =1.5V

20
20 Speedup analysis Stage type Speedup over bulk from direct port to 0.18 m PD-SOI Static gates Dynamic gates 3:1 TG Mux 5:1 TG Mux 9:1 TG Mux 12-15% 2-9% 20% 23% 35% Diffusion dominated muxes Max. speedup Load dominated gates Speedup decreases

21
21 Motivation for PDSOI-optimal redesign Reduced stack penalty in SOI Deeper stack design Stage reduction ALU is amenable to such a redesign –Not true for all CPU critical paths SOI-optimal ALU architecture –Increasing stack depth must not increase fanouts A novel deep-stack sparse-tree ALU was developed

22
22 Sparse-tree adder core 50% reduced fanouts compared to Han-Carlson 7 gate stages (Two less than Han-Carlson) 2N 2P 4N 2P 3N Mux b1b0b2b3 b63b62b61b60 PG generator 63:6261:60 59:58 3:2 1:0 7:0 15:8 7:6 5:4 23:1631:2439:32 47:40 15:0 31:16 47:32 47:0 31:0 SumGen Int. carry gen. 59:58 57:56 55:54 53:52 51:50 49:48 43:42 41:40 39:38 37:36 35:34 33:32 27:26 25:24 23:22 21:20 19:18 11:10 9:8 7:6 5:4 3:2 17:16 1:0 Fast carry-merge tree

23
23 Intermediate Carry Generator Generates 1 in 4 carries (C 3, C 7, C 19 ….. C 59 ) Non-critical path (ripple carry-select scheme) Fast carry selects bet. the conditional carries 01 P 3:0 G 3:0 P 7:4 G 7:4 P 11:8 G 11: Carry from Fast CM Chain C3C3 C7C7 C 11 2:1 Mux CM

24
24 Non-critical Sum Generator Non-critical path: ripple carry chain Reduced area, energy consumption, leakage Generate conditional sums for each bit 1 in 4 carry selects appropriate sum PiPi P i+1 P i+2,G i+2 Sum i+1 Sum i+2 Sum i+3 XOR P i+3,G i+3 Sum i Sum i,1 Sum i,0 Carry G i+1 2: CM CMCM CMCMCMCM XOR

25
25 Sparse-tree adder critical path Fast carry-merge path Critical path Non-critical side-paths Ripple-carry 2N2N2P2P4N4N2P2P3N3N Inv3N 2P 2N 2P 3N Sum generator Intermediate carry generator Fast carry-merge path Sumout Input 2N

26
26 PD-SOI optimal redesign in 0.18 m Deeper stack redesign additional 5% speedup 64b ALU delay simulations Speedup over bulk Bulk482ps- Direct-port SOI403ps16% SOI-optimal redesign380ps21% 0.18 m technology, V cc =1.5V

27
27 Margining for reverse-body bias in PD-SOI 400mV rvs. bias increases rise-delay by 10% Difficult to detect for large circuits 10% Margin required for all max-delay paths Overall PD-SOI speedup reduces to 11%

28
28 Reducing reverse-bias penalty in dynamic SOI gates Point solution for dynamic designs Pre-charging stack node decreases penalty to 2% M1M1 B A Out A B Stack node Body-B Body-A P0 Max-delay margin reduced to 2% Cost 5% increase in clock energy Cost 5% increase in clock energy

29
m ALU performance after margining Maximum PD-SOI speedup reduces to 19% 64b ALU delay simulations Speedup over bulk Speedup after margining Bulk482ps-- Direct-port SOI403ps16% 14% SOI-Optimal redesign 380ps21% 19% 0.18 m technology, V cc =1.5V

30
30 Scaling to 0.13 m technologies Equal SOI & bulk I OFF-DC MOSFET & impact ionization data obtained from 0.13 m bulk measurements SOI parasitic BJT/diode characteristics unchanged from 0.18 m fitting

31
31 Scaling ALU designs to 0.13 m technology Maximum PD-SOI speedup reduces to 16% 64b ALU delay simulations Speedup over bulk Speedup after margining Bulk351ps-- Direct-port SOI312ps11%9% SOI-Optimal redesign 286ps18%16% 0.13 m technology, V cc =1.2V

32
32 SOI vs. bulk Summary 482ps energy-efficient dynamic 64b ALU in 0.18 m bulk –310ps adder core Direct port to 0.18 m SOI 14% speedup SOI optimal redesign 19% speedup Floating body can get reverse-biased –Preconditioning reduces margin from 10% to 2% Scaling to 0.13 m decreases PD-SOI speedup 16% Maximum PD-SOI speedup in 0.13 m falls to 16%

33
33 Goal: Shift the E-D curve High-Performance Low Power Datapath design Delay Energy

34
A 4GHz 130nm Address Generation Unit with 32-bit Sparse-tree Adder Core Intel Labs [S. Mathew et al, VLSI Symp. 2002], [S. Mathew et al, JSSC May 2003]

35
35 AGUs: performance and peak-current limiters High activity thermal hotspot Goal: high-performance energy-efficient design Motivation Execution core 120 o C Cache Processor thermal map AGU Temp ( o C)

36
36 AGU Architecture Single-cycle latency and throughput Effective Address = Base + Index * Scale + (Segment +Displacement) 2-phase address computation Displacement Effective Address 3:2 Compressor Base Index Segment 3b shift 32 32b add 32 + clk clk2 clk3 32

37
37 AGU Operation: Phase 1 Index pre-scaled via 3-bit barrel shifter 3:2 compressor renders partial address: Carry-save format Adder in pre-charge state Displacement Effective Address 3:2 Compressor Base Index Segment 32 32b adder 32 + clk clk2 clk3 32 Carry-Save format Carry-Save format 3b shift 3b shift

38
38 AGU Operation: Phase 2 Carry-save to binary format conversion: 2s complement parallel 32-bit adder Displacement Effective Address Effective Address Base Index Segment 3b shift 32 32b adder 32 + clk clk2 clk3 32 3:2 Compressor

39
39 Kogge-Stone Adder Critical path = PG+5+XOR = 7 gate stages Generate,Propagate fanout of 2,3 Maximum interconnect spans 16b Energy inefficient PG Carry-merge gates XOR 0 0

40
40 Sparse-tree Adder Architecture Generate every 4 th carry in parallel Side-path: 4-bit conditional sum generator 73% fewer carry-merge gates energy-efficient

41
41 Non-critical Sum Generator Non-critical path: ripple carry chain Reduced area, energy consumption, leakage Generate conditional sums for each bit Sparse-tree carry selects appropriate sum PiPi P i+1 P i+2,G i+2 Sum i+1 Sum i+2 Sum i+3 XOR P i+3,G i+3 Sum i Sum i,1 Sum i,0 Carry G i+1 2: CM CMCM CMCMCMCM XOR

42
42 Conditional Carry for Cin=0 Optimized First-level Carry-merge Carry-merge stage reduces to inverter Conditional carry_0 = G i # C#_0 GiGi CMCM 00

43
43 Conditional carry for Cin=1 Optimized First-level Carry-merge P i & G i correlated Conditional carry_1 = P i # PiPi GiGi AiAi BiBi PiPi GiGi C#_ PiPi CMCM 11

44
44 Optimized Sum Generator PiPi P i+1 P i+2,G i+2 Sum i+1 Sum i+2 Sum i+3 XOR P i+3,G i+3 Sum i Sum i,1 Sum i,0 Carry G i+1 CMCMCMCM Optimized 1st-level carry-merge Optimized non-critical path: 4 stages 2:1 2:12:1 CMCMCMCM XORXORXORXOR

45
45 Adder Core Critical Path Critical path: 7 gate stages same as KS Sparse-tree: single-rail dynamic Exploit non-criticality of sum generator Convert to static logic Semi-dynamic design PG GG 1 GG 7 Static sum generator Single-rail dynamic sparse-tree path AdderInputs clk2 Sum 31 clk3 clk clk GG 27 GG 15 CM0Latch CM1 XOR C 27 Sum 31_0 Sum 31_1 GG 3

46
46 1 st -level Carry-merge: Static Latch Holds state in pre-charge phase Prevents pre-charging of static stages

47
47 Domino-Static Interface Sum=Sum0 during pre-charge Mux output resolves during evaluation clk=0 clk=1

48
48 Sparse-tree Architecture Performance impact: (20% speedup) 33-50% reduced G/P fanouts 80% reduced wiring complexity 30% reduction in maximum interconnect Power impact: (56% reduction) 73% fewer carry-merge gates 50% reduction in average transistor size

49
49 Energy-delay Space 20% speedup over Kogge-Stone 56% worst-case energy reduction Scales with activity factor Delay (ps) Worst-case Energy (pJ) Dynamic Kogge-Stone Semi-dynamic Sparse-Tree 20% 4GHzDesign 56% 130nm CMOS, 1.2V, 110 o C Simulation

50
50 Semi-dynamic Design Static sum generators : low switching activity 71% lower average energy at 10% activity Activity factor Average Energy (pJ) Dynamic DynamicKogge-Stone Semi-dynamicSparse-Tree 71% 71%

51
51 Dual-V t Allocation Exploit non-criticality of sidepaths Use high-V t devices 0% performance penalty 56% reduction in active leakage energy Low-V t Dual-V t Delay Switching Energy Leakage Energy 152ps36pJ0.9pJ152ps 34pJ (-6%) 0.4pJ (-56%) 130nm CMOS, 1.2V, 110 o C Simulation

52
52 Scaling Performance Average transistor size = 3.5 m Reduces impact of increasing leakage 80% reduction in wiring complexity Reduces impact of wire resistance 33% delay scaling, 50% energy reduction130nm100nmDelay Switching Energy Leakage Energy 152ps36pJ0.9pJ 102ps (-33%) 18pJ (-50%) 0.7pJ (-23%)

53
A 6.5GHz, 130nm Single-ended Dynamic ALU Intel Labs [M. Anders et al, ISSCC 2002], [S. Vangal et al, JSSC November 2002]

54
54 32-bit ALU/Scheduler Loop Performance-critical execution core loop

55
55 Han-Carlson ALU Organization Single-rail dynamic 9-stage low-Vt design

56
56 Odd-bits CSG Sum Generation Final carry-merge CSG(dual-rail carry output) pass-transistor sum XOR

57
57 Even-bits CSG Sum Generation Domino-compatible sum Dual-rail sum from single-ended g inputs

58
58 Die Micro-photograph 130nm 6-metal dual-Vt CMOS Scheduler: 210μm x 210μm ALU: 84μm x 336μm Scheduler ALU

59
59 Delay and Power Measurements 6.5GHz at 1.1V, 25ºC Power: 120mW total, 15mW leakage Scalable to 10GHz at 1.7V, 25ºC Design target 25ºC

60
60 Area50% Performance (Delay) 10% Active Leakage 40% Robustnessequal Improvements Over Dual-rail Domino Leakage reduced by eliminating dual-rail logic Robustness not compromised CSG improves both area and performance

61
61 SummarySummary 4GHz AGU in 1.2V, 130nm technology 4GHz AGU in 1.2V, 130nm technology Sparse-tree adder architecture described 20% speedup and 56% energy reduction Semi-dynamic design: Energy scales with switching activity Dual-V t non-critical paths: Low active leakage energy 6.5GHz ALU and scheduler loop at 1.1V, 25ºC 6.5GHz ALU and scheduler loop at 1.1V, 25ºC –Scalable to 10GHz at 1.7V, 25ºC

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google