Download presentation

Presentation is loading. Please wait.

Published byAngelo Crowther Modified over 2 years ago

1

2
2 Explaining The Gap Between ASIC and Custom Power: A Custom Perspective Andrew Chang Cadence Design Systems* William J. Dally Computer Systems Laboratory Stanford University * Work done while Author was at Stanford

3
3 Design Tradeoffs: Power vs. Performance 1. Move to More Energy Efficient Operating Point More Energy Efficient w/ Custom Power 2 1 3 Performance

4
4 Design Tradeoffs: Power vs. Performance 1. Move to More Energy Efficient Operating Point More Energy Efficient w/ Custom 2. Trade Performance for Power Larger Range w/ Custom Power 2 1 3 Performance

5
5 Design Tradeoffs: Power vs. Performance 1. Move to More Energy Efficient Operating Point More Energy Efficient w/ Custom 2. Trade Performance for Power Larger Range w/ Custom 3. Move to Different Power vs. Performance Curve More Architectural Choice with Custom Power 2 1 3 Performance

6
6 Dynamic Power Dissipation P dyn = CV dd 2 f = E circuit f Reduce V dd Static, dynamic, voltage islands, power gating Reduce and/or f Clock gating, block enables, bus encoding, glitch identification and elimination Reduce E circuit Engineer interconnects, increase circuit efficiency, subthreshold circuit techniques

7
7 Static Power Dissipation P static = V dd (I sub + I ox ) I sub = K 1 W e -V t / nV (1- e –V gs /V ) I ox = K 2 W (V gs /t ox ) 2 e – t ox / V gs With K 1, K 2, n, and experimentally determined Reduce V dd Static, dynamic, voltage islands, power gating Increase effective V t Substituting high-threshold devices, transistor stacking, static and active body bias Reduce effective W Reduce number and size of devices in design

8
8 Which Design Is More Efficient? 0.7um CMOS 173MHz chip w/ 460K T’s 0.18um CMOS 10kHz chip w/ 640K T’s

9
9 Which Design Is More Efficient? 0.7um CMOS 173MHz chip w/ 460K T’s Vdd (typ) = 3.3V, Vdd (min) = 1.1V 0.18um CMOS 10kHz chip w/ 640K T’s Vdd (max) = 1.8V, Vdd (min) = 0.18V

10
10 Which Design Is More Efficient? 0.7um CMOS 173MHz chip w/ 460K T’s Vdd (typ) = 3.3V, Vdd (min) = 1.1V Power = 845mW 0.18um CMOS 10kHz chip w/ 640K T’s Vdd (max) = 1.8V, Vdd (min) = 0.18V Power = 1.6mW

11
11 Talk Outline Normalized Metric: E bit Effect of Architecture ASIC vs. Custom Building Blocks Achievable Energy Efficiency 16b 1024 FFT Example Answer to “Which Design is More Efficient”

12
12 Talk Outline Normalized Metric: E bit Effect of Architecture ASIC vs. Custom Building Blocks Achievable Energy Efficiency 16b 1024 FFT Example Answer to “Which Design is More Efficient”

13
13 Defining E bit E bit = C bit * V dd 2 C bit = 4 * 2 fF/um * W min Energy needed to write a 1-bit SRAM cell Approximates minimum useful capacitance The ratio of E bit to the energy for a range of circuits remains largely constant with technology scaling

14
14 Technology Scaling for E bit is a normalized unit of distance equal to the M1 pitch Technology 0.5 m 0.18 m 5818 5.7 18 m2m2

15
15 Technology Scaling for Nand2 is a normalized unit of distance equal to the M1 pitch 4 = 2.24 m 8 = 4.48 m NAND2 A B YN A B

16
16 Applying E bit Energy180nm130nm90nm65nm E bit (fJ)3.31.40.50.36 Relative180nm130nm90nm65nm E bit 1111 1b FO4 ~10 1b SP-SRAM 0.3-7 1b RF 4-20+ 1b DFF 20-30+15-30+10-30+ 1b Nand2 11-30 (typ 19)5-30 (typ 14) Move 1b 1000 ~100 Move 1b 1.5mm 268367467714

17
17 Talk Outline Normalized Metric: E bit Effect of Architecture ASIC vs. Custom Building Blocks Achievable Energy Efficiency 16b 1024 FFT Example Answer to “Which Design is More Efficient”

18
18 Talk Outline Normalized Metric: E bit Effect of Architecture ASIC vs. Custom Building Blocks Achievable Energy Efficiency 16b 1024 FFT Example Answer to “Which Design is More Efficient”

19
19 Design Style: Custom NVIDIA GeForceFXIntel Pentium-4 Design Style: ASIC 400MHz – 125M Transistors 2600MHz – 55M Transistors Effect of Architecture

20
20 Design Style: Custom NVIDIA GeForceFXIntel Pentium-4 Design Style: ASIC 400MHz – 125M Transistors ~20 Watts 2600MHz – 55M Transistors ~60 Watts Effect of Architecture

21
21 Effect of Architecture ASIC Architecture: 6x Efficiency Design Style: Custom NVIDIA GeForceFXIntel Pentium-4 Design Style: ASIC 400MHz – 125M Transistors ~20 Watts: 10GFlops & 13 GBs 2600MHz – 55M Transistors ~60 Watts: 5GFlops & 5 Gbs

22
22 Custom Circuits: 9x (7x) Efficiency Design Style: Custom NVIDIA GeForceFXIntel Pentium-4 Design Style: Custom 400MHz – 125M Transistors ~3 Watts: 10GFlops & 13 GBs Vdd = 0.65V 2600MHz – 55M Transistors ~60 Watts: 5GFlops & 5 Gbs Vdd = 1.3V

23
23 Combined Architecture and Circuits 40x+ Improvement but 1.5 Years vs. 3+ Years Design Style: Custom NVIDIA GeForceFXIntel Pentium-4 Design Style: Custom 400MHz – 125M Transistors ~3 Watts: 10GFlops & 13 GBs Vdd = 0.65V 2600MHz – 55M Transistors ~60 Watts: 5GFlops & 5 Gbs Vdd = 1.3V

24
24 Talk Outline Normalized Metric: E bit Effect of Architecture ASIC vs. Custom Building Blocks Achievable Energy Efficiency 16b 1024 FFT Example Answer to “Which Design is More Efficient”

25
25 Talk Outline Normalized Metric: E bit Effect of Architecture ASIC vs. Custom Building Blocks Achievable Energy Efficiency 16b 1024 FFT Example Answer to “Which Design is More Efficient”

26
26 ASIC vs. Custom ASIC Methods Provide only coarse-grain control 100K+ gates, but require much less effort and historically scale with complexity Custom Methods Offer fine-grain control individual transistors & gates, but require large effort and scale poorly with complexity Exploits Design Structure Exploits Circuit Techniques

27
27 Custom Methods Emphasize Fine-Grain Manual Control + Custom Library

28
28 Custom Methods Emphasize Fine-Grain Manual Control + Custom Library Operation and Performance Characterized for the Specific Case

29
29 ASIC Methods Substitute Coarse-Grain Control Automation + Generic Library

30
30 ASIC Methods Substitute Coarse-Grain Control Automation + Generic Library Operation and Performance Characterized for the Typical/Generic Case

31
31 ASIC Focus on 100K+ Gates Lost Opportunities to Exploit Structure Designs reuse similar basic building blocks Building blocks: 1-10K-gates not 100K+ gate 64-bit adder 1K-gates 64x64 rf 2K-gates 64x64 multiplier 20K-gates Opportunities to exploit these structures lost when design is viewed in large chunks

32
32 Different Architectures Similar Building Blocks L C L C L C L C L C L C L C L C L C EXRFSRAM XCVRS L C Bus Bank 1Bank 0 CLST 0CLST 1CLST 2 CLST 0CLST 1CLST 2 NIF/ROUTER MEMORY SWITCH CLUSTER SWITCH EMIEMI LTLBLTLB 1998 “MAP” 64b Microprocessor - 5M T’s (MIT/Stanford) EXRFSRAM XCVRS Bus 2002 “Imagine” 32b Stream Processor - 22M T’s (Stanford)

33
33 Significant Structure Exists Within 100K-gates L C L C L C L C L C L C L C L C L C L C L C L C L C L C L C EXRFSRAM XCVRSBus EXRFSRAM XCVRS Bus Bank 1Bank 0 CLST 0CLST 1CLST 2 CLST 0CLST 1CLST 2 NIF/ROUTER MEMORY SWITCH CLUSTER SWITCH EMIEMI LTLBLTLB 1998 “MAP” 64b Microprocessor - 5M T’s (MIT/Stanford) 2002 “Imagine” 32b Stream Processor - 22M T’s (Stanford)

34
34 Energy of 100K-gate Equivalent ASIC (N2) = 1400K E bits (typ) Custom Logic = 424K E bits * SRAM (small) = 1085K E bits SRAM (med) = 155K E bits SRAM (large) = 50K E bits *Based on data extracted from Intel McKinley

35
35 Exploiting Circuit Techniques Custom circuits more efficient Reduced parasitics 1.7x circuit techniques and flops 1.4x libraries 1.4x due to engineering interconnects Subthreshold Circuits Low Performance but ultra-low power Requires Architecture, Gates, Memories, CAD Tools

36
36 Relating Power to Performance CV/I, I dsat, tFO4 I dsat = K 3 L eff -0.5 t ox -0.8 (V gs - V t ) 1.25 t FO4 = K 4 [C eff V dd /I dsat ] (K 4 ~ 13.5)

37
37 Relating Power to Performance Relating V dd and V t to tFO4 I dsat = K 3 L eff -0.5 t ox -0.8 (V gs - V t ) 1.25 t FO4 = K 4 [C eff V dd /I dsat ] (K 4 ~ 13.5)

38
38 Relating Power to Performance Correlation to Reported Foundry Data Technology Node CV/I est (ps) CV/I reported (ps) t FO4 est (ps) Foundry A 180-nm3.943.7053 Foundry A 130-nm2.552.1734 Foundry A 90-nm1.852.0425 Foundry A 65-nm1.451.0020 I dsat = K 3 L eff -0.5 t ox -0.8 (V gs - V t ) 1.25 t FO4 = K 4 [C eff V dd /I dsat ] (K 4 ~ 13.5)

39
39 Achievable Power Improvement (Assuming 50/50 split of Logic and Memory) TechniqueType Custom vs. ASIC EnergyType Circuit Styles and Flops Dynamic 1.70.815Logic Libraries + V dd Scaling 1.40.855Logic SRAM Circuits20.95SRAM Interconnect + V dd Scaling 1.40.855Inter-connect

40
40 Achievable Power Improvement (Assuming 50/50 Split of Logic and Memory) TechniqueType Custom vs. ASIC EnergyType Bit Encoding Dynamic 10.84Inter-connect Clock Gating10.84Chip Frequency Scaling10.5Chip Subthreshold Circuits N/A0.062Chip

41
41 Achievable Power Improvement (Assuming 50/50 Split of Logic and Memory) TechniqueType Custom vs. ASIC EnergyType V dd Scaling Static 10.79Chip MT-CMOS10.5Chip Stacking and input state vector 1.40.7 Chip (typically only one of these three is applied) Body Bias20.5 Supply Gating100.1

42
42 Achievable Power Improvement Assuming 50/50 Split of Logic and Memory TypeTech ASIC (Custom) Tech ASIC (Custom) Net Dynamic 130-nm 45% (32%) 90-nm 28%(20%) Net Static 8% (4%)20%(10%) Total 53% (36%) 48%(30%) 130nm uP assumes 80% Dynamic and 20% Static 90nm uP assumes 50% Dynamic and 50% Static

43
43 Talk Outline Normalized Metric: E bit Effect of Architecture ASIC vs. Custom Building Blocks Achievable Energy Efficiency 16b 1024 FFT Example Answer to “Which Design is More Efficient”

44
44 Talk Outline Normalized Metric: E bit Effect of Architecture ASIC vs. Custom Building Blocks Achievable Energy Efficiency 16b 1024 FFT Example Answer to “Which Design is More Efficient”

45
45 16b 1024 point FFT Generally, k N log N operations (complex multiplies) with pre- computation Radix-2, Radix-4 etc… implementations Decimation in time and/or decimation in Frequency

46
46 Range of Implementations MIT FFT (2005) 0.18um CMOS, 628K T’s, 10KHz: Architecture and subtheshold circuits, 180mV operation Spiffee (1999) 0.7um CMOS, 460K T’s, 173MHz: Cached FFT Architecture and algorithm, 1.1V operation SA-1100 (1999) 0.35um CMOS, 2.6M T’s, 74MHz: Commercial embedded processor, Custom Circuits, 1.5V operation Imagine (2003) 0.15um CMOS, 22M T’s, 232MHz: Streaming Media Processor, tiled standard cells, 1.2V operation Stratix IS25F627C8 (2005) 0.13um CMOS, 3.9K logic elements, 123K memory bits, 24 DSP blocks, 272MHz: Commercial FPGA Co-processor, Intel P4 (2003) 0.13um CMOS, 3GHz, SSE: Commerical General Purpose Processor, Custom Circuits, 1.5V operation TI ‘C6416 (2003) 0.13um CMOS, 720MHz: Commercial Digital Signal Processor

47
47 E bit Energy 16b 1024 point FFT DesignFabV dd MHzmWCycles MIT FFT1801.80.011.695 Spiffee7003.31738455190 SA-11003502743931500 Imagine1501.523240003708 Stratix1301.32758841291 Intel P41301.230005120071680 TI 'C64161301.272012006526

48
48 E bit Energy 16b 1024 point FFT Design EDP (rel norm) E bit (fJ)E fft (nJ) Normalized to E bit (1e6) Energy Ratio MIT FFT1433.3154471 Spiffee191253502776 SA-11002834.216601395385 Imagine1482.26393129726637 Stratix241.44149296464 Intel P4125481.41E+0687381318591 TI 'C6416271.4108777769166

49
49 Which Design Is More Efficient? 0.7um CMOS 173MHz chip w/ 460K T’s Vdd (typ) = 3.3V, Vdd (min) = 1.1V Power = 845mW 0.18um CMOS 10kHz chip w/ 640K T’s Vdd (max) = 1.8V, Vdd (min) = 0.18V Power = 1.6mW

50
50 Which Design Is More Efficient? Depends on the Metric! 0.7um CMOS 173MHz chip w/ 460K T’s Vdd (typ) = 3.3V, Vdd (min) = 1.1V Power = 845mW EDP 143x better 0.18um CMOS 10kHz chip w/ 640K T’s Vdd (max) = 1.8V, Vdd (min) = 0.18V Power = 1.6mW Absolute energy 6x better

51
51 Summary Normalized metric – E bit - enables meaningful comparisons across designs and technologies Custom designers can exploit a wide range of optimizations: enabling architecture with circuits and circuits with Architecture Custom designs can readily achieve a 3x advantage in energy with the potential for over 10x Selective application of custom techniques and automated support for performance characterization at specific instead of generic operating points can enable ASIC designers to begin to bridge this Power Gap.

52
52 Back-Up Slides

53
53 ASIC Rely on General Optimization Techniques Focus - Improve the Average Case Partitioning: Hyper-graph - min-cut, ratio cut Solutions: move-based, geometric & combinatorial forms, clustering Hypergraph H(V,E) E = { e1, e2….} nets Circuit e1 e3 e4 e5 e6 e7 e8 V1V3 V4 V5 V2 e2 V3 V4 e6 e7 e4 e5 e8 e3 Vertex & Edge weights used to encode costs V1 V2 V5 e1

54
54 Designs with Structure Do Not Exhibit Average Characteristics 64b Multiplier (half-array) Clear Disparity in Resource Usage Routing Density

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google