# 2 Explaining The Gap Between ASIC and Custom Power: A Custom Perspective Andrew Chang Cadence Design Systems* William J. Dally Computer Systems Laboratory.

## Presentation on theme: "2 Explaining The Gap Between ASIC and Custom Power: A Custom Perspective Andrew Chang Cadence Design Systems* William J. Dally Computer Systems Laboratory."— Presentation transcript:

2 Explaining The Gap Between ASIC and Custom Power: A Custom Perspective Andrew Chang Cadence Design Systems* William J. Dally Computer Systems Laboratory Stanford University * Work done while Author was at Stanford

3 Design Tradeoffs: Power vs. Performance 1. Move to More Energy Efficient Operating Point More Energy Efficient w/ Custom Power 2 1 3 Performance

4 Design Tradeoffs: Power vs. Performance 1. Move to More Energy Efficient Operating Point More Energy Efficient w/ Custom 2. Trade Performance for Power Larger Range w/ Custom Power 2 1 3 Performance

5 Design Tradeoffs: Power vs. Performance 1. Move to More Energy Efficient Operating Point More Energy Efficient w/ Custom 2. Trade Performance for Power Larger Range w/ Custom 3. Move to Different Power vs. Performance Curve More Architectural Choice with Custom Power 2 1 3 Performance

6 Dynamic Power Dissipation P dyn =  CV dd 2 f =  E circuit f  Reduce V dd  Static, dynamic, voltage islands, power gating  Reduce  and/or f  Clock gating, block enables, bus encoding, glitch identification and elimination  Reduce E circuit  Engineer interconnects, increase circuit efficiency, subthreshold circuit techniques

7 Static Power Dissipation P static = V dd (I sub + I ox ) I sub = K 1 W e -V t / nV  (1- e –V gs /V  ) I ox = K 2 W (V gs /t ox ) 2 e –  t ox / V gs With K 1, K 2, n, and  experimentally determined  Reduce V dd  Static, dynamic, voltage islands, power gating  Increase effective V t  Substituting high-threshold devices, transistor stacking, static and active body bias  Reduce effective W  Reduce number and size of devices in design

8 Which Design Is More Efficient?  0.7um CMOS 173MHz chip w/ 460K T’s  0.18um CMOS 10kHz chip w/ 640K T’s

9 Which Design Is More Efficient?  0.7um CMOS 173MHz chip w/ 460K T’s  Vdd (typ) = 3.3V, Vdd (min) = 1.1V  0.18um CMOS 10kHz chip w/ 640K T’s  Vdd (max) = 1.8V, Vdd (min) = 0.18V

10 Which Design Is More Efficient?  0.7um CMOS 173MHz chip w/ 460K T’s  Vdd (typ) = 3.3V, Vdd (min) = 1.1V  Power = 845mW  0.18um CMOS 10kHz chip w/ 640K T’s  Vdd (max) = 1.8V, Vdd (min) = 0.18V  Power = 1.6mW

11 Talk Outline  Normalized Metric: E bit  Effect of Architecture  ASIC vs. Custom  Building Blocks  Achievable Energy Efficiency  16b 1024 FFT Example  Answer to “Which Design is More Efficient”

12 Talk Outline  Normalized Metric: E bit  Effect of Architecture  ASIC vs. Custom  Building Blocks  Achievable Energy Efficiency  16b 1024 FFT Example  Answer to “Which Design is More Efficient”

13 Defining E bit E bit = C bit * V dd 2 C bit = 4 * 2 fF/um * W min  Energy needed to write a 1-bit SRAM cell  Approximates minimum useful capacitance  The ratio of E bit to the energy for a range of circuits remains largely constant with technology scaling

14 Technology Scaling for E bit   is a normalized unit of distance equal to the M1 pitch Technology 0.5  m 0.18  m  5818 5.7 18 m2m2

15 Technology Scaling for Nand2   is a normalized unit of distance equal to the M1 pitch 4  = 2.24  m 8  = 4.48  m NAND2 A B YN A B

16 Applying E bit Energy180nm130nm90nm65nm E bit (fJ)3.31.40.50.36 Relative180nm130nm90nm65nm E bit 1111 1b FO4 ~10 1b SP-SRAM 0.3-7 1b RF 4-20+ 1b DFF 20-30+15-30+10-30+ 1b Nand2 11-30 (typ 19)5-30 (typ 14) Move 1b 1000  ~100 Move 1b 1.5mm 268367467714

17 Talk Outline  Normalized Metric: E bit  Effect of Architecture  ASIC vs. Custom  Building Blocks  Achievable Energy Efficiency  16b 1024 FFT Example  Answer to “Which Design is More Efficient”

18 Talk Outline  Normalized Metric: E bit  Effect of Architecture  ASIC vs. Custom  Building Blocks  Achievable Energy Efficiency  16b 1024 FFT Example  Answer to “Which Design is More Efficient”

19 Design Style: Custom NVIDIA GeForceFXIntel Pentium-4 Design Style: ASIC 400MHz – 125M Transistors 2600MHz – 55M Transistors Effect of Architecture

20 Design Style: Custom NVIDIA GeForceFXIntel Pentium-4 Design Style: ASIC 400MHz – 125M Transistors ~20 Watts 2600MHz – 55M Transistors ~60 Watts Effect of Architecture

21 Effect of Architecture ASIC Architecture: 6x Efficiency Design Style: Custom NVIDIA GeForceFXIntel Pentium-4 Design Style: ASIC 400MHz – 125M Transistors ~20 Watts: 10GFlops & 13 GBs 2600MHz – 55M Transistors ~60 Watts: 5GFlops & 5 Gbs

22 Custom Circuits: 9x (7x) Efficiency Design Style: Custom NVIDIA GeForceFXIntel Pentium-4 Design Style: Custom 400MHz – 125M Transistors ~3 Watts: 10GFlops & 13 GBs Vdd = 0.65V 2600MHz – 55M Transistors ~60 Watts: 5GFlops & 5 Gbs Vdd = 1.3V

23 Combined Architecture and Circuits 40x+ Improvement but 1.5 Years vs. 3+ Years Design Style: Custom NVIDIA GeForceFXIntel Pentium-4 Design Style: Custom 400MHz – 125M Transistors ~3 Watts: 10GFlops & 13 GBs Vdd = 0.65V 2600MHz – 55M Transistors ~60 Watts: 5GFlops & 5 Gbs Vdd = 1.3V

24 Talk Outline  Normalized Metric: E bit  Effect of Architecture  ASIC vs. Custom  Building Blocks  Achievable Energy Efficiency  16b 1024 FFT Example  Answer to “Which Design is More Efficient”

25 Talk Outline  Normalized Metric: E bit  Effect of Architecture  ASIC vs. Custom  Building Blocks  Achievable Energy Efficiency  16b 1024 FFT Example  Answer to “Which Design is More Efficient”

26 ASIC vs. Custom  ASIC Methods  Provide only coarse-grain control 100K+ gates, but require much less effort and historically scale with complexity  Custom Methods  Offer fine-grain control individual transistors & gates, but require large effort and scale poorly with complexity  Exploits Design Structure  Exploits Circuit Techniques

27 Custom Methods Emphasize Fine-Grain Manual Control + Custom Library

28 Custom Methods Emphasize Fine-Grain Manual Control + Custom Library Operation and Performance Characterized for the Specific Case

29 ASIC Methods Substitute Coarse-Grain Control Automation + Generic Library

30 ASIC Methods Substitute Coarse-Grain Control Automation + Generic Library Operation and Performance Characterized for the Typical/Generic Case

31 ASIC Focus on 100K+ Gates Lost Opportunities to Exploit Structure  Designs reuse similar basic building blocks  Building blocks: 1-10K-gates not 100K+ gate  64-bit adder 1K-gates  64x64 rf 2K-gates  64x64 multiplier 20K-gates  Opportunities to exploit these structures lost when design is viewed in large chunks

32 Different Architectures Similar Building Blocks L C L C L C L C L C L C L C L C L C EXRFSRAM XCVRS L C Bus Bank 1Bank 0 CLST 0CLST 1CLST 2 CLST 0CLST 1CLST 2 NIF/ROUTER MEMORY SWITCH CLUSTER SWITCH EMIEMI LTLBLTLB 1998 “MAP” 64b Microprocessor - 5M T’s (MIT/Stanford) EXRFSRAM XCVRS Bus 2002 “Imagine” 32b Stream Processor - 22M T’s (Stanford)

33 Significant Structure Exists Within 100K-gates L C L C L C L C L C L C L C L C L C L C L C L C L C L C L C EXRFSRAM XCVRSBus EXRFSRAM XCVRS Bus Bank 1Bank 0 CLST 0CLST 1CLST 2 CLST 0CLST 1CLST 2 NIF/ROUTER MEMORY SWITCH CLUSTER SWITCH EMIEMI LTLBLTLB 1998 “MAP” 64b Microprocessor - 5M T’s (MIT/Stanford) 2002 “Imagine” 32b Stream Processor - 22M T’s (Stanford)

34 Energy of 100K-gate Equivalent  ASIC (N2) = 1400K E bits (typ)  Custom Logic = 424K E bits *  SRAM (small) = 1085K E bits  SRAM (med) = 155K E bits  SRAM (large) = 50K E bits *Based on data extracted from Intel McKinley

35 Exploiting Circuit Techniques  Custom circuits more efficient  Reduced parasitics  1.7x circuit techniques and flops  1.4x libraries  1.4x due to engineering interconnects  Subthreshold Circuits  Low Performance but ultra-low power  Requires Architecture, Gates, Memories, CAD Tools

36 Relating Power to Performance CV/I, I dsat, tFO4 I dsat = K 3 L eff -0.5 t ox -0.8 (V gs - V t ) 1.25 t FO4 = K 4 [C eff V dd /I dsat ] (K 4 ~ 13.5)

37 Relating Power to Performance Relating V dd and V t to tFO4 I dsat = K 3 L eff -0.5 t ox -0.8 (V gs - V t ) 1.25 t FO4 = K 4 [C eff V dd /I dsat ] (K 4 ~ 13.5)

38 Relating Power to Performance Correlation to Reported Foundry Data Technology Node CV/I est (ps) CV/I reported (ps) t FO4 est (ps) Foundry A 180-nm3.943.7053 Foundry A 130-nm2.552.1734 Foundry A 90-nm1.852.0425 Foundry A 65-nm1.451.0020 I dsat = K 3 L eff -0.5 t ox -0.8 (V gs - V t ) 1.25 t FO4 = K 4 [C eff V dd /I dsat ] (K 4 ~ 13.5)

39 Achievable Power Improvement (Assuming 50/50 split of Logic and Memory) TechniqueType Custom vs. ASIC EnergyType Circuit Styles and Flops Dynamic 1.70.815Logic Libraries + V dd Scaling 1.40.855Logic SRAM Circuits20.95SRAM Interconnect + V dd Scaling 1.40.855Inter-connect

40 Achievable Power Improvement (Assuming 50/50 Split of Logic and Memory) TechniqueType Custom vs. ASIC EnergyType Bit Encoding Dynamic 10.84Inter-connect Clock Gating10.84Chip Frequency Scaling10.5Chip Subthreshold Circuits N/A0.062Chip

41 Achievable Power Improvement (Assuming 50/50 Split of Logic and Memory) TechniqueType Custom vs. ASIC EnergyType V dd Scaling Static 10.79Chip MT-CMOS10.5Chip Stacking and input state vector 1.40.7 Chip (typically only one of these three is applied) Body Bias20.5 Supply Gating100.1

42 Achievable Power Improvement Assuming 50/50 Split of Logic and Memory TypeTech ASIC (Custom) Tech ASIC (Custom) Net Dynamic 130-nm 45% (32%) 90-nm 28%(20%) Net Static 8% (4%)20%(10%) Total 53% (36%) 48%(30%)  130nm uP assumes 80% Dynamic and 20% Static  90nm uP assumes 50% Dynamic and 50% Static

43 Talk Outline  Normalized Metric: E bit  Effect of Architecture  ASIC vs. Custom  Building Blocks  Achievable Energy Efficiency  16b 1024 FFT Example  Answer to “Which Design is More Efficient”

44 Talk Outline  Normalized Metric: E bit  Effect of Architecture  ASIC vs. Custom  Building Blocks  Achievable Energy Efficiency  16b 1024 FFT Example  Answer to “Which Design is More Efficient”

45 16b 1024 point FFT  Generally, k N log N operations (complex multiplies) with pre- computation  Radix-2, Radix-4 etc… implementations  Decimation in time and/or decimation in Frequency

46 Range of Implementations  MIT FFT (2005)  0.18um CMOS, 628K T’s, 10KHz: Architecture and subtheshold circuits, 180mV operation  Spiffee (1999)  0.7um CMOS, 460K T’s, 173MHz: Cached FFT Architecture and algorithm, 1.1V operation  SA-1100 (1999)  0.35um CMOS, 2.6M T’s, 74MHz: Commercial embedded processor, Custom Circuits, 1.5V operation  Imagine (2003)  0.15um CMOS, 22M T’s, 232MHz: Streaming Media Processor, tiled standard cells, 1.2V operation  Stratix IS25F627C8 (2005)  0.13um CMOS, 3.9K logic elements, 123K memory bits, 24 DSP blocks, 272MHz: Commercial FPGA Co-processor,  Intel P4 (2003)  0.13um CMOS, 3GHz, SSE: Commerical General Purpose Processor, Custom Circuits, 1.5V operation  TI ‘C6416 (2003)  0.13um CMOS, 720MHz: Commercial Digital Signal Processor

47 E bit Energy 16b 1024 point FFT DesignFabV dd MHzmWCycles MIT FFT1801.80.011.695 Spiffee7003.31738455190 SA-11003502743931500 Imagine1501.523240003708 Stratix1301.32758841291 Intel P41301.230005120071680 TI 'C64161301.272012006526

48 E bit Energy 16b 1024 point FFT Design EDP (rel norm) E bit (fJ)E fft (nJ) Normalized to E bit (1e6) Energy Ratio MIT FFT1433.3154471 Spiffee191253502776 SA-11002834.216601395385 Imagine1482.26393129726637 Stratix241.44149296464 Intel P4125481.41E+0687381318591 TI 'C6416271.4108777769166

49 Which Design Is More Efficient?  0.7um CMOS 173MHz chip w/ 460K T’s  Vdd (typ) = 3.3V, Vdd (min) = 1.1V  Power = 845mW  0.18um CMOS 10kHz chip w/ 640K T’s  Vdd (max) = 1.8V, Vdd (min) = 0.18V  Power = 1.6mW

50 Which Design Is More Efficient? Depends on the Metric!  0.7um CMOS 173MHz chip w/ 460K T’s  Vdd (typ) = 3.3V, Vdd (min) = 1.1V  Power = 845mW  EDP 143x better  0.18um CMOS 10kHz chip w/ 640K T’s  Vdd (max) = 1.8V, Vdd (min) = 0.18V  Power = 1.6mW  Absolute energy 6x better

51 Summary  Normalized metric – E bit - enables meaningful comparisons across designs and technologies  Custom designers can exploit a wide range of optimizations: enabling architecture with circuits and circuits with Architecture  Custom designs can readily achieve a 3x advantage in energy with the potential for over 10x  Selective application of custom techniques and automated support for performance characterization at specific instead of generic operating points can enable ASIC designers to begin to bridge this Power Gap.

52 Back-Up Slides

53 ASIC Rely on General Optimization Techniques Focus - Improve the Average Case  Partitioning: Hyper-graph - min-cut, ratio cut  Solutions: move-based, geometric & combinatorial forms, clustering Hypergraph H(V,E) E = { e1, e2….} nets Circuit e1 e3 e4 e5 e6 e7 e8 V1V3 V4 V5 V2 e2 V3 V4 e6 e7 e4 e5 e8 e3 Vertex & Edge weights used to encode costs V1 V2 V5 e1

54 Designs with Structure Do Not Exhibit Average Characteristics 64b Multiplier (half-array) Clear Disparity in Resource Usage Routing Density

Download ppt "2 Explaining The Gap Between ASIC and Custom Power: A Custom Perspective Andrew Chang Cadence Design Systems* William J. Dally Computer Systems Laboratory."

Similar presentations