Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jan M. Rabaey Dejan Marković Low Power Design Essentials ©2008 Chapter 5 Optimizing Design Time Architectures, Algorithms and Systems.

Similar presentations


Presentation on theme: "Jan M. Rabaey Dejan Marković Low Power Design Essentials ©2008 Chapter 5 Optimizing Design Time Architectures, Algorithms and Systems."— Presentation transcript:

1 Jan M. Rabaey Dejan Marković Low Power Design Essentials ©2008 Chapter 5 Optimizing Design Time Architectures, Algorithms and Systems

2 Low Power Design Essentials © Chapter Outline  The architecture/system trade-off space  Concurrency improves energy-efficiency  Exploring alternative topologies  Removing inefficiency  The cost of flexibility

3 Low Power Design Essentials © Motivation  Optimizations at the architecture or system level can enable more effective power minimization at the circuit level (while maintaining performance), such as –Enabling a reduction in supply voltage –Reducing the effective switching capacitance for a given function (physical capacitance, activity) –Reducing the switching rates –Reducing leakage  Optimizations at higher abstraction levels tend to have greater potential impact –While circuit techniques may yield improvements in the 10-50% range, architecture and algorithm optimizations have reported orders of magnitude power reduction

4 Low Power Design Essentials © [Ref: D. Markovic, JSSC’04] Circuit Optimization Limited in Range Case study: Tree adder Result of joint (V DD, V TH, W) optimization: –65% of energy saved without delay penalty –25% smaller delay without energy cost Need higher level optimizations for larger gain Lessons Learned from Circuit Optimization D/D ref E/E ref % ref Ref: min nominal V dd, V th 25%

5 Low Power Design Essentials © Logic/RT (Micro-)Architecture Software Circuit Device System/Application Increasing Return-on- Investment (ROI) at higher levels of the stack Chapter 4 The Design Abstraction Stack

6 Low Power Design Essentials © Removing inefficiencies (1) Discrete options (3) Alternative topologies (2) D E D E D E Architecture and system transformations and optimizations reshape the E-D curves Expanding the Playing Field

7 Low Power Design Essentials © (while maintaining performance) Concurrency: trading off clock frequency versus area to reduce power F1 Consider the following reference design F2 R R R f ref R: register, F1,F2: combinational logic blocks (adders, ALUs, etc) C ref : average switching capacitance [A. Chandrakasan, JSSC’92] Reducing the Supply Voltage

8 Low Power Design Essentials © F1 F2 R R R f ref /2 F1 F2 R R R f ref /2 Running slower reduces required supply voltage Yields quadratic reduction in power Almost cancels A Parallel Implementation

9 Low Power Design Essentials © Assuming ov par = 7.5% Example: 90nm Technology V DD (norm.) t p (norm.)

10 Low Power Design Essentials © F1 F2 R R R f ref R R Assuming ov pipe = 10% Shallower logic reduces required supply voltage A Pipelined Implementation (this example assumes equal V dd for par / pipe designs)

11 Low Power Design Essentials ©  Can combine parallelism and pipelining to drive V DD down  But, close to process threshold overhead of excessive concurrency starts to dominate Assuming constant % overhead Increasing use of Concurrency Saturates Concurrency Power

12 Low Power Design Essentials © P Fixed Throughput V DD Concurrency Nominal design (no concurrency) P min Only option: Reduce V TH as well! But: Must consider Leakage … Overhead + leakage Increasing use of Concurrency Saturates

13 Low Power Design Essentials © Delay = 1/Throughput E Op nominalN=2N=3N=4N=5 increasing level of parallelism Fixed throughput Optimum Energy-Delay point Mapping into the Energy-Delay Space  For each level of performance, optimum amount of concurrency  Concurrency only energy-optimal if requested throughput larger than optimal operation point of nominal function [Ref: D. Markovic, JSSC’04] © IEEE 2004

14 Low Power Design Essentials © time-mux reference Introduce Time-Multiplexing! A f f A f f A ff ff 2f2f (that is, at no concurrency) Absorb unused time slack by increasing clock frequency (and voltage …) Again comes with some area and capacitance overhead! What if the Required Throughput is Below Minimum?

15 Low Power Design Essentials © Max E Op D target A = A ref 1515 Data for 64-b ALU parallelism time-mux AREA SMALL LARGE Concurrency and Multiplexing Combined

16 Low Power Design Essentials ©  For maximum performance –Maximize use of concurrency at the cost of area  For given performance –Optimal amount of concurrency for minimum energy  For given energy –Least amount of concurrency that meets performance goals  For minimum energy –Solution with minimum overhead (that is – direct mapping between function and architecture) Some Energy-Inspired Design Guidelines

17 Low Power Design Essentials © [Ref: R. Subramanyan, Tampere’99] Concepts Slowly Embraced in Late 90’s Normalized processor speed [mA/ MIP] computational efficiency memory Transistors/chip microprocessor/DSP

18 Low Power Design Essentials © Confirmed by Actual Processors … [Courtesy: J. DeVale and B. Black, Intel, ‘05 ]

19 Low Power Design Essentials © [Ref: S. Chou, ISSCC’05] Processor performance (for constant power envelope) 10x 3x Single Core Dual/Many Core And Finally Accepted in the 00’s

20 Low Power Design Essentials © Xilinx Vertex 4 IBM/Sony Cell Processor Intel Montecito ARMARM Heterogeneous reconfigurable fabric fabric UCB Pleiades NTT Video codec (4 Tensilica cores) AMD DualCore Fully Accepted in 00’s [© Xilinx, Intel, AMD, IBM, NTT]

21 Low Power Design Essentials © Amdahl’s Law: Serial = 6.7% Serial = 20% Serial = 0% The Quest for Concurrency

22 Low Power Design Essentials © mm, 100W, 48MB Cache, 4B Transistors, in 22nm 12 Cores 48 Cores 144 Cores Single Core Performance Large Med Small Relative Performance [Courtesy: S. Borkar, Intel, 2006] The Quest for Concurrency

23 Low Power Design Essentials © [Ref: A, Chandrakasan, TCAD’95; D. Markovic, JSSC’07] Loop folding / unfolding Others: loop retiming, loop pipelining, (de)-interleaving, … Algebraic transformations F F F F F F Manipulating Concurrency Through Transformations

24 Low Power Design Essentials © Example: Visualizing MPEG-4 encoder Parallelism [Courtesy: W.M. Hwu, Illinois] Concurrent Compilers to Pick Up the Pace

25 Low Power Design Essentials © Single transformation Combined transformations [Courtesy: W.M. Hwu, Illinois] Impact of Code Transformations

26 Low Power Design Essentials © D E D E F’ F” Choosing Between Alternative Topologies Multiple computational topologies for single function F  E.g.: adders, ALUs, multipliers, dividers, goniometric functions  Each topology comes with its own optimal E-D curve Absolute optimal E-D for function F obtained by composite  Plotting unarguably the best possible implementation of F for a given E or D and technology

27 Low Power Design Essentials © R2: Radix 2; R4: Radix 4 64 bit CLA Adders; 130 nm CMOS;  Static adders low power but slow  Dynamic adders the preferred choice for higher performance  Higher radix orders improve efficiency overall [Ref: R. Zlatanovici, ESSCIRC’03] Adder Example: Static versus Domino Static R2 Domino R4 Domino R2 Compound Domino R Delay [FO4] Energy [pJ]

28 Low Power Design Essentials © Conventional CLA  Higher stack in first stage  Simple sum precompute Ling CLA  Lower stack in first stage  Complex sum precompute  Higher speed Adder Example: Static CLA versus Ling Delay [FO4] Energy [pJ] R2 Ling R2 CLA R4 CLA R4 Ling [Ref: R. Zlatanovici, ESSCIRC03] © IEEE 2003

29 Low Power Design Essentials © D E Implementations for a given function maybe inefficient and can often be replaced with more efficient versions without penalty in energy or delay Improving Computational Efficiency Inefficiencies arise from:  Over-dimensioning or over-design  Generality of function  Design methodologies  Limited design time  Need for flexibility, re-use and programmability

30 Low Power Design Essentials © Some simple guidelines:  Match computation and architecture –Dedicated solutions superior by far  Preserve locality present in algorithm –Getting data from far away is expensive  Exploit signal statistics –Correlated data contains less transitions than random data  Energy on demand –Only spend energy when truly needed Improving Computational Efficiency

31 Low Power Design Essentials ©  Choice of computational architecture can have major impact on energy efficiency (see further) Example: Compute y = A∙x 2 + B∙x + C or Matching Computation and Architecture

32 Low Power Design Essentials ©  Example: Word-length optimization  Most algorithms (wireless, multimedia) simulated and developed in floating point  Programmable architectures operate on fixed word length (16, 32,64 bit)  Creates substantial amount of switching overhead for many computations Careful selection of word length leads to substantial power reduction Matching Computation and Architecture quantizers

33 Low Power Design Essentials © Design Example: SVD Processor for MIMO MIMO channel: Matrix H 1 st path,  1 = 1 2 nd path,  2 = 0.6 x y Tx array Rx array [Ref: D. Markovic, JSSC’07] V†V† V 11 U 44... U†U† z'1z'1 z'4z'4 Tx Rx Encoding & Modulation Demodulation & Decoding Channel H = U ∙ Σ ∙ V † © IEEE 2007

34 Low Power Design Essentials © Clock Period Energy sizing Area interleaving +folding word-length initial synthesis word-length sizing max Vdd Vdd scaling Clock Period target speed (40%) (20%) (30%) (7x) (36x) SVD: Optimization Techniques [Ref: D. Markovic, JSSC’07] © IEEE 2007

35 Low Power Design Essentials © Impact of combined optimizations  Folding, interleaving, sizing, word length, voltage scaling  64x area & 16x energy reduction compared to direct mapping [Ref: D. Markovic, JSSC’07] Energy-Area-Delay Tradeoff in SVD Energy DelayArea 0 40% 16b design word-size sizing 30% Initial synthesis V DD scaling Optim. V DD, W 30% 20% Interl. 13.8x Fold 2.6x Final design © IEEE 2007

36 Low Power Design Essentials ©  2.1 GOPS/mW –100 MHz clock –70 GOPS –Power = 34mW  20 GOPS/mm 2 –3.5mm 2 –70 GOPS Power/Area Optimal 4x4 SVD Chip 34mW, 3.5mm V DD = 0.4V Energy efficiency (GOPS/mW) Area efficiency (GOPS/mm 2 ) SVD Comparison with ISSCC chips [Ref: D. Markovic, JSSC’07] © IEEE 2007

37 Low Power Design Essentials © Prime example: memory hierarchy register files, caches, instruction loop buffers, memory partitioning, distributed memory Fetching data and instructions from local rather than global resources reduces access cost (interconnect, access energy) Main memory PP Instr.PC Main memory Cache Memory PP Instr.PC small, fast, efficient Slow, expensive Locality of Reference

38 Low Power Design Essentials ©  (Hardware) instruction loop buffer IMEM Loop Buffer Proc. Core PC IC + LC Locality of Reference  On first iteration, code cached in loop buffer  Fetched from loop buffer on subsequent iterations  Popular feature in DSPs

39 Low Power Design Essentials © Reference code Compiler Loop1 Loop2 Loop1 Loop2’ or Improved temporal locality of data [Ref: H. De Man, ISSCC’05] Software Optimizations Crucial © IEEE 2005

40 Low Power Design Essentials © VGA quality MPEG 4 on 1.6 GHz Pentium M [Ref: H. De Man, ISSCC’05] Software Optimizations – Example © IEEE 2005

41 Low Power Design Essentials ©  Sequential data vary often displays temporal correlation  Temporally uncorrelated data maximizes transitions  Preserving correlations (= avoiding time sharing) good idea. Cntr1 Cntr2 Cntr1 Cntr2 mbus bus2 or Exploiting Signal Statistics [Courtesy: A. Chandrakasan]

42 Low Power Design Essentials © % reduction in signal activity Exploiting Signal Statistics

43 Low Power Design Essentials ©  Programmable solutions very attractive –Shorter time to market –High reuse –Field updates (reprogramming)  But come at a large efficiency cost –Energy/function and throughput-latency/function substantially higher than dedicated implementation  How to combine flexibility and efficiency? –Simple versus complex processors –Stepping away from “completely flexible” to “somewhat dedicated” –Concurrency versus clock frequency –Novel architectural solutions such as reconfiguration The Cost of Flexibility

44 Low Power Design Essentials © D E # Apps 1 N Dedicated Programmable The Cost of Flexibility

45 Low Power Design Essentials © Embedded Processors SA MIPS/mW ASIPs DSPs DSP: 3 MOPS/mW Dedicated HW Flexibility (Coverage) Energy Efficiency MOPS/mW (or MIPS/mW) Reconfigurable Processor/Logic Pleiades MOPS/mW [Ref: J. Rabaey, Tampere’99] Approximately three orders of magnitude in inefficiency from general-purpose to dedicated! Benchmark 1999 The Cost of Flexibility

46 Low Power Design Essentials © [Ref: T. Claasen, ISSCC’99; H. De Man, ISSCC’05] 32 bit IPE GP microprocessor feature size(  m) Power efficiency PE (GOPS/Watt ) Reconfigurable // computing Muxed data paths IS Computing mpu asip-dsp cg fg The Cost of Flexibility – Evolution

47 Low Power Design Essentials © Least-Mean-square Pilot Correlators for CDMA (1.67 MSymbols Data Rate) Complexity: 300 Mmult/sec and 360 Mmac/sec [Ref: N. Zhang, PhD’01] ASIC implementation mW Architecture comparison – single correlator The Cost of Flexibility – Example TypePowerArea Commercial DSP460 mW1100 mm 2 Configurable Proc.18 mW5.5 mm 2 Dedicated3 mW1.5 mm 2

48 Low Power Design Essentials © PP AC1 M AC2 Bus Dedicated accelaratorsApplication-specific processor M1 M2 M$ ALU M Bus General-purpose processor PP M RC1 Bus Reconfigurable processor RC2 RC3 RN The Architectural Choices

49 Low Power Design Essentials ©  Best explored using Energy-Delay curves  For each proposed architecture and parameters set, determine average energy-delay over library of benchmark examples  Modern computer-aided design tools allow for quick synthesis and analysis –Leads to fair comparison  Example: Subliminal Project - University of Michigan –Explores processor architecture over the following parameters: Depth and number of pipeline stages; Memory: Von Neumann or Harvard; ALU Width(8/16/32); With or without explicit register file Simple versus Complex Processors?

50 Low Power Design Essentials © Pareto analysis over 19 processors Simple versus Complex Processors [Ref: D. Blaauw, ISCA’05]

51 Low Power Design Essentials ©  Tailor processor to be efficient for sub-set of applications –Memory architecture, interconnect structure, computational units, instructions  Digital-signal processors best known example –Special memory architecture provides locality –Datapath optimized for vector-multiplication (originally)  Examples now available in many other areas (graphics, security, control, etc) Application-Specific Processors

52 Low Power Design Essentials ©  The first type of application-specific processor to become popular  Initially mostly for performance, but energy benefit now also recognized  Key properties: dedicated memory architecture (multiple data memories), data path specialized for specific functions such as vector multiplies and FFTs  Over time: introduction of more and more concurrency (VLIW) RamY N  16 Mult 16  16 Acc 40 ALU 40 RamX N  16 Example 1: DSPs

53 Low Power Design Essentials © Energy efficiency of DSPs doubles every 18 months (“Gene’s Law”), but… [Ref: G. Frantz, TI] DSPs Deliver Improving Energy-Efficiency DSP Power Dissipation Trends 1, mW / MMACs Year Gene’s Law DSP Power

54 Low Power Design Essentials © DSP Proc (?) Techno (nm) # Gates50K500K5G50G V DD (V) GHz MIPS5405K50K MIPS / W48010K1G mW / MIPS [Ref: G. Frantz, TI] Performances of DSP Processors

55 Low Power Design Essentials © [Courtesy: C. Rowen, Tensilica’01] Combines spatial and temporal processing Core processor with extendible instruction set Application Specific Instruction Processors (ASIP)

56 Low Power Design Essentials © extra instructions 1700 additional gates No cycle time impact Code size reduction Impact of adding special instructions [Courtesy: C. Rowen, Tensilica’01] Advantage of Application Specific Processors

57 Low Power Design Essentials © Diamond 388VDO Video Processor Top Level Block Diagram ISA extensions to support Context-adaptive Binary Arithmetic Coding (CABAC) in H.264 decoding* unaugmented core ISA extended core 710 Mcycles/sec13 Mcycles/sec CABAC cycles 164mJ5mJ Energy/sec Area cost for CABAC ISA Extensions: 20 Kgates [Courtesy: C. Rowen, Tensilica’07] Optimizing Energy in Video * 5Mbps H.264 MP stream with MBAff enabled, at D1 resolution

58 Low Power Design Essentials © Often executed functions implemented as dedicated modules and executed as co-processors  Opportunities: Network processing, MPEG Encode/Decode, Speech, Wireless Interfaces  Advantage: Energy- efficiency of custom implementation  Disadvantage: Area- overhead Hardware Accelerators Example: Computational core of Texas Instruments OMAP 2420 Platform TM ARM11 TMS320C55 DSP 2D/3D Graphics Accelerator Imaging Video Accelerator Security Accelerator: SHA-1, DES, AES, PKA, Secure WDT Timers, Interrupt Controllers Shared Memory Controller, DMA [Ref: OMAP Platform, TI]

59 Low Power Design Essentials © mm  3.54 mm, 260K transistors TCP Offload Engine [Courtesy: S. Borkar, Intel’05] Example: networking coprocessor Hardware Accelerators MIPS GP APS Year

60 Low Power Design Essentials © Configuration Bus “Programming in space” Create dedicated co- processors by reconfiguring interconnect between dedicated computational models. Efficiency of hardwired accelerators, but increased flexibility and reuse (smaller area) Configurable Interconnect Arithmetic Module Configurable Logic PP (Re)configurable Processors [Ref: H. Zhang, JSSCC’00]

61 Low Power Design Essentials © for(i=1;i<=L;i++) for(k=i;k<=L;k++) phi[i][k]= phi[i-1][k-1] +in[NP-i]*in[NP-k] -in[NA-1-i]*in[NA-1-k]; end start Embedded processor AddrGen MEM: in ALU AddrGen MEM: phi MPY Code seg Example: Co-Variance Matrix Computation Programming in space Programming in time Programming in Space versus Time

62 Low Power Design Essentials © VCELP coder for wireless (1 250 nm CMOS) [Ref: H. Zhang, JSSCC’00] Example: Reconfigurable Processor for Wireless

63 Low Power Design Essentials © % of VCELP Code maps onto Reconfigurable Datapath VCELP code breakdownVCELP Energy breakdown Compared to state-of-art 17mW DSP Results of VCELP Voice Coder [Ref: H. Zhang et al., JSSCC’00]

64 Low Power Design Essentials © Other examples: ADRES, Cluster, CoolDSP, SiliconHive  Dynamic Reconfigurable vector engine  Reconfigured on the fly  One cycle context switch  Coarse grain heterogeneous type  Native 24bit data-width  Max Clock Freq. 166MHz  Deployed in portable music and game players Example: Sony Virtual Mobile Engine (VME) [Ref: K. Seno, HotChips’04]

65 Low Power Design Essentials ©  Effectiveness of alternative architectures (ASIP, Accelerator, Reconfigurable) determined by the amount of code spawned from GP  Mostly effective for repetitive kernels  80-20% rule typically seems to apply  Transformations can help to improve effectiveness  Most important: code development and algorithm selection that encourages concurrency Remember: Amdahl’s Law Still Holds

66 Low Power Design Essentials © Domain-specific platforms combine multiple computational concepts to optimize flexibility, performance and energy- efficiency TM-xxxx D$ I$ TriMedia CPU DEVICE I/P BLOCK DVP System Silicon PI BUS SDRAM MMI DVP MEMORY BUS DEVICE I/P BLOCK PRxxxx D$ I$ MIPS CPU DEVICE I/P BLOCK PI BUS TriMedia TM MIPS TM Example: NXP Nexperia Platform for Multimedia Applications Bringing It All Together [Ref: Nexperia, NXP]

67 Low Power Design Essentials © Example: A programmable HDTV media processor Combines VLIW DSP with configurable media co-processors [Ref: Nexperia, NXP] A Heterogeneous Platform Configurable accelerator for image filtering

68 Low Power Design Essentials © Combines “enhanced ARM processor”, multiple accelerator processors, I/O modules and sophisticated interconnect network OMAP Platform for Wireless [Ref: OMAP, TI]

69 Low Power Design Essentials ©  Architectural and algorithmic optimization can lead to drastic improvements in energy-efficiency  Concurrency is an effective means to improve throughput at fixed currency or reduce energy for fixed throughput  Energy-efficient architectures specialize the implementation of often recurring instructions or functions Summary and Perspectives

70 Low Power Design Essentials © Theses:  M. Potkonjak, “Algorithms for high level synthesis: resource utilization based approach,” PhD thesis, UC Berkeley,  N. Zhang, “Algorithm/Architecture Co-Design for Wireless Communication Systems,” PhD thesis, UC Berkeley, Articles:  D. Blaauw, B. Zhai, “Energy Efficient Design for Subthreshold Supply Voltage Operation,” IEEE International Symposium on Circuits and Systems (ISCAS), April, 2006  S. Borkar, “Design challenges of technology scaling,” IEEE Micro, vol.19, no.4, p.23-29, July-Aug  A.P. Chandrakasan, S. Sheng, R.W. Brodersen, “Low-power CMOS digital design,” IEEE Journal of Solid-State Circuits, vol. 27, no. 4, pp , April  A. Chandrakasan, M. Potkonjak, J, Rabaey and R. Brodersen, "Optimizing Power using Transformations", IEEE Transactions on Computer Aided Design, vol. 14, No 1, pp Jan  S. Chou, “Integration and innovation in the nanoelectronics era, “ Keynote presentation, Digest of Technical Papers Solid-State Circuits Conference (ISSCC05), pp , February  T. Claasen, “High speed: not the only way to exploit the intrinsic computational power of silicon,” Keynote presentation, Digest of Technical Papers Solid-State Circuits Conference (ISSCC99), pp. 22–25, February  H. De Man, “Ambient intelligence: gigascale dreams and nanoscale realities,” Keynote presentation, Digest of Technical Papers International Solid-State Circuits Conference (ISSCC '05), pp. 29–35, February  G. Frantz,  K. Keutzer, S. Malik, R. Newton, J. Rabaey and A. Sangiovanni-Vincentelli, “System Level Design: Orthogonalization of Concerns and Platform-Based Design,” IEEE Transactions on Computer-Aided Design of Integrated Circuits & Systems, vol.19, no.12, pp , Dec. 2000,. References

71 Low Power Design Essentials © References Articles (cntd)  T. Kuroda, T. Sakurai, “Overview of low-power ULSI circuit techniques,” IEICE Trans. on Electronics, vol. E78-C, no. 4, pp , April  D. Markovic, V. Stojanovic, B. Nikolic, M.A. Horowitz, R.W. Brodersen, “Methods for True Energy- Performance Optimization,” IEEE Journal of Solid-State Circuits, vol. 39, no. 8, pp , Aug  D. Markovic, B. Nikolic, R.W. Brodersen, “Power and Area Minimization for Multidimensional Signal Processing,” IEEE Journal of Solid-State Circuits, vol. 42, no. 4, pp , April  Nexperia, NXP Semiconductors,  OMAP, Texas Instruments, ntId=4638  J. Rabaey, “System-on-a-Chip – A Case for Heterogeneous Architectures”, Invited Presentation, Wireless Technology Seminar, Tampere, May Also in HotChips’2000.  K. Seno, “A 90nm embedded DRAM single chip LSI with a 3D graphics, H.264 codec engine, and a reconfigurable processor“, HotChips  R. Subramanyan, “Reconfigurable Digital Communications Systems on a Chip”, Invited Presentation, Wireless Technology Seminar, Tampere, May  H. Zhang, V. Prabhu, V. George, M. Wan, M. Benes, A. Abnous, J. Rabaey, “A 1V Heterogeneous Reconfigurable Processor IC for Baseband Wireless Applications,” IEEE Journal of Solid-State Circuits, vol. 35, no. 11, pp , Nov (also ISSCC 2000)  R. Zlatanovici, B. Nikolic, “Power-Performance Optimal 64-bit Carry-Lookahead Adders,” in Proc. European Solid-State Circuits Conf. (ESSCIRC), pp , Sept


Download ppt "Jan M. Rabaey Dejan Marković Low Power Design Essentials ©2008 Chapter 5 Optimizing Design Time Architectures, Algorithms and Systems."

Similar presentations


Ads by Google