Jan M. Rabaey Dejan Marković Low Power Design Essentials ©2008 Chapter 5 Optimizing Design Time Architectures, Algorithms and Systems.

Slides:



Advertisements
Similar presentations
DSPs Vs General Purpose Microprocessors
Advertisements

Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.
VADA Lab.SungKyunKwan Univ. 1 L3: Lower Power Design Overview (2) 성균관대학교 조 준 동 교수
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Computer Abstractions and Technology
CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.
A Survey of Logic Block Architectures For Digital Signal Processing Applications.
 Understanding the Sources of Inefficiency in General-Purpose Chips.
Digital Signal Processing and Field Programmable Gate Arrays By: Peter Holko.
Reconfigurable Computing: What, Why, and Implications for Design Automation André DeHon and John Wawrzynek June 23, 1999 BRASS Project University of California.
L27:Lower Power Algorithm for Multimedia Systems 성균관대학교 조 준 동
Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)
Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma
Low Power Design of Integrated Systems Assoc. Prof. Dimitrios Soudris
Computer performance.
6.893: Advanced VLSI Computer Architecture, September 28, 2000, Lecture 4, Slide 1. © Krste Asanovic Krste Asanovic
Real time DSP Professors: Eng. Julian Bruno Eng. Mariano Llamedo Soria.
Motivation Mobile embedded systems are present in: –Cell phones –PDA’s –MP3 players –GPS units.
RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696
Power Reduction for FPGA using Multiple Vdd/Vth
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Archs, VHDL 3 Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
SYSTEM-ON-CHIP (SoC) AND USE OF VLSI CIRCUIT DESIGN TECHNOLOGY.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Automated Design of Custom Architecture Tulika Mitra
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
Section 10: Advanced Topics 1 M. Balakrishnan Dept. of Comp. Sci. & Engg. I.I.T. Delhi.
ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.
J. Christiansen, CERN - EP/MIC
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
A 240ps 64b Carry-Lookahead Adder in 90nm CMOS Faezeh Montazeri Advanced VLSI Course Presentation University of Tehran December.
RICE UNIVERSITY “Joint” architecture & algorithm designs for baseband signal processing Sridhar Rajagopal and Joseph R. Cavallaro Rice Center for Multimedia.
ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.
Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.
EE3A1 Computer Hardware and Digital Design
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
Development of Programmable Architecture for Base-Band Processing S. Leung, A. Postula, Univ. of Queensland, Australia A. Hemani, Royal Institute of Tech.,
Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.
DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –
1 Energy-Efficient Register Access Jessica H. Tseng and Krste Asanović MIT Laboratory for Computer Science, Cambridge, MA 02139, USA SBCCI2000.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Chapter 5: Computer Systems Design and Organization Dr Mohamed Menacer Taibah University
FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Topics n Low power design. n Pipelining.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
VADA Lab.SungKyunKwan Univ. 1 L5:Lower Power Architecture Design 성균관대학교 조 준 동 교수
CS203 – Advanced Computer Architecture
LOW POWER DESIGN METHODS
Programmable Logic Devices
Power-Optimal Pipelining in Deep Submicron Technology
Lynn Choi School of Electrical Engineering
Temperature and Power Management
Low-power Digital Signal Processing for Mobile Phone chipsets
Evaluating Register File Size
Architecture & Organization 1
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
Introduction.
Chapter 1: Introduction
Hyperthreading Technology
Architecture & Organization 1
Circuit Design Techniques for Low Power DSPs
A High Performance SoC: PkunityTM
Computer Evolution and Performance
Presentation transcript:

Jan M. Rabaey Dejan Marković Low Power Design Essentials ©2008 Chapter 5 Optimizing Design Time Architectures, Algorithms and Systems

Low Power Design Essentials © Chapter Outline  The architecture/system trade-off space  Concurrency improves energy-efficiency  Exploring alternative topologies  Removing inefficiency  The cost of flexibility

Low Power Design Essentials © Motivation  Optimizations at the architecture or system level can enable more effective power minimization at the circuit level (while maintaining performance), such as –Enabling a reduction in supply voltage –Reducing the effective switching capacitance for a given function (physical capacitance, activity) –Reducing the switching rates –Reducing leakage  Optimizations at higher abstraction levels tend to have greater potential impact –While circuit techniques may yield improvements in the 10-50% range, architecture and algorithm optimizations have reported orders of magnitude power reduction

Low Power Design Essentials © [Ref: D. Markovic, JSSC’04] Circuit Optimization Limited in Range Case study: Tree adder Result of joint (V DD, V TH, W) optimization: –65% of energy saved without delay penalty –25% smaller delay without energy cost Need higher level optimizations for larger gain Lessons Learned from Circuit Optimization D/D ref E/E ref % ref Ref: min nominal V dd, V th 25%

Low Power Design Essentials © Logic/RT (Micro-)Architecture Software Circuit Device System/Application Increasing Return-on- Investment (ROI) at higher levels of the stack Chapter 4 The Design Abstraction Stack

Low Power Design Essentials © Removing inefficiencies (1) Discrete options (3) Alternative topologies (2) D E D E D E Architecture and system transformations and optimizations reshape the E-D curves Expanding the Playing Field

Low Power Design Essentials © (while maintaining performance) Concurrency: trading off clock frequency versus area to reduce power F1 Consider the following reference design F2 R R R f ref R: register, F1,F2: combinational logic blocks (adders, ALUs, etc) C ref : average switching capacitance [A. Chandrakasan, JSSC’92] Reducing the Supply Voltage

Low Power Design Essentials © F1 F2 R R R f ref /2 F1 F2 R R R f ref /2 Running slower reduces required supply voltage Yields quadratic reduction in power Almost cancels A Parallel Implementation

Low Power Design Essentials © Assuming ov par = 7.5% Example: 90nm Technology V DD (norm.) t p (norm.)

Low Power Design Essentials © F1 F2 R R R f ref R R Assuming ov pipe = 10% Shallower logic reduces required supply voltage A Pipelined Implementation (this example assumes equal V dd for par / pipe designs)

Low Power Design Essentials ©  Can combine parallelism and pipelining to drive V DD down  But, close to process threshold overhead of excessive concurrency starts to dominate Assuming constant % overhead Increasing use of Concurrency Saturates Concurrency Power

Low Power Design Essentials © P Fixed Throughput V DD Concurrency Nominal design (no concurrency) P min Only option: Reduce V TH as well! But: Must consider Leakage … Overhead + leakage Increasing use of Concurrency Saturates

Low Power Design Essentials © Delay = 1/Throughput E Op nominalN=2N=3N=4N=5 increasing level of parallelism Fixed throughput Optimum Energy-Delay point Mapping into the Energy-Delay Space  For each level of performance, optimum amount of concurrency  Concurrency only energy-optimal if requested throughput larger than optimal operation point of nominal function [Ref: D. Markovic, JSSC’04] © IEEE 2004

Low Power Design Essentials © time-mux reference Introduce Time-Multiplexing! A f f A f f A ff ff 2f2f (that is, at no concurrency) Absorb unused time slack by increasing clock frequency (and voltage …) Again comes with some area and capacitance overhead! What if the Required Throughput is Below Minimum?

Low Power Design Essentials © Max E Op D target A = A ref 1515 Data for 64-b ALU parallelism time-mux AREA SMALL LARGE Concurrency and Multiplexing Combined

Low Power Design Essentials ©  For maximum performance –Maximize use of concurrency at the cost of area  For given performance –Optimal amount of concurrency for minimum energy  For given energy –Least amount of concurrency that meets performance goals  For minimum energy –Solution with minimum overhead (that is – direct mapping between function and architecture) Some Energy-Inspired Design Guidelines

Low Power Design Essentials © [Ref: R. Subramanyan, Tampere’99] Concepts Slowly Embraced in Late 90’s Normalized processor speed [mA/ MIP] computational efficiency memory Transistors/chip microprocessor/DSP

Low Power Design Essentials © Confirmed by Actual Processors … [Courtesy: J. DeVale and B. Black, Intel, ‘05 ]

Low Power Design Essentials © [Ref: S. Chou, ISSCC’05] Processor performance (for constant power envelope) 10x 3x Single Core Dual/Many Core And Finally Accepted in the 00’s

Low Power Design Essentials © Xilinx Vertex 4 IBM/Sony Cell Processor Intel Montecito ARMARM Heterogeneous reconfigurable fabric fabric UCB Pleiades NTT Video codec (4 Tensilica cores) AMD DualCore Fully Accepted in 00’s [© Xilinx, Intel, AMD, IBM, NTT]

Low Power Design Essentials © Amdahl’s Law: Serial = 6.7% Serial = 20% Serial = 0% The Quest for Concurrency

Low Power Design Essentials © mm, 100W, 48MB Cache, 4B Transistors, in 22nm 12 Cores 48 Cores 144 Cores Single Core Performance Large Med Small Relative Performance [Courtesy: S. Borkar, Intel, 2006] The Quest for Concurrency

Low Power Design Essentials © [Ref: A, Chandrakasan, TCAD’95; D. Markovic, JSSC’07] Loop folding / unfolding Others: loop retiming, loop pipelining, (de)-interleaving, … Algebraic transformations F F F F F F Manipulating Concurrency Through Transformations

Low Power Design Essentials © Example: Visualizing MPEG-4 encoder Parallelism [Courtesy: W.M. Hwu, Illinois] Concurrent Compilers to Pick Up the Pace

Low Power Design Essentials © Single transformation Combined transformations [Courtesy: W.M. Hwu, Illinois] Impact of Code Transformations

Low Power Design Essentials © D E D E F’ F” Choosing Between Alternative Topologies Multiple computational topologies for single function F  E.g.: adders, ALUs, multipliers, dividers, goniometric functions  Each topology comes with its own optimal E-D curve Absolute optimal E-D for function F obtained by composite  Plotting unarguably the best possible implementation of F for a given E or D and technology

Low Power Design Essentials © R2: Radix 2; R4: Radix 4 64 bit CLA Adders; 130 nm CMOS;  Static adders low power but slow  Dynamic adders the preferred choice for higher performance  Higher radix orders improve efficiency overall [Ref: R. Zlatanovici, ESSCIRC’03] Adder Example: Static versus Domino Static R2 Domino R4 Domino R2 Compound Domino R Delay [FO4] Energy [pJ]

Low Power Design Essentials © Conventional CLA  Higher stack in first stage  Simple sum precompute Ling CLA  Lower stack in first stage  Complex sum precompute  Higher speed Adder Example: Static CLA versus Ling Delay [FO4] Energy [pJ] R2 Ling R2 CLA R4 CLA R4 Ling [Ref: R. Zlatanovici, ESSCIRC03] © IEEE 2003

Low Power Design Essentials © D E Implementations for a given function maybe inefficient and can often be replaced with more efficient versions without penalty in energy or delay Improving Computational Efficiency Inefficiencies arise from:  Over-dimensioning or over-design  Generality of function  Design methodologies  Limited design time  Need for flexibility, re-use and programmability

Low Power Design Essentials © Some simple guidelines:  Match computation and architecture –Dedicated solutions superior by far  Preserve locality present in algorithm –Getting data from far away is expensive  Exploit signal statistics –Correlated data contains less transitions than random data  Energy on demand –Only spend energy when truly needed Improving Computational Efficiency

Low Power Design Essentials ©  Choice of computational architecture can have major impact on energy efficiency (see further) Example: Compute y = A∙x 2 + B∙x + C or Matching Computation and Architecture

Low Power Design Essentials ©  Example: Word-length optimization  Most algorithms (wireless, multimedia) simulated and developed in floating point  Programmable architectures operate on fixed word length (16, 32,64 bit)  Creates substantial amount of switching overhead for many computations Careful selection of word length leads to substantial power reduction Matching Computation and Architecture quantizers

Low Power Design Essentials © Design Example: SVD Processor for MIMO MIMO channel: Matrix H 1 st path,  1 = 1 2 nd path,  2 = 0.6 x y Tx array Rx array [Ref: D. Markovic, JSSC’07] V†V† V 11 U 44... U†U† z'1z'1 z'4z'4 Tx Rx Encoding & Modulation Demodulation & Decoding Channel H = U ∙ Σ ∙ V † © IEEE 2007

Low Power Design Essentials © Clock Period Energy sizing Area interleaving +folding word-length initial synthesis word-length sizing max Vdd Vdd scaling Clock Period target speed (40%) (20%) (30%) (7x) (36x) SVD: Optimization Techniques [Ref: D. Markovic, JSSC’07] © IEEE 2007

Low Power Design Essentials © Impact of combined optimizations  Folding, interleaving, sizing, word length, voltage scaling  64x area & 16x energy reduction compared to direct mapping [Ref: D. Markovic, JSSC’07] Energy-Area-Delay Tradeoff in SVD Energy DelayArea 0 40% 16b design word-size sizing 30% Initial synthesis V DD scaling Optim. V DD, W 30% 20% Interl. 13.8x Fold 2.6x Final design © IEEE 2007

Low Power Design Essentials ©  2.1 GOPS/mW –100 MHz clock –70 GOPS –Power = 34mW  20 GOPS/mm 2 –3.5mm 2 –70 GOPS Power/Area Optimal 4x4 SVD Chip 34mW, 3.5mm V DD = 0.4V Energy efficiency (GOPS/mW) Area efficiency (GOPS/mm 2 ) SVD Comparison with ISSCC chips [Ref: D. Markovic, JSSC’07] © IEEE 2007

Low Power Design Essentials © Prime example: memory hierarchy register files, caches, instruction loop buffers, memory partitioning, distributed memory Fetching data and instructions from local rather than global resources reduces access cost (interconnect, access energy) Main memory PP Instr.PC Main memory Cache Memory PP Instr.PC small, fast, efficient Slow, expensive Locality of Reference

Low Power Design Essentials ©  (Hardware) instruction loop buffer IMEM Loop Buffer Proc. Core PC IC + LC Locality of Reference  On first iteration, code cached in loop buffer  Fetched from loop buffer on subsequent iterations  Popular feature in DSPs

Low Power Design Essentials © Reference code Compiler Loop1 Loop2 Loop1 Loop2’ or Improved temporal locality of data [Ref: H. De Man, ISSCC’05] Software Optimizations Crucial © IEEE 2005

Low Power Design Essentials © VGA quality MPEG 4 on 1.6 GHz Pentium M [Ref: H. De Man, ISSCC’05] Software Optimizations – Example © IEEE 2005

Low Power Design Essentials ©  Sequential data vary often displays temporal correlation  Temporally uncorrelated data maximizes transitions  Preserving correlations (= avoiding time sharing) good idea. Cntr1 Cntr2 Cntr1 Cntr2 mbus bus2 or Exploiting Signal Statistics [Courtesy: A. Chandrakasan]

Low Power Design Essentials © % reduction in signal activity Exploiting Signal Statistics

Low Power Design Essentials ©  Programmable solutions very attractive –Shorter time to market –High reuse –Field updates (reprogramming)  But come at a large efficiency cost –Energy/function and throughput-latency/function substantially higher than dedicated implementation  How to combine flexibility and efficiency? –Simple versus complex processors –Stepping away from “completely flexible” to “somewhat dedicated” –Concurrency versus clock frequency –Novel architectural solutions such as reconfiguration The Cost of Flexibility

Low Power Design Essentials © D E # Apps 1 N Dedicated Programmable The Cost of Flexibility

Low Power Design Essentials © Embedded Processors SA MIPS/mW ASIPs DSPs DSP: 3 MOPS/mW Dedicated HW Flexibility (Coverage) Energy Efficiency MOPS/mW (or MIPS/mW) Reconfigurable Processor/Logic Pleiades MOPS/mW [Ref: J. Rabaey, Tampere’99] Approximately three orders of magnitude in inefficiency from general-purpose to dedicated! Benchmark 1999 The Cost of Flexibility

Low Power Design Essentials © [Ref: T. Claasen, ISSCC’99; H. De Man, ISSCC’05] 32 bit IPE GP microprocessor feature size(  m) Power efficiency PE (GOPS/Watt ) Reconfigurable // computing Muxed data paths IS Computing mpu asip-dsp cg fg The Cost of Flexibility – Evolution

Low Power Design Essentials © Least-Mean-square Pilot Correlators for CDMA (1.67 MSymbols Data Rate) Complexity: 300 Mmult/sec and 360 Mmac/sec [Ref: N. Zhang, PhD’01] ASIC implementation mW Architecture comparison – single correlator The Cost of Flexibility – Example TypePowerArea Commercial DSP460 mW1100 mm 2 Configurable Proc.18 mW5.5 mm 2 Dedicated3 mW1.5 mm 2

Low Power Design Essentials © PP AC1 M AC2 Bus Dedicated accelaratorsApplication-specific processor M1 M2 M$ ALU M Bus General-purpose processor PP M RC1 Bus Reconfigurable processor RC2 RC3 RN The Architectural Choices

Low Power Design Essentials ©  Best explored using Energy-Delay curves  For each proposed architecture and parameters set, determine average energy-delay over library of benchmark examples  Modern computer-aided design tools allow for quick synthesis and analysis –Leads to fair comparison  Example: Subliminal Project - University of Michigan –Explores processor architecture over the following parameters: Depth and number of pipeline stages; Memory: Von Neumann or Harvard; ALU Width(8/16/32); With or without explicit register file Simple versus Complex Processors?

Low Power Design Essentials © Pareto analysis over 19 processors Simple versus Complex Processors [Ref: D. Blaauw, ISCA’05]

Low Power Design Essentials ©  Tailor processor to be efficient for sub-set of applications –Memory architecture, interconnect structure, computational units, instructions  Digital-signal processors best known example –Special memory architecture provides locality –Datapath optimized for vector-multiplication (originally)  Examples now available in many other areas (graphics, security, control, etc) Application-Specific Processors

Low Power Design Essentials ©  The first type of application-specific processor to become popular  Initially mostly for performance, but energy benefit now also recognized  Key properties: dedicated memory architecture (multiple data memories), data path specialized for specific functions such as vector multiplies and FFTs  Over time: introduction of more and more concurrency (VLIW) RamY N  16 Mult 16  16 Acc 40 ALU 40 RamX N  16 Example 1: DSPs

Low Power Design Essentials © Energy efficiency of DSPs doubles every 18 months (“Gene’s Law”), but… [Ref: G. Frantz, TI] DSPs Deliver Improving Energy-Efficiency DSP Power Dissipation Trends 1, mW / MMACs Year Gene’s Law DSP Power

Low Power Design Essentials © DSP Proc (?) Techno (nm) # Gates50K500K5G50G V DD (V) GHz MIPS5405K50K MIPS / W48010K1G mW / MIPS [Ref: G. Frantz, TI] Performances of DSP Processors

Low Power Design Essentials © [Courtesy: C. Rowen, Tensilica’01] Combines spatial and temporal processing Core processor with extendible instruction set Application Specific Instruction Processors (ASIP)

Low Power Design Essentials © extra instructions 1700 additional gates No cycle time impact Code size reduction Impact of adding special instructions [Courtesy: C. Rowen, Tensilica’01] Advantage of Application Specific Processors

Low Power Design Essentials © Diamond 388VDO Video Processor Top Level Block Diagram ISA extensions to support Context-adaptive Binary Arithmetic Coding (CABAC) in H.264 decoding* unaugmented core ISA extended core 710 Mcycles/sec13 Mcycles/sec CABAC cycles 164mJ5mJ Energy/sec Area cost for CABAC ISA Extensions: 20 Kgates [Courtesy: C. Rowen, Tensilica’07] Optimizing Energy in Video * 5Mbps H.264 MP stream with MBAff enabled, at D1 resolution

Low Power Design Essentials © Often executed functions implemented as dedicated modules and executed as co-processors  Opportunities: Network processing, MPEG Encode/Decode, Speech, Wireless Interfaces  Advantage: Energy- efficiency of custom implementation  Disadvantage: Area- overhead Hardware Accelerators Example: Computational core of Texas Instruments OMAP 2420 Platform TM ARM11 TMS320C55 DSP 2D/3D Graphics Accelerator Imaging Video Accelerator Security Accelerator: SHA-1, DES, AES, PKA, Secure WDT Timers, Interrupt Controllers Shared Memory Controller, DMA [Ref: OMAP Platform, TI]

Low Power Design Essentials © mm  3.54 mm, 260K transistors TCP Offload Engine [Courtesy: S. Borkar, Intel’05] Example: networking coprocessor Hardware Accelerators MIPS GP APS Year

Low Power Design Essentials © Configuration Bus “Programming in space” Create dedicated co- processors by reconfiguring interconnect between dedicated computational models. Efficiency of hardwired accelerators, but increased flexibility and reuse (smaller area) Configurable Interconnect Arithmetic Module Configurable Logic PP (Re)configurable Processors [Ref: H. Zhang, JSSCC’00]

Low Power Design Essentials © for(i=1;i<=L;i++) for(k=i;k<=L;k++) phi[i][k]= phi[i-1][k-1] +in[NP-i]*in[NP-k] -in[NA-1-i]*in[NA-1-k]; end start Embedded processor AddrGen MEM: in ALU AddrGen MEM: phi MPY Code seg Example: Co-Variance Matrix Computation Programming in space Programming in time Programming in Space versus Time

Low Power Design Essentials © VCELP coder for wireless (1 250 nm CMOS) [Ref: H. Zhang, JSSCC’00] Example: Reconfigurable Processor for Wireless

Low Power Design Essentials © % of VCELP Code maps onto Reconfigurable Datapath VCELP code breakdownVCELP Energy breakdown Compared to state-of-art 17mW DSP Results of VCELP Voice Coder [Ref: H. Zhang et al., JSSCC’00]

Low Power Design Essentials © Other examples: ADRES, Cluster, CoolDSP, SiliconHive  Dynamic Reconfigurable vector engine  Reconfigured on the fly  One cycle context switch  Coarse grain heterogeneous type  Native 24bit data-width  Max Clock Freq. 166MHz  Deployed in portable music and game players Example: Sony Virtual Mobile Engine (VME) [Ref: K. Seno, HotChips’04]

Low Power Design Essentials ©  Effectiveness of alternative architectures (ASIP, Accelerator, Reconfigurable) determined by the amount of code spawned from GP  Mostly effective for repetitive kernels  80-20% rule typically seems to apply  Transformations can help to improve effectiveness  Most important: code development and algorithm selection that encourages concurrency Remember: Amdahl’s Law Still Holds

Low Power Design Essentials © Domain-specific platforms combine multiple computational concepts to optimize flexibility, performance and energy- efficiency TM-xxxx D$ I$ TriMedia CPU DEVICE I/P BLOCK DVP System Silicon PI BUS SDRAM MMI DVP MEMORY BUS DEVICE I/P BLOCK PRxxxx D$ I$ MIPS CPU DEVICE I/P BLOCK PI BUS TriMedia TM MIPS TM Example: NXP Nexperia Platform for Multimedia Applications Bringing It All Together [Ref: Nexperia, NXP]

Low Power Design Essentials © Example: A programmable HDTV media processor Combines VLIW DSP with configurable media co-processors [Ref: Nexperia, NXP] A Heterogeneous Platform Configurable accelerator for image filtering

Low Power Design Essentials © Combines “enhanced ARM processor”, multiple accelerator processors, I/O modules and sophisticated interconnect network OMAP Platform for Wireless [Ref: OMAP, TI]

Low Power Design Essentials ©  Architectural and algorithmic optimization can lead to drastic improvements in energy-efficiency  Concurrency is an effective means to improve throughput at fixed currency or reduce energy for fixed throughput  Energy-efficient architectures specialize the implementation of often recurring instructions or functions Summary and Perspectives

Low Power Design Essentials © Theses:  M. Potkonjak, “Algorithms for high level synthesis: resource utilization based approach,” PhD thesis, UC Berkeley,  N. Zhang, “Algorithm/Architecture Co-Design for Wireless Communication Systems,” PhD thesis, UC Berkeley, Articles:  D. Blaauw, B. Zhai, “Energy Efficient Design for Subthreshold Supply Voltage Operation,” IEEE International Symposium on Circuits and Systems (ISCAS), April, 2006  S. Borkar, “Design challenges of technology scaling,” IEEE Micro, vol.19, no.4, p.23-29, July-Aug  A.P. Chandrakasan, S. Sheng, R.W. Brodersen, “Low-power CMOS digital design,” IEEE Journal of Solid-State Circuits, vol. 27, no. 4, pp , April  A. Chandrakasan, M. Potkonjak, J, Rabaey and R. Brodersen, "Optimizing Power using Transformations", IEEE Transactions on Computer Aided Design, vol. 14, No 1, pp Jan  S. Chou, “Integration and innovation in the nanoelectronics era, “ Keynote presentation, Digest of Technical Papers Solid-State Circuits Conference (ISSCC05), pp , February  T. Claasen, “High speed: not the only way to exploit the intrinsic computational power of silicon,” Keynote presentation, Digest of Technical Papers Solid-State Circuits Conference (ISSCC99), pp. 22–25, February  H. De Man, “Ambient intelligence: gigascale dreams and nanoscale realities,” Keynote presentation, Digest of Technical Papers International Solid-State Circuits Conference (ISSCC '05), pp. 29–35, February  G. Frantz,  K. Keutzer, S. Malik, R. Newton, J. Rabaey and A. Sangiovanni-Vincentelli, “System Level Design: Orthogonalization of Concerns and Platform-Based Design,” IEEE Transactions on Computer-Aided Design of Integrated Circuits & Systems, vol.19, no.12, pp , Dec. 2000,. References

Low Power Design Essentials © References Articles (cntd)  T. Kuroda, T. Sakurai, “Overview of low-power ULSI circuit techniques,” IEICE Trans. on Electronics, vol. E78-C, no. 4, pp , April  D. Markovic, V. Stojanovic, B. Nikolic, M.A. Horowitz, R.W. Brodersen, “Methods for True Energy- Performance Optimization,” IEEE Journal of Solid-State Circuits, vol. 39, no. 8, pp , Aug  D. Markovic, B. Nikolic, R.W. Brodersen, “Power and Area Minimization for Multidimensional Signal Processing,” IEEE Journal of Solid-State Circuits, vol. 42, no. 4, pp , April  Nexperia, NXP Semiconductors,  OMAP, Texas Instruments, ntId=4638  J. Rabaey, “System-on-a-Chip – A Case for Heterogeneous Architectures”, Invited Presentation, Wireless Technology Seminar, Tampere, May Also in HotChips’2000.  K. Seno, “A 90nm embedded DRAM single chip LSI with a 3D graphics, H.264 codec engine, and a reconfigurable processor“, HotChips  R. Subramanyan, “Reconfigurable Digital Communications Systems on a Chip”, Invited Presentation, Wireless Technology Seminar, Tampere, May  H. Zhang, V. Prabhu, V. George, M. Wan, M. Benes, A. Abnous, J. Rabaey, “A 1V Heterogeneous Reconfigurable Processor IC for Baseband Wireless Applications,” IEEE Journal of Solid-State Circuits, vol. 35, no. 11, pp , Nov (also ISSCC 2000)  R. Zlatanovici, B. Nikolic, “Power-Performance Optimal 64-bit Carry-Lookahead Adders,” in Proc. European Solid-State Circuits Conf. (ESSCIRC), pp , Sept