Low Power Design, Past and Future

Slides:



Advertisements
Similar presentations
ITRS Roadmap Design + System Drivers Makuhari, December 2007 Worldwide Design ITWG Good morning. Here we present the work that the ITRS Design TWG has.
Advertisements

On The Energy Efficiency of Computation Mihai Budiu CMU CS CALCM Seminar Feb 17, 2004 Note: this version fixes some errors in the ASH performance graphs.
Chapter 8 Interfacing Processors and Peripherals.
Introduced 1982 Used mostly in embedded applications - controllers, point-of- sale systems, terminals, and the like Used in several MS-DOS non-PC- Compatible.
Goals for Today Today we want to learn about the microprocessor, the key component, the brain, of a computer We’ll learn about the function of a microprocessor.
Presenter : Cheng-Ta Wu Kenichiro Anjo, Member, IEEE, Atsushi Okamura, and Masato Motomura IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 39,NO. 5, MAY 2004.
Embedded Systems Design: A Unified Hardware/Software Introduction 1 Chapter 10: IC Technology.
Digital Logical Structures
Feb. 17, 2011 Midterm overview Real life examples of built chips
Subthreshold SRAM Designs for Cryptography Security Computations Adnan Gutub The Second International Conference on Software Engineering and Computer Systems.
Parallelism Lecture notes from MKP and S. Yalamanchili.
Computer Abstractions and Technology
Power Reduction Techniques For Microprocessor Systems
A Survey of Logic Block Architectures For Digital Signal Processing Applications.
Chapter1 Fundamental of Computer Design Dr. Bernard Chen Ph.D. University of Central Arkansas.
Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.
VLSI Trends. A Brief History  1958: First integrated circuit  Flip-flop using two transistors  From Texas Instruments  2011  Intel 10 Core Xeon Westmere-EX.
Embedded Systems Programming
Introduction to Reconfigurable Computing CS61c sp06 Lecture (5/5/06) Hayden So.
Spring 07, Feb 20 ELEC 7770: Advanced VLSI Design (Agrawal) 1 ELEC 7770 Advanced VLSI Design Spring 2007 Reducing Power through Multicore Parallelism Vishwani.
UC Berkeley B. Nikolić Architecture choices MAC Unit Addr Gen  P Prog Mem Embedded Processor (lpArm) Direct Mapped Hardware Embedded FPGA DSP (e.g. TI.
EET 4250: Chapter 1 Performance Measurement, Instruction Count & CPI Acknowledgements: Some slides and lecture notes for this course adapted from Prof.
Parallel Algorithms - Introduction Advanced Algorithms & Data Structures Lecture Theme 11 Prof. Dr. Th. Ottmann Summer Semester 2006.
Low Power Design of Integrated Systems Assoc. Prof. Dimitrios Soudris
Case Study - SRAM & Caches
An Energy-Efficient Reconfigurable Multiprocessor IC for DSP Applications Multiple programmable VLIW processors arranged in a ring topology –Balances its.
Chapter1 Fundamental of Computer Design Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
INTRODUCTION TO MICROPROCESSORS
1 VLSI and Computer Architecture Trends ECE 25 Fall 2012.
Power Reduction for FPGA using Multiple Vdd/Vth
Low-Power Wireless Sensor Networks
EET 4250: Chapter 1 Computer Abstractions and Technology Acknowledgements: Some slides and lecture notes for this course adapted from Prof. Mary Jane Irwin.
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
Operating Systems Lecture 02: Computer System Overview Anda Iamnitchi
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
EE3A1 Computer Hardware and Digital Design
Chapter 1 Computer Abstractions and Technology. Chapter 1 — Computer Abstractions and Technology — 2 The Computer Revolution Progress in computer technology.
VLSI: A Look in the Past, Present and Future Basic building block is the transistor. –Bipolar Junction Transistor (BJT), reliable, less noisy and more.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
Development of Programmable Architecture for Base-Band Processing S. Leung, A. Postula, Univ. of Queensland, Australia A. Hemani, Royal Institute of Tech.,
DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –
© Digital Integrated Circuits 2nd Inverter Digital Integrated Circuits A Design Perspective The Inverter Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.
An Introduction to VLSI (Very Large Scale Integrated) Circuit Design
Patricia Gonzalez Divya Akella VLSI Class Project.
Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.
Fundamentals of Programming Languages-II
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
Seok-jae, Lee VLSI Signal Processing Lab. Korea University
CS203 – Advanced Computer Architecture
Computer Organization IS F242. Course Objective It aims at understanding and appreciating the computing system’s functional components, their characteristics,
1 A simple parallel algorithm Adding n numbers in parallel.
Hardware Architecture
History of Computers and Performance David Monismith Jan. 14, 2015 Based on notes from Dr. Bill Siever and from the Patterson and Hennessy Text.
Lynn Choi School of Electrical Engineering
Chapter1 Fundamental of Computer Design
LOW POWER DESIGN METHODS V.ANANDI ASST.PROF,E&C MSRIT,BANGALORE.
Embedded Systems Design
Morgan Kaufmann Publishers
Instructor: Dr. Phillip Jones
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
INTRODUCTION TO MICROPROCESSORS
BIC 10503: COMPUTER ARCHITECTURE
CSV881: Low-Power Design Multicore Design for Low Power
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
A High Performance SoC: PkunityTM
Chapter 1 Introduction.
Intel CPU for Desktop PC: Past, Present, Future
Presentation transcript:

Low Power Design, Past and Future Bob Brodersen (UCB), Anantha Chandrakasan (MIT) and Dejan Markovic (UCLA)

Acknowledgements Rahul Rithe (MIT) Mehul Tikekar (MIT) Chengcheng Wang (UCLA) Fang-Li Yuan (UCLA)

Technologies evolved to obtain improved Energy Efficiency 1940-1945 Relays 1945-1955 Vacuum Tubes 1955-1970 Bipolar - ECL, TTL, I2L 1970-1975 PMOS and NMOS Enhancement mode 1975-1980 NMOS Depletion mode 1975-2005 CMOS with Feature size, Voltage and Frequency Scaling (The Golden Age)

The End of (Energy) Scaling is Here Number of transistors increase and will continue to increase with CMOS technology and 3D integration But…. For energy efficiency of logic it is the voltage and frequency that is critical and they aren’t scaling

Now what?? Do we need a fundamentally new device like in the past? Carbon nanotube, Spin transistor?

It really isn’t necessary! Now what?? Do we need a fundamentally new device like in the past? Carbon nanotube, Spin transistor? It really isn’t necessary! Other considerations are becoming more important than the logic device There are other ways to stay on a “Moore’s Law” for energy efficiency

Definitions… Operation = OP = algorithmically interesting operation (i.e. multiply, add, memory access) MOPS = Millions of OP’s per Second Nop = Number of parallel OP’s in each clock cycle

Energy and Power Efficiency MOPS/mW = OP/nJ Power efficiency metric = Energy efficiency metric Energy Efficiency = Number of useful operations Energy required = # of Operations = OP/nJ NanoJoule = OP/Sec = MOPS = Power Efficiency NanoJoule/Sec mW

Design Techniques to reduce energy/power System – Protocols, sleep modes, communication vs. computation Algorithms – Minimize computational complexity and maximize parallelism Architectures – Tradeoff flexibility against energy/power efficiency Circuits – Optimization at the transistor level

Design Techniques to reduce energy/power System – Protocols, sleep modes, communication vs. computation Algorithms – Minimize computational complexity and maximize parallelism Architectures – Tradeoff flexibility against energy/power efficiency Circuits – Optimization at the transistor level

Future Circuit Improvements (other than memory) Functional over wide supply range for use with dynamic voltage scaling and optimized fixed supplies Power down and sleep switches to facilitate dark silicon Tradeoff speed and density for leakage reduction Optimized circuits to support near Vt and below Vt operation with low clock rates However, these improvements will only be effective if developed along with new architectures

The memory problem (a circuit and device issue) DRAM’s are incredibly inefficient (40 nm, 1V) 6 OP/nJ (OP = 16 bit access) A 16 bit adder (40 nm at .5 V) 60,000 OP/nJ (OP= 16 bit add) 1000 times lower efficiency than logic! SRAM’s are better 8 T cell .5 V - 330 OP/nJ (M. Sinangil et. al. JSSC 2009) 10T cell .5V - 590 OP/nJ (S. Clerc et. al. 2012 ICICDT) Example of using more transistors to improve energy efficiency (more will be coming)

Design Techniques to reduce energy/power System – Protocols, sleep modes, communication vs. computation Algorithms – Minimize computational complexity and maximize parallelism Architectures – Tradeoff flexibility against energy/power efficiency Circuits – Optimization at the transistor level

Year 1997-2000 ISSCC Chips (.18m-.25m) Paper Description Chip # 1 1997 10.3 mP - S/390   11 1998 18.1 DSP -Graphics 2 2000 5.2 mP – PPC (SOI) 12 18.2 DSP - Multimedia 3 1999 mP - G5 13 14.6 DSP – Multimedia 4 5.6 mP - G6 14 2002 22.1 Mpeg Decoder 5 5.1 mP - Alpha 15 18.3 6 15.4 mP - P6 16 2001 21.2 Encryption Processor 7 18.4 17 14.5 Hearing Aid Processor 8 mP – PPC 18 4.7 FIR for Disk Read Head 9 18.6 DSP - StrongArm 19 2.1 MPEG Encoder 10 4.2 DSP – Comm 20 7.2 802.11a Baseband DSP’s Microprocessors Dedicated DSP’s

We’ll look at one from each category… What is the key architectural feature which gives the best energy efficiency (15 years ago)? We’ll look at one from each category… WLAN ;; General Purpose DSP Microprocessors Dedicated NEC DSP PPC

Uniprocessor (PPC): MOPS/mW=.13 The only circuitry which supports “useful operations” All the rest is overhead to support the time multiplexing Number of operations each clock cycle = Nop = 2 fclock = 450 MHz (2 way) = 900 MOPS Power = 7 Watts

Programmable DSP: MOPS/mW=7 Same granularity (a datapath), more parallelism 4 Parallel processors (4 ops each) Nop = 16 50 MHz clock => 800 MOPS Power = 110 mW.

802.11a Dedicated Design - WLAN : MOPS/mW=200 ADC/DAC Viterbi Decoder MAC Core Time/Freq Synch FFT DMA PCI AGC FSM Fully parallel mapping of an 802.11a implementation. No time multiplexing. Nop = 500 Clock rate = 80 MHz => 40,000 MOPS Power = 200 mW

Now lets update the plot and look at more recent chips (x.x) 2013 ISSCC Paper SVD (90nm) SDR Processor Video Decoder (9.5) Recognition Processor (9.8) (9.6) Computational Photography Application Processor (9.4) Mobile Processors (Atom, Snapdragon, Exynos, OMAP) DSP Processor (7.5 – 2011) 24 Core Processor (3.6) 100,000 X Fujitsu 4 Core FR550 Intel Sandy Bridge Microprocessors Programmable DSP Dedicated

Lets again look at 3 of them (x.x) 2013 ISSCC Paper SVD (90nm) SDR Processor Video Decoder (9.5) Recognition Processor (9.8) (9.6) Computational Photography Application Processor (9.4) Mobile Processors (Atom, Snapdragon, Exynos, OMAP) DSP Processor (7.5 – 2011) 24 Core Processor (3.6) Fujitsu 4 Core FR550 Intel Sandy Bridge Microprocessors Programmable DSP Dedicated

Microprocessor (Intel Sandy Bridge): MOPS/mW=.86 Number of operations each clock cycle = Nop = 27 fclock = 3 GHz Performance = 81 GOPS Power = 95 Watts

Mobile Processors: MOPS/mW = 200 Mobile Processors (Tegra, Atom, Snapdragon, Exynos, OMAP) A combination of programmable dedicated and general purpose multicores 1/3 of power in internal busses Dark silicon strategy – transistors that aren’t used

802.11a Dedicated Design – Computational Photography : MOPS/mW=55,000 Parallel, programmable implementation of computational photography functions Nop = 5000 Clock rate = 25 MHz => 125,000 MOPS Vdd = .5 V, Power = 2.3 mW At Vdd = .9 V, MOPS/mW =28,000 Rahul Rithe et. al. ISSCC 2013

Is all Parallelism Equal? If you parallelize using an inefficient architecture it still will be inefficient. Lets look at desktop computer multi-cores. Do they really improve energy efficiency? Can software be written for these architectures?

From a white paper from a VERY large semiconductor company By incorporating multiple cores, each core is able to run at a lower frequency, dividing among them the power normally given to a single core. The result is a big performance increase over a single-core processor This fundamental relationship between power and frequency can be effectively used to multiply the number of cores from two to four, and then eight and more, to deliver continuous increases in performance without increasing power usage. Lets look at this…

Not much improvement with with a multi-core architecture and 15 years of technology scaling Intel Pentium MMX (1998) 250nm 1.8 GOPS at 19 Watt, 2 V. and a 450 MHz clock .1MOPS/mW Intel Sandy Bridge Quad Core (2013) 32 nm 81 GOPS at 95 Watts, 1 V. and a 3 GHz clock .9 MOPS/mW Only a ratio of 9 in MOPS/mW

Duplicating inefficient Von Neumann architectures isn’t the solution Power of a Multi-core chip = Ncores* CV2 * f 10X reduction in power requires 170 cores for the same performance To get 10X performance increase requires 1700 Cores! Supply: 1V ->.34 V Clock: 600 MHz->3.5 MHz Mobile DSP, G. Gammie et. al. ISSCC 2011 7.5)

What is the basic problem? The Von Neumann architecture was developed when Hardware was expensive Cost, size and power were not an issue Time sharing the hardware was absolutely necessary – more transistors make this unnecessary and undesirable Requires higher clock rates Memory to store intermediate results and instructions

Another problem is software "With multicore, it's like we are throwing this Hail Mary pass down the field and now we have to run down there as fast as we can to see if we can catch it." Dave Patterson – UC Berkeley (ISSCC Panel) "The industry is in a little bit of a panic about how to program multi-core processors, especially heterogeneous ones. To make effective use of multi-core hardware today you need a PhD in computer science.” Chuck Moore - AMD

Clock rates Dedicated designs with ultra high parallelism can operate in the near sub-threshold regime (.5 V and 25 MHz for 125 GOPS) The quad core Intel Sandy Bridge requires a 3 GHz clock to achieve 81 GOPS BUT it is fully flexible! Is there any other way to get this flexibility

Reconfigurable interconnect can provide application flexibility with high efficiency Hierarchical interconnect with logic blocks Hierarchical interconnect with LUTs Commercial FPGA with embedded blocks Commercial FPGA Microprocessors Mobile Processors Dedicated

And there is a programming model – dataflow diagrams

The Most Important Technology Requirement Continue the increase in the number of transistors (feature scaling, 3D integration, …) Facilitates architectures with ultra high parallelism Allows operation in near sub-threshold Reduces the need for time multiplexing Extra transistors to reduce leakage (power down, sleep modes, cascodes) Improved memory energy efficiency (8T, 10T cells) Allows many efficient specialized blocks that are only active part of the time “Dark Silicon”

Conclusions CMOS technology scaling that improves performance and energy efficiency has ended but CMOS will continue Reconfigurable interconnect is the most promising approach to fully flexible chips More energy efficient memory structures are critical Parallel architectures with the appropriate parallel units will be the key to continuing a “Moore’s Law” for energy and power efficiency