Lecture 15: Power. Power = Voltage × Current –Voltage is usually a constant (we’ll talk about voltage scaling later) –Current varies Depends on the block.

Slides:



Advertisements
Similar presentations
Spatial Computation Thesis committee: Seth Goldstein Peter Lee Todd Mowry Babak Falsafi Nevin Heintze Ph.D. Thesis defense, December 8, 2003 SCS Mihai.
Advertisements

Agenda Semiconductor materials and their properties PN-junction diodes
ECE555 Lecture 5 Nam Sung Kim University of Wisconsin – Madison
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 4 Computing Platforms.
Predicting Performance Impact of DVFS for Realistic Memory Systems Rustam Miftakhutdinov Eiman Ebrahimi Yale N. Patt.
Processor Data Path and Control Diana Palsetia UPenn
Charge Pump PLL.
S-Curves & the Zero Bug Bounce:
Introduction to CMOS VLSI Design Combinational Circuits
CMOS Logic Circuits.
Topics Electrical properties of static combinational gates:
Penn ESE370 Fall DeHon 1 ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems Day 10: September 29, 2010 MOS Transistors.
Transmission Gate Based Circuits
Logic Gates Flip-Flops Registers Adders
Transistors: Building blocks of electronic computing Lin Zhong ELEC101, Spring 2011.
Digital Components Introduction Gate Characteristics Logic Families
Feb. 17, 2011 Midterm overview Real life examples of built chips
Zhou Peng, Zuo Decheng, Zhou Haiying Harbin Institute of Technology 1.
Direct-Current Circuits
Gursharan Singh Tatla PIN DIAGRAM OF 8086 Gursharan Singh Tatla Gursharan Singh Tatla
Execution Cycle. Outline (Brief) Review of MIPS Microarchitecture Execution Cycle Pipelining Big vs. Little Endian-ness CPU Execution Time 1 IF ID EX.
PSSA Preparation.
CSET 4650 Field Programmable Logic Devices
Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,
Lecture 12 Reduce Miss Penalty and Hit Time
Managing Static (Leakage) Power S. Kaxiras, M Martonosi, “Computer Architecture Techniques for Power Effecience”, Chapter 5.
Keeping Hot Chips Cool Ruchir Puri, Leon Stok, Subhrajit Bhattacharya IBM T.J. Watson Research Center Yorktown Heights, NY Circuits R-US.
Power Reduction Techniques For Microprocessor Systems
Clock Design Adopted from David Harris of Harvey Mudd College.
S. Reda EN160 SP’08 Design and Implementation of VLSI Systems (EN1600) Lecture 14: Power Dissipation Prof. Sherief Reda Division of Engineering, Brown.
CS 7810 Lecture 12 Power-Aware Microarchitecture: Design and Modeling Challenges for Next-Generation Microprocessors D. Brooks et al. IEEE Micro, Nov/Dec.
S. RossEECS 40 Spring 2003 Lecture 28 Today… Analyzing digital computation at a very low level! The Latch Pipelined Datapath Control Signals Concept of.
Lecture 4: Computer Memory
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
S. Reda EN160 SP’07 Design and Implementation of VLSI Systems (EN0160) Lecture 13: Power Dissipation Prof. Sherief Reda Division of Engineering, Brown.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
Lecture 5 – Power Prof. Luke Theogarajan
Lecture 7: Power.
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
Lecture 21, Slide 1EECS40, Fall 2004Prof. White Lecture #21 OUTLINE –Sequential logic circuits –Fan-out –Propagation delay –CMOS power consumption Reading:
Low Power Design of Integrated Systems Assoc. Prof. Dimitrios Soudris
Power, Energy and Delay Static CMOS is an attractive design style because of its good noise margins, ideal voltage transfer characteristics, full logic.
Digital Integrated Circuits for Communication
CS 423 – Operating Systems Design Lecture 22 – Power Management Klara Nahrstedt and Raoul Rivas Spring 2013 CS Spring 2013.
© 2012 Eric Pop, UIUCECE 340: Semiconductor Electronics ECE 340 Lecture 35 MOS Field-Effect Transistor (MOSFET) The MOSFET is an MOS capacitor with Source/Drain.
6.893: Advanced VLSI Computer Architecture, September 28, 2000, Lecture 4, Slide 1. © Krste Asanovic Krste Asanovic
EE466: VLSI Design Power Dissipation. Outline Motivation to estimate power dissipation Sources of power dissipation Dynamic power dissipation Static power.
ENGG 6090 Topic Review1 How to reduce the power dissipation? Switching Activity Switched Capacitance Voltage Scaling.
17 Sep 2002Embedded Seminar2 Outline The Big Picture Who’s got the Power? What’s in the bag of tricks?
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
EE415 VLSI Design DYNAMIC LOGIC [Adapted from Rabaey’s Digital Integrated Circuits, ©2002, J. Rabaey et al.]
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
Basics of Energy & Power Dissipation Lecture notes S. Yalamanchili, S. Mukhopadhyay. A. Chowdhary.
1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
Why Low Power Testing? 台大電子所 李建模.
Leakage reduction techniques Three major leakage current components 1. Gate leakage ; ~ Vdd 4 2. Subthreshold ; ~ Vdd 3 3. P/N junction.
Basics of Energy & Power Dissipation
Bi-CMOS Prakash B.
FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Topics n Low power design. n Pipelining.
Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis 6.1 EE4800 CMOS Digital IC Design & Analysis Lecture 6 Power Zhuo Feng.
Dynamic Logic.
CS203 – Advanced Computer Architecture
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
LOW POWER DESIGN METHODS
LOW POWER DESIGN METHODS V.ANANDI ASST.PROF,E&C MSRIT,BANGALORE.
SECTIONS 1-7 By Astha Chawla
Reading: Hambley Ch. 7; Rabaey et al. Sec. 5.2
An Introduction to Microprocessor Architecture using intel 8085 as a classic processor
Lecture 7: Power.
Lecture 7: Power.
Presentation transcript:

Lecture 15: Power

Power = Voltage × Current –Voltage is usually a constant (we’ll talk about voltage scaling later) –Current varies Depends on the block (cache vs. ALU vs. decoder …) Depends on the application (int vs. FP vs. multimedia) Depends on the program phase Another form: –i = C dv / dt  vi dt = Cv dv  P = ½CV 2 –Power =  Energy of each capacitor × avg times (dis)charged / time to (dis)charge –=   b  All Blocks ½C b V 2  b /t c = ½V 2 f   b C b  b = ½  CV 2 f Lecture 15: Power 2 C = Total Capacitance  = average activity factor C = Total Capacitance  = average activity factor

We talked about this in Lecture 1 –Two types of static power Leakage through the channel (sub-threshold conductance) Leakage through the gate/oxide (tunneling) P static = P sub + P oxide P total = P dynamic + P static = ½  CV 2 f + K 1 We -V T /nV  (1-e -V/V  ) + K 2 W(V/T ox ) 2 e -  T ox /V Lecture 15: Power 3

P = ½  CV 2 f, f  V  P  V 3 To a first order, Perf  f  Perf  V Lecture 15: Power 4 Power Voltage P  V 3 For a linear decrease in voltage (and  performance) … we get a cubic decrease in (dynamic) power consumption Rule of thumb: for small  V/  f, 1% performance for every 3% power Rule of thumb: for small  V/  f, 1% performance for every 3% power

V dd – V T > V Noise Margin  V dd cannot be scaled below V T + V Noise Margin Lecture 15: Power 5 Gnd Noise can cause transistor to accidentally switch! Power Voltage/Frequency P  V 3 Voltage scaling can take the supply voltage down only so far Below this, we can only use frequency scaling (decrease f, but keep V constant), which provides only linear power reduction (½CV 2 f) VTVT V dd noise

Dynamic Voltage/Frequency Scaling Someone tracks performance demands, idleness, etc. –“Someone” is typically the OS with hardware support –… but you could have a hardware only-approach Under thermal emergencies, the HW takes over regardless of what voltage/frequency the OS asks for Goal: consume minimum power necessary while still meeting performance demands Can also do just DVS or DFS Lecture 15: Power 6

CMOS logic is also called “static” logic: –If the inputs don’t change, neither do the outputs (or any other intermediate nodes) Therefore, to reduce dynamic power in CMOS circuits, don’t let the inputs change if you don’t need to! Lecture 15: Power 7 CMOSBlock Power dissipated CMOSBlock Clock gate this block? Latch doesn’t grab new value, so its output (block’s input) doesn’t change

Lecture 15: Power 8 opcode + + logic shift comp × × opcode one result All units consume power, but only one output is useful + + logic shift comp × × opcode Clock-gatingLogicClock-gatingLogic one result Based on opcode, the logic clock-gates all but the one required unit Note, this logic consumes its own power

To properly clock-gate, you must know you’re going to gate the cycle before (otherwise it’ll be too late as the clock edge will have already arrived) Lecture 15: Power 9 Payload RAM + + logic comp Clock-gatingLogicClock-gatingLogic Opcode Value E Value L

Not all blocks can be easily gated –may be difficult to know whether gating should be applied ahead of time likely true for critical path circuits: e.g., gating select logic probably difficult since bidders not known until last moment –computation of gating condition may be complex value-based (is input zero?) multi-value based (are all inputs zero?) multi-condition based (are all RS entries not bidding?) Lecture 15: Power 10

CMOS logic toggles only when input changes Dynamic logic may consume power regardless Lecture 15: Power 11 CMOS NOR gate N-Domino NOR gate pictures from If A (or B) equals 1 and does not change, then sequence is: precharge X to 1, evaluate discharges X to 0, precharge X to 1, evaluate … If A (or B) equals 1 and does not change, then sequence is: precharge X to 1, evaluate discharges X to 0, precharge X to 1, evaluate … X X Gating inputs is not enough; need to ensure CLK is disabled.

Even if gates not toggling, they continue to leak Lecture 15: Power 12 V dd Gnd 1 On Off subthreshold leakage gate leakage V dd Gnd 0 Off On gate leakage subthreshold leakage

Lecture 15: Power 13 intermediate node has V > 0 V 0 V/2 R R channel leakage higher resistance vs. Higher V SB increases V T V B =0 V S  V/2 Higher threshold voltage decreases leakage current Higher resistance increases gate latency

Lecture 15: Power 14 Channel Leakage Less Channel Leakage VBVB VBVB VSVS VSVS Larger V SB WARNING WARNING: This is a GROSSLY simplified explanation!!! If you’re interested in low-power circuits and microarchitecture, you should go read up on some real semiconductor/electronics literature. WARNING WARNING: This is a GROSSLY simplified explanation!!! If you’re interested in low-power circuits and microarchitecture, you should go read up on some real semiconductor/electronics literature.

Manufacture two types of transistors: –Low V T gates: fast, high leakage –High V T gates: slow, low leakage (typically  10x less) –Designer chooses what kind to use Pro: –less area than stacking (one high-V T gate = one low-V T gate in area, stacking requires multiple gates) Con: –Manufacturing process needs to provide two device types Lecture 15: Power 15

Stacking and higher V T both slow down the gates Analyze circuits and… –apply one or both techniques to gates not on the critical path –apply to longest path if timing permits (i.e., this circuit is not a frequency limiter) Lecture 15: Power 16 Critical path gates Stack or use high-V T gates here

The amount of leakage depends on the clock-gated inputs to the gate Lecture 15: Power Off On Off On 1 0 Off 0 1 Of Off On 1 1 Off On Off 2 off transistors in parallel 2 off transistors in parallel 1 off transistor in leakage path 1 off transistor in leakage path 1 off transistor in leakage path 1 off transistor in leakage path 2 off transistors in leakage path 2 off transistors in leakage path

When clock-gating a block –disable latch clock (as usual) –load leakage-minimizing input vector (stored elsewhere) Lecture 15: Power 18 Clock gate How to determine best input vector for n-input gate? Can cause spurious transitions that consume more dynamic power

Instead of at the gate-level, choose high-V T vs. low-V T at the transistor-level Lecture 15: Power 19 High-V T devices Low-V T devices Can be used if some transitions are more important than others –“more important” can be speed or power Combine with setting input sleep vectors –make the off transistors high-V T if possible to further reduce leakge

If you turn off the power, then the gates can’t leak Lecture 15: Power 20 V dd Gnd 0 Off On X Gnd Off Gnd Virtual V dd V dd 01 X off This gating transistor is a beast… it needs to be big enough to supply the necessary current when not- gated, also needs to be low leakage (high V T gate) Gating transistor also called “sleep” transistor

Lecture 15: Power 21 Virtual V dd V dd After gating, residual charge in system will continue to leak Off Gnd Virtual V dd V dd Virtual V Gnd Both paths cut off now

Sleep transistors are slow high V T devices Depending on size of block covered by sleep transistor, virtual V dd /Gnd may have a lot of capacitance to charge/discharge Lecture 15: Power 22 V dd Virt. V dd R C Moderate R, Large C  Large RC (slow) time ADD inst ready to execute ALU asleep delay to wakeup ALU ADD exec Wakeup delay can cause significant performance penalties when units unavailable

In some situations, can know early enough ahead Lecture 15: Power 23 (crude pipeline) fetchdecode FP inst decoded! FPU Immediately send wakeup to FPU Hopefully by the time the fadd makes it to the OOO core, gets scheduled, and makes it to the FPU, the turn-on has completed exec

In some cases it’s much harder Lecture 15: Power 24 pipeline full/stalled (maybe due to D$ miss to main memory) power-off front-end units (fetch, decode, etc.) miss serviced, back-end starts moving again; front-end starts wake up back-end gets starved because front-end wakeup is too slow and can’t refill the pipeline But it’s hard to start the power-on early because we don’t know when the memory request will be fulfilled (and whether that will cause the back-end to drain)

(Dis)Charging Virtual V dd /Gnd consumes quite a bit of energy/power Lecture 15: Power 25 P = ½  C V 2 f Worst-case: charge up as soon as you’re done discharging time Go to sleep! Virt. V dd Done discharging, now wakeup! We just wasted 2×½×C Virt V dd ×V dd 2 Watts to discharge and then recharge the virtual V dd And we spent zero cycles fully asleep, so we didn’t save any/much leakage power

Must stay asleep for some time, just to break even! Lecture 15: Power 26 Energy consumed from leakage (no sleeping) time Energy consumed Energy to discharge Virtual V dd /Gnd Zero energy consumed while sleeping Energy to recharge Virtual V dd /Gnd Minimum sleep-time for energy break-even Too little sleep… ends up costing more energy than doing nothing Extra energy spent Sleep interval > break-even length Energy reduction

Instantly turning on the sleep transistor to recharge virtual V dd causes very large current spike ( di / dt ) Lecture 15: Power 27 Water Tank I shower Flush! I john I shower - I john Pressure Drop Current for recharging virtual V dd Solution: progressive turn-on; recharge virtual Vdd slowly, which limits I john (i.e., I recharge ) to keep pressure drop (supply noise) under control Solution: progressive turn-on; recharge virtual Vdd slowly, which limits I john (i.e., I recharge ) to keep pressure drop (supply noise) under control Slowing down recharge increases performance penalty when recharge is late

OS power management (OSPM) –algorithm monitors CPU load over some window of time –computes target performance point, requests from CPU –CPU is expected to modify operating voltage/frequency to match OSPM’s request Lecture 15: Power 28 Relative Power Consumption Voltage and frequency scaling Frequency scaling only OS can choose different power saving states (C 0 – C n ) –C 0 : active state (no power saving) –C i : higher i  more power savings, but longer recovery time

C 0 : Active C 1 (processor-centric measures) –instruction execution halted, clocks are gated C 2 : CPU does not access bus w/o chipset’s consent –allows bus to be put in low-power mode C 3 : CPU disables PLLs (clock generators) C 4 : CPU lowers voltage to minimum level while still being able to retain state (e.g., cache contents) DC 4 : “Deep” C 4 (next slide) Lecture 15: Power 29

Upon entering C 4, flush L2 cache to main memory –Don’t do it all at once! If C 4 period is short, then you waste more power due to flushing Can have performance impact on wakeup since cache will be cold Flush only part of the L2 ( 1 / 8 to 1 / 2 ) by ways –once a complete way has been flushed, power gate it with sleep transistors (discussed later) Do this upon each entry into C 4 state When L2 shrunk to 0 bytes, enter DC 4 –Greatly reduce voltage since there’s no state to retain No need to wakeup cache for snoops Chipset directs snoop traffic directly to memory Typically expand cache to minimum of two ways on exit from DC4 Lecture 15: Power 30

Many shared resources –PLL, power supply, L2 cache Can’t (easily) run cores at different clock speeds with a single PLL Can’t run cores at different voltages with a single power supply Can’t turn off L2 cache just because one core is idle External interface complications –OS sees two separate CPUs one C-state per core –Platform views the whole processor as a single entity for power- management (for C 2 state and higher) Lecture 15: Power 31 OS can request C-states on a per-core basis OS can request C-states on a per-core basis Platform sees only a single C-state (the lower of the two) Platform sees only a single C-state (the lower of the two)

If one core is in deep-sleep, it’s not consuming much power Idea: use DVFS in reverse to increase voltage/freqency Lecture 15: Power 32 core 1 power power limit relative performance Both cores in C 0 Core 0 in C 0 Core 1 in DC 4 Core 0 in C 0 Core 1 in DC 4 Deliver more performance when running a single program and not worried about battery life (plugged in to wall) “Intel Dynamic Acceleration Technology”

Pros: –significant standby leakage reduction –memory elements retain state –no transistor sizing/partitioning required –dynamically tunable V T at runtime Cons: –requires expensive triple-well fabrication process –body-biasing effect decreases with technology scaling Lecture 15: Power 33 Higher V SB increases V T V B =0 V S  V/2 Earlier body-bias effect from stacked transistors due to higher source voltage Provide a way to explicitly bias VB Set V BBN 0 for this NFET Since V BBN < 0, also called “reverse biasing” Since V BBN < 0, also called “reverse biasing” Kao et al., Embedded Tutorial: Subthreshold Leakage Modeling and Reduction Techniques, ICCAD 2002

Super-high V T for caches (very slow) Use selective forward-body biasing during access to read/write at a reasonable speed Lecture 15: Power Very-high VT devices (very low leakage, slow access speed) Very-high VT devices (very low leakage, slow access speed) 0 V BBN Access V fwd-bias V SB V SB < 0  V T decreases  transistors are faster (but consume more power) Access Completed 0 A few cache lines go into high leakage mode, but only very briefly (during access). The rest of the time, it consumes very little leakage power.

Different blocks have different performance needs –and this varies in time Idea: clock different blocks at different speeds –Apply voltage/frequency scaling to blocks/groups-of-blocks e.g., FP units can be slowed down (or maybe even completely turned off) for integer applications –Block consumes less power when it doesn’t have to operate in max-performance mode GALS = Globally Asynchronous, Locally Synchronous Lecture 15: Power 35

Lecture 15: Power 36 Baseline ProcessorGALS Processor

How to communicate between clock domains? Lecture 15: Power 37 Asynchronous FIFO Design [Chelcea and Nowick] Producer can clear empty, but it gets cleared on clk2 Consumer clears the full signal, but it occurs on clk1 Timing Issues: Voltage Issues: 0V 0.75V “0” “1” 0V 1.5V “1” (0.75V) 0.75V =0/1? V dd1 V dd2 FIFO between domains must “speak” both voltages