Lecture 15: Power. Power = Voltage × Current –Voltage is usually a constant (we’ll talk about voltage scaling later) –Current varies Depends on the block.

Lecture 15: Power

Power = Voltage × Current –Voltage is usually a constant (we’ll talk about voltage scaling later) –Current varies Depends on the block (cache vs. ALU vs. decoder …) Depends on the application (int vs. FP vs. multimedia) Depends on the program phase Another form: –i = C dv / dt  vi dt = Cv dv  P = ½CV 2 –Power =  Energy of each capacitor × avg times (dis)charged / time to (dis)charge –=   b  All Blocks ½C b V 2  b /t c = ½V 2 f   b C b  b = ½  CV 2 f Lecture 15: Power 2 C = Total Capacitance  = average activity factor C = Total Capacitance  = average activity factor

We talked about this in Lecture 1 –Two types of static power Leakage through the channel (sub-threshold conductance) Leakage through the gate/oxide (tunneling) P static = P sub + P oxide P total = P dynamic + P static = ½  CV 2 f + K 1 We -V T /nV  (1-e -V/V  ) + K 2 W(V/T ox ) 2 e -  T ox /V Lecture 15: Power 3

P = ½  CV 2 f, f  V  P  V 3 To a first order, Perf  f  Perf  V Lecture 15: Power 4 Power Voltage P  V 3 For a linear decrease in voltage (and  performance) … we get a cubic decrease in (dynamic) power consumption Rule of thumb: for small  V/  f, 1% performance for every 3% power Rule of thumb: for small  V/  f, 1% performance for every 3% power http://download.intel.com/technology/itj/2003/volume07issue02/art03_pentiumm/vol7iss2_art03.pdf

V dd – V T > V Noise Margin  V dd cannot be scaled below V T + V Noise Margin Lecture 15: Power 5 Gnd Noise can cause transistor to accidentally switch! Power Voltage/Frequency P  V 3 Voltage scaling can take the supply voltage down only so far Below this, we can only use frequency scaling (decrease f, but keep V constant), which provides only linear power reduction (½CV 2 f) VTVT V dd noise

Dynamic Voltage/Frequency Scaling Someone tracks performance demands, idleness, etc. –“Someone” is typically the OS with hardware support –… but you could have a hardware only-approach Under thermal emergencies, the HW takes over regardless of what voltage/frequency the OS asks for Goal: consume minimum power necessary while still meeting performance demands Can also do just DVS or DFS Lecture 15: Power 6

CMOS logic is also called “static” logic: –If the inputs don’t change, neither do the outputs (or any other intermediate nodes) Therefore, to reduce dynamic power in CMOS circuits, don’t let the inputs change if you don’t need to! Lecture 15: Power 7 CMOSBlock 1234590386449087 Power dissipated CMOSBlock Clock gate this block? 123486445903 1976 Latch doesn’t grab new value, so its output (block’s input) doesn’t change

Lecture 15: Power 8 opcode + + logic shift comp × × opcode one result All units consume power, but only one output is useful + + logic shift comp × × opcode Clock-gatingLogicClock-gatingLogic one result Based on opcode, the logic clock-gates all but the one required unit Note, this logic consumes its own power

To properly clock-gate, you must know you’re going to gate the cycle before (otherwise it’ll be too late as the clock edge will have already arrived) Lecture 15: Power 9 Payload RAM + + logic comp Clock-gatingLogicClock-gatingLogic Opcode Value E Value L

Not all blocks can be easily gated –may be difficult to know whether gating should be applied ahead of time likely true for critical path circuits: e.g., gating select logic probably difficult since bidders not known until last moment –computation of gating condition may be complex value-based (is input zero?) multi-value based (are all inputs zero?) multi-condition based (are all RS entries not bidding?) Lecture 15: Power 10

CMOS logic toggles only when input changes Dynamic logic may consume power regardless Lecture 15: Power 11 CMOS NOR gate N-Domino NOR gate pictures from http://6004.csail.mit.edu/6.371/handouts/L11.pdf If A (or B) equals 1 and does not change, then sequence is: precharge X to 1, evaluate discharges X to 0, precharge X to 1, evaluate … If A (or B) equals 1 and does not change, then sequence is: precharge X to 1, evaluate discharges X to 0, precharge X to 1, evaluate … X X Gating inputs is not enough; need to ensure CLK is disabled.

Even if gates not toggling, they continue to leak Lecture 15: Power 12 V dd Gnd 1 On Off subthreshold leakage gate leakage V dd Gnd 0 Off On gate leakage subthreshold leakage

Lecture 15: Power 13 intermediate node has V > 0 V 0 V/2 R R 0 1 0 1 channel leakage higher resistance vs. Higher V SB increases V T V B =0 V S  V/2 Higher threshold voltage decreases leakage current Higher resistance increases gate latency

Lecture 15: Power 14 Channel Leakage Less Channel Leakage VBVB VBVB VSVS VSVS Larger V SB WARNING WARNING: This is a GROSSLY simplified explanation!!! If you’re interested in low-power circuits and microarchitecture, you should go read up on some real semiconductor/electronics literature. WARNING WARNING: This is a GROSSLY simplified explanation!!! If you’re interested in low-power circuits and microarchitecture, you should go read up on some real semiconductor/electronics literature.

Manufacture two types of transistors: –Low V T gates: fast, high leakage –High V T gates: slow, low leakage (typically  10x less) –Designer chooses what kind to use Pro: –less area than stacking (one high-V T gate = one low-V T gate in area, stacking requires multiple gates) Con: –Manufacturing process needs to provide two device types Lecture 15: Power 15

Stacking and higher V T both slow down the gates Analyze circuits and… –apply one or both techniques to gates not on the critical path –apply to longest path if timing permits (i.e., this circuit is not a frequency limiter) Lecture 15: Power 16 Critical path gates Stack or use high-V T gates here

The amount of leakage depends on the clock-gated inputs to the gate Lecture 15: Power 17 0 0 Off On Off On 1 0 Off 0 1 Of Off On 1 1 Off On Off 2 off transistors in parallel 2 off transistors in parallel 1 off transistor in leakage path 1 off transistor in leakage path 1 off transistor in leakage path 1 off transistor in leakage path 2 off transistors in leakage path 2 off transistors in leakage path

When clock-gating a block –disable latch clock (as usual) –load leakage-minimizing input vector (stored elsewhere) Lecture 15: Power 18 Clock gate 1 1 1 1 1 1 1 How to determine best input vector for n-input gate? Can cause spurious transitions that consume more dynamic power

Instead of at the gate-level, choose high-V T vs. low-V T at the transistor-level Lecture 15: Power 19 High-V T devices Low-V T devices Can be used if some transitions are more important than others –“more important” can be speed or power Combine with setting input sleep vectors –make the off transistors high-V T if possible to further reduce leakge

If you turn off the power, then the gates can’t leak Lecture 15: Power 20 V dd Gnd 0 Off On X Gnd Off Gnd Virtual V dd V dd 01 X off This gating transistor is a beast… it needs to be big enough to supply the necessary current when not- gated, also needs to be low leakage (high V T gate) Gating transistor also called “sleep” transistor

Lecture 15: Power 21 Virtual V dd V dd After gating, residual charge in system will continue to leak Off Gnd Virtual V dd V dd Virtual V Gnd Both paths cut off now

Sleep transistors are slow high V T devices Depending on size of block covered by sleep transistor, virtual V dd /Gnd may have a lot of capacitance to charge/discharge Lecture 15: Power 22 V dd Virt. V dd R C Moderate R, Large C  Large RC (slow) time ADD inst ready to execute ALU asleep delay to wakeup ALU ADD exec Wakeup delay can cause significant performance penalties when units unavailable

In some situations, can know early enough ahead Lecture 15: Power 23 (crude pipeline) fetchdecode FP inst decoded! FPU Immediately send wakeup to FPU Hopefully by the time the fadd makes it to the OOO core, gets scheduled, and makes it to the FPU, the turn-on has completed exec

In some cases it’s much harder Lecture 15: Power 24 pipeline full/stalled (maybe due to D$ miss to main memory) power-off front-end units (fetch, decode, etc.) miss serviced, back-end starts moving again; front-end starts wake up back-end gets starved because front-end wakeup is too slow and can’t refill the pipeline But it’s hard to start the power-on early because we don’t know when the memory request will be fulfilled (and whether that will cause the back-end to drain)

(Dis)Charging Virtual V dd /Gnd consumes quite a bit of energy/power Lecture 15: Power 25 P = ½  C V 2 f Worst-case: charge up as soon as you’re done discharging time Go to sleep! Virt. V dd Done discharging, now wakeup! We just wasted 2×½×C Virt V dd ×V dd 2 Watts to discharge and then recharge the virtual V dd And we spent zero cycles fully asleep, so we didn’t save any/much leakage power

Must stay asleep for some time, just to break even! Lecture 15: Power 26 Energy consumed from leakage (no sleeping) time Energy consumed Energy to discharge Virtual V dd /Gnd Zero energy consumed while sleeping Energy to recharge Virtual V dd /Gnd Minimum sleep-time for energy break-even Too little sleep… ends up costing more energy than doing nothing Extra energy spent Sleep interval > break-even length Energy reduction

Instantly turning on the sleep transistor to recharge virtual V dd causes very large current spike ( di / dt ) Lecture 15: Power 27 Water Tank I shower Flush! I john I shower - I john Pressure Drop Current for recharging virtual V dd Solution: progressive turn-on; recharge virtual Vdd slowly, which limits I john (i.e., I recharge ) to keep pressure drop (supply noise) under control Solution: progressive turn-on; recharge virtual Vdd slowly, which limits I john (i.e., I recharge ) to keep pressure drop (supply noise) under control Slowing down recharge increases performance penalty when recharge is late

OS power management (OSPM) –algorithm monitors CPU load over some window of time –computes target performance point, requests from CPU –CPU is expected to modify operating voltage/frequency to match OSPM’s request Lecture 15: Power 28 Relative Power Consumption Voltage and frequency scaling Frequency scaling only OS can choose different power saving states (C 0 – C n ) –C 0 : active state (no power saving) –C i : higher i  more power savings, but longer recovery time http://download.intel.com/technology/itj/2006/volume10issue02/vol10_art03.pdf

C 0 : Active C 1 (processor-centric measures) –instruction execution halted, clocks are gated C 2 : CPU does not access bus w/o chipset’s consent –allows bus to be put in low-power mode C 3 : CPU disables PLLs (clock generators) C 4 : CPU lowers voltage to minimum level while still being able to retain state (e.g., cache contents) DC 4 : “Deep” C 4 (next slide) Lecture 15: Power 29

Upon entering C 4, flush L2 cache to main memory –Don’t do it all at once! If C 4 period is short, then you waste more power due to flushing Can have performance impact on wakeup since cache will be cold Flush only part of the L2 ( 1 / 8 to 1 / 2 ) by ways –once a complete way has been flushed, power gate it with sleep transistors (discussed later) Do this upon each entry into C 4 state When L2 shrunk to 0 bytes, enter DC 4 –Greatly reduce voltage since there’s no state to retain No need to wakeup cache for snoops Chipset directs snoop traffic directly to memory Typically expand cache to minimum of two ways on exit from DC4 Lecture 15: Power 30

Many shared resources –PLL, power supply, L2 cache Can’t (easily) run cores at different clock speeds with a single PLL Can’t run cores at different voltages with a single power supply Can’t turn off L2 cache just because one core is idle External interface complications –OS sees two separate CPUs one C-state per core –Platform views the whole processor as a single entity for power- management (for C 2 state and higher) Lecture 15: Power 31 OS can request C-states on a per-core basis OS can request C-states on a per-core basis Platform sees only a single C-state (the lower of the two) Platform sees only a single C-state (the lower of the two)

If one core is in deep-sleep, it’s not consuming much power Idea: use DVFS in reverse to increase voltage/freqency Lecture 15: Power 32 core 1 power power limit relative performance Both cores in C 0 Core 0 in C 0 Core 1 in DC 4 Core 0 in C 0 Core 1 in DC 4 Deliver more performance when running a single program and not worried about battery life (plugged in to wall) “Intel Dynamic Acceleration Technology”

Pros: –significant standby leakage reduction –memory elements retain state –no transistor sizing/partitioning required –dynamically tunable V T at runtime Cons: –requires expensive triple-well fabrication process –body-biasing effect decreases with technology scaling Lecture 15: Power 33 Higher V SB increases V T V B =0 V S  V/2 Earlier body-bias effect from stacked transistors due to higher source voltage Provide a way to explicitly bias VB Set V BBN 0 for this NFET Since V BBN < 0, also called “reverse biasing” Since V BBN < 0, also called “reverse biasing” Kao et al., Embedded Tutorial: Subthreshold Leakage Modeling and Reduction Techniques, ICCAD 2002

Super-high V T for caches (very slow) Use selective forward-body biasing during access to read/write at a reasonable speed Lecture 15: Power 34 0 0 0 0 0 0 0 Very-high VT devices (very low leakage, slow access speed) Very-high VT devices (very low leakage, slow access speed) 0 V BBN Access V fwd-bias V SB V SB < 0  V T decreases  transistors are faster (but consume more power) Access Completed 0 A few cache lines go into high leakage mode, but only very briefly (during access). The rest of the time, it consumes very little leakage power.

Different blocks have different performance needs –and this varies in time Idea: clock different blocks at different speeds –Apply voltage/frequency scaling to blocks/groups-of-blocks e.g., FP units can be slowed down (or maybe even completely turned off) for integer applications –Block consumes less power when it doesn’t have to operate in max-performance mode GALS = Globally Asynchronous, Locally Synchronous Lecture 15: Power 35

Lecture 15: Power 36 http://www.ece.cmu.edu/~dianam/conferences/isca02.pdf Baseline ProcessorGALS Processor

How to communicate between clock domains? Lecture 15: Power 37 Asynchronous FIFO Design [Chelcea and Nowick] Producer can clear empty, but it gets cleared on clk2 Consumer clears the full signal, but it occurs on clk1 Timing Issues: Voltage Issues: 0V 0.75V “0” “1” 0V 1.5V “1” (0.75V) 0.75V =0/1? V dd1 V dd2 FIFO between domains must “speak” both voltages

Lecture 15: Power. Power = Voltage × Current –Voltage is usually a constant (we’ll talk about voltage scaling later) –Current varies Depends on the block.

Similar presentations

Presentation on theme: "Lecture 15: Power. Power = Voltage × Current –Voltage is usually a constant (we’ll talk about voltage scaling later) –Current varies Depends on the block."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 15: Power. Power = Voltage × Current –Voltage is usually a constant (we’ll talk about voltage scaling later) –Current varies Depends on the block.

Similar presentations

Presentation on theme: "Lecture 15: Power. Power = Voltage × Current –Voltage is usually a constant (we’ll talk about voltage scaling later) –Current varies Depends on the block."— Presentation transcript:

Similar presentations

About project

Feedback