Power-Optimal Pipelining in Deep Submicron Technology

Slides:



Advertisements
Similar presentations
A Robust, Fast Pulsed Flip- Flop Design By: Arunprasad Venkatraman Rajesh Garg Sunil Khatri Department of Electrical and Computer Engineering, Texas A.
Advertisements

Keeping Hot Chips Cool Ruchir Puri, Leon Stok, Subhrajit Bhattacharya IBM T.J. Watson Research Center Yorktown Heights, NY Circuits R-US.
Power Reduction Techniques For Microprocessor Systems
© Digital Integrated Circuits 2nd Inverter EE4271 VLSI Design The Inverter Dr. Shiyan Hu Office: EERC 518 Adapted and modified from Digital.
Leakage and Dynamic Glitch Power Minimization Using MIP for V th Assignment and Path Balancing Yuanlin Lu and Vishwani D. Agrawal Auburn University ECE.
Leakage-Biased Domino Circuits for Dynamic Fine- Grain Leakage Reduction Seongmoo Heo and Krste Asanović Massachusetts Institute of Technology Lab for.
Synchronous Digital Design Methodology and Guidelines
CS 7810 Lecture 12 Power-Aware Microarchitecture: Design and Modeling Challenges for Next-Generation Microprocessors D. Brooks et al. IEEE Micro, Nov/Dec.
LOW-LEAKAGE REPEATERS FOR NETWORK-ON-CHIP INTERCONNECTS Arkadiy Morgenshtein, Israel Cidon, Avinoam Kolodny, Ran Ginosar Technion – Israel Institute of.
On-Line Adjustable Buffering for Runtime Power Reduction Andrew B. Kahng Ψ Sherief Reda † Puneet Sharma Ψ Ψ University of California, San Diego † Brown.
1 Introduction Background: CS 3810 or equivalent, based on Hennessy and Patterson’s Computer Organization and Design Text for CS/EE 6810: Hennessy and.
Architectural Power Management for High Leakage Technologies Department of Electrical and Computer Engineering Auburn University, Auburn, AL /15/2011.
Power-Aware Computing 101 CS 771 – Optimizing Compilers Fall 2005 – Lecture 22.
ECE 510 Brendan Crowley Paper Review October 31, 2006.
Lecture 21, Slide 1EECS40, Fall 2004Prof. White Lecture #21 OUTLINE –Sequential logic circuits –Fan-out –Propagation delay –CMOS power consumption Reading:
Low Power Design of Integrated Systems Assoc. Prof. Dimitrios Soudris
Power, Energy and Delay Static CMOS is an attractive design style because of its good noise margins, ideal voltage transfer characteristics, full logic.
6.893: Advanced VLSI Computer Architecture, September 28, 2000, Lecture 4, Slide 1. © Krste Asanovic Krste Asanovic
11 1 Process Variation in Near-threshold Wide SIMD Architectures Sangwon Seo 1, Ronald G. Dreslinski 1, Mark Woh 1, Yongjun Park 1, Chaitali Chakrabarti.
Folklore Confirmed: Compiling for Speed = Compiling for Energy Tomofumi Yuki INRIA, Rennes Sanjay Rajopadhye Colorado State University 1.
1 VLSI Design SMD154 LOW-POWER DESIGN Magnus Eriksson & Simon Olsson.
Power Reduction for FPGA using Multiple Vdd/Vth
Low-Power Wireless Sensor Networks
Dept. of Computer Science, UC Irvine
Sub-threshold Design of Ultra Low Power CMOS Circuits Students: Dmitry Vaysman Alexander Gertsman Supervisors: Prof. Natan Kopeika Prof. Orly Yadid-Pecht.
1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
Modern VLSI Design 2e: Chapter 3 Copyright  1998 Prentice Hall PTR Topics n Electrical properties of static combinational gates: –transfer characteristics;
Adiabatic Logic as Low-Power Design Technique Presented by: Muaayad Al-Mosawy Presented to: Dr. Maitham Shams Mar. 02, 2005.
A Robust Pulse-triggered Flip-Flop and Enhanced Scan Cell Design
Guy Lemieux, Mehdi Alimadadi, Samad Sheikhaei, Shahriar Mirabbasi University of British Columbia, Canada Patrick Palmer University of Cambridge, UK SoC.
MicroFix: Exploiting Path-grained Timing Adaptability for Improving Power-Performance Efficiency Guihai Yan, Yinhe Han, Hui Liu, Xiaoyao Liang, Xiaowei.
Leakage reduction techniques Three major leakage current components 1. Gate leakage ; ~ Vdd 4 2. Subthreshold ; ~ Vdd 3 3. P/N junction.
Stochastic Current Prediction Enabled Frequency Actuator for Runtime Resonance Noise Reduction Yiyu Shi*, Jinjun Xiong +, Howard Chen + and Lei He* *Electrical.
경종민 Low-Power Design for Embedded Processor.
Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.
-1- UC San Diego / VLSI CAD Laboratory Optimization of Overdrive Signoff Tuck-Boon Chan, Andrew B. Kahng, Jiajia Li and Siddhartha Nath Tuck-Boon Chan,
University of Toronto,Toronto, Ontario, Canada 1 Circuit Research Labs, Intel Corporation, Hillsboro, OR Variations-Aware Low-Power Design with Voltage.
CS203 – Advanced Computer Architecture
COE 360 Principles of VLSI Design Delay. 2 Definitions.
CS203 – Advanced Computer Architecture
The Interconnect Delay Bottleneck.
Frequency response I As the frequency of the processed signals increases, the effects of parasitic capacitance in (BJT/MOS) transistors start to manifest.
Memory Segmentation to Exploit Sleep Mode Operation
ECE Department, University of California, Davis
The Inverter EE4271 VLSI Design Professor Shiyan Hu Office: EERC 518
Alireza Shafaei, Shuang Chen, Yanzhi Wang, and Massoud Pedram
Sequential circuit design with metastability
LOW POWER DESIGN METHODS V.ANANDI ASST.PROF,E&C MSRIT,BANGALORE.
Pipelining and Retiming 1
Hot Chips, Slow Wires, Leaky Transistors
Estimate power saving by clock slowdown for s5378 in 180nm and 32nm CMOS Chao Han ELEC 6270.
Architecture & Organization 1
Circuits and Interconnects In Aggressively Scaled CMOS
DESIGN AND IMPLEMENTATION OF DIGITAL FILTER
Reading: Hambley Ch. 7; Rabaey et al. Sec. 5.2
Microarchitectural Techniques for Power Gating of Execution Units
Fine-Grain CAM-Tag Cache Resizing Using Miss Tags
Low Power Design in VLSI
Lecture 2: Performance Today’s topics: Technology wrap-up
Architecture & Organization 1
Circuit Design Techniques for Low Power DSPs
Yiyu Shi*, Jinjun Xiong+, Howard Chen+ and Lei He*
A High Performance SoC: PkunityTM
FPGA Glitch Power Analysis and Reduction
The Inverter EE4271 VLSI Design Dr. Shiyan Hu Office: EERC 731
Load-Sensitive Flip-Flop Characterization
Reading: Hambley Ch. 7; Rabaey et al. Secs. 5.2, 5.5, 6.2.1
Low Power Digital Design
Parametric Yield Estimation Considering Leakage Variability Rajeev Rao, Anirudh Devgan, David Blaauw, Dennis Sylvester Present by Fengbo Ren Apr. 30.
Frequency response I As the frequency of the processed signals increases, the effects of parasitic capacitance in (BJT/MOS) transistors start to manifest.
Presentation transcript:

Power-Optimal Pipelining in Deep Submicron Technology ISLPED 2004 8/10/2004 Power-Optimal Pipelining in Deep Submicron Technology Seongmoo Heo and Krste Asanović Computer Architecture Group, MIT CSAIL

Traditional Pipelining Goal: Maximum performance Vdd Clk-Q Setup Propagation Delay Clk Traditionally pipelining was used to increase clock frequency for performance. But the time slack obtained from pipelining can also be used to reduce power consumption. Supply voltage scaling is one of the most effective delay-power tradeoff techniques. Our target digital system has highly parallel computation and we assume that it is pipelinable without CPI loss. And the system has fixed throughput requirement and clock frequency is fixed. In this work, we show how power-optimal pipelining varies for different operating regimes in deep submicron technology. We explore the effects of Vdd, Vth, activity factor, and clock gating. Clk Clk

Pipelining as a Low-Power Tool Goal: Low-Power, Fixed Throughput Vdd Clk-Q Setup Propagation Delay Clk Time Slack Traditionally pipelining was used to increase clock frequency for performance. But the time slack obtained from pipelining can also be used to reduce power consumption. Supply voltage scaling is one of the most effective delay-power tradeoff techniques. Our target digital system has highly parallel computation and we assume that it is pipelinable without CPI loss. And the system has fixed throughput requirement and clock frequency is fixed. In this work, we show how power-optimal pipelining varies for different operating regimes in deep submicron technology. We explore the effects of Vdd, Vth, activity factor, and clock gating. Clk Time Slack Clk

Pipelining as a Low-Power Tool Goal: Low-Power, Fixed Throughput Vdd Clk-Q Setup Propagation Delay Clk Time Slack Traditionally pipelining was used to increase clock frequency for performance. But the time slack obtained from pipelining can also be used to reduce power consumption. Supply voltage scaling is one of the most effective delay-power tradeoff techniques. Our target digital system has highly parallel computation and we assume that it is pipelinable without CPI loss. And the system has fixed throughput requirement and clock frequency is fixed. In this work, we show how power-optimal pipelining varies for different operating regimes in deep submicron technology. We explore the effects of Vdd, Vth, activity factor, and clock gating. Clk Traded for Power (supply voltage scaling) Time Slack Clk

Pipelining as a Low-Power Tool * Clock frequency fixed Flip-flop Power Overhead Pipelining Time slack Traditionally pipelining was used to increase clock frequency for performance. But the time slack obtained from pipelining can also be used to reduce power consumption. Supply voltage scaling is one of the most effective delay-power tradeoff techniques. Our target digital system has highly parallel computation and we assume that it is pipelinable without CPI loss. And the system has fixed throughput requirement and clock frequency is fixed. In this work, we show how power-optimal pipelining varies for different operating regimes in deep submicron technology. We explore the effects of Vdd, Vth, activity factor, and clock gating. Delay

Pipelining as a Low-Power Tool * Clock frequency fixed Traditionally pipelining was used to increase clock frequency for performance. But the time slack obtained from pipelining can also be used to reduce power consumption. Supply voltage scaling is one of the most effective delay-power tradeoff techniques. Our target digital system has highly parallel computation and we assume that it is pipelinable without CPI loss. And the system has fixed throughput requirement and clock frequency is fixed. In this work, we show how power-optimal pipelining varies for different operating regimes in deep submicron technology. We explore the effects of Vdd, Vth, activity factor, and clock gating. Power Saving Supply voltage scaling Delay

Power-Optimal Pipelining Power reduction from pipelining limited by power overhead of increased number of flip-flops  Power-Optimal Pipelining Traditionally pipelining was used to increase clock frequency for performance. But the time slack obtained from pipelining can also be used to reduce power consumption. Supply voltage scaling is one of the most effective delay-power tradeoff techniques. Our target digital system has highly parallel computation and we assume that it is pipelinable without CPI loss. And the system has fixed throughput requirement and clock frequency is fixed. In this work, we show how power-optimal pipelining varies for different operating regimes in deep submicron technology. We explore the effects of Vdd, Vth, activity factor, and clock gating.

Power-Optimal Pipelining Power reduction from pipelining limited by power overhead of increased number of flip-flops  Power-Optimal Pipelining Power Traditionally pipelining was used to increase clock frequency for performance. But the time slack obtained from pipelining can also be used to reduce power consumption. Supply voltage scaling is one of the most effective delay-power tradeoff techniques. Our target digital system has highly parallel computation and we assume that it is pipelinable without CPI loss. And the system has fixed throughput requirement and clock frequency is fixed. In this work, we show how power-optimal pipelining varies for different operating regimes in deep submicron technology. We explore the effects of Vdd, Vth, activity factor, and clock gating. Too shallow pipelining Delay

Power-Optimal Pipelining Power reduction from pipelining limited by power overhead of increased number of flip-flops  Power-Optimal Pipelining Power Too deep pipelining Traditionally pipelining was used to increase clock frequency for performance. But the time slack obtained from pipelining can also be used to reduce power consumption. Supply voltage scaling is one of the most effective delay-power tradeoff techniques. Our target digital system has highly parallel computation and we assume that it is pipelinable without CPI loss. And the system has fixed throughput requirement and clock frequency is fixed. In this work, we show how power-optimal pipelining varies for different operating regimes in deep submicron technology. We explore the effects of Vdd, Vth, activity factor, and clock gating. Too shallow pipelining Delay

Power-Optimal Pipelining Power reduction from pipelining limited by power overhead of increased number of flip-flops  Power-Optimal Pipelining Power Too deep pipelining Traditionally pipelining was used to increase clock frequency for performance. But the time slack obtained from pipelining can also be used to reduce power consumption. Supply voltage scaling is one of the most effective delay-power tradeoff techniques. Our target digital system has highly parallel computation and we assume that it is pipelinable without CPI loss. And the system has fixed throughput requirement and clock frequency is fixed. In this work, we show how power-optimal pipelining varies for different operating regimes in deep submicron technology. We explore the effects of Vdd, Vth, activity factor, and clock gating. Too shallow pipelining Optimal pipelining Optimal Power Saving Delay

Contribution Pipelining is an old idea. Research focus has been on performance impact of pipelining. Idea of using pipelining [Chandrakasan ’92] to lower power has not been fully explored in deep submicron technology. Analysis and circuit-level simulation of Power-Optimal Pipelining for different regimes of Vth, activity factor, clock gating Pipelining is an old idea. Research focus has been on performance impact of pipelining. The idea of using pipelining to lower power was proposed long time ago, however it has not been fully explored especially in deep submicron technology. Analysis and circuit-level simulation of power-optimal pipelining for different regimes of threshold voltages, activity factor, and clock-gating in deep submicron technology.

Bottom-to-Top Approach Impact of pipelining on power component Impact of pipelining on total power (with/without clock-gating) Power Total Power (clock-gated) active inactive active Time The figure shows the approach. Impact of pipelining on each power component is studied first and then, Impact of pipelining on total power is explored. Switching Power Component Leakage Power Component Idle Power Component

Bottom-to-Top Approach Impact of pipelining on power component Impact of pipelining on total power (with/without clock-gating) Power Total Power (not clock-gated) active inactive active Time The figure shows the approach. Impact of pipelining on each power component is studied first and then, Impact of pipelining on total power is explored. *Idle power = power consumed when circuit is idle and not clock-gated Switching Power Component Leakage Power Component Idle Power Component

Methodology Target digital system: Fixed throughput, Highly parallel computation, Logic-dominant Test bench BPTM (Berkeley Predictive Technology Model) 70nm process: LVT(0.17/-0.2), MVT(0.19/-0.22), HVT(0.21/-0.24) Hspice simulation at 100°C, Clock = 2 GHz Baseline Clock frequency was set at 2 GHz and no CPI loss was assumed. Logic-dominant stage was assumed rather than wire-dominant one. BPTM 70nm process was used. Test bench consists of PowerPC flip-flops and N stages of FO4 inverters. 24 FO4 inverters were used for the baseline. Simulation was done with Hspice simulation at 100 degree. N FO4 inverters (N = 2 ~ 24) TG flip-flops TG flip-flops One Pipeline Stage

Pipelining and Switching Power: Analytical Trend Optimal Saving O(N2) Flip-flop overhead Switching Power Quadratic reduction of logic switching power O(1/N) Pipelining and Switching Power As N decreases, the number of flip-flops increases and so power overhead does too. In the equation, the ratio of b1 to b0 is approximately the capacitance ratio of flip-flop and FO4 inverter. If N is greater than the ratio, the quadratic term dominates. On the other hand, if N is small, switching power becomes inversely proportional to 1 over N. The graph shows the imaginative switching power scaling curve and power-optimal point.  Vdd2  N2 Optimal FO4 Number of FO4 per stage, N

Pipelining and Leakage Power: Analytical Trend Optimal Saving O(1/N) O(N ) (1<< 2) Flip-flop overhead Superlinear reduction of logic leakage power Leakage Power Optimal FO4 Pipelining and Leakage Power In the equation, the flip-flop overhead is represented by c1 over N. Similar to the switching power case, the ratio of c1 and c0 is approximately the leakage ratio of flip-flop and FO4 inverter. The dependence of leakage on drain voltage is included by the DIBL coefficient. When N is large, leakage power is proportional to N times an exponential of N. The exponential term represents the dependence of leakage current on the drain voltage (from DIBL). In modern deep sub-micron technology, for an appropriate supply voltage range, this term is larger than O(1) but smaller than O(N), therefore leakage power is reduced in a super-linear fashion as N decreases, though less than the quadratic reduction for switching power.  Vdd * e(Vdd)  N DIBL effect Number of FO4 per stage, N

Pipelining and Idle Power: Analytical Trend Clock-gating is not always possible Increased control complexity insufficient setup time of clock enable signal Leakage Power + Flip-flop Switching Power Between leakage power scaling and flip-flop switching power scaling depending on leakage level Pipelining and Idle Power without clock-gating Clock-gating is not always possible due to increased control complexity or insufficient setup time of clock enable signal. Therefore, we look at the power consumption when the stage is idle but not clock-gated. The idle power consists of flip-flop switching power overhead and leakage power.

Pipelining and Idle Power: Analytical Trend Leakage Power Scale Flip-flop Switching Power Scale Idle Power Optimal Saving Optimal Saving O(N) Optimal FO4 Linear reduction of Flip-flop switching power Relative Power O(N ) (1<< 2) O(1/N) Pipelining and Idle Power without clock-gating Clock-gating is not always possible due to increased control complexity or insufficient setup time of clock enable signal. Therefore, we look at the power consumption when the stage is idle but not clock-gated. The idle power consists of flip-flop switching power overhead and leakage power. Optimal FO4  1/N * Vdd2  N O(1/N) Number of FO4 per stage, N Number of FO4 per stage, N

Simulation Results: Power Components Fixed Throughput @ 2 GHz Power Components Switching Power Leakage Power Idle Power Right hand side curve O(N2) O(N ) (1<< 2) O(N) or O(N ) (1<< 2) Saving* 79(HVT)~ 82(LVT)% 70(LVT)~ 75(HVT)% 55(HVT)~ 70(LVT)% N* 6 8 Clock frequency was set at 2 GHz and no CPI loss was assumed. Logic-dominant stage was assumed rather than wire-dominant one. BPTM 70nm process was used. Test bench consists of PowerPC flip-flops and N stages of FO4 inverters. 24 FO4 inverters were used for the baseline. Simulation was done with Hspice simulation at 100 degree. N = Number of FO4 inverters per stage (Not including flip-flop delay) N* = Optimal N Saving* = Optimal power saving by pipelining

Optimal Power Saving Optimal FO4 = 6 Optimal FO4 = 6~8 No Clock Gating *2 GHz *Flip-flop delay not included in optimal FO4 relative power relative power Figure shows the simulated total power when a clock-gating mechanism is present for different activity factors. With a low activity factor, total power curves follow leakage power curves and high Vth leads to more power saving by pipelining. As the activity factor increases, total power curves follow switching power curves and high Vth leads to less power saving by pipelining. activity factor activity factor

Optimal Power Saving Optimal FO4 = 6 Optimal FO4 = 6~8 No Clock Gating Idle Power Leakage Power relative power relative power Switching Power Figure shows the simulated total power when a clock-gating mechanism is present for different activity factors. With a low activity factor, total power curves follow leakage power curves and high Vth leads to more power saving by pipelining. As the activity factor increases, total power curves follow switching power curves and high Vth leads to less power saving by pipelining. Switching Power activity factor activity factor

Optimal Power Saving Optimal FO4 = 6 Optimal FO4 = 6~8 No Clock Gating relative power relative power Figure shows the simulated total power when a clock-gating mechanism is present for different activity factors. With a low activity factor, total power curves follow leakage power curves and high Vth leads to more power saving by pipelining. As the activity factor increases, total power curves follow switching power curves and high Vth leads to less power saving by pipelining. LVT activity factor activity factor

Discussion LVT can be fast and power-efficient enables lower Vdd Flip-flop delay more important than flip-flop power for power-optimal pipelining

Limitation of This Work Effect on optimal logic depth Effect on optimal power saving Super-linear growth of flip-flops   Additional memory Reduced glitches Parasitic wire capacitance

Conclusion Pipelining is an effective low-power tool when used to support voltage scaling in digital system implementing highly parallel computation. Optimal Logic Depth: 6-8 FO4 ~ 8-10 FO4 including flip-flop delay Optimal Power Saving: 55 – 80% It depends on Vth, AF, Clock-Gating Insights: Pipelining is more effective with High AF Pipelining is most effective at saving switching power Pipelining is more effective with lower Vth Except for when leakage power is dominant. Pipelining is more effective with clock-gating reduced flip-flop overhead.

Acknowledgments Thanks to SCALE group members and anonymous reviewers Funded by NSF CAREER award CCR-0093354, NSF ITR award CCR-0219545, and a donation from Intel Corporation.

BACKUP SLIDES