Adaptive Techniques for Leakage Power Management in L2 Cache Peripheral Circuits Houman Homayoun Alex Veidenbaum and Jean-Luc Gaudiot Dept. of Computer.

Slides:

Advertisements

Similar presentations

Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,

Advertisements

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

Virtual Exclusion: An Architectural Approach to Reducing Leakage Energy in Multiprocessor Systems Mrinmoy Ghosh Hsien-Hsin S. Lee School of Electrical.

Leakage Energy Management in Cache Hierarchies L. Li, I. Kadayif, Y-F. Tsai, N. Vijaykrishnan, M. Kandemir, M. J. Irwin, and A. Sivasubramaniam Penn State.

Semiconductor Memory Design. Organization of Memory Systems Driven only from outside Data flow in and out A cell is accessed for reading by selecting.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Managing Static (Leakage) Power S. Kaxiras, M Martonosi, “Computer Architecture Techniques for Power Effecience”, Chapter 5.

Keeping Hot Chips Cool Ruchir Puri, Leon Stok, Subhrajit Bhattacharya IBM T.J. Watson Research Center Yorktown Heights, NY Circuits R-US.

Power Reduction Techniques For Microprocessor Systems

CML CML Presented by: Aseem Gupta, UCI Deepa Kannan, Aviral Shrivastava, Sarvesh Bhardwaj, and Sarma Vrudhula Compiler and Microarchitecture Lab Department.

Power, Temperature, Reliability and Performance - Aware Optimizations in On-Chip SRAMs Houman Homayoun PhD Candidate Dept. of Computer Science, UC Irvine.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science August 20, 2009 Enabling.

Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.

Mathew Paul and Peter Petrov Proceedings of the IEEE Symposium on Application Specific Processors (SASP ’09) July /6/13.

CS 7810 Lecture 12 Power-Aware Microarchitecture: Design and Modeling Challenges for Next-Generation Microprocessors D. Brooks et al. IEEE Micro, Nov/Dec.

Super-Drowsy Caches Single-V DD and Single-V T Super-Drowsy Techniques for Low- Leakage High-Performance Instruction Caches Nam Sung Kim, Krisztián Flautner,

Non-Uniform Power Access in Large Caches with Low-Swing Wires Aniruddha N. Udipi with Naveen Muralimanohar*, Rajeev Balasubramonian University of Utah.

On the Limits of Leakage Power Reduction in Caches Yan Meng, Tim Sherwood and Ryan Kastner UC, Santa Barbara HPCA-2005.

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton University M. Franklin – University of Maryland Presented by:

Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.

Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

S. Reda EN160 SP’07 Design and Implementation of VLSI Systems (EN0160) Lecture 13: Power Dissipation Prof. Sherief Reda Division of Engineering, Brown.

1 Drowsy Caches Simple Techniques for Reducing Leakage Power Krisztián Flautner Nam Sung Kim Steve Martin David Blaauw Trevor Mudge

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

1 Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems G. Pokam and F. Bodin.

Lecture 7: Power.

Power-Aware Computing 101 CS 771 – Optimizing Compilers Fall 2005 – Lecture 22.

Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma

ECE 510 Brendan Crowley Paper Review October 31, 2006.

Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.

Case Study - SRAM & Caches

EE466: VLSI Design Power Dissipation. Outline Motivation to estimate power dissipation Sources of power dissipation Dynamic power dissipation Static power.

Copyright © 2012 Houman Homayoun 1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei.

Low Power Techniques in Processor Design

Power Reduction for FPGA using Multiple Vdd/Vth

Low-Power Wireless Sensor Networks

Dept. of Computer Science, UC Irvine

1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.

Drowsy Caches: Simple Techniques for Reducing Leakage Power Authors: ARM Ltd Krisztián Flautner, Advanced Computer Architecture Lab, The University of.

Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu

Chapter 1 An Introduction to Processor Design 부산대학교 컴퓨터공학과.

1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University

Exploiting Program Hotspots and Code Sequentiality for Instruction Cache Leakage Management J. S. Hu, A. Nadgir, N. Vijaykrishnan, M. J. Irwin, M. Kandemir.

Power Management in High Performance Processors through Dynamic Resource Adaptation and Multiple Sleep Mode Assignments Houman Homayoun National Science.

A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.

הפקולטה למדעי ההנדסה Faculty of Engineering Sciences.

Multiple Sleep Mode Leakage Control for Cache Peripheral Circuits in Embedded Processors Houman Homayoun, Avesta Makhzan, Alex Veidenbaum Dept. of Computer.

Houman Homayoun, Sudeep Pasricha, Mohammad Makhzan, Alex Veidenbaum Center for Embedded Computer Systems, University of California, Irvine,

Architectural and Circuit-Levels Design Techniques for Power and Temperature Optimizations in On- Chip SRAM Memories Houman Homayoun PhD Candidate Dept.

Energy Reduction for STT-RAM Using Early Write Termination Ping Zhou, Bo Zhao, Jun Yang, *Youtao Zhang Electrical and Computer Engineering Department *Department.

1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei Lin Dean M. Tullsen Speaker: Houman.

Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.

Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.

Lightweight Runtime Control Flow Analysis for Adaptive Loop Caching + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing Marisha.

Sunpyo Hong, Hyesoon Kim

1 RELOCATE Register File Local Access Pattern Redistribution Mechanism for Power and Thermal Management in Out-of-Order Embedded Processor Houman Homayoun,

Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Presented by Rania Kilany.  Energy consumption  Energy consumption is a major concern in many embedded computing systems.  Cache Memories 50%  Cache.

Rakesh Kumar Keith Farkas Norman P Jouppi,Partha Ranganathan,Dean M.Tullsen University of California, San Diego MICRO 2003 Speaker ： Chun-Chung Chen Single-ISA.

Graduate Seminar Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun April 2005.

YASHWANT SINGH, D. BOOLCHANDANI

Temperature and Power Management

Microarchitectural Techniques for Power Gating of Execution Units

Fine-Grain CAM-Tag Cache Resizing Using Miss Tags

A Novel Cache-Utilization Based Dynamic Voltage Frequency Scaling (DVFS) Mechanism for Reliability Enhancements *Yen-Hao Chen, *Yi-Lun Tang, **Yi-Yu Liu,

Presentation transcript:

Adaptive Techniques for Leakage Power Management in L2 Cache Peripheral Circuits Houman Homayoun Alex Veidenbaum and Jean-Luc Gaudiot Dept. of Computer Science, UC Irvine

Outline L2 Cache Power Dissipation Why Cache Peripheral ? Study recently proposed static approach to reduce leakage Propose two adaptive technique to reduce leakage Present power, performance and energy-delay results

L2 Cache and Power L2 cache in high-performance processors is large 2 to 4 MB is common It is typically accessed relatively infrequently Thus dissipates most of its power via leakage Much of it was in the SRAM cells Many architectural techniques proposed to remedy this Today, there is also significant leakage in the peripheral circuits of an SRAM (cache) In part because cell design has been optimized Pentium M processor die photo Courtesy of intel.com

Peripherals ?! Data Input/Output Driver Address Input/Output Driver Row Pre-decoder Wordline Driver Row Decoder Others : sense-amp, bitline pre-charger, memory cells, decoder logic

Why Peripherals ? Using minimal sized transistor for area considerations in cells and larger, faster and accordingly more leaky transistors to satisfy timing requirements in peripherals Using high vt transistors in cells compared with typical threshold voltage transistors in peripherals

Leakage Power Components of L2 Cache SRAM peripheral circuits dissipate more than 90% of the total leakage power

Leakage power as a Fraction of Total L2 Power Dissipation L2 cache leakage power dominates its dynamic power above 87% of the total

Circuit Techniques Address Leakage in SRAM Cell Gated-Vdd, Gated-Vss Voltage Scaling (DVFS) ABB-MTCMOS Forward Body Biasing (FBB), RBB Target SRAM memory cell

Architectural Techniques Way Prediction, Way Caching, Phased Access Predict or cache recently access ways, read tag first Drowsy Cache Keeps cache lines in low-power state, w/ data retention Cache Decay Evict lines not used for a while, then power them down Applying DVS, Gated Vdd, Gated Vss to memory cell Many architectural support to do that. All target cache SRAM memory cell

Static Architectural Techniques: SM SM Technique (ICCD’07) Asserts the sleep signal by default. Wakes up L2 peripherals on an access to the cache Keeps the cache in the normal state for J cycles (turn-on period) before returning it to the stand-by mode (SM_J) No wakeup penalty during this period Larger J leads to lower performance degradation but lower energy savings

Static Architectural Techniques: IM IM technique (ICCD’07) Monitor issue logic and functional units of the processor after L2 cache miss. Asserts the sleep if the issue logic has not issued any instructions and functional units have not executed any instructions for K consecutive cycles (K=10) De-asserted the sleep signal M cycles before the miss is serviced No performance loss

Simulated Processor Architecture SimpleScalar 4.0 SPEC2K benchmarks Compiled with the -O4 flag using the Compaq compiler targeting the Alpha processor fast–forwarded for 3 billion instructions, then fully simulated for 4 billion instructions using the reference data sets.

SM Performance Degradation

More Insight on SM and IM Fraction of program execution time during which L2 cache is in low power mode (FLP) using one of IM or SM two techniques benefit different benchmarks

More Insight on SM and IM (Cont.) In almost half of the benchmarks the FLP is negligible and there is no leakage reduction opportunity using IM The majority of load instructions satisfied within the cache hierarchy The memory accesses are extremely infrequent The average FLP period is 26.9%

Some Observations Some benchmarks SM and IM techniques are both effective facerec, gap, perlbmk and vpr IM works well in almost half of the benchmarks but is ineffective in the other half SM work well in about one half of the benchmarks but not the same benchmarks as the IM adaptive technique combining IM and SM has the potential to deliver an even greater power reduction

Which Technique Is the Best and When ? L2 to be idle There are few L1 misses Many L2 misses waiting for memory miss rate product (MRP) may be a good indicator of the cache behavior

The Adaptive Techniques Adaptive Static Mode (ASM) MRP measured only once during an initial learning period (the first 100M committed instructions) MRP > A  IM (A=90) MRP ≤ A  SM_J Initial technique  SM_J Adaptive Dynamic Mode (ADM) MRP measured continuously over a K cycle period (K is 10M) choose IM or the SM, for the next 10M cycles MRP > A  IM (A=100) A ≥ MRP > B  SM_N (B=200) otherwise  SM_P

More Insight on ASM and ADM ASM attempts to find the more effective static technique per benchmark by profiling a small subset of a program ADM is more complex and attempts to find the more effective static technique at a finer granularity of every 10M cycles intervals based on profiling the previous timing interval

ASM Results ASM_750 makes a good power-performance trade-off with a 44% FLP and an approximately 2% performance loss 20% 30% 40% 50% 60% 70% 80% J=100J=200J=500J=750J= % 93% 94% 95% 96% 97% 98% 99% 100% J=100J=200J=500J=750J=1500 FLP PeriodPerformance Loss

Compare ASM with IM and SM fraction of IM and SM contribution for ASM_750 Most benchmarks ASM correctly selects the more effective static technique Exception: equake a small subset of program can be used to identify L2 cache behavior, whether it is accessed very infrequently or it is idle since processor is idle

ASM and SM Performance No Performance Loss ammp, applu, lucas, mcf, mgird, swim and wupwise 2X more leakage power reduction and less performance loss compare to static approaches

ADM Results Many benchmarks both IM and SM make a noticeable contribution ADM is effective in combining the IM and SM Some benchmarks either IM or SM contribution is negligible ADM selects the best static technique

Power Measurement Approach CACTI-5 Peripheral circuits account for 90% of all the leakage power The power reduction is 88%. Total dynamic power : N*Eaccess/Texec N is the total number of accesses (obtained from simulation) Eaccess is the single access energy from CACTI-5 Texec is the program execution time Leakage energy is dissipated on every cycle

Power Results leakage power savings total energy delay reduction leakage reduction using ASM and ADM is 34% and 52% respectively The overall energy delay reduction is 29.4 and 45.5% respectively, using the ASM and ADM. 2~3 X more leakage power reduction and less performance loss compare to static approaches

Conclusion Study break down of leakage in L2 cache components, show peripheral circuit leaking considerably Study recently proposed IM and SM approach Propose a metric (cache miss rate product) to differentiate the benchmarks works well with each of static approach Propose two adaptive technique to select the best static approach dynamically Present power, performance and energy-delay results 2 to 3 X improvement over recently proposed static techniques