Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,

Slides:



Advertisements
Similar presentations
CSCI 4717/5717 Computer Architecture
Advertisements

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.
Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.
Virtual Exclusion: An Architectural Approach to Reducing Leakage Energy in Multiprocessor Systems Mrinmoy Ghosh Hsien-Hsin S. Lee School of Electrical.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
Managing Static (Leakage) Power S. Kaxiras, M Martonosi, “Computer Architecture Techniques for Power Effecience”, Chapter 5.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Power, Temperature, Reliability and Performance - Aware Optimizations in On-Chip SRAMs Houman Homayoun PhD Candidate Dept. of Computer Science, UC Irvine.
Adaptive Techniques for Leakage Power Management in L2 Cache Peripheral Circuits Houman Homayoun Alex Veidenbaum and Jean-Luc Gaudiot Dept. of Computer.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science August 20, 2009 Enabling.
Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.
Chapter 12 Pipelining Strategies Performance Hazards.
1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.
Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.
Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.
Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Performance Comparison of Niagara, Xeon, and Itanium2 Daekyeong Moon
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
1 Drowsy Caches Simple Techniques for Reducing Leakage Power Krisztián Flautner Nam Sung Kim Steve Martin David Blaauw Trevor Mudge
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
CS 7810 Lecture 13 Pipeline Gating: Speculation Control For Energy Reduction S. Manne, A. Klauser, D. Grunwald Proceedings of ISCA-25 June 1998.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Power-Aware Hardware Prefetching Yao Guo. Prefetching Energy Consumption L2 + Memory Access Energy –very minor energy impact. Leakeage Energy –reduced.
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
Register Cache System not for Latency Reduction Purpose Ryota Shioya, Kazuo Horio, Masahiro Goshima, and Shuichi Sakai The University of Tokyo 1.
Low Power Techniques in Processor Design
Dept. of Computer Science, UC Irvine
Drowsy Caches: Simple Techniques for Reducing Leakage Power Authors: ARM Ltd Krisztián Flautner, Advanced Computer Architecture Lab, The University of.
Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs
Exploiting Program Hotspots and Code Sequentiality for Instruction Cache Leakage Management J. S. Hu, A. Nadgir, N. Vijaykrishnan, M. J. Irwin, M. Kandemir.
COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept.
Power Management in High Performance Processors through Dynamic Resource Adaptation and Multiple Sleep Mode Assignments Houman Homayoun National Science.
Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith.
A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.
Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.
Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.
Multiple Sleep Mode Leakage Control for Cache Peripheral Circuits in Embedded Processors Houman Homayoun, Avesta Makhzan, Alex Veidenbaum Dept. of Computer.
Houman Homayoun, Sudeep Pasricha, Mohammad Makhzan, Alex Veidenbaum Center for Embedded Computer Systems, University of California, Irvine,
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
Power Analysis of Embedded Software : A Fast Step Towards Software Power Minimization 指導教授 : 陳少傑 教授 組員 : R 張馨怡 R 林秀萍.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.
Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.
Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
BITS Pilani, Pilani Campus Today’s Agenda Role of Performance.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
PipeliningPipelining Computer Architecture (Fall 2006)
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
Graduate Seminar Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun April 2005.
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
Assessing and Understanding Performance
Zhichun Zhu Zhao Zhang ECE Department ECE Department
Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs
SECTIONS 1-7 By Astha Chawla
Antonia Zhai, Christopher B. Colohan,
Microarchitectural Techniques for Power Gating of Execution Units
Pipelining: Advanced ILP
Lecture 6: Advanced Pipelines
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Address-Value Delta (AVD) Prediction
Alpha Microarchitecture
Overheads for Computers as Components 2nd ed.
Lecture 20: OOO, Memory Hierarchy
Lecture 20: OOO, Memory Hierarchy
Performance ICS 233 Computer Architecture and Assembly Language
COMS 361 Computer Organization
Presentation transcript:

Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou, ICCD 2007

L2 Caches and Power L2 caches in high-performance processors are large 2 to 4 MB is common They are typically accessed relatively infrequently Thus L2 cache dissipates most of its power via leakage Much of it was in the SRAM cells Many architectural techniques proposed to remedy this Today, there is also significant leakage in the peripheral circuits of an SRAM (cache) In part because cell design has been optimized

The problem How to reduce power dissipation in the peripheral circuits of the L2 cache? Seek an architectural solution with a circuit assist Approach: Reduce peripheral leakage when circuits are unused By applying “sleep transistor” techniques Use architectural techniques to minimize “wakeup” time During an L2 miss service, for instance Will assume that an SRAM cell design is already optimized and will attempt to save power in cells

Miss rates and load frequencies SPEC2K benchmarks 128KB L1 cache 5% average L1 miss rate, Loads are 25% of instr. In many benchmarks the L2 is mostly idle In some L1 miss rate is high  Much waiting for data  L2 and CPU idle?

SRAM Leakage Sources SRAM cell Sense Amps Multiplexers Local and Global Drivers (including the wordline driver) Address decoder

Leakage Energy Break Down in L2 Cache Large, more leaky transistors used in peripheral circuits High Vth, less leaky transistors in memory cells

Circuit Techniques for Leakage Reduction Gated-Vdd, Gated-Vss Voltage Scaling (DVFS) ABB-MTCMOS Forward Body Biasing (FBB), RBB Typically target cache SRAM cell design But are also applicable to peripheral circuits

Architectural Techniques Way Prediction, Way Caching, Phased Access Predict or cache recently access ways, read tag first Drowsy Cache Keeps cache lines in low-power state, w/ data retention Cache Decay Evict lines not used for a while, then power them down Applying DVS, Gated Vdd, Gated Vss to memory cell Many architectural support to do that. All target cache SRAM memory cell

What else can be done? Architectural Motivation: A load miss in the L2 cache takes a long time to service prevents dependent instructions from being issued dispatch issue

When dependent instructions cannot issue After a number of cycles the instruction window is full ROB, Instruction Queue, Store Queue The processor issue stalls and performance is lost At the same time, energy is lost as well! This is an opportunity to save energy

IPC during an L2 miss Cumulative over the L2 miss service time for a program Decreases significantly compared to program average

A New Technique Idle time Management (IM) Assert an L2 sleep signal (SLP) after an L2 cache miss Puts L2 peripheral circuits into a low-power state L2 cannot be accessed while in this state De-assert SLP when the cache miss completes Can also apply to the CPU Use SLP for DVFS, for instance But L2 idle time is only 200 to 300 clocks It currently takes longer than that for DVFS

A Problem Disabling the L2 as soon as the miss is detected Prevents the issue of independent instructions In particular, of loads that may hit or miss in the L2 This may impact the performance significantly Up to a 50% performance loss ammp applu apsi equake gcc lucas mcf mgrid perlbmk swim vpr wupwise Average twolf vortex sixtrack mesa parser gzip gap facerec galgel eon crafty art bzip Percentage (%)

What are independent instructions? Independent instructions do not depend on a load miss Or any other miss occuring during the L2 miss service Independent instructions can execute during miss service

Two Idle Mode Algorithms Static algorithm (SA) put L2 in stand-by mode N cycles after a cache miss occurs enable it again M cycles before the miss is expected to compete Independent instructions execute during the L2 miss service L2 can be accesses during the N+M cycles L1 misses are buffered in an L2 buffer during stand-by Adaptive algorithm (AA) Monitor the issue logic and functional units of the processor after an L2 miss Put the L2 into stand-by mode if no instructions are issued AND functional units have not executed any instructions in K cycles The algorithm attempts to detect that there are no more instructions that may access the L2

Sometimes the L2 is not accessed much and is mostly idle In this case it is best to use the Stand-By Mode (SM) Start the L2 cache in stand-by, low-power mode “Wake it up” on an L1 cache miss and service the miss Return the L2 to stand-by mode right after the L2 access However, this is likely to lead to performance loss L1 misses are often clustered, there is a wake-up delay… A better solution: Keep the L2 awake for J cycles after it was turned on increases energy consumption, but improves performance A Second Leakage Reduction Technique

Hardware Support Add appropriately sized sleep transistors in global drivers Add delayed-access buffer to L2 allows L1 misses to be issued and stored in this buffer at L2

System Description L1 I-cache128KB, 64 byte/line, 2 cycles L1 D-cache128KB, 64 byte/line, 2 cycles, 2 R/W ports L2 cache4MB, 8 way, 64 byte/line, 20 cycles issue4 way out of order Branch predictor64KB entry g-share,4K-entry BTB Reorder buffer96 entry Instruction queue64 entry (32 INT and 32 FP) Register file128 integer and 128 floating point Load/store queue32 entry load and 32 entry store Arithmetic unit4 integer, 4 floating point units Complex unit2 INT, 2 FP multiply/divide units Pipeline15 cycles (some stages are multi-cycles)

Performance Evaluation Fraction of total execution time L2 cache was active using IM & SM IPC loss due to L2 not being accessible under IM & SM

Power-Performance Trade Off (IM): 18 to 22% leakage power reduction with 1% performance loss (SM) : 25% leakage power reduction with 2% performance loss

Conclusions Study break down of leakage in L2 cache components, show peripheral circuit leaking considerably. Architectural techniques address reducing leakage in memory cell. Present an architectural study on what is happening after an L2 cache miss occurred. Present two architectural techniques to reduce leakage in the L2 peripheral circuits; IM and SM. (IM) achieves 18 or 22% average leakage power reduction, with a 1% average IPC reduction. (SM) achieves a 25% average savings with a 2% average IPC reduction. two techniques benefit different benchmarks, indicates a possibility adaptively selecting the best technique. This is subject of our ongoing research