L1 Data Cache Decomposition for Energy Efficiency Michael Huang, Joe Renau, Seung-Moon Yoo, Josep Torrellas University of Illinois at Urbana-Champaign.

Slides:



Advertisements
Similar presentations
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy J. Zebchuk, E. Safi, and A. Moshovos.
Advertisements

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
A Framework for Dynamic Energy Efficiency and Temperature Management (DEETM) Michael Huang, Jose Renau, Seung-Moon Yoo, Josep Torrellas University of Illinois.
Memory Optimizations Research at UNT Krishna Kavi Professor Director of NSF Industry/University Cooperative Center for Net-Centric Software and Systems.
Evaluating an Adaptive Framework For Energy Management in Processor- In-Memory Chips Michael Huang, Jose Renau, Seung-Moon Yoo, Josep Torrellas.
CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.
Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences.
CS 333 Introduction to Operating Systems Class 12 - Virtual Memory (2) Jonathan Walpole Computer Science Portland State University.
Power Efficient IP Lookup with Supernode Caching Lu Peng, Wencheng Lu*, and Lide Duan Dept. of Electrical & Computer Engineering Louisiana State University.
Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang, Rajesh Gupta, Alex Nicolau.
UPC Value Compression to Reduce Power in Data Caches Carles Aliagas, Carlos Molina and Montse García Universitat Rovira i Virgili – Tarragona, Spain {caliagas,
Chapter 12 CPU Structure and Function. Example Register Organizations.
Virtual Memory and Paging J. Nelson Amaral. Large Data Sets Size of address space: – 32-bit machines: 2 32 = 4 GB – 64-bit machines: 2 64 = a huge number.
The Memory Behavior of Data Structures Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences The University.
1 Balanced Cache:Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders ISCA 2006,IEEE. By Chuanjun Zhang Speaker: WeiZeng.
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.
CS 333 Introduction to Operating Systems Class 12 - Virtual Memory (2) Jonathan Walpole Computer Science Portland State University.
Flexicache: Software-based Instruction Caching for Embedded Processors Jason E Miller and Anant Agarwal Raw Group - MIT CSAIL.
CS333 Intro to Operating Systems Jonathan Walpole.
Reconfigurable Caches and their Application to Media Processing Parthasarathy (Partha) Ranganathan Dept. of Electrical and Computer Engineering Rice University.
Page Overlays An Enhanced Virtual Memory Framework to Enable Fine-grained Memory Management Vivek Seshadri Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu,
Lecture 19: Virtual Memory
ISLPED’99 International Symposium on Low Power Electronics and Design
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Lecture by: Prof. Pooja Vaishnav.  Language Processor implementations are highly influenced by the kind of storage structure used for program variables.
Computer Organization CS224 Fall 2012 Lessons 45 & 46.
Cache Memory Chapter 17 S. Dandamudi To be used with S. Dandamudi, “Fundamentals of Computer Organization and Design,” Springer,  S. Dandamudi.
1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
Superscalar Architecture Design Framework for DSP Operations Rehan Ahmed.
Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.
Cache Miss-Aware Dynamic Stack Allocation Authors: S. Jang. et al. Conference: International Symposium on Circuits and Systems (ISCAS), 2007 Presenter:
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
Re-configurable Bus Encoding Scheme for Reducing Power Consumption of the Cross Coupling Capacitance for Deep Sub-micron Instructions Bus Siu-Kei Wong.
ARM 7 & ARM 9 MICROCONTROLLERS AT91 1 ARM920T Processor.
Translation Lookaside Buffer
Cache Memory.
Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh
Associativity in Caches Lecture 25
Replacement Policy Replacement policy:
Multilevel Memories (Improving performance using alittle “cash”)
5.2 Eleven Advanced Optimizations of Cache Performance
Energy-Efficient Address Translation
Fine-Grain CAM-Tag Cache Resizing Using Miss Tags
Cache Memories September 30, 2008
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Reducing Memory Reference Energy with Opportunistic Virtual Caching
TLC: A Tag-less Cache for reducing dynamic first level Cache Energy
Address-Value Delta (AVD) Prediction
Out-of-Order Commit Processor
Chapter 6 Memory System Design
Chap. 12 Memory Organization
Translation Lookaside Buffer
Overheads for Computers as Components 2nd ed.
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
CS703 - Advanced Operating Systems
Lecture 4: Instruction Set Design/Pipelining
Overview Problem Solution CPU vs Memory performance imbalance
ARM920T Processor This training module provides an introduction to the ARM920T processor embedded in the AT91RM9200 microcontroller.We’ll identify the.
Presentation transcript:

L1 Data Cache Decomposition for Energy Efficiency Michael Huang, Joe Renau, Seung-Moon Yoo, Josep Torrellas University of Illinois at Urbana-Champaign

International Symposium on Low Power Electronics and Design, August Objective  Reduce L1 data cache energy consumption  No performance degradation  Partition the cache in multiple ways  Specialization for stack accesses

International Symposium on Low Power Electronics and Design, August Outline  L1 D-Cache decomposition  Specialized Stack Cache  Pseudo Set-Associative Cache  Simulation Environment  Evaluation  Conclusions

International Symposium on Low Power Electronics and Design, August L1 D-Cache Decomposition  A Specialized Stack Cache (SSC)  A Pseudo Set-Associative Cache (PSAC)

International Symposium on Low Power Electronics and Design, August Selection  Selection done in decode stage to speed up –Based on instruction address and opcode  2Kbit table to predict the PSAC way Opcode Address PSACSSC

International Symposium on Low Power Electronics and Design, August Stack Cache  Small, direct-mapped cache  Virtually tagged  Software optimizations: –Very important to reduce stack cache size –Avoid trashing: allocate large structs in heap –Easy to implement

International Symposium on Low Power Electronics and Design, August SSC: Specialized Stack Cache Pointers to reduce traffic:  TOS: reduce number write-backs  SRB (safe-region-bottom): reduce unnecessary line-fills for write miss –Region between TOS & SRB is “safe” (missing lines are non initialized) Infrequent access TOS Stack grows TOS SRB TOS SRB TOS

International Symposium on Low Power Electronics and Design, August Pseudo Set-Associative Cache  Partition the cache in 4 ways  Evaluated activation policies: Sequential, FallBackReg, Phased Cache, FallBackPha, PredictPha DataTag

International Symposium on Low Power Electronics and Design, August Sequential (Calder ‘96) cycle 1 cycle 2 cycle 3

International Symposium on Low Power Electronics and Design, August Fallback-regular (Inoue ‘99) cycle 1 cycle 2

International Symposium on Low Power Electronics and Design, August Phased Cache (Hasegawa ‘95) cycle 1 cycle 2

International Symposium on Low Power Electronics and Design, August Fallback-phased (ours) cycle 1 cycle 2 cycle 3  Emphasis in energy reduction

International Symposium on Low Power Electronics and Design, August Predictive Phased (ours) cycle 1 cycle 2  Emphasis in performance

International Symposium on Low Power Electronics and Design, August Simulation Environment Baseline configuration:  Processor:1GHz R10000 like  L1:32 KB 2-way  L2:512KB 8-way phased cache  Memory:1 Rambus Channel  Energy model: extended CACTI  Energy is for data memory hierarchy only

International Symposium on Low Power Electronics and Design, August Applications  Mp3dec:MP3 decoder  Mp3enc:MP3 encoder  Gzip:Data compression  Crafty:Chess game  MCF:Traffic model  Bsom:data mining  Blast:protein matching  Treeadd:Olden tree search Multimedia SPECint Scientific

International Symposium on Low Power Electronics and Design, August Adding a Stack Cache For the same size the Specialized Stack Cache is always better

International Symposium on Low Power Electronics and Design, August Pseudo Set-Associative Cache PredictPha has the best delay and energy-delay product

International Symposium on Low Power Electronics and Design, August PSAC: 2-way vs. 4-way For E*D, 4-way PSAC is better than 2-way

International Symposium on Low Power Electronics and Design, August Pseudo Set-Associative + Specialized Stack Cache Combining PSAC and SSC reduces E*D by 44% on average

International Symposium on Low Power Electronics and Design, August Area Constrained: small PSAC+SSC SSC + small PSAC delivers cost effective E*D design

International Symposium on Low Power Electronics and Design, August Energy Breakdown

International Symposium on Low Power Electronics and Design, August Conclusions  Stack cache: important for energy-efficiency  SW optimization required for stack caches  Effective Specialized Stack Cache extensions  Pseudo Set-Associative Cache: –4-way more effective than 2-way –Predictive Phased PSAC has the lowest E*D  Effective to combine PASC and SSC –E*D reduced by 44% on average

International Symposium on Low Power Electronics and Design, August Backup Slides

International Symposium on Low Power Electronics and Design, August Cache Energy

International Symposium on Low Power Electronics and Design, August Extended CACTI  New sense amplifier –15% bit-line swing for reads  Full bit-line swing for writes  Different energy for reads, writes, line-fills, and write backs  Multiple optimization parameters

International Symposium on Low Power Electronics and Design, August SSC Energy Overhead  Small energy consumption required to use TOS and SRB  Registers updated at function call and return  Registers check on cache miss

International Symposium on Low Power Electronics and Design, August Miss Rate

International Symposium on Low Power Electronics and Design, August Overview