Virtual Exclusion: An Architectural Approach to Reducing Leakage Energy in Multiprocessor Systems Mrinmoy Ghosh Hsien-Hsin S. Lee School of Electrical.

Slides:



Advertisements
Similar presentations
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy J. Zebchuk, E. Safi, and A. Moshovos.
Advertisements

Bypass and Insertion Algorithms for Exclusive Last-level Caches
A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.
Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
Multicore Architectures Michael Gerndt. Development of Microprocessors Transistor capacity doubles every 18 months © Intel.
Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors Chinnakrishnan S. Ballapuram Ahmad Sharif Hsien-Hsin S.
To Include or Not to Include? Natalie Enright Dana Vantrease.
1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.
Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.
LEMap: Controlling Leakage in Large Chip-multiprocessor Caches via Profile-guided Virtual Address Translation Jugash Chandarlapati Mainak Chaudhuri Indian.
High Performing Cache Hierarchies for Server Workloads
FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.
Leakage Energy Management in Cache Hierarchies L. Li, I. Kadayif, Y-F. Tsai, N. Vijaykrishnan, M. Kandemir, M. J. Irwin, and A. Sivasubramaniam Penn State.
Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
Thoughts on Shared Caches Jeff Odom University of Maryland.
Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.
Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Aanjhan Ranganathan (ETH Zurich), Ali Galip Bayrak (EPFL), Theo Kluter.
Adaptive Techniques for Leakage Power Management in L2 Cache Peripheral Circuits Houman Homayoun Alex Veidenbaum and Jean-Luc Gaudiot Dept. of Computer.
1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science August 20, 2009 Enabling.
Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.
A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.
Restrictive Compression Techniques to Increase Level 1 Cache Capacity Prateek Pujara Aneesh Aggarwal Dept of Electrical and Computer Engineering Binghamton.
1 Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems G. Pokam and F. Bodin.
Cache Organization of Pentium
Mrinmoy Ghosh Weidong Shi Hsien-Hsin (Sean) Lee
Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.
Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,
Reducing Refresh Power in Mobile Devices with Morphable ECC
Drowsy Caches: Simple Techniques for Reducing Leakage Power Authors: ARM Ltd Krisztián Flautner, Advanced Computer Architecture Lab, The University of.
Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos
Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.
Energy Reduction for STT-RAM Using Early Write Termination Ping Zhou, Bo Zhao, Jun Yang, *Youtao Zhang Electrical and Computer Engineering Department *Department.
Analyzing Performance Vulnerability due to Resource Denial-Of-Service Attack on Chip Multiprocessors Dong Hyuk WooGeorgia Tech Hsien-Hsin “Sean” LeeGeorgia.
BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches
Min Lee, Vishal Gupta, Karsten Schwan
CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.
11 Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures Songjun Pan Yu Hu Xiaowei Li {pansongjun, huyu,
Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.
1 Efficient System-on-Chip Energy Management with a Segmented Counting Bloom Filter Mrinmoy Ghosh- Georgia Tech Emre Özer- ARM Ltd Stuart Biles- ARM Ltd.
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
컴퓨터교육과 이상욱 Published in: COMPUTER ARCHITECTURE LETTERS (VOL. 10, NO. 1) Issue Date: JANUARY-JUNE 2011 Publisher: IEEE Authors: Omer Khan (Massachusetts.
University of Toronto Department of Electrical and Computer Engineering Jason Zebchuk and Andreas Moshovos June 2006.
Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.
Performance of Snooping Protocols Kay Jr-Hui Jeng.
The University of Adelaide, School of Computer Science
Improving Multi-Core Performance Using Mixed-Cell Cache Architecture
Presented by: Nick Kirchem Feb 13, 2004
Cache Organization of Pentium
Adaptive Cache Partitioning on a Composite Core
Zhichun Zhu Zhao Zhang ECE Department ECE Department
תרגול מס' 5: MESI Protocol
ASR: Adaptive Selective Replication for CMP Caches
A Study on Snoop-Based Cache Coherence Protocols
Jason F. Cantin, Mikko H. Lipasti, and James E. Smith
Energy-Efficient Address Translation
Fine-Grain CAM-Tag Cache Resizing Using Miss Tags
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Taeweon Suh § Hsien-Hsin S. Lee § Shih-Lien Lu † John Shen †
Reducing Memory Reference Energy with Opportunistic Virtual Caching
Lecture 5: Snooping Protocol Design Issues
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
CANDY: Enabling Coherent DRAM Caches for Multi-node Systems
A Novel Cache-Utilization Based Dynamic Voltage Frequency Scaling (DVFS) Mechanism for Reliability Enhancements *Yen-Hao Chen, *Yi-Lun Tang, **Yi-Yu Liu,
Presentation transcript:

Virtual Exclusion: An Architectural Approach to Reducing Leakage Energy in Multiprocessor Systems Mrinmoy Ghosh Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, GA

2 Definition of MLI: Cache Line present in lower level cache  Cache Line present in higher level cache Use of MLI: Facilitates efficient cache coherence implementation Shields lower level caches from snoop requests Implementing MLI: “I” bit in cache tags Higher level cache gets info about clean evictions Multi-Level Inclusion in Caches

3 IBM Power 4 Cache Hierarchy 1.5MB L2 shared by 2 cores, with a 32MB L3 Inclusion maintained between L1 and L2 Inclusion indication can be false L1 Tag L1$ L2 Cache Inclusion bits 1 Level 3 Cache snoop Bus

4 Another Approach: Piranha CMP (Compaq) 8 cores (64KB I$ + 64KB D$, 1MB shared L2) Aggregate L1 = 1MB = L2 No inclusion maintained L1 Tag L2 Cache L1 Tag L2 controller Duplicate L1 tag and state snoop L1$ Bus

5 Power Implication in MLI Caches The same active information kept in both caches With locality, L2 is rarely accessed L2 Cache L1 Tag L1$ Cache  larger  deeper Moore’s law  more transistors for insurance? L1 Tag L1$ L1 Tag L1$ L1 Tag L1$

6 Prior Architectural Art in Saving Cache Leakage BL WL Gated Vdd Control Drowsy Vdd (1V) Vdd Low (0.3 V) Vdd Cache Decay [ISCA-28] Could lead to more power Drowy Cache: [ISCA-29][MICRO-35] Could impact access latency

7 Virtual Exclusion

8 0 Gated Vdd Control Core L1 Cache TagVDI 0x ff ab Way L2 Cache Tag RAM Data Array Shared Bus Tag RAM Data Array Virtual Exclusion : L1 Cache Line Fill

9 1 Gated Vdd Control Core L1 Cache TagVDI 2-Way L2 Cache Tag RAM Data Array Shared Bus Tag RAM Data Array Drowsy = 1 Vdd_low Virtual Exclusion : L1 Eviction 0xffddeeaa

10 Core L1 Cache TagVDI 2-Way L2 Cache Tag RAM Data Array Shared Bus Tag RAM Data Array Snoop Request Forward Snoop to L1 Protocol Change ─ Snoop Forwarding

11 Core L1 Cache TagVDI 2-Way L2 Cache Tag RAM Data Array Shared Bus Tag RAM Data Array Invalidation Request L1 Cache Write Notification Protocol Change ─ Write Invalidation

12 Modified Cache Decay

13 Core L1 Cache 2-Way L2 Cache Tag RAM Data Array Shared Bus Tag RAM Data Array TagDCI Memory L2 Linefill Decay of counter continues even if line is in L1 Cache Modified Cache Decay for MLI: L2 Line Fill TagDCI Decay Counter 0x ff ab

14 Core L1 Cache TagDCI 2-Way L2 Cache Tag RAM Data Array Shared Bus Tag RAM Data Array TagDCI Memory Eviction Decay of counter unaffected by L1 Eviction Modified Cache Decay for MLI : L1 Eviction

15 Core L1 Cache TagDCI 2-Way L2 Cache Tag RAM Data Array Shared Bus Tag RAM Data Array TagDCI Memory Access hits L2 Cache Modified Cache Decay for MLI: L2 Hit 0x ff ab

16 Hybrid Virtual Exclusion Observation: –Cache decay starts decaying when L1 has high locality Hybrid Virtual Execution does –Virtual Execution when L1 has high locality –Start decaying after L1 eviction

17 Core L1 Cache TagDCI 2-Way L2 Cache Tag RAM Data Array Shared Bus Tag RAM Data Array TagDCI Memory L2 Linefill Hybrid Virtual Exclusion : L2 Line Fill 0x ff ab Gated Vdd Control L1 & L2 virtually exclusive

18 Core L1 Cache TagDCI 2-Way L2 Cache Tag RAM Data Array Shared Bus Tag RAM Data Array TagDCI Memory Eviction Decay starts only after line is evicted from L1 Hybrid Virtual Exclusion : L1 Eviction 0x ff ab

19 Experimental Framework Single processor modelUltra Sparc T1 like (Niagara) L1 data/instruction cache2-way 16KB, 64 byte line L2 caches8-way 256KB, 512KB L1 access1 cycle L2 access (Shared for Multi-Core) (Private for SMP) 10 cycles (normal) 12 cycles (drowsy) Memory access200 cycles DRAM256MB (conservative base) Energy BaselineDrowsy cache scheme M5 simulator from Michigan System level emulation Power models integrated into M5 –ECacti from UC Irvine (leakage + dynamic) –MICRON DRAM datasheet 2P, 4P, & 8-P SMP Dual, Quad, & Oct- Multicore Benchmark workload –SPLASH-2 (ran to completion) –SPEC 2000

20 Leakage Energy Reduction (2-way SMP)

21 Leakage Energy Reduction (Various SMPs) Average of SPLASH2 benchmark

22 Leakage Energy Reduction (4-way Multi-Core)

23 Leakage Energy Reduction (Various Multi-Cores) ConfigurationSPEC 2000 benchmark mix 2-way Multicorebzip, gzip 4-way Multicorebzip, gzip, crafty, gap 8-way Multicore2x (bzip, gzip, crafty, gap)

24 Conclusions Prior art can violate Multi-level Inclusion for cache coherence protocols Virtual Exclusion –Maintain correctness for Multi-Level Inclusion –Low overhead architectural approach –Enhanced Cache Decay to work correctly with MLI Significant energy savings over a drowsy cache baseline –Symmetric Multiprocessors (46% for 8-way, SPLASH2) –Multi-Core processors (35% for 4-way, SPLASH2)

Thank You! Georgia Tech ECE MARS Labs

BACKUP

27 Prior Architectural Art in Saving Cache Leakage Cache Decay [ISCA-28] –Use Gated-Vdd –Turn off cache lines when not used for a while –Can lead to more power consumption –Did not consider cache coherence Drowsy Cache [ISCA-29][MICRO-35] –Maintain state in low leakage drowsy mode –Has latency implication