FreshCache: Statically and Dynamically Exploiting Dataless Ways Arkaprava Basu, Derek R. Hower, Mark D. Hill, Mike M. Swift.

Slides:

Advertisements

Similar presentations

Feedback Control Real- time Scheduling James Yang, Hehe Li, Xinguang Sheng CIS 642, Spring 2001 Professor Insup Lee.

Advertisements

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

High Performing Cache Hierarchies for Server Workloads

FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

Amoeba-Cache Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar Arrvindh Shriraman Eric Matthews Lesley Shannon Hongzhou Zhao.

Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.

1 MemScale: Active Low-Power Modes for Main Memory Qingyuan Deng, David Meisner*, Luiz Ramos, Thomas F. Wenisch*, and Ricardo Bianchini Rutgers University.

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture

Prefetch-Aware Cache Management for High Performance Caching

Improving Cache Performance by Exploiting Read-Write Disparity

1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.

Mathew Paul and Peter Petrov Proceedings of the IEEE Symposium on Application Specific Processors (SASP ’09) July /6/13.

(C) 2002 Milo MartinHPCA, Feb Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet.

Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian,

1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,

1 Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems G. Pokam and F. Bodin.

Skewed Compressed Cache

Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.

1 Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu.

Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.

Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

Budget-based Control for Interactive Services with Partial Execution 1 Yuxiong He, Zihao Ye, Qiang Fu, Sameh Elnikety Microsoft Research.

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos

Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

Moshovos © 1 ReCast: Boosting L2 Tag Line Buffer Coverage “for Free” Won-Ho Park, Toronto Andreas Moshovos, Toronto Babak Falsafi, CMU

Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.

Sampling Dead Block Prediction for Last-Level Caches

1 CACM July 2012 Talk: Mark D. Hill, Cornell University, 10/2012.

Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.

DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations Hyojin Sung and Sarita Adve Department of Computer Science.

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

Energy-Aware Resource Adaptation in Tessellation OS 3. Space-time Partitioning and Two-level Scheduling David Chou, Gage Eads Par Lab, CS Division, UC.

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.

CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.

BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.

Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.

The Evicted-Address Filter

1 Efficient System-on-Chip Energy Management with a Segmented Counting Bloom Filter Mrinmoy Ghosh- Georgia Tech Emre Özer- ARM Ltd Stuart Biles- ARM Ltd.

By Islam Atta Supervised by Dr. Ihab Talkhan

Optimizing Replication, Communication, and Capacity Allocation in CMPs Z. Chishti, M. D. Powell, and T. N. Vijaykumar Presented by: Siddhesh Mhambrey Published.

University of Toronto Department of Electrical and Computer Engineering Jason Zebchuk and Andreas Moshovos June 2006.

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.

Speaker : Kyu Hyun, Choi. Problem: Interference in shared caches – Lack of isolation → no QoS – Poor cache utilization → degraded performance.

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh

ASR: Adaptive Selective Replication for CMP Caches

Stash: Have Your Scratchpad and Cache it Too

(Find all PTEs that map to a given PPN)

Prefetch-Aware Cache Management for High Performance Caching

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Energy-Efficient Address Translation

Fine-Grain CAM-Tag Cache Resizing Using Miss Tags

Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar, Hongzhou Zhao†, Arrvindh Shriraman Eric Matthews∗, Sandhya.

Reducing Memory Reference Energy with Opportunistic Virtual Caching

Improving Multiple-CMP Systems with Token Coherence

Lucía G. Menezo Valentín Puente Jose Ángel Gregorio

SecDir: A Secure Directory to Defeat Directory Side-Channel Attacks

Presentation transcript:

FreshCache: Statically and Dynamically Exploiting Dataless Ways Arkaprava Basu, Derek R. Hower, Mark D. Hill, Mike M. Swift

Last Level Caches: Area and Energy Hungry Intel Ivy Bridge die picture

Last Level Caches: Area and Energy Hungry LLC contributes up to 37% of on-chip power [Sen et al., 2013, UW-TR 1791] Intel Ivy Bridge die picture

Inefficiencies in LLC Inclusive LLC wastes energy and area – Transistors devoted to hold stale data

Inefficiencies in LLC Inclusive LLC wastes energy and area – Transistors devoted to hold stale data LLC + Directory Private Caches (L1/L2) C1 C2 A :x TAG DATA Block A is cached with exclusive permission in C1’s private cache A :y

Inefficiencies in LLC Inclusive LLC wastes energy and area – Transistors devoted to hold stale data Amount of stale data varies across workloads Fraction of stale data in LLC blocks 0.7 Private Cache: LLC ratio ~ 1:4

Idea: FreshCache Static: – Omit data portion of a fixed number of ways  Reduce area and energy overhead Dynamic : – Disable data ways at runtime  Reduce more energy for when possible

Roadmap Motivation and key idea FreshCache: Static + Dynamic Dataless Ways Design and Mechanisms Evaluation Summary

Static Dataless Ways (SDWs) TAG + Metadata Data Set Way Set-associative LLC

Static Dataless Ways (SDWs) Set-associative LLC Number of dataless ways fixed at design time Static Dataless Way ✔ Saves both area and static power* ✗ Cannot adapt to workloads * If blocks with stale data kept in SDWs

Dynamic Dataless Ways (DDWs) Set-associative LLC Number of dataless ways adjusted at runtime Data ways Turned off Workload A Dynamic Dataless Ways

Dynamic Dataless Ways (DDWs) Set-associative LLC Number of dataless ways adjusted at runtime Workload B Cache utilization is less for workload B

Dynamic Dataless Ways (DDWs) Set-associative LLC Number of dataless ways adjusted at runtime Data ways Turned off Workload B ✔ Opportunistically save more energy ✗ No area savings

FreshCache Goals: Best of Both Worlds Static: save area and energy – Omitting transistors at design time Dynamic: save more energy – Turning off transistor when possible How to tradeoff performance? – Bounded by Maximum Performance Degradation e.g., MPD = 1% or 3% – Minimize energy subject to MPD

FreshCache: Static + Dynamic Dataless Ways Workload A/B Static Dataless WaysDynamic Dataless Ways

FreshCache: Challenges Put blocks with stale data in dataless ways Determine number of DDWs at runtime 1 2

Roadmap Motivation FreshCache: Static + Dynamic Dataless Ways Mechanisms – LLC Controller  Manage Dataless ways – DDW Controller  Determine number of DDWs Evaluation Summary 1 2

Dataless-Way-Aware LLC Controller Coherence state decides if cache block put in dataless way From Memory/Other Socket Keep blocks with stale data in dataless ways 1 Exclusive state SDW or DDW

Dataless-Way-Aware LLC Controller Coherence state decides if cache block put in dataless way From Memory/Other Socket Keep blocks with stale data in dataless ways 1 Shared state SDW or DDW

Dataless-Way-Aware LLC Controller Writeback to dataless way may move block to conventional way Intra-set block movement Keep blocks with stale data in dataless ways 1 Writeback from Private $

DDW Controller Determines number of DDWs at runtime DDW Cont. LLC miss Estimator Avg. Mem. Latency Hit Counters Maximum Performance Degradation (MPD) Energy savings Est. LLC miss Aggregator Aux. Tag Array 2 Software specifies performance vs. energy savings tradeoff MPD value specified in a register Energy savings subjected to MPD Qureshi’06 0.3% overhead

DDW Controller Determines number of DDWs at runtime DDW Cont. LLC miss Estimator Avg. Mem. Latency Hit Counters Maximum Performance Degradation (MPD) Energy savings Est. LLC miss Aggregator Aux. Tag Array 2 Qureshi’07

Roadmap Motivation FreshCache: Static + Dynamic Dataless Ways Mechanisms Evaluation Summary

Methodology gem5 full system simulation 8 in-order cores, 3-level cache hierarchy Parsec and commercial workloads CACTI 6.5 to evaluate area and energy savings Evaluation: – Efficacy of FreshCache in saving energy – Area savings due to FreshCache

Energy Savings: MPD=1% Relative Energy (LLC + DRAM access) Savings 28% 2 SDWs (out 16 ways) + variable number of DDWs Percentage (%) Avg. 28% energy savings with worst case perf. Degradation < 1%

Energy Savings: MPD= 3% Relative Energy (LLC + DRAM access) Savings 28% 41% 2 SDWs (out 16 ways) + variable number of DDWs MPD = 1% Percentage (%) Avg. 41% energy savings with worst case perf. Degradation < 3%

Area Savings Relative Energy (LLC + DRAM access) Savings 28% 41% 2 SDWs (out 16 ways) + variable number of DDWs MPD = 1% Percentage (%) 8.23% of LLC area saved

Summary LLC can be energy and area hungry Inclusive LLCs holds substantial stale data FreshCache: – Static Dataless Ways to save area and power – Dynamic Dataless Ways to save further power 28% Energy and 8.23% LLC area savings – Worst case performance degradation <1%