CANDY: Enabling Coherent DRAM Caches for Multi-node Systems

Slides:



Advertisements
Similar presentations
Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :
Advertisements

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.
A Case for Refresh Pausing in DRAM Memory Systems
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
Nikos Hardavellas, Northwestern University
High Performing Cache Hierarchies for Server Workloads
FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.
Virtual Exclusion: An Architectural Approach to Reducing Leakage Energy in Multiprocessor Systems Mrinmoy Ghosh Hsien-Hsin S. Lee School of Electrical.
The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.
Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Citadel: Efficiently Protecting Stacked Memory From Large Granularity Failures June 14 th 2014 Prashant J. Nair - Georgia Tech David A. Roberts- AMD Research.
June 30th, 2006 ICS’06 -- Håkan Zeffer: Håkan Zeffer Zoran Radovic Martin Karlsson Erik Hagersten Uppsala University Sweden TMA A Trap-Based.
Citadel: Efficiently Protecting Stacked Memory From Large Granularity Failures Dec 15 th 2014 MICRO-47 Cambridge UK Prashant Nair - Georgia Tech David.
A Cache-Like Memory Organization for 3D memory systems CAMEO 12/15/2014 MICRO Cambridge, UK Chiachen Chou, Georgia Tech Aamer Jaleel, Intel Moinuddin K.
The University of Adelaide, School of Computer Science
1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
The Dirty-Block Index Vivek Seshadri Abhishek Bhowmick ∙ Onur Mutlu Phillip B. Gibbons ∙ Michael A. Kozuch ∙ Todd C. Mowry.
Multiprocessor Cache Coherency
A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,
Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,
Reducing Refresh Power in Mobile Devices with Morphable ECC
In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI.
Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.
1 CACM July 2012 Talk: Mark D. Hill, Cornell University, 10/2012.
BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches
Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State.
The Evicted-Address Filter
An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.
Parallel Computers Today LANL / IBM Roadrunner > 1 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating point.
Improving Multi-Core Performance Using Mixed-Cell Cache Architecture
Improving Cache Performance using Victim Tag Stores
Presented by: Nick Kirchem Feb 13, 2004
BD-Cache: Big Data Caching for Datacenters
Reducing Memory Interference in Multicore Systems
Lecture: Large Caches, Virtual Memory
תרגול מס' 5: MESI Protocol
ASR: Adaptive Selective Replication for CMP Caches
BD-CACHE Big Data Caching for Datacenters
Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Interaction of NoC design and Coherence Protocol in 3D-stacked CMPs
A Study on Snoop-Based Cache Coherence Protocols
Multiprocessor Cache Coherency
Jason F. Cantin, Mikko H. Lipasti, and James E. Smith
SABRes: Atomic Object Reads for In-Memory Rack-Scale Computing
(Find all PTEs that map to a given PPN)
RIC: Relaxed Inclusion Caches for Mitigating LLC Side-Channel Attacks
Moinuddin K. Qureshi ECE, Georgia Tech Gabriel H. Loh, AMD
Scalable High Performance Main Memory System Using PCM Technology
Prefetch-Aware Cache Management for High Performance Caching
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
CMSC 611: Advanced Computer Architecture
Example Cache Coherence Problem
Directory-based Protocol
The University of Adelaide, School of Computer Science
Lecture 2: Snooping-Based Coherence
CARP: Compression-Aware Replacement Policies
Improving Multiple-CMP Systems with Token Coherence
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
Presented by David Wolinsky
MICRO 2018 Vinson Young (GT) Aamer Jaleel (NVIDIA)
High Performance Computing
Lecture: Cache Hierarchies
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
CSE 486/586 Distributed Systems Cache Coherence
Lei Zhao, Youtao Zhang, Jun Yang
Presentation transcript:

CANDY: Enabling Coherent DRAM Caches for Multi-node Systems MICRO 2016 Taipei, Taiwan Oct 18, 2016 Chiachen Chou, Georgia Tech Aamer Jaleel, NVIDIA Moinuddin K. Qureshi, Georgia Tech

3D-DRAM Helps Mitigate Bandwidth WALL 3D-DRAM: High Bandwidth Memory (HBM) Off-chip DRAM DRAM Cache P L1 L3$ 3D-DRAM AMD Zen Intel Xeon Phi NVIDIA PASCAL Compared to DDR, 3D DRAM as a cache (DRAM Cache) transparently provides 4-8X bandwidth courtesy: Micron, AMD, Intel, NVIDIA

dram caches for multi-node systems Prior studies focus on single-node systems Off-chip DRAM DRAM$ P L3$ long-latency inter-node network Node 0 Node 1 Off-chip DRAM DRAM$ P L3$ We study DRAM caches for Multi-Node systems

Memory-Side Cache (MSC) Implicitly coherent, easy to implement Node 0 Node 1 P P P P P P P P L3 L3 DRAM$ local local DRAM$ long-latency interconnect Memory-Side Cache is implicitly coherent and simple to implement

shortcomings of Memory-Side Cache Implicitly coherent, easy to implement Cache only local data, long latency of remote data Node 0 Node 1 P P P P P P P P L3 L3 ~1GB ~4MB ✘ remote DRAM$ DRAM$ long-latency interconnect A L3 cache miss of remote data incurs a long latency in Memory-Side Cache

Coherent DRAM caches (CDC) Cache local/remote data, save remote miss latency Need coherence support Node 0 Node 1 P P P P P P P P L3 L3 remote ✔ DRAM$ DRAM$ Coherent DRAM Cache saves the L3 miss latency of remote data but needs coherence support

Potential Performance improvement 4-node system, each node has 1GB DRAM$ AVG: 1.3X Ideal-CDC outperforms Memory-Side Cache by 30%

Agenda The Need for Coherent DRAM Cache Challenge 1: A Large Coherence Directory Challenge 2: A Slow Request-For-Data Operation Summary

Directory-Based Coherence Protocol Coherence Directory (CDir): Sparse Directory*, tracking cached data in the system Node 0 Node 1 P P P P P P P P LLC LLC Coherence Directory Coherence Directory CDir CDir Memory Memory On a cache miss, the home node accesses the CDir information *Standalone inclusive directory with recalls

Large Coherence Directory Coherence directory size must be proportional to cache size Memory-Side Cache Coherent DRAM Cache P P P P P P P P 8MB L3 L3 On-Die CDir for L3 DRAM$ 1GB 1MB DRAM$ Coherence Directory 64MB Memory Memory For giga-scale DRAM cache, the 64MB coherence directory incurs storage and latency overheads

Where to place Coherence Directory? Options: 1. SRAM-CDir: place the 64MB CDir on die (SRAM) 2. Embedded-CDir: embed the 64MB CDir in 3D-DRAM DRAM$ miss P P P P P P P P L3 CDir 64MB L3 L3 SRAM DRAM$ L4$ DRAM$ 3D-DRAM CDir 64MB SRAM-CDir Embedded-CDir CDir Entry Embedding CDir avoids the SRAM storage, but incurs long access latency to CDir

DRAM-cache Coherence Buffer (DCB) Caching the recently used CDir entries for future references in the unused on-die CDir of L3 coherence P L3 1MB On-Die CDir for L3 Memory-Side Cache Coherent DRAM Cache DRAM$ miss P P P P DRAM$ Unused On-Die CDir for L3 ✘ DRAM-cache Coherence Buffer 1MB CDir Entry Hit Miss CDir 3D-DRAM 64MB DCB mitigates the latency to access embedded CDir

Design of Dram-cache coherence buffer One access to CDir in 3D-DRAM returns 16 CDir entries. insert = SRAM DCB miss Demand CDir 3D-DRAM Set S S+1 S+2 S+3 4-way set-associative 64B (16 CDir entries) The hit rate of DCB is 80% with the co-optimization of DCB and embedded-CDir

DRAM-cache Coherence Buffer (DCB): 21% Effectiveness of DCB 4-node system, each node has 1GB DRAM$ DRAM-cache Coherence Buffer (DCB): 21%

Agenda The Need for Coherent DRAM Cache Challenge 1: A Large Coherence Directory Challenge 2: A Slow Request-For-Data Operation Summary

Slow Request-For-Data (RFD) RFD (fwd-getS): read the data from a remote cache cache miss Home Node Remote Coherence Directory DRAM$ CDC: Read L3 Request-For-Data MSC: SRAM$ Read extra latency In Coherent DRAM Cache, Request-For-Data incurs a slow 3D-DRAM access

Sharing-aware bypass Request-For-Data accesses only read-write shared data cache miss Home Node P L3 DRAM$ Owner Coherence Directory M M Request-For-Data . Read-write shared data bypass DRAM caches and are stored only in L3 caches

Performance improvement of CANDY DRAM Cache for Multi-Node Systems (CANDY) AVG: 1.25X CANDY: 25% improvement (5% within Ideal-CDC)

Summary Coherent DRAM Cache faces two key challenges: Large coherence directory Slow Request-For-Data DRAM Cache for Multi-Node Systems (CANDY) DRAM-cache Coherence Buffer with embedded coherence directory Sharing-Aware Bypass CANDY outperforms Memory-Side Cache by 25% (5% within Ideal Coherent DRAM Cache)

CANDY: Enabling Coherent DRAM Caches for Multi-node Systems Thank you CANDY: Enabling Coherent DRAM Caches for Multi-node Systems MICRO 2016 Taipei, Taiwan Oct 18, 2016 Chiachen Chou, Georgia Tech Aamer Jaleel, NVIDIA Moinuddin K. Qureshi, Georgia Tech Computer Architecture and Emerging Technologies Lab, Georgia Tech

Backup Slides

DCB Hit RatE

Operation breakdown

Inter-node Network Traffic reduction Compare to MSC, CANDY reduces 65% of the traffic

Performance (NUMA-Aware systems)

Sharing-Aware Bypass (1) (1) Detecting read-write shared data (2) Enforcing R/W shared data to bypass caches cache request Home Node Coherence Directory CDir Entry M 1 RWS Memory Read Invalidate Request-For-Data Flush Read-write shared data, set Read-Write Shared (RWS) bit Sharing-Aware Bypass detects read-write shared data at run-time based on coherence operations

Sharing-Aware Bypass (2) (1) Detecting read-write shared data (2) Enforcing R/W shared data to bypass caches L4 cache miss and L3 dirty eviction BypL4? Data No Yes BypL4 bit + Data BypL4 bit Home Node Requester cache miss 1 RWS L3 1 Dirty Eviction M M 1 BypL4 BypL4? No Yes DRAM$ Sharing-Aware Bypass enforces R/W shared data to be stored only in L3 caches

methodology 4-Node DRAM$ Off-chip DRAM 4-Node, each node: 4 cores 3.2 GHz 2-wide OOO 4MB 16-way L3 shared cache DRAM Cache DRAM Memory Capacity 1GB 16GB Bus DDR3.2GHz, 64-bit DDR1.6GHz, Channel 8 channels, 16 banks/ch 2 channels 8 banks/ch Evaluation in Sniper simulator, baseline: Memory-Side Cache 13 parallel benchmarks from NAS, SPLASH2, PARSEC, NU-bench