Presentation is loading. Please wait.

Presentation is loading. Please wait.

BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches

Similar presentations


Presentation on theme: "BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches"— Presentation transcript:

1 BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches
ISCA 2015 Portland, OR June 15 , 2015 Chiachen Chou, Georgia Tech Aamer Jaleel, NVIDIA* Moinuddin K. Qureshi, Georgia Tech

2 3D DRAM Helps Mitigate Bandwidth WALL
3D DRAM: Hybrid Memory Cube (HMC), High Bandwidth Memory (HBM) Intel Xeon Phi NVIDIA Pascal Stacked DRAM provides 4-8X bandwidth, but has limited capacity courtesy: Micron, JEDEC, Intel, NVIDIA

3 3D DRAM is used as a CACHE (DRAM CACHE)
fast CPU CPU L1$ L1$ L2$ L2$ L3$ Memory Hierarchy Off-chip DRAM DRAM Cache 1GB DRAM$ 16M Cache Lines 4B Tags 64MB Tag Storage 3D DRAM slow DRAM$ stores tags in 3D DRAM for scalability

4 can DRAM Cache provide 4X bandwidth?
(Tags + Data) Hit (Good Use of BW) TAG DATA 4X Miss Detection Secondary Operations (Waste BW) Miss Fill DATA CPU Writeback Detection Writeback Fill 1X Memory DRAM$ does not utilize full bandwidth

5 Agenda Introduction Background BEAR Results Summary DRAM Cache Designs
Secondary Operations Bloat Factor BEAR Results Summary

6 Dram cache has narrow bus
CPU 2KB Row Buffer 8B Tag 64B Data Alloy Cache 16-byte buses DRAM Cache DRAM$ accesses tag and data via a narrow bus [Qureshi and Loh MICRO’12]

7 Cache requires maintenance operations
L3$ Dirty Line Y Hit Miss WB Detection/Fill DRAM Cache Line X Miss Fill Useful Secondary Hit (HIT) Miss Detection (MD), Miss Fill (MF), WB Detection (WD), WB Fill (WF) Memory Line X DRAM$ bandwidth is used for secondary operations

8 quantifying the bandwidth usage
Useful HIT WF WD HIT MF MD HIT Transfer on Bus HIT HIT HIT Bloat Factor indicates the bandwidth inefficiency

9 Bloat Factor Breakdown
8-core, 8MB shared L3$, 1GB DRAM$, 16GB memory SPEC2006: 16 rate and 38 mix workloads WD WF 0.6 MD MF 0.7 HIT 1.25 Baseline has a Bloat Factor of 3.8

10 Potential Performance of 22%
8-core, 8MB shared L3$, 1GB DRAM$, 16GB memory SPEC2006: 16 rate and 38 mix workloads Reducing Bloat Factor improves performance

11 Not all operations are created equal
Opportunities to remove Secondary Operations 1. Operations to improve cache performance 2. Operations to ensure correctness Request Exist? DATA Insert DRAM Cache We propose BEAR to exploit these opportunities

12 BEAR: Bandwidth-Efficient ARchitecture
Agenda Introduction Background BEAR: Bandwidth-Efficient ARchitecture 1. Bandwidth-Efficient Miss Fill 2. Bandwidth-Efficient Writeback Detection 3. Bandwidth-Efficient Miss Detection Results Summary

13 Bandwidth-Efficient Miss Fill
Line X 12% P=90% returns from memory Throw away 1-P P Insert Insert +10% DRAM$ -5% How to enable bypass without hit rate degradation?

14 BAB limits the hit rate loss
Bandwidth-Aware Bypass (BAB) No Bypass no bypass Insert Set Hit Rate False + X - Y < Δ + - Bypass Set Hit Rate True probabilistic bypass 90% bypass Probabilistic Bypass DRAM$ Use Probabilistic Bypass when hit rate loss is small

15 BAB improve performance by 5%
HIT 1.25 MD MF 0.7 WD WF 0.6 0.1 Hit Rate: Alloy 64%, BAB 62% BAB trades off small hit rate for 5% improvement

16 What is a writeback detection?
(WB Detection) Dirty Line Ynew L3$ DRAM Cache Line Yold Exist? How can we remove Writeback Detection?

17 DRAM Cache Presence for WB DETECTION
(DCP) Dirty Line Ynew L3$ V D ? True False Exist? DRAM Cache Line Yold Only WB Fill WB Detection + WB Fill DRAM Cache Presence reduces WB Detection

18 DCP Improves performance by 4%
HIT 1.25 MD MF 0.7 0.1 WD WF 0.6 0.1 DCP provides 4% improvement in addition to BAB

19 What is a miss detection?
Missing Line X L3$ DRAM Cache (Tag + Data) Line X Exist? Can we detect a miss w/o using BW?

20 Neighbor’s Tag comes free with demand
Address X DRAM Row Buffer 2KB X TAD TAD Tag+Data+Tag (8+64+8=80Bytes) Demand Neighbor Neighboring Tag Cache (NTC) Neighboring Tag Cache saves Miss Detection

21 NTC shows 2% performance improvement
HIT 1.25 MD MF 0.7 0.1 WD WF 0.6 0.5 NTC improves performance by additional 2%

22 Agenda Introduction Background BEAR Results Summary

23 methodology CPU Stacked DRAM Off-chip DRAM Core Chips: 8 cores 3.2 GHz
2-wide OOO 8MB 16-way L3 shared cache DRAM Cache Off-chip DRAM Capacity 1GB 16GB Bus DDR3.2GHz, 128-bit DDR1.6GHz, 64-bit Channel 4 channels, 16 banks/ch 2 channels 8 banks/ch Baseline: Alloy Cache [MICRO’12] SPEC2006 (16 memory intensive apps): 16 rate and 38 mix workloads

24 BEAR reduces Bloat Factor by 32%
BEAR improves performance by 11%

25 BW bloat in Tags-in-sram designs
Tags-In-SRAM (TIS) Designs: (1) storage overhead (64MB) and (2) access latency CPU Hit MF WF Tags in SRAM 64MB DRAM$ Tags-in-SRAM also has bandwidth bloat problem

26 Tags-in-sram Performs similar to BEAR
BEAR can be applied to reduce BW bloat in Tags-in-SRAM DRAM$ designs

27 3D DRAM as a cache mitigates the memory wall.
Summary 3D DRAM as a cache mitigates the memory wall. In DRAM caches, secondary operations cause slow down to the critical data. We propose BEAR, which targets three sources of bandwidth bloat in DRAM cache. 1. Bandwidth-Efficient Miss Fill 2. Bandwidth-Efficient Writeback Detection 3. Bandwidth-Efficient Miss Detection Overall, BEAR reduces the bandwidth bloat by 32%, and improves the performance by 11%

28 Computer Architecture and Emerging Technologies Lab, Georgia Tech
Thank you Computer Architecture and Emerging Technologies Lab, Georgia Tech

29 Backup Slides

30 The overhead of BEAR is negligible small
Design Cost Total Bandwidth-Aware Bypass 8 bytes per thread 64 bytes DRAM Cache Presence One bit per line in LLC 16K bytes Neighboring Tag Register 44 bytes per bank 3.2K bytes Total Cost 19.2K bytes Overall, BEAR incurs HW overhead of 19.2KB

31 Comparison to other dram$ designs
Tags-In-DRAM Designs 28% 11% BEAR outperforms other DRAM$ designs


Download ppt "BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches"

Similar presentations


Ads by Google