Adaptive GPU Cache Bypassing Yingying Tian , Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez Texas A&M University *,

Slides:

Advertisements

Similar presentations

Dead Block Replacement and Bypass with a Sampling Predictor Daniel A. Jiménez Department of Computer Science The University of Texas at San Antonio.

Advertisements

Bypass and Insertion Algorithms for Exclusive Last-level Caches

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *

Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

Supporting x86-64 Address Translation for 100s of GPU Lanes Jason Power, Mark D. Hill, David A. Wood UW-Madison Computer Sciences 2/19/2014.

Cache Coherence for GPU Architectures Inderpreet Singh 1, Arrvindh Shriraman 2, Wilson Fung 1, Mike O’Connor 3, Tor Aamodt 1 Image source:

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.

1 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007.

Amoeba-Cache Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar Arrvindh Shriraman Eric Matthews Lesley Shannon Hongzhou Zhao.

Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.

Teaching Old Caches New Tricks: Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi.

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.

Cache Replacement Policy Using Map-based Adaptive Insertion Yasuo Ishii 1,2, Mary Inaba 1, and Kei Hiraki 1 1 The University of Tokyo 2 NEC Corporation.

Improving Cache Performance by Exploiting Read-Write Disparity

Increasing the Cache Efficiency by Eliminating Noise Philip A. Marshall.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

DATA ADDRESS PREDICTION Zohair Hyder Armando Solar-Lezama CS252 – Fall 2003.

Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.

Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

GPGPU platforms GP - General Purpose computation using GPU

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,

A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.

Sampling Dead Block Prediction for Last-Level Caches

MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,

1 Virtual Machine Memory Access Tracing With Hypervisor Exclusive Cache USENIX ‘07 Pin Lu & Kai Shen Department of Computer Science University of Rochester.

BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches

Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.

Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches Yuejian Xie, Gabriel H. Loh Georgia Institute of Technology Presented by: Yingying.

The Evicted-Address Filter

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

Sunpyo Hong, Hyesoon Kim

Value Prediction Kyaw Kyaw, Min Pan Final Project.

Cache Replacement Policy Based on Expected Hit Count

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

Improving Cache Performance using Victim Tag Stores

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

ASR: Adaptive Selective Replication for CMP Caches

Stash: Have Your Scratchpad and Cache it Too

18742 Parallel Computer Architecture Caching in Multi-core Systems

Prefetch-Aware Cache Management for High Performance Caching

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Energy-Efficient Address Translation

Spare Register Aware Prefetching for Graph Algorithms on GPUs

Experiment Evaluation

Rachata Ausavarungnirun

Exploring Value Prediction with the EVES predictor

Using Dead Blocks as a Virtual Victim Cache

CARP: Compression-Aware Replacement Policies

Qingbo Zhu, Asim Shankar and Yuanyuan Zhou

Lecture 14: Large Cache Design II

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

A Novel Cache-Utilization Based Dynamic Voltage Frequency Scaling (DVFS) Mechanism for Reliability Enhancements *Yen-Hao Chen, *Yi-Lun Tang, **Yi-Yu Liu,

Address-Stride Assisted Approximate Load Value Prediction in GPUs

Haonan Wang, Adwait Jog College of William & Mary

Presentation transcript:

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *, AMD Research†

Outline Background and Motivation Adaptive GPU Cache bypassing Methodology Evaluation Results Case Study of Programmability Conclusion

Tremendous throughput High performance computing Target for general-purpose GPU computing – Programming model: CUDA, OpenCL – Hardware support: cache hierarchies – AMD GCN: 16KB L1, NVIDIA Fermi: 16KB/48KB configurable Graphics Processing Units (GPUs)

Thousands of concurrent threads – Low per-thread cache capacity Characteristics of GPU workloads – Large data structures – Useless insertion consumes energy – Reducing the performance by replacing useful blocks Scratchpad memories filter temporal locality – Less reuse caught in caches – Limit programmability Inefficiency of GPU Caches

Zero-reuse blocks: inserted into caches without being accessed again until evicted 46% (max. 84%) of L1 cache blocks are only accessed once before eviction Consume energy; pollute cache; cause more replacement Memory Characteristics

Motivation GPU caches are inefficient Increasing cache sizes is impractical Inserting zero-reuse blocks wastes power without performance gain Inserting zero-reuse blocks causes useful blocks being replaced to reduce performance Objective: Bypass zero-reuse blocks

Outline Background and Motivation Adaptive GPU Cache bypassing Methodology Evaluation Results Case Study of Programmability Conclusion

Bypass Zero-Reuse Blocks Static bypassing on resource limit [Jia et. al 2014] – Degrade the performance Dynamic bypassing – Adaptively bypass zero-reuse blocks – Make prediction on cache misses – What information shall we use to make prediction? Memory addresses Memory instructions

Memory Address vs. PC Using memory addresses is impractical due to storage Using memory instructions consumes far less overhead PCs generalize the behaviors of SIMD with high accuracy

Adaptive GPU Cache Bypassing PC-based dynamic Bypass Predictor Inspired by dead block prediction in CPU caches [Lai et. al 2001][Khan et. al 2010] -If a PC leads to the last access to one block, then the same PC will lead to the last access to other blocks On cache misses, predict if it is zero-reuse - If yes, requested block will bypass the cache On cache hits, verify previous prediction and make new prediction

L1 cache Cache accesses (PC, addr) Structure of PC-based Bypass Predictor hashedPC (7-bit) Tag LRU bits Valid bit metadata 4-bit counter Prediction table 128 entries Update Predict HashedPC: the hash value of the PC of the last instruction that accessed this block

Example L1 cache val Prediction table (PC, addr) Hash(PC) Get prediction ? Val >= threshold block CU Bypass

Example L1 cache val Prediction table (PC, addr) Verify prediction val = val -1 HashedPC

Example L1 cache val Prediction table Verify prediction val = val +1 HashedPC

Misprediction Correction Bypass misprediction is irreversible Cause additional penalties to access lower level caches Utilize this procedure to help verify misprediction Each L2 entry contains an extra bit: BypassBit – BypassBit == 1 if the requested block will bypass L1 – Intuition: if it is a bypass misprediction, this block should be re- referenced soon, and will be hit in the L2 cache Miscellaneous – Set dueling – Sampling

Outline Background and Motivation Adaptive GPU Cache bypassing Methodology Evaluation Results Case Study of Programmability Conclusion

Methodology In-house APU simulator that extends Gem5 - Similar to AMD Graphics Core Next architecture Benchmarks: Rodinia, AMD APP SDK, OpenDwarfs

Storage Overhead Evaluated Techniques

Outline Background and Motivation Adaptive GPU Cache bypassing Methodology Evaluation Results Case Study of Programmability Conclusion

Energy Savings 58% of cache fills are prevented with cache bypassing The energy cost of L1 cache is reduced by up to 49%, on average by 25% Reduces dynamic power by 18%, increases the leakage power by only 2.5%

Performance Improvement Outperforms counter-based predictor Achieves performance of 32KB upper-bound

Prediction Accuracy False positives: incorrectly bypassed to-be-reused blocks Coverage: the ratio of bypass prediction to all prediction Coverage ratio is 58.6%, false positive rate is 12%

Outline Background and Motivation Adaptive GPU Cache bypassing Methodology Evaluation Results Case Study of Programmability Conclusion

Scratchpad Memories vs.Caches Store reused data shared within a compute unit Programmer-managed vs. hardware-controlled Scratchpad memories filter out temporal locality Limited programmability – Explicitly remapping from memory address space to the scratchpad address space To what extent can L1 cache bypassing make up for the performance loss caused by removing scratchpad memories?

Case Study: Needleman-Wunsch (nw) nw : – Global optimization algorithm – Compute-sensitive – Very little reuse observed in L1 caches Rewrote nw to remove the use of scratchpad memories: nw-noSPM – Not simply replace _local_ functions with _global_ functions – Best-effort re-written version

Case Study (cont.) Nw-noSPM takes 7X longer than nw With bypassing, the gap is reduced by 30% 16KB L1 cache + bypassing outperforms 64KB L1 cache Note that scratchpad memory is 64KB while L1 is only 16KB

Conclusion GPU caches are inefficient We propose a simple but effective GPU cache bypassing technique Improves GPU cache efficiency Reduces energy overhead Requires far less storage overhead Bypassing allows us to move towards more programmable GPUs

Thank you! Questions?

Backup Do all zero-reuse actually streaming?

GPU Cache Capacity Caches are useful, but current cache size are too small to gain benefit Performance benefit by increasing cache sizes << extra area taken up Adding more computational resources Leading to lower per-thread cache capacity

Power Overhead Energy (nJ)16KB baselinebypassing per tag access per data access per prediction table accessN/A Dynamic Power (mW) Static Power (mW)

Compare with SDBP SDBP is designed for LLCs, where much of the temporal locality has been filtered our technique is designed for GPU L1 caches, where temporal information is complete. We use PCs because of the observation of characteristics of GPGPU memory accesses. GPU kernels are small and frequently launched, the interleaving changes frequently.