Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Slides:

Advertisements

Similar presentations

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

Advertisements

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

August 8 th, 2011 Kevan Thompson Creating a Scalable Coherent L2 Cache.

High Performing Cache Hierarchies for Server Workloads

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

1 MacSim Tutorial (In ISCA-39, 2012). Thread fetch policies Branch predictor Thread fetch policies Branch predictor Software and Hardware prefetcher Cache.

Understanding a Problem in Multicore and How to Solve It

OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.

Managing GPU Concurrency in Heterogeneous Architect ures Shared Resources Network LLC Memory.

Improving Cache Performance by Exploiting Read-Write Disparity

1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks Vivek Seshadri Samihan Yedkar ∙ Hongyi Xin ∙ Onur Mutlu Phillip.

The Dirty-Block Index Vivek Seshadri Abhishek Bhowmick ∙ Onur Mutlu Phillip B. Gibbons ∙ Michael A. Kozuch ∙ Todd C. Mowry.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor José-María Arnau, Joan-Manuel Parcerisa (UPC) Polychronis Xekalakis.

Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.

Exploiting Compressed Block Size as an Indicator of Future Reuse

Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research.

WarpPool: Sharing Requests with Inter-Warp Coalescing for Throughput Processors John Kloosterman, Jonathan Beaumont, Mick Wollman, Ankit Sethia, Ron Dreslinski,

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

The Evicted-Address Filter

Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.

Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.

Managing DRAM Latency Divergence in Irregular GPGPU Applications Niladrish Chatterjee Mike O’Connor Gabriel H. Loh Nuwan Jayasena Rajeev Balasubramonian.

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.

Cache Replacement Championship

µC-States: Fine-grained GPU Datapath Power Management

Improving Cache Performance using Victim Tag Stores

UH-MEM: Utility-Based Hybrid Memory Management

Reducing Memory Interference in Multicore Systems

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Session 2A: Today, 10:20 AM Nandita Vijaykumar.

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Zhichun Zhu Zhao Zhang ECE Department ECE Department

Controlled Kernel Launch for Dynamic Parallelism in GPUs

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait.

Multilevel Memories (Improving performance using alittle “cash”)

Mosaic: A GPU Memory Manager

Managing GPU Concurrency in Heterogeneous Architectures

18742 Parallel Computer Architecture Caching in Multi-core Systems

Cache Memory Presentation I

Lecture: Cache Hierarchies

Spare Register Aware Prefetching for Graph Algorithms on GPUs

Application Slowdown Model

Rachata Ausavarungnirun GPU 2 (Virginia EF) Tuesday 2PM-3PM

Rachata Ausavarungnirun

Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu

Presented by: Isaac Martin

MASK: Redesigning the GPU Memory Hierarchy

Row Buffer Locality Aware Caching Policies for Hybrid Memories

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Milad Hashemi, Onur Mutlu, Yale N. Patt

Achieving High Performance and Fairness at Low Cost

CARP: Compression-Aware Replacement Policies

15-740/ Computer Architecture Lecture 14: Prefetching

ITAP: Idle-Time-Aware Power Management for GPU Execution Units

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Demystifying Complex Workload–DRAM Interactions: An Experimental Study

CoNDA: Efficient Cache Coherence Support for Near-Data Accelerators

Address-Stride Assisted Approximate Load Value Prediction in GPUs

Haonan Wang, Adwait Jog College of William & Mary

Presented by Florian Ettinger

Presentation transcript:

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance Rachata Ausavarungnirun Saugata Ghose, Onur Kayiran, Gabriel H. Loh Chita Das, Mahmut Kandemir, Onur Mutlu

Overview of This Talk Problem: Observation: A single long latency thread can stall an entire warp Observation: Heterogeneity: some warps have more long latency threads Cache bank queuing worsens GPU stalling significantly Our Solution: Memory Divergence Correction Differentiate each warp based on this heterogeneity Prioritizes warps with fewer of long latency threads Key Results: 21.8% Better performance and 20.1% better energy efficiency compared to state-of-the-art

Outline Background on Inter-warp Heterogeneity Our Goal Solution: Memory Divergence Correction Results

Latency Hiding in GPGPU Execution GPU Core Status Time GPU Core Warp A Warp B Active Warp C Warp D Stall Active

Different Sources of Stalls Cache misses

Heterogeneity: Cache Misses Warp A Warp B Warp C Stall Time Cache Hit Reduced Stall Time Time Main Memory Arrow  stall time Cache Miss Cache Hit

Different Sources of Stalls Cache misses Shared cache queuing latency

Queuing at L2 Banks 45% of requests stall 20+ cycles at the L2 queue Shared L2 Cache Bank 0 Bank 1 Bank 2 Bank n Request Buffers 45% of requests stall 20+ cycles at the L2 queue Queuing latency gets worse as parallelism increases Memory Scheduler To DRAM …… ……

Outline Background on Inter-warp Heterogeneity Our Goal Solution: Memory Divergence Correction Results

Our Goals Improve performance of GPGPU applicaion Take advantage of warp heterogeneity Lower L2 queuing latency Eliminate misses from mostly-hit warp Simple design

Outline Background on Inter-warp Heterogeneity Our Goal Solution: Memory Divergence Correction Warp-type Identification Warp-type Aware Cache Bypassing Warp-type Aware Cache Insertion Warp-type Aware Memory Scheduling Results

Mechanism to Identify Warp-type Key Observations Warp retains its ratio of hit ratio Hit ratio  number of hits / number of access High intra-warp locality  high hit ratio Warps with random access pattern  low hit ratio Cache thrashing  additional reduction in hit ratio Add the plot from the paper

Mechanism to Identify Warp-type Key Observations Warp retains its ratio of hit ratio Add the plot from the paper Reorder the plot ----- Meeting Notes (9/29/15 17:17) ----- Q: The trend: why is this the case

Mechanism to Identify Warp-type Key Observations Warp retains its ratio of hit ratio Mechanism Profile hit ratio for each warp Assign warp-type based on profiling information Warp-type get reset periodically Add the plot from the paper Reorder the plot ----- Meeting Notes (9/29/15 17:17) ----- Q: The trend: why is this the case

Warp-types in MeDiC Higher Priority Lower Priority Shared L2 Cache Bank 0 Bank 1 Bank 2 Bank n Request Buffers Memory Request All-hit Mostly-hit Balanced Mostly-miss All-miss Memory Scheduler Low Priority N Warp Type ID Bypassing Logic High Priority …… …… Y To DRAM Higher Priority Lower Priority Any Requests in High Priority

MeDiC Warp-type aware cache bypassing Warp-type aware cache insertion policy Warp-type aware memory scheduling Shared L2 Cache Bank 0 Bank 1 Bank 2 Bank n Request Buffers Memory Request Memory Scheduler Low Priority N Warp Type ID Bypassing Logic High Priority …… …… Y To DRAM Any Requests in High Priority

MeDiC Warp-type aware cache bypassing Mostly-miss, All-miss Shared L2 Cache Bank 0 Bank 1 Bank 2 Bank n Request Buffers Memory Request Warp-type Aware Memory Scheduler Low Priority N Warp Type ID Bypassing Logic High Priority …… …… Y To DRAM Color Code Any Requests in High Priority Warp-type aware cache Insertion Policy

Warp-type Aware Cache Bypassing Goal: Only try to cache accesses that benefit from lower latency to access the cache Our Solution: All-miss and mostly-miss warps  Bypass L2 Other warp-types  Allow cache access Key Benefits: All-hit and mostly-hit are likely to stall less Mostly-miss and all-miss accesses  likely to miss Reduce queuing latency for the shared cache

MeDiC Warp-type aware cache bypassing Mostly-miss, All-miss Shared L2 Cache Bank 0 Bank 1 Bank 2 Bank n Request Buffers Memory Request Warp-type Aware Memory Scheduler Low Priority N Warp Type ID Bypassing Logic High Priority …… …… Y To DRAM Color Code Any Requests in High Priority Warp-type aware cache Insertion Policy

Warps Can Fetch Data for Others All-miss and mostly-miss warps sometimes prefetch cache blocks for other warps Blocks with high reuse Shared address with all-hit and mostly-hit warps Solution: Warp-type aware cache insertion

Warp-type Aware Cache Insertion Goal: Ensure cache blocks from all-miss and mostly-miss warps are more likely to be evicted Our Solution: All-miss and mostly-miss  Insert at LRU All-hit, mostly-hit and balanced  Insert at MRU Benefits: All-hit and mostly-hit are less likely to be evicted Heavily reused cache blocks from mostly-miss are likely to remain in the cache

MeDiC Warp-type aware cache bypassing Mostly-miss, All-miss Shared L2 Cache Bank 0 Bank 1 Bank 2 Bank n Request Buffers Memory Request Warp-type Aware Memory Scheduler Low Priority N Warp Type ID Bypassing Logic High Priority …… …… Y To DRAM Color Code Any Requests in High Priority Warp-type aware cache Insertion Policy

Not All Blocks Can Be Cached Despite the best effort, accesses from mostly-hit warps can still miss in the cache Compulsory misses Cache thrashing Solution: Warp-type aware memory scheduler

Warp-type Aware Memory Sched. Goal: Prioritize mostly-hit over mostly-miss Mechanism: Two memory request queues High-priority  all-hit and mostly-hit Low-priority  balanced, mostly-miss and all-miss Benefits: Memory requests from mostly-hit are serviced first Still maintain high row buffer hit rate  only a few mostly-hit requests

MeDiC: Putting Everything Together Warp-type aware cache bypassing Mostly-miss, All-miss Shared L2 Cache Bank 0 Bank 1 Bank 2 Bank n Request Buffers Memory Request Warp-type Aware Memory Scheduler Low Priority N Warp Type ID Bypassing Logic High Priority …… …… Y To DRAM Color Code Any Requests in High Priority Warp-type aware cache Insertion Policy

MeDiC: Example Warp A Warp A’ Warp B Warp B’ Queuing Latency High Priority Lower stall time Cache/Mem Latency Cache Miss Cache Hit Warp B Warp B’ Bypass Cache Lower queuing latency MRU Insertion

Outline Background on Inter-warp Heterogeneity Our Goal Solution: Memory Divergence Correction Results

Methodology Modified GPGPU-sim modeling GTX480 Comparison points: Model L2 queue and L2 queuing latency Comparison points: FR-FCFS [Rixner+, ISCA’00] Commonly used in GPGPU-scheduling Prioritizes row-hit requests  Better throughput EAF [Seshadri+, PACT’12] Tracks blocks that is recently evicted to detect high reuse PCAL [Li+, HPCA’15] Uses tokens to limit number of warps that gets to access the L2 cache  Lower cache thrashing Warps with highly reuse access gets more priority

Results: Performance of MeDiC MeDiC is effective in identifying warp-type and taking advantage of latency heterogeneity 21.8%

Results: Energy Efficiency of MeDiC Performance improvement outweighs the additional energy from extra cache misses 20.1%

Other Results in the Paper Comparison against PC-based and random cache bypassing policy MeDiC provides better performance Breakdowns of WMS, WByp and WIP Each component is effective Explicitly taking reuse into account MeDiC is effective in caching highly reuse blocks Sensitivity analysis of each individual components Minimal impact on L2 miss rate Minimal impact on row buffer locality

Conclusion Problem: Observation: A single long latency thread can stall an entire warp Observation: Heterogeneity: some warps have more long latency threads Cache bank queuing worsens GPU stalling significantly Our Solution: Memory Divergence Correction Differentiate each warp based on this heterogeneity Prioritizes warps with fewer of long latency threads Key Results: 21.8% Better performance and 20.1% better energy efficiency compared to state-of-the-art

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance Rachata Ausavarungnirun, Saugata Ghose, Onur Kayiran, Gabriel H. Loh, Chita Das, Mahmut Kandemir, Onur Mutlu