Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu

Slides:



Advertisements
Similar presentations
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
Advertisements

Optimization on Kepler Zehuan Wang
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim,
OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.
Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.
1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.
Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
Storage in Big Data Systems
CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
A Programmable Processing Array Architecture Supporting Dynamic Task Scheduling and Module-Level Prefetching Junghee Lee *, Hyung Gyu Lee *, Soonhoi Ha.
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.
Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, Amirali Boroumand,
Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator Paper Presentation Yifeng (Felix) Zeng University of Missouri.
Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1
Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.
Managing DRAM Latency Divergence in Irregular GPGPU Applications Niladrish Chatterjee Mike O’Connor Gabriel H. Loh Nuwan Jayasena Rajeev Balasubramonian.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
RFVP: Rollback-Free Value Prediction with Safe to Approximate Loads Amir Yazdanbakhsh, Bradley Thwaites, Hadi Esmaeilzadeh Gennady Pekhimenko, Onur Mutlu,
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
A Case for Toggle-Aware Compression for GPU Systems
Zorua: A Holistic Approach to Resource Virtualization in GPUs
Gwangsun Kim, Jiyun Jeong, John Kim
Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim,
Resource Management IB Computer Science.
Reducing Memory Interference in Multicore Systems
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
CS427 Multicore Architecture and Parallel Computing
Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim,
A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Session 2A: Today, 10:20 AM Nandita Vijaykumar.
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
CS5102 High Performance Computer Systems Thread-Level Parallelism
Controlled Kernel Launch for Dynamic Parallelism in GPUs
Understanding Latency Variation in Modern DRAM Chips Experimental Characterization, Analysis, and Optimization Kevin Chang Abhijith Kashyap, Hasan Hassan,
A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait.
“Temperature-Aware Task Scheduling for Multicore Processors”
Fault-Tolerant NoC-based Manycore system: Reconfiguration & Scheduling
RFVP: Rollback-Free Value Prediction with Safe to Approximate Loads
Many-core Software Development Platforms
Linchuan Chen, Xin Huo and Gagan Agrawal
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Application Slowdown Model
Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu
Rachata Ausavarungnirun
Hoda NaghibiJouybari Khaled N. Khasawneh and Nael Abu-Ghazaleh
The University of Texas at Austin
Presented by: Isaac Martin
Mingxing Zhang, Youwei Zhuo (equal contribution),
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Milad Hashemi, Onur Mutlu, Yale N. Patt
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
A Case for Richer Cross-layer Abstractions: Bridging the Semantic Gap with Expressive Memory Nandita Vijaykumar Abhilasha Jain, Diptesh Majumdar, Kevin.
Managing GPU Concurrency in Heterogeneous Architectures
A Case for Richer Cross-layer Abstractions: Bridging the Semantic Gap with Expressive Memory Nandita Vijaykumar Abhilasha Jain, Diptesh Majumdar, Kevin.
Hoda NaghibiJouybari Khaled N. Khasawneh and Nael Abu-Ghazaleh
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
Chapter 4 Multiprocessors
Warp Scheduling.
CSE 502: Computer Architecture
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
The Gamma Database Machine Project
Address-Stride Assisted Approximate Load Value Prediction in GPUs
A Case for Richer Cross-layer Abstractions:
Presented by Florian Ettinger
DSPatch: Dual Spatial pattern prefetcher
Presentation transcript:

Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu The Locality Descriptor A Holistic Cross-Layer Abstraction to Express Data Locality in GPUs Nandita Vijaykumar Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu

Executive Summary Performance Speedups: Exploiting data locality in GPUs is a challenging task Performance Speedups: 26.6% (up to 46.6%) from cache locality 53.7% (up to 2.8x) from NUMA locality Flexible, architecture-agnostic interface Application Access to program semantics Software Locality Descriptor Hardware Data Placement Cache Management CTA Scheduling Data Prefetching …

Outline Why leveraging data locality is challenging? Designing the Locality Descriptor Evaluation

Data locality is critical to GPU performance Core Core Two forms of data locality: Reuse-based locality (cache locality) NUMA locality L1 L1 L2 Memory Reuse-based (Cache Locality)

Data locality is critical to GPU performance Cores Cores Two forms of data locality: Reuse-based locality (cache locality) NUMA locality NUMA Zone 0 NUMA Zone 1 Memory Memory Cores Cores NUMA Zone 2 NUMA Zone 3 Memory Memory NUMA Locality

The GPU execution and programming models are designed to explicitly express parallelism… But there is no explicit way to express data locality Exploiting data locality in GPUs is a challenging and elusive feat

A case study in leveraging data locality: Histo CTAs CTAs along the Y dim share the same data Y dim Type of data locality: Inter-CTA Y dim Y dim X dim X dim X dim Data Structure A CTA Compute Grid

Leveraging cache locality Core L1 Core L1 Y dim Y dim X dim X dim Core L1 Core L1 CTA Compute Grid Data Structure A CTA scheduling is required to leverage inter-CTA cache locality CTA scheduling is insufficient: we also need other techniques

Leveraging NUMA locality Memory Memory Cores Cores Y dim Y dim NUMA Zone 0 NUMA Zone 1 X dim X dim Memory Memory CTA Compute Grid Cores Cores Data Structure A Exploiting NUMA locality requires both CTA scheduling and data placement NUMA Zone 2 NUMA Zone 3

Today, leveraging data locality is challenging As a programmer: No easy access to architectural techniques – CTA scheduling, cache management, data placement, etc. Even when using work-arounds, optimization is tedious and not portable As the architect: Key program semantics are not available to the hardware Where to place data? Which CTAs to schedule together?

To make things worse: There are many different locality types: Inter-CTA, inter-warp, intra-thread, … Each type requires a different set of architectural techniques: Inter-CTA locality requires CTA scheduling + prefetching Intra-thread locality requires cache management …

The Locality Descriptor A hardware-software abstraction to express and exploit data locality Connects locality semantics to the underlying hardware techniques Application New software interface Software Locality Descriptor Access to key program semantics Hardware Data Placement Cache Management CTA Scheduling Data Prefetching …

Goals in designing the Locality Descriptor 1) Supplemental and hint-based Inter-CTA, inter-warp, intra-thread, … 2) Architecture-agnostic interface Application 3) Flexible and general Software Locality Descriptor Hardware Data Placement Cache Management CTA Scheduling Data Prefetching …

Designing the Locality Descriptor LocalityDescriptor ldesc(X, Y, Z); LocalityDescriptor ldesc(); Hardware Data Placement Cache Management CTA Scheduling Data Prefetching …

An Overview: The components of the Locality Descriptor 1) Data Structure LocalityDescriptor ldesc(A, len, INTER-THREAD, tile, loc); 3) Tile Semantics 2) Locality Type 4) Locality Semantics

Outline Why leveraging data locality is challenging? Designing the Locality Descriptor Evaluation

1. How to choose the basis of the abstraction? Key Idea: Use the data structure as the basis to describe data locality Architecture-agnostic Each data structure is accessed the same way by all threads A new instance is required for each important data structure LocalityDescriptor ldesc(A); LocalityDescriptor ldesc(); Data Structure

2. How to communicate with hardware? Locality type drives architecture mechanisms Inter-CTA, inter-warp, intra-thread, … Architecture-agnostic interface Application Architecture-specific optimizations Software Locality Descriptor Hardware Data Placement Cache Management CTA Scheduling Data Prefetching …

2. How to communicate with hardware? Key Idea: Use locality type to drive underlying architectural techniques Origin of locality (or locality type) causes the challenges in exploiting it E.g.: Inter-CTA locality requires CTA scheduling as reuse is across threads Intra-thread locality requires cache management to avoid thrashing Locality type is application-specific and known to the programmer

2. How to communicate with hardware? Key Idea: Use locality type to drive underlying architectural techniques Three fundamental types: INTER-THREAD INTRA-THREAD NO-REUSE INTER-THREAD Y dim X dim CTA Compute Grid LocalityDescriptor ldesc(A); LocalityDescriptor ldesc(A, INTER-THREAD);

Driving underlying architectural techniques Locality Type? NO_REUSE INTER_THREAD INTRA_THREAD Determined based on more information Cache Bypassing CTA Scheduling (if NUMA) Memory Placement (if NUMA) CTA Scheduling Cache Soft Pinning Memory Placement (if NUMA)

3. How to describe locality? Key Idea: Partition the data structure and compute grid into tiles Basic unit of locality: Data Tile Compute Tile Mapping between them Y dim Y dim LocalityDescriptor ldesc(A, INTER-THREAD); LocalityDescriptor ldesc(A, INTER-THREAD, tile); X dim X dim Data Structure A CTA Compute Grid tile((X_tile, Y_len, 1), (1, GridSize.y, 1), (1, 0, 0)); Compute Tile Data Tile Data Tile Dimensions Compute Tile Dimensions Compute-Data Map

Additional features of the Locality Descriptor Locality type insufficient to inform underlying architectural techniques (INTER-THREAD, INTRA-THREAD, NO-REUSE) In addition, we also have Locality Semantics to include: - Sharing Type - Access Pattern (COACCESSED, REGULAR, X_len) LocalityDescriptor ldesc(A, INTER-THREAD, tile, loc); LocalityDescriptor ldesc(A, INTER-THREAD, tile);

A decision tree to drive underlying techniques Locality Type? INTER_THREAD NEARBY Sharing Type? NO_REUSE INTRA_THREAD COACCESSED Cache Bypassing CTA Scheduling (if NUMA) Memory Placement (if NUMA) CTA Scheduling Next-line Stride Prefetching Memory Placement (if NUMA) Access Pattern? REGULAR IRREGULAR CTA Scheduling Guided Stride Prefetching Memory Placement (if NUMA) Cache Hard Pinning CTA Scheduling (if NUMA) Memory Placement (if NUMA) CTA Scheduling Cache Soft Pinning Memory Placement (if NUMA)

Leveraging the Locality Descriptor LocalityDescriptor ldesc(A, INTER-THREAD, tile, loc); Architectural techniques: CTA Scheduling Prefetching Data Placement Y dim Y dim X dim X dim Data Structure A CTA Compute Grid

CTA Scheduling Cluster 2 Cluster 3 Cluster 1 Cluster 0 Cluster 0 Compute Tile 3 Compute Tile 0 Compute Tile 1 Compute Tile 2 Cluster 0 Cluster 1 Cluster 2 Cluster 3 Y dim Y dim Core 0 Core 1 Core 2 Core 3 X dim X dim L1D L1D L1D L1D CTA Compute Grid Data Structure A

Leveraging the Locality Descriptor LocalityDescriptor ldesc(A, INTER-THREAD, tile, loc); Architectural techniques: CTA Scheduling Prefetching Data Placement Y dim Y dim X dim X dim CTA Compute Grid Data Structure A

Data Placement Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 0 Cluster Queues Cluster 0 Cluster 1 Cluster 2 Cluster 3 Clust 0 Clust 1 Clust 2 Clust 3 Cluster 0 Cluster 1 Cluster 2 Cluster 3 Y dim Y dim SM 0 SM 1 SM 2 SM 3 X dim X dim L1D L1D L1D L1D CTA Compute Grid Data Structure A Memory Memory Memory Memory NUMA Zone 0 NUMA Zone 1 NUMA Zone 2 NUMA Zone 3

Leveraging the Locality Descriptor LocalityDescriptor ldesc(A, INTER-THREAD, tile, loc); All threads stall because they wait on the same data Y dim Y dim Architectural techniques: CTA Scheduling Prefetching Data Placement X dim X dim CTA Compute Grid Data Structure A

Outline Why leveraging data locality is challenging? The Locality Descriptor Evaluation

Methodology System Parameters: Evaluation Infrastructure: GPGPUSim v3.2.2 Workloads: Parboil, Rodinia, CUDA SDK, Polybench System Parameters: Shader Core: 1.4 GHz; GTO scheduler [50]; 2 schedulers per SM, Round-robin CTA scheduler SM Resources Registers: 32768; Scratchpad: 48KB, L1: 32KB, 4 ways Memory Model: FR-FCFS scheduling [59, 60], 16 banks/channel Single Chip System: 15 SMs; 6 memory channels; L2: 768KB, 16 ways Multi-Chip System: 4 NUMA zones, 64 SMs (16 per zone); 32 memory channels; L2: 4MB, 16 ways; Inter-GPM Interconnect: 192 GB/s; DRAM Bandwidth: 768 GB/s (192 GB/s per module)

Performance speedup: leveraging cache locality Locality descriptors are an effective means to leverage cache locality 26.6% (up to 46.6%) Different locality types require different optimizations A single optimization is often insufficient

Performance Impact: Leveraging NUMA Locality 53.7% (up to 2.8x)

Conclusion Problem: Our Proposal: The Locality Descriptor Key Results: GPU programming models have no explicit abstraction to express data locality Leveraging data locality is a challenging task, as a result Our Proposal: The Locality Descriptor A HW-SW abstraction to explicitly express data locality A architecture-agnostic and flexible SW interface to express data locality Enables HW to leverage key program semantics to optimize locality Key Results: 26.6% (up to 46.6%) performance speed up from leveraging cache locality 53.7% (up to 2.8x) performance speed up from leveraging NUMA locality

Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu The Locality Descriptor A Holistic Cross-Layer Abstraction to Express Data Locality in GPUs Nandita Vijaykumar Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu