MICRO 2018 Vinson Young (GT) Aamer Jaleel (NVIDIA)

Combining HW/SW Mechanisms to Improve NUMA Performance of Multi-GPU Systems
MICRO 2018 Vinson Young (GT) Aamer Jaleel (NVIDIA) Evgeny Bolotin (NVIDIA) Eiman Ebrahimi (NVIDIA) David Nellans (NVIDIA) Oreste Villa (NVIDIA) Hi, I am Vinson Young. I will be talking about our work, “Combining HW/SW Mechanisms to Improve NUMA Performance of Multi-GPU Systems” This work was done with my co-authors while I was at NVIDIA. Authors:

Towards Multi-Gpus End of Moore’s Law
Multi-GPUs are an attractive way to scale performance Continued GPU performance scaling important for Deep Learning, Graphics, HPC, etc Today, we are meeting the end of Moore’s law, single-GPU performance may stop scaling. Multi-GPUs are an attractive way to scale GPU performance. For example, new 16-GPU DGX-2 system. Can pack more compute on package. This enables performance scaling, independent of transistor scaling. These GPU systems are used to speed up important workloads such as Deep Learning, Graphics, HPC, etc. e.g., 16-GPU DGX-2 System

multi-gpus are inherently NUMA
DDR CPU NVLink (25 GB/s) Multi-GPU NV SWITCH GB/s GPU … 1 TB/s Unified Memory HBM HBM HBM HBM … However, Multi-GPU are NUMA. High bandwidth for local. Low bandwidth for remote. This asymmetric bandwidth can cause a bottleneck if there is significant inter-GPU traffic. So, today, programmers explicitly avoid sending inter-GPU traffic. But, we would like to eliminate this substantial programmer burden. It would be nice if we could run unmodified GPU software transparently, and still provide most of the benefits of Multi-GPUs. Prior work has proposed running unmodified code and instead using GPU-runtime to optimize for inter-GPU traffic NUMA-GPU [MICRO’17] Today, programmers explicitly avoid inter-GPU traffic Preferable: run unmodified GPU SW transparently Use GPU-runtime to optimize inter-GPU traffic

Software Techniques for improving NUMA
Unified Memory GPU SWITCH HBM 1 TB/s 64 GB/s 🗴 ✔* Private Data 🗴 ✔ Rd-shared Rd/Wr-shared ? * 🗴 So, there is a recent solution for how to solve NUMA in GPU runtime. We call the prior work in this domain as NUMA-GPU [presented in MICRO 17], and show this as our baseline throughout our work. For private (unshared) data, NUMA-GPU allocates pages in the first NODE that touches it. No inter-GPU traffic for private data. And, for rd-shared data, we can use software page-replication. We can service such pages at local memory bandwidth. However, for Rd/Wr shared data, we do not have an effective solution. Page replication is difficult due to the high cost of collapsing pages on write. NUMA-GPU attempts to solve RW-shared data problem by caching remote data in L2 cache. However, L2 capacity is small, and often insufficient to cache a large amount of the shared working set. So, how do these prior techniques of NUMA-GPU and page-replication perform? Data Rd-only Copy Rd-only Copy Rd-only Copy Shared Page NUMA-GPU works, but no solution for substantial RW-sharing *NUMA-GPU [MICRO ‘17]

Performance of Todays techniques
30% GAP So, how do existing techniques perform? Y-axis is performance relative to an Ideal system with no NUMA effects (i.e. infinite inter-GPU bandwidth). X-axis is several classes of workloads. Blue is NUMA-GPU [MICRO ‘17]; Orange is NUMA-GPU with Page Replication Enabled for read pages (current best software solution). Yellow is an ideal Multi-GPU with no NUMA bottlenecks. For workloads with Minimal Sharing, NUMA-GPU gets nearly ideal performance. More compute resources, more performance. For workloads with Rd-Only Sharing, Rd-Only Page replication enables servicing most data from high-bandwidth replicas and bridges most of the performance gap. However, for workloads with significant Read-Write Sharing, prior work is unable to bridge the performance gap. Existing solutions appear to fail to address NUMA bottlenecks for workloads with significant read-write sharing. Let us explore the extent of this problem and why existing solutions fail to bridge this gap, to see how we can help. 8 workloads 3 workloads 9 workloads Existing solutions to address NUMA bottlenecks ineffective Configuration: 4 GPUs, 64 SMs/GPU, 2MB Pages

Issue: Read+Write sharing behavior
Significant RW sharing  Page Replication fails OS Page (2MB) The first issue is the Read+Write sharing behavior. Here, the Y-axis is fraction of accesses to private pages (one GPU access), Read-Shared pages (pages with multiple GPU readers), and Read-Write Shared pages (pages with multiple GPU readers and writers). For the poor-performing RW Sharing workloads, there is substantial amount of accesses to pages that have both writers and readers. Recall that on writing to a replicated page, OS needs to collapse the copies and shootdown the TLB, which is very expensive. This limits the amount of pages that can be replicated. GPUs use large pages to improve TLB coverage—Well known to have a problem of false sharing due using such large pages. BUT Sharing depends on the granularity that you are analyzing.

Issue: Read+Write sharing behavior
Sharing behavior decreases with fine-grain replication Cacheline (128B) OS Page (2MB) We repeat the sharing analysis at line granularity, and we see a much different picture. We find that there is much less sharing at line level. What if we implement replication at a fine granularity. Would this be able to mitigate the inter-GPU traffic? Well, NUMA-GPU already tried caching at fine granularity with L2 remote caching. And it didn’t work. Why? 2MB pages experience false sharing  replicate at finer granularity?

Fine-grain L2 remote caching—insufficient
Aggregate System L2 Capacity NUMA-GPU already DOES cache remote in L2. But, that cache isn’t very big. Can’t cache very much data. If we want to cache all data, we would need a much bigger cache. Gigascale cache. But, we can’t do that -- making a larger on-chip cache is infeasible. But this brings us to an interesting insight. We simply need to build a larger cache! However, what if we could a much larger cache out of off-chip resources... Like say Memory technology! This is our approach. Shared data working set >> L2 size. Desire much larger cache Infeasible to devote GB’s of on-chip area to L2 cache

CARVE: Caching Remote Data in Video Memory
Overview Background CARVE: Caching Remote Data in Video Memory Cache Design HW / SW Coherence Summary Instead of limiting to L2 size, we can utilize memory technology to build a gigascale cache by using the local memory to store remote data. Our proposal: Caching Remote Data in Video Memory.

CARVE: Cache Remote Data in Video Memory
Dedicate a small fraction of GPU memory and architect a HW-managed “DRAM-cache” Unified Memory GPU SWITCH HBM 32GB 1 TB/s 64 GB/s CPU DDR 100 GB/s “Remote DRAM Cache” (RDC): DRAM-cache stores contents of remote memory DRAM-cache is consulted on a GPU LLC miss DRAM-Cache 2GB DRAM-$ We can build a cache by CARVE-ing out part of memory. For example, use 2GB of your 32GB memory to use as a DRAM-cache. DRAM-cache is consulted on L2 Miss. And filled from memory on DRAM-cache miss. However, we notice that the DRAM-cache is built from the same technology as memory—caching local lines would not improve latency or bandwidth, and may instead pay additional serialization latency and increased bandwidth consumption. Hence, we only cache remote data. CARVE enables caching shared data footprint much much greater than L2 capacity. DRAM-cache misses fetch data from remote GPU and fill into DRAM cache 30GB HBM Shared Page DRAM-cache enables caching shared data footprint >> on-chip L2 size Line-granularity DRAM-Cache architected similar to Alloy Cache [MICRO ‘12]

CARVE PERFORMANCE (IDEAL COHERENCE)
CARVE has the potential to bridge the performance for the workloads with heavy Read-Write Sharing, as it is able to cache a substantial portion of shared data and reduce the fraction of remote accesses from 40% to 8%. However, we now need a way to keep this giga-scale cache coherent. Fraction of remote accesses reduce from 40%  8%. CARVE can reduce the NUMA bottleneck Configuration: 4 GPUs, 64 SMs/GPU, 2MB Pages

Overview Background CARVE: Caching Remote Data in Video Memory Cache Design Coherence Summary

Conventional GPU Coherence
Conventional GPUs enforce SW coherence by Using write-through private caches (i.e., write-through L1) Flushing ALL private caches at synchronization boundary Attempt: SW Coherent Remote DRAM Cache (RDC): + Write-through DRAM cache (no extra overhead rel. to base) - Flush DRAM cache at synchronization boundary Conventionally, GPUs enforce software coherence with explicit synchronization boundaries. They accomplish by 1. employing write-through private caches, and 2. flushing ALL private caches at synchronization boundaries to ensure updated data is fetched. We can try to do the same for our DRAM cache. Can make it write-through (no extra overhead) And flush DRAM cache at synchronization boundary. Here, Gigabytes worth of data is potentially discarded.

CARVE Performance with SW Coherence
So, how does software coherent Remote DRAM Cache perform? Well, SW-coherent CARVE loses all of its potential performance benefits. This is because workloads have significant locality between synchronization boundaries. Flushing all of DRAM-cache loses all of its benefits. We do not want to flush the giga-scale cache frequently, so we need an alternative form of coherence that does not require cache flush. SW-coherent DRAM Caches remove ALL benefits from CARVE. Workloads have significant locality between synchronization boundaries Need alternative form of coherence that does not require cache flush

Overview Background CARVE: Caching Remote Data in Video Memory Cache Design HW Coherence Summary People have learned this in their classes. Hardware coherence where data is always kept up-to-date.

GPU Hardware Coherence
Leverage GPU-VI Coherence Protocol [HPCA ‘13] GPU writes-through private caches and broadcast invalidates Problem: Broadcasting invalidates on EVERY store requires high interconnect bandwidth Insight: Only write-invalidate for RW-shared data We leverage a simple hardware coherence mechanism to start: GPU-VI Coherence Protocol. GPU-VI is a protocol that writes-through private caches, and broadcast invalidates on every write, to ensure data is always up-to-date. Works. However, broadcasting invalidates on EVERY store increases interconnect traffic. From prior analysis, we know most accesses are to private data. Do not need to maintain coherence for such lines. We only want to broadcast invalidates on writes to RW-shared data. We can reduce invalidate traffic if we can design a way to detect which lines are Rd+Wr shared Desire mechanism to dynamically detect Rd/Wr-shared data

Dynamically tracking rw-shared data
Proposal: GPU maintains in-memory sharing tracker (IMST) for EVERY line. Broadcast invalidates only for RW-shared Classify R+W Shared lines Store 2-bit sharing state in memory; cache alongside data (in ECC) GPU monitors read and writes from local and remote GPUs Write from Remote GPU  RW-shared Remote Write Remote Write We can accomplish read-write sharing tracking with an In-Memory Sharing Tracker. And broadcast invalidates only for observed RW-shared lines. To classify RW shared lines, we track 4 states: uncached, private, Rd-Shared, Rd/Wr-shared. We store this 2-bit sharing state in memory, and cache alongside data. To track state transitions, we monitor reads and writes from GPUs. On write from remote GPU, we transition to RW-shared classification—and writes to this line must now broadcast invalidates. However, sharing behavior can change over time. We probabilistically transition to private or R-shared on some writes to enable dynamic tracking of sharing behavior. Note that on these writes, we broadcast invalidates to ensure one valid version. For more details, we refer you to the paper. private Rd/Wr-shared un-cached Rd-shared Local Write Probabilistic Update to track dynamic cacheline behavior Remote Write However, sharing behavior can change over time IMST tracks RW-shared lines and minimizes broadcast traffic

Performance with HW Coherence
CARVE with simple hardware coherence is able perform similar to ideal CARVE with Ideal Coherence. This is because additional invalidates due to coherence is rare, as constant RW-sharing at line granularity is uncommon. In total, CARVE-HWC gets similar to CARVE with Ideal Coherence HW-Coherent CARVE performs similar to Ideal-Coherence

CARVE eliminates NUMA effects to achieve near ideal Multi-GPU perf
Overall Performance So, what is the raw speedup we get relative to a single-GPU? Well, for an ideal 4x Multi-GPU (with infinite inter-GPU bandwidth), we can get strong scaling to almost 4x performance. CARVE with HW Coherence nearly matches that ideal configuration. Note: Rd Sharing. CARVE-HWC (degradation due to serialization of check RDCache then check remote memory) In total , CARVE enables a large fraction of shared data to be cached locally and reduces dependence on inter-GPU link bandwidth. This enables CARVE to obtain performance similar to that of an ideal Multi-GPU. CARVE eliminates NUMA effects to achieve near ideal Multi-GPU perf Negligible impact from loss in memory capacity, please refer to the paper

Overview Background CARVE: Caching Remote Data in Video Memory Cache Design HW Coherence Summary

✔ ✔ ✔ 🗴 🗴 🗴 Private Data SW (NUMA-GPU) Rd-shared SW (Page Replication)
Summary 🗴 ✔ Private Data SW (NUMA-GPU) 🗴 ✔ Rd-shared SW (Page Replication) 🗴 ✔ Rd/Wr-shared HW (CARVE)

Combining HW/SW Mechanisms to Improve NUMA Performance of Multi-GPU Systems
Thank you! MICRO 2018 Vinson Young (GT) Aamer Jaleel (NVIDIA) Evgeny Bolotin (NVIDIA) Eiman Ebrahimi (NVIDIA) David Nellans (NVIDIA) Oreste Villa (NVIDIA) Thank you Authors:

Backup slides Backup Slides

In-memory Sharing tracking
Read-only or read-write line only used by local GPU Read-only line shared by multiple GPUs Read-write line shared by multiple GPUs Un-cached line Remote Write Local Read/Write or Remote Read/Write Remote Read Remote Write Remote Read un-cached Local Read/Write private R-shared RW-shared Local Write Probabilistic Update to track dynamic cacheline behavior Local/Remote Write Local/Remote Read Local Read/Write We refer you to the paper to see how we dynamically detect RW-shared line. Key idea is we monitor at RW lines. Only for RW shared lines we broadcast, others don’t However, lines can get stuck in RW-shared state IMST tracks RW-shared lines and minimizes broadcast traffic

Caching Remote Data in Video Memory (CARVE) Coherence Directory Implementation
Spare ECC bits used to store a 2-bit in-memory directory per 128B cache line CPU DDR 100 GB/s Unified Memory GPU SWITCH HBM 1 TB/s 64 GB/s RDC caches the in-memory directory on fills RDC 2-bit in-memory directory using spare ECC bits

Caching Remote Data in Video Memory (CARVE)
Overhead for Loss in GPU Memory Capacity? Recent Unified Memory (UM) support enables placing data in CPU memory instead CPU DDR 100 GB/s Unified Memory GPU SWITCH HBM 1 TB/s 64 GB/s When GPU memory full, allocate data in CPU memory RDC

Sensitivity to GPU Memory Capacity Loss
(NUM-GPU) (RDC-0.5GB) (RDC-1GB) (RDC-2GB) (RDC-4GB) (NUM-GPU) (RDC-0.5GB) (RDC-1GB) (RDC-2GB) (RDC-4GB) Small loss in GPU memory capacity has negligible worst case performance loss

Software Techniques for improving NUMA
NUMA-GPU Techniques Unified Memory GPU SWITCH HBM 1 TB/s 64 GB/s Contiguous work scheduling improves data locality First-touch page placement ensures data is serviced from local memory Work-0,Work-1 Work-2,Work-3 Work-4,Work-5 Work-6,Work-7 Work-8,Work-9 Work-10,Work-11 Work-12,Work-13 Work-14,Work-15 SW-Coherent Remote Data Caching caches some inter-GPU data This is Transparent Multi-GPU. This is how a TMG would optimize for inter-GPU traffic. Traditionally, GPU’s like to distribute Work and Data everywhere to get the most use of the system. However, this destroys data locality. 1. Instead, recent proposals try to improve data locality, by improving WORK SCHEDULING with contiguous work scheduling. 2. And, By improving DATA PLACEMENT with First-Touch. Unshared Data can be accessed at high bandwidth. 3. For shared data, they can cache a small amount of remote data in L2, with L2 REMOTE CACHING. 4. And, they can enable coarse-grain SW-BASED PAGE REPLICATION, whereby programmer can make copies of pages. While we can theoretically replicate all pages, These copies still need to be kept coherent. If a write occurs to a replicated page, the OS needs to collapse the copies and shootdown the TLB, which are very heavy operations. So, typically page replication is limited to read-only pages. NUMA-GPU works well. But no solution for substantial RW sharing. Let us analyze the extent of this problem. Data Other SW Techniques Shared Page Rd-only Copy Rd-only Copy Rd-only Copy Page Replication: however, feasible only for Rd-only copies. NUMA-GPU works, but no solution for substantial RW-sharing NUMA-GPU [MICRO ‘17]

Summary Problem: NUMA effects impede strong scaling of Multi-GPUs
Approach: If we can cache ALL shared data, we can eliminate inter-GPU accesses and NUMA bottleneck Solution: CARVE—Cache Remote Data in Video Memory Prior SW techniques effective at improving data locality (“Min Sharing”+“Rd Sharing”) Proposed HW technique of giga-scale remote cache increases remote caching capacity (improves “RW Sharing” workloads) In total, NUMA effects (due to small inter-GPU bandwidth) are a major impediment for strong scaling of transparent multi-GPUs. We find that if we can cache nearly-all shared data, we can solve the NUMA bottleneck. SW techniques can reduce the shared-data footprint by improving data locality. And our proposed HW technique of CARVE, enables a giga-scale remote cache to cache nearly all of the shared data. Carve can enable strong scaling on emerging Multi-GPU Systems CARVE eliminates NUMA effects on emerging Multi-GPU Systems  Strong scaling of Multi-GPU performance

MICRO 2018 Vinson Young (GT) Aamer Jaleel (NVIDIA)

Similar presentations

Presentation on theme: "MICRO 2018 Vinson Young (GT) Aamer Jaleel (NVIDIA)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MICRO 2018 Vinson Young (GT) Aamer Jaleel (NVIDIA)

Similar presentations

Presentation on theme: "MICRO 2018 Vinson Young (GT) Aamer Jaleel (NVIDIA)"— Presentation transcript:

Similar presentations

About project

Feedback