Multi-GPU System Design with Memory Networks

Multi-GPU System Design with Memory Networks
Gwangsun Kim, Minseok Lee, Jiyun Jeong, John Kim Department of Computer Science Korea Advanced Institute of Science and Technology

Single-GPU Programming Pattern
Device Memory Host Memory Data

Multi-GPU Programming Pattern
… Problems: 1. Programming can be challenging. 2. Inter-GPU communication cost is high. Device Memory How to place the data? A. Split B. Duplicate Host Memory Data 1 1 2 3 4 5 6 7 … 2 4 3 5 7 … 6

Hybrid Memory Cube (HMC)
Vault controller … Vault controller Vault DRAM layers Intra-HMC Network Logic layer I/O port … I/O port High-speed link Packet

Memory Network Memory network for multi-CPU [Kim et al., PACT’13] GPU
Logic layer High-speed link DRAM layers I/O port … Vault controller Intra-HMC Network GPU GPU CPU

… Related Work NVLink for Nvidia Pascal architecture …
Drawback: some processor bandwidth dedicated to NVLink. SLI (Nvidia) and Crossfire (AMD) Graphics only. Unified virtual addressing from Nvidia Easy access to other GPU’s memory Restriction in memory allocation. MEM GPU … IO Hub CPU PCIe Switches … NVLink

Contents Motivation Related work Inter-GPU communication
Scalable kernel execution (SKE) GPU memory network (GMN) design CPU-GPU communication Unified memory network (UMN) Overlay network architecture Evaluation Conclusion

GPU Memory Network Advantage
288 GB/s 15.75 GB/s PCIe PCIe (optional) GPU Memory Network Device Memory Separate physical address spaces Unified physical address space

Scalable Kernel Execution (SKE)
Executes an unmodified kernel on multiple GPUs. GPUs need to support partial execution of a kernel. Kernel Partitioned Kernel Kernel Original Kernel Virtual GPU GPU GPU GPU GPU GPU Source transformation [Kim et al., PPoPP’11] [Lee et al., PACT’13] [Cabezas et al., PACT’14] Single-GPU Multi-GPU with SKE

Scalable Kernel Execution Implementation
Thread block 1D Kernel Thread Block range for GPU 0 Block range for GPU 1 Block range for GPU 2 ... Original kernel meta data + Block range Virtual GPU command queue Virtual GPU Application (unmodified single-GPU version) Original kernel meta data SKE Runtime GPU command queue GPU0 GPU1 …

Memory Address Space Organization
GPU virtual address space Fine-grained interleaving Load-balanced … Cache line 0 Cache line 1 Cache line 2 Cache line 3 Cache line 4 Cache line 5 Page A  GPU X GPU Page B  GPU Y Non-minimal path Page C  GPU Z … … Minimal path …

Multi-GPU Memory Network Topology
Load-balanced GPU channels Remove path diversity among local HMCs 2D Flattened butterfly w/o concentration [ISCA’07] (FBFLY) GPU HMC 50% 43% 33% Load-balanced Distributor-based flattened butterfly [PACT’13] (dFBFLY) Sliced flattened butterfly (sFBFLY)

Contents Motivation Related work Inter-GPU communication
Scalable kernel execution (SKE) GPU memory network (GMN) design CPU-GPU communication Unified memory network (UMN) Overlay network architecture Evaluation Conclusion

Data Transfer Overhead
CPU GPU Problems: 1. CPU-GPU communication BW is low. 2. Data transfer (or memory copy) overhead. PCIe Data 1 2 … Low BW Host memory Device memory

Unified Memory Network
Remove PCIe bottleneck between CPU and GPUs. Eliminate memory copy between CPU and GPUs! Unified Memory Network GPU Memory Network CPU … … … … … … … … … … GPU GPU GPU … PCIe Switches IO Hub

Overlay Network Architecture
CPUs are latency-sensitive. GPUs are bandwidth-sensitive. Off-chip link On-chip pass-thru path [PACT’13, FB-DIMM spec.] CPU GPU

Methodology GPGPU-sim version 3.2 Assume SKE for evaluation
Configuration 4 HMCs per CPU/GPU 8 bidirectional channels per CPU/GPU/HMC PCIe BW: GB/s, latency: 600 ns HMC: 4 GB, 8 layers, 16 vaults, 16 banks/vault, FR-FCFS Assume 1CPU-4GPU unless otherwise mentioned. Abbreviation Configuration PCIe PCIe-based system with memcpy GMN GPU memory network-based system with memcpy UMN Unified memory network-based system (no copy)

SKE Performance with Different Designs
Results for selected workloads 82% reduction Compute-intensive Data-intensive *Lower is better

Impact of Removing Path Diversity b/w Local HMCs
14% higher 9% lower <1% diff. *Lower is better

Input size not large enough
Scalability 13.5x # GPUs Compute-intensive Input size not large enough *Higher is better

Conclusion We addressed two critical problems in multi-GPU systems with memory networks. Inter-GPU communication Improved bandwidth with GPU memory network Scalable Kernel Execution  Improved Programmability CPU-GPU communication Unified memory network  Eliminate data transfer Overlay network architecture Our proposed designs improve both performance and programmability of multi-GPU systems.

Multi-GPU System Design with Memory Networks

Similar presentations

Presentation on theme: "Multi-GPU System Design with Memory Networks"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multi-GPU System Design with Memory Networks

Similar presentations

Presentation on theme: "Multi-GPU System Design with Memory Networks"— Presentation transcript:

Similar presentations

About project

Feedback