Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multi-GPU System Design with Memory Networks Gwangsun Kim, Minseok Lee, Jiyun Jeong, John Kim Department of Computer Science Korea Advanced Institute of.

Similar presentations


Presentation on theme: "Multi-GPU System Design with Memory Networks Gwangsun Kim, Minseok Lee, Jiyun Jeong, John Kim Department of Computer Science Korea Advanced Institute of."— Presentation transcript:

1 Multi-GPU System Design with Memory Networks Gwangsun Kim, Minseok Lee, Jiyun Jeong, John Kim Department of Computer Science Korea Advanced Institute of Science and Technology

2 Single-GPU Programming Pattern GPU Device Memory 2/21 Host Memory Data

3 Multi-GPU Programming Pattern GPU … 01234567… Data 014 7… 23 5 6 Device Memory 3/21 Host Memory How to place the data? A. Split B. Duplicate Problems: 1. Programming can be challenging. 2. Inter-GPU communication cost is high.

4 Hybrid Memory Cube (HMC) Logic layer High-speed link DRAM layers I/O port … Vault controller I/O port Intra-HMC Network Vault controller … Packet 4/21 Vault

5 Memory Network Logic layer High-speed link DRAM layers I/O port … Vault controller I/O port Intra-HMC Network Vault controller …  Memory network for multi-CPU [Kim et al., PACT’13] 5/21

6 …… NVLink Related Work  NVLink for Nvidia Pascal architecture – Drawback: some processor bandwidth dedicated to NVLink.  SLI (Nvidia) and Crossfire (AMD) – Graphics only.  Unified virtual addressing from Nvidia – Easy access to other GPU’s memory – Restriction in memory allocation. MEM GPU … IO Hub CPU PCIe Switches 6/21

7 Contents  Motivation  Related work  Inter-GPU communication – Scalable kernel execution (SKE) – GPU memory network (GMN) design  CPU-GPU communication – Unified memory network (UMN) – Overlay network architecture  Evaluation  Conclusion 7/21

8 Memory Network GPU Memory Network Advantage GPU PCIe Separate physical address spaces Device Memory 288 GB/s 15.75 GB/s PCIe (optional) 8/21 Unified physical address space

9 Scalable Kernel Execution (SKE)  Executes an unmodified kernel on multiple GPUs.  GPUs need to support partial execution of a kernel. Single-GPU GPU Original Kernel Virtual GPU Multi-GPU with SKE GPU Source transformation [Kim et al., PPoPP’11] [Lee et al., PACT’13] [Cabezas et al., PACT’14] Kernel Partitioned Kernel Kernel 9/21

10 Scalable Kernel Execution Implementation 1D Kernel Thread block Thread Block range for GPU 0 Block range for GPU 1 Block range for GPU 2... Virtual GPU command queue Virtual GPU Application (unmodified single-GPU version) Application (unmodified single-GPU version) Original kernel meta data + Block range SKE Runtime GPU command queue GPU 0 GPU 1 … 10/21

11 Memory Address Space Organization Page A  GPU X Page B  GPU Y Page C  GPU Z … … Fine-grained interleaving GPU Load-balanced Cache line 0 Cache line 1 Cache line 2 Cache line 3 Cache line 4 Cache line 5 … … GPU virtual address space Minimal path Non-minimal path 11/21

12 Multi-GPU Memory Network Topology Load-balanced GPU channels  Remove path diversity among local HMCs Sliced flattened butterfly (sFBFLY) Distributor-based flattened butterfly [PACT’13] (dFBFLY) Load-balanced 50% 43% 33% 12/21

13 Contents  Motivation  Related work  Inter-GPU communication – Scalable kernel execution (SKE) – GPU memory network (GMN) design  CPU-GPU communication – Unified memory network (UMN) – Overlay network architecture  Evaluation  Conclusion 13/21

14 Data Transfer Overhead CPU GPU Device memory Host memory 012… Data PCIe Problems: 1. CPU-GPU communication BW is low. 2. Data transfer (or memory copy) overhead. Low BW 14/21

15 Unified Memory Network  Remove PCIe bottleneck between CPU and GPUs.  Eliminate memory copy between CPU and GPUs! 15/21 GPU … GPU Memory Network ……………… PCIe Switches IO Hub CPU … … … … … … GPU

16 Overlay Network Architecture  CPUs are latency-sensitive.  GPUs are bandwidth-sensitive. 16/21

17 Methodology  GPGPU-sim version 3.2  Assume SKE for evaluation  Configuration – 4 HMCs per CPU/GPU – 8 bidirectional channels per CPU/GPU/HMC – PCIe BW: 15.75 GB/s, latency: 600 ns – HMC: 4 GB, 8 layers, 16 vaults, 16 banks/vault, FR-FCFS – Assume 1CPU-4GPU unless otherwise mentioned. AbbreviationConfiguration PCIePCIe-based system with memcpy GMNGPU memory network-based system with memcpy UMNUnified memory network-based system (no copy) 17/21

18 SKE Performance with Different Designs Results for selected workloads Compute-intensiveData-intensive 18/21 *Lower is better 82% reduction

19 Impact of Removing Path Diversity b/w Local HMCs 14% higher 9% lower <1% diff. 19/21 *Lower is better

20 Scalability 13.5x 20/21 *Higher is better # GPUs Compute-intensive Input size not large enough

21 Conclusion  We addressed two critical problems in multi-GPU systems with memory networks.  Inter-GPU communication -Improved bandwidth with GPU memory network -Scalable Kernel Execution  Improved Programmability  CPU-GPU communication -Unified memory network  Eliminate data transfer -Overlay network architecture  Our proposed designs improve both performance and programmability of multi-GPU systems. 21/21


Download ppt "Multi-GPU System Design with Memory Networks Gwangsun Kim, Minseok Lee, Jiyun Jeong, John Kim Department of Computer Science Korea Advanced Institute of."

Similar presentations


Ads by Google