Multi-GPU Programming

Multi-GPU Programming
Martin Kruliš by Martin Kruliš (v1.1)

Multi-GPU Systems Connecting Multiple GPUs to Host
Workload division and management Sharing PCI-Express/host memory throughput Host architecture example (NUMA): GPU Memory CPU/Chipset PCIe QPI by Martin Kruliš (v1.1)

Multi-GPU Systems Detection and Selection
cudaGetDeviceCount(), cudaSetDevice() Each device may be manually queried for properties cudaGetDeviceProperties(), cudaDeviceGetAttribute() A stream may be created for each device Then we use streams to determine, which device is used Automatic selection of the optimal device cudaChooseDevice(&device, props) Selecting devices by their physical layout cudaDeviceGetByPCIBusId(&device, pciId) cudaDeviceGetPCIBusId() The devices that are visible to the application can be restricted by CUDA_VISIBLE_DEVICES environment variable. It may contain a list of integers that specify the devices visible to the application. The application always lists the devices as 0…N-1, where N is the number of visible devices. by Martin Kruliš (v1.1)

Workload Division Task Management Similar on GPU and CPU
E.g., each task must have sufficient size Static task scheduling Works only in special cases (e.g., all tasks have the same size and all GPUs are identical) Dynamic task scheduling Oversubscription – much more tasks than devices Tasks are dispatched to devices as they become available More complex for GPU, since the copy-work-copy pipeline is required to be maintained by Martin Kruliš (v1.1)

Peer-to-Peer Transfers
Copying memory between devices Special methods that copy memory directly between two devices cudaMemcpyPeer(dst, dstDev, src, srcDev, size) cudaMemcpyPeerAsync(…, stream) Synchronous version is asynchronous towards host, but synchronized with other async operations Works as a barrier on both devices Portable memory allocation Page-locked memory used on multiple GPUs The cudaHostAllocPortable flag must be used by Martin Kruliš (v1.1)

Peer-to-Peer Memory Access
Direct Inter-GPU Data Exchange Possibly without storing in host memory Since CC 2.0 (Tesla devices), 64-bit processes only cudaDeviceCanAccessPeer() cudaDeviceEnablePeerAccess() Unified Virtual Memory Space Both host and device buffers have one VS The unifiedAddressing device property must be 1 cudaPointerGetAttributes() Devices can directly use cudaHostAlloc() pointers by Martin Kruliš (v1.1)

IPC Inter-Process Communication
CUDA resources are restricted to the process Device buffer pointers, events, … Multiprocess sharing may be inevitable (e.g., when integrating multiple CUDA applications) IPC API allows sharing these resources cudaIpcGetMemHandle(), cudaIpcGetEventHandle() return cudaIpcEventHandle_t handle The handle can be transferred via IPC mechanisms cudaIpcOpenMemHandle(), cudaIpcOpenEventHandle() open the handle passed on from another process cudaIpcCloseMemHandle() by Martin Kruliš (v1.1)

Discussion by Martin Kruliš (v1.1)

Multi-GPU Programming

Similar presentations

Presentation on theme: "Multi-GPU Programming"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multi-GPU Programming

Similar presentations

Presentation on theme: "Multi-GPU Programming"— Presentation transcript:

Similar presentations

About project

Feedback