BWUPEP2011, UIUC, May 29 - June Blue Waters Undergraduate Petascale Education Program May 29 – June Hybrid MPI/CUDA Scaling accelerator code
BWUPEP2011, UIUC, May 29 - June Why Hybrid CUDA? CUDA is fast! (for some problems) CUDA on a single card is like OpenMP (doesn’t scale) MPI can only scale so far Excessive power Communication overhead Large amount of work remains for each node What if you can harness the power of multiple accelerators on multiple MPI processes?
BWUPEP2011, UIUC, May 29 - June Hybrid Architectures Tesla S1050 connected to nodes 1 GPU, connected directly to a node Earlham (as11 & as12) Tesla S1070 A server node with 4 GPUs, typically connected via PCI-E to 2 nodes OU has some of these NCSA (192 nodes) Accelerator Cluster NCSA (32 nodes) RAM GPU Node
BWUPEP2011, UIUC, May 29 - June MPI/CUDA Approach CUDA will be: Doing the computational heavy lifting Dictating your algorithm & parallel layout (data parallel) Therefore: Design CUDA portions first Use MPI to move work to each node
BWUPEP2011, UIUC, May 29 - June Implementation Do as much work as possible on the GPU before bringing data back to the CPU and communicating it Sometimes you won’t have a choice… Debugging tips: Develop/test/debug one- node version first Then test it with multiple nodes to verify commun- ication move data to each node while not done: copy data to GPU do work >> get new state out of GPU communicate with others aggregate results from all nodes move data to each node while not done: copy data to GPU do work >> get new state out of GPU communicate with others aggregate results from all nodes
BWUPEP2011, UIUC, May 29 - June Multi-GPU Programming A CPU thread can only have a single active context to communicate with a GPU cudaGetDeviceCount(int * count) cudaSetDevice(int device) Be careful using MPI rank alone, device count only counts the cards visible from each node Use MPI_Get_processor_name() to determine which processes are running where
BWUPEP2011, UIUC, May 29 - June Compiling CUDA needs nvcc, MPI needs mpicc Dirty trick: wrap mpicc with nvcc nvcc processes.cu files, sends the rest to its wrapped compiler Kernel, kernel invocation, cudaMalloc, are all best off in a.cu file somewhere MPI calls should be in.c files There are workarounds, but this is the simplest approach nvcc --compiler-bindir mpicc main.c kernel.cu
BWUPEP2011, UIUC, May 29 - June Executing Typically one MPI process per available GPU On Sooner (OU), each node has 2 GPUs available, so ppn should be 2. On AC, each node has 4 GPUs and correspond to the number of processors requested, so this requests a total of 8 GPUs on 2 nodes: #BSUB -l nodes=2:tesla:cuda3.2:ppn=4 #BSUB -R "select[cuda > 0]“ #BSUB -R "rusage[cuda=2]“ #BSUB –l nodes=1:ppn=2 #BSUB -R "select[cuda > 0]“ #BSUB -R "rusage[cuda=2]“ #BSUB –l nodes=1:ppn=2
BWUPEP2011, UIUC, May 29 - June Hybrid CUDA Lab We already have Area Under a Curve code for MPI and CUDA independently. You can write a hybrid code that has each GPU calculate a portion of the area, then use MPI to combine subtotals for the complete area. Otherwise feel free to take any code we’ve used so far and experiment!