Presentation is loading. Please wait.

Presentation is loading. Please wait.

BWUPEP2011, UIUC, May 29 - June 10 2011 1 Blue Waters Undergraduate Petascale Education Program May 29 – June 10 2011 Hybrid MPI/CUDA Scaling accelerator.

Similar presentations


Presentation on theme: "BWUPEP2011, UIUC, May 29 - June 10 2011 1 Blue Waters Undergraduate Petascale Education Program May 29 – June 10 2011 Hybrid MPI/CUDA Scaling accelerator."— Presentation transcript:

1 BWUPEP2011, UIUC, May 29 - June 10 2011 1 Blue Waters Undergraduate Petascale Education Program May 29 – June 10 2011 Hybrid MPI/CUDA Scaling accelerator code

2 BWUPEP2011, UIUC, May 29 - June 10 2011 2 Why Hybrid CUDA?  CUDA is fast! (for some problems)  CUDA on a single card is like OpenMP (doesn’t scale)  MPI can only scale so far  Excessive power  Communication overhead  Large amount of work remains for each node  What if you can harness the power of multiple accelerators on multiple MPI processes?

3 BWUPEP2011, UIUC, May 29 - June 10 2011 3 Hybrid Architectures  Tesla S1050 connected to nodes  1 GPU, connected directly to a node  Al-Salam @ Earlham (as11 & as12)  Tesla S1070  A server node with 4 GPUs, typically connected via PCI-E to 2 nodes  Sooner @ OU has some of these  Lincoln @ NCSA (192 nodes)  Accelerator Cluster (AC) @ NCSA (32 nodes) RAM GPU Node

4 BWUPEP2011, UIUC, May 29 - June 10 2011 4 MPI/CUDA Approach  CUDA will be:  Doing the computational heavy lifting  Dictating your algorithm & parallel layout (data parallel)  Therefore:  Design CUDA portions first  Use MPI to move work to each node

5 BWUPEP2011, UIUC, May 29 - June 10 2011 5 Implementation  Do as much work as possible on the GPU before bringing data back to the CPU and communicating it  Sometimes you won’t have a choice…  Debugging tips:  Develop/test/debug one- node version first  Then test it with multiple nodes to verify commun- ication move data to each node while not done: copy data to GPU do work >> get new state out of GPU communicate with others aggregate results from all nodes move data to each node while not done: copy data to GPU do work >> get new state out of GPU communicate with others aggregate results from all nodes

6 BWUPEP2011, UIUC, May 29 - June 10 2011 6 Multi-GPU Programming  A CPU thread can only have a single active context to communicate with a GPU  cudaGetDeviceCount(int * count)  cudaSetDevice(int device)  Be careful using MPI rank alone, device count only counts the cards visible from each node  Use MPI_Get_processor_name() to determine which processes are running where

7 BWUPEP2011, UIUC, May 29 - June 10 2011 7 Compiling  CUDA needs nvcc, MPI needs mpicc  Dirty trick: wrap mpicc with nvcc  nvcc processes.cu files, sends the rest to its wrapped compiler  Kernel, kernel invocation, cudaMalloc, are all best off in a.cu file somewhere  MPI calls should be in.c files  There are workarounds, but this is the simplest approach nvcc --compiler-bindir mpicc main.c kernel.cu

8 BWUPEP2011, UIUC, May 29 - June 10 2011 8 Executing  Typically one MPI process per available GPU  On Sooner (OU), each node has 2 GPUs available, so ppn should be 2.  On AC, each node has 4 GPUs and correspond to the number of processors requested, so this requests a total of 8 GPUs on 2 nodes: #BSUB -l nodes=2:tesla:cuda3.2:ppn=4 #BSUB -R "select[cuda > 0]“ #BSUB -R "rusage[cuda=2]“ #BSUB –l nodes=1:ppn=2 #BSUB -R "select[cuda > 0]“ #BSUB -R "rusage[cuda=2]“ #BSUB –l nodes=1:ppn=2

9 BWUPEP2011, UIUC, May 29 - June 10 2011 9 Hybrid CUDA Lab  We already have Area Under a Curve code for MPI and CUDA independently.  You can write a hybrid code that has each GPU calculate a portion of the area, then use MPI to combine subtotals for the complete area.  Otherwise feel free to take any code we’ve used so far and experiment!


Download ppt "BWUPEP2011, UIUC, May 29 - June 10 2011 1 Blue Waters Undergraduate Petascale Education Program May 29 – June 10 2011 Hybrid MPI/CUDA Scaling accelerator."

Similar presentations


Ads by Google