Lecture 14: Inter-process Communication

Lecture 14: Inter-process Communication
CS 179: GPU Programming Lecture 14: Inter-process Communication

The Problem What if we want to use GPUs across a distributed system?
GPU cluster, CSIRO

Distributed System A collection of computers
Each computer has its own local memory! Communication over a network Communication suddenly becomes harder! (and slower!) GPUs can’t be trivially used between computers

We need some way to communicate between processes…

Message Passing Interface (MPI)
A standard for message-passing Multiple implementations exist Standard functions that allow easy communication of data between processes Works on non-networked systems too! (Equivalent to memory copying on local system)

MPI Functions There are seven basic functions:
MPI_Init initialize MPI environment MPI_Finalize terminate MPI environment MPI_Comm_size how many processes we have running MPI_Comm_rank the ID of our process MPI_Isend send data (nonblocking) MPI_Irecv receive data (nonblocking) MPI_Wait wait for request to complete

MPI Functions Some additional functions:
MPI_Barrier wait for all processes to reach a certain point MPI_Bcast send data to all other processes MPI_Reduce receive data from all processes and reduce to a value MPI_Send send data (blocking) MPI_Recv receive data (blocking)

The workings of MPI Big idea: We have some process ID
We know how many processes there are A send request/receive request pair are necessary to communicate data!

MPI Functions – Init int MPI_Init( int *argc, char ***argv ) Takes in pointers to command-line parameters (more on this later)

MPI Functions – Size, Rank
int MPI_Comm_size( MPI_Comm comm, int *size ) int MPI_Comm_rank( MPI_Comm comm, int *rank ) Take in a “communicator” For us, this will be MPI_COMM_WORLD – essentially, our communication group contains all processes Take in a pointer to where we will store our size (or rank)

MPI Functions – Send Non-blocking send
int MPI_Isend(const void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request) Non-blocking send Program can proceed without waiting for return! (Can use MPI_Send for “blocking” send)

MPI Functions – Send Takes in… … A pointer to what we send
int MPI_Isend(const void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request) Takes in… A pointer to what we send Size of data Type of data (special MPI designations MPI_INT, MPI_FLOAT, etc) Destination process ID …

MPI Functions – Send Takes in… (continued)
int MPI_Isend(const void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request) Takes in… (continued) A “tag” – nonnegative integers Uniquely identifies our operation – used to identify messages when many are sent simultaneously MPI_ANY_TAG allows all possible incoming messages A “communicator” (MPI_COMM_WORLD) A “request” object Holds information about our operation

MPI Functions - Recv Non-blocking receive (similar idea)
int MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request) Non-blocking receive (similar idea) (Can use MPI_Irecv for “blocking” receive)

MPI Functions - Recv Takes in… …
int MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request) Takes in… A buffer (pointer to where received data is placed) Size of incoming data Type of data (special MPI designations MPI_INT, MPI_FLOAT, etc) Sender’s process ID …

MPI Functions - Recv Takes in… A “tag”
int MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request) Takes in… A “tag” Should be the same tag as the tag used to send message! (assuming MPI_ANY_TAG not in use) A “communicator” (MPI_COMM_WORLD) A “request” object

Blocking vs. Non-blocking
MPI_Isend and MPI_Irecv are non-blocking communications Calling these functions returns immediately Operation may not be finished! Should use MPI_Wait to make sure operations are completed Special “request” objects for tracking status MPI_Send and MPI_Recv are blocking communications Functions don’t return until operation is complete Can cause deadlock! (we won’t focus on these)

MPI Functions - Wait Takes in…
int MPI_Wait(MPI_Request *request, MPI_Status *status) Takes in… A “request” object corresponding to a previous operation Indicates what we’re waiting on A “status” object Basically, information about incoming data

MPI Functions - Reduce Takes in…
int MPI_Reduce(const void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm) Takes in… A “send buffer” (data obtained from every process) A “receive buffer” (where our final result will be placed) Number of elements in send buffer Can reduce element-wise array -> array Type of data (MPI label, as before) …

MPI Functions - Reduce Takes in… (continued)
int MPI_Reduce(const void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm) Takes in… (continued) Reducing operation (special MPI labels, e.g. MPI_SUM, MPI_MIN) ID of process that obtains result MPI communication object (as before)

MPI Example Two processes Sends a number from process 0 to process 1
int main(int argc, char **argv) { int rank, numprocs; MPI_Status status; MPI_Request request; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&rank); int tag=1234; int source=0; int destination=1; int count=1; int send_buffer; int recv_buffer; if(rank == source){ send_buffer=5678; MPI_Isend(&send_buffer,count,MPI_INT,destination,tag, MPI_COMM_WORLD,&request); } if(rank == destination){ MPI_Irecv(&recv_buffer,count,MPI_INT,source,tag, MPI_COMM_WORLD,&request); MPI_Wait(&request,&status); printf("processor %d sent %d\n",rank,recv_buffer); printf("processor %d got %d\n",rank,recv_buffer); MPI_Finalize(); return 0; MPI Example Two processes Sends a number from process 0 to process 1 Note: Both processes are running this code!

Solving Problems Goal: To use GPUs across multiple computers!
Two levels of division: Divide work between processes (using MPI) Divide work among a GPU (using CUDA, as usual) Important note: Single-program, multiple-data (SPMD) for our purposes i.e. running multiple copies of the same program!

Multiple Computers, Multiple GPUs
Can combine MPI, CUDA streams/device selection, and single-GPU CUDA!

MPI/CUDA Example Suppose we have our “polynomial problem” again
Suppose we have n processes for n computers Each with one GPU (Suppose array of data is obtained through process 0) Every process will run the following:

MPI/CUDA Example MPI_Init declare rank, numberOfProcesses set rank with MPI_Comm_rank set numberOfProcesses with MPI_Comm_size if process rank is 0: array <- (Obtain our raw data) for all processes idx = 0 through numberOfProcesses - 1: Send fragment idx of the array to process ID idx using MPI_Isend This fragment will have the size (size of array) / numberOfProcesses) declare local_data Read into local_data for our process using MPI_Irecv MPI_Wait for our process to finish receive request local_sum <- ... Process local_data with CUDA and obtain a sum declare global_sum set global_sum on process 0 with MPI_reduce (use global_sum somehow) MPI_Finalize

MPI/CUDA – Wave Equation
Recall our calculation: 𝑦 𝑥,𝑡+1 =2 𝑦 𝑥,𝑡 − 𝑦 𝑥,𝑡−1 + 𝑐∆𝑡 ∆𝑥 2 (𝑦 𝑥+1,𝑡 −2 𝑦 𝑥,𝑡 + 𝑦 𝑥−1,𝑡 )

Big idea: Divide our data array between n processes!

In reality: We have three regions of data at a time (old, current, new)

Calculation for timestep t+1 uses the following data: 𝑦 𝑥,𝑡+1 =2 𝑦 𝑥,𝑡 − 𝑦 𝑥,𝑡−1 + 𝑐∆𝑡 ∆𝑥 2 (𝑦 𝑥+1,𝑡 −2 𝑦 𝑥,𝑡 + 𝑦 𝑥−1,𝑡 ) t+1 t t-1

Problem if we’re at the boundary of a process! 𝑦 𝑥,𝑡+1 =2 𝑦 𝑥,𝑡 − 𝑦 𝑥,𝑡−1 + 𝑐∆𝑡 ∆𝑥 2 (𝑦 𝑥+1,𝑡 −2 𝑦 𝑥,𝑡 + 𝑦 𝑥−1,𝑡 ) t+1 t t-1 x Where do we get 𝑦 𝑥−1,𝑡 ? (It’s outside our process!)

Wave Equation – Simple Solution
After every time-step, each process gives its leftmost and rightmost piece of “current” data to neighbor processes! Proc0 Proc1 Proc2 Proc3 Proc4

Pieces of data to communicate: Proc0 Proc1 Proc2 Proc3 Proc4

Can do this with MPI_Irecv, MPI_Isend, MPI_Wait: Suppose process has rank r: If we’re not the rightmost process: Send data to process r+1 Receive data from process r+1 If we’re not the leftmost process: Send data to process r-1 Receive data from process r-1 Wait on requests

Boundary conditions: Use MPI_Comm_rank and MPI_Comm_size Rank 0 process will set leftmost condition Rank (size-1) process will set rightmost condition

Simple Solution – Problems
Communication can be expensive! Expensive to communicate every timestep to send 1 value! Better solution: Send some m values every m timesteps!

Lecture 14: Inter-process Communication

Similar presentations

Presentation on theme: "Lecture 14: Inter-process Communication"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 14: Inter-process Communication

Similar presentations

Presentation on theme: "Lecture 14: Inter-process Communication"— Presentation transcript:

Similar presentations

About project

Feedback