1 Non-Blocking Communications. 2 #include int main(int argc, char **argv) { int my_rank, ncpus; int left_neighbor, right_neighbor; int data_received=-1;

1 Non-Blocking Communications

2 #include int main(int argc, char **argv) { int my_rank, ncpus; int left_neighbor, right_neighbor; int data_received=-1; int tag = 101; MPI_Status statSend, statRecv; MPI_Request reqSend, reqRecv; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &ncpus); left_neighbor = (my_rank-1 + ncpus)%ncpus; right_neighbor = (my_rank+1)%ncpus; MPI_Isend(&my_rank,1,MPI_INT,left_neighbor,tag,MPI_COMM_WORLD,&reqSend); // comm start MPI_Irecv(&data_received,1,MPI_INT,right_neighbor,tag,MPI_COMM_WORLD,&reqRecv); // maybe do something useful here MPI_Wait(&reqSend, &statSend); // complete comm MPI_Wait(&reqRecv, &statRecv); printf("Among %d processes, process %d received from right neighbor: %d\n", ncpus, my_rank, data_received); // clean up MPI_Finalize(); return 0; } Example mpirun –np 4 test_shift Among 4 processes, process 3 received from right neighbor: 0 Among 4 processes, process 2 received from right neighbor: 3 Among 4 processes, process 0 received from right neighbor: 1 Among 4 processes, process 1 received from right neighbor: 2

3 Semantics etc  Purpose:  Mechanism for overlapping communication and useful computations. Communication and computation may proceed concurrently. Latency hiding.  Deadlock avoidance  May avoid system buffering and memory-to-memory copying, and improve performance  Structure of non-blocking calls Post communication requests  non-blocking call, MPI_Isend … … // do some useful work Complete communication call  MPI_Wait, MPI_Test, …

4 Semantics etc  Non-blocking calls: MPI_Isend, MPI_Irecv etc  Will return immediately. Merely post a request to system to initiate communication.  However, communication is not completed yet.  Cannot tamper with the memory provided in these calls until the communication is completed by calling MPI_Wait or MPI_Test etc Non-blocking sendNon-blocking receive

5 Non-blocking Send/Recv int MPI_Isend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request) MPI_ISEND(BUF,COUNT,DATATYPE,DEST,TAG,COMM,REQUEST,IERROR) BUF(*) INTEGER COUNT,DATATYPE,DEST,TAG,COMM,REQUEST, IERROR int MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request) MPI_IRECV(BUF,COUNT,DATATYPE,SOURCE,TAG,COMM,REQUEST,IERROR) BUF(*) INTEGER COUNT,DATATYPE,SOURCE,TAG,COMM,REQUEST,IERROR Post send/recv requests to MPI system. Calls return immediately, but don’t access the memory pointed to by *buf MPI_Request request is a handle to an internal MPI object. Everything about that non-blocking communication is through that handle. MPI_REQUEST_NULL is a NULL request. MPI_Request req1, req2; double A[10], B[5]; … MPI_Isend(A, 10, MPI_DOUBLE, rank, tag, MPI_COMM_WORLD, &req1); MPI_Irecv(B, 5, MPI_DOUBLE, rank, tag, MPI_COMM_WORLD, &req2);

6 Other Non-blocking Sends  4 communication modes, same semantics as blocking sends.  MPI_ISEND – standard mode  MPI_IBSEND – buffered mode  MPI_ISSEND – synchronous mode  MPI_IRSEND – ready mode Identical arguments as MPI_Isend int MPI_Ibsend(void *buf,int count,MPI_Datatype datatype,int dest, int tag, MPI_Comm comm, MPI_Request *request) int MPI_Issend(void *buf,int count,MPI_Datatype datatype,int dest, int tag, MPI_Comm comm, MPI_Request *request) int MPI_Irsend(void *buf,int count,MPI_Datatype datatype,int dest, int tag, MPI_Comm comm, MPI_Request *request)

7 Completion  Use MPI_Wait or MPI_Test to complete non-blocking communication  Semantics: after MPI_Wait returns  For standard send, message data has been safely stored away, safe to access buffer.  For receive, data is received.

8 MPI_Wait  Will block until the communication completes (or fails)  If request is from MPI_Isend, MPI_Irecv etc  Will deallocate request object, set request to MPI_REQUEST_NULL.  Will return in status the status information.  for MPI_Irecv, hold additional information.  For MPI_Isend, not much to be used int MPI_Wait(MPI_Request *request, MPI_Status *status) MPI_WAIT(REQUEST,STATUS,IERROR) INTEGER REQUEST, STATUS(MPI_STATUS_SIZE), IERROR *request is a handle returned from MPI_Isend, MPI_Irecv etc MPI_Request req; MPI_Status stat; … MPI_Irecv(…, &req); MPI_Wait(&req, &stat);

9 MPI_Test  request – MPI_Request object from MPI_Isend, etc  flag – true if communication complete; false if not yet  If true, request object will be de-allocated, and set to MPI_REQUEST_NULL  status – contain status information if complete  Does not block, return immediately.  Provide a mechanism for overlapping communication and computation  Do useful computation; periodically check communication status; if not complete, go back to computation. int MPI_Test(MPI_Request *request, int *flag, MPI_Status *status) MPI_TEST(REQUEST,FLAG,STATUS,IERROR) LOGICAL FLAG INTEGER REQUEST, STATUS, IERROR

10 Properties  Order: non-overtaking, order preserved  according to the execution order of non-blocking calls that initiate the communications  Progress: guarantees progress  Receive call completed by MPI_Wait will eventually return if there is a matching send.  Send call completed by MPI_Wait will eventually return if there is a matching receive. MPI_Comm_rank(comm,&rank); If(rank==0) { MPI_Isend(A,1,MPI_DOUBLE,1,99,comm,&req1); MPI_Isend(B,1,MPI_DOUBLE,1,99,comm,&req2); } Else if(rank==1) { MPI_Irecv(A,1,MPI_DOUBLE,0,MPI_ANY_TAG,comm,&req1); MPI_Irecv(B,1,MPI_DOUBLE,0,99,comm,&req2); } MPI_Wait(&req1,&stat1); MPI_Wait(&req2,&stat2);

11 MPI_Wait Variants  Deal with arrays of MPI_Requests: MPI_Request req[4];  MPI_Waitall:  MPI_Waitall(int count, MPI_Request *request, MPI_Status *status)  Blocks until all active requests in array complete; return status of all communications  Deallocate request objects, set to MPI_REQUEST_NULL  MPI_Waitany:  MPI_Waitany(int count,MPI_Request *req, int *index, MPI_Status *stat)  Blocks until one of the active requests in array completes; return its index in array and the status of completing request; deallocate that request object. If none completes, return index=MPI_UNDEFINED.  MPI_Waitsome:  MPI_Waitsome(int incount, MPI_Request *req, int *outcount, int *array_indices, MPI_Status *array_status)  Blocks until at least one of the active communications completes; return associated indices and status of completed communications; deallocate objects. If none, outcount=MPI_UNDEFINED. MPI_Request req[2]; MPI_Status stat[2]; … MPI_Isend(…, &req[0]); MPI_Isend(…, &req[1]); MPI_Waitall(2, req, stat); MPI_Request req[2]; MPI_Status stat; Int index; MPI_Isend(…, &req[0]); MPI_Isend(…, &req[1]); MPI_Waitany(2, req, &index, &stat); …

12 MPI_Test Variants  MPI_Testall:  MPI_Testall(int count, MPI_Request *array_req, int *flag, MPI_Status *array_stat)  Return flag=true if all active requests complete; return flag=false otherwise.  If true, will de-allocate request objects, set to MPI_REQUEST_NULL.  MPI_Testany:  MPI_Testany(int count, MPI_Request *array_req, int *index, int *flag, MPI_Status *stat)  If one of active comm completes, return flag=true the index and status of completing comm; deallocate that object.  Return flag=false, index=MPI_UNDEFINED if none completes  Return flag=true, index=MPI_UNDEFINED if none active requests.  MPI_Testsome:  MPI_Testsome(int incount, MPI_Request *array_req, int *outcount, int *array_indices, MPI_Status *array_stat)  Return in outcount the number of completed active comm and associated indices and status of completing comm.  If none completes, return outcount=0  if none active comm, outcount=MPI_UNDEFINED.

13 Persistent Communication  Structure for nonblocking calls:  MPI_Ixxxx allocates MPI_Request  MPI_Wait or MPI_Test completes and de-allocates request objects  Often a communication with same arguments is executed repeatedly  e.g. every time step or every iteration.  Can create a persistent request that will not be de- allocated by MPI_Wait. Reduce overhead Create persistent request  MPI_Send_init, MPI_Recv_init Repeat: Start communication  MPI_Start … Complete communication  MPI_Wait, MPI_Test Free persistent request  MPI_Request_free

14 Creation  Creates a persistent request object for standard send mode.  Bind to the arguments: buf, count, datatype, dest, tag, comm. These arguments will not change in following communications  On creation, request inactive – not associated with any active communication. Communication initiated by MPI_Start int MPI_Send_init(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *req) int MPI_Recv_init(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *req) MPI_Request req_send, req_recv; double A[100], B[100]; int left_neighbor, right_neighbor, tag=999; MPI_Status stat_send, stat_recv; … MPI_Send_init(A,100,MPI_DOUBLE,left_neighbor,tag,MPI_COMM_WORLD,&req_send); MPI_Recv_init(B,100,MPI_DOUBLE,right_neighbor,tag,MPI_COMM_WORLD,&req_recv); MPI_Start(&req_send); MPI_Start(&req_recv); … // do something else useful MPI_Wait(&req_send, &stat_send); MPI_Wait(&req_recv, &stat_recv); MPI_Request_free(&req_send); MPI_Request_free(&req_recv);

15 Start Communication, Free Request  request is a persistent request created by MPI_Send_init etc.  Start the communication on request object.  The call returns immediately. It starts a non-blocking communication. Should not access the buffer after this call until completion.  Complete communication by MPI_Wait, MPI_Test etc.  MPI_Wait, MPI_Test will not de-allocate the request upon completion of communication  De-allocate persistent request using MPI_Request_free in the end. int MPI_Start(MPI_Request *request) MPI_START(REQUEST) integer REQUEST int MPI_Request_free(MPI_Request *request) MPI_REQUEST_FREE(request) integer REQUEST

16 Example: Matrix-Vector Multiplication AX=Y A – NxN matrix X,Y – vectors, dimension N = AXY A11A12A13 A21A22A23 A31A32A33 X1 X2 X3 Y1 Y2 Y3 Y1 = A11*X1 + A12*X2 + A13*X3 Y2 = A21*X1 + A22*X2 + A23*X3 Y3 = A31*X1 + A32*X2 + A33*X3 = A11A12A13 A21A22A23 A31A32A33 X2 X3 X1 Y1 Y2 Y3 Y1 = A11*X1 + A12*X2 + A13*X3 Y2 = A21*X1 + A22*X2 + A23*X3 Y3 = A31*X1 + A32*X2 + A33*X3 = A11A12A13 A21A22A23 A31A32A33 X3 X1 X2 Y1 Y2 Y3 Y1 = A11*X1 + A12*X2 + A13*X3 Y2 = A21*X1 + A22*X2 + A23*X3 Y3 = A31*X1 + A32*X2 + A33*X3 cpu 0 cpu 1 cpu 2 cpu 0 cpu 1 cpu 2 cpu 0 cpu 1 cpu 2

17 Example: Matrix-Vector Data on cpu 0: [A11 A12 A13]  N/3 x N matrix X1  vector, length N/3 Y1  vector, length N/3 Data on cpu 1: [A21 A22 A23]  N/3 x N matrix X2  vector, length N/3 Y2  vector, length N/3 Data on cpu 2: [A31 A32 A33]  N/3 x N matrix X3  vector, length N/3 Y3  vector, length N/3 Need to communicate: X1, X2, X3 Upward shift. Number of shifts = ncpus-1 Assume: A[i][j] = i+j X[i] = i

18 #include #include "dmath.h“ //  ignore this for now #define DIM 1000 // logical A[DIM][DIM], X[DIM], Y[DIM] int main(int argc, char **argv) { int ncpus, my_rank, left_neighbor, right_neighbor, tag=1001; int Nx, Ny; // Ny=DIM, Nx=DIM/ncpus, on each cpu: A[Nx][Ny], X[Nx], Y[Nx] MPI_Request req_sr[2]; MPI_Status stat_sr[2]; double **A, *X, *Y, *Xt; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &ncpus); if(DIM%ncpus != 0) { // assume DIM dividable by ncpus if(my_rank==0) printf("ERROR: grid size cannot be divided by ncpus!\n"); MPI_Finalize(); return -1; } Nx = DIM/ncpus; // again on each cpu: A[Nx][Ny] etc Ny = DIM; left_neighbor = (my_rank-1 + ncpus)%ncpus; // top neighbor right_neighbor = (my_rank+1)%ncpus; // bottom neighbor A = DMath::newD(Nx, Ny); // allocate memory, ignore DMath – my own routine X = DMath::newD(Nx); Xt = DMath::newD(Nx); // Xt – temporary space for receiving from neighbor Y = DMath::newD(Nx); Example (non-blocking comm)

19 int i,j; for(i=0;i<Nx;i++) { // initialize A, X for(j=0;j<Ny;j++) A[i][j] = (my_rank*Nx+i) + j; //  *** important *** X[i] = my_rank*Nx+i; } int count; // loop counter int sindex, curr_block; memset(Y, '\0', sizeof(double)*Nx); // zero out result vector Y first for(count=0;count<ncpus;count++){ if(count < ncpus-1) { MPI_Irecv(Xt, Nx, MPI_DOUBLE,right_neighbor,tag,MPI_COMM_WORLD,&req_sr[0]); // receive from bottom neighbor MPI_Isend(X, Nx, MPI_DOUBLE, left_neighbor, tag, MPI_COMM_WORLD, &req_sr[1]); // send to top neighbor } // compute on current data curr_block = (my_rank+count)%ncpus; //  *** important *** sindex = curr_block*Nx; // starting index of A[i][sindex+0:sindex+Nx-1] for(i=0;i<Nx;i++) for(j=0;j<Nx;j++) Y[i] += A[i][sindex+j]*X[j]; //  *** important *** // complete comm if(count<ncpus-1) { MPI_Waitall(2, req_sr, stat_sr); // data now in Xt memcpy(X, Xt, sizeof(double)*Nx); // copy data from Xt to X *** important ** } Example

20 Example // clean up, free memory DMath::del(A); // Ignore DMath for now DMath::del(X); DMath::del(Xt); DMath::del(Y); MPI_Finalize(); return 0; }

21... MPI_Recv_init(Xt, Nx, MPI_DOUBLE,right_neighbor,tag,MPI_COMM_WORLD,&req_sr[0]); MPI_Send_init(X, Nx, MPI_DOUBLE, left_neighbor, tag, MPI_COMM_WORLD, &req_sr[1]); for(count=0;count<ncpus;count++){ if(count < ncpus-1) MPI_Startall(2, req_sr); // compute on current data curr_block = (my_rank+count)%ncpus; sindex = curr_block*Nx; for(i=0;i<Nx;i++) for(j=0;j<Nx;j++) Y[i] += A[i][sindex+j]*X[j]; // complete comm if(count<ncpus-1) { MPI_Waitall(2, req_sr, stat_sr); // data now in Xt memcpy(X, Xt, sizeof(double)*Nx); // copy data to X } MPI_Request_free(&req_sr[0]); MPI_Request_free(&req_sr[1]);... Example: Persistent Communication

22... for(count=0;count<ncpus;count++){ // compute on current data curr_block = (my_rank+count)%ncpus; sindex = curr_block*Nx; for(i=0;i<Nx;i++) for(j=0;j<Nx;j++) Y[i] += A[i][sindex+j]*X[j]; // send-recv if(count<ncpus-1) MPI_Sendrecv_replace(X,Nx,MPI_DOUBLE,left_neighbor,tag, right_neighbor, tag, MPI_COMM_WORLD, &stat_sr); }... Example: Send-Recv

23 HWK#2: Matrix Multiplication A1A2A3 B11B12B13 B21B22B23 B31B32B33 C1C2C3 = AB C C1 = A1*B11 + A2*B21 + A3*B31 cpu 0 C2 = A1*B12 + A2*B22 + A3*B32 cpu 1 C3 = A1*B13 + A2*B23 + A3*B33 cpu 2 A, B, C – NxN matrices P – number of processors A1, A2, A3 – Nx(N/P) matrices C1, C2, C3 - … Bij – (N/P)x(N/P) matrices Input: A[i][j] = 2*i + j B[i][j] = 2*i – j Column-wise decomposition

24 HWK #2  Implement the above parallel matrix multiplication (column-wise data decomposition) in either C, C++ or Fortran  Use non-blocking communication or persistent communication in MPI  Test your parallel implementation and make sure the result is correct  Result for matrix C on p CPUs must be identical to that on 1 CPU  Use a matrix size 2048x2048 (double)  Time the “multiplication section” of your code using MPI_Wtime() routine for wall-clock time.  Run your code on 1, 2, 4, 8, 16 CPUs and obtain the wall-clock time it takes: T1, T2, …, T16  Compute parallel speedup factors: Sp = T1/Tp, e.g. Sp=T1/T8 for 8 CPUs.  Plot Sp vs. number of CPUs.  Turn in:  Source code + compiled binary code on either hamlet or radon.  Table of wall-clock time vs. number of CPUs.  Plot of parallel speedup factors.  Write-up of what you have learned from the implementation and timing results  Due date: Oct. 11

25 Collective Communications

26 Overview  All processes in a group participate in communication, by calling the same function with matching arguments.  Types of collective operations:  Synchronization: MPI_Barrier  Data movement: MPI_Bcast, MPI_Scatter, MPI_Gather, MPI_Allgather, MPI_Alltoall  Collective computation: MPI_Reduce, MPI_Allreduce, MPI_Scan  Collective routines are blocking:  Completion of call means the communication buffer can be accessed  No indication on other processes’ status of completion  May or may not have effect of synchronization among processes.

27 Overview  Can use same communicators as PtP communications  MPI guarantees messages from collective communications will not be confused with PtP communications.  Key is a group of processes partaking communication  If you want only a sub-group of processes involved in collective communication, need to create a sub- group/sub-communicator from MPI_COMM_WORLD

28 Barrier  Blocks the calling process until all group members have called it.  Decreases performance. Refrain from using it explicitly. int MPI_Barrier(MPI_Comm comm) MPI_BARRIER(COMM,IERROR) integer COMM, IERROR … MPI_Barrier(MPI_COMM_WORLD); // synchronization point …

29 Broadcast  Broadcasts a message from process with rank root to all processes in group, including itself.  comm, root must be the same in all processes  The amount of data sent must be equal to amount of data received, pairwise between each process and the root  For now, means count and datatype must be the same for all processes; may be different when generalized datatypes are involved. int MPI_Bcast(void *buffer, int count, MPI_Datatype datatype,int root, MPI_Comm comm) MPI_BCAST(BUFFER, COUNT, DATATYPE, ROOT, COMM) integer BUFFER, COUNT, DATATYPE, ROOT, COMM

1 Non-Blocking Communications. 2 #include int main(int argc, char **argv) { int my_rank, ncpus; int left_neighbor, right_neighbor; int data_received=-1;

Similar presentations

Presentation on theme: "1 Non-Blocking Communications. 2 #include int main(int argc, char **argv) { int my_rank, ncpus; int left_neighbor, right_neighbor; int data_received=-1;"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Non-Blocking Communications. 2 #include int main(int argc, char **argv) { int my_rank, ncpus; int left_neighbor, right_neighbor; int data_received=-1;

Similar presentations

Presentation on theme: "1 Non-Blocking Communications. 2 #include int main(int argc, char **argv) { int my_rank, ncpus; int left_neighbor, right_neighbor; int data_received=-1;"— Presentation transcript:

Similar presentations

About project

Feedback