LAB 7 -Parallel Programming Models -MPI Continued.

Slides:



Advertisements
Similar presentations
Its.unc.edu 1 Collective Communication University of North Carolina - Chapel Hill ITS - Research Computing Instructor: Mark Reed
Advertisements

A Message Passing Standard for MPP and Workstations Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W. Walker.
SOME BASIC MPI ROUTINES With formal datatypes specified.
Message-Passing Programming and MPI CS 524 – High-Performance Computing.
Distributed Memory Programming with MPI. What is MPI? Message Passing Interface (MPI) is an industry standard message passing system designed to be both.
EECC756 - Shaaban #1 lec # 7 Spring Message Passing Interface (MPI) MPI, the Message Passing Interface, is a library, and a software standard.
Collective Communication.  Collective communication is defined as communication that involves a group of processes  More restrictive than point to point.
Message Passing Interface. Message Passing Interface (MPI) Message Passing Interface (MPI) is a specification designed for parallel applications. The.
Jonathan Carroll-Nellenback CIRC Summer School MESSAGE PASSING INTERFACE (MPI)
1 Tuesday, October 10, 2006 To err is human, and to blame it on a computer is even more so. -Robert Orben.
Parallel Programming with Java
CS 179: GPU Programming Lecture 20: Cross-system communication.
Parallel Programming Using Basic MPI Presented by Timothy H. Kaiser, Ph.D. San Diego Supercomputer Center Presented by Timothy H. Kaiser, Ph.D. San Diego.
A Message Passing Standard for MPP and Workstations Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W. Walker.
ORNL is managed by UT-Battelle for the US Department of Energy Crash Course In Message Passing Interface Adam Simpson NCCS User Assistance.
Parallel Processing1 Parallel Processing (CS 676) Lecture 7: Message Passing using MPI * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
2a.1 Message-Passing Computing More MPI routines: Collective routines Synchronous routines Non-blocking routines ITCS 4/5145 Parallel Computing, UNC-Charlotte,
1 MPI: Message-Passing Interface Chapter 2. 2 MPI - (Message Passing Interface) Message passing library standard (MPI) is developed by group of academics.
MA471Fall 2003 Lecture5. More Point To Point Communications in MPI Note: so far we have covered –MPI_Init, MPI_Finalize –MPI_Comm_size, MPI_Comm_rank.
Specialized Sending and Receiving David Monismith CS599 Based upon notes from Chapter 3 of the MPI 3.0 Standard
Part I MPI from scratch. Part I By: Camilo A. SilvaBIOinformatics Summer 2008 PIRE :: REU :: Cyberbridges.
PP Lab MPI programming VI. Program 1 Break up a long vector into subvectors of equal length. Distribute subvectors to processes. Let them compute the.
Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.
Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.
Message Passing Programming with MPI Introduction to MPI Basic MPI functions Most of the MPI materials are obtained from William Gropp and Rusty Lusk’s.
CS 838: Pervasive Parallelism Introduction to MPI Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from an online tutorial.
MPI Communications Point to Point Collective Communication Data Packaging.
Message Passing Programming Model AMANO, Hideharu Textbook pp. 140-147.
Summary of MPI commands Luis Basurto. Large scale systems Shared Memory systems – Memory is shared among processors Distributed memory systems – Each.
Message Passing Interface (MPI) 1 Amit Majumdar Scientific Computing Applications Group San Diego Supercomputer Center Tim Kaiser (now at Colorado School.
MPI Introduction to MPI Commands. Basics – Send and Receive MPI is a message passing environment. The processors’ method of sharing information is NOT.
Parallel Programming with MPI By, Santosh K Jena..
MA471Fall 2002 Lecture5. More Point To Point Communications in MPI Note: so far we have covered –MPI_Init, MPI_Finalize –MPI_Comm_size, MPI_Comm_rank.
CSCI-455/522 Introduction to High Performance Computing Lecture 4.
1 Message Passing Models CEG 4131 Computer Architecture III Miodrag Bolic.
Oct. 23, 2002Parallel Processing1 Parallel Processing (CS 730) Lecture 6: Message Passing using MPI * Jeremy R. Johnson *Parts of this lecture was derived.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Message-passing Model.
Introduction to MPI CDP 1. Shared Memory vs. Message Passing Shared Memory Implicit communication via memory operations (load/store/lock) Global address.
Introduction to Parallel Programming at MCSR Message Passing Computing –Processes coordinate and communicate results via calls to message passing library.
Message Passing Interface (MPI) 2 Amit Majumdar Scientific Computing Applications Group San Diego Supercomputer Center Tim Kaiser (now at Colorado School.
-1.1- MPI Lectured by: Nguyễn Đức Thái Prepared by: Thoại Nam.
Message Passing Programming Based on MPI Collective Communication I Bora AKAYDIN
Message Passing Interface Using resources from
Lecture 3 Point-to-Point Communications Dr. Muhammad Hanif Durad Department of Computer and Information Sciences Pakistan Institute Engineering and Applied.
COMP7330/7336 Advanced Parallel and Distributed Computing MPI Programming: 1. Collective Operations 2. Overlapping Communication with Computation Dr. Xiao.
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
1 MPI: Message Passing Interface Prabhaker Mateti Wright State University.
Distributed Processing with MPI International Summer School 2015 Tomsk Polytechnic University Assistant Professor Dr. Sergey Axyonov.
3/12/2013Computer Engg, IIT(BHU)1 MPI-2. POINT-TO-POINT COMMUNICATION Communication between 2 and only 2 processes. One sending and one receiving. Types:
Computer Science Department
Introduction to MPI Programming Ganesh C.N.
Introduction to parallel computing concepts and technics
CS4402 – Parallel Computing
MPI Message Passing Interface
Computer Science Department
Send and Receive.
CS 584.
An Introduction to Parallel Programming with MPI
Send and Receive.
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
Message Passing Basics
CS 5334/4390 Spring 2017 Rogelio Long
Lecture 14: Inter-process Communication
A Message Passing Standard for MPP and Workstations
MPI: Message Passing Interface
Hardware Environment VIA cluster - 8 nodes Blade Server – 5 nodes
Computer Science Department
5- Message-Passing Programming
MPI Message Passing Interface
CS 584 Lecture 8 Assignment?.
Presentation transcript:

LAB 7 -Parallel Programming Models -MPI Continued

Parallel Programming Models Programming Models MPI Programming

Fundamental Design Issues Design of user-system and hardware-software interface –proposed by Culler et. al. Functional Issues: –Naming: how are logically shared data referenced? –Operations: what operations are provided on shared data? –Ordering: how are accesses to data ordered and coordinated? Performance Issues: –Replication: how are data replicated to reduce communication –Communication Cost: latency, bandwidth, overhead Others –Node granularity

Sequential Programming Model Functional Issues –Naming: Can name any variable in virtual address space Hardware (and perhaps compilers) does translation to physical addresses –Operations: Loads and Stores –Ordering: Sequential program order

Performance –Rely on dependences on single location (mostly): dependence order –Compilers and hardware violate other orders without getting caught –Compiler: reordering and register allocation –Hardware: out of order, pipeline bypassing, write buffers –Transparent replication in caches

SAS Programming Model Thread (Process) Thread (Process) System X read(X)write(X) ProcessorMemory Shared variable

Shared Address Space Programming Model Naming –Any process can name any variable in shared space Operations –Loads and stores, plus those needed for ordering Simplest Ordering Model –Within a process/thread: sequential program order –Across threads: some interleaving (as in time-sharing) –Additional orders through synchronization –Again, compilers/hardware can violate orders without getting caught

Synchronization Mutual exclusion (locks) –Ensure certain operations on certain data can be performed by only one process at a time –Room that only one person can enter at a time –No ordering guarantees Event synchronization – Ordering of events to preserve dependences –e.g. producer —> consumer of data –3 main types: point-to-point global group

MP Programming Model process Node A message YY’ send (Y)receive (Y’) Node B ProcessorMemory

Message Passing Programming Model Naming –Processes can name private data directly. –No shared address space Operations –Explicit communication: “send” and “receive” –“Send” transfers data from private address space to another process –“Receive” copies data from process to private address space

Ordering –Program order within a process –Send and receive can provide pt-to-pt synch between processes –Mutual exclusion inherent Can construct global address space –Process number + address within process address space –But no direct operations on these names

Message Passing Basics What is MPI? The Message-Passing Interface Standard(MPI) is a library that allows you to do problems in parallel using message- passing to communicate between processes. Library It is not a language (like FORTRAN 90, C or HPF) or even an extension to a language. Instead, it is a library that your native, standard, serial compiler (f77, f90, cc, CC) uses. Message Passing Message passing is sometimes referred to as a paradigm itself. But it is really just a method of passing data between processes that is flexible enough to implement most paradigms (Data Parallel, Work Sharing, etc.) with it Communicate This communication may be via a dedicated MPP torus network, or merely an office LAN. To the MPI programmer, it looks much the same. Processes These can be 512 PEs on a T3E, or 4 processes on a single workstation.

Basic MPI In order to do parallel programming, you require some basic functionality, namely, the ability to: Start Processes Send Messages Receive Messages Synchronize With these four capabilities, you can construct any program. We will look at the basic versions of the MPI routines that implement this. Of course, MPI offers over 125 functions. Many of these are more convenient and efficient for certain tasks. However, with what we learn here, we will be able to implement just about any algorithm. Moreover, the vast majority of MPI codes are built using primarily these routines.

Starting Processes on the Cluster On the Clusters, the fundamental control of processes is fairly simple. There is always one process for each PE that your code is running on. At run time, you specify how many PEs you require and then your code is copied to each PE and run simultaneously. In other words, a 512 PE Cluster code has 512 copies of the same code running on it from start to finish. At first the idea that the same code must run on every node seems very limiting. We'll see in a bit that this is not at all the case.

MPI Programming Fundamentals Environment Point-to-point communication Collective communication Derived data type Group management

MPI Terms Blocking –If return from the procedure indicates that the user is allowed to reuse resources specified in the call Non-blocking –If the procedure may return before the operation completes, and before the user is allowed to reuse resources specified in the call Collective –If all processes in a process group need to invoke the procedure Message envelope –Information used to distinguish messages and selectively receive them –

MPI Terms (Cont’d) Communicator –The communication context for a communication operation –Messages are always received within the context they were sent –Messages sent in different contexts do not interfere –MPI_COMM_WORLD Process group –The communicator specifies the set of processes that share this communication context. –This process group is ordered and processes are identified by their rank within this group

MPI Environment Calls MPI_INIT MPI_COMM_SIZE MPI_COMM_RANK MPI_FINALIZE MPI_ABORT

MPI_INIT Usage –int MPI_Init( int* argc_ptr,/* in */ char** argv_ptr[] );/* in */ Description –Initialize MPI –All MPI programs must call this routines once and only once before any other MPI routines

MPI_COMM_SIZE Usage –int MPI_Comm_size( MPI_Comm comm,/* in */ int* size );/* out */ Description –Return the number of processes in the group associated with a communicator

MPI_COMM_RANK Usage –int MPI_Comm_rank ( MPI_Comm comm, /* in */ int* rank );/* out */ Description –Returns the rank of the local process in the group associated with a communicator –The rank of the process that calls it in the range from 0 … size - 1

MPI_FINALIZE Usage –int MPI_Finalize (void); Description –Terminates all MPI processing –Make sure this routine is the last MPI call. –All pending communications involving a process have completed before the process calls MPI_FINALIZE

MPI_ABORT Usage –int MPI_Abort( MPI_Comm comm,/* in */ int errorcode );/* in */ Description –Forces all processes of an MPI job to terminate

Too Simple Program #include “mpi.h” int main( int argc, char* argv[] ) { int rank; int nproc; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); /* write codes for you */ MPI_Finalize(); }

Point-To-Point Communication MPI_SEND MPI_RECV MPI_ISEND MPI_IRECV MPI_WAIT MPI_GET_COUNT

4 Communication Modes in MPI Standard mode –It is up to MPI to decide whether outgoing messages will be buffered –Non-local operation –Buffered or synchronous? Buffered(asynchronous) mode –A send operation can be started whether or not a matching receive has been posted –It may complete before a matching receive is posted –Local operation

4 Communication Modes in MPI (Cont’d) Synchronous mode –A send operation can be started whether or not a matching receive was posted –The send will complete successfully only if a matching receive was posted and the receive operation has started to receive the message –The completion of a synchronous send not only indicates that the send buffer can be reused but also indicates that the receiver has reached a certain point in its execution –Non-local operation

4 Communication Modes in MPI (Cont’d) Ready mode –A send operation may be started only if the matching receive is already posted –The completion of the send operation does not depend on the status of a matching receive and merely indicates the send buffer can be reused –EAGER_LIMIT of SP system

MPI_SEND Usage –int MPI_Send( void* buf, /* in */ int count, /* in */ MPI_Datatype datatype, /* in */ int dest, /* in */ int tag, /* in */ MPI_Comm comm ); /* in */ Description –Performs a blocking standard mode send operation –The message can be received by either MPI_RECV or MPI_IRECV

MPI_RECV Usage –int MPI_Recv( void* buf,/* out */ int count,/* in */ MPI_Datatype datatype,/* in */ int source,/* in */ int tag,/* in */ MPI_Comm comm,/* in */ MPI_Status* status );/* out */ Description –Performs a blocking receive operation –The message received must be less than or equal to the length of the receive buffer –MPI_RECV can receive a message sent by either MPI_SEND or MPI_ISEND

Blocking Operations

Sample Program for Blocking Operations #include “mpi.h” int main( int argc, char* argv[] ) { int rank, nproc; int isbuf, irbuf; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); if(rank == 0) { isbuf = 9; MPI_Send( &isbuf, 1, MPI_INTEGER, 1, TAG, MPI_COMM_WORLD);

Sample Program for Blocking Operations (Cont’d) } else if(rank == 1) { MPI_Recv( &irbuf, 1, MPI_INTEGER, 0, TAG, MPI_COMM_WORLD, &status); printf( “%d\n”, irbuf ); } MPI_Finalize(); }

MPI_ISEND Usage –int MPI_Isend( void* buf,/* in */ int count,/* in */ MPI_Datatype datatype,/* in */ int dest,/* in */ int tag,/* in */ MPI_Comm comm,/* in */ MPI_Request* request );/* out */ Description –Performs a nonblocking standard mode send operation –The send buffer may not be modified until the request has been completed by MPI_WAIT or MPI_TEST –The message can be received by either MPI_RECV or MPI_IRECV.

MPI_IRECV Usage –int MPI_Irecv( void* buf,/* out */ int count,/* in */ MPI_Datatype datatype,/* in */ int source,/* in */ int tag,/* in */ MPI_Comm comm,/* in */ MPI_Request* request );/* out */

MPI_IRECV (Cont’d) Description –Performs a nonblocking receive operation –Do not access any part of the receive buffer until the receive is complete –The message received must be less than or equal to the length of the receive buffer –MPI_IRECV can receive a message sent by either MPI_SEND or MPI_ISEND

MPI_WAIT Usage –int MPI_Wait( MPI_Request* request,/* inout */ MPI_Status* status );/* out */ Description –Waits for a nonblocking operation to complete –Information on the completed operation is found in status. –If wildcards were used by the receive for either the source or tag, the actual source and tag can be retrieved by status- >MPI_SOURCE and status->MPI_TAG

Non-Blocking Operations

MPI_GET_COUNT Usage –int MPI_Get_count( MPI_Status status,/* in */ MPI_Datatype datatype, /* in */ int* count );/* out */ Description –Returns the number of elements in a message –The datatype argument and the argument provided by the call that set the status variable should match

Sample Program for Non- Blocking Operations #include “mpi.h” int main( int argc, char* argv[] ) { int rank, nproc; int isbuf, irbuf, count; MPI_Request request; MPI_Status status; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); if(rank == 0) { isbuf = 9; MPI_Isend( &isbuf, 1, MPI_INTEGER, 1, TAG, MPI_COMM_WORLD, &request );

Sample Program for Non- Blocking Operations (Cont’d) } else if(rank == 1) { MPI_Irecv( &irbuf, 1, MPI_INTEGER, 0, TAG, MPI_COMM_WORLD, &request); MPI_Wait(&request, &status); MPI_Get_count(&status, MPI_INTEGER, &count); printf( “irbuf = %d source = %d tag = %d count = %d\n”, irbuf, status.MPI_SOURCE, status.MPI_TAG, count); } MPI_Finalize(); }

Collective Communication MPI_BCAST MPI_SCATTER MPI_SCATTERV MPI_GATHER MPI_GATHERV MPI_ALLGATHER MPI_ALLGATHERV MPI_ALLTOALL

MPI_BCAST Usage –int MPI_Bcast( void* buffer, /* inout */ int count,/* in */ MPI_Datatype datatype, /* in */ int root, /* in */ MPI_Comm comm);/* in */ Description –Broadcasts a message from root to all processes in communicator –The type signature of count, datatype on any process must be equal to the type signature of count, datatype at the root

Example of MPI_BCAST #include “mpi.h” int main( int argc, char* argv[] ) { int rank; int ibuf; MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); if(rank == 0) ibuf = 12345; else ibuf = 0; MPI_Bcast(&ibuf, 1, MPI_INTEGER, 0, MPI_COMM_WORLD); printf(“ibuf = %d\n”, ibuf); MPI_Finalize(); }

MPI_SCATTER Usage –int MPI_Scatter( void* sendbuf,/* in */ int sendcount,/* in */ MPI_Datatype sendtype,/* in */ void* recvbuf,/* out */ int recvcount,/* in */ MPI_Datatype recvtype,/* in */ int root,/* in */ MPI_Comm comm);/* in */ Description –Distribute individual messages from root to each process in communicator –Inverse operation to MPI_GATHER

Example of MPI_SCATTER #include “mpi.h” int main( int argc, char* argv[] ) { int i; int rank, nproc; int isend[3], irecv; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); if(rank == 0) { for(i=0; i<nproc; i++) isend(i) = i+1; }

Example of MPI_SCATTER (Cont’d) MPI_Scatter( isend, 1, MPI_INTEGER, irecv, 1, MPI_INTEGER, 0, MPI_COMM_WORLD); printf(“irecv = %d\n”, irecv); MPI_Finalize(); }

MPI_SCATTERV Usage –int MPI_Scatterv( void* sendbuf,/* in */ int* sendcounts,/* in */ int* displs,/* in */ MPI_Datatype sendtype,/* in */ void* recvbuf,/* in */ int recvcount,/* in */ MPI_Datatype recvtype,/* in */ int root,/* in */ MPI_Comm comm);/* in */ Description –Distributes individual messages from root to each process in communicator –Messages can have different sizes and displacements

Example of MPI_SCATTERV #include “mpi.h” int main( int argc, char* argv[] ) { int i; int rank, nproc; int iscnt[3] = {1,2,3}, irdisp[3] = {0,1,3}; int isend[6] = {1,2,2,3,3,3}, irecv[3]; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); ircnt = rank + 1;

Example of MPI_SCATTERV (Cont’d) MPI_Scatterv( isend, iscnt, idisp, MPI_INTEGER, irecv, ircnt, MPI_INTEGER, 0, MPI_COMM_WORLD); printf(“irecv = %d\n”, irecv); MPI_Finalize(); }

MPI_GATHER Usage –int MPI_Gather( void* sendbuf,/* in */ int sendcount,/* in */ MPI_Datatype sendtype,/* in */ void* recvbuf,/* out */ int recvcount,/* in */ MPI_Datatype recvtype,/* in */ int root,/* in */ MPI_Comm comm ); /* in */ Description –Collects individual messages from each process in communicator to the root process and store them in rank order

Example of MPI_GATHER #include “mpi.h” int main( int argc, char* argv[] ) { int i; int rank, nproc; int isend, irecv[3]; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); isend = rank + 1; MPI_Gather( &isend, 1, MPI_INTEGER, irecv, 1, MPI_INTEGER, 0, MPI_COMM_WORLD);

Example of MPI_GATHER (Cont’d) if(rank == 0) { for(i=0; i<3; i++) printf(“irecv = %d\n”, irecv[i]); MPI_Finalize(); }

MPI_GATHERV Usage –int MPI_Gather( void* sendbuf,/* in */ int sendcount,/* in */ MPI_Datatype sendtype,/* in */ void* recvbuf,/* out */ int* recvcount,/* in */ int* displs,/* in */ MPI_Datatype recvtype,/* in */ int root,/* in */ MPI_Comm comm ); /* in */ Description –Collects individual messages from each process in communicator to the root process and store them in rank order

Example of MPI_GATHERV #include “mpi.h” int main( int argc, char* argv[] ) { int i; int rank, nproc; int isend[3], irecv[6]; int ircnt[3] = {1,2,3}, idisp[3] = {0,1,3}; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); for(i=0; i<rank; i++) isend[i] = rank + 1; iscnt = rank + 1;

Example of MPI_GATHERV (Cont’d) MPI_Gatherv( isend, iscnt, MPI_INTEGER, irecv, ircnt, idisp, MPI_INTEGER, 0, MPI_COMM_WORLD); if(rank == 0) { for(i=0; i<6; i++) printf(“irecv = %d\n”, irecv[i]); MPI_Finalize(); }

MPI_ALLGATHER Usage –int MPI_Allgather( void* sendbuf,/* in */ int sendcount, /* in */ MPI_Datatype sendtype, /* in */ void* recvbuf, /* out */ int recvcount, /* in */ MPI_Datatype recvtype, /* in */ MPI_Comm comm);/* in */ Description –Gathers individual messages from each process in communicator and distributes the resulting message to each process –Similar to MPI_GATHER except that all processes receive the result

Example of MPI_ALLGATHER int main( int argc, char* argv[] ) { int i; int rank, nproc; int isend, irecv[3]; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); isend = rank + 1; MPI_Allgather(&isend, 1, MPI_INTEGER, irecv, 1, MPI_INTEGER, MPI_COMM_WORLD); for(i=0; i<3; i++) printf(“irecv = %d\n”, irecv[i]); MPI_Finalize(); }

MPI_ALLGATHERV Usage –int MPI_Allgatherv( void* sendbuf,/* in */ int sendcount, /* in */ MPI_Datatype sendtype, /* in */ void* recvbuf, /* out */ int* recvcounts, /* in */ int* displs,/* in */ MPI_Datatype recvtype, /* in */ MPI_Comm comm);/* in */ Description –Collects individual messages from each process in communicator and distributes the resulting message to all processes –Messages can have different sizes and displacements

Example of MPI_ALLGATHERV #include “mpi.h” int main( int argc, char* argv[] ) { int i; int rank, nproc; int isend[3], irecv[6]; int ircnt[3] = {1,2,3}, idisp[3] = {0,1,3}; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); for(i=0; i<rank+1; i++) isend[i] = rank + 1; iscnt = rank + 1;

Example of MPI_ALLGATHERV (Cont’d) MPI_Allgather(isend, iscnt, MPI_INTEGER, irecv, ircnt, idisp, MPI_INTEGER, MPI_COMM_WORLD); for(i=0; i<6; i++) printf(“irecv = %d\n”, irecv[i]); MPI_Finalize(); }

MPI_ALLTOALL Usage –int MPI_Alltoall( void* sendbuf, /* in */ int sendcount, /* in */ MPI_Datatype sendtype, /* in */ void* recvbuf, /* out */ int recvcount, /* in */ MPI_Datatype recvtype, /* in */ MPI_Comm comm); /* in */ Description –Sends a distinct message from each process to every other process –The j-th block of data sent from process i is received by process j and placed in the i-th block of the buffer recvbuf

Example of MPI_ALLTOALL #include “mpi.h” int main( int argc, char* argv[] ) { int i; int rank, nproc; int isend[3], irecv[3]; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); for(i=0; i<nproc; i++) isend[i] = i + nproc * rank;

Example of MPI_ALLTOALL (Cont’d) MPI_Alltoall(isend, 1, MPI_INTEGER, irecv, 1, MPI_INTEGER, MPI_COMM_WORLD); for(i=0; i<3; i++) printf(“irecv = %d\n”, irecv[i]); MPI_Finalize(); }

MPI_ALLTOALLV Usage –int MPI_Alltoallv( void* sendbuf, /* in */ int* sendcounts, /* in */ int* sdispls, /* in */ MPI_Datatype sendtype, /* in */ void* recvbuf, /* in */ int* recvcounts, /* in */ int* rdispls, /* in */ MPI_Datatype recvtype, /* in */ MPI_Comm comm); /* in */ Description –Sends a distinct message from each process to every process –Messages can have different sizes and displacements

MPI_ALLTOALLV (Cont’d) Description (Cont’d) –The type signature associated with sendcount(j), sendtype at process i must be equal to the type signature associated with recvcounts(j)

Example of MPI_ALLTOALLV #include “mpi.h” int main( int argc, char* argv[] ) { int i; int rank, nproc; int isend[6] = {1,2,2,3,3,3}, irecv[9]; int iscnt[3] = {1,2,3}, isdsp[3] = {0,1,3}, ircnt[3], irdsp[3]; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); for(i=0; i<6; i++) isend[i] = isend[I] + nproc * rank;

Example of MPI_ALLTOALLV (Cont’d) for(i=0; i<nproc; i++) { ircnt[i] = rank + 1; irdsp[i] = i * (rank + 1); } MPI_Alltoallv( isend, iscnt, isdsp, MPI_INTEGER, irecv, ircnt, irdsp, MPI_INTEGER, MPI_COMM_WORLD); for(i=0; i<iscnt[rank] * nproc; i++) printf(“irecv = %d\n”, irecv[i]); MPI_Finalize(); }

MPI_REDUCE Usage –int MPI_Reduce( void* sendbuf, /* in */ void* recvbuf, /* out */ int count, /* in */ MPI_Datatype datatype, /* in */ MPI_Op op, /* in */ int root, /* in */ MPI_Comm comm); /* in */ Description –Applies a reduction operation to the vector sendbuf over the set of processes specified by communicator and places the result in recvbuf on root

MPI_REDUCE (Cont’d) Description (Cont’d) –Both the input and output buffers have the same number of elements with the same type –Users may define their own operations or use the predefined operations provided by MPI Predefined operations –MPI_SUM, MPI_PROD –MPI_MAX, MPI_MIN –MPI_MAXLOC, MPI_MINLOC –MPI_LAND, MPI_LOR, MPI_LXOR –MPI_BAND, MPI_BOR, MPI_BXOR

Example of MPI_REDUCE #include “mpi.h” int main( int argc, char* argv[] ) { int rank, nproc; int isend, irecv; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); isend = rank + 1; MPI_Reduce(&isend, &irecv, 1, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD); if(rank == 0) printf(“irecv = %d\n”, irecv); MPI_Finalize(); }

MPI_REDUCE for scalars

MPI_REDUCE for arrays

MPI_ALLREDUCE Usage –int MPI_Allreduce( void* sendbuf, /* in */ void* recvbuf, /* out */ int count, /* in */ MPI_Datatype datatype, /* in */ MPI_Op op, /* in */ MPI_Comm comm);/* in */ Description –Applies a reduction operation to the vector sendbuf over the set of processes specified by communicator and places the result in recvbuf o nall the processes in communicator –This routine is similar to MPI_REDUCE except the result is returned to the receive buffer of all the group members

Example of MPI_ALLREDUCE #include “mpi.h” int main( int argc, char* argv[] ) { int rank, nproc; int isend, irecv; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); isend = rank + 1; MPI_Allreduce(&isend, &irecv, 1, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD); printf(“irecv = %d\n”, irecv); MPI_Finalize(); }

MPI_SCAN Usage –int MPI_Scan( void* sendbuf, /* in */ void* recvbuf, /* out */ int count, /* in */ MPI_Datatype datatype, /* in */ MPI_Op op, /* in */ MPI_Comm comm);/* in */ Description –Performs a parallel prefix reduction on data distributed across a group –The operation returns, in the receive buffer of the process with rank I, the reduction of the values in the send buffers of processes with ranks 0…i

Example of MPI_SCAN #include “mpi.h” int main( int argc, char* argv[] ) { int rank, nproc; int isend, irecv; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); isend = rank + 1; MPI_Scan(&isend, &irecv, 1, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD); printf(“irecv = %d\n”, irecv); MPI_Finalize(); }

MPI_REDUCE_SCATTER Usage –int MPI_Reduce_scatter( void* sendbuf,/* in */ void* recvbuf,/* out */ int* recvcounts,/* in */ MPI_Datatype datatype, /* in */ MPI_Op op,/* in */ MPI_Comm comm); /* in */ Description –Applies a reduction operation to the vector sendbuf over the set of processes specified by communicator and scatters the result according to the values in recvcounts –Functionally equivalent to MPI_REDUCE with count equal to the sum of recvcounts(I) followed by MPI_SCATTERV with sendcounts equal to recvcounts

Example of MPI_REDUCE_SCATTER #include “mpi.h” int main( int argc, char* argv[] ) { int i; int rank, nproc; int isend[6], irecv[3]; int ircnt[3] = {1,2,3}; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); for(i=0; i<6; i++) isend[i] = 1 + rank * 10;

Example of MPI_REDUCE_SCATTER (Cont’d) MPI_Reduce_scatter(isend, irecv, ircnt, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD); for(i=0; i<3; i++) printf(“irecv = %d\n”, irecv[i]); MPI_Finalize(); }

MPI_OP_CREATE Usage –int MPI_Op_create( MPI_User_function* function, /* in */ int commute, /* in */ MPI_Op* op);/* out */ Description –Binds a user-defined reduction operation to an op handle –MPI_REDUCE, MPI_ALLREDUCE, MPI_REDUCE_SCATTER, MPI_SCAN –If commute is true, then the operation must be both commutative and associative. –MPI_User_function typedef void MPI_User_function(void* invec, void* inoutvec, int* len, MPI_Datatype* datatype);

Example of MPI_OP_CREATE #include “mpi.h” int main( int argc, char* argv[] ) { int i; int rank, nproc; MPI_Op isum; COMPLEX c[2], csum[2]; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); MPI_Op_create(my_sum, true, &isum); MPI_Reduce(c, csum, 2, MPI_COMPLEX, isum, 0, MPI_COMM_WORLD); MPI_Finalize(); }

void my_sum(void* cin, void* cinout, int* len, MPI_Datatype type) { int i; for(i=0; i<*len; i++) { cinout[i] = cinout[i] + cin[i]; }

MPI_BARRIER Usage –int MPI_Barrier(MPI_Comm comm);/* in */ Description –Blocks each process in communicator until all processes have called it

Hello World: C Code The easiest way to see exactly how a parallel code is put together and run is to write the classic "Hello World" program in parallel. In this case it simply means that every PE will say hello to us. Let's take a look at the code to do this. Hello World C Code #include #include "mpi.h" main(int argc, char** argv){ int my_PE_num; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_PE_num); printf("Hello from %d.\n", my_PE_num); MPI_Finalize(); }

Hello World: Fortran Code program shifter include 'mpif.h' integer my_pe_num, errcode call MPI_INIT(errcode) call MPI_COMM_RANK(MPI_COMM_WORLD, my_pe_num, errcode) print *, 'Hello from ', my_pe_num,'.' call MPI_FINALIZE(errcode) end

Output Hello from 5. Hello from 3. Hello from 1. Hello from 2. Hello from 7. Hello from 0. Hello from 6. Hello from 4. There are two issues here that may not have been expected. The most obvious is that the output might seem out of order. The response to that is "what order were you expecting?" Remember, the code was started on all nodes practically simultaneously. There was no reason to expect one node to finish before another. Indeed, if we rerun the code we will probably get a different order. Sometimes it may seem that there is a very repeatable order. But, one important rule of parallel computing is don't assume that there is any particular order to events unless there is something to guarantee it. Later on we will see how we could force a particular order on this output.

Master and Slaves PEs The much more common case is to have a single PE that is used for some sort of coordination purpose, and the other PEs run code that is the same, although the data will be different. This is how one would implement a master/slave or host/node paradigm. if (my_PE_num = 0) MasterCodeRoutine else SlaveCodeRoutine Of course, the above code is the trivial case of EveryBodyRunThisRoutine and consequently the only difference will be in the output, as it actually uses the PE number.

MPI_COMM_WORLD In the Hello World program, we see that the first parameter in MPI_Comm_rank (MPI_COMM_WORLD, &my_PE_num) is MPI_COMM_WORLD. MPI_COMM_WORLD is known as the "communicator" and can be found in many of the MPI routines. In general, it is used so that one can divide up the PEs into subsets for various algo- rithmic purposes. For example, if we had an array that we wished to find the determinant of distributed across the PEs, we might wish to define some subset of the PEs that holds a certain column of the array so that we could address only those PEs conveniently. However, this is a convenience that can often be dispensed with. As such, one will often see the value MPI_COMM_WORLD used anywhere that a communicator is required. This is simply the global set that states we don't really care to deal with any particular subset here.

Compiling and Running Well, now that we may have some idea how the above code will perform, let's compile it and run it to see if it meets our expectations. We compile using a normal ANSI C or Fortran 90 compiler (C++ is also available): While logged in the cluster master: For C codes: cc -lmpi hello.c For Fortran codes: f90 -lmpi hello.c We now have an executable. To run on the Cluster we must tell the machine how many copies we wish to run. You can choose any number. We'll try 8: Some Clusters use mpprun –n8 a.out Others use prun –n8 a.out

Where Will The Output Go? The second issue, although you may have taken it for granted, is "where will the output go?". This is another question that MPI dodges because it is so implementation dependent. On the Cluster, the I/O is structured in about the simplest way possible. All PEs can read and write (files as well as console I/O) through the standard channels. This is very convenient, and in our case results in all of the "standard output" going back to your terminal window on the Cluster. In general, it can be much more complex. For instance, suppose you were running this on a cluster of 8 workstations. Would the output go to eight separate consoles? Or, in a more typical situation, suppose you wished to write results out to a file: With the workstations, you would probably end up with eight separate files on eight separate disks. With the Cluster, they can all access the same file simultaneously. There are some good reasons why you would want to exercise some constraint even on the Cluster. 512 PEs accessing the same file would be extremely inefficient.

Sending and Receiving Messages Hello world might be illustrative, but we haven't really done any message passing yet. Let's write the simplest possible message passing program. It will run on 2 PEs and will send a simple message (the number 42) from PE 1 to PE 0. PE 0 will then print this out.

Sending a Message Sending a message is a simple procedure. In our case the routine will look like this in C (the standard man pages are in C, so you should get used to seeing this format): MPI_Send( &numbertosend, 1, MPI_INT, 0, 10, MPI_COMM_WORLD)

Sending a Message Cont’d Let's look at the parameters individually: &numbertosend a pointer to whatever we wish to send. In this case it is simply an integer. It could be anything from a character string to a column of an array or a structure. It is even possible to pack several different data types in one message. 1 the number of items we wish to send. If we were sending a vector of 10 int's, we would point to the first one in the above parameter and set this to the size of the array. MPI_INT the type of object we are sending. Possible values are: MPI_CHAR, MPI_SHORT, MPI_INT, MPI_LONG, MPI_UNSIGNED_CHAR, MPI_UNSIGNED_SHORT, MPI_UNSIGNED, MPI_UNSIGNED_LING, MPI_FLOAT, MPI_DOUBLE, MPI_LONG_DOUBLE, MPI_BYTE, MPI_PACKED Most of these are obvious in use. MPI_BYTE will send raw bytes (on a heterogeneous workstation cluster this will suppress any data conversion). MPI_PACKED can be used to pack multiple data types in one message, but it does require a few additional routines we won't go into (those of you familiar with PVM will recognize this). 0Destination of the message. In this case PE 0. 10Message tag. All messages have a tag attached to them that can be useful for sorting messages. For example, one could give high priority control messages a different tag then data messages. When receiving, the program would check for messages that use the control tag first. We just picked 10 at random. MPI_COMM_WORLDWe don't really care about any subsets of PEs here. So, we just chose this "default".

Receiving a Message Receiving a message is equally simple. In our case it will look like: MPI_Recv( &numbertoreceive, 1, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG,MPI_COMM_WORLD, &status) &numbertoreceiveA pointer to the variable that will receive the item. In our case it is simply an integer that has has some undefined value until now. 1Number of items to receive. Just 1 here. MPI_INTDatatype. Better be an int, since that's what we sent. MPI_ANY_SOURCEThe node to receive from. We could use 1 here since the message is coming from there, but we'll illustrate the "wild card" method of receiving a message from anywhere. MPI_ANY_TAGWe could use a value of 10 here to filter out any other messages (there aren't any) but, again, this was a convenient place to show how to receive any tag. MPI_COMM_WORLDJust using default set of all PEs. &statusA structure that receive the status data which includes the source and tag of the message.

Send and Receive C Code #include #include "mpi.h" main(int argc, char** argv){ int my_PE_num, numbertoreceive, numbertosend=42; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_PE_num); if (my_PE_num==0){ MPI_Recv( &numbertoreceive, 1, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status); printf("Number received is: %d\n", numbertoreceive); } else MPI_Send( &numbertosend, 1, MPI_INT, 0, 10, MPI_COMM_WORLD); MPI_Finalize(); }

Send and Receive Fortran Code program shifter implicit none include 'mpif.h' integer my_pe_num, errcode, numbertoreceive, numbertosend integer status(MPI_STATUS_SIZE) call MPI_INIT(errcode) call MPI_COMM_RANK(MPI_COMM_WORLD, my_pe_num, errcode) numbertosend = 42 if (my_PE_num.EQ.0) then call MPI_Recv( numbertoreceive, 1, MPI_INTEGER,MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, status, errcode) print *, 'Number received is:‘,numbertoreceive endif

if (my_PE_num.EQ.1) then call MPI_Send( numbertosend, 1,MPI_INTEGER, 0, 10, MPI_COMM_WORLD, errcode) endif call MPI_FINALIZE(errcode) end

Non-Blocking Recieves All of the receives that we will use are blocking. This means that they will wait until a message matching their requirements for source and tag has been received. It is possible to use non-blocking communications. This means a receive will return immediately and it is up to the code to determine when the data actually arrives using additional routines. In most cases this additional coding is not worth it in terms of performance and code robustness. However, for certain algorithms this can be useful to keep in mind.

Communication Modes There are four possible modes (with slight differently named MPI_XSEND routines) for buffering and sending messages in MPI. We use the standard mode here, and you may find this sufficient for the majority of your needs. However, these other modes can allow for substantial optimization in the right circumstances: Standard modeSend will usually not block even if a receive for that message has not occurred. Exception is if there are resource limitations (buffer space). Buffered ModeSimilar to above, but will never block (just return error). Synchronous Mode will only return when matching receive has started. Ready Modewill only work if matching receive is already waiting.

Synchronization We are going to write one more code which will employ the remaining tool that we need for general parallel programming: synchronization. Many algorithms require that you be able to get all of the nodes into some controlled state before proceeding to the next stage. This is usually done with a synchronization point that require all of the nodes (or some specified subset at the least) to reach a certain point before proceeding. Sometimes the manner in which messages block will achieve this same result implicitly, but it is often necessary to explicitly do this and debugging is often greatly aided by the insertion of synchronization points which are later removed for the sake of efficiency. Our code will perform the rather pointless operation of having PE 0 send a number to the other 3 PEs and have them multiply that number by their own PE number. They will then print the results out (in order, remember the hello world program?) and send them back to PE 0 which will print out the sum.

Synchronization: C Code #include #include "mpi.h" main(int argc, char** argv){ int my_PE_num, numbertoreceive, numbertosend=4,index, result=0; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_PE_num); if (my_PE_num==0) for (index=1; index<4; index++) MPI_Send( &numbertosend, 1,MPI_INT, index, 10,MPI_COMM_WORLD); else{ MPI_Recv( &numbertoreceive, 1, MPI_INT, 0, 10, MPI_COMM_WORLD, &status); result = numbertoreceive * my_PE_num; }

for (index=1; index<4; index++){ MPI_Barrier(MPI_COMM_WORLD); if (index==my_PE_num) printf("PE %d's result is %d.\n", my_PE_num, result); } if (my_PE_num==0){ for (index=1; index<4; index++){ MPI_Recv( &numbertoreceive, 1,MPI_INT,index,10, MPI_COMM_WORLD, &status); result += numbertoreceive; } printf("Total is %d.\n", result); } else MPI_Send( &result, 1, MPI_INT, 0, 10, MPI_COMM_WORLD); MPI_Finalize(); }

Synchronization: Fortran Code program shifter implicit none include 'mpif.h' integer my_pe_num, errcode, numbertoreceive, numbertosend integer index, result integer status(MPI_STATUS_SIZE) call MPI_INIT(errcode) call MPI_COMM_RANK(MPI_COMM_WORLD, my_pe_num, errcode) numbertosend = 4 result = 0 if (my_PE_num.EQ.0) then do index=1,3 call MPI_Send( numbertosend, 1, MPI_INTEGER, index, 10, MPI_COMM_WORLD, errcode) enddo else call MPI_Recv( numbertoreceive, 1, MPI_INTEGER, 0, 10, MPI_COMM_WORLD, status, errcode) result = numbertoreceive * my_PE_num endif

do index=1,3 call MPI_Barrier(MPI_COMM_WORLD, errcode) if (my_PE_num.EQ.index) then print *, 'PE ',my_PE_num,'s result is ',result,'.' endif enddo if (my_PE_num.EQ.0) then do index=1,3 call MPI_Recv( numbertoreceive, 1, MPI_INTEGER, index,10, MPI_COMM_WORLD, status, errcode) result = result + numbertoreceive enddo print *,'Total is ',result,'.' else call MPI_Send( result, 1, MPI_INTEGER, 0, 10, MPI_COMM_WORLD, errcode) endif call MPI_FINALIZE(errcode) end

Results of “Synchronization” The output you get when running this code with 4 PEs (what will happen if you run with more or less?) is the following: PE 1’s result is 4. PE 2’s result is 8. PE 3’s result is 12. Total is 24

Analysis of “Synchronization” The best way to make sure that you understand what is happening in the code above is to look at things from the perspective of each PE in turn. THIS IS THE WAY TO DEBUG ANY MESSAGE-PASSING (or MIMD) CODE. Follow from the top to the bottom of the code as PE 0, and do likewise for PE 1. See exactly where one PE is dependent on another to proceed. Look at each PEs progress as though it is 100 times faster or slower than the other nodes. Would this affect the final program flow? It shouldn't unless you made assumptions that are not always valid.

Reduction MPI_Reduce: Reduces values on all processes to a single value. Synopsis #include "mpi.h" int MPI_Reduce ( sendbuf, recvbuf, count, datatype, op, root, comm ) void *sendbuf; void *recvbuf; int count; MPI_Datatype datatype; MPI_Op op; int root; MPI_Comm comm;

Reduction Cont’d Input Parameters: sendbufaddress of send buffer countnumber of elements in send buffer (integer) datatypedata type of elements of send buffer (handle) opreduce operation (handle) rootrank of root process (integer) commcommunicator (handle) Output Parameter: recvbufaddress of receive buffer (choice, significant only at root) Algorithm: This implementation currently uses a simple tree algorithm.

Finding Pi Our last example will find the value of pi by integrating 4/(1 + x 2 ) for -1/2 to +1/2. This is just a geometric circle. The master process (0) will query for a number of intervals to use, and then broadcast this number to all of the other processors. Each processor will then add up every n th interval (x = -1/2 + rank/n, -1/2 + rank/n + size/n). Finally, the sums computed by each processor are added together using a new type of MPI operation, a reduction.

Finding Pi program FindPI implicit none include 'mpif.h' integer n, my_pe_num, numprocs, index, errcode real mypi, pi, h sum, x call MPI_Init(errcode) call MPI_Comm_size(MPI_COMM_WORLD, numprocs, errcode) call MPI_Comm_rank(MPI_COMM_WORLD, my_pe_num, errcode) if (my_pe_num.EQ.0) then print *,'How many intervals?:' read *, n endif call MPI_Bcast(n, 1, MPI_INTEGER, 0, MPI_COMM_WORLD, errcode)

h = 1.0 / n sum = 0.0 do index = my_pe_num+1, n, numprocs x = h * (index - 0.5) sum = sum / (1.0 + x*x) enddo mypi = h * sum call MPI_Reduce(mypi, pi, 1, MPI_REAL, MPI_SUM, 0, MPI_COMM_WORLD, errcode) if (my_pe_num.EQ.0) then print *,'pi is approximately ',pi print *,'Error is ',pi endif call MPI_Finalize(errcode) end

Do Not Make Any Assumptions Do not make any assumptions about the mechanics of the actual message- passing. Remember that MPI is designed to operate not only on fast MPP networks, but also on Internet size meta-computers. As such, the order and timing of messages may be considerably skewed. MPI makes only one guarantee: two messages sent from one process to another process will arrive in that relative order. However, a message sent later from another process may arrive before, or between, those two messages.

References There is a wide variety of material available on the Web, some of which is intended to be used as hardcopy manuals and tutorials. you may wish to start at one of the MPI home pages at from which you can find a lot of useful information without traveling too far. To learn the syntax of MPI calls, access the index for the Message Passing Interface Standard at: Books: Parallel Programming with MPI. Peter S. Pacheco. San Francisco: Morgan Kaufmann Publishers, Inc., PVM: a users' guide and tutorial for networked parallel computing. Al Geist, Adam Beguelin, Jack Dongarra et al. MIT Press, Using MPI: portable parallel programming with the message-passing interface. William Gropp, Ewing Lusk, Anthony Skjellum. MIT Press, 1996.