Prof. Daniel S. Katz Department of Electrical and Computer Engineering

High Performance Computing: Concepts, Methods & Means MPI: The Message Passing Interface
Prof. Daniel S. Katz Department of Electrical and Computer Engineering Louisiana State University February 22nd, 2007

Topics Introduction MPI Standard MPI-1.x Model and Basic Calls
Point-to-point Communication Collective Communication Advanced MPI-1.x Highlights MPI-2 Highlights Summary

Opening Remarks Context: distributed memory parallel computers
We have communicating sequential processes, each with their own memory, and no access to another process’s memory A fairly common scenario from the mid 1980s (Intel Hypercube) to today Processes interact (exchange data, synchronize) through message passing Initially, each computer vendor had its own library and calls First standardization was PVM Started in 1989, first public release in 1991 Worked well on distributed machines A library, not an API Next was MPI

What you’ll Need to Know
What is a standard API How to build and run an MPI-1.x program Basic MPI functions 4 basic environment functions Including the idea of communicators Basic point-to-point functions Blocking and non-blocking Deadlock and how to avoid it Datatypes Basic collective functions The advanced MPI-1.x material may be required for the problem set The MPI-2 highlights are just for information

Point-to-point Communication Collective Communication Advanced MPI-1.x Highlights MPI-2 Highlights Summary

MPI Standard From , a bunch of people representing both vendors and users got together and decided to create a standard interface to message passing calls In the context of distributed memory parallel computers (MPPs, there weren’t really clusters yet) MPI-1 was the result “Just” an API FORTRAN77 and C bindings Reference implementation (mpich) also developed Vendors also kept their own internals (behind the API) Vendor interfaces faded away over about 2 years

MPI Standard Since then Best MPI reference MPI-1.1 MPI-2
Fixed bugs, clarified issues MPI-2 Included MPI-1.2 Fixed more bugs, clarified more issues Extended MPI without new functionality New datatype constructors, language interoperability New functionality One-sided communication MPI I/O Dynamic processes FORTRAN90 and C++ bindings Best MPI reference MPI Standard - on-line at:

Point-to-point Communication Collective Communication Advanced MPI-1.x Highlights MPI-2 Highlights Summary 9

Building an MPI Executable
Not specified in the standard Two normal options, dependent on implementation Library version cc -Iheaderdir -Llibdir mpicode.c -lmpi User knows where header file and library are, and tells compiler Wrapper version mpicc -o executable mpicode.c Does the same thing, but hides the details from the user You can do either one, but don't try to do both! On Celeritas (celeritas.cct.lsu.edu), the latter is easier

MPI Model Some number of processes are started somewhere
Again, standard doesn’t talk about this Implementation and interface varies Usually, some sort of mpirun command starts some number of copies of an executable according to a mapping Example: mpirun -np 2 ./a.out Run two copies of ./a.out where the system specifies Most production supercomputing resources wrap the mpi run command with higher level scripts that interact with scheduling systems such as PBS / LoadLeveler for efficient resource management and multi-user support Sample PBS / Load Leveler job submission scripts : PBS File: #!/bin/bash #PBS -l walltime=120:00:00,nodes=8:ppn=4 cd /home/cdekate/S1_L2_Demos/adc/ pwd date mpirun -np 32 -machinefile $PBS_NODEFILE ./padcirc LoadLeveler File: #!/bin/bash job_type = parallel job_name = SIMID wall_clock_limit = 120:00:00 node = 8 total_tasks = 32 initialdir = /scratch/cdekate/ executable = /usr/bin/poe arguments = /scratch/cdekate/padcirc queue

MPI Communicators Communicator is an internal object
MPI provides functions to interact with it Default communicator is MPI_COMM_WORLD All processes are member of it It has a size (the number of processes) Each process has a rank within it Can think of it as an ordered list of processes Additional communicator can co-exist A process can belong to more than one communicator Within a communicator, each process has a unique rank

A Sample MPI program ... INCLUDE mpif.h CALL MPI_INITIALIZE(IERR)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD, SIZE, IERR) CALL MPI_COMM_RANK(MPI_COMM_WORLD, RANK, IERR) CALL MPI_FINALIZE(IERR)

A Sample MPI program ... #include <mpi.h>
err = MPI_Init(&Argc,&Argv); err = MPI_Comm_size(MPI_COMM_WORLD, &size); err = MPI_Comm_rank(MPI_COMM_WORLD, &rank); err = MPI_Finalize();

A Sample MPI program Mandatory in any MPI code
Defines MPI-related parameters ... #include <mpi.h> err = MPI_Init(&Argc,&Argv); err = MPI_Comm_size(MPI_COMM_WORLD, &size); err = MPI_Comm_rank(MPI_COMM_WORLD, &rank); err = MPI_Finalize();

A Sample MPI program Must be called in any MPI code by all processes once and only once before any other MPI calls ... #include <mpi.h> err = MPI_Init(&Argc,&Argv); err = MPI_Comm_size(MPI_COMM_WORLD, &size); err = MPI_Comm_rank(MPI_COMM_WORLD, &rank); err = MPI_Finalize();

A Sample MPI program Must be called in any MPI code by all processes once and only once after all other MPI calls ... #include <mpi.h> err = MPI_Init(&Argc,&Argv); err = MPI_Comm_size(MPI_COMM_WORLD, &size); err = MPI_Comm_rank(MPI_COMM_WORLD, &rank); err = MPI_Finalize();

A Sample MPI program Returns the number of processes (size) in the communicator (MPI_COMM_WORLD) ... #include <mpi.h> err = MPI_Init(&Argc,&Argv); err = MPI_Comm_size(MPI_COMM_WORLD, &size); err = MPI_Comm_rank(MPI_COMM_WORLD, &rank); err = MPI_Finalize();

A Sample MPI program Returns the rank of this process (rank) in the communicator (MPI_COMM_WORLD) Has unique return value per process ... #include <mpi.h> err = MPI_Init(&Argc,&Argv); err = MPI_Comm_size(MPI_COMM_WORLD, &size); err = MPI_Comm_rank(MPI_COMM_WORLD, &rank); err = MPI_Finalize();

A Sample MPI program int char** int int int ... #include <mpi.h>
err = MPI_Init(&Argc,&Argv); err = MPI_Comm_size(MPI_COMM_WORLD, &size); err = MPI_Comm_rank(MPI_COMM_WORLD, &rank); err = MPI_Finalize(); int char** int int int

A Complete MPI Example Output (with 3 processes): I am not the root
#include <stdio.h> #include <mpi.h> main(int argc, char *argv[]) { int err, size, rank; err = MPI_Init(&argc, &argv); /* Initialize MPI */ if (err != MPI_SUCCESS) { printf("MPI initialization failed!\n"); exit(1); } err = MPI_Comm_size(MPI_COMM_WORLD, &size); err = MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == 0) { /* root process */ printf("I am the root\n"); } else { printf("I am not the root\n"); printf("My rank is %d\n",rank); err = MPI_Finalize(); exit(0); Output (with 3 processes): I am not the root My rank is 2 I am the root My rank is 0 My rank is 1

Point-to-point Communication
How two processes interact Most flexible communication in MPI Two basic varieties Blocking and nonblocking Two basic functions Send and receive With these two functions, and the four functions we already know, you can do everything in MPI But there's probably a better way to do a lot things, using other functions

Basic concept (buffered)
Kernel mode User mode Process 0 Process 1 Call send Subroutine Return from send Subroutine Copy data from sendbuf to sysbuf Send data to the sysbuf at the receiving end Call receive Subroutine Return from receive Subroutine Receive data from the sysbuf at the sending end Copy data from sysbuf to recvbuf sendbuf sysbuf recvbuf Step 1 Step 2 Step 3

Calls do not return until data transfer is done
Blocking (buffered) Kernel mode User mode Process 0 Process 1 Call send Subroutine Return from send Subroutine Copy data from sendbuf to sysbuf Send data to the sysbuf at the receiving end Call receive Subroutine Return from receive Subroutine Receive data from the sysbuf at the sending end Copy data from sysbuf to recvbuf sendbuf sysbuf recvbuf Calls do not return until data transfer is done Send doesnt return until sendbuf can be reused Receive doesn't return until recvbuf can be used

Nonblocking (buffered)
Calls return after data transfer is started Faster than blocking calls Could cause problems if sendbuf or recvbuf is changed during call Need new call (wait) to know if nonblocking call is done Kernel mode User mode Process 0 Process 1 Call send Subroutine Return from send Subroutine Copy data from sendbuf to sysbuf Send data to the sysbuf at the receiving end Call receive Subroutine Return from receive Subroutine Receive data from the sysbuf at the sending end Copy data from sysbuf to recvbuf sendbuf sysbuf recvbuf

Datatypes Basic datatypes
You can also define your own (derived datatypes), such as an array of ints of size 100, or more complex examples, such as a struct or an array of structs MPI datatype C datatype MPI_CHAR signed char MPI_SHORT signed short int MPI_INT singed int MPI_LONG singed long int MPI_UNSIGNED_CHAR unsigned char MPI_UNSIGNED_SHORT unsigned short MPI_UNSIGNED unsigned int MPI_UNSIGNED_LONG unsigned long int MPI_FLOAT float MPI_DOUBLE double MPI_LONG_DOUBLE long double MPI_BYTE MPI_PACKED MPI datatype Fortran datatype MPI_INTEGER INTEGER MPI_DOUBLE_PRECISION DOUBLE PRECISION MPI_COMPLEX COMPLEX MPI_LOGICAL LOGICAL MPI_CHARACTER CHARACTER(1) MPI_BYTE MPI_PACKED

Point-to-Point Syntax
Blocking Nonblocking err = MPI_Send(sendbuf, count, datatype, destination, tag, comm); err = MPI_Recv(recvbuf, count, datatype, source, tag, comm, &status); err = MPI_Isend(sendbuf, count, datatype, destination, tag, comm, &req); err = MPI_Irecv(recvbuf, count, datatype, source, tag, comm, &req); err = MPI_Wait(req, &status);

A situation where the dependencies between processors are cyclic
Deadlock Something to avoid A situation where the dependencies between processors are cyclic One processor is waiting for a message from another processor, but that processor is waiting for a message from the first, so nothing happens Until your time in the queue runs out and your job is killed MPI does not have timeouts

Deadlock Example If (rank == 0) { err = MPI_Send(sendbuf, count, datatype, 1, tag, comm); err = MPI_Recv(recvbuf, count, datatype, 1, tag, comm, &status); }else { err = MPI_Send(sendbuf, count, datatype, 0, tag, comm); err = MPI_Recv(recvbuf, count, datatype, 0, tag, comm, &status); } If the message sizes are small enough, this should work because of systems buffers If the messages are too large, or system buffering is not used, this will hang

Deadlock Example Solutions
If (rank == 0) { err = MPI_Send(sendbuf, count, datatype, 1, tag, comm); err = MPI_Recv(recvbuf, count, datatype, 1, tag, comm, &status); }else { err = MPI_Recv(recvbuf, count, datatype, 0, tag, comm, &status); err = MPI_Send(sendbuf, count, datatype, 0, tag, comm); } or If (rank == 0) { err = MPI_Isend(sendbuf, count, datatype, 1, tag, comm, &req); err = MPI_Recv(recvbuf, count, datatype, 1, tag, comm); err = MPI_Wait(req, &status); }else { err = MPI_Isend(sendbuf, count, datatype, 0, tag, comm, &req); err = MPI_Recv(recvbuf, count, datatype, 0, tag, comm); }

Collective Communication
How a group of processes interact “group” here means processes in a communicator A group can be as small as one or two processes One process communicating with itself isn't interesting Two processes communicating are probably better handled through point-to-point communication Most efficient communication in MPI All collective communication is blocking

Collective Communication Types
Three types of collective communication Synchronization Example: barrier Data movement Examples: broadcast, gather Reduction (computation) Example: reduce All of these could also be done with point-to-point communications Collective operations give better performance and better productivity

Synchronization: Barrier
Called by all processes in a communicator Each calling processes blocks until all processes have made the call. err = MPI_Barrier(comm); My opinions Barriers are not needed in MPI-1.x codes unless based on external events Examples: signals from outside, I/O Barriers are a performance bottleneck One-sided communication in MPI-2 codes is different Barriers can help in printf-style debugging Barriers may be needed for accurate timing

Data Movement: Broadcast
Send data from one process, called root (P0 here) to all other processes in the communicator Called by all processes in the communicator with the same arguments err = MPI_Bcast(buf, count, datatype, root, comm);

Broadcast Example Output (with 3 processes): My data[99] is xyz
#include <stdio.h> #include <mpi.h> main(int argc, char *argv[]) { int err, rank, data[100]; err = MPI_Init(&argc, &argv); /* Initialize MPI */ err = MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == 0) { /* root process */ /* fill all of data array here, with data[99] as xyz */ } err = MPI_Bcast(data, 100, MPI_INT, 0, MPI_COMM_WORLD); if (rank == 2) printf("My data[99] is %d\n",data[99]); err = MPI_Finalize(); exit(0); Output (with 3 processes): My data[99] is xyz

Data Movement: Gather Gather Collect data from all one process in a communicator to one process, called root (P0 here) Called by all processes in the communicator with the same arguments err = MPI_Bcast(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm); As if all n processes called MPI_Send(sendbuf, sendcount, sendtype, root, …); And root made n calls MPI_Recv(recvbuf+i•recvcount•extent(recvtype, recvcount, recvtype, i, …);

Gather Example Output (with >1 processes): My rbuf[100] is xyz
#include <stdio.h> #include <mpi.h> main(int argc, char *argv[]) { int err, size, rank, data[100], *rbuf; MPI_Init(&argc, &argv); /* Initialize MPI */ MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); /* all processes fill their data array here */ /* make data[0]=xyz on process 1 */ if (rank == 0) { /* root process */ rbuf = malloc(size*100*sizeof(int); } MPI_Gather(data, 100, MPI_INT, rbuf, 100, MPI_INT, 0, MPI_COMM_WORLD); if (rank == 0) printf("rbuf[100] is %d\n",rbuf[100]); MPI_Finalize(); exit(0); Output (with >1 processes): My rbuf[100] is xyz

Reduction: Reduce Similar to gather: But, the data is operated upon
Collect data from all one process in a communicator to one process, called root (P0 here) But, the data is operated upon Called by all processes in the communicator with the same arguments err = MPI_Reduce(sendbuf, recvbuf, count, datatype, operation, root, comm);

Reduction: Reduce Operations
Predefined operations: MPI_MAX, MPI_MIN, MPI_SUM, MPI_PROD, MPI_LAND, MPI_BAND, MPI_LOR, MPI_BOR, MPI_LXOR, MPI_BXOR, MPI_MAXLOC, MPI_MINLOC MAXLOC and MINLOC are tricky Read the spec Also can define your own operation

Reduce Example Output (with 3 processes): My datasum is 12
#include <stdio.h> #include <mpi.h> main(int argc, char *argv[]) { int err, rank, data, datasum; MPI_Init(&argc, &argv); /* Initialize MPI */ MPI_Comm_rank(MPI_COMM_WORLD, &rank); data = rank*3+1; /* {1,4,7} */ MPI_reduce(data, datasum, 1, MPI_INT, MPI_SUM, 0,MPI_COMM_WORLD); if (rank == 0) printf("datasum is %d\n", datasum); MPI_Finalize(); exit(0); } Output (with 3 processes): My datasum is 12

Collective Operations

Communicators and Groups
Group is an MPI term related to communicators MPI functions exist to define a group from a communicator or vice-versa, to split communicators, and to work with groups You can build communicators to fit your application, not just use MPI_COMM_WORLD Communicators can be built to match a logical process topology, such as Cartesian

Communication Modes MPI_Send uses standard mode Other modes exist:
May be buffered, depending on message size Details are implementation dependent Other modes exist: Buffered - requires message buffering (MPI_Bsend) Synchronous - forbids message buffering (MPI_Ssend) Ready - Recv most be posted before Send (MPI_Rsend) Only one MPI_Recv This is independent of blocking/nonblocking

More Collective Communication
What if data to be communicated is not the same size in each process? Varying calls (v) exist: MPI_Gatherv, MPI_Scatterv, MPI_Allgatherv Additional arguments include information about data on each process

Persistent Communication
Used when a communication with the same argument list is repeatedly executed within the inner loop of a parallel computation Bind list of communication arguments to a persistent communication request once, and then, repeatedly use the request to initiate and complete messages Allows reduction of overhead for communication between the process and communication controller, not overhead for communication between one communication controller and another Not necessary that messages sent with persistent request be received by receive operation with persistent request, or vice versa

MPI-2 Status All vendors have complete MPI-1, and have for years Free implementations (MPICH, LAM) support heterogeneous workstation networks MPI-2 implementations are being undertaken by all vendors Fujitsu, NEC have complete MPI-2 implementations Other vendors generally have all but dynamic process management MPICH-2 is complete Open MPI (new MPI from LAM and other MPIs) is becoming complete

MPI-2 Dynamic Processes
MPI-2 supports dynamic processes An application can start, and later more processes can be added to it, through a complex process Intracommunicators - everything we talked about so far Intercommunicators - for multiple sets of processes to work together

One-sided communication
One process can put data to another process's memory, or get data from another process's memory, with the other process being affected If hardware supports it, allows the second process to compute while the communication is happening Separates data transfer and synchronization Barriers become essential

Parallel I/O (1) Old ways to do I/O
Process 0 does all the I/O to a single file and broadcasts/scatters/gathers to/from other processes All processes do their own I/O to separate files All tasks read from same file All tasks write to same file, using seeks to get to right place One task at a time appends to a single file, using barriers to prevent overlapping writes

Parallel I/O (2) New way is to use parallel I/O library, such as MPI I/O Multiple tasks can simultaneously read or write to a single file (possibly on a parallel file system) using the MPI I/O API A parallel file system usually looks like a single file system, but has multiple I/O servers to permit high bandwidth from multiple processes MPI I/O is part of MPI-2 Allows single or collective operations to/of contiguous or non-contiguous regions/data using MPI datatypes, including derived datatypes, blocking or nonblocking Sound familiar? Writing: sending message, reading: receiving

Uses high-level access
Parallel I/O (3) Uses high-level access Given complete access information, an implementation can perform optimizations such as: Data Sieving: Read large chunks and extract what is really needed Collective I/O: Merge requests of different processes into larger requests Improved prefetching and caching

Summary – Material for the Test
MPI standard: slides 4,7 Compile and Run an MPI Program: slides 10,11 Environment functions: slides 12,14 Point-to-point functions: slides 27,28 Blocking vs. nonblocking: slides 25,26 Deadlock: slides 29-31 Basic collective functions: slides 33,34,36,38,40,41,43

Prof. Daniel S. Katz Department of Electrical and Computer Engineering

Similar presentations

Presentation on theme: "Prof. Daniel S. Katz Department of Electrical and Computer Engineering"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Prof. Daniel S. Katz Department of Electrical and Computer Engineering

Similar presentations

Presentation on theme: "Prof. Daniel S. Katz Department of Electrical and Computer Engineering"— Presentation transcript:

Similar presentations

About project

Feedback