Programming Using The Message Passing Interface

Programming Using The Message Passing Interface
Dagoberto A.R.Justo PPGMAp UFRGS (based on Rudnei’s notes) 11/21/2018 intro

1. Introdução Parte dessa introdução é baseada nas notas do Prof.Rudnei Dias da Cunha “Message Passing Interface” é um padrão que estabelece a semântica de comunicação entre dois ou mais processos 1994: MPI 1.0 60 pessoas + 40 instituições + empresas 1996/1998: MPI 2.0

Em termos de eficiência, a semântica da interface foi definida de tal forma que:
a cópia de dados do usuário para dados internos às funções MPI é evitada sempre que possível a simultaneidade de comunicação e computação é permitida, se possível utilizando um co-processador responsável pela comunicação

Existem diferentes implementações do padrão, dentre as quais podem ser citadas:
MPICH LAM CHIMP Específicas dos fabricantes (SGI, IBM, etc.)

Programming approach with MPI
p distinct programs running on p processors (not scalable in general) Single program, multiple data Easier to implement and reasoning about Often specify a special process, called the master, to perform special duties, such as I/O Most current MPI implementation do not support parallel I/O to the same data set -- MPI 2.0 does though SPMD programs can be loosely synchronous or asynchronous

Topics The building blocks -- send/receive operations
MPI: the message passing interface Topologies and embedding into fixed networks Overlapping communication and computation Collective communication and computation operations Groups and communicators

The building blocks These are paired operations Where:
call send (sendbuf, nelems, dest) call receive(recvbuf, nelems, source) Where: Sendbuf Buffer data to be sent Recvbuf Buffer that contains the received data Nelems number of items to be sent or received Dest identifier of the process that receives the data Source identifier of the process that sent the data

Send/Receives Are Not Trival To Implement
Consider two programs (shown as MIMD but could be SPMD): What does this program mean? Probably, 100 is sent from P0 to P1 What happens if the send is executed and returns, posting the send, but does not actually send a until after the statement 5 executes? 0 gets sent to P1 which is wrong That is, the semantics of this send operation are that the value given at the time of the call are sent to P1 There has to be some kind of coordination here, which is often called handshaking -- an example is given on the next slide P0 1: 2: 3: a=100 4: send(a,1,1) 5: a=0 P1 1: 2: 3: 4: receive(a,1,0) 5: print *, a

Handshaking For Blocking Non-Buffered Communications
sending process receiving process sending process receiving process sending process receiving process request to send send receive request to send request to send ok to send send ok to send receive receive send data data data Sender come first; idling at sender Sender and receiver come at about the same time; idling minimized Receiver comes first; idling at receiver executing idling communicating

Approaches available for interaction
The interactions are synchronous All processes/pairs rendezvous for the interaction at the same time This is difficult to specify and control The interactions are asynchronous The interactions are typically one-way but can be two-way It is very hard to reason about this kind of interaction The interactions are loosely synchronous Each process executes its part of the two-way interaction at different times and may (nonblocking) or may not (blocking) continue execution even though the partner has not executed its part Support is given in the nonblocking cases to check whether the interaction has taken place

Blocking Communication -- Non-Buffered Send/Receive
Advantage: The send/receive procedures return only when the data is safe to use That is, the send procedure returns after the data to be sent has been read and saved or sent The data in the send buffer can be changed without modifying the message sent Disadvantages: There are potentially large idling overheads Particularly when the send/receive processes are not coordinated just right Deadlocks are possible

The Possibility Of Deadlocks
Consider two programs: P0 and P1 block one another because: P0 issues a send and does not return from the P0's send procedure until P1 issues a receive and has received the data P1 issues a send and does not return from the P1's send procedure until P0 issues a receive and has received the data Neither P0 nor P1 ever reach their receive procedure Can be corrected by reversing the order of the send/receive procedures on one of the processes This is an easy solution for two processes/programs but what do you do if you had a ring of communicating processes? P0 1: 2: 3: send(a,1,1) 4: receive(b,1,1) P1 1: 2: 3: send(a,1,0) 4: receive(b,1,0)

Blocking Buffered Transfer Protocols
sending process receiving process sending process receiving process send send send data data moving data between buffers receive receive executing filling the send buffer/emptying the receive buffer transferring data between buffers In the presence of communication hardware with buffers at send and receive ends In the absence of communication hardware, sender interrupts receiver and deposits data in buffer at receiver end

An Example Of The Limitation Caused By Finite Buffer Sizes
Consider two programs: What happens when P1 is slow at consuming data or just slow anyway Eventually, either P0 or P1 might run out of buffer space, particularly if the data a is large They either block or crash -- neither of which may be very satisfactory P0 1: 2:do i=1, 100 3: a=f(i) 4: send(a,1,1) 5:end do P1 1: 2:do i=1, 100 3: receive(a, 1, 0) 4: f(i)=a 5:end do

Deadlocks In Buffered Send/Receive Operations
Consider two programs: Although buffered, the procedures still block until it is safe to go on In this case, the receives block until the data is received Because both block, neither process P0 nor P1 executes the send Thus, deadlock occurs Again, the solution is to reverse the order of the send/receives on either P0 or P1 P0 1: 2: receive(a,1,1) 3: send(b,1,1) P1 1: 2: receive(a,1,0) 3: send(b,1,0)

Non-Blocking Message Passing Operations
Blocking operations have a substantial overhead in order to ensure data integrity and safety Let's place the responsibility for safety on the programmer The overheads are controlled by the programmer The cost is that the library must provide procedures to find out if the messages have been sent and received If not sent or not received, the library must provide a mechanism for the programmer to wait completion of the message passing Non-blocking operations can use buffers, to further reduce the time the data is unsafe to use

Diagrams Of Non-Blocking Non-Buffered Send/Receive Operations
sending process receiving process sending process receiving process request to send request to send Unsafe to update data being sent Unsafe to update data being sent ok to send ok to send receive data receive data Unsafe to read data being received Without communication hardware With communication hardware executing communicating

MPI Message Passing Interface Finally!!!

MPI: The Message Passing Interface
Over 125 procedures Only 6 are needed to give full functionality MPI_Init Initialize MPI MPI_Finalize Terminates MPI MPI_Comm_size Determines the number of processes MPI_Comm_rank Determines the process identifier MPI_Send Sends a message MPI_Recv Receives a message Most of the procedures are type overloaded Legal in C Illegal in standard Fortran in the way it is done Many Fortran 90/95 compilers provide a flag to suppress the error diagnostics with warning diagnostics

Compilando e rodando programas em paralelo
Para compilar um programa usando MPI mpif90 programa.f90 -o programa Use os comandos mpicc, mpif77, mpif90 para gerar um executável Para executar um programa com múltiplos processos mpirun -np 4 programa Use o comando mpirun. O número de processos é especificado pelo argumento -np MPI_Init permits argc and argv arguments They represent all command-line arguments not used by mpirun

Starting And Terminating MPI
program Init implicit none integer :: ier … call MPI_Init( ier ) call MPI_Finalize( ier ) end program All other MPI procedure references must appear after MPI_Init and before MPI_Finalize per process Neither procedure can be called more than once

MPI Include Files, Datatypes, and Constants
Contain predefined definitions for MPI datatypes and MPI constants Fortran 77 include "mpif.h" Fortran 90/95 include "mpif90.h" C #include <mpi.h> Examples of MPI constants: MPI_SUCCESS the value returned by an MPI procedure indicating the procedure completed successfully MPI_COMM_WORLD the default communicator representing all available processes program Init implicit none include "mpif.h" integer :: ier … call MPI_Init( ier ) if(ier /= MPI_SUCCESS) stop call MPI_Finalize( ier ) end program

Interfaces For MPI_Init And MPI_Finalize
The Fortran interfaces are: interface subroutine MPI_Init( error ) integer, intent(out) :: error end subroutine subroutine MPI_Finalize( error ) end interface The C interfaces are: int MPI_Init( int *argc, char ***argv ) int MPI_Finalize()

Relógio MPI_WTIME mede o tempo, retornando o “wall-clock time” (WCT), em segundos, desde um certo momento MPI_WTICK retorna a resolução do relógio usado em MPI_WTIME expressa em segundos DOUBLE PRECISION :: TI,TF,tick ... Tick = MPI_WTICK() TI = MPI_WTIME() TF = MPI_WTIME() print *,"Tempo (s): ", TF-TI print *,"Incremento (s) : ", Tick

Communicators Communicator: a collection of processes that are allowed to communicate with one another MPI's concept of the communication domain In F90 this information is represented in integer variables in C by type MPI_Comm Used in all MPI procedures except MPI_Init and MPI_Finalize Specifies the communication domain for the procedure MPI_COMM_WORLD is the communicator representing all processes made available via the mpirun command It is the default communicator

São os processos de “rank” 0 e 1 (ou 1 e 0) no grupo “B” !
Communicators MPI_COMM_WORLD A B 1 2 3 4 5 São os processos de “rank” 0 e 1 (ou 1 e 0) no grupo “B” !

Getting Information: MPI_Comm_size and MPI_Comm_rank
Fortran: subroutine MPI_Comm_size( comm, size, error) integer, intent(in) :: comm integer, intent(out) :: size, error end subroutine subroutine MPI_Comm_rank( comm, rank, error) integer, intent(out) :: rank, error C: int MPI_Comm_size( MPI_Comm comm, int *size ) int MPI_Comm_rank( MPI_Comm comm, int *rank ) size: the number of processes in the communicator group rank: the identifier of the process executing the procedure in the communicator group

Hello World Program program hello implicit none include "mpif.h"
integer :: np, myid, ier call MPI_Init( error ) call MPI_Comm_size( MPI_COMM_WORLD, np, ier) call MPI_Comm_rank( MPI_COMM_WORLD, myid, ier) print *,"Sou o processo ", myid," de um total de ", np call MPI_Finalize( error ) end program

Hello World Program #include <stdio.h> #include <mpi.h>
main( int argc, char *argv[]){ int npes, myid; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &npes); MPI_Comm_rank(MPI_COMM_WORLD, &myid); printf("Sou o processo %d de um total de %d\n", myid, npes); MPI_Finalize(); }

Send And Receive: Fortran
subroutine MPI_Send( buf, count, datatype, dest, tag, comm, error ) <datatype>, intent(in) :: buf integer , intent(in) :: count, datatype, dest, tag, comm integer , intent(out) :: error subroutine MPI_Recv(buf,count,datatype,source,tag,comm,status,error) <datatype>, intent(out) :: buf integer , intent(in) :: count,datatype,source,tag,comm integer , intent(out) :: status(MPI_STATUS_SIZE), error MPI_Send envia os dados em buf A mensagem vai para o processo dest com o identificador tag MPI_Recv recebe os dados e põe em buf A mensagem vem do processo source com o identificador tag O estado da operação receive é mantida em status

Send and Receive A mensagem enviada tem tamanho count e possui elementos do tipo datatype começando no endereço dado por buf Os tipos de dados são especificados por constantes MPI que correspondem a tipos de dados na linguagem MPI Datatype Fortran Datatype MPI_CHARARACTER character(1) MPI_INTEGER integer MPI_REAL real MPI_COMPLEX complex MPI_DOUBLE_PRECISION double precision MPI_LOGICAL logical MPI Datatype C Datatype MPI_CHAR signed char MPI_SHORT signed short int MPI_INT signed int MPI_LONG signed long int MPI_FLOAT float MPI_DOUBLE double

Send And Receive: C The status of the receive operation is maintained in the structure status Note: many of the examples in Kumar's book leave out the parameter status -- this is erroneous and will lead to segmentation faults int MPI_Send( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm ) int MPI_Recv( void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status )

The C Structure Status And The Fortran Array Status
The C structure is: typedef struct MPI_status { int MPI_SOURCE; int MPI_TAG; int MPI_ERROR; } Allows the program to find out the source of the received message, its tag when the MPI_recv arguments for source and tag are the named constants MPI_ANY_TAG and MPI_ANY_SOURCE, and the error flag A function is provided to find the actual length of the message received as well MPI_Get_count: uses the status structure returned by MPI_Recv to inquire of the length of the received message In Fortran, this structure is implemented with an integer array of size MPI_STATUS_SIZE

Avoiding Deadlocks MPI does not specify whether buffering is used or not HPCERC's MPICH and Myrinet implementations using buffering The following code deadlocks on a non-buffered implementation Thus, this code is not portable it is the programmer's responsibility not to create such deadlocks integer :: a(10), b(10), myid, ier integer :: status(MPI_STATUS_SIZE) call MPI_Comm_rank(MPI_COMM_WORLD, myid) if (myid == 0) then call MPI_Send( a, 10, MPI_INT, 1, 1, MPI_COMM_WORLD, ier) call MPI_Send( b, 10, MPI_INT, 1, 2, MPI_COMM_WORLD, ier) elseif (myid == 1) then call MPI_Recv( b, 10, MPI_INT, 0, 2, MPI_COMM_WORLD, status, ier) call MPI_Recv( a, 10, MPI_INT, 0, 1, MPI_COMM_WORLD, status, ier) endif Waiting a receive Waiting a receive

Sending/Receiving Messages In a Circular Communication World
Suppose you have your processes arranged in a ring of p processes process k sends to process (k+1) mod p process k receives from process (k–1) mod p Non-portable code -- possibly deadlocks Waiting a receive Waiting a receive Waiting a receive integer :: a(10), b(10), myid, npes, ier integer :: status(MPI_STATUS_SIZE) call MPI_Comm_size(MPI_COMM_WORLD, npes, ier) call MPI_Comm_rank(MPI_COMM_WORLD, myid, ier) call MPI_Send( a, 10, MPI_INT, mod(myrank+1,npes), 1, MPI_COMM_WORLD, ierr) call MPI_Recv( b, 10, MPI_INT, mod(myrank-1,npes), 1, MPI_COMM_WORLD, status, ierr)

Deadlocks Free Code For Circular Message Distribution
The following code is deadlock-free whether or not buffering is implemented integer :: a(10), b(10), np, myid, ier integer :: status(MPI_STATUS_SIZE) integer :: COMM = MPI_COMM_WORLD call MPI_Comm_size(COMM, np) call MPI_Comm_rank(COMM, myid) if (mod(myid,2) == 1 ) then call MPI_Send( a, 10, MPI_INT, mod(myid+1 ,np), 1, COMM, ier) call MPI_Recv( b, 10, MPI_INT, mod(myid-1+np,np), 1, COMM, status, ier) else endif

Sending And Receiving Messages Simultaneously
To avoid the deadlock issues, MPI provides a joint send/receive procedure In Fortran subroutine MPI_Sendrecv( sendbuf, sendcount, sendtype, dest, sendtag, & recvbuf, recvcount, recvtype, source, recvtag, & comm, status, ier ) <datatype>, intent(in) :: sendbuf <datatype>, intent(out) :: recvbuf integer , intent(in) :: sendcount, sendtype, dest, sendtag, comm integer , intent(in) :: recvcount, recvtype, source, recvtag integer , intent(out) :: status(MPI_STATUS_SIZE), ier

C Interface To Sendrecv
In C However, its send and receive buffers MUST be disjoint The command MPI_Sendrecv_replace allows the send and receive buffer to be the same In this case, the counts and datatypes must be the same int MPI_Sendrecv( void *sendbuf, int sendcount, MPI_Datatype sendtype, int dest, int sendtag, void *recvbuf, int recvcount, MPI_Datatype recvtype, int source, int recvtag, MPI_Comm comm, MPI_Status *status )

Safe Version Using MPI_Sendrecv
Using the MPI_Sendrecv procedure: integer :: a(10), b(10), np, myid integer :: status(MPI_STATUS_SIZE) call MPI_Comm_size(MPI_COMM_WORLD, np) call MPI_Comm_rank(MPI_COMM_WORLD, myid) call MPI_Sendrecv( a, 10, MPI_INT, mod(myid+1,np), 1, b, 10, MPI_INT, mod(myid-1+np,np), 1, MPI_COMM_WORLD, status, ier );

Exercise 1) Crie uma versão (simplificada) em paralelo do jogo de cartas BlackJack. Regras Simplificadas: Um processo será a banca e distribuirá as cartas, os outro processos serão os jogadores. O baralho é uma lista de 52 números contando de 1 a 13 (cada número é repetido 4 vezes) ignore os naipes para o exercício. Para simplificar não há apostas A banca embaralha as cartas e distribui 2 cartas para cada jogador Cada jogador decide se quer mais UMA carta ou não (simplifique: por exemplo, se a soma das cartas na mão for menor que 14 pede mais uma carta) Todos jogadores somam suas cartas e informam a banca seus pontos, que decide quem é o vencedor. A banca controla os pontos de todos os jogadores A banca recolhe todas as cartas e reordena (em serial) para verificar se não está faltando nenhuma. Depois de n rodadas, a banca informa quem é o vencedor

An Example: Odd-Even Sort
The example is the parallel algorithm described in Sections and 9.1.3 It is called an even-odd sort and turns out to be a cost-optimal algorithm for sorting, very analogous to the cost-optimal addition algorithm The algorithm works as follows: It sorts n elements on p processors The implementation below has been simplified to the case that p divides n It distributes the n elements evenly amongst p processors and proceeds through p phases, having sorted the n/p elements on each process first The processors are ordered and numbered in a horizontal line from processor 0 to processor p – 1

Even-Odd Sort Continued
It then proceeds through p phases from 0 to p – 1 For each phase number i: If i is an even (odd) phase number, the even (odd) numbered processes send their numbers to the processor to the right and the odd (even) number processors send their numbers to the left The left of an even-odd pair selects in order the smaller elements up to n/p elements The right of an even-odd pair selects in order the larger elements up to n/p elements If on any phase the communication is with a process to the right of the last one or to the left of the first one, that process does nothing (no communication and no sorting)

Pictorial View For The Even-Odd Parallel Sorting Algorithm
Phase 0 Comm Keepsmall 3 7 P0 (N, 1) 1 8 P1 (2, 0) 2 5 P2 (1, 3) 4 6 P3 (N, 2) S: 01 R: 10 T 1 3 S: 10 R: 01 F 7 8 S: 23 R: 32 2 4 S: 32 R: 23 5 6 S: 0N R: N0   S: 12 R: 21 S: 21 R: 12 S: 3N R: N3 Phase 1 Initial data (oddrank,evenrank)

Pictorial View For The Even-Odd Parallel Sorting Algorithm Continued
Phase 2 Comm Keepsmall 1 3 P0 (N, 1) 2 4 P1 (2, 0) 7 8 P2 (1, 3) 5 6 P3 (N, 2) S: 01 R: 10 T 1 2 S: 10 R: 01 F 3 4 S: 23 R: 32 S: 32 R: 23 S: 0N R: N0   S: 12 R: 21 S: 21 R: 12 S: 3N R: N3 Phase 3 After Phase 1 (oddrank,evenrank)

Corrected Even-Odd Sort Code
#include <stdio.h> #include <mpi.h> CompareSplit( int nlocal, int *elmnts, int *relmnts, int *wspace, int keepsmall) { int i, j, k; for (i=0; i<nlocal; i++) wspace[i] = elmnts[i]; if (keepsmall) { for (i=j=k=0; k<nlocal; k++) { if (j == nlocal || (i < nlocal && wspace[i] < relmnts[j])) elmnts[k] = wspace[i++]; else elmnts[k] = relmnts[j++]; } } else { for (i=j=k=nlocal-1; k>=0; k--) { if (j == -1 || (i >= 0 && wspace[i] >= relmnts[j])) elmnts[k] = wspace[i--]; else elmnts[k] = relmnts[j--]; } } } Corrected

int IncOrder( const void *e1, const void *e2) { return (*((int *)e1) - *((int *)e2)); } int main( int argc, char *argv[] ) { int n, npes, myrank, nlocal, *elmnts, *relmnts, oddrank, evenrank, *wspace, i; MPI_Status status; MPI_Init( &argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &npes); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); n = atoi(argv[1]); nlocal = n/npes; elmnts = (int *)malloc(nlocal*sizeof(int)); relmnts = (int *)malloc(nlocal*sizeof(int)); wspace = (int *)malloc(nlocal*sizeof(int)); srandom(myrank); for (i=0; i<nlocal; i++) elmnts[i] = random(); qsort(elmnts, nlocal, sizeof(int), IncOrder);

if (myrank%2 == 0) { oddrank = myrank - 1; evenrank = myrank + 1; } else { oddrank = myrank + 1; evenrank = myrank - 1; } if (oddrank == -1 || oddrank == npes) oddrank = MPI_PROC_NULL; if (evenrank == -1 || evenrank == npes) evenrank = MPI_PROC_NULL; for (i=0; i<npes; i++) { if (i%2 == 1) MPI_Sendrecv(elmnts, nlocal, MPI_INT, oddrank, 1, relmnts, nlocal, MPI_INT, oddrank, 1, MPI_COMM_WORLD, &status); else MPI_Sendrecv(elmnts, nlocal, MPI_INT, evenrank, 1, relmnts, nlocal, MPI_INT, evenrank, 1, MPI_COMM_WORLD, &status); if (status.MPI_SOURCE != MPI_PROC_NULL) CompareSplit(nlocal, elmnts, relmnts, wspace, myrank<status.MPI_SOURCE); } free(elmnts); free(relmnts); free(wspace); MPI_Finalize(); } Corrected

6.4. Topologies And Embedding
MPI views processes in the MPI_COMM_WORLD communicator as a 1-D topology by default However, you can change this topology by creating a new communicator Maybe 2-D or 3-D or hypercube say See the next slide for examples To create a Cartesian arrangement of processors, use MPI_Cart_create, which has the following interface int MPI_Cart_create( MPI_Comm comm_old, int ndims, int *dims, int *periods, int reorder, MPI_Comm *comm_cart )

Different Ways To Map Processes Into A 2-D Grid
Row-major mapping Column-major mapping Space-filling curve mapping -- the dotted line specifies the order of the mapping Hypercube mapping

The Most Appropriate Mappings
We need to select a mapping that is most consistent with the way the processes are expected to communicate For communication that is along the axis of a 2-D grid, the row or column major mappings on the previous slide are the ones of choice, provided the underlying network is a switch or a 2-D grid with this arrangement of physical processors The last choice on the previous slide is a good choice if the tasks communicate naturally as if on a hypercube and the underlying network is a 2-D grid of this shape or a network switch The approach taken by MPI is to have the user specify the virtual communication arrangement and the library then tries to optimize the assignment to physical processor, because only MPI knows the physical processor network topology It means your program is portable and does not have any hard-wired assumptions about the actual network

Creating And Using Cartesian Topologies
Takes a group of processes in a communication world comm_old and creates a new virtual topology comm_cart int MPI_Cart_create( MPI_Comm comm_old, int ndims, int *dims, int *periods, int reorder, MPI_Comm *comm_cart ) All processes that want to use this new world must call this function Shape and topology of the new world are specified by the values of ndims, dims, and periods ndims is the number of dimensions in the new toploogy dims are the extents along each new dimension periods indicates, if the element is true, that the corresponding dimension is wrapped around reorder and indicates whether the rank numbers of the processors must be preserved in the old and new world if true, it indicates reorder which allows the MPI system to find the most efficient assignment of processes to processors

Process Naming Use the MPI procedures MPI_Cart_rank and MPI_Cart_coord to convert from processes identified by a single identifier to its coordinates in the 2-D grid Interfaces: Semantics: MPI_Cart_rank: Given the coordinates of a process in the new Cartesian work, return its rank in this new world MPI_Cart_coord: Given the rank of a process in the new world, return the Cartesian coordinates of this process in an array of size maxdims int MPI_Cart_rank( MPI_Comm comm_cart, int *coords, int *rank ) int MPI_Cart_coord( MPI_Comm comm_cart, int rnak, int maxdims, int *coords )

The MPI Procedure MPI_Cart_shift
Use the MPI procedure MPI_Cart_shift Given the length and direction of the shift, return the rank of the process that will be the source of data coming to the running process (myrank) and the destination of the data send by the running process (myrank) dir is the coordinate direction for the shift and s_step is length of the shift The source and destinations will be based on a wraparound shift if the periods entry in MPI_Cart_create indicates wrapping in the specified dimension MPI_PROC_NULL will be returned for sources and destinations not part of the current process topology -- eg. end-off shifts int MPI_Cart_shift( MPI_Comm comm_cart, int dir, int s_step, int *rank_source, int *rank_dest )

Example: The Algorithm For Cannon's Matrix-Matrix Multiplication
Description (assume p processors, matrices of size nn, p divides n): Consider the processors are layout in a p  p wraparound mesh Partition the matrices A and B into p submatrices of size p  p Scatter the p submatrices of A and B onto the processors in the natural way Align the 2nd through p rows of submatrices of A by shifting circularly the j-th row j positions left Align the 2nd through p columns of submatrices of B by shifting circularly the j-th column j positions up

 The product C appears as submatrices on each process
Multiply the submatrices of A and B on each processor (s,t) using matrix multiplication and store in Cs,t Repeat the following 3 steps p–1 times: Shift the submatrices of A left one position circularly in each row Shift the submatrices of B up one position circularlyin each column Multiply the submatrices of A and B on each processor (s,t) using matrix multiplication and accumulate in Cs,t The product C appears as submatrices on each process That is, process (s,t) has the submatrix Cs,t Check: C2,3 = A2,1B1,3 + A2,2B2,3 + A2,3B3,3 + A2,0B0, = A2,0B0,3 + A2,1B1,3 + A2,2B2,3 + A2,3B3,3 

Diagrams--Cannon's Parallel Matrix-Matrix Multiplication For 44 Matrices
B0,0 B0,1 B0,2 B0,3 A1,0 A1,1 A1,2 A1,3 B1,0 B1,1 B1,2 B1,3 A2,0 A2,1 A2,2 A2,3 B2,0 B2,1 B2,2 B2,3 A3,0 A3,1 A3,3 A3,3 B3,0 B3,1 B3,2 B3,3 Initial distribution of A and communication for the first alignment (all left circular shifts of increasing length j) Initial distribution of B and communication for the first alignment (all up circular shifts of increasing length j)

Diagrams--Cannon's Parallel Matrix-Matrix Multiplication -- Part 2
A0,0B0,0 A0,1B1,1 A0,2B2,2 A0,3B3,3 A1,1B1,0 A1,2B2,1 A1,3B3,2 A1,0B0,3 A2,2B2,0 A2,3B3,1 A2,0B0,2 A2,1B1,3 A3,3B3,0 A3,0B0,1 A3,1B1,2 A3,2B2,3 A0,1B1,0 A0,2B2,1 A0,3B3,2 A0,0B0,3 A1,2B2,0 A1,3B3,1 A1,0B0,2 A1,1B1,3 A2,3B3,0 A2,0B0,1 A2,1B1,2 A2,2B2,3 A3,0B0,0 A3,1B1,1 A3,2B2,2 A3,3B3,3 A and B after the initial alignment and communication for the first shift (all left or up circular shifts of length 1) A submatrix multiplication and accumulation occurs before communication A and B after the first shift and communication for the second shift (all left or up circular shifts of length 1) A submatrix multiplication and accumulation occurs before communication

Diagrams--Cannon's Parallel Matrix-Matrix Multiplication -- Part 3
A0,2B2,0 A0,3B3,1 A0,0B0,2 A0,1B1,3 A1,3B3,0 A1,0B0,1 A1,1B1,2 A1,2B2,3 A2,0B0,0 A2,1B1,1 A2,2B2,2 A2,3B3,3 A3,1B1,0 A3,2B2,1 A3,3B3,2 A3,0B0,3 A0,3B3,0 A0,0B0,1 A0,1B1,2 A0,2B2,3 A1,0B0,0 A1,1B1,1 A1,2B2,2 A1,3B3,3 A2,1B1,0 A2,2B2,1 A2,3B3,2 A2,0B0,3 A3,2B2,0 A3,3B3,1 A3,0B0,2 A3,1B1,3 A and B after the second shift and communication for the third shift (all left or up circular shifts of length 1) A submatrix multiplication and accumulation occurs before communication A and B after the third shift and communication for the third shift A final submatrix multiplication and accumulation occurs after shift

Cannon's Code MatrixMatrixMultiply( int n, double *a, double *b, double *c, MPI_Comm comm ) { int i, nlocal, npes, dims[2], periods[2], my2drank, mycoords[2], uprank, downrank, leftrank, rightrank, coords[2], shiftsource, shiftdest; MPI_Status status; MPI_Comm comm_2d; MPI_Comm_size( comm, &npes ); dims[0] = dims[1] = sqrt(npes); periods[0] = periods[1] = 1; MPI_Cart_create( comm, 2, dims, periods, 1, &comm_2d ); MPI_Comm_rank( comm_2d, &my2drank ); MPI_Cart_coords( comm_2d, my2drank, 2, mycoords ); MPI_Cart_shift( comm_2d, 0, -1, &rightrank, &leftrank ); MPI_Cart_shift( comm_2d, 1, -1, &downrank, &uprank ); nlocal = n/dims[0];

 /* Initial a and b matrix shift -- *a and *b are the submatrices on myrank */ MPI_Cart_shift( comm_2d, 0, -mycoords[0], &shiftsource, &shiftdest ); MPI_Sendrecv_replace( a, nlocal*nlocal, MPI_DOUBLE, shiftdest, 1, shiftsource, 1, comm_2d, &status ); MPI_Cart_shift( comm_2d, 1, -mycoords[1], &shiftsource, &shiftdest ); MPI_Sendrecv_replace( b, nlocal*nlocal, MPI_DOUBLE, shiftdest, 1, shiftsource, /* Assume c is initially 0 */ for (i=0; i<dims[0]; i++) { MatrixMultiply( nlocal, a, b, c); /* Shift a left by one */ MPI_Sendrecv_replace( a, nlocal*nlocal, MPI_DOUBLE, leftrank, 1, rightrank, 1, comm_2d, &status ); /* Shift b up by one */ MPI_Sendrecv_replace( b, nlocal*nlocal, MPI_DOUBLE, uprank, 1, downrank, 1, comm_2d, &status ); }

 /* Restore the original distribution of a and b */
MPI_Cart_shift( comm_2d, 0, +mycoords[0], &shiftsource, &shiftdest ); MPI_Sendrecv_replace( a, nlocal*nlocal, MPI_DOUBLE, shiftdest, 1, shiftsource, 1, comm_2d, &status ); MPI_Cart_shift( comm_2d, 1, +mycoords[1], &shiftsource, &shiftdest ); MPI_Sendrecv_replace( b, nlocal*nlocal, MPI_DOUBLE, shiftdest, 1, shiftsource, MPI_Comm_free(&comm_2d); } /* Perform serial matrix-matrix multiplication c = c + a*b */ MatrixMultiply( int n, double *a, double *b, double *c ) { int i, j, k; for (i=0; i<n; i++) for (j=0; j<n; j++) for (k=0; k<n; k++) c[i*n+j] += a[i*n+k]*b[k*n+j];

Overlapping Communication And Computation
Motivation: Notice that Cannon's code is written so that: while computing, no communication is occurring while communicating, the blocking Sendrecv_replace waits until all communication completes and thus will not allow an computation to occur while communicating A and B are never changes (while C is changed) but it may be in different places The computation time behaves like O((n/p)3) or O(n3/p1.5) The communication time behaves like O((n/p)2p) or O(n2) This suggests that as long as we arrange for computation and communication of submatrices of A and B to occur in different copies of the submatrices of A and B, we could overlap computation and communication

Non-Blocking Communication Operations MPI_Irecv and MPI_Isend
Inicia um envio, mas continua antes que o dado esteja livre para ser reescrito MPI_Irecv Inicia um recebe, mas continua antes que o dado esteja pronto para ser lido Interfaces subroutine MPI_Isend( buf, count, type, dest, tag, comm, request, ier ) <datatype>, intent(in) :: buf integer , intent(in) :: count, type, dest, tag, comm integer , intent(out) :: request, ier end subroutine <datatype>, intent(out) :: buf

Interfaces Continued int MPI_Isend( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request ) int MPI_Irecv( void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request ) status parameter is not required by Irecv request parameter (integer in Fortran): Used by MPI_Test and MPI_Wait MPI_Test: tests whether the operation specified by the request parameter is complete and safe to use the data MPI_Wait: does not return until the operation specified by the request parameter is complete and its data is safe to use Non-blocking communication can be matched with a block communication Eg. a non-blocking send can match a blocking receive

MPI_Test and MPI_Wait The interfaces are:
status: is returned by MPI_Wait and MPI_Test instead of by the receive operation flag: is set true if the operation is complete and false otherwise subroutine MPI_Test( request, flag, status, ier ) integer , intent(in) :: request integer , intent(out) :: flag, status, ier end subroutine subroutine MPI_Wait( request, status, ier ) integer , intent(out) :: status, ier int MPI_Test( MPI_Request *request, int *flag, MPI_Status *status ) int MPI_Wait( MPI_Request *request, MPI_Status *status )

Avoiding Deadlocks Non-blocking communications avoid most of the deadlock situations presented earlier Deadlocks Deadlock-free integer :: a(10), b(10), myid,rank integer :: status(MPI_STATUS_SIZE) Integer :: COMM=MPI_COMM_WORLD call MPI_Comm_rank(COMM, myid) if (myid == 0 ) then call MPI_Send(a, 10, MPI_INT, 1, 1, & COMM, ier) call MPI_Send(b, 10, MPI_INT, 1, 2, & elseif (myid == 1)then call MPI_Recv(b, 10, MPI_INT, 0, 2, & COMM, status,ier) call MPI_Recv(a, 10, MPI_INT, 0, 1, & endif integer :: a(10), b(10), myid,rank integer :: status(MPI_STATUS_SIZE) integer :: request(2) Integer :: COMM=MPI_COMM_WORLD Call MPI_Comm_rank(COMM, myid); if (myid == 0 ) then call MPI_Send(a, 10, MPI_INT, 1, 1, COMM, ierr) call MPI_Send(b, 10, MPI_INT, 1, 2, elseif (myrank == 1) then call MPI_Irecv(b, 10, MPI_INT, 0, 2, COMM, request(1),ier) call MPI_Irecv(a, 10, MPI_INT, 0, 1, COMM, request(2),ier) endif Correction

Cannon's Matrix-Matrix Multiplication Using Overlapping Communication
Differences between overlap and non-overlapping code are in bold MatrixMatrixMultiply_NonBlocking( int n, double *a, double *b, double *c, MPI_Comm comm ){ int i, nlocal, npes, dims[2], periods[2], myrank, my2drank, mycoords[2], uprank, downrank, leftrank, rightrank, coords[2], shiftsource, shiftdest, j; double *a_buffers[2], b_buffers[2]; MPI_Status status; MPI_Comm comm_2d; MPI_Request reqs[4]; MPI_Comm_size(MPI_COMM_WORLD, &npes); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); dims[0] = dims[1] = sqrt(npes); periods[0] = periods[1] = 1; MPI_Cart_create( comm, 2, dims, periods, 1, &comm_2d ); MPI_Comm_rank( comm_2d, &my2drank ); MPI_Cart_coords( comm_2d, my2drank, 2, mycoords ); MPI_Cart_shift( comm_2d, 0, -1, &rightrank, &leftrank ); MPI_Cart_shift( comm_2d, 1, -1, &downrank, &uprank ); nlocal = n/dims[0];

Cannon's Matrix-Matrix Multiplication Using Overlapping Communication Cont'd
Differences between overlap and non-overlapping code in bold /* Setup a_buffers and b_buffers */ a_buffers[0] = a; a_buffers[1] = (double *)malloc(nlocal*nlocal*sizeof(double)); b_buffers[0] = b; b_buffers[1] = (double *)malloc(nlocal*nlocal*sizeof(double)); /* Initial A and B matrix shift -- *a and *b are the submatrices on myrank */ MPI_Cart_shift( comm_2d, 0, -mycoords[0], &shiftsource, &shiftdest ); MPI_Sendrecv_replace( a_buffers[0], nlocal*nlocal, MPI_DOUBLE, shiftdest, 1, shiftsource, 1, comm_2d, &status ); MPI_Cart_shift( comm_2d, 1, -mycoords[1], &shiftsource, &shiftdest ); MPI_Sendrecv_replace( b_buffers[1], nlocal*nlocal, MPI_DOUBLE, shiftdest, 1, shiftsource, 1, comm_2d, &status );

Cannon's Matrix-Matrix Multiplication Using Overlapping Communication Cont'd
Differences between overlap and non-overlapping code in bold /* Assume c is initially 0 */ for (i=0; i<dims[0]; i++) { /* Shift a left by one and b up by one */ MPI_Isend( a_buffers[i%2], nlocal*nlocal, MPI_DOUBLE, leftrank, 1, comm_2d, &reqs[0] ); MPI_Isend( b_buffers[i%2], nlocal*nlocal, MPI_DOUBLE, uprank, 1, comm_2d, &reqs[1] ); MPI_Irecv( a_buffers[(I+1)%2], nlocal*nlocal, MPI_DOUBLE, rightrank, 1, comm_2d, &reqs[2] ); MPI_Irecv( b_buffers[(i+1)%2], nlocal*nlocal, MPI_DOUBLE, downrank, 1, comm_2d, &reqs[3] ); /* c = c + a*b */ MatrixMultiply( nlocal, a_buffers[i%2], b_buffers[i%2], c); for (j=0; j<4; j++) MPI_Wait( &reqs[j], &status ); }

Collective Communication and Computation Operations
11/21/2018 intro

In C int MPI_Barrier( MPI_Comm comm )
int MPI_Bcast( void *buf, int count, MPI_Datatype datatype, int source, MPI_Comm comm ) int MPI_Reduce( void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int target, MPI_Comm comm ) int MPI_Allreduce( void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm ) int MPI_Scan( void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm ) int MPI_Gather( void *sendbuf, int sendcount, MPI_Datatype sdatatype, void *recvbuf, int recvcount, MPI_Datatype rdatatype, int target, MPI_Comm comm ) int MPI_Allgather(void *sendbuf, int sendcount, MPI_Datatype sdatatype, MPI_Comm comm )

int MPI_Scatter( void *sendbuf, int sendcount, MPI_Datatype senddatatype, void *recvbuf, int recvcount, MPI_Datatype recvdatatype, int source, MPI_Comm comm ) int MPI_Scatterv( void *sendbuf, int *sendcounts, int *displs, MPI_Datatype senddatatype, void *recvbuf, int recvcount, MPI_Datatype recvdatatype, int source, MPI_Comm comm ) int MPI_Gatherv( void *sendbuf, int sendcount, MPI_Datatype senddatatype, void *recvbuf, int *recvcounts, int *displs, MPI_Datatype recvdatatype, int target, MPI_Comm comm ) int MPI_Allgatherv( void *sendbuf, int sendcount, MPI_Datatype senddatatype, void *recvbuf, int *recvcounts, int *displs, MPI_Datatype recvdatatype, MPI_Comm comm ) int MPI_Scatter( void *sendbuf, int sendcount, MPI_Datatype senddatatype, void *recvbuf, int recvcount, MPI_Datatype recvdatatype, int source, MPI_Comm comm ) int MPI_Scatterv( void *sendbuf, int *sendcounts, int *displs, MPI_Datatype senddatatype, void *recvbuf, int recvcount, MPI_Datatype recvdatatype, int source, MPI_Comm comm )

int MPI_Alltoall( void. sendbuf, int sendcount,
int MPI_Alltoall( void *sendbuf, int sendcount, MPI_Datatype senddatatype, void *recvbuf, int recvcount, MPI_Datatype recvdatatype, MPI_Comm comm ) int MPI_Alltoallv( void *sendbuf, int *sendcounts, int *sdispls, MPI_Datatype senddatatype, void *recvbuf, int *recvcounts, int *rdispls, MPI_Datatype recvdatatype, MPI_Comm comm )

Collective Communication And Computation Operations
The barrier synchronization: All processes must reach this barrier before proceed The broadcast procedure: All processes execute this procedure count and datatype must be the same on all processes The data from buf on process source is sent to all processes and it is stored in buf subroutine MPI_Barrier ( comm, ier ) integer , intent(in) :: comm integer , intent(out) :: ier subroutine MPI_Bcast(buf,count,datatype,source,comm, ier) <datatype>, intent(inout) :: buf integer , intent(in) :: count,datatype,source,comm integer , intent(out) :: ier

MPI_Reduce: All-to-one reduction
subroutine MPI_Reduce(sendbuf,recvbuf,count,dtype,op,target,comm, ier) <datatype>, intent(in) :: sendbuf <datatype>, intent(out) :: recvbuf integer , intent(in) :: count,dtype,op,target,comm integer , intent(out) :: ier Combines the data in sendbuf from different processes using the operator op into the location recvbuf recvbuf is only used and changed on the process with rank target The processes that do not have the rank recvbuf must still provide recvbuf If count>1, the operation is performed element-wise on corresponding components of sendbuf and combined into the corresponding element of recvbuf

Predefined Reduction Operators
Operation Meaning Datatypes MPI_MAX Maximum C integers and floating point MPI_MIN Minimum MPI_SUM Sum MPI_PROD Product MPI_LAND Logical AND C integers MPI_BAND Bit-wise AND C integers and byte MPI_LOR MPI_BOR MPI_LXOR MPI_BXOR MPI_MAXLOC Max 1st value-min 2nd value Data-pairs MPI_MINLOC Min 1st value-min 2nd value

MPI_MINLOC and MPI_MAXLOC Reduction Operations
Value 15 17 1 11 2 12 3 17 4 11 5 First value Process Second value Returned Result MinLoc(Value, Process) = (11,2) MaxLoc(Value, Process) = (17,1) For MINLOC: 11 is minimum value on all processors and 2 is the least ranked processor with the minimum value For MAXLOC: 17 is maximum value on all processors and 1 is the least ranked processor with the maximum value

MPI_MINLOC and MPI_MAXLOC Reduction Operations
For these operations, a pair of values are compared and generated For example, for MPI_MAXLOC, the generated pair consists of the maximum of the first values and the minimum of the second values Thus, the recvbuf must be a structure of two components of potentially different types; the paired types are indicated by the following table of recvdatatype values MPI Datatype recvdatatype C Datatypes MPI_2INT pair of ints MPI_SHORT_INT short and int MPI_LONG_INT long and int MPI_LONG_DOUBLE_INT long double and int MPI_FLOAT_INT float and int MPI_DOUBLE_INT double and int

The All-Reduce Operation
Returns the reduced result to all processes The interface is: Note: no target argument as it is unnecessary All processes receive the reduced result subroutine MPI_Allreduce(sendbuf,recvbuf,count,dtype,op,comm, ier) <datatype>, intent(in) :: sendbuf <datatype>, intent(out) :: recvbuf integer , intent(in) :: count,dtype,op,comm integer , intent(out) :: ier

The Prefix Operations Computes the partial reductions of a set of values on each process and returns a different partial reduction to each process The partial reduction for process k is the reduction of the values on process k and on all lower ranked processes The supported reduction operations are those of MPI_Reduce subroutine MPI_Scan(sendbuf,recvbuf,count,dtype,op,comm, ier) <datatype>, intent(in) :: sendbuf <datatype>, intent(out) :: recvbuf integer , intent(in) :: count,dtype,op,comm integer , intent(out) :: ier

The Gather Operation Each process sends its data to the target process and the target process appends the data into one variable in process rank order Even the target process "sends" data to the receive buffer The sendcount and senddatatype arguments must be the same on all processes The recvcount is the number of elements received by each process Thus, the recvcount and recvdatatype arguments must be the same on all processes and the same as the sendcount and senddatatype arguments subroutine MPI_Gather( sendbuf, sendcount, sendtype, & recvbuf, recvcount, recvtype, & target, comm, ier ) <datatype>, intent(in) :: sendbuf <datatype>, intent(out) :: recvbuf integer , intent(in) :: sendcount, sendtype, target integer , intent(in) :: recvcount, recvtype, comm, ier

The Allgather Operation
The MPI_Allgather operation gathers the results to all processes That is, the recvbuf on all processes contains the same data in the same order Note: there is no target argument subroutine MPI_Allgather( sendbuf, sendcount, sendtype, & recvbuf, recvcount, recvtype, & comm, ier ) <datatype>, intent(in) :: sendbuf <datatype>, intent(out) :: recvbuf integer , intent(in) :: sendcount, sendtype integer , intent(in) :: recvcount, recvtype, comm, ier

Vector Variants Of MPI_Gather and MPI_Allgather -- MPI_Gatherv and MPI_Allgatherv
These operations allow the sizes of the data sent by each process to be different The size of the data received is specified in an array of sizes, recvcounts The k-th element specifies the number of elements sent from process k The location in the receive buffer recvbuf for the data from process k is specified by array displs The k-th element displs[k] specifies the position in recvbuf where the first element of the data from process k is placed

Interfaces For MPI_Gatherv and MPI_Allgatherv
subroutine MPI_Gatherv(sendbuf, sendcount, sendtype, & recvbuf, recvcount, displs, recvtype, & target, comm, ier ) <datatype>, intent(in) :: sendbuf <datatype>, intent(out) :: recvbuf integer , intent(in) :: sendcount, sendtype, displs, target integer , intent(in) :: recvcount, recvtype, comm, ier subroutine MPI_Allgatherv(sendbuf, sendcount, sendtype, & recvbuf, recvcount, displs, recvtype, & comm, ier ) <datatype>, intent(in) :: sendbuf <datatype>, intent(out) :: recvbuf integer , intent(in) :: sendcount, sendtype, displs integer , intent(in) :: recvcount, recvtype, comm, ier

The Scatter Operation Sends a different part of the sendbuf on the source process to each of the other processes The location of the data sent from process k is k*sendcount from the beginning of sendbuf As with the gather operation, the same argument values on all processes must be provided for sendcount, senddatatype, recvcount, recdatatype, source, and comm subroutine MPI_Scatter(sendbuf, sendcount, sendtype, & recvbuf, recvcount, recvtype, & source, comm, ier ) <datatype>, intent(in) :: sendbuf <datatype>, intent(out) :: recvbuf integer , intent(in) :: sendcount, sendtype, source integer , intent(in) :: recvcount, recvtype, comm, ier

The MPI_Scatterv Operation
Just like MPI_Gatherv, it allows different sized messages to be sent to each process from the source processor's sendbuf Data sent to process k from process source has size sendcounts[k] and starts at location displs[k] in sendbuf Overlapping regions of sendbuf are allowed by setting sendcounts and displs appropriately subroutine MPI_Scatterv(sendbuf, sendcount, displs, sendtype, & recvbuf, recvcount, recvtype, & source, comm, ier ) <datatype>, intent(in) :: sendbuf <datatype>, intent(out) :: recvbuf integer , intent(in) :: sendcount, sendtype, displs, target integer , intent(in) :: recvcount, recvtype, comm, ier

The All-To-All Personalized Operation
Each process sends a different portion (scatters) of its data to the processes in its world (including itself) The sent data portions in sendbuf must be contiguous and of the same type senddatatype and length sendcount The received portions must also be the same type (recvcount=sendcount) and length (recvcount=sendcount) as the sent portions The received portions are placed in the receive buffer recvbuf in process rank order subroutine MPI_Alltoall( sendbuf, sendcount, sendtype, & recvbuf, recvcount, recvtype, & comm, ier ) <datatype>, intent(in) :: sendbuf <datatype>, intent(out) :: recvbuf integer , intent(in) :: sendcount, sendtype, integer , intent(in) :: recvcount, recvtype, comm, ier

All-To-All Personalized Vector Variant
Like the other vector variants, each process sends different sized portions of its data to each process in the world The size of the contiguous portion to be sent to process k is sendcounts[k] The portion starts as location sendbuf[sdispls[k]] The received portion from process k starts at recvbuf[rdispls[k]] and is of size recvcounts[k] As with MPI_Scatterv, the messages can be overlapping sections of sendbuf subroutine MPI_Alltoallv(sendbuf, sendcount, sendtype, sdispls, & recvbuf, recvcount, recvtype, rdispls, & comm, ier ) <datatype>, intent(in) :: sendbuf <datatype>, intent(out) :: recvbuf integer , intent(in) :: sendcount, sendtype, sdispls, rdispls, integer , intent(in) :: recvcount, recvtype, comm, ier

One-Dimensional Row-Wise Matrix-Vector Multiplication
Consider multiplying an nn matrix A by a vector b of length n to produce a vector x Consider performing this computation on p processors and each processor computes a portion of size n/p of x Assume A, b and x have been partitioned by rows into chunks of n/p consecutive rows on each processor This is called the row-wise matrix-vector multiplication The following code uses MPI_Allgather to gather all of b onto each processor and perform the computation of the appropriate portion of x

Row-Wise Matrix-Vector Multiplication Code
RowMatrixVectorMultiply( int n, double *a, double *b, double *x, MPI_Comm comm ){ int i, j, nlocal, npes, myrank; double *fb; MPI_Status status; /* Get information about the communicator */ MPI_Comm_size( comm, &npes ); MPI_Comm_rank( comm, &myrank ); /* Allocate the memory that will store the entire b */ fb = (double *)malloc(n*size(double)); nlocal = n/npes; /* Gather the entire vector b for each processor using MPI_Allgather */ MPI_Allgather( b, nlocal, MPI_DOUBLE, fb, nlocal, MPI_DOUBLE, comm ); /* Perform the mat-vec mult. involving the locally stored submatrix */ for (i=0; i<nlocal; i++) { x[i] = 0.0; for (j=0; j<n; j++) x[i] += a[i*n+j]*fb[j]; } free(fb);

One-Dimensional Column-Wise Matrix-Vector Multiplication
Consider performing this computation on p processors and each processor computes a partial sum of the result vector x Assume A has partitioned by columns into chunks of n/p consecutive columns on each processor, b is partitioned by chunks into n/p consecutive rows, and all of x is on each processor This is called the column-wise matrix-vector multiplication The following code uses MPI_Reduce to add all the partial sums representing x and distributes x like b onto each processor using MPI_Scatter

Column-Wise Matrix-Vector Multiplication Code -- Part 1
ColMatrixVectorMultiply( int n, double *a, double *b, double *x, MPI_Comm comm ) { int i, j, nlocal, npes, myrank; double *px, *fx; /* Get information about the communicator */ MPI_Comm_size( comm, &npes ); MPI_Comm_rank( comm, &myrank ); nlocal = n/npes; /* Allocate the memory that will store intermediate results */ px = (double *)malloc(n*size(double)); fx = (double *)malloc(n*size(double));

Column-Wise Matrix-Vector Multiplication Code -- Part 2
/* Perform the partial dot prods corres. to the local elements of A */ for (i=0; i<n; i++) { px[i] = 0.0; for (j=0; j<nlocal; j++) px[i] += a[i*nlocal+j]*b[j]; } /* Sum up the results on 0 by performing an element-wise reduction */ MPI_Reduce( px, fx, n, MPI_DOUBLE, MPI_SUM, 0, comm ); /* Redistribute fx in the same way that x is distributed */ MPI_Scatter( fx, nlocal, MPI_DOUBLE, x, nlocal, MPI_DOUBLE, 0, comm ); free(px); free(fx);

Comparison Of Row-Wise And Column-Wise Matrix-Vector Multiplication
Both perform the same computational work O(n2/p) floating point operations They perform different communications Row-wise -- pts + 0.5*ntw for the MPI_Allgather Column-wise -- (p+1)tslog p + ntw(1+log p) for the MPI_Reduce and MPI_Scatter Column-wise is more expensive But if you have to perform both Ab and ATb (or bTA), then the column-wise algorithm is used for the second product rather than distributing A both by columns and by rows which is very expensive

Single-Source Shortest Path
Compute the lengths of the shortest paths from a particular vertex source of a graph of n vertices to all other vertices of the graph using Dijkstra's single-source shortest-path algorithm in parallel Assume p processors and p divides n Assume wgt is the weighted adjacency matrix of the graph The serial algorithm is: Initialize the weighted distance from source to each other vertex from the weighted adjacency matrix Repeated the following three steps until all vertices are visited Find an unvisited vertex connected to source of minimum weighted distance and with smallest index Mark this vertex as visited Update the distance from source to all other vertices, if the distance from source to the found vertex plus the distance from the found vertex to a particular other vertex is smaller

The Parallel Distribution
For the parallel algorithm, distribute the data as follows: The vertices are divided evenly and consecutively amongst the processors, with firstvtx and lastvtx being the vertex numbers on each processor The weighted adjacency matrix wgt is distributed by rows with the rows for vertices firstvtx to lastvtx When there is no connection, the entry is MAXINT Other variables in the parallel algorithm are: lengths[j] is the current weighted distance from vertex source to vertex j -- when the computation completes, it is the length of the shortest path from vertex source to vertex j marker[j]=0 indicates vertex j has been visited and its minimum distance from source has been computed

Single-Source Shortest Path Parallel Algorithm -- Part 1
SingleSource( int n, int source, int *wgt, int *lengths, MPI_Comm comm ) { int i, j, nlocal, npes, myrank, firstvtx, lastvtx, u, udist; int *marker, lminpair[2], gminpair[2]; /* Get information about the communicator */ MPI_Comm_size( comm, &npes ); MPI_Comm_rank( comm, &myrank ); nlocal = n/npes; firstvtx = myrank*nlocal; lastvtx = firstvtx + nlocal - 1; /* Set the initial distances from the source to all other vertices */ for (j=0; j<nlocal; j++) lengths[j] = wgt[source*nlocal+j]; /* Set marker[j]=1 to indicate the distance to j has not been found */ marker = (int *)malloc(nlocal*size(int)); marker[j] = 1; Correction

/* The process that owns the source vertex marks it as being visited */ if (source >= firstvtx && source <= lastvtx) marker[source-firstvtx] = 0; /* The main loop of Dijkstra's algorithm */ for (i=1; i<n; i++) { /* Repeat for all non-source vertices */ /* Step 1: Find the local vertex that is closest to the source */ lminpair[0] = MAXINT; lminpair[1] = -1; for (j=0; j<nlocal; j++) { if (marker[j] && lengths[j] < lminpair[0]) { lminpair[0] = lengths[j]; lminpair[1] = firstvtx + j; }

/* Step 2: Compute the global minimum vertex and mark it visited */ MPI_Allreduce( lminpair, gminpair, 1, MPI_2INT, MPI_MINLOC, comm ); udist = gminpair[0]; u = gminpair[1]; /* The owner of u marks the vertex as visited */ if (u == lminpair[1]) marker[u-firstvtx] = 0; /* Step 3: Update the distances given the shortest distance to u from from source is known */ for (j=0; j<nlocal; j++) { if (marker[j] && udist + wgt[u*nlocal+j] < lengths[j]) lengths[j] = udist + wgt[u*nlocal+j]; } free(marker);

Avoiding Load Imbalances
One subtle property of this algorithm is that the processor loads will become imbalanced The reason is that when there are ties in terms of the minimum distance, the algorithm always picks the vertex of least number This will tend to reduce the number of vertices on the earliest ranked processors first, thus causing them to have fewer tests for the minimum value in the weighted adjacency matrix wgt The way to avoid this is to distribute the vertices cyclically (the rows of wgt also) This implies that the vertex of minimum path length and also minimum vertex number is unlikely to be on the earlier processors

The Parallel Sample Sort Algorithm
Sort a sequence A of n elements on p processors in increasing order Assume p divides n Assume that the elements of A are distributed evenly to the p processors This algorithm, called the sample sort, is an improvement to the bucket sort algorithm where the elements are assumed to be uniformly distributed over the interval Steps (as in the serial sample sort algorithm) Sort the elements that are on each processor Select a sample of size p(p–1) from the elements A by: Selecting p–1 equally spaced elements on each processor Gather these samples to all processors Sort the samples and select every p-th element from the sorted sample, using these elements as p bucket splitters Distribute the elements of A resident in each processor to the buckets assigned to each processor Sort the elements in the buckets on each processor

Parallel Sample-Sort Algorithm -- Part 1
int SampleSort( int n, int *elmnts, int *nsorted, MPI_Comm comm ) { int i, j, nlocal, npes, myrank, *sorted_elmnts, *splitters, *allpicks; int *scounts, *sdispls, *rcounts, *rdispls; /* Get information about the communicator */ MPI_Comm_size( comm, &npes ); MPI_Comm_rank( comm, &myrank ); nlocal = n/npes; /* Allocate memory for the arrays that will store the splitters */ splitters = (int *)malloc(npes*size(int)); allpicks = (int *)malloc(npes*(npes-1)*size(int)); /* Sort the local array */ qsort( elmnts, nlocal, sizeof(int), Incorder ); /* Select local npes-1 equally spaced elements */ for (i=1; i<npes; i++) splitters[i-1] = elmnts[i*nlocal/npes]; /* Gather the samples from all the processors */ MPI_Allgather( splitters, npes-1, MPI_INT, allpicks, npes-1, MPI_INT, comm );

/* Sort the samples gather from all processors */ qsort( allpicks, npes*(npes-1), sizeof(int), IncOrder ); /* Select splitters */ for (i=1; i<npes; i++) splitters[i-1] = allpicks[i*npes]; splitters[npes-1] = MAXINT; /* Compute the number of elements that belong to each bucket */ scounts = (int *)malloc(npes*size(int)); for (i=0; i<npes; i++) scounts[i] = 0; for (i=j=0; i<nlocal; i++) { if (elmnts[i] < splitters[j]) scounts[j]++; else scounts[++j]++; }

/* Determine the starting loc. of each bucket's elements in elmnts */ sdispls = (int *)malloc(npes*size(int)); sdispls[0] = 0; for (i=1; i<npes; i++) sdispls[i] = sdispls[i-1] + scounts[i-1]; /* Perform an all-to-all to inform all procs of number of elements */ rcounts = (int *)malloc(npes*size(int)); MPI_Alltoall( scounts, 1, MPI_INT, rcounts, 1, MPI_INT, comm ); /* Based on rcounts, determine where in the local array the data */ /* from each processor will be stored. This array will store the */ /* received elements as well as the final sorted sequence */ rdispls = (int *)malloc(npes*size(int)); rdispls[0] = 0; rdispls[i] = rdispls[i-1] + rcounts[i-1]; *nsorted = rdispls[npes-1] + rcounts[npes-1]; sorted_elmnts = (int *)malloc(*nsorted*size(int)); Correction?

/* Each process sends and receives the corresponding elements, using */ /* MPI_Alltoallv. The arrays scounts and sdispls are used to specify */ /* the number of elements to be sent and where these elements are */ /* stored respectively. The arrays rcounts and rdispls are used to */ /* specify the number of elements to be received, and where these */ /* elements will be stored, respectively */ MPI_Alltoallv( elmnts, scounts, sdispls, MPI_INT, sorted_elmnts, rcounts, rdispls, MPI_INT, comm ); /* Perform the final local sort */ qsort(sorted_elmnts, *nsorted, sizeof(int), IncOrder ); free(splitters); free(allpicks); free(scounts); free(sdispls); free(rcounts); free(rdispls); return sorted_elmnts }

Groups And Communicators
Sometimes you want to distribute data only to a particular subset of the processors MPI provides a general procedure MPI_Comm_split to partition the world of processors into subset to aid this kind of distribution For example, see code on the next slide For part of a sub-dimension of mesh, use MPI_Cart_sub instead Accomplished with color and key parameters using MPI_Comm_split when called by each processor in the world to create the appropriate communicator The color and the key parameters indicate the partition and order of the processors in the new partitioned world The interface for MPI_Comm_split is: int MPI_Comm_split( MPI_Comm comm, int color, int key, MPI_Comm *newcomm )

Picture Of The Partition
process 1 1 2 1 3 1 4 1 5 1 6 1 7 2 1 color key MPI_Comm_split original rank 7 new rank Accomplished with a single call to MPI_Comm_split executed on each processor of the form: MPI_Split_comm( comm, color[myrank], key[myrank], &newcomm Where the arrays color and key are set with values shown above

Splitting Cartesian Topologies
Use the MPI procedure MPI_Cart_sub to partition a Cartesian topology along its dimensions The array keep_dims indicates which dimensions to be retained in the new topology keep_dims[j] = 1 indicates the dimension is to be retained Example: Suppose you have a 247 processor grid A call to MPI_Cart_sub with keep_dims=(T,F,T) produces a new communicator of 27 grid of processors It actually produces 4 communicators, but on different processors-- the processors with coordinates (*,k,*) are all in the same communicator for a given value k int MPI_Cart_sub( MPI_Comm comm_cart, int *keep_dims, MPI_Comm *comm_subcart )

A Figure Showing Splitting Cartesian Meshes
2 7 2 7 1 2 7 1 2 7 1 2 7 1 4 keep_dims[] = {true,false,true} Use MPI_Comm_Split to split into 4 groups of size 217

Two-Dimensional Matrix-Vector Multiplication Algorithm
Perform the computation x = Ab on p processors arranged in a square 2-d grid pp where: p divides n A is an nn matrix is distributed by blocks of size n/pn/p b is initially stored in equal sized pieces on the processors on the first column and the result x ends up on these first column processors partitioned in the same way Steps of the algorithm: Create a 2-d wraparound processor grid of size pp Distribute the subsections of b, first from the processors in the first column to the diagonal processors in the corresponding process rows and then broadcast each diagonal block into all processors in its column Perform the submatrix product Ab on each processor Perform a sum reduction of the partition of x along each row to the processor in the first column of each row This final step gives the desired result x

Two-Dimensional Matrix-Vector Multiplication -- Part 1
MatrixVectorMultiply_2D( int n, double *a, double *b, double *x, MPI_Comm comm ) { int ROW = 0, COL = 1; int i, j, nlocal, npes, dims[2], periods[2], keep_dims[2]; int myrank, my2drank, mycoords[2], other_rank, coords[2]; MPI_Status status; MPI_Comm comm_2d, comm_row, comm_col; /* Get information about the communicator */ MPI_Comm_size( comm, &npes ); MPI_Comm_rank( comm, &myrank ); /* Compute the size of the square grid */ dims[ROW] = dims[COL] = sqrt(npes); nlocal = n/dims[ROW]; /* Alloc. memory for array that will hold the partial dot-products */ px = (double *)malloc(nlocal*size(double));

Correction /* Set up the Cartesian topology and get the rank and coordinates */ /* of the process in this topology */ periods[ROW] = periods[COL] = 1 /* Periods specify wraparound */ MPI_Cart_create( comm, 2, dims, periods, 1, &comm_2d ); MPI_Comm_rank( comm_2d, &my2drank ); /* Get my rank in comm_2d */ MPI_Cart_coords( comm_2d, my2drank, 2, mycoords ); /* Create the row-based sub-topology */ keep_dims[ROW] = 0; keep_dims[COL] = 1; MPI_Cart_sub( comm_2d, keep_dims, &comm_row ); /* Create the column-based sub-topology */ keep_dims[ROW] = 1; keep_dims[COL] = 0; MPI_Cart_sub( comm_2d, keep_dims, &comm_col );

/* Redistribute the b vector */ /* Step 1: The processors along the 0th coumn send their data to the */ /* diagonal processors */ if (mycoords[COL] == 0 && mycoords[ROW] !=0) {/* 0th col and not diag */ coords[ROW] = mycoords[ROW]; coords[COL] = mycoords[ROW]; MPI_Cart_rank( comm_2d, coords, &other_rank ); /* Rank of diagonal */ MPI_Send( b, nlocal, MPI_DOUBLE, other_rank, 1, comm_2d ); } if (mycoords[COL] == mycoords[ROW] && mycoords[ROW] != 0) { coords[ROW] = mycoords[ROW]; coords[COL] = 0; MPI_Recv( b, nlocal, MPI_DOUBLE, other_rank, 1, comm_2d, &status );

/* Step 2: The diagonal processors perform a column-wise broadcast */ coords[0] = mycoords[COL]; MPI_Cart_rank( comm_col, coords, &other_rank ); /* Rank of diagonal */ MPI_Bcast( b, nlocal, MPI_DOUBLE, other_rank, comm_col ); } /* The main computational loop */ for (i=0; i<nlocal; i++) { px[i] = 0.0; for (j=0; j<nlocal; j++) px[i] += a[i*nlocal+j]*b[j];

/* Perform the sum-reduction along the rows to add up the partial */ /* dot products */ coords[0] = 0; MPI_Cart_rank( comm_row, coords, &other_rank ); MPI_Reduce( px, x, nlocal, MPI_DOUBLE, MPI_SUM, &other_rank, comm_row ); MPI_Comm_free( &comm_2d )); MPI_Comm_free( &comm_row ); MPI_Comm_free( &comm_col ); free(px); }

Calling CLAPACK Routine -- Declarations
#include <stdio.h> #define ALL_DATA_ROW 25 #define ALL_DATA_COLUMN 4 #define PARTITION_FACTOR 7 #define bound 5 #include <f2c.h> #include <clapack.h> /* Prototype (source in /usr/local/CLAPACK/SRC) -- unnecessary if clapack.h is used */ /* Subroutine */ int ssyev_(char *jobz, char *uplo, integer *n, real *a, integer *lda, real *w, real *work, integer *lwork, integer *info); int main() { /* Type integer in f2c.h is defined as long int */ /* Type real in f2c.h is defined as float */ long int DATA_ROW = ALL_DATA_COLUMN, lwork = 102, ntau = 3, lda = 4, info; float test[4][4], W[DATA_ROW], work[(32+2)*DATA_ROW]; int j, result; char vec = 'V', uplo = 'U';

Calling CLAPACK Routine -- Code
test[0][0] = 0.0, test[0][1] = 0.0, test[0][2] = 0.0, test[0][3] = 0.0; test[1][0] = 0.0, test[1][1] = 1.0, test[1][2] = 0.0, test[1][3] = 0.0; test[2][0] = 0.0, test[2][1] = 0.0, test[2][2] = 2.0, test[2][3] = 0.0; test[3][0] = 0.0, test[3][1] = 0.0, test[3][2] = 0.0, test[3][3] = 3.0; // Call the symmetric matrix eigensolver ssyev in single precision. result = ssyev_(&vec, &uplo, &ntau, &test[1][1], &lda, &W[1], work, &lwork, &info); printf("\nresult = %d, ", result); printf("\ninfo = %d, ", info); printf("\nW = %e %e %e , ", W[1],W[2],W[3]); return ( 0 ); } //end main

C Makefile For CLAPACK # Sample makefile for calling CLAPACK procedures GCC = gcc MPICC = mpicc CFLAGS = -O3 -I/usr/local/src/CLAPACK/BLAS/SRC LFLAGS = /usr/local/src/CLAPACK/lapack_LINUX.a \ L/usr/local/lib/atlas \ L/usr/local/src/CLAPACK -L/usr/local/lib/GNU \ L/usr/local/src/CLAPACK/F2CLIBS \ lcblaswr -lcblas -latlas -lF77 -lI77 -lF77 -lm -lc proj1: proj1.c $(GCC) $(CFLAGS) -o proj1 proj1.c $(LFLAGS)

Calling LAPACK Routine -- Declarations
program main implicit none integer, parameter :: ALL_DATA_ROW = 25, ALL_DATA_COLUMN = integer, parameter :: PARTITION_FACTOR = integer, parameter :: TAU = 3, bound = interface subroutine ssyev(jobz, uplo, n, a, lda, w, work, lwork, info) implicit none integer, parameter :: WP = kind(0.0e0) character(1), intent(in) :: jobz, uplo integer, intent(in) :: n, lda, lwork integer, intent(out) :: info real(WP), dimension(lda,*), intent(inout) :: a real(WP), dimension(n), intent(out) :: w real(WP), dimension(lwork), intent(out) :: work end subroutine ssyev end interface

Calling LAPACK Routine -- Code
integer, parameter :: DATA_ROW = ALL_DATA_COLUMN integer :: lwork = 102, ntau = integer :: lda = 4, info real test(0:3,0:3), W(DATA_ROW), work((32+2)*DATA_ROW) character(1) :: vec = 'V', uplo = 'U' test(0,0) = 0.0; test(0,1) = 0.0; test(0,2) = 0.0; test(0,3) = test(1,0) = 0.0; test(1,1) = 1.0; test(1,2) = 0.0; test(1,3) = test(2,0) = 0.0; test(2,1) = 0.0; test(2,2) = 2.0; test(2,3) = test(3,0) = 0.0; test(3,1) = 0.0; test(3,2) = 0.0; test(3,3) = 3.0 call ssyev(vec, uplo, ntau, test(1,1), lda, W, work, lwork, info) print *,"info = ", info, ',' print *,"W = ", W(1),', ',W(2),', ',W(3); end program main

Fortran Makefile For LAPACK
\ FFLAGS = -O2 -fast # This LAPACK has been built with g77 and not the PGI compiler. It appears to work #LFLAGS = /usr/local/LAPACK/lapack_LINUX.a -L/usr/local/lib/atlas \ # -L/usr/local/src/CLAPACK -L/usr/local/lib/GNU -L/usr/local/src/CLAPACK/F2CLIBS \ # -lcblaswr -lcblas -latlas -lF77 -lI77 -lF77 -lm -lc # This is the PGI LAPACK library and its works. However, if the atlas library is not linked # with it, it is very slow (by a factor of 2-3 in dgemm, for example) . LFLAGS = /usr/pgi/linux86/lib/liblapack.a -L/usr/local/lib/atlas -lf77blas -lcblas -latlas proj1: proj1.f $(FORTRAN) $(FFLAGS) -o proj1 proj1.f $(LFLAGS)

Programming Using The Message Passing Interface

Similar presentations

Presentation on theme: "Programming Using The Message Passing Interface"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Programming Using The Message Passing Interface

Similar presentations

Presentation on theme: "Programming Using The Message Passing Interface"— Presentation transcript:

Similar presentations

About project

Feedback