2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.

2.2 Message-Passing Programming using User-level Message-Passing Libraries Programming by using a normal high-level language such as C, augmented with message-passing library calls that perform direct process-to- process message passing. 1.A method of creating separate processes for execution on different computers 2.A method of sending and receiving messages

2.3 Multiple program, multiple data (MPMD) model Source file Executable Processor 0Processorp - 1 Compile to suit processor Source file

2.4 Multiple Program Multiple Data (MPMD) Model with dynamic process creation Process 1 Process 2 spawn(); Time Start execution of process 2 Processes started from within master process - dynamic process creation. Potentially very flexible but incurs overhead of dynamically starting processes PVM used this form. MPI-2 has dynamic process creation although we not use it.

2.5 Single Program Multiple Data (SPMD) model. Source file Executables Processor 0Processorp - 1 Compile to suit processor Different processes merged into one program. Control (IF) statements select different parts for each processor to execute. All executables started together - static process creation

2.6 Single Program Multiple Data (SPMD) model. Source file Executables Processor 0Processorp - 1 Compile to suit processor if (processID == 0) { … // do this code } else if (processID == 1) { … //do this code } else … Typically coded for one master and a set of identical slave processes rather than each process different. MPI uses this form

2.7 Advantages/disadvantages of MPMD and SPMD MPMD with dynamic process creation –Flexible – can start up processes on demand during execution, for example when searching through a search space. –Has process start-up overhead SPMD (with static process creation) –Easier to code –Just one program to write –Collective message passing routines in each process the same (see later) –Efficient process execution

2.8 Basic “point-to-point” Send and Receive Routines Process 1Process 2 send(&x, 2); recv(&y, 1); xy Movement of data Generic syntax (actual formats later) Passing a message between processes using send() and recv() library calls:

2.9 Synchronous Message Passing Routines that actually return when message transfer completed. Synchronous send routine Waits until complete message has been accepted by the receiving process before returning. Synchronous receive routine Waits until the message it is expecting arrives. Synchronous routines intrinsically perform two actions: They transfer data and they synchronize processes. Neither can proceed until the message has been passed from the source to the destination. So no message buffer storage is needed.

2.10 Synchronous send() and recv() using 3-way protocol Process 1Process 2 send(); recv(); Suspend Time process Acknowledgment Message Both processes continue (a) When send() occurs before recv() Process 1Process 2 recv(); send(); Suspend Time process Acknowledgment Message Both processes continue (b) When recv() occurs before send() Request to send

2.11 Asynchronous Message Passing Blocking - has been used to describe routines that do not return until the transfer is completed. –The routines are “blocked” from continuing. Non-blocking - has been used to describe routines that return whether or not the message had been received. Usually require local storage for messages. –In general, they do not synchronize processes but allow processes to move forward sooner. Must be used with care. In that sense, the terms synchronous and blocking were synonymous.

2.12 MPI Definitions of Blocking and Non- Blocking Locally Blocking - return after their local actions complete, though the message transfer may not have been completed. Non-blocking - return immediately. Assumes that data storage used for transfer not modified by subsequent statements prior to being used for transfer, and it is left to the programmer to ensure this. These terms may have different interpretations in other systems.

2.13 How message-passing routines return before message transfer completed Process 1Process 2 send(); recv(); Message buffer Read message buffer Continue process Time Message buffer needed between source and destination to hold message:

Message Buffer For a receive routine, the message has to have been received if we want the message. If recv() is reached before send(), the message buffer will be empty and recv() waits for the message. For a send routine, once the local actions have been completed and the message is safely on its way, the process can continue with subsequent work. In this way, using such send routines can decrease the overall execution time. In practice, buffers can only be of finite length and a point could be reached when the send routine is held up because all the available buffer space has been exhausted. It may be necessary to know at some point if the message has actually been received, which will require additional message passing. 2.14

2.15 Message Tag Used to differentiate between different types of messages being sent. Message tag is carried within message. If special type matching is not required, a wild card message tag is used, so that the recv() will match with any send().

2.16 Message Tag Example Process 1Process 2 send(&x,2,5); recv(&y,1,5); xy Movement of data Waits for a message from process 1 with a tag of 5 To send a message, x, with message tag 5 from a source process, 1, to a destination process, 2, and assign to y:

2.17 “Group” message passing routines Have routines that send message(s) to a group of processes or receive message(s) from a group of processes Higher efficiency than separate point-to-point routines although not absolutely necessary.

2.18 Broadcast Sending same message to all processes concerned with problem. Multicast - sending same message to defined group of processes. bcast(); buf bcast(); data bcast(); data Process 0Processp - 1Process 1 Action Code SPMD (MPI) form Broadcast action does not occur until all the processes have executed their broadcast routine, and the broadcast operation will have the effect of synchronizing the processes.

2.19 Scatter scatter(); buf scatter(); data scatter(); data Process 0Processp - 1Process 1 Action Code Sending each element of an array in root process to a separate process. Contents of ith location of array sent to ith process. Can send more than one element. SPMD (MPI) form

2.20 Gather gather(); buf gather(); data gather(); data Process 0Processp - 1Process 1 Action Code Having one process collect individual values from set of processes. SPMD (MPI) form

2.21 Reduce reduce(); buf reduce(); data reduce(); data Process 0Processp - 1Process 1 + Action Code Gather operation combined with specified arithmetic/logical operation to a single value. Example: Values could be gathered and then added together by root: SPMD (MPI) form

2.22 Software Tools for Clusters Late 1980’sParallel Virtual Machine (PVM) - developed Became very popular. Mid 1990’s -Message-Passing Interface (MPI) - standard defined. Based upon Message Passing Parallel Programming model. Both provide a set of user-level libraries for message passing. Use with sequential programming languages (C, C++,...).

2.23 PVM (Parallel Virtual Machine) Perhaps first widely adopted attempt at using a workstation cluster as a multicomputer platform, developed by Oak Ridge National Laboratories. Available at no charge. Programmer decomposes problem into separate programs (usually master and group of identical slave programs). Programs compiled to execute on specific types of computers. Set of computers used on a problem first must be defined prior to executing the programs (in a hostfile).

2.24 Message routing between computers done by PVM daemon processes installed by PVM on computers that form the virtual machine. PVM Application daemon program Workstation PVM daemon Application program Application program PVM daemon Workstation Workstation Messages sent through network (executable) (executable) (executable) MPI implementation we use is similar. Can have more than one process running on each computer.

2.25 MPI (Message Passing Interface) Message passing library standard developed by group of academics and industrial partners to foster more widespread use and portability. Defines routines, not implementation. Several free implementations exist.

2.26 MPI Process Creation and Execution Purposely not defined - Will depend upon implementation. Only static process creation supported in MPI version 1. All processes must be defined prior to execution and started together. Originally SPMD model of computation. MPMD also possible with static creation

2.27 Using SPMD Computational Model main (int argc, char *argv[]) { MPI_Init(&argc, &argv);. MPI_Comm_rank(MPI_COMM_WORLD, &myrank); /*find process rank */ if (myrank == 0) master(); else slave();. MPI_Finalize(); } where master() and slave() are to be executed by master process and slave process, respectively.

2.28 Communicators Defines scope of a communication operation. Processes have ranks associated with communicator. Initially, all processes enrolled in a “universe” called MPI_COMM_WORLD, and each process is given a unique rank, a number from 0 to p - 1, with p processes. Other communicators can be established for groups of processes. Two types of communicator –Intracommunicators for communication within a defined groups –Intercommunicators for communication between defined groups

2.29 Reasoning for Communicators Provides a solution to unsafe message passing, –Message tags alone are not sufficient. Enables basic error checking of message passing code by allowing programmer to define communication domains. –Messages cannot be sent to destinations outside defined communication domain

2.30 Unsafe message passing - Example lib() send(…,1,…); recv(…,0,…); Process 0Process 1 send(…,1,…); recv(…,0,…); (a) Intended behavior (b) Possible behavior lib() send(…,1,…); recv(…,0,…); Process 0Process 1 send(…,1,…); recv(…,0,…); Destination Source

2.31 MPI Blocking Routines Return when “locally complete” - when location used to hold message can be used again or altered without affecting message being sent. Blocking send will send message and return - does not mean that message has been received, just that process free to move on without adversely affecting message.

2.32 Parameters of blocking send MPI_Send(buf, count, datatype, dest, tag, comm) Address of Number of items Datatype of Rank of destination Message tag Communicator send buffer to send each item process

2.33 Parameters of blocking receive MPI_Recv(buf, count, datatype, src, tag, comm, status) Address of Maximum number Datatype of Rank of source Message tag Communicator receive buffer of items to receive each item process Status after operation

2.34 Example To send an integer x from process 0 to process 1, MPI_Comm_rank(MPI_COMM_WORLD,&myrank); /* find rank */ if (myrank == 0) { int x; MPI_Send(&x, 1, MPI_INT, 1, msgtag, MPI_COMM_WORLD); } else if (myrank == 1) { int x; MPI_Recv(&x, 1, MPI_INT, 0,msgtag,MPI_COMM_WORLD,status); }

2.35 MPI Nonblocking Routines Nonblocking send - MPI_Isend() - will return “immediately” even before source location is safe to be altered. Nonblocking receive - MPI_Irecv() - will return even if no message to accept.

2.36 Nonblocking Routine Formats MPI_Isend(buf,count,datatype,dest,tag,comm,request) MPI_Irecv(buf,count,datatype,source,tag,comm, request) Completion detected by MPI_Wait() and MPI_Test(). MPI_Wait() waits until operation completed and returns then. MPI_Test() returns with flag set indicating whether operation completed at that time. Need to know whether particular operation completed. Determined by accessing request parameter.

2.37 Example To send an integer x from process 0 to process 1 and allow process 0 to continue, MPI_Comm_rank(MPI_COMM_WORLD, &myrank);/* find rank */ if (myrank == 0) { int x; MPI_Isend(&x,1,MPI_INT, 1, msgtag, MPI_COMM_WORLD, req1); compute(); MPI_Wait(req1, status); } else if (myrank == 1) { int x; MPI_Recv(&x,1,MPI_INT,0,msgtag, MPI_COMM_WORLD, status); }

2.38 Send Communication Modes Standard Mode Send - Not assumed that corresponding receive routine has started. Amount of buffering not defined by MPI. If buffering provided, send could complete before receive reached. Buffered Mode - Send may start and return before a matching receive. Necessary to specify buffer space via routine MPI_Buffer_attach(). Synchronous Mode - Send and receive can start before each other but can only complete together. Ready Mode - Send can only start if matching receive already reached, otherwise error. Use with care.

2.39 Parameters of synchronous send (same as blocking send) MPI_Ssend(buf, count, datatype, dest, tag, comm) Address of Number of items Datatype of Rank of destination Message tag Communicator send buffer to send each item process

2.40 Collective Communication Involves set of processes, defined by an intra-communicator. Message tags not present. Principal collective operations: MPI_Bcast() - Broadcast from root to all other processes MPI_Gather() - Gather values for group of processes MPI_Scatter() - Scatters buffer in parts to group of processes MPI_Alltoall() - Sends data from all processes to all processes MPI_Reduce() - Combine values on all processes to single value MPI_Reduce_scatter() - Combine values and scatter results MPI_Scan() - Compute prefix reductions of data on processes MPI_Barrier() - A means of synchronizing processes by stopping each one until they all have reached a specific “barrier” call.

2.41 Barrier: Block process until all processes have called it MPI_Barrier(comm) communicator

2.42 Broadcast message from root process to all processes in comm and itself. MPI_Bcast(*buf, count, datatype, root, comm) Parameters: *bufmessage buffer (loaded) countnumber of entries in buffer datatypedata type of buffer rootrank of root

2.43 Gather values for group of processes MPI_Gather(*sendbuf, sendcount, sendtype, *recvbuf, recvcount, recvtype, root, comm) Parameters: *sendbufsend buffer sendcountnumber of send buffer elements sendtypedata type of send elements *recvbufreceive buffer (loaded) recvcountnumber of elements each receive recvtypedata type of receive elements rootrank of receiving process commcommunicator

2.44 Example To gather items from group of processes into process 0, using dynamically allocated memory in root process: int data[10];/*data to be gathered from processes*/ MPI_Comm_rank(MPI_COMM_WORLD, &myrank);/* find rank */ if (myrank == 0) { MPI_Comm_size(MPI_COMM_WORLD, &grp_size); /*find group size*/ buf = (int *)malloc(grp_size*10*sizeof (int)); /*allocate memory*/ } MPI_Gather(data,10,MPI_INT,buf,grp_size*10,MPI_INT,0,MPI_COMM_WORLD) ; MPI_Gather() gathers from all processes, including root.

2.45 Scatter a buffer from root in parts to group of processes MPI_Scatter(*sendbuf, sendcount, sendtype, *recvbuf, recvcount, recvtype, root, comm) Parameters: *sendbufsend buffer sendcountnumber of elements sent (each process) sendtypedata type of elements *recvbufreceive buffer (loaded) recvcountnumber of recv buffer elements recvtypetype of recv elements rootroot process rank commcommunicator

2.46 Combine values on all processes to single value MPI_Reduce(*sendbuf,*recvbuf,count,datatype,op,root,comm) Parameters: *sendbufsend buffer address *recvbufreceive buffer address countnumber of send buffer elements datatypedata type of send elements opreduce operation. Several operations, including MPI_MAXMaximum MPI_MINMinimum MPI_SUMSum MPI_PRODProduct rootroot process rank for result commcommunicator

Hello World #include int main(int argc, char ** argv) { int size,rank; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank); MPI_Comm_size(MPI_COMM_WORLD,&size); printf("Hello MPI! Process %d of %d\n", rank, size); MPI_Finalize(); } 2.47

2.48 Sample MPI program #include “mpi.h” #include #define MAXSIZE 1000 void main(int argc, char *argv) { int myid, numprocs; int data[MAXSIZE], i, x, low, high, myresult, result; char fn[255]; char *fp; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); if (myid == 0) { /* Open input file and initialize data */ strcpy(fn,getenv(“HOME”)); strcat(fn,”/MPI/rand_data.txt”); if ((fp = fopen(fn,”r”)) == NULL) { printf(“Can’t open the input file: %s\n\n”, fn); exit(1); } for(i = 0; i < MAXSIZE; i++) fscanf(fp,”%d”, &data[i]); } MPI_Bcast(data, MAXSIZE, MPI_INT, 0, MPI_COMM_WORLD); /* broadcast data */ x = n/nproc; /* Add my portion Of data */ low = myid * x; high = low + x; for(i = low; i < high; i++) myresult += data[i]; printf(“I got %d from %d\n”, myresult, myid); /* Compute global sum */ MPI_Reduce(&myresult, &result, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); if (myid == 0) printf(“The sum is %d.\n”, result); MPI_Finalize(); }

MPI Groups, Communicators 2.49 Collective communication needed to be performed by subsets of the processes in the computation MPI provides routines for –defining new process groups from subsets of existing process groups like MPI_COMM_WORLD –creating a new communicator for a new process group –performing collective communication within that process group

Groups and communicators 2.50

Facts about groups and communicators 2.51 Group: –ordered set of processes –each process in group has a unique integer id called its rank within that group –process can belong to more than one group rank is always relative to a group –groups are “opaque objects” use only MPI provided routines for manipulating groups Communicators: –all communication must specify a communicator –from the programming viewpoint, groups and communicators are equivalent –communicators are also “opaque objects” Groups and communicators are dynamic objects and can be created and destroyed during the execution of the program

Typical usage 2.52 1.Extract handle of global group from MPI_COMM_WORLD using MPI_Comm_group 2.Form new group as a subset of global group using MPI_Group_incl or MPI_Group_excl 3.Create new communicator for new group using MPI_Comm_create 4.Determine new rank in new communicator using MPI_Comm_rank 5.Conduct communications using any MPI message passing routine 6.When finished, free up new communicator and group (optional) using MPI_Comm_free and MPI_Group_free

main(int argc, char **argv) { int me, count, count2; void *send_buf, *recv_buf, *send_buf2, *recv_buf2; MPI_Group MPI_GROUP_WORLD, grprem; MPI_Comm commslave; static int ranks[] = {0}; MPI_Init(&argc, &argv); MPI_Comm_group(MPI_COMM_WORLD, &MPI_GROUP_WORLD); MPI_Comm_rank(MPI_COMM_WORLD, &me); MPI_Group_excl(MPI_GROUP_WORLD, 1, ranks, &grprem); MPI_Comm_create(MPI_COMM_WORLD, grprem, &commslave); if(me != 0){ /* compute on slave */ MPI_Reduce(send_buf,recv_buff,count, MPI_INT, MPI_SUM, 1, commslave); } /* zero falls through immediately to this reduce, others do later... */ MPI_Reduce(send_buf2, recv_buff2, count2, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); MPI_Comm_free(&commslave); MPI_Group_free(&MPI_GROUP_WORLD); MPI_Group_free(&grprem); MPI_Finalize(); }

2.54 #include “mpi.h” #include #define NPROCS 8 int main(argc,argv) int argc; char *argv[]; { int rank, new_rank, sendbuf, recvbuf, numtasks, ranks1[4]={0,1,2,3}, ranks2[4]={4,5,6,7}; MPI_Group orig_group, new_group; MPI_Comm new_comm; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &numtasks); sendbuf = rank; /* Extract the original group handle */ MPI_Comm_group(MPI_COMM_WORLD, &orig_group); /* Divide tasks into two distinct groups based upon rank */ if (rank < NPROCS/2) { MPI_Group_incl(orig_group, NPROCS/2, ranks1, &new_group); } else { MPI_Group_incl(orig_group, NPROCS/2, ranks2, &new_group); } /* Create new new communicator and then perform collective communications */ MPI_Comm_create(MPI_COMM_WORLD, new_group, &new_comm); MPI_Allreduce(&sendbuf, &recvbuf, 1, MPI_INT, MPI_SUM, new_comm); MPI_Group_rank (new_group, &new_rank); printf("rank= %d newrank= %d recvbuf= %d\n",rank,new_rank,recvbuf); MPI_Finalize(); } Sample output: rank= 7 newrank= 3 recvbuf= 22 rank= 0 newrank= 0 recvbuf= 6 rank= 1 newrank= 1 recvbuf= 6 rank= 2 newrank= 2 recvbuf= 6 rank= 6 newrank= 2 recvbuf= 22 rank= 3 newrank= 3 recvbuf= 6 rank= 4 newrank= 0 recvbuf= 22 rank= 5 newrank= 1 recvbuf= 22

2.55 Evaluating Programs Empirically Measuring Execution Time To measure the execution time between point L1 and point L2 in the code, we might have a construction such as:. L1: time(&t1); /* start timer */. L2: time(&t2);/* stop timer */. elapsed_time = difftime(t2, t1); /* time=t2 - t1 */ printf(“Elapsed time = %5.2f seconds”, elapsed_time);

2.56 MPI provides the routine MPI_Wtime() for returning time (in seconds): double start_time, end_time, exe_time; start_time = MPI_Wtime();. end_time = MPI_Wtime(); exe_time = end_time - start_time;

Average/least/most execution time spent by individual process int myrank, numprocs; double mytime, maxtime, mintime, avgtime; /*variables used for gathering timing statistics*/ MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); MPI_Barrier(MPI_COMM_WORLD); /*synchronize all processes*/ mytime = MPI_Wtime(); /*get time just before work section */ work(); mytime = MPI_Wtime() - mytime; /*get time just after work section*/ /*compute max, min, and average timing statistics*/ MPI_Reduce(&mytime, &maxtime, 1, MPI_DOUBLE,MPI_MAX, 0, MPI_COMM_WORLD); MPI_Reduce(&mytime, &mintime, 1, MPI_DOUBLE, MPI_MIN, 0,MPI_COMM_WORLD); MPI_Reduce(&mytime, &avgtime, 1, MPI_DOUBLE, MPI_SUM, 0,MPI_COMM_WORLD); if (myrank == 0) { avgtime /= numprocs; printf("Min: %lf Max: %lf Avg: %lf\n", mintime, maxtime,avgtime); } 2.57

2.58 Compiling/Executing MPI Programs Minor differences in the command lines required depending upon MPI implementation. For the assignments, we will use MPICH-2. Generally, a file need to be present that lists all the computers to be used. MPI then uses those computers listed.

Set MPICH2 Path Then add the following at the end of your ~/.bashrc file and source it with the command source ~/.bashrc (or log in again). #---------------------------------------------------------------------- # MPICH2 setup export PATH=/opt/MPICH2/bin:$PATH export MANPATH= =/opt/MPICH2/bin :$MANPATH #---------------------------------------------------------------------- Some logging and visulaization help: You can Link with the libraries -llmpe -lmpe to enable logging and the MPE environment. Then run the program as usual and a log file will be produced. The log file can be visualized using the jumpshot program that comes bundled with MPICH2. 2.59

2.60 Defining the Computers to Use Generally, need to create a file containing the list of machines to be used. Sample machines file (or hostfile) athena.cs.siu.edu oscarnode1.cs.siu.edu ………. oscarnode8.cs.siu.edu In MPICH, if just using one computer, do not need this file.

2.61 MPICH Commands Two basic commands: mpicc, a script to compile MPI programs Mpiexec, mpirun, the command to execute an MPI program.

2.62 Compiling/executing (SPMD) MPI program For MPICH. At a command line: To start MPI: Nothing special. To compile MPI programs: for C mpicc -o prog prog.c for C++ mpiCC -o prog prog.cpp To execute MPI program: mpiexec -n no_procs prog A positive integer

2.63 Executing MPICH program on multiple computers Create a file called say “machines” containing the list of machines: athena.cs.siu.edu oscarnode1.cs.siu.edu ………. oscarnode8.cs.siu.edu Establish network environments mpdboot –n 9 –f machines mpdtrace mpdallexit

2.64 mpirun -machinefile machines -np 4 prog would run prog with four processes. Each processes would execute on one of machines in list. MPI would cycle through list of machines giving processes to machines. Can also specify number of processes on a particular machine by adding that number after machine name.) “MPI standard” command mpiexec is now the replacement for mpirun although mpirun exists.

Reference Tutorial Materials http://www-unix.mcs.anl.gov/mpi/tutorial/index.html http://www.cs.utexas.edu/users/pingali/CS378/2008s p/lectureschedule.html 2.65

2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.

Similar presentations

Presentation on theme: "2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.

Similar presentations

Presentation on theme: "2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007."— Presentation transcript:

Similar presentations

About project

Feedback