Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallel Programming – Process- Based Communication Operations David Monismith CS599 Based upon notes from Introduction to Parallel Programming, Second.

Similar presentations


Presentation on theme: "Parallel Programming – Process- Based Communication Operations David Monismith CS599 Based upon notes from Introduction to Parallel Programming, Second."— Presentation transcript:

1 Parallel Programming – Process- Based Communication Operations David Monismith CS599 Based upon notes from Introduction to Parallel Programming, Second Edition by Grama, Gupta, Karypis, and Kumar and from CS550

2 Last time We reviewed the Scan pattern We will continue with OpenMP scheduling operations later in the course. For now, we are going to move on to MPI so we can make use of multi-process programming.

3 Interprocess Communication Often communication between processes is necessary. Communication may occur sporadically from one process to another. It may also occur in well defined patterns some of which are used collectively (by all processes). Collective patterns are frequently used in parallel algorithms.

4 Send and Receive (Abstract operations) Point to point (i.e. process to process) communication occurs as send and receive operations. send – send data from this process with to a process identified by rank. – Example: send(myMessage, rank) receive – receive data in this process from the process with identifier rank. – Example: receive(receivedMessage, rank)

5 MPI Message Passing Send and receive are implemented concretely in MPI using the MPI_Send and MPI_Recv functions. MPI - the message passing interface allows for communication (IPC) between running processes, even those using the same source code.

6 Using MPI Processes use MPI by using #include "mpi.h" or depending upon the system and MPI stack. MPI is started in a program using: MPI_Init(&argc, &argv); and ended with: MPI_Finalize(); These function almost like curly brackets to start and end the parallel portion of the program.

7 Using MPI on LittleFe Anything between the MPI_Init and MPI_Finalize statements runs in as many processes that are requested by " mpirun " at the command line. For example, on littlefe : mpirun -np 12 -machinefile machines-openmpi prog1.exe Runs 12 processes using the executable code from prog1.exe.

8 Try running MPI Hello World on LittleFe1 or LittleFe2

9 Using MPI on Stampede On Stampede, one specifies the number of tasks in a batch script using the –n operator. Example: –#SBATCH -n 32 – Specifies to use 32 tasks (MPI Processes, one per CPU core) After all options have been specified, an MPI program is started in the script using ibrun Example –ibrun prog1.exe

10 Identifying Processes in MPI The MPI_Comm_rank and MPI_Comm_size functions get the rank (process identifier) and number of processes (the value 12 after -np, and the value 32 on the previous slides). These were previously reviewed in class.

11 MPI Message Passing Messages are passed in MPI using MPI_Send and MPI_Recv MPI_Send - sends a message of a given size with a given type to a process with a specific rank. MPI_Recv - receives a message of a maximum size with a given type from a process with a specific rank. MPI_COMM_WORLD - the "world" in which the processes exist. This is a constant.

12 Sending and Receiving Messages MPI_Send and MPI_Recv have the following parameters: MPI_Send( pointer to message, message size, message type, process rank to send to, message tag or id, MPI_COMM_WORLD) MPI_Recv( pointer to variable used to receive, maximum recv size, message type, process rank to receive from, message tag or id, MPI_COMM_WORLD, MPI_STATUS_IGNORE)

13 MPI Types MPI_CHARMPI_LONG MPI_SHORTMPI_FLOAT MPI_INTMPI_DOUBLE Many other types exist These types are analogous to C primitive types See the MPI Reference Manual for more examples

14 Blocking I/O MPI_Send and MPI_Recv are blocking I/O operations. In blocking I/O, when a message is sent, a process waits until it has acknowledgement that the message has been received before it can continue processing. Similarly, when a message is requested (a receive method/function is called) the program waits until the message has been received before continuing processing.

15 Blocking I/O Example Process1 Process2 +------------+ 1.send msg +------------+ |MPI_Send | --------> |MPI_Recv | |wait for ack| |wait for msg| |ack received| <-------- |ack receipt | |3b.continue | 2.send ack |3a.continue | +------------+

16 Before we continue… Try #1 from worksheet 6, and DON’T PANIC!!! Most functional MPI programs can be implemented with only 6 functions: –MPI_Init –MPI_Finalize –MPI_Send –MPI_Recv –MPI_Comm_rank –MPI_Comm_size

17 Why are Send and Receive Important? MPI is not the only framework in which send and receive operations are used. Send and receive exist in Java, Android Services, iOS, Web Services (i.e. GET and POST), etc. It is likely that you have used these operations before and that you will use them again.

18 Collective Message Patterns We will investigate commonly used collective message communication patterns. Collective means that the functions representing these patterns must be called in ALL processes. These include: – Broadcast – Reduction – All-to-all – Scatter – Gather – Scan – And more Communication patterns on simple interconnect networks will also be covered for linear arrays, meshes, and hypercubes.

19 One to All Broadcast Send identical data from one process to all other processes or a subset thereof. Initially the root process only has the data (size m) After completing the operation, there are p copies of the data where p is the number of processes to which the data was broadcast Implemented by MPI_Bcast

20 All-to-One Reduction Each of p processes starts with a buffer B of size m Data from all processes is combined using an associative operator such as +, *, min, max, etc. Data is accumulated at a single process into one buffer B_reduce of size m Data element i of B_reduce is the sum, product, minimum, maximum, etc., of all of the ith elements of each original buffer B. This reduction is implemented by MPI_Reduce

21 Broadcasting On a ring or linear array, the naïve way to send data is to send p – 1 messages from the source to the other p – 1 processes. After the first message is sent, recursive doubling can be used to send the message to two processes. That is, the message can be sent from both the original source and the first destination to two additional processes. This algorithm can be repeated to reduce the number of steps required to broadcast to log(p) Note that on a linear array, the initial message must be sent to the farthest node, thereafter the distances are halved.

22 Mesh Communication on a mesh can be regarded as a extension of the linear array. A 2d mesh of p nodes consists of sqrt(p) linear arrays. Therefore, the first sqrt(p-1) messages can be sent from the root to those sqrt(p-1) nodes in the linear array. From there, messages may be sent in parallel to the remaining sqrt(p-1) linear arrays. A similar process can be carried out with a hypercube of size 2^d as it can be modeled as a d-dimensional mesh with 2 nodes per dimension. Therefore, on a hypercube, a broadcast may be carried out in d steps.

23 Hypercube Broadcast Algorithm one_to_all_bc(d, my_id, X) mask = 2^d – 1 //Set d bits of mask to 1 for i = d – 1 to 0 //Start loop mask = mask XOR 2^i //Set bit i of mask to 0 if(my_id AND mask == 0) //If lower i bits of my_id are 0 if(my_id AND 2^i == 0) dest = my_id XOR 2^i send X to dest else source = my_id XOR 2^i recv X from source endif endfor

24 All-to-All Broadcast and Reduction In an All-to-all Broadcast every process out of p processes simultaneously initiates a a broadcast. Each process sends the same message of size m to every other process, but different processes may broadcast different messages. This is useful in matrix multiplication and matrix-vector multiplication. Naïve implementations may take p times as long as the one-to-all broadcast. It is possible to implement the all-to-all algorithm in such a manner to take advantage of the interconnect network so all messages traversing the same path at the same time are concatenated. The dual operation of such a broadcast is an all-to-all reduction in which every node is the destination of an all-to-one reduction These operations are implemented via the MPI_Allgather (All-to-all broadcast) and MPI_Reduce_scatter (All-to-all reduction) operations.

25 Ring All to All Broadcast Consider a ring topology. All links can be kept busy until the all-to-all broadcast is complete. An algorithm for such a broadcast follows below. all_to_all_ring_bc(myId,myMsg, p, result) left = (myId-1) % p right = (myId+1) % p Result = myMsg msg = result for i = 1 to p-1 send msg to right recv msg from left result = concat(result, msg) endfor

26 Ring All to One Reduce Algorithm All_to_all_ring_reduce(myId, myMsg, p, result) left = (myId-1)%p right = (myId+1)%p recvVal = 0 for i = 1 to p-1 j = (myId + 1) % p temp = myMsg[j] + recvVal send temp to left recv recvVal from right endfor result = myMsg[myId] + recvVal

27 Mesh and Hypercube Implementations Mesh and Hypercube implementations can be constructed by expanding upon the linear array and ring algorithms, to carry these out in two steps. The hypercube algorithm is a generalization of the mesh algorithm to log(p) dimensions. It is important to realize that such implementations are used to take advantage of the existing interconnect networks on large scale systems.

28 Scatter and Gather Scatter and gather are personalized operations. Scatter – single node sends a unique message of size m to every other node. One to many personalized communication Gather – a single node collects unique messages from each node Implemented using MPI_Scatter and MPI_Gather respectively

29 MPI Operations One to All – MPI_Bcast All-to-one – MPI_Reduce All-to-all Broadcast – MPI_Allgather All-to-all Reduction – MPI_Reduce_scatter All-reduce – MPI_Allreduce Gather – MPI_Gather, MPI_Gatherv Scatter – MPI_Scatter, MPI_Scatterv All-to-all personalized – MPI_Alltoall Scan – MPI_Scan

30 Next Time: All-to-All Personalized Communication Total exchange Used in FFT, matrix transpose, sample sort, and parallel DB join operations Different algorithms exist for: – Linear Array – Mesh – Hypercube


Download ppt "Parallel Programming – Process- Based Communication Operations David Monismith CS599 Based upon notes from Introduction to Parallel Programming, Second."

Similar presentations


Ads by Google