1 Collective Operations Dr. Stephen Tse firstname.lastname@example.org 908-872-2108 Lesson 12
2 Collective Communication A collective communication is –A communication pattern that involves all the processes in a communicator –It involves more than two processes. Different Collective Communication Operations: –Broadcast –Gather and Scatter –Allgather –Alltoall
3 Consider the following Arrangement ----> Data 0 | A0 A1 A2 A3 A4... 1 | 2 | : | n | V Processes
4 Broadcast A broadcast is a collective communication that a single process sends the same data to every process in the communicator. ----> Data | A0 =======>A0 | bcastA0 |A0 V Processes
5 Matrix-Vector Product If A=(a ij ) is an mxn matrix and x=(x 0, x 1, …, x n-1 ) T is an n-dimensional vector then the matrix-vector product is y=Ax = A x y Process 0 Process 1 Process 2 Process 3
6 A Gather A collective communication in which a root process receives data from every other process. In order to form the dot product of each row of A with x: We need to gather all of x onto each process Process 0 x 0 Process 1 x 1 Process 2 x 2 Process 3 x 3 A
7 A Scatter A collective communication in which a fixed root process sends a distinct collection of data to every other process. Scatter each row of A across the process Process 0 a 00 a 01 a 02 a 03 Process 1 Process 2 Process 3 A A A
8 Gather and Scatter ----> Data |A0 A1 A2 A3 A4A0 |====>A1 |ScatterA2 |<====A3 |GatherA4 V Processes
9 Allgather ----> Data |A0A0 B0 C0 D0 E0 |B0A0 B0 C0 D0 E0 |C0A0 B0 C0 D0 E0 |D0A0 B0 C0 D0 E0 |E0A0 B0 C0 D0 E0 V Processes Simultaneously gather all of x onto each processes. Gathering a distributed array to every process. It gathers the contents of each process’s send_data into each process’s recv_data. After the function returns, all the processes in the communicator will have the result stored in the memory referenced by result.
10 Alltoall (transpose) ----> Data |A0 A1 A2 A3 A4A0 B0 C0 D0 E0 |B0 B1 B2 B3 B4 A1 B1 C1 D1 E1 |C0 C1 C2 C3 C4 A2 B2 C2 D2 E2 |D0 D1 D2 D3 D4 A3 B3 C3 D3 E3 |E0 E1 E2 E3 E4 A4 B4 C4 D4 E4 V Processes The heart of the redistribution of the keys is each process’s sending of its original local keys to the appropriate process This is a collective communication operation in which each process sends a distinct collection of data to every other process.
11 Tree-Structure Communication To improve the coding, we should focus on the distribution of the input data. How can we divide the work more evenly among processes? We think of that we have a tree of processes, with process 0 as the root; During the 1 st stage of the data distribution: 0 sends data to 1. During 2 nd stage: 0 sends the data to 2 while 1 sends data to 3. During 3 rd stage: 0 sends to 4, while 1 sends to 5, 2 sends to 6, and 3 sends to 7. So we reduce the input distribution loop from 7 stages to 3 stages. In general, if we have p processes, this procedure allows us to distribute the input data in | log 2 (p) | stages; which is the smallest whole number greater than of equal to log 2 (p), which is all called the ceiling of that number. (See the processes configuration tree )
12 Data Distribution Stages 0 0 0 0351624 31 1 2 7 1.This distribution reduced the original p-1 stages. 2.If p=7, it reduced the time required for the program to complete the data distribution from 6 to 3 and reduced by a factor of 50 times. 3.There is no canonical choice of ordering. 4.We have to know the topology of the system in order to have better choice of scheme.
13 Reduce the burden of final Sum In the final summation phase, process 0 always gets a disproportionate amount of work; i.e. the global sum of results from all other processes. To accelerate the final phase, we can use the tree concept in reverse to reduce the load of process 0. Distribute the work as: –Stage 1: 1.4 sends to 0; 5 sends to 1; 6 sends to 2; 7 sends to 3. 2.0 adds its integral from 4; 1 adds its integral from 5; 2 adds its integral from 6; 3 adds its integral from 7. –Stage 2: 1.2 sends to 0; 3 sends to 1. 2.0 adds its integral from 2; 1 adds its integral from 3. –Stage 3: 1.1 sends to 0. 2.0 adds its integral from 1. (See the reverse tree processes configuration)
15 Reduction Operations The “global sum” calculation, is a general class of collective communication operations called reduction operations. In a global reduction operation, all the processes in a communicator are contributing data. All those data will be combined by using a binary operation. Typical operations are addition, max, min, logical and, etc.
16 Simple Reduce ----> Data |A0 A1 A2A0+B0+C0 A1+B1+C1 A2+B2+C2 |B0 B1 B2 |C0 C1 C2 V Process
17 Allreduce In the simple reduce function only process 0 will return the global sum result. All the other processes will return 0. If we want to use the result for subsequent calculations, we would like each process to return the same correct result. The obvious approach is to call MPI_Reduce with a call to MPI_Bcast.
18 Every Processes have same Results ----> DataResults |A0 A1 A2 A0+B0+C0 A1+B1+C1 A2+B2+C2 |B0 B1 B2 A0+B0+C0 A1+B1+C1 A2+B2+C2 |C0 C1 C2 A0+B0+C0 A1+B1+C1 A2+B2+C2 V Process
19 Implementation in MPI - MPI_Gather 1.MPI_Gather( sendbuffer sendcount sendtype recvbuffer recvcount recvtype root rank comm ) Remarks: 1. All processes in “comm..” Including root send “sendbuffer” to root’s recvbuffer 2. Root collects these “sendbuffer” contents and put them in rank order in “recvbuffer” 3. “recvbuffer” is ignored in all processes except the “root”. 4. Its inverse operation is MPI_Scatter()
20 Implementation in MPI - MPI_Scatter 2.MPI_Scatter( sendbuffer sendcount sendtype recvbuffer recvcount recvtype root rank comm. ) Remarks: 1. Root sends “sendbuffer” to all processes including “root” 2. Root puts them in rank order in “recvbuffer” 3. Root cuts its msg into “n” equal parts and then sends them to “n” processes
21 Implementation in MPI - MPI_GatherV 3.MPI_GatherV( sendbuffer sendcount sendtype recvbuffer recvcount displacement /* integer array for displacement */ recvtype root rank comm. ) Remarks: 1. This is a more general and more flexible function 2. Allowing varying count of data from each process 3. The variation is marked in "displacement" which is an "n-" dimensional array.
22 Implementation in MPI - MPI_Allgather 4.MPI_Allgather( sendbuffer sendcount sendtype recvbuffer recvcount recvtype comm ) Remarks: 1. This operation is similar to all-to-all operation 2. Instead of specifying the "root", every process sends a its data too all other processes 3. The "j-th" block of data from each process is received by every process and is placed in the "j-th" block of the buffer "recvbuf"
23 Implementation in MPI - MPI_Allgather 5.MPI_AllgatherV( sendbuffer sendcount sendtype recvbuffer recvcount displacement recvtype comm ) Remarks: (1) This is an operation similar to all-to-all operation. (2) Instead of specifying the "root", every process sends its data too all other processes. (3) The "j-th" block of data from each process is received by every process and is placed in the "j-th" block of the buffer "recvbuf". (4) But the blocks from different processes need not to be uniform in sizes.
24 Implementation in MPI - MPI_Alltoall 6.MPI_Alltoall( sendbuffer sendcount sendtype recvbuffer recvcount recvtype comm ) Remarks: (1) This is an all-to-all operation (2) "j-th" block sent from process "i" is placed in process "j"'s "i-th" location of the "recv" buffer
25 Implementation in MPI - MPI_AlltoallV 7.MPI_AlltoallV( sendbuffer sendcount s-displacement sendtype recvbuffer recvcount r-displacement recvtype comm ) Remarks: (1) This is an all-to-all process (2) "j-th" block sent from process "i" is placed process "j"'s "i-th" location of the "recv" buffer