Presentation is loading. Please wait.

Presentation is loading. Please wait.

12e.1 More on Parallel Computing UNC-Wilmington, C. Ferner, 2007 Mar 21, 2007.

Similar presentations


Presentation on theme: "12e.1 More on Parallel Computing UNC-Wilmington, C. Ferner, 2007 Mar 21, 2007."— Presentation transcript:

1 12e.1 More on Parallel Computing UNC-Wilmington, C. Ferner, 2007 Mar 21, 2007

2 12e.2 Block Mapping (Review) blksz = (int)ceil((float)N / P); for (i = lb + my_rank * blksz; i < min(N, lb + (my_rank + 1) * blksz); i++) {... } (lb is the lower bound of the original loop)

3 12e.3 Example for (i = 1; i < N; i++) { for (j = 0; j < N; j++) { a[i][j] += f(a[i-1][j]); }

4 12e.4 Example 0,00,10,20,3 0,N-1... 1,0 1,1 1,2 1,31,N-1... 2,02,12,22,3 2,N-1... N-1,0N-1,1N-1,2N-1,3N-1,N-1... j i

5 12e.5 Example If we mapped iterations of the i loop to processors, the dependencies cross processors boundaries Therefore interprocessor communication would be required

6 12e.6 N-1,N-1 Example 0,00,10,20,3 0,N-1... 1,0 1,1 1,2 1,31,N-1... 2,02,12,22,3 2,N-1... N-1,0N-1,1N-1,2N-1,3... PE 0 : PE 1 : PE 2 : PE P :

7 12e.7 Example A better solution would be to map iterations of the j loop to processors

8 12e.8 N-1,N-1 Example 0,00,10,20,3 0,N-1... 1,0 1,1 1,2 1,31,N-1... 2,02,12,22,3 2,N-1... N-1,0N-1,1N-1,2N-1,3... PE 0 : PE 1 : PE 2 : PE 3 :

9 12e.9 Example for (i = 1; i < N; i++) { for (j = my_rank * blksz; i < min(N, (my_rank + 1) * blksz); i++) { a[i][j] += f(a[i-1][j]); }

10 12e.10 Block Mapping (Review) blksz = (int)ceil((float)N / P); for (i = lb + my_rank * blksz; i < min(N, lb + (my_rank + 1) * blksz); i++) {... } (lb is the lower bound of the original loop)

11 12e.11 Block Mapping

12 12e.12 Block Mapping The problem is that block mapping can lead to a load imbalance Example, let N=26, P=6 blksz = ceiling(26/6) = 5 (lb = 0)

13 12e.13 Block Mapping Processors 0-4 have 5 iterations of work Processor 5 has 1 iteration

14 12e.14 Cyclic Mapping An alternative to block mapping is cyclic mapping This is where each iteration is assigned to each processors in a round robin fashion This leads to a better load balance

15 12e.15 Cyclic Mapping Processors 0-2 have 6 iterations of work Processor 3-6 have only 5, but it is only 1 iteration fewer!

16 12e.16 Cyclic Mapping for (i = lb + my_rank; i < N; i += P) {... } (lb is the lower bound of the original loop)

17 12e.17 Cyclic Mapping Conceptually, this is an easier mapping to implement than block mapping It leads to better load balancing However, it can (and often does) lead to more communication Suppose that each iteration in the above example is dependent on the previous iteration

18 12e.18 Cyclic Mapping A message is sent from iteration 0 to 1, from 1 to 2, from 2 to 3, from 3 to 4, from 4 to 5, from 5 to 6,...

19 12e.19 Block Mapping With block mapping, only messages are sent from iteration 5 to 6, from 11 to 12, from 17 to 18, and from 23 to 24

20 12e.20 Block vs Cyclic Block mapping increases the granularity and reduces overall communication (O(P)). However, it can lead to load imbalances (O(N/P)). Cyclic mapping decreases granularity and increases overall communication (O(N)). However, it improves load balance (O(1)). Block-Cyclic is a combination of the two

21 12e.21 Block-Cyclic Mapping Block-cyclic with N=26, P=6, and blksz=2 The load imbalance will be <= blksz

22 12e.22 Block-Cyclic Mapping (N, P, and blksz are given) nLayers = (int)ceil(((float)N)/(blksz*P)); for (layer = 0; layer < nLayers; layer++) { beginBlk = layer*blksz*N; for (i = beginBlk + mypid*blksz; i < min(N, beginBlk + (mypid + 1)*blksz); i++) {... }

23 12e.23 Block vs Cyclic Block-Cyclic is in between Block and Cyclic in terms of granularity, communication, and load balancing. Block and Cyclic are special cases of Block-Cyclic –Block = Block-Cyclic with blksz = ceiling(N/P) –Cyclic = Block-Cyclic with blksz = 1


Download ppt "12e.1 More on Parallel Computing UNC-Wilmington, C. Ferner, 2007 Mar 21, 2007."

Similar presentations


Ads by Google