Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National.

Similar presentations


Presentation on theme: "Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National."— Presentation transcript:

1 Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National Technical University of Athens Computing Systems Laboratory {ndros,nkoziris}@cslab.ece.ntua.gr www.cslab.ece.ntua.gr

2 2/10/2003EuroPVM/MPI 20032 Overview  Introduction  Pure MPI Model  Hybrid MPI-OpenMP Models Hyperplane Scheduling Fine-grain Model Coarse-grain Model  Experimental Results  Conclusions – Future Work

3 2/10/2003EuroPVM/MPI 20033 Introduction  Motivation: SMP clusters Hybrid programming models Mostly fine-grain MPI-OpenMP paradigms Mostly DOALL parallelization

4 2/10/2003EuroPVM/MPI 20034 Introduction  Contribution: 3 programming models for the parallelization of nested loops algorithms pure MPI fine-grain hybrid MPI-OpenMP coarse-grain hybrid MPI-OpenMP Advanced hyperplane scheduling minimize synchronization need overlap computation with communication

5 2/10/2003EuroPVM/MPI 20035 Introduction Algorithmic Model: FOR j 0 = min 0 TO max 0 DO … FOR j n-1 = min n-1 TO max n-1 DO Computation(j 0,…,j n-1 ); ENDFOR … ENDFOR  Perfectly nested loops  Constant flow data dependencies

6 2/10/2003EuroPVM/MPI 20036 Introduction Target Architecture: SMP clusters

7 2/10/2003EuroPVM/MPI 20037 Overview  Introduction  Pure MPI Model  Hybrid MPI-OpenMP Models Hyperplane Scheduling Fine-grain Model Coarse-grain Model  Experimental Results  Conclusions – Future Work

8 2/10/2003EuroPVM/MPI 20038 Pure MPI Model  Tiling transformation groups iterations into atomic execution units (tiles)  Pipelined execution  Overlapping computation with communication  Makes no distinction between inter-node and intra-node communication

9 2/10/2003EuroPVM/MPI 20039 Pure MPI Model Example: FOR j 1 =0 TO 9 DO FOR j 2 =0 TO 7 DO A[j 1,j 2 ]:=A[j 1 -1,j 2 ] + A[j 1,j 2 -1]; ENDFOR

10 2/10/2003EuroPVM/MPI 200310 Pure MPI Model CPU1 CPU0 CPU1 CPU0 NODE1 NODE0 4 MPI nodes

11 2/10/2003EuroPVM/MPI 200311 Pure MPI Model CPU1 CPU0 CPU1 CPU0 NODE1 NODE0 4 MPI nodes

12 2/10/2003EuroPVM/MPI 200312 Pure MPI Model tile 0 = nod 0 ; … tile n-2 = nod n-2 ; FOR tile n-1 = 0 TO DO Pack(snd_buf, tile n-1 – 1, nod); MPI_Isend(snd_buf, dest(nod)); MPI_Irecv(recv_buf, src(nod)); Compute(tile); MPI_Waitall; Unpack(recv_buf, tile n-1 + 1, nod); END FOR

13 2/10/2003EuroPVM/MPI 200313 Overview  Introduction  Pure MPI Model  Hybrid MPI-OpenMP Models Hyperplane Scheduling Fine-grain Model Coarse-grain Model  Experimental Results  Conclusions – Future Work

14 2/10/2003EuroPVM/MPI 200314 Hyperplane Scheduling  Implements coarse-grain parallelism assuming inter-tile data dependencies  Tiles are organized into data-independent subsets (groups)  Tiles of the same group can be concurrently executed by multiple threads  Barrier synchronization between threads

15 2/10/2003EuroPVM/MPI 200315 Hyperplane Scheduling CPU1 CPU0 CPU1 CPU0 NODE1 NODE0 2 MPI nodes x 2 OpenMP threads

16 2/10/2003EuroPVM/MPI 200316 Hyperplane Scheduling CPU1 CPU0 CPU1 CPU0 NODE1 NODE0 2 MPI nodes x 2 OpenMP threads

17 2/10/2003EuroPVM/MPI 200317 Hyperplane Scheduling #pragma omp parallel { group 0 = nod 0 ; … group n-2 = nod n-2 ; tile 0 = nod 0 * m 0 + th 0 ; … tile n-2 = nod n-2 * m n-2 + th n-2 ; FOR(group n-1 ){ tile n-1 = group n-1 - ; if(0 <= tile n-1 <= ) compute(tile); #pragma omp barrier }

18 2/10/2003EuroPVM/MPI 200318 Overview  Introduction  Pure MPI Model  Hybrid MPI-OpenMP Models Hyperplane Scheduling Fine-grain Model Coarse-grain Model  Experimental Results  Conclusions – Future Work

19 2/10/2003EuroPVM/MPI 200319 Fine-grain Model  Incremental parallelization of computationally intensive parts  Relatively straightforward from pure MPI  Threads (re)spawned at computation  Inter-node communication outside of multi- threaded part  Thread synchronization through implicit barrier of omp parallel directive

20 2/10/2003EuroPVM/MPI 200320 Fine-grain Model FOR(group n-1 ){ Pack(snd_buf, tile n-1 – 1, nod); MPI_Isend(snd_buf, dest(nod)); MPI_Irecv(recv_buf, src(nod)); #pragma omp parallel { thread_id=omp_get_thread_num(); if(valid(tile,thread_id,group n-1 )) Compute(tile); } MPI_Waitall; Unpack(recv_buf, tile n-1 + 1, nod); }

21 2/10/2003EuroPVM/MPI 200321 Overview  Introduction  Pure MPI Model  Hybrid MPI-OpenMP Models Hyperplane Scheduling Fine-grain Model Coarse-grain Model  Experimental Results  Conclusions – Future Work

22 2/10/2003EuroPVM/MPI 200322 Coarse-grain Model  SPMD paradigm  Requires more programming effort  Threads are only spawned once  Inter-node communication inside multi- threaded part (requires MPI_THREAD_MULTIPLE)  Thread synchronization through explicit barrier ( omp barrier directive)

23 2/10/2003EuroPVM/MPI 200323 Coarse-grain Model #pragma omp parallel { thread_id=omp_get_thread_num(); FOR(group n-1 ){ #pragma omp master{ Pack(snd_buf, tile n-1 – 1, nod); MPI_Isend(snd_buf, dest(nod)); MPI_Irecv(recv_buf, src(nod)); } if(valid(tile,thread_id,group n-1 )) Compute(tile); #pragma omp master{ MPI_Waitall; Unpack(recv_buf, tile n-1 + 1, nod); } #pragma omp barrier }

24 2/10/2003EuroPVM/MPI 200324 Summary: Fine-grain vs Coarse-grain Fine-grainCoarse-grain Threads re-spawningThreads are only spawned once Inter-node MPI communication outside of multi-threaded region Inter-node MPI communication inside multi-threaded region, assumed by master thread Intra-node synchronization through implicit barrier ( omp parallel ) Intra-node synchronization through explicit OpenMP barrier

25 2/10/2003EuroPVM/MPI 200325 Overview  Introduction  Pure MPI model  Hybrid MPI-OpenMP models Hyperplane Scheduling Fine-grain Model Coarse-grain Model  Experimental Results  Conclusions – Future Work

26 2/10/2003EuroPVM/MPI 200326 Experimental Results  8-node SMP Linux Cluster (800 MHz PIII, 128 MB RAM, kernel 2.4.20)  MPICH v.1.2.5 ( --with-device=ch_p4, --with-comm=shared )  Intel C++ compiler 7.0 ( -O3 -mcpu=pentiumpro -static )  FastEthernet interconnection  ADI micro-kernel benchmark (3D)

27 2/10/2003EuroPVM/MPI 200327 Alternating Direction Implicit (ADI)  Unitary data dependencies  3D Iteration Space (X x Y x Z)

28 2/10/2003EuroPVM/MPI 200328 ADI – 4 nodes

29 2/10/2003EuroPVM/MPI 200329 ADI – 4 nodes  X < Y  X > Y

30 2/10/2003EuroPVM/MPI 200330 ADI X=512 Y=512 Z=8192 – 4 nodes

31 2/10/2003EuroPVM/MPI 200331 ADI X=128 Y=512 Z=8192 – 4 nodes

32 2/10/2003EuroPVM/MPI 200332 ADI X=512 Y=128 Z=8192 – 4 nodes

33 2/10/2003EuroPVM/MPI 200333 ADI – 2 nodes

34 2/10/2003EuroPVM/MPI 200334 ADI – 2 nodes  X < Y  X > Y

35 2/10/2003EuroPVM/MPI 200335 ADI X=128 Y=512 Z=8192 – 2 nodes

36 2/10/2003EuroPVM/MPI 200336 ADI X=256 Y=512 Z=8192 – 2 nodes

37 2/10/2003EuroPVM/MPI 200337 ADI X=512 Y=512 Z=8192 – 2 nodes

38 2/10/2003EuroPVM/MPI 200338 ADI X=512 Y=256 Z=8192 – 2 nodes

39 2/10/2003EuroPVM/MPI 200339 ADI X=512 Y=128 Z=8192 – 2 nodes

40 2/10/2003EuroPVM/MPI 200340 ADI X=128 Y=512 Z=8192 – 2 nodes Computation Communication

41 2/10/2003EuroPVM/MPI 200341 ADI X=512 Y=128 Z=8192 – 2 nodes Computation Communication

42 2/10/2003EuroPVM/MPI 200342 Overview  Introduction  Pure MPI model  Hybrid MPI-OpenMP models Hyperplane Scheduling Fine-grain Model Coarse-grain Model  Experimental Results  Conclusions – Future Work

43 2/10/2003EuroPVM/MPI 200343 Conclusions  Nested loop algorithms with arbitrary data dependencies can be adapted to the hybrid parallel programming paradigm  Hybrid models can be competitive to the pure MPI paradigm  Coarse-grain hybrid model can be more efficient than fine-grain one, but also more complicated  Programming efficiently in OpenMP not easier than programming efficiently in MPI

44 2/10/2003EuroPVM/MPI 200344 Future Work  Application of methodology to real applications and benchmarks  Work balancing for coarse-grain model  Performance evaluation on advanced interconnection networks (SCI, Myrinet)  Generalization as compiler technique

45 2/10/2003EuroPVM/MPI 200345 Questions? http://www.cslab.ece.ntua.gr/~ndros


Download ppt "Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National."

Similar presentations


Ads by Google