Presentation is loading. Please wait.

Presentation is loading. Please wait.

Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.

Similar presentations


Presentation on theme: "Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris."— Presentation transcript:

1 Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris National Technical University of Athens Computing Systems Laboratory

2 5/9/2003Parallel Computing Overview  Introduction  Background  Code Generation Computation/Data Distribution Communication Schemes Summary  Experimental Results  Conclusions – Future Work

3 5/9/2003Parallel Computing Introduction  Motivation: A lot of theoretical work has been done on arbitrary tiling but there are no actual experimental results! There is no complete method to generate code for non-rectangular tiles

4 5/9/2003Parallel Computing Introduction  Contribution: Complete end-to-end SPMD code generation method for arbitrarily tiled iteration spaces Simulation of blocking and non-blocking communication primitives Experimental evaluation of proposed scheduling scheme

5 5/9/2003Parallel Computing Overview  Introduction  Background  Code Generation Computation/Data Distribution Communication Schemes Summary  Experimental Results  Conclusions – Future Work

6 5/9/2003Parallel Computing Background Algorithmic Model: FOR j 1 = min 1 TO max 1 DO … FOR j n = min n TO max n DO Computation(j 1,…,j n ); ENDFOR … ENDFOR  Perfectly nested loops  Constant flow data dependencies (D)

7 5/9/2003Parallel Computing Background Tiling:  Popular loop transformation  Groups iterations into atomic units  Enhances locality in uniprocessors  Enables coarse-grain parallelism in distributed memory systems  Valid tiling matrix H:

8 5/9/2003Parallel Computing Tiling Transformation Example: FOR j 1 =0 TO 11 DO FOR j 2 =0 TO 8 DO A[j 1,j 2 ]:=A[j 1 -1,j 2 ] + A[j 1 -1,j 2 -1]; ENDFOR

9 5/9/2003Parallel Computing Rectangular Tiling Transformation 3 0 P = 0 3 1/3 0 H = 0 1/3

10 5/9/2003Parallel Computing Non-rectangular Tiling Transformation 3 3 P = 0 3 1/3 -1/3 H = 0 1/3

11 5/9/2003Parallel Computing Why Non-rectangular Tiling?  Reduces communication  Enables more efficient scheduling schemes 8 communication points 6 communication points 6 time steps5 time steps

12 5/9/2003Parallel Computing Overview  Introduction  Background  Code Generation Computation/Data Distribution Communication Schemes Summary  Experimental Results  Conclusions – Future Work

13 5/9/2003Parallel Computing Computation Distribution  We map tiles along the longest dimension to the same processor because: It reduces the number of processors required It simplifies message-passing It reduces total execution times when overlapping computation with communication

14 5/9/2003Parallel Computing Computation Distribution P3 P2 P1

15 5/9/2003Parallel Computing Data Distribution  Computer-owns rule: Each processor owns the data it computes  Arbitrary convex iteration space, arbitrary tiling  Rectangular local iteration and data spaces

16 5/9/2003Parallel Computing Data Distribution

17 5/9/2003Parallel Computing Data Distribution

18 5/9/2003Parallel Computing Data Distribution

19 5/9/2003Parallel Computing Overview  Introduction  Background  Code Generation Computation/Data Distribution Communication Schemes Summary  Experimental Results  Conclusions – Future Work

20 5/9/2003Parallel Computing P3 P2 P1 Communication Schemes  With whom do I communicate?

21 5/9/2003Parallel Computing P3 P2 P1  With whom do I communicate? Communication Schemes

22 5/9/2003Parallel Computing  What do I send? Communication Schemes

23 5/9/2003Parallel Computing Blocking Scheme P3 P2 P1 j1j1 j2j2 12 time steps

24 5/9/2003Parallel Computing Non-blocking Scheme P3 P2 P1 j1j1 j2j2 6 time steps

25 5/9/2003Parallel Computing Overview  Introduction  Background  Code Generation Computation/Data Distribution Communication Schemes Summary  Experimental Results  Conclusions – Future Work

26 5/9/2003Parallel Computing Code Generation Summary Advanced Scheduling = Suitable Tiling + Non-blocking Communication Scheme Sequential Code Dependence Analysis Parallelization Parallel SPMD Code Tiling Transformation Sequential Tiled Code Tiling Computation Distribution Data Distribution Communication Primitives

27 5/9/2003Parallel Computing Code Summary – Blocking Scheme

28 5/9/2003Parallel Computing Code Summary – Non-blocking Scheme

29 5/9/2003Parallel Computing Overview  Introduction  Background  Code Generation Computation/Data Distribution Communication Schemes Summary  Experimental Results  Conclusions – Future Work

30 5/9/2003Parallel Computing Experimental Results  8-node SMP Linux Cluster (800 MHz PIII, 128 MB RAM, kernel )  MPICH v ( --with-device=p4, --with-comm=shared )  g++ compiler v ( -O3 )  FastEthernet interconnection  2 micro-kernel benchmarks (3D): Gauss Successive Over-Relaxation (SOR) Texture Smoothing Code (TSC)  Simulation of communication schemes

31 5/9/2003Parallel Computing SOR  Iteration space M x N x N  Dependence matrix:  Rectangular Tiling:  Non-rectangular Tiling:

32 5/9/2003Parallel Computing SOR

33 5/9/2003Parallel Computing SOR

34 5/9/2003Parallel Computing TSC  Iteration space T x N x N  Dependence matrix:  Rectangular Tiling:  Non-rectangular Tiling:

35 5/9/2003Parallel Computing TSC

36 5/9/2003Parallel Computing TSC

37 5/9/2003Parallel Computing Overview  Introduction  Background  Code Generation Computation/Data Distribution Communication Schemes Summary  Experimental Results  Conclusions – Future Work

38 5/9/2003Parallel Computing Conclusions  Automatic code generation for arbitrary tiled spaces can be efficient  High performance can be achieved by means of  a suitable tiling transformation  overlapping computation with communication

39 5/9/2003Parallel Computing Future Work  Application of methodology to imperfectly nested loops and non-constant dependencies  Investigation of hybrid programming models (MPI+OpenMP)  Performance evaluation on advanced interconnection networks (SCI, Myrinet)

40 5/9/2003Parallel Computing Questions?


Download ppt "Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris."

Similar presentations


Ads by Google