Download presentation

Presentation is loading. Please wait.

Published byNichole Blurton Modified about 1 year ago

1
Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris National Technical University of Athens Computing Systems Laboratory

2
5/9/2003Parallel Computing Overview Introduction Background Code Generation Computation/Data Distribution Communication Schemes Summary Experimental Results Conclusions – Future Work

3
5/9/2003Parallel Computing Introduction Motivation: A lot of theoretical work has been done on arbitrary tiling but there are no actual experimental results! There is no complete method to generate code for non-rectangular tiles

4
5/9/2003Parallel Computing Introduction Contribution: Complete end-to-end SPMD code generation method for arbitrarily tiled iteration spaces Simulation of blocking and non-blocking communication primitives Experimental evaluation of proposed scheduling scheme

5
5/9/2003Parallel Computing Overview Introduction Background Code Generation Computation/Data Distribution Communication Schemes Summary Experimental Results Conclusions – Future Work

6
5/9/2003Parallel Computing Background Algorithmic Model: FOR j 1 = min 1 TO max 1 DO … FOR j n = min n TO max n DO Computation(j 1,…,j n ); ENDFOR … ENDFOR Perfectly nested loops Constant flow data dependencies (D)

7
5/9/2003Parallel Computing Background Tiling: Popular loop transformation Groups iterations into atomic units Enhances locality in uniprocessors Enables coarse-grain parallelism in distributed memory systems Valid tiling matrix H:

8
5/9/2003Parallel Computing Tiling Transformation Example: FOR j 1 =0 TO 11 DO FOR j 2 =0 TO 8 DO A[j 1,j 2 ]:=A[j 1 -1,j 2 ] + A[j 1 -1,j 2 -1]; ENDFOR

9
5/9/2003Parallel Computing Rectangular Tiling Transformation 3 0 P = 0 3 1/3 0 H = 0 1/3

10
5/9/2003Parallel Computing Non-rectangular Tiling Transformation 3 3 P = 0 3 1/3 -1/3 H = 0 1/3

11
5/9/2003Parallel Computing Why Non-rectangular Tiling? Reduces communication Enables more efficient scheduling schemes 8 communication points 6 communication points 6 time steps5 time steps

12
5/9/2003Parallel Computing Overview Introduction Background Code Generation Computation/Data Distribution Communication Schemes Summary Experimental Results Conclusions – Future Work

13
5/9/2003Parallel Computing Computation Distribution We map tiles along the longest dimension to the same processor because: It reduces the number of processors required It simplifies message-passing It reduces total execution times when overlapping computation with communication

14
5/9/2003Parallel Computing Computation Distribution P3 P2 P1

15
5/9/2003Parallel Computing Data Distribution Computer-owns rule: Each processor owns the data it computes Arbitrary convex iteration space, arbitrary tiling Rectangular local iteration and data spaces

16
5/9/2003Parallel Computing Data Distribution

17
5/9/2003Parallel Computing Data Distribution

18
5/9/2003Parallel Computing Data Distribution

19
5/9/2003Parallel Computing Overview Introduction Background Code Generation Computation/Data Distribution Communication Schemes Summary Experimental Results Conclusions – Future Work

20
5/9/2003Parallel Computing P3 P2 P1 Communication Schemes With whom do I communicate?

21
5/9/2003Parallel Computing P3 P2 P1 With whom do I communicate? Communication Schemes

22
5/9/2003Parallel Computing What do I send? Communication Schemes

23
5/9/2003Parallel Computing Blocking Scheme P3 P2 P1 j1j1 j2j2 12 time steps

24
5/9/2003Parallel Computing Non-blocking Scheme P3 P2 P1 j1j1 j2j2 6 time steps

25
5/9/2003Parallel Computing Overview Introduction Background Code Generation Computation/Data Distribution Communication Schemes Summary Experimental Results Conclusions – Future Work

26
5/9/2003Parallel Computing Code Generation Summary Advanced Scheduling = Suitable Tiling + Non-blocking Communication Scheme Sequential Code Dependence Analysis Parallelization Parallel SPMD Code Tiling Transformation Sequential Tiled Code Tiling Computation Distribution Data Distribution Communication Primitives

27
5/9/2003Parallel Computing Code Summary – Blocking Scheme

28
5/9/2003Parallel Computing Code Summary – Non-blocking Scheme

29
5/9/2003Parallel Computing Overview Introduction Background Code Generation Computation/Data Distribution Communication Schemes Summary Experimental Results Conclusions – Future Work

30
5/9/2003Parallel Computing Experimental Results 8-node SMP Linux Cluster (800 MHz PIII, 128 MB RAM, kernel ) MPICH v ( --with-device=p4, --with-comm=shared ) g++ compiler v ( -O3 ) FastEthernet interconnection 2 micro-kernel benchmarks (3D): Gauss Successive Over-Relaxation (SOR) Texture Smoothing Code (TSC) Simulation of communication schemes

31
5/9/2003Parallel Computing SOR Iteration space M x N x N Dependence matrix: Rectangular Tiling: Non-rectangular Tiling:

32
5/9/2003Parallel Computing SOR

33
5/9/2003Parallel Computing SOR

34
5/9/2003Parallel Computing TSC Iteration space T x N x N Dependence matrix: Rectangular Tiling: Non-rectangular Tiling:

35
5/9/2003Parallel Computing TSC

36
5/9/2003Parallel Computing TSC

37
5/9/2003Parallel Computing Overview Introduction Background Code Generation Computation/Data Distribution Communication Schemes Summary Experimental Results Conclusions – Future Work

38
5/9/2003Parallel Computing Conclusions Automatic code generation for arbitrary tiled spaces can be efficient High performance can be achieved by means of a suitable tiling transformation overlapping computation with communication

39
5/9/2003Parallel Computing Future Work Application of methodology to imperfectly nested loops and non-constant dependencies Investigation of hybrid programming models (MPI+OpenMP) Performance evaluation on advanced interconnection networks (SCI, Myrinet)

40
5/9/2003Parallel Computing Questions?

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google