Presentation is loading. Please wait.

Presentation is loading. Please wait.

Adaptive Cyclic Scheduling of Nested Loops Florina M. Ciorba, Theodore Andronikos and George Papakonstantinou National Technical University of Athens Computing.

Similar presentations


Presentation on theme: "Adaptive Cyclic Scheduling of Nested Loops Florina M. Ciorba, Theodore Andronikos and George Papakonstantinou National Technical University of Athens Computing."— Presentation transcript:

1 Adaptive Cyclic Scheduling of Nested Loops Florina M. Ciorba, Theodore Andronikos and George Papakonstantinou National Technical University of Athens Computing Systems Laboratory cflorina@cslab.ece.ntua.gr www.cslab.ece.ntua.gr

2 September 24, 2005HERCMA'052Outline IntroductionIntroduction Definitions and notations Adaptive cyclic scheduling ACS for homogeneous systems ACS for heterogeneous systems Conclusions Future work

3 September 24, 2005HERCMA'053Introduction  Motivation: A lot of work has been done in parallelizing loops with dependencies, but very little work exists on explicitly minimizing the communication incurred by certain dependence vectors

4 September 24, 2005HERCMA'054Introduction  Contribution: Enhancing the data locality for loops with dependencies Reducing the communication cost by mapping iterations tied by certain dependence vectors to the same processor Applicability to both homogeneous & heterogeneous systems, regardless of their interconnection network

5 September 24, 2005HERCMA'055Outline Introduction Definitions and notationsDefinitions and notations Adaptive cyclic scheduling ACS for homogeneous systems ACS for heterogeneous systems Conclusions Future work

6 September 24, 2005HERCMA'056 Definitions and notations  Algorithmic model: FOR (i 1 =l 1 ; i 1 <=u 1 ; i 1 ++) FOR (i 2 =l 2 ; i 2 <=u 2 ; i 2 ++) … FOR (i n =l n ; i n <=u n ; i n ++) Loop Body ENDFOR … ENDFOR Perfectly nested loops Constant flow data dependencies

7 September 24, 2005HERCMA'057 Definitions and notations J – the index space of an n-dimensional loop ECT – earliest computation time of an iteration point R k – set of points (called region) of J with ECT k R 0 – contains the boundary (pre-computed) points Con(d 1,..., d q ) – a cone, the convex subspace formed by q dependence vectors of the m+1 dependence vectors of the problem Trivial cones – the cones defined by dependence vectors and at least one unitary axis vector Non-trivial cones – the cones defined exclusively by dependence vectors Cone vectors – are those dependence vectors d i (i≤q) that define the hyperplane in a cone Chain of computations – a sequence of iterations executed by the same processor

8 September 24, 2005HERCMA'058 Definitions and notations Index space of a loop with d 1 =(1,7), d 2 =(2,4), d 3 =(3,2), d 4 =(4,4) and d 5 =(6,1) The cone vectors are d 1, d 2, d 3 and d 5 The first three regions and few chains of computations are shown

9 September 24, 2005HERCMA'059 Definitions and notations d c – the communication vector (one of the cone vectors) j = p + λd c is the family of lines of J formed by d c C r = is a chain formed by d c |C r | is the number of iteration points of C r r – is a natural number indicating the relative offset between chain C (0,0) and chain C r C – is the set of C r chains and | C | is the number of C r chains |C M | – is the cardinality of the maximal chain D r in – the volume of “incoming” data for C r D r out – the volume of “outgoing” data for C r D r in + D r out is the total communication associated with C r #P – the number of available processors m – the number of dependence vectors, except d c

10 September 24, 2005HERCMA'0510 Definitions and notations Communication vector is d c = d 3 = (3,2) Chains are formed along d c

11 September 24, 2005HERCMA'0511Outline Introduction Definitions and notations Adaptive cyclic schedulingAdaptive cyclic scheduling ACS for homogeneous systemsACS for homogeneous systems ACS for heterogeneous systemsACS for heterogeneous systems Conclusions Future work

12 September 24, 2005HERCMA'0512 Adaptive cyclic scheduling (ACS)  Assumptions  All points of a chain C r are mapped to the same processor  Each chain is mapped to a different processor a)Homogeneous case: in round-robin fashion, load balanced b)Heterogeneous case: according to the available computation power of every processor  #P is arbitrarily chosen to be fixed  The Master-Slave model is used in both cases

13 September 24, 2005HERCMA'0513 Adaptive cyclic scheduling (ACS) The ACS Algorithm I NPUT I NPUT : An n-dimensional nested loop with terminal point U. Master: (1)Determine the cone vectors. (2)Compute cones. (3)Use QuickHull to find the optimal hyperplane. (4)Choose the d c. (5)Form and count the chains. (6)Compute the relative offsets between C (0,0) and the m dependence vectors. (7)Divide #P so as to cover most successfully the relative offsets below as well as above d c. If no dependence vector exists below (or above) d c, then choose the closest offset to #P above (or below) d c, and use the remaining number of processors below (or above) d c. (8)Assign chains to slave: a)(homogeneous sys) in cyclic fashion. b)(heterogeneous sys) according to their available computational power (i.e. longer/more chains are mapped to faster processors, whereas shorter/fewer chains to slower processors).

14 September 24, 2005HERCMA'0514 Adaptive cyclic scheduling (ACS) The ACS Algorithm (continued) Slave: (1)Send request for work to master (and communicate the available computational power if in heterogeneous system). (2)Wait for reply; store all chains and sort the points by the region they belong to. (3)Compute points region by region, and along the optimal hyperplane. Communicate only when needed points are not locally computed. O UTPUT O UTPUT : (Slave) When no more points in the memory, notify the master and terminate. (Master) If all slaves sent notification, collect results and terminate.

15 September 24, 2005HERCMA'0515 Adaptive cyclic scheduling (ACS) ACS for homogeneous systems r =2 r =3 d 1 =(1,3) d 2 =(2,2) d 3 =(4,1) d c = d 2 C (0,0) communicates with C (0,0) +d 1 =C (0,2) and C (0,0) +d 3 =C (3,0) relative offsets are 2 and 3 5 slaves, 1 master S 1,S 2,S 3 are cyclicly assigned chains below d c S 4,S 5 are cyclicly assigned chains above d c

16 September 24, 2005HERCMA'0516 ACS for heterogeneous systems  Assumptions  Every process running on a heterogeneous computer takes an equal share of its computing resources  Notations  ACP i – the available computational power of slave i  VCP i – the virtual computational power of slave i  Q i – the number of running processes in the queue of slave i  ACP – the total available computational power of the heterogeneous system ACP i = VCP i / Q i and ACP = Σ ACP i Adaptive cyclic scheduling (ACS)

17 September 24, 2005HERCMA'0517 Adaptive cyclic scheduling (ACS) ACS for heterogeneous systems r =2 r =3 d 1 =(1,3) d 2 =(2,2) d 3 =(4,1) d c = d 2 C (0,0) communicates with C (0,0) +d 1 =C (0,2) and C (0,0) +d 3 =C (3,0) 5 slaves, 1 master S 3 has the lowest ACP S 3 is assigned 4 chains S 1,S 2, S 4,S 5 are assigned 5 chains each oversimplified example

18 September 24, 2005HERCMA'0518 Adaptive cyclic scheduling (ACS)  Advantages  It zeroes the communication cost imposed by as many dependence vectors as possible  #P is divided into two groups of processors used in the area above d c, and below d c respectively, such that chains above d c are cyclically mapped to one group of processors, whereas chains below d c are cyclically mapped to the other  This way communication cost is additionally zeroed along one dependence vector in the area above d c, and along another dependence vector in the area below d c  Suitable for Homogeneous systems (an arbitrary chain is mapped to an arbitrary processor) Heterogeneous systems (longer/more chains are mapped to faster processors, whereas shorter/fewer chains to slower processors)

19 September 24, 2005HERCMA'0519Outline Introduction Definitions and notations Adaptive cyclic scheduling ACS for homogeneous systems ACS for heterogeneous systems ConclusionsConclusions Future work

20 September 24, 2005HERCMA'0520Conclusions The total communication cost can be significantly reduced if the communication incurred by certain dependence vectors is eliminated Preliminary simulations show that the adaptive cyclic mapping outperforms other mapping schemes (e.g. cyclic mapping) by enhancing the data locality

21 September 24, 2005HERCMA'0521Outline Introduction Definitions and notations Adaptive cyclic scheduling ACS for homogeneous systems ACS for heterogeneous systems Conclusions Future workFuture work

22 September 24, 2005HERCMA'0522 Future work Simulate the algorithm on various architectures (such as shared memory systems, SMPs and MPP systems) and for real-life test cases

23 September 24, 2005HERCMA'0523 Thank you Questions?

24 September 24, 2005HERCMA'0524 Selected references [12] I. Drositis, T. Andronikos, M. Kalathas, G. Papakonstantinou, and N. Koziris, “Optimal loop parallelization in n-dimensional index spaces”, in Proc. of the 2002 Int’l Conf. on Par. and Dist. Proc. Techn. and Appl. (PDPTA’02) [13] F.M. Ciorba, T. Andronikos, D. Kamenopoulos, P. Theodoropoulos, and G. Papakonstantinou, “Simple code generation for special UDLs”, in Proc. of the 1 st Balkan Conference in Informatics (BCI’03) [14] N. Manjikian and T. Abdelrahman, “Exploiting Wavefront Parallelism on Large-Scale Shared-Memory Multiprocessors,” IEEE Trans. on Par. and Dist. Sys., vol. 12, no. 3, pp. 259- 271, 2001 [15] G. Papakonstantinou, T. Andronikos, I. Drositis, “On the parallelization of UET/UET-UCT loops”, Journal of Neural Parallel & Scientific Computations, 2001


Download ppt "Adaptive Cyclic Scheduling of Nested Loops Florina M. Ciorba, Theodore Andronikos and George Papakonstantinou National Technical University of Athens Computing."

Similar presentations


Ads by Google