Presentation is loading. Please wait.

Presentation is loading. Please wait.

Self-Adapting Scheduling for Tasks with Dependencies in Stochastic Environments Ioannis Riakiotakis, Florina M. Ciorba, Theodore Andronikos and George.

Similar presentations


Presentation on theme: "Self-Adapting Scheduling for Tasks with Dependencies in Stochastic Environments Ioannis Riakiotakis, Florina M. Ciorba, Theodore Andronikos and George."— Presentation transcript:

1 Self-Adapting Scheduling for Tasks with Dependencies in Stochastic Environments Ioannis Riakiotakis, Florina M. Ciorba, Theodore Andronikos and George Papakonstantinou National Technical University of Athens, Greece cflorina@cslab.ece.ntua.gr www.cslab.ece.ntua.gr 5th International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Networks (HeteroPar'06)

2 2 Talk outline The problem The problem Inter-slave synchronization mechanism Inter-slave synchronization mechanism Overview of Distributed Trapezoid Self- Scheduling algorithm Overview of Distributed Trapezoid Self- Scheduling algorithm Self-Adapting Scheduling Self-Adapting Scheduling Stochastic environment modeling Stochastic environment modeling Results Results Conclusions & future work Conclusions & future work

3 3 The problem we study Scheduling problems with task dependencies Scheduling problems with task dependencies Target systems: stochastic systems, i.e., real non-dedicated heterogeneous systems with fluctuating load Target systems: stochastic systems, i.e., real non-dedicated heterogeneous systems with fluctuating load Approach: adaptive dynamic load balancing Approach: adaptive dynamic load balancing

4 4 Algorithmic model & notations for (i 1 =l 1 ; i 1 <=u 1 ; i 1 ++) {... for (i n =l n ; i n <=u n ; i n ++) { S 1 (I);... S k (I); }... } Loop Body  Perfectly nested loops  Constant flow data dependencies  General program statements within the loop body  J – index space of an n-dimensional uniform dependence loop   – set of dependence vectors

5 5 More notations  P 1,...,P m – slaves  VP k – virtual computing power of slave P k  Σ m k=1 VP k – total virtual computing power of the system  q k – number of processes/jobs in the run-queue of slave P k, reflecting the its total load  – available computing power of slave P k  Σ m k=1 A k – total available computing power of the system

6 6 Synchronization mechanism (1) C i – chunk size at the i -th scheduling step C i – chunk size at the i -th scheduling step V i – projection of C i along scheduling dimension u c V i – projection of C i along scheduling dimension u c u 1 – is called the synchronization dimension, noted u s ; synchronization u 1 – is called the synchronization dimension, noted u s ; synchronization points are introduced along u s u 2 – is called the scheduling dimension, noted u c ; chunks are formed along u c u 2 – is called the scheduling dimension, noted u c ; chunks are formed along u c (IPDPS’06)

7 7 Synchronization mechanism (2) SP j – synchronization point SP j – synchronization point M – the number of SPs along synchronization dimension u s M – the number of SPs along synchronization dimension u s H – the interval between two SPs; H is the same for every chunk H – the interval between two SPs; H is the same for every chunk SC i,j – set of iterations of chunk C i between SP j-1 and SP j SC i,j – set of iterations of chunk C i between SP j-1 and SP j Current slave – the slave assigned chunk C i Current slave – the slave assigned chunk C i Previous slave – the slave assigned chunk C i-1 Previous slave – the slave assigned chunk C i-1

8 8 C i-1 is assigned to P k-1, C i assigned to P k and C i+1 to P k+1 C i-1 is assigned to P k-1, C i assigned to P k and C i+1 to P k+1 When P k reaches SP j+1, it sends to P k+1 only the data P k+1 requires (i.e., those iterations imposed by the existing dependence vectors) When P k reaches SP j+1, it sends to P k+1 only the data P k+1 requires (i.e., those iterations imposed by the existing dependence vectors) Afterwards, P k receives from P k-1 the data required for the current computation Afterwards, P k receives from P k-1 the data required for the current computation communication set set of points computed at moment t+1 set of points computed at moment t indicates communication auxiliary explanations P k+1 P k P k-1 SP j C i+1 C i C i-1 SP j+1 SP j+2 SC i,j+1 SC i-1,j+1 t t t +1 Synchronization mechanism (3) Slaves do not reach an SP at the same time ―> a wavefront execution fashion H should be chosen so as to maintain the comm/comp < 1

9 9 Overview of Distributed Trapezoid Self-Scheduling (DTSS) Algorithm Divides the scheduling dimension into decreasing chunks Divides the scheduling dimension into decreasing chunks First chunk is F = |u c |/(2× Σ m k=1 A k ), where: First chunk is F = |u c |/(2× Σ m k=1 A k ), where: |u c | – the size of the scheduling dimension |u c | – the size of the scheduling dimension Σ m k=1 A k – the total available computational power of the system Σ m k=1 A k – the total available computational power of the system Last chunk is L = 1 Last chunk is L = 1 N = (2* |u c |)/(F+L) – the number of scheduling steps N = (2* |u c |)/(F+L) – the number of scheduling steps D = (F-L)/(N-1) – chunk decrement D = (F-L)/(N-1) – chunk decrement C i = A k * [F – D*(S k-1 +(A k -1)/2)], where: C i = A k * [F – D*(S k-1 +(A k -1)/2)], where: S k-1 = A 1 + … + A k-1 S k-1 = A 1 + … + A k-1 DTSS selects chunk sizes based on: DTSS selects chunk sizes based on: the virtual computational power of a processor, V k the virtual computational power of a processor, V k the number of processes in the run-queue of each processor, q k the number of processes in the run-queue of each processor, q k (IPDPS’06)

10 10 Self-Adapting Scheduling – SAS (1) SAS is a self-scheduling scheme SAS is a self-scheduling scheme It is NOT a decrease-chunk algorithm It is NOT a decrease-chunk algorithm Built upon the master-slave model Built upon the master-slave model Each chunk size is computed based on: Each chunk size is computed based on: history of computation times of previous chunks on the particular slave history of computation times of previous chunks on the particular slave history of jobs in the run-queue of the particular slave history of jobs in the run-queue of the particular slave current number of jobs in the run-queue of the particular slave current number of jobs in the run-queue of the particular slave Targeted for stochastic systems Targeted for stochastic systems

11 11 Self-Adapting Scheduling – SAS (2) Terminology used with SAS Terminology used with SAS V k 1, …, V k j – the sizes of the first j chunks assigned to P k and t k 1, …, t k j – their computation times V k 1, …, V k j – the sizes of the first j chunks assigned to P k and t k 1, …, t k j – their computation times V k 1 > … > V k j, and V k 1 =1 V k 1 > … > V k j, and V k 1 =1 q k 1, …, q k j – number of jobs in the run-queue of P k when assigned its first j chunks q k 1, …, q k j – number of jobs in the run-queue of P k when assigned its first j chunks – the average time/iteration for the first j chunks of P k – the average time/iteration for the first j chunks of P k – the estimated computation time for executing its j+1-th chunk – the estimated computation time for executing its j+1-th chunk t Ref – the execution time of the first chunk of the problem (reference time); all processors are expected to compute their chunks within t Ref t Ref – the execution time of the first chunk of the problem (reference time); all processors are expected to compute their chunks within t Ref

12 12 SAS – description (1) Master side : Master side : Initialization: (M.a) Register slaves; store each reported VP k and initial load q k 1. (M.b) Sort slaves in decreasing order of their VP k, considering VP 1 =1; assign first chunk to each slave, i.e., V k 1. While there are unassigned iterations do: (M.1) Receive request from P k and store its reported q k j+1 and t k j. (M.2) Determine, where * If P k took > t Ref to compute its j-th chunk, its j+1-th chunk will be decreased, and vice versa.

13 13 SAS – description (2) Slave side : Slave side : Initialization: Register with the master; report VP k and initial load q k 1. (S.1) Send request for work to the master; report current load (q k j+1 ) and the time (t k j ) spent on completing the previous chunk. (S.2) Wait for reply/work. If no more work to do, terminate. If no more work to do, terminate. Else receive the size of next chunk (V k j+1 ) and compute it. Else receive the size of next chunk (V k j+1 ) and compute it. (S.3) Exchange data at SPs as described in slide 7. (S.4) Measure the completion time t k j+1 for chunk V k j+1. (S.5) Go to (S.1).

14 14 Stochastic Environment Modeling (1) Existing real non-dedicated systems have fluctuating load Existing real non-dedicated systems have fluctuating load Fluctuating load is non-deterministic ―> Fluctuating load is non-deterministic ―> stochastic process modeling The incoming foreign jobs inter-arrival time is considered to be exponentially distributed The incoming foreign jobs inter-arrival time is considered to be exponentially distributed (λ – arrival rate) The incoming foreign jobs lifetime is considered to be exponentially distributed (μ – service rate) The incoming foreign jobs lifetime is considered to be exponentially distributed (μ – service rate)

15 15 inter-arrival time Stochastic Environment Modeling (2) t Ref 2*t Ref inter-arrival time time q k j =2+1 q k j =1+1 q k j =0+1 q k j =1+1 fast medium slow parallel job foreign job system load fluctuation inter-arrival time > t Ref inter-arrival time < t Ref inter-arrival time ~ t Ref λ ~ μ inter-arrival time ~ service time inter-arrival time

16 16 Implementation and testing setup  The algorithms are implemented in C and C++  MPI is used for master-slave and inter-slave communication  The heterogeneous system consists of 7 dual-node machines (12+2 processors):  3 Intel Pentiums III, 1266 MHz with 1GB RAM (called zealots), assumed to have VP k = 1  4 Intel Pentiums III, 800 MHz with 256MB RAM (called kids), assumed to have VP k = 0.5. (one of them is the master)  Interconnection network is Fast Ethernet, at 100Mbit/sec  One real-life application: Floyd-Steinberg error dithering computation  The synchronization interval is H = 100

17 17 Results with fast, medium and slow load fluctuations

18 18 Conclusions Loops with dependencies can now be dynamically scheduled on stochastic systems Loops with dependencies can now be dynamically scheduled on stochastic systems Adaptive load balancing algorithms efficiently compensate for the system’s heterogeneity and foreign load fluctuations for loops with dependencies Adaptive load balancing algorithms efficiently compensate for the system’s heterogeneity and foreign load fluctuations for loops with dependencies

19 19 Future work Establish a model for predicting the optimal synchronization interval H and minimize the communication Establish a model for predicting the optimal synchronization interval H and minimize the communication Modeling the foreign load with other probabilistic distributions & analyze the relationship between distribution type and performance Modeling the foreign load with other probabilistic distributions & analyze the relationship between distribution type and performance

20 20 Thank you! Questions?


Download ppt "Self-Adapting Scheduling for Tasks with Dependencies in Stochastic Environments Ioannis Riakiotakis, Florina M. Ciorba, Theodore Andronikos and George."

Similar presentations


Ads by Google