Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Algorithm for Automatically Obtaining Distributed and Fault Tolerant Static Schedules Alain Girault - Hamoudi Kalla - Yves Sorel - Mihaela Sighireanu.

Similar presentations


Presentation on theme: "An Algorithm for Automatically Obtaining Distributed and Fault Tolerant Static Schedules Alain Girault - Hamoudi Kalla - Yves Sorel - Mihaela Sighireanu."— Presentation transcript:

1 An Algorithm for Automatically Obtaining Distributed and Fault Tolerant Static Schedules Alain Girault - Hamoudi Kalla - Yves Sorel - Mihaela Sighireanu San Francisco, USA, June 23, 2003 POP ART team & OSTRE Team

2 Outline Introduction Modeling distributed real-time systems Problem : How to introduce fault-tolerance ? The proposed solution for fault-tolerance Principles and example Simulations Conclusion and future work

3 3 High level program Compiler Architecture specification Distribution constraints Execution times Real-time constraints Failure specification Fault-tolerant distributed static schedule Fault-tolerant distributed code Code generator Distribution and scheduling fault-tolerant heuristic Model of the algorithm 1.Introduction

4 4 2.Modeling distributed real-time systems b.Architecture Modela.Algorithm Model P1 P2 P3 m1 m2 m3 « P1, P2 and P3 » are processors « m1, m2 and m3 » are communications links « I 1 and I 2 » are inputs operations « O » is output operation « A, B and C » are computations operations I1I1 A B C O I2I2

5 5 fault-tolerant  Find a distributed schedule of the algorithm on the architecture which is fault-tolerant to processors failures ? 3.Problem : How to introduce fault-tolerance ? Problem P1 P2P3 m1 m2 m3 I1I1 A B C O I2I2 scheduleschedule

6 6 Solution active software replication  A list scheduling heuristic which use the active software replication of operations and communications. fail-silent  Processors are assumed to be fail-silent 3.The proposed solution for fault-tolerance Assumption Npf  1  Tolerate a number of processor failures Npf  1

7 7 more than Npf+1 times  Each operation/communication is replicated more than Npf+1 times on different processors/links of the architecture graph. 4.The proposed solution for fault-tolerance Principles (1)

8 8 Principles (2) 4.The proposed solution for fault-tolerance

9 9 Principles (3) 4.The proposed solution for fault-tolerance

10 10 schedule pressure   The schedule pressure  is used as a cost function to select the best processor p for each operation o : where, 4.The proposed solution for fault-tolerance Principles (4)

11 11 1.   o | o is an input operation  ;   ; 2.While   do smallest Npf+1 results a.Compute the schedule pressure  for each operation o of on each processor p and keep the smallest Npf+1 results; best candidate operation b.Select the best candidate operation o best which has the greatest schedule pressure  (o best, p) ; replicated Npf+1 times scheduled on parallel links c.Schedule o best on each processor p computed at step a and the communications implied by this schedule are replicated Npf+1 times and scheduled on parallel links; minimise the start time d.Try to minimise the start time of o best on each processor p computed at step a by replicating these predecessors on p [ahmad and al.]; e. Update the list of candidate operations :  -  o best    o | o  (succs o best ) & (preds o)  )     o best  end while; 5.Heuristic

12 12 P1 P2 P3 m1 m2 m3 Npf = 1 Number of fail-silent processor that the system must tolerate Npf = 1 Architecture graph Algorithm graph Failures 5.Example I1I1 A B C O I2I2

13 13 1.   o | o is an input operation  ;   ; 2.While   do a.Compute the schedule pressure  for each operation o of on each processor p and keep the smallest Npf+1 results; b.Select the best candidate operation o best which has the greatest schedule pressure  (o best, p) ; c.Schedule o best on each processor p computed at step a and the communications implied by this schedule are replicated Npf+1 times and scheduled on parallel links; d.Try to minimise the start time of o best on each processor p computed at step a by replicating these predecessors on p [ahmad and al.]; e. Update the list of candidate operations :  -  o best    o | o  (succs o best ) & (preds o)  )     o best  end while; 5.Heuristic

14 14 =  Npf = 1 I1I1 A B C O I2I2 = { I 1, I 2 } 5.Example Step 1. (1) P 1 m 2 P 3 m 3 P 2 m 1

15 15 P 1 m 2 P 3 m 3 P 2 m 1 I1I1 I1I1 Step 2. (1) I1I1 A B C O I2I2 5.Example Npf = 1 = { } = { I 1 } = { I 1, I 2 } = { I 2, B } Schedule I 1 on P 1 and P 2

16 16 Step 2. (2) I1I1 A B C O I2I2 5.Example Npf = 1 = { I 1 } = { I 1, I 2 } = { I 2, B } = { A, B } P 1 m 2 P 3 m 3 P 2 m 1 I1I1 I1I1 I2I2 I2I2 Schedule I 2 on P 1 and P 2

17 17 = { I 1, I 2 } P 1 m 2 P 3 m 3 P 2 m 1 I1I1 I1I1 I2I2 I2I2 Step 2. (3) I1I1 A B C O I2I2 5.Example Npf = 1 = { A, B }

18 18 1.   o | o is an input operation  ;   ; 2.While   do a.Compute the schedule pressure  for each operation o of on each processor p and keep the smallest Npf+1 results; b.Select the best candidate operation o best which has the greatest schedule pressure  (o best, p) ; c.Schedule o best on each processor p computed at step a and the communications implied by this schedule are replicated Npf+1 times and scheduled on parallel links; d.Try to minimise the start time of o best on each processor p computed at step a by replicating these predecessors on p [ahmad and al.]; e. Update the list of candidate operations :  -  o best    o | o  (succs o best ) & (preds o)  )     o best  end while; 5.Heuristic

19 19 = { I 1, I 2 }  ( A, { P 1, P 2, P 3 } ) = { 7,10, 9 }  ( B, { P 1, P 2, P 3 } ) = { 9, 6, 8 }  ( A, { P 1, P 3 } ) = { 7, 9 }  ( B, { P 2, P 3 } ) = { 6, 8 } Min  P 1 m 2 P 3 m 3 P 2 m 1 I1I1 I1I1 I2I2 I2I2 Step 2.a. (3) I1I1 A B C O I2I2 5.Example Npf = 1 = { A, B }

20 20 1.   o | o is an input operation  ;   ; 2.While   do a.Compute the schedule pressure  for each operation o of on each processor p and keep the smallest Npf+1 results; b.Select the best candidate operation o best which has the greatest schedule pressure  (o best, p) ; c.Schedule o best on each processor p computed at step a and the communications implied by this schedule are replicated Npf+1 times and scheduled on parallel links; d.Try to minimise the start time of o best on each processor p computed at step a by replicating these predecessors on p [ahmad and al.]; e. Update the list of candidate operations :  -  o best    o | o  (succs o best ) & (preds o)  )     o best  end while; 5.Heuristic

21 21  ( A, { P 1, P 3 } ) = { 7, 9 }  ( B, { P 2, P 3 } ) = { 6, 8 } P 1 m 2 P 3 m 3 P 2 m 1 I1I1 I1I1 I2I2 I2I2 Max   ( A, { P 1, P 3 } ) = { 7, 9 } I1I1 A B C O I2I2 5.Example Step 2.b. (3) Npf = 1 = { I 1, I 2 } = { A, B }

22 22 1.   o | o is an input operation  ;   ; 2.While   do a.Compute the schedule pressure  for each operation o of on each processor p and keep the smallest Npf+1 results; b.Select the best candidate operation o best which has the greatest schedule pressure  (o best, p) ; c.Schedule o best on each processor p computed at step a and the communications implied by this schedule are replicated Npf+1 times and scheduled on parallel links; d.Try to minimise the start time of o best on each processor p computed at step a by replicating these predecessors on p [ahmad and al.]; e. Update the list of candidate operations :  -  o best    o | o  (succs o best ) & (preds o)  )     o best  end while; 5.Heuristic

23 23 Schedule A on P 1 and P 3 P 1 m 2 P 3 m 3 P 2 m 1 I1I1 I1I1 I2I2 I2I2 A A  ( A, { P 1, P 3 } ) = { 7, 9 } I1I1 A B C O I2I2 5.Example Step 2.c. (3) Npf = 1

24 24 1.   o | o is an input operation  ;   ; 2.While   do a.Compute the schedule pressure  for each operation o of on each processor p and keep the smallest Npf+1 results; b.Select the best candidate operation o best which has the greatest schedule pressure  (o best, p) ; c.Schedule o best on each processor p computed at step a and the communications implied by this schedule are replicated Npf+1 times and scheduled on parallel links; d.Try to minimise the start time of o best on each processor p computed at step a by replicating these predecessors on p [ahmad and al.]; e. Update the list of candidate operations :  -  o best    o | o  (succs o best ) & (preds o)  )     o best  end while; 5.Heuristic

25 25 Replicating I 2 on P 3 P 1 m 2 P 3 m 3 P 2 m 1 I1I1 I1I1 I2I2 I2I2 A A I1I1 A B C O I2I2 5.Example Step 2.d. (3) Npf = 1 I2I2

26 26 1.   o | o is an input operation  ;   ; 2.While   do a.Compute the schedule pressure  for each operation o of on each processor p and keep the smallest Npf+1 results; b.Select the best candidate operation o best which has the greatest schedule pressure  (o best, p) ; c.Schedule o best on each processor p computed at step a and the communications implied by this schedule are replicated Npf+1 times and scheduled on parallel links; d.Try to minimise the start time of o best on each processor p computed at step a by replicating these predecessors on p [ahmad and al.]; e. Update the list of candidate operations :  -  o best    o | o  (succs o best ) & (preds o)  )     o best  end while; 5.Heuristic

27 27 I1I1 A B C O I2I2 P 1 m 2 P 3 m 3 P 2 m 1 I1I1 I1I1 I2I2 I2I2 A I2I2 A 5.Example Step 2.e. (3) Npf = 1 = { I 1, I 2 } = { A, B } = { I 1, I 2, A} = { B }

28 28 Aim : Aim :  Compare the proposed heuristic with the HBP heuristic [Hashimoto and al. 2002]. Assumptions : Assumptions :  Architecture with fully connect processors,  Number of fail-silent processor Npf = 1. Simulation parameters: Simulation parameters:  Communication-to-computation ratio, defined as the average communication time divided by the average computation time, CCR = 0.1, 0.5, 1, 2, 5 and 10,  Number of operations N = 10, 20, …, 80. Comparison parameter : Comparison parameter : Overhead = length (HTBR or HBP) - length (HTBR without fault-tolerance) longueur (HTBR without fault-tolerance) x 100 % 6.Simulations

29 29 No processor failure Impact of the number of operation One processor fails

30 30 Impact of the communication-to computation ratio No processor failure One processor fails

31 31 7.Conclusion and future work A new fault-tolerant scheduling heuristics:  Processors and communications links failures. reliability  Maximise the system’s reliability. A new scheduling heuristics based on the active replication strategy. It produces a static distributed schedule of a given algorithm on a given distributed architecture, tolerant to Npf processor failures. Result Future work

32


Download ppt "An Algorithm for Automatically Obtaining Distributed and Fault Tolerant Static Schedules Alain Girault - Hamoudi Kalla - Yves Sorel - Mihaela Sighireanu."

Similar presentations


Ads by Google