Presentation is loading. Please wait.

Presentation is loading. Please wait.

An efficient active replication scheme that tolerate failures in distributed embedded real-time systems Alain Girault, Hamoudi Kalla and Yves Sorel Pop.

Similar presentations


Presentation on theme: "An efficient active replication scheme that tolerate failures in distributed embedded real-time systems Alain Girault, Hamoudi Kalla and Yves Sorel Pop."— Presentation transcript:

1 An efficient active replication scheme that tolerate failures in distributed embedded real-time systems Alain Girault, Hamoudi Kalla and Yves Sorel Pop Art team and Aoste team Toulouse, August 22-27, 2004

2 2 Outline Introduction Models and problem The proposed fault-tolerant method for tolerating :  Processor failures  Communication link failures  Both processor and communication link failures Example Conclusion and future works

3 3 High level program Compiler Fault model, Execution times, Embedding, real-time and distribution constraints Fault-tolerant distributed static schedule Fault-tolerant distributed embedded code Code generator Distribution and scheduling fault-tolerant heuristic Algorithm model 1. Introduction Architecture model

4 4 2. Models : Algorithm (1) I1I1 B A C O I2I2 Algorithm graph « I 1 and I 2 » : input operations (sensors) « O » : output operation (actuator) « A, B and C » : computation operations « » : data-dependency I1I1 B A C O I2I2 execution iexecution i+1 Time Cyclic execution of the algorithm

5 5 mem com 1 op com 2 mem com 1 P1P1 Architecture graph « P i » : processors « mem » : memory « op » : operator « com i » : communicators « (com i, L ij, com j ) : are point-to-point links 2. Models : Hardware architecture with point-to-point links (2). op P2P2 P3P3 L 13 com 2 L 12 L 23 mem com 2 op com 1

6 6 a.To each processor P i we associate a list of pairs (o j,duration), where duration is the worst case execution time (WCET) of the operation o j on processor P i 2. Models : Execution characteristics (3) Architecture graph [ (A,5), (B,2), …] [ (A,4), (B,1), …] [(I 1,∞), (A,5), (B,2), …] mem com 1 op com 2 mem com 1 P1P1 op P2P2 P3P3 L 13 com 2 L 12 L 23 mem com 2 op com 1 I1I1 B A C O I2I2 Algorithm graph

7 7 2. Models : Execution characteristics (4) Architecture graph I1I1 B A C O I2I2 Algorithm graph [ (A,5), (B,2), …] [ (A,4), (B,1), …] [(I 1,∞), (A,5), (B,2), …] mem com 1 op com 2 mem com 1 P1P1 op P2P2 P3P3 L 13 com 2 L 12 L 23 mem com 2 op com 1 b.To each link L i we associate a list of pairs (d j,duration), where duration is the worst case transmission time (WCTT) of the data-dependence d j on link L i [ (A C,2), …] [ (A C,3), …]

8 8 a.Software is assumed to be reliable, b.Only hardware faults of processors and links, c.Sensors and actuators are assumed to be reliable, d.Processors are assumed to be fail-silent, Link failuresProcessor failures 2. Models : Faults (5) mem com 1 op com 2 mem com 1 P1P1 op P2P2 P3P3 L 13 com 2 L 12 L 23 mem com 2 op com 1 mem com 1 op com 2 mem com 1 P1P1 op P2P2 P3P3 L 13 com 2 L 12 L 23 mem com 2 op com 1

9 9  Find a distributed schedule of the algorithm graph on the architecture graph which is fault-tolerant to processors and communication links failures. It must satisfy real-time, embedded and distributed constraints ?  Problem ? I1I1 B A C O I2I2 & & Algorithm graph Scheduling Architecture graph mem com1 P1P1 com2 op mem com1 P2P2 com2 op P3P3 mem com1com2 op L 12 L 23 L 13 Distributed schedule Distribution

10 10 1.Real-time constraint Constraints and Objective 3.Fault-tolerance objective  Minimize the run-time (latency) of the algorithm graph  hardware resources (memory, …) must be minimized to match cost, power and volume constraints required for embedded application 2.Distribution and Embedding constraints  Tolerate at most ”P” processor failures  Tolerate at most ”L” point-to-point communication link failures

11 11 Outline Introduction Models and problem The proposed fault-tolerant method for tolerating :  Processor failures  Communication link failures  Both processor and communication link failures Example Conclusion and future works

12 12 3. The Proposed fault-tolerant method Principles :  It is a software solution that can efficiently deal with hardware failures.  It uses a software-based replication technique to mask errors that arise from hardware faults.  It can tolerate a fixed number of arbitrary processors and communication links failures.

13 13 I1I1 B A C O I2I2 algorithm graph architecture graph mem com1 P1P1 com2 op mem com1 P2P2 com2 op P3P3 mem com1com2 op L 12 L 23 L 13 & & Scheduling Distributed schedule Distribution 3. The Proposed fault-tolerant method (1) 1.We use active software redundancy for both operations and communications. Principle : I1 A

14 14 P processor faults L link faults static fault-tolerant distributed real-time schedule new Alg * with redundancies and exclusion relations Graph transformation 3. The Proposed fault-tolerant method (2) Principle : Execution times, Embedding, real-time and distribution constraints Architecture graph (Arc) Algorithm graph (Alg) Distribution and scheduling fault-tolerant heuristic Prediction of timing behavior

15 15 B data a. initial graph Processor faults : (at most P faults) 3. The Proposed fault-tolerant method (3) Algorithm graph (Alg) transformation A b. final graph BjBj data AiAi P+1 replicas... A1A1 P+1 replicas

16 16 Link faults : (at most L faults) 3. The Proposed fault-tolerant method (4) B data one replica a. initial graph Algorithm graph (Alg) transformation A b. final graph one replica B data A L+1 replicas...

17 17 b. Operation redundancy B1B1 data A2A2 A1A1 two replicas B2B2 Processor and link faults : (P = 1 & L = 1) 3. The Proposed fault-tolerant method (5) B data a. initial graph A c. Data-dependence redundancy three replicas BiBi data A1A1 A2A2

18 18 Processor and link faults : (P = 1 & L = 1) 3. The Proposed fault-tolerant method (6) d. Data-dependence distribution three replicas BiBi data A1A1 A2A2 c. Data-dependence redundancy three replicas BiBi data A1A1 A2A2

19 19 Processor and link faults : (P = 1 & L = 1) 3. The Proposed fault-tolerant method (7) d. final transformation BiBi data A1A1 A2A2 d. Data-dependence distribution three replicas BiBi data A1A1 A2A2

20 20 Processor and link faults : (P = 1 & L = 1) 3. The Proposed fault-tolerant method (7) d. final transformation BiBi data A1A1 A2A2 d. Data-dependence distribution three replicas BiBi data A1A1 A2A2

21 21 Processor and link faults : (P = 1 & L = 1) 3. The Proposed fault-tolerant method (7) d. final transformation BiBi data A1A1 A2A2 d. Data-dependence distribution three replicas BiBi data A1A1 A2A2 R1R1 R1R1 = Routing operation, its duration is null

22 22 Processor and link faults : (P = 1 & L = 1) 3. The Proposed fault-tolerant method (8) B1B1 A1A1 A2A2 R1R1 B2B2 B1B1 A1A1 A2A2 R1R1 B2B2 BiBi A1A1 A2A2 R1R1 Case 1 Case 2

23 23 Processor and link faults : (P = 1 & L = 1) 3. The Proposed fault-tolerant method (9) B1B1 A1A1 A2A2 R1R1 B2B2 B x A a. initial graph b. final graph y1y1 y2y2 x y y 1 +y 2 x y =

24 24 AiAi BjBj A1A1 R1R1 RkRk... Processor and link faults : (P ≥ 0 & L ≥ 0) 3. The Proposed fault-tolerant method (10) P+1 replicas L routing operations B data A a. initial graph b. final graph...

25 Our methodology P processor faults L link faults Distribution and scheduling fault-tolerant heuristic Execution times, Embedding, real-time and distribution constraints Architecture graph (Arc) static fault-tolerant distributed real-time schedule Prediction of timing behavior Algorithm graph (Alg) B data A new Alg with redundancies and exclusion relations Graph transformation A B A R R... P+1 replicas L routing operations

26 26 Outline Introduction Models and problem The proposed fault-tolerant method for tolerating :  Processor failures  Communication link failures  Both processor and communication link failures Example Conclusion and future works

27 27 L23 4. Example  B 1 will receive its input data P+L+1 times (P=1, L=1); as soon as it receives the first input, B 1 is executed, and it ignores the later inputs R A2A2 B1B1 A1A1 data P1 P4 L12 L14 P2 P3 L34 P1P2P3P4 L12 L23 L34L14 A2A2 A1A1 B1B1 data L24 data start time (B 1 ) = min ( end communication [A 1,A 2,R] ) a transformed algorithm graph architecture graph Temporary schedule R R time B1B1

28 28 5. Implantation A B data mem com1 P1P1 Architecture graph com2 op mem com1 P2P2 com2 op P3P3 mem com1com2 op L 12 L 23 L 13 A1A1 A2A2 Algorithm sub-graph data B1B1 B2B2 a.Replicas of each operation are scheduled on distinct processors b.Replicas of each communication are scheduled on disjoint paths c.Operations and communications are time triggered

29 29 P1P2P3P4 L12 L23 L34L14 A2A2 A1A1 B1B1 data L24 data start time (B 1 ) = [Best_end_com ( [A 1,A 2,R] ), Worst_end_com ([A 1,A 2,R] ) ] Temporary schedule R R time 5. Implantation a.Replicas of each operation are scheduled on distinct processors b.Replicas of each communication are scheduled on disjoint paths c.Operations and communications are time triggered

30 30 I1I1 B A C O I2I2 SynDEx*SynDEx* algorithm graph Distribution/scheduling architecture graph mem com1 P1P1 com2 op mem com1 P2P2 com2 op P3P3 mem com1com2 op L 12 L 23 L 13 Final schedule for L = 1 *SynDEx is a system level CAD software tool for optimizing the implementation of real-time embeded applications on multicomponenet architecture 5. Implantation – Example « link faults »

31 31 I1I1 B A C O I2I2 SynDExSynDEx algorithm graph Distribution/scheduling architecture graph mem com1 P1P1 com2 op mem com1 P2P2 com2 op P3P3 mem com1com2 op L 12 L 23 L 13 Final schedule for P=1 5. Implantation – Example « processor faults »

32 32 Conclusion and future works  Benchmarks  A case study, for instance with the Cycab automatic vehicle  Taking into account sensors failures A new method, based on graph transformation, that introduces fault-tolerance in building distributed real-time systems. Result Future works It uses a software-based replication technique to tolerate a fixed number of arbitrary processors and communication links failures.

33


Download ppt "An efficient active replication scheme that tolerate failures in distributed embedded real-time systems Alain Girault, Hamoudi Kalla and Yves Sorel Pop."

Similar presentations


Ads by Google