Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tolerating Communication and Processor Failures in Distributed Real-Time Systems Hamoudi Kalla, Alain Girault and Yves Sorel Grenoble, November 13, 2003.

Similar presentations


Presentation on theme: "Tolerating Communication and Processor Failures in Distributed Real-Time Systems Hamoudi Kalla, Alain Girault and Yves Sorel Grenoble, November 13, 2003."— Presentation transcript:

1 Tolerating Communication and Processor Failures in Distributed Real-Time Systems Hamoudi Kalla, Alain Girault and Yves Sorel Grenoble, November 13, 2003

2 2 Outline Introduction Modeling distributed real-time systems The Fault model Related work Processor fault tolerance Communication fault tolerance Conclusion and future work

3 3 High level program Compiler Architecture specification Distribution constraints Execution times Real-time constraints Failure specification Fault-tolerant distributed static schedule Fault-tolerant distributed code Code generator Distribution and scheduling fault-tolerant heuristic Model of the algorithm Introduction

4 4 Modeling distributed real-time systems a.Algorithm Model « I 1 and I 2 » are inputs operations (sensors) « O » is output operation (actuator) « A, B and C » are computations operations I1I1 A B C O I2I2

5 5 Modeling distributed real-time systems b.Architecture Model P1 P2 P3 « P1, P2 and P3 » are processors « B1 and B2 » are communication buses B1 B2 Processor Computation unit memory co-processor …

6 6 The Fault Model 1.Tolerating a fixed number of fail-silent processors. 2.Tolerating a fixed number of fail-silent bus: complete and partial faults. Complete bus faults Partial bus faults Processors faults P1 P2 P3 B1 B2 P1 P2 P3 B1 B2 P1 P2 P3 B1 B2

7 7 fault-tolerant  Find a distributed schedule of the algorithm on the architecture which is fault-tolerant to processors and communications failures ? Problem ? I1I1 A B C O I2I2 scheduleschedule P1 P2 P3 B1 B2

8 8 2. Forward Error Correction (FEC) 2. Forward Error Correction (FEC): passive or active replication of operations and active replication of communication. Related Work (1) 1.Time-Triggered Architecture (TTA) 1.Time-Triggered Architecture (TTA): active replication of operations and communications. (20 years = 100 masters theses and 25 doctoral)

9 9 1.Time-Triggered Architecture (TTA) 1.Time-Triggered Architecture (TTA): Related Work (2)  Processor fault tolerance: k replicas or copies of each operation are actively allocated to separate processors.  Communication fault tolerance: k’ replicas or copies of each communication are actively allocated to separate buses.

10 10 1.Forward Error Correction (FEC) 1.Forward Error Correction (FEC): Related Work (3)  Processor fault tolerance: k replicas or copies of each operation are actively or passively allocated to separate processors.  Communication fault tolerance: First, each communication is coded by the FEC code on k’ messages with redundant informations. Next, the k’ messages are actively allocated to separate buses.

11 11 Outline Introduction Modeling distributed real-time systems The Fault model Related work Processor fault tolerance Communication fault tolerance Conclusion and future work

12 12 active software replication  Use the active software replication of operations; where each operation is replicated on k different processors to tolerate k processors failures. Processor fault tolerance

13 13 passive software replication watchdog timer a.Use the passive software replication of communication, which need « watchdog timer », Communication fault tolerance (1) (data fragmentation) b.Split each data communication on k messages. (data fragmentation)

14 14 Communication fault tolerance (2) passive software replication watchdog timer a.Use the passive software replication of communication, which need « watchdog timer »,

15 15 Communication fault tolerance (3) (data fragmentation) b.Split each data communication on k messages. (data fragmentation)

16 16 Communication fault tolerance (3) data fragmentation Why data fragmentation of communication ? complete and partial 1.Distinction between complete and partial communication fault !

17 17 Communication fault tolerance (4) data fragmentation Why data fragmentation of communication ? rapid recovery 2.Enable rapid recovery from processors and buses failures

18 18 Recovery from failures (1) 1.Processor fault

19 19 Recovery from failures (2) 2.Partial bus fault

20 20 Recovery from failures (3) 3.Complete bus fault

21 21 Example (1)

22 22 Example (2)

23 23 Conclusion and future work  Implementation of the proposed method into the SynDEx tool.  Simulations. both communication and processor failures A new method to tolerate both communication and processor failures in distributed real-time systems, which may be reduce the load and the overhead of the recovery from failures. Result Future work

24 24 Questions ?


Download ppt "Tolerating Communication and Processor Failures in Distributed Real-Time Systems Hamoudi Kalla, Alain Girault and Yves Sorel Grenoble, November 13, 2003."

Similar presentations


Ads by Google