Tolerating Communication and Processor Failures in Distributed Real-Time Systems Hamoudi Kalla, Alain Girault and Yves Sorel Grenoble, November 13, 2003.

Slides:



Advertisements
Similar presentations
An Overview of ABFT in cloud computing
Advertisements

Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.
Fault-Tolerance for Distributed and Real-Time Embedded Systems
1 of 14 1 /23 Flexibility Driven Scheduling and Mapping for Distributed Real-Time Systems Paul Pop, Petru Eles, Zebo Peng Department of Computer and Information.
Carnegie Mellon R-BATCH: Task Partitioning for Fault-tolerant Multiprocessor Real-Time Systems Junsung Kim, Karthik Lakshmanan and Raj Rajkumar Electrical.
3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani.
Making Services Fault Tolerant
1 Building Reliable Web Services: Methodology, Composition, Modeling and Experiment Pat. P. W. Chan Department of Computer Science and Engineering The.
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
1 of 14 1/14 Design Optimization of Time- and Cost-Constrained Fault-Tolerant Distributed Embedded Systems Viaceslav Izosimov, Paul Pop, Petru Eles, Zebo.
Introduction Designing cost-sensitive real-time control systems for safety-critical applications requires a careful analysis of the cost/fault-coverage.
Introduction to Distributed Systems
1 Data Persistence in Large-scale Sensor Networks with Decentralized Fountain Codes Yunfeng Lin, Ben Liang, Baochun Li INFOCOM 2007.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
1 SWE Introduction to Software Engineering Lecture 22 – Architectural Design (Chapter 13)
DS -V - FDT - 1 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Zuverlässige Systeme für Web und E-Business (Dependable Systems for Web and E-Business)
Ritu Varma Roshanak Roshandel Manu Prasanna
1 of 14 1 Fault-Tolerant Embedded Systems: Scheduling and Optimization Viacheslav Izosimov, Petru Eles, Zebo Peng Embedded Systems Lab (ESLAB) Linköping.
Reliability on Web Services Pat Chan 31 Oct 2006.
Scheduling with Optimized Communication for Time-Triggered Embedded Systems Slide 1 Scheduling with Optimized Communication for Time-Triggered Embedded.
1 of 16 March 30, 2000 Bus Access Optimization for Distributed Embedded Systems Based on Schedulability Analysis Paul Pop, Petru Eles, Zebo Peng Department.
Real-time systems. CS351 - Software Engineering (AY2004)2 Real-time systems Real-time (RT) Systems RT transaction Controlled Object Computer System Operator.
Design of Fault Tolerant Data Flow in Ptolemy II Mark McKelvin EE290 N, Fall 2004 Final Project.
Architecture and Real Time Systems Lab University of Massachusetts, Amherst An Application Driven Reliability Measures and Evaluation Tool for Fault Tolerant.
Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,
1 Oct 2, 2003 Design Optimization of Mixed Time/Event-Triggered Distributed Embedded Systems Traian Pop, Petru Eles, Zebo Peng Embedded Systems Laboratory.
1 of 14 1 Analysis and Synthesis of Communication-Intensive Heterogeneous Real-Time Systems Paul Pop Computer and Information Science Dept. Linköpings.
1 of 14 1/15 Design Optimization of Multi-Cluster Embedded Systems for Real-Time Applications Paul Pop, Petru Eles, Zebo Peng, Viaceslav Izosimov Embedded.
Design of Distributed Real-Time Systems Ramani Arunachalam.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.
1 of 14 1 Scheduling and Optimization of Fault- Tolerant Embedded Systems Viacheslav Izosimov Embedded Systems Lab (ESLAB) Linköping University, Sweden.
CprE 458/558: Real-Time Systems
1 Making Services Fault Tolerant Pat Chan, Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Miroslaw Malek.
1 of 14 1 / 18 An Approach to Incremental Design of Distributed Embedded Systems Paul Pop, Petru Eles, Traian Pop, Zebo Peng Department of Computer and.
Motivation  Synthesis-based methodology for quick design space exploration enabled by automatic synthesis followed by analysis  Automatic synthesis:
Time-Triggered Architectures, Protocols and Applications. P.S. Thiagarajan.
Expediting Programmer AWAREness of Anomalous Code Sarah E. Smith Laurie Williams Jun Xu November 11, 2005.
PMIT-6102 Advanced Database Systems
REAL-TIME SOFTWARE SYSTEMS DEVELOPMENT Instructor: Dr. Hany H. Ammar Dept. of Computer Science and Electrical Engineering, WVU.
Towards a Contract-based Fault-tolerant Scheduling Framework for Distributed Real-time Systems Abhilash Thekkilakattil, Huseyin Aysan and Sasikumar Punnekkat.
1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.
An efficient active replication scheme that tolerate failures in distributed embedded real-time systems Alain Girault, Hamoudi Kalla and Yves Sorel Pop.
Microcontroller Presented by Hasnain Heickal (07), Sabbir Ahmed(08) and Zakia Afroze Abedin(19)
 Chapter 13 – Dependability Engineering 1 Chapter 12 Dependability and Security Specification 1.
1 of 14 1/15 Synthesis-driven Derivation of Process Graphs from Functional Blocks for Time-Triggered Embedded Systems Master thesis Student: Ghennadii.
Dependable communication synthesis for distributed embedded systems Nagarajan Kandasamy, John P. Hayes, Brian T. Murray Presented by John David Eriksen.
ISADS'03 Message Logging and Recovery in Wireless CORBA Using Access Bridge Michael R. Lyu The Chinese Univ. of Hong Kong
Presenters: Rezan Amiri Sahar Delroshan
Secure Systems Research Group - FAU 1 Active Replication Pattern Ingrid Buckley Dept. of Computer Science and Engineering Florida Atlantic University Boca.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
REAL-TIME SOFTWARE SYSTEMS DEVELOPMENT Instructor: Dr. Hany H. Ammar Dept. of Computer Science and Electrical Engineering, WVU.
CprE 458/558: Real-Time Systems
5 May CmpE 516 Fault Tolerant Scheduling in Multiprocessor Systems Betül Demiröz.
Fault Tolerance in CORBA and Wireless CORBA Chen Xinyu 18/9/2002.
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
Advantages of Time-Triggered Ethernet
Mixed Criticality Systems: Beyond Transient Faults Abhilash Thekkilakattil, Alan Burns, Radu Dobrin and Sasikumar Punnekkat.
Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003.
1 of 14 1/15 Schedulability-Driven Frame Packing for Multi-Cluster Distributed Embedded Systems Paul Pop, Petru Eles, Zebo Peng Embedded Systems Lab (ESLAB)
1 Developing Aerospace Applications with a Reliable Web Services Paradigm Pat. P. W. Chan and Michael R. Lyu Department of Computer Science and Engineering.
FLARe: a Fault-tolerant Lightweight Adaptive Real-time Middleware for Distributed Real-time and Embedded Systems Dr. Aniruddha S. Gokhale
An Algorithm for Automatically Obtaining Distributed and Fault Tolerant Static Schedules Alain Girault - Hamoudi Kalla - Yves Sorel - Mihaela Sighireanu.
Reliable energy management System reliability is affected by use of energy management The use of DVS increases the probability of faults, thus damaging.
COEN 421- Embedded System and Software Design
Shanna-Shaye Forbes Ben Lickly Man-Kit Leung
Fault Tolerance Distributed Web-based Systems
Mark McKelvin EE249 Embedded System Design December 03, 2002
Anand Bhat*, Soheil Samii†, Raj Rajkumar* *Carnegie Mellon University
Presentation transcript:

Tolerating Communication and Processor Failures in Distributed Real-Time Systems Hamoudi Kalla, Alain Girault and Yves Sorel Grenoble, November 13, 2003

2 Outline Introduction Modeling distributed real-time systems The Fault model Related work Processor fault tolerance Communication fault tolerance Conclusion and future work

3 High level program Compiler Architecture specification Distribution constraints Execution times Real-time constraints Failure specification Fault-tolerant distributed static schedule Fault-tolerant distributed code Code generator Distribution and scheduling fault-tolerant heuristic Model of the algorithm Introduction

4 Modeling distributed real-time systems a.Algorithm Model « I 1 and I 2 » are inputs operations (sensors) « O » is output operation (actuator) « A, B and C » are computations operations I1I1 A B C O I2I2

5 Modeling distributed real-time systems b.Architecture Model P1 P2 P3 « P1, P2 and P3 » are processors « B1 and B2 » are communication buses B1 B2 Processor Computation unit memory co-processor …

6 The Fault Model 1.Tolerating a fixed number of fail-silent processors. 2.Tolerating a fixed number of fail-silent bus: complete and partial faults. Complete bus faults Partial bus faults Processors faults P1 P2 P3 B1 B2 P1 P2 P3 B1 B2 P1 P2 P3 B1 B2

7 fault-tolerant  Find a distributed schedule of the algorithm on the architecture which is fault-tolerant to processors and communications failures ? Problem ? I1I1 A B C O I2I2 scheduleschedule P1 P2 P3 B1 B2

8 2. Forward Error Correction (FEC) 2. Forward Error Correction (FEC): passive or active replication of operations and active replication of communication. Related Work (1) 1.Time-Triggered Architecture (TTA) 1.Time-Triggered Architecture (TTA): active replication of operations and communications. (20 years = 100 masters theses and 25 doctoral)

9 1.Time-Triggered Architecture (TTA) 1.Time-Triggered Architecture (TTA): Related Work (2)  Processor fault tolerance: k replicas or copies of each operation are actively allocated to separate processors.  Communication fault tolerance: k’ replicas or copies of each communication are actively allocated to separate buses.

10 1.Forward Error Correction (FEC) 1.Forward Error Correction (FEC): Related Work (3)  Processor fault tolerance: k replicas or copies of each operation are actively or passively allocated to separate processors.  Communication fault tolerance: First, each communication is coded by the FEC code on k’ messages with redundant informations. Next, the k’ messages are actively allocated to separate buses.

11 Outline Introduction Modeling distributed real-time systems The Fault model Related work Processor fault tolerance Communication fault tolerance Conclusion and future work

12 active software replication  Use the active software replication of operations; where each operation is replicated on k different processors to tolerate k processors failures. Processor fault tolerance

13 passive software replication watchdog timer a.Use the passive software replication of communication, which need « watchdog timer », Communication fault tolerance (1) (data fragmentation) b.Split each data communication on k messages. (data fragmentation)

14 Communication fault tolerance (2) passive software replication watchdog timer a.Use the passive software replication of communication, which need « watchdog timer »,

15 Communication fault tolerance (3) (data fragmentation) b.Split each data communication on k messages. (data fragmentation)

16 Communication fault tolerance (3) data fragmentation Why data fragmentation of communication ? complete and partial 1.Distinction between complete and partial communication fault !

17 Communication fault tolerance (4) data fragmentation Why data fragmentation of communication ? rapid recovery 2.Enable rapid recovery from processors and buses failures

18 Recovery from failures (1) 1.Processor fault

19 Recovery from failures (2) 2.Partial bus fault

20 Recovery from failures (3) 3.Complete bus fault

21 Example (1)

22 Example (2)

23 Conclusion and future work  Implementation of the proposed method into the SynDEx tool.  Simulations. both communication and processor failures A new method to tolerate both communication and processor failures in distributed real-time systems, which may be reduce the load and the overhead of the recovery from failures. Result Future work

24 Questions ?