Ying Zhang&Krishnendu Chakrabarty Presenter Kasım Sert

Fault Recovery Based on Checkpointing for Hard Real-Time Embedded Systems
Ying Zhang&Krishnendu Chakrabarty Presenter Kasım Sert Spring 2007, Boğaziçi University

Introduction The checkpointing of system state and restoring it in the case of a system fault is one method if goal is the creation of a robust, fault-tolerant system. Safety-critical embedded systems often necessitate fault-tolerant computing techniques because of their harsh operating environment. The correctness of these systems depends not only on the result of computation but also on the time that result is produced. Fault tolerance is achieved by; Online fault detection Checkpointing & Rollback

Introduction Why fault tolerance is needed ? Checkpointing;
(+)Increases task execution time and may cause missing deadline which completes on time without checkpointing. (-)If checkpointing is chosen carefully, it may cause on time termination although faults occur. Why fault tolerance is needed ? Rapid increase in processor speeds. Dependence of correctness to the timely completion of underlying tasks. Lower processor voltages => Lower noise margins.

Checkpointing in real-time systems
Offline-Checkpointing schemes Checkpointing interval is known before task execution. Online-Checkpointing Checkpointing interval can be adapted to fault occurrences. However, interval rates are generally considered as probabilistic i.e. Poisson process, equidistant..

Off-line Checkpointing Analysis
Γ = { τ1, τ2, …, τn} of n periodic real-time tasks τi = (Ti, Di, Ei). Ti is the period of τi, Di is its deadline (Di ≤ Ti), and Ei is the execution time Proposed approaches; 1)To tolerate k faults for each job, termed as job-oriented fault tolerance; 2)To tolerate k faults within a hyper-period (defined as the least common multiple of all the task periods, termed as hyper-period-oriented fault tolerance.

Job-Oriented Fault Tolerance
1-Compute where Th is period and Eh is execution time. 2-The iteration is terminated either when or In the former case, τi is schedulable; in the latter case, τi is not schedulable. Under faulty conditions, the additional time due to checkpointing and recovery should be incorporated.

Hyperperiod Oriented Fault Tolerance
Start from the highest priority task and calculate the minimum number of checkpoints mi to make it schedulable. Calculate the response time Ri. If Ri <= Di move to the next task Else continue reducing Ri  To reduce Ri add more checkpoints.

Comment on the Algorithm
There are two key issues which are not addressed in the algorithm. Checkpoints are added to highest priority task. However they may belong to the earlier iteration tasks, in this situation calculation after that task may not be valid. Adding more checkpoints may not always improve overall execution time.

Solving the Deficiencies of Algorithms
To solve the first problem addressed above; all lower-priority tasks need to be re-examined. To solve the second one; Analysis of a bound based on checkpointing tradeoffs. Analysis of a bound based on timing constraints.

Proposed Solution Algorithm

Experimental Results The performance of proposed fault tolerant schemes (JFT,HFT) are compared with rate-monothonic (RM) schemes. It is assumed that RM simply re-executes a job a fault occurs.

Experimental Results-JFT vs. RM
Re-execution takes extra time and total task utilization falls under 1. Thus, in the presence of faults RM is not schedulable.

Experimental Results-HFT vs. RM

Thank You for Listening
Any Questions?

Ying Zhang&Krishnendu Chakrabarty Presenter Kasım Sert

Similar presentations

Presentation on theme: "Ying Zhang&Krishnendu Chakrabarty Presenter Kasım Sert"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Ying Zhang&Krishnendu Chakrabarty Presenter Kasım Sert

Similar presentations

Presentation on theme: "Ying Zhang&Krishnendu Chakrabarty Presenter Kasım Sert"— Presentation transcript:

Similar presentations

About project

Feedback