Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Feedback Based Real-Time Fault Tolerance Issues and Possible Solutions Xue Liu, Hui Ding, Kihwal Lee, Marco Caccamo, Lui Sha.

Similar presentations


Presentation on theme: "1 Feedback Based Real-Time Fault Tolerance Issues and Possible Solutions Xue Liu, Hui Ding, Kihwal Lee, Marco Caccamo, Lui Sha."— Presentation transcript:

1 1 Feedback Based Real-Time Fault Tolerance Issues and Possible Solutions Xue Liu, Hui Ding, Kihwal Lee, Marco Caccamo, Lui Sha

2 2 Software becoming more and more complex –More features → larger code size –Rapid evolution → introduction of new code Major Issues in Software Reliability E.g. Apache 1998 0.8 MLOC 2002 10 MLOC 2004 27 MLOC E.g. Windows XP 40-50 MLOC Gray’s Estimate : 1 bug / KLOC

3 3 Growing Software Complexity Poorly managed or maintained;Software bugs and errors. Managed by human operators –Shortage of skilled operators due to the growing complexity –Costly –To err is human Faults Sources of computing system downtime ( Cite from: Candea, Stanford’03 ) CategorySource of downtime (percentage) Hardware20% Software40% Human operators 40% Complexity adds difficulty to management and breeds bugs. - Control the complexity in computer systems! - Build systems that are robust against software bugs

4 4 Feedback Control Reflection Successful track record in controlling electro/mechanical systems Observation 1: Computing systems haven been crucial in the success of feedback control –Digital designs & implementations etc Observation 2: Feedback control have appealing properties –Tolerance of errors (model/sensing/actuation etc) in the physical process Utilize runtime feedback for error correction Computing Systems Feedback Control Reflection: Can feedback control help to solve fault tolerance problem in computing systems? Fault tolerance

5 5 Idea 1: Feedback Control of Software Execution Mechanical systems: Sense (feedback)->Control (error correction) -> Actuation Software systems: Sense (feedback)->Control (error correction) -> Execution A simple and reliable core which gives acceptable performance; The system under complex control software remains in states that are recoverable by the simple core. (achieve fault tolerance) Idea 2: Using Simplicity to Control Complexity Q: Feedback control can help to tolerate errors in mechanical systems, can feedback control help to tolerate software errors also? Targeted applications: Real-time control systems Tolerant of Errors in Software Systems Feedback Control Tolerant of Errors in Mechanical Systems

6 6 A Typical Feedback Control Loop for Mechanical Systems Sense: System output, identify if error exists Control: Decision Actuation: Execution Mechanical System (Plant) Sensor Controller Actuator _ Reference Input (Decision)(Execution) (Sensing/error identification)

7 7 Related Work – Simplex Architecture Simple high assurance control subsystem (HAC) Complex high performance control subsystem (HPC) Data Flow Block Diagram Plant Decision A simple reliable core (HAC) Diversity in the form of 2 alternatives (HAC, HPC) Feedback control of the software execution. Sense (feedback)->Decision (control/error correction) -> Execution (actuation)

8 8 Drawbacks of Simplex P1: Analytically redundant high assurance controller (HAC) runs in parallel with complex controller (HPC) –Lowers system performance, increase operating costs –Limits the application of Simplex in only safety-critical domains P2: HAC and HPC must run at the same period Design Goals of ORTGA 1. Similar functionalities with Simplex 2. Much less resource usage 3. Flexibility Our new Proposal: On-demand Real-Time Guard (ORTGA) HAC only runs when faulty occurs!

9 9 ORTGA Architecture: Key Ideas (1) : Reduce resource usage of Simplex Solution: “On-demand” execution of HAC. – Only when the control under HPC is detected as faulty, the HAC is switched in to take over the plant (2): Flexibility Solution: HAC and HPC ‘s periods are multiples of subperiod HAC and HPC can have different periods.

10 10 Background: Maximum Stability Region The largest state space such that system is still stable under the current controller Maximum Stability Region (Recovery Region) Stability Region Lyapunov Functions State Constraints

11 11 How to determine the Maximum Stability Region? In the operation of a plant, there is a set of state constraints: representing the safety, device physical limitations, environmental and other operation requirements. They can be represented as a normalized polytope, C T X  1, in the N-dimensional state space. We must be able –take the control away from a faulty State constraints Admissible States Operation Constraints and Admissible states

12 12 Maximum Stability Region A stability region is closed with respect to the operations of simple controller. It is Lyapunov function inside the polytope. The maximum recovery region can be found using LMI. State constraints Recovery Region Lyapunov function State Constraints and the switching rule (Lyapunov function)

13 13 Research Issues of ORTGA How to detect faults in HPC –Timing faults: Application level support: Monitor detect heartbeat messages misses OS support: Scheduler detect task deadline misses –Other faults: Wide range of traditional fault detection techniques can be used. When to recover if a fault in HPC is detected? –Recover early? Too early: False alarms –Recover late? Too late: could not recover in time

14 14 When to recover Why not recover too early? –Control tasks are shown can tolerate several deadline misses –Sometimes system just have some delay (overloaded, communication delay etc) –These are not “real” faults –Try to minimize the recovery due to false alarms Why not recover too late? –If you recover too late, then no time to make the system stable!

15 15 Right Time To Recover (RTTR) An example of a “desirable” late but timely recovery (under RM) Observation: Sometimes, a late but timely recovery makes system more schedulable Assumption: Fault is detected at t=2.0 before its task deadline D=8 Find RTTR instead of minimize MTTR!

16 16 A possible solution to determine RTTR Idea –Recover as late as possible, –But not too late If the state of HPC is going to be out of the HAC-established stability region, recover! Otherwise, wait (maybe HPC still OK )  HB1 (t 1 ) When to recover? Recovered Threads HB2 (t 2 ) Prediction tsts Monitor find HB3 missing Stability Region S of Controlled Plant (t 3 ) trtr S

17 17 Performance Gain of ORTGA Reduce Resource Usage: On-demand Execution of HAC HPC’s timing parameters: {C p, T p }; HAC’s timing parameters: {C a, T a }; A total savings of: Relative saving:

18 18 Ongoing Work: A proof-of-concept System Double Inverted Pendulum System - Double Quanser inverted pendulum with custom-made tracks - PC/104 sized, i486 compatible system - Customized Linux 2.6 kernel and root image in flash memory - ORTGA middleware layer

19 19 Conclusions Feedback Based Real-Time Fault Tolerance –Leverage feedback control of software execution ORTGA Architecture –On-demand execution of reliable core (HAC) only when fault occurs –Significantly reduces resource usage Issues and possible solutions –How to detect fault –When to recover to maintain system stability –How to find the RTTR (instead of minimize MTTR)

20 20 Backup Slides

21 21 Software Fault Model in RT Control systems Timing fault: misses its deadlines Capability abuse: –Corrupt others’ code or data –Unauthorized acquisition of process/resource management capability Semantic fault: incorrect results that can lead to: –Poor control performance –Instability in the plant Timing fault GRMS Semantic fault Analytic Redundancy (simple & complex Controllers Capability abuse Privilege management


Download ppt "1 Feedback Based Real-Time Fault Tolerance Issues and Possible Solutions Xue Liu, Hui Ding, Kihwal Lee, Marco Caccamo, Lui Sha."

Similar presentations


Ads by Google