Presentation is loading. Please wait.

Presentation is loading. Please wait.

Software Fault Tolerance (SWFT) Microreboots Dependable Embedded Systems & SW Group Prof. Neeraj Suri Abdelmajid Khelil.

Similar presentations

Presentation on theme: "Software Fault Tolerance (SWFT) Microreboots Dependable Embedded Systems & SW Group Prof. Neeraj Suri Abdelmajid Khelil."— Presentation transcript:

1 Software Fault Tolerance (SWFT) Microreboots Dependable Embedded Systems & SW Group Prof. Neeraj Suri Abdelmajid Khelil Dept. of Computer Science TU Darmstadt, Germany

2 Overview on Recovery Mechanisms  Approaches to recover once failure is encountered  Checkpointing  Log-based recovery  Recovery blocks  NVP, NCP  Reactive microrboots (minimize down-time!) [Candea et al. 2004]  Approaches to prevent failures  Rejuvenation (maximize up-time!)  Proactive microrboots

3 3 Outline of Today’s Lecture  Objectives  Why reboot?  System/fault model  Crash-only design  Architecture  What to microreboot  Failure tree, recovery tree, reboot tree  Tree structuring  Microreboots vs. rejuvenation © DEEDS Group SWFT WS ‘07

4 4 Objectives  All software can fail  Well-managed servers today achive availability of 99.9% to 99%, i.e. 8 to 80 hours downtime per year  Cost per hour $200,000 for internet service like amazon $6,000,000 for a stock brokerage firm  Improve availability by reducing time-to-recover: Rapidly and effectively recover from failure  Hypothesis: Recovery performance would be a more fruitful pursuit and more important for society than traditional performance.  Recovery itself can fail too  Intelligent retry of recovery is needed  Approach here: fine-grain component level restarts = MICROREBOOTS

5 5 Improving the Availability  Avaialbility A= MTTF /(MTTF+MTTR) MTTF: mean time to fail MTTR: mean time to recover For SW: MTTF days..months, MTTR: minutes..hours  Unavailability U=MTTR/(MTTF+MTTR) ~MTTR/MTTF if MTTF>>MTTR  Tenfold decrease of MTTR is just as valuable as a tenfold increase in MTTF !  Design fast recovery!

6 6 Design Goals  Fast and correct component recovery  Strongly-localized recovery mith minimal impact on other parts of the system  Fast and correct reintigration of recoverd components

7 7 Why Reboot?  A reboot is a recovery mechanism that (for properly designed software):  (a) unequivocally returns the recovered system to its start state, which is the best understood and best tested state of all  (b) provides a high confidence way to reclaim resources that are stale or leaked, such as memory and file descriptors  (c) is easy to understand and employ, which makes reboots easy to implement, debug, and automate  A microreboot is a low-cost form of reboot, i.e. applied at the level of individual fine grain SW components.  Examples:  Killing and restarting a deadlocked thread in database systems  Internet portals routinely kill and restart their web servers  A major search engine periodically performs rolling reboots of all other assigned nodes

8 8 System/Fault Model  Internet/entreprise services  Large scale  built from many heterogeneous components  Stringent high availability requirements  Workloads consist of large numbers of relatively short tasks  Subject to rapid, perpetual evolution  Microrebootable SW (crash-only SW model)  SW Failures  Primarily transient/intermittent  Can be typically resolved by rebooting

9 9 The Crash-Only SW  Candea: „There should be only one way to stop or recover a system: by crashing it“  Why not „clean shutdown“? Recovery code runs rarely and whenever it does run, it must work perfectly. Performance reason High recovery time  The crashing approach forces recovery code to be exercised regularly as part of the normal operation Sacrifices performance Crashes safely Recovers quickly  Crash-only SW components: Components must be prepared to suddenly be deactivated  Stop=crash : „power-off switch“  Start=recover : „power-on switch“ (Switches are external to the component)  Fault model enforcement  Turn unkown faults into crashes

10 10 Properties of Crash-Only SW  Intra-component properties  State segregation: data recovery/process recovery To ensure recovery correctness, we must prevent microreboots from inducing inconsistency in app state Persistent state is managed by dedicated state stores  State store are crash only  Abstractions and guarantees provided by the state store match the application requirements  Extra-component properties  Decoupling: Components are modules with externally enforced boundaries that provide strong fault containment  All components use timeout-based communication and all resources are lease, rather than permanently allocated  Retryable requests: All requests carry a time-to-live and an indication of whether they are idempotent (RetryAfter(t) exception)

11 11 Recursive Recovery and Microreboots  Whenever any bug occurs, the system is rebooted so that it can reach to a best safest state  Instead of restarting the entire system, only the component which needs restart can be rebooted (microreboot)  When this reboot fails, recursively larger subsystems are rebooted  It is stopped only when the human intervene  Microreboot is low cost as it is implemented on individual fine grained software component.  Thus one part can be recovered without affecting other parts of the system.  They are faster than the full reboot

12 12 Execution Architecture Component (e.g. Thread, process and Java beans.) Service e.g. OS, application server and Java VM

13 13 Augmenting the Execution Infrastructure  Monitoring agent  Supervise the health of the system  Report interesting changes to the recovery manager  Recovery manager  Decides which of the unit (COM or SVC) should be recoverd (r-unit)  Recovery agents  Execute the decision of recovery manager  Need a map to navigate the system

14 14 Monitoring System Health  3 layered monitoring: Platform-level, application-level, and end-to-end monitoring.  Platform-level monitoring  Exploits generic knowledge about COM behavious E.g. A Java VM can inspect the application´s use of synchronization primitives  Application-level monitoring  Application progress counters E.g The number of participants to measure the commit progress  Bahavious monitoring: Deviations from pre-agreed behavious E.g. Periodic heart beat is expected from a node but not received  End-to-end monitoring  Exploits the application´s interface that is end-user-visible E.g. Perform an SQL query to check the database system

15 15 Monitoring System Health (2)  Platform-level checks are the least expensive and are performed more often than application-level ones.  The monitoring agents convey monitors to recovery manager  Some monitors may be faulty or provide incomplete information

16 16 Fault Propagation Map and Recovery Map  Maintained by the recovery manager  Fault propagation map (f-map)  Dynamic graph  Captures the currently known paths which faults propagate  Recovery map (r-map)  Dynamic tree  Constructed based on the f-map  A cycle in f-map should be considered as a single recovery unit

17 17 The Recovery Process (1)  Recovery group of a given r-unit = the set of nodes in r- map that are reachable from r-unit, i.e. Nodes that can be contaminated by a fault in r-unit.  If recovery manager decides to recover r-unit it actually recover the entire r-unit´s group. r is an r-unit as a object with two methods: pre-recovery() and post-recovery() Recursive recovery of r:

18 18 The Recovery Process (2)  If failure persists after recover (r), recovery has propagated in the reverse direction of failure propagation  Invoke recovery(upstream of r)  Recursion until failure is eliminated (can lead to a full reboot) or a failure is detected that requires human intervention  pre-recovery() and post-recovery() define a per- component recovery:  Checkpoint for stateful COM  Log-based rollback for transactional COM  Methods are empty for crash-only COM  Hybrid strategies possible: roll-back for pre-recovery and reboot for post-recovery

19 19 Case Study: Satellite Ground Station XML messaging bus Proxy between high level language and low level radio commands Satellite tracker Radio tuner Satellite estimator Failure MONitor RECovery manager&agents

20 20 Component MTTF and MTTR MTTF: MTTR (in seconds):

21 21 Reboot Tree (r-map): Definition - A,B and C: r-units - R BC and R ABC : recovery agents - B and C must be microrebooted together - If A is microrebooted, B and C must be microrebooted

22 22 Depth Augmentation: Addition of New Nodes Reboot all when sth goes wrong „Full reboot“ „Microreboot“ Time to detect failed component and to recover the system in seconds: MTTR-II < MTTR-I

23 23 Subtree Depth Augmentation (1)  The COM fedrcom has  high MTTR (20.93 seconds): due to HW negotiation  low MTTF: crashes often  The COM fedrcom consists of 2 subcomponents  pbcom : maps a serial port to a TCP socket Very stable but takes a long time to recover (MTTR=21.24 sec)  fedr : the front end driver-radio that connects to pbcom over TCP Unstable but recovers very quicly (MTTR=5.76 sec)

24 24 Subtree Depth Augmentation (2) MTTF-II = MTTF-III MTTR-II > MTTR-III

25 25 Dependent Failures  The COM ses and str exihibit correlated failures  Due to functional dependencies: synchronization  When either is rebooted the other will inevitably have to be rebooted  REC will start ses, then be told that there is another failure in str

26 26 Dependent Failures: Group Consolidation

27 27 Imperfect Failure Detection  There are two mistakes that an imperfect REC can do:  „Guess-too-low“: The REC suggests a microreboot at node n, when in fact a microreboot at one of n‘s ancestors is the minimum needed to fix the problem  The time spent rebooting only n is wasted  „Guess-too-high“ The REC suggests a reboot at a level higher than minimally necessary   Recovery time is potentially greater than necessary  Wrong guess is bad when the MTTRs of COM differ greatly  structure the reboot tree  Keep the low-MTTR components low in the tree  Promote high-MTTR components toward the top

28 28 Promoting High-MTTR Nodes - Promotion is one-sided group consolidation - If failure behaviour is symmetrically correlated as for ses and str, the full consolidation would be recommended

29 29 MTTR

30 30 Summary of Transformations (1)

31 31 Summary of Transformations (2)

32 32 Full Rejeuvenation vs Microreboot false positives

33 33 Conclusions  Microreboots  Restart fine-grained components “with a clean slate”  Only take a fraction of the time needed for full system reboot  Need to know which component to rejuvenate: structure of reboot can improve reboot time  They provide a simple recovery technique  For Internet services  Which can be supported entirely in middleware  Which requires no changes to apps or a priori knowledge of app semantics  Low cost  use microreboots aggressively even when their necessity is less than certain (e.g. proactive)  Reduces recovery time  Reduces time spent detecting/diagnosing failures

34 34 Literature George Candea, James Cutler, Armando Fox: “Improving Availability with Recursive Microreboots: A Soft-State System Case Study” in Performance Evaluation Journal, vol. 56, nos. 1-3, March 2004

Download ppt "Software Fault Tolerance (SWFT) Microreboots Dependable Embedded Systems & SW Group Prof. Neeraj Suri Abdelmajid Khelil."

Similar presentations

Ads by Google