Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fabián E. Bustamante, Winter 2006 Recovery Oriented Computing Embracing Failure A. B. Brown and D. A. Patterson, Embracing failure: a case for recovery-

Similar presentations


Presentation on theme: "Fabián E. Bustamante, Winter 2006 Recovery Oriented Computing Embracing Failure A. B. Brown and D. A. Patterson, Embracing failure: a case for recovery-"— Presentation transcript:

1 Fabián E. Bustamante, Winter 2006 Recovery Oriented Computing Embracing Failure A. B. Brown and D. A. Patterson, Embracing failure: a case for recovery- oriented computing (ROC), HPTS, 2001 A little of … A. B. Brown and D. A. Patterson, Undo for operators: Building an undoable store, USENIX ATC 2003 (Best paper)

2 CS 395/495 Autonomic Computing Systems EECS, Northwestern University 2 Availability and today’s apps Availability is the most important metric for modern computer systems Availability used to be a solved problem –Expensive fault-tolerance server –Vendor-supplied high-availability database system –All behind a box well firewalled Today’s apps are quire different –Distributed, heterogeneous environment –Conglomeration of interconnected systems: databases, application servers, middleware, web servers So – 65% of surveyed sties suffered a customer- visible outage at least once in 6-month; 25% 3+ in same period

3 CS 395/495 Autonomic Computing Systems EECS, Northwestern University 3 Problem with assumptions Basic model –Hardware and software can be built w/ negligible failure rates –Failure modes of systems can be predicted and tolerated –Maintenance and repair are error-free procedures More realistically –Hardware and software failures are inevitable –Human failures are inevitable –Unanticipated failures are inevitable Your only option – get used to it – embraced failure – Recovery Oriented Computing (ROC)

4 CS 395/495 Autonomic Computing Systems EECS, Northwestern University 4 HW & SW failures are inevitable Software: Functionality is king – a constant race to offer new functionality → sloppy people & buggy code Hardware: razor-thin margins means no $ for high-quality, fault-tolerant hardware → commodity, failure-prone, hardware Scale only multiplies the problem!

5 CS 395/495 Autonomic Computing Systems EECS, Northwestern University 5 Human failures are inevitable Large systems rely on human beings for –Maintenance and repair –Software configuration and upgrading –Performance tuning –Diagnosing and fixing failures Human beings make mistakes –At a rate of % under stress –70% of failures in electronic systems, 20-53% in missile systems, 60-70% in aircraft failures, 50% in VAX systems, 42% in Tandem systems, …. But modern systems do not into account the possibility of human failure

6 CS 395/495 Autonomic Computing Systems EECS, Northwestern University 6 Unanticipated failures are inevitable Could you solve this w/ good engineering? –Not really Perrow’s work on high-risk technology –Large servers - complex, reasonably-tightly- coupled systems, performing complex tasks under human guidance … prone to “normal accidents” –Accidents that arise from the multiple and unexpected hidden interactions of smaller failures and recovery systems designed to handle them

7 CS 395/495 Autonomic Computing Systems EECS, Northwestern University 7 Recovery Oriented Computing Focus on repair instead of avoiding failures Recovery needs to be a first-class part of the system It must –Ensure problems are detected fast (for containment) –Provide assistance in diagnosing root-cause of them –Repair mechanisms should be trustworthy –Should tolerate errors during recovery –It’s really complementary to fault-tolerance (redundancy is thus necessary) –Should automatically track the health of all components – so it should include fault-injection mechanisms –…

8 CS 395/495 Autonomic Computing Systems EECS, Northwestern University 8 Undoable store You have undos for Office, but not for admins?! Undo operator incorporates three steps –Rewind – physically rolled back to before the damage –Repair – not constraint admins on what repair they can do –Replay – logically (to incorporate the repair) bring it back Two challenges in the 3Rs model –Timeline management – record system timeline so that you can edit it during repair and re-execute during replay –Keep the system consistent from an external observer’s point of view (even ‘after’ repair)

9 CS 395/495 Autonomic Computing Systems EECS, Northwestern University 9 Undo system architecture User Undo Proxy Service App Time travel storage Timeline log Undo Manager Control UI Control Verbs To be able to roll-back the system Service specific In part to make the undo manager generic


Download ppt "Fabián E. Bustamante, Winter 2006 Recovery Oriented Computing Embracing Failure A. B. Brown and D. A. Patterson, Embracing failure: a case for recovery-"

Similar presentations


Ads by Google