Presentation on theme: "Recursive Restartability In a Networked Ground Station (RRINGS) Rushabh Doshi and Rakesh Gowda Computer Science Department Stanford University."— Presentation transcript:
Recursive Restartability In a Networked Ground Station (RRINGS) Rushabh Doshi and Rakesh Gowda Computer Science Department Stanford University
Fall 2001 - CS444AIntroduction n Hypothesis: l In conjunction with fault detection, enabling the ground station (GS) for Recursive Restartability (RR) will increase system availability n Approach: l Verify the applicability of RR to a single GS node. l Design a framework for enabling RR in new/existing GS modules and systems. l Integrate with Fault Detection (FD) component.
Fall 2001 - CS444A Current State of Art n Restart scalpel is a novel approach (Candea, Fox). n Sledgehammer Restarts l MS cluster Server (formerly Wolfpack) uses clustering and application level restarts to achieve higher availability. l Unnamed Internet portal does prophylactic restarts on Apache n However, none of above use an RR scalpel n We are developing RR scalpel techniques
Fall 2001 - CS444A Program Flow n Wait for a fault message from Fault Detector n Consult an oracle to tell you what to restart n Restart those components n A decision tree is the oracle l Construct the decision tree l Capturing restart dependency information
Fall 2001 - CS444A RR Tree n RR Tree captures Restart dependency information n Parents must be able to restart children Pipeline iseistristu pbcom fedr ise: IServiceEstimator istr: IserviceTracker istu: IServiceTuner fedr: FedRadio pbcom: PipelineByteCOMPort
Fall 2001 - CS444A From RR Trees to Decision Trees n Components have different restart times n Components have different failure rates n Use this information to augment Decision Tree l Preserve dependencies l Reduce MTTR l Move slower components up, push faster components down l Capture historical information: Groups of components that fail together l Move high-failure components to single nodes
Fall 2001 - CS444A Restructuring helps! n Sample Restart times for different components
Fall 2001 - CS444A Making the Decision n Algorithm: n Get a fault, restart the node and children l May not be able to kill the node l Restart may not solve the problem n If this does not fix the problem l Retry a constant number of times l Go up one level l Repeat n Log all faults and restarts
Fall 2001 - CS444A Kill – Restart mechanism Kill – Restart mechanism n Need for a softer kill l All components may not be misbehaving l Give components a chance to free resources n If soft kill fails, follow with hard kill l kill – 9 system call on linux n Restart implemented as a java System.exec(…) call
Fall 2001 - CS444A Designing a system for RR n Goal is to decrease MTTR n Decompose components into smaller pieces n Advantages l Fault isolation l Move slow-restart pieces up (and fast-restart down) l Significantly decreases MTTR l Example: fedr and pbom4 n Disadvantages l Some components may not be decomposable l IPC can make things difficult (they were together for a reason) – coordination aspect n State management
Fall 2001 - CS444A State Management n Stateful components need to resynchronize after restart n Resynch complexity is a function of system design n GS Resynchronization l All components keep softstate l “Hardstate” in control GUI that we are not modeling here. n Future GS Resynchronization l Protect system goal state in a “safe” stable storage. l Components refresh from this stable storage l Details not yet defined.
Fall 2001 - CS444AResults n Increased reliability in GS through RR n Developed framework for enabling new GS modules n Future work: l Develop protected stable storage techniques l Extend framework for a multi-component GS l Extend framework to a federated Virtual GS