Presentation is loading. Please wait.

Presentation is loading. Please wait.

Safetynet: Improving The Availability Of Shared Memory Multiprocessors With Global Checkpoint/Recovery D. Sorin M. Martin M. Hill D. Wood Presented by.

Similar presentations


Presentation on theme: "Safetynet: Improving The Availability Of Shared Memory Multiprocessors With Global Checkpoint/Recovery D. Sorin M. Martin M. Hill D. Wood Presented by."— Presentation transcript:

1 Safetynet: Improving The Availability Of Shared Memory Multiprocessors With Global Checkpoint/Recovery D. Sorin M. Martin M. Hill D. Wood Presented by Akin Olugbade03/05/2010

2 Motivation  Increase in processor speed and decrease in processor technology size make chips more susceptible to errors  Systems need high availability  Shared memory multiprocessor servers make up a lot of internet servers  Rebooting or system crashes are an undesirable way to deal with errors

3 SafteyNet Design  Create globally consistent checkpoints that the system can recover to in the case an error is detected  Save architected state which consists of processor registers, memory state, coherence state  Validate that a checkpoint is fault free  Recover to most recent validated checkpoint in case of error

4 SafetyNet Design  Logging space reduced  Only log changes to a certain register, memory block, or coherence permission once per checkpoint interval  Point of Atomicity  Requestor does not increment recovery point until all outstanding requests are completed  Consistent logical time ensures global consistency of checkpoints  Validation  All components must agree that a checkpoint is a valid fault free point for it to be validated

5 Logical Time

6 Evaluation

7

8 Conclusion  + Checkpoint/Recovery system can be independent of error detection mechanism  +Negligible performance overhead in error free common case  +Storage and Bandwidth overhead can be minimized greatly by increasing checkpoint interval

9 Questions  Does the Validation Latency matter in the case of output commit?  How do we deal with stores in the case of CLB fillup?  Is SafteyNet suitable for mission critical situations?  If our validation time is fast enough, would we want to reduce the checkpoint interval time?


Download ppt "Safetynet: Improving The Availability Of Shared Memory Multiprocessors With Global Checkpoint/Recovery D. Sorin M. Martin M. Hill D. Wood Presented by."

Similar presentations


Ads by Google