Presentation is loading. Please wait.

Presentation is loading. Please wait.

SafetyNet Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

Similar presentations


Presentation on theme: "SafetyNet Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,"— Presentation transcript:

1 SafetyNet Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, David A. Wood March 31 st 2006

2 Target: Systems where availability is crucial SMP Commercial Servers: Application Services, Database Management Systems Motivation: Increase in Performance => Decrease in feature size => Decrease in Reliability Cost of fault-tolerant solution: Important

3 Approach and Challenges Decouple: Local Fault Detection - ECC, timeout, etc. Lightweight & Global Fault Recovery - SafetyNet Challenges for lightweight recovery schemes: Amount of storage (checkpoints logs) Maintain consistent global recovery point Advance global recovery point

4 SafetyNet: High-Level View Maintain per processor checkpoints: One globally validated recovery point Multiple coordinated checkpoints pending validation ID by global logical timestamp Fault detected => recover state to Recovery Point (Global)

5 Solutions: Storage Checkpoint architectural state: Registers: Shadow registers or cached copies Copy once on beginning of checkpoint Memory and Caches: Checkpoint Log Buffers (CLBs) Log incrementally stores, ownership change Log only first update per block per checkpoint

6 Solution: Global Coherence Logical Time Base: General agreement on checkpoint interval for each coherence transaction Loosely synchronous checkpoint clock Maintain per block Checkpoint number (CN)

7 Solution: Global Recovery Point Checkpoint Validation: All agree execution to that point Error Free Broadcast new Recovery Point Checkpoint Number Restart: Drain interconnection network Discard in progress coherence state Processors: restore register checkpoint Memory: undo actions in Checkpoint Log Buffers (CLBs) Caches: undo CLB

8 Evaluation: Performance Impact

9 Evaluation: Sensitivity

10 Evaluation: Sensitivity (Cont)

11 Questions Why is having a coordinated checkpoint important? Why broadcast Recovery Point Checkpoint Number twice: when advancing the recovery point when triggering recovery? Why a Sequential Consistent model? Is the scheme valid for Processor Consistency? Is this a good idea? Has it caught on?


Download ppt "SafetyNet Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,"

Similar presentations


Ads by Google