Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.

Slides:



Advertisements
Similar presentations
Distributed Snapshots: Non-blocking checkpoint coordination protocol Next: Uncoordinated Chkpnt.
Advertisements

Global States in a Distributed System By John Kor and Yvonne Cheng.
Global States and Checkpoints
Recovery Failure of a site/node in a distributed system causes inconsistencies in the state of the system. Recovery: bringing back the failed node in step.
Faults and Recovery Ludovic Henrio CNRS - projet OASIS Sources: - A survey of rollback-recovery protocols in message-passing systems.
Uncoordinated Checkpointing The Global State Recording Algorithm.
Faults and Recovery Ludovic Henrio INRIA - projet OASIS Sources: - A survey of rollback-recovery protocols in message-passing systems.
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Synchronization. Physical Clocks Solar Physical Clocks Cesium Clocks International Atomic Time Universal Coordinate Time (UTC) Clock Synchronization Algorithms.
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
1 CS 194: Distributed Systems Distributed Commit, Recovery Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering.
Causal Logging : Manetho Rohit C Fernandes 10/25/01.
CMPT Dr. Alexandra Fedorova Lecture XI: Distributed Transactions.
CMPT Dr. Alexandra Fedorova Lecture XI: Distributed Transactions.
Academic Year 2014 Spring. MODULE CC3005NI: Advanced Database Systems “DATABASE RECOVERY” (PART – 1) Academic Year 2014 Spring.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems M. Elnozahy, L. Alvisi, Y. Wang, D. Johnson Carnegie Mellon University Presented by:
1 Rollback-Recovery Protocols II Mahmoud ElGammal.
CIS 720 Distributed algorithms. “Paint on the forehead” problem Each of you can see other’s forehead but not your own. I announce “some of you have paint.
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.17.1 FAULT TOLERANT SYSTEMS Chapter 6 – Checkpointing.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems.
Distributed Transactions Chapter 13
1 8.3 Reliable Client-Server Communication So far: Concentrated on process resilience (by means of process groups). What about reliable communication channels?
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Fault Tolerant Systems
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14.
CS5204 – Operating Systems 1 Checkpointing-Recovery.
Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.
Fault Tolerance CSCI 4780/6780. Distributed Commit Commit – Making an operation permanent Transactions in databases One phase commit does not work !!!
CprE 545: Fault Tolerant Systems (G. Manimaran), Iowa State University1 CprE 545: Fault Tolerant Systems Rollback Recovery Protocols.
1 CSE 8343 Presentation # 2 Fault Tolerance in Distributed Systems By Sajida Begum Samina F Choudhry.
Coordinated Checkpointing Presented by Sarah Arnold 1.
Rollback-Recovery Protocols in Message-Passing Systems Based on A Survey of Rollback-Recovery Protocols in Message-Passing Systems by Mootaz Elnozahy Lorenzo.
Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Rollback-Recovery Protocols I Message Passing Systems Nabil S. Al Ramli.
Synchronization. Clock Synchronization In a centralized system time is unambiguous. In a distributed system agreement on time is not obvious. When each.
Ludovic Henrio INRIA - projet OASIS
EEC 688/788 Secure and Dependable Computing Lecture 5 Wenbing Zhao Cleveland State University
Presented by Rukmini and Diksha Chauhan Virginia Tech 2 nd May, 2007 Movement-Based Checkpointing and Logging for Recovery in Mobile Computing Systems.
1 Fault Tolerance and Recovery Mostly taken from
Jun-Ki Min. Slide Purpose of Database Recovery ◦ To bring the database into the last consistent stat e, which existed prior to the failure. ◦
Operating System Reliability Andy Wang COP 5611 Advanced Operating Systems.
Recovery in Distributed Systems:
8.6. Recovery By Hemanth Kumar Reddy.
Ludovic Henrio CNRS - projet SCALE
Prepared by Ertuğrul Kuzan
Global state and snapshot
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
Operating System Reliability
EECS 498 Introduction to Distributed Systems Fall 2017
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
Fault Tolerance Distributed Web-based Systems
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
Transactions in Distributed Systems
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Last Class: Fault Tolerance
Operating System Reliability
Operating System Reliability
Presentation transcript:

Checkpointing and Recovery

Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore to the previous checkpoint What happens in case of a distributed application –One (or more) processes fail –Restoration to previous checkpoint should be done consistently

Examples

What to Save? Depends on application –Could be as simple as just program counter information –Could be the state of the entire process, including messages received, etc

Stable Storage Checkpoints must survive failure of processes (including failure during a disk write) –A simple approach for stable storage

Approaches Asynchronous –The local checkpoints at different processes are taken independently Synchronous –The local checkpoints at different processes are coordinated –They may not be at the same time

Asynchronous Checkpointing Problem –Domino effect Failed process

Other Issues with Asynchronous Checkpointing Useless checkpoints Need for garbage collection Recovery requires significant coordination

Asynchronous Checkpointing (Continued) Identify dependency between different checkpoint intervals This information is stored along with checkpoints in a stable storage When a process repairs, it requests this information from others to determine the need for rollback

Two Examples of Asynchronous Checkpointing Bhargava and Lian Wang et al

Algorithm by Bhargava et al Draw an edge from c i, x to c j,y if either –i = j and y = x+1 –i  j and a message m is sent from I i, x and received in I j, y Where I i, x is the interval between c i, x-1 and c i, x Rollback recovery line used for recovery as well as garbage collection

Algorithm by Wang et al Difference –If a message sent from I i, x is received in I j, y then draw an edge between c j, x-1 to c j, y Recovery line obtained is similar to that by by Bhargava and Lian Advantage –Number of useful checkpoints is at most N(N+1)/2 This can be shown that the number of checkpoints that are ahead of recovery line

Coordinated Checkpointing Using diffusing computation –How can we use diffusing computation to obtain a consistent snapshot?

Algorithm by Tamir and Sequin Blocking checkpoint –A coordinator decides when a checkpoint is taken –Coordinator sends a request message to all –Each process Stops executing Flushes the channels Takes a tentative checkpoint Replies to coordinator –When all processes send replies, the coordinator asks them to change it to a permanent checkpoint

Algorithm by Tamir and Sequin How many checkpoints need to be stored per process?

Checkpointing in Timed Systems If perfectly synchronized clocks?

Checkpointing in Timed Systems What if clocks are loosely synchronized? –Max clock drift, , is known? All processes take a checkpoint at a fixed (local) time –After the checkpoint, a process does not send any messages for 2  –The set of local checkpoints is guaranteed to be consistent

Minimal Checkpoint Coordination Approach by Koo and Toueg –Require processes to take a checkpoint only if they have to

Logging Protocols Pessimistic Optimistic Causal