Presentation is loading. Please wait.

Presentation is loading. Please wait.

Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.

Similar presentations


Presentation on theme: "Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore."— Presentation transcript:

1 Checkpointing and Recovery

2 Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore to the previous checkpoint What happens in case of a distributed application –One (or more) processes fail –Restoration to previous checkpoint should be done consistently

3 Examples

4 What to Save? Depends on application –Could be as simple as just program counter information –Could be the state of the entire process, including messages received, etc

5 Stable Storage Checkpoints must survive failure of processes (including failure during a disk write) –A simple approach for stable storage

6 Approaches Asynchronous –The local checkpoints at different processes are taken independently Synchronous –The local checkpoints at different processes are coordinated –They may not be at the same time

7 Asynchronous Checkpointing Problem –Domino effect Failed process

8 Other Issues with Asynchronous Checkpointing Useless checkpoints Need for garbage collection Recovery requires significant coordination

9 Asynchronous Checkpointing (Continued) Identify dependency between different checkpoint intervals This information is stored along with checkpoints in a stable storage When a process repairs, it requests this information from others to determine the need for rollback

10 Two Examples of Asynchronous Checkpointing Bhargava and Lian Wang et al

11 Algorithm by Bhargava et al Draw an edge from c i, x to c j,y if either –i = j and y = x+1 –i  j and a message m is sent from I i, x and received in I j, y Where I i, x is the interval between c i, x-1 and c i, x Rollback recovery line used for recovery as well as garbage collection

12 Algorithm by Wang et al Difference –If a message sent from I i, x is received in I j, y then draw an edge between c j, x-1 to c j, y Recovery line obtained is similar to that by by Bhargava and Lian Advantage –Number of useful checkpoints is at most N(N+1)/2 This can be shown that the number of checkpoints that are ahead of recovery line

13 Coordinated Checkpointing Using diffusing computation –How can we use diffusing computation to obtain a consistent snapshot?

14 Algorithm by Tamir and Sequin Blocking checkpoint –A coordinator decides when a checkpoint is taken –Coordinator sends a request message to all –Each process Stops executing Flushes the channels Takes a tentative checkpoint Replies to coordinator –When all processes send replies, the coordinator asks them to change it to a permanent checkpoint

15 Algorithm by Tamir and Sequin How many checkpoints need to be stored per process?

16 Checkpointing in Timed Systems If perfectly synchronized clocks?

17 Checkpointing in Timed Systems What if clocks are loosely synchronized? –Max clock drift, , is known? All processes take a checkpoint at a fixed (local) time –After the checkpoint, a process does not send any messages for 2  –The set of local checkpoints is guaranteed to be consistent

18 Minimal Checkpoint Coordination Approach by Koo and Toueg –Require processes to take a checkpoint only if they have to

19 Logging Protocols Pessimistic Optimistic Causal


Download ppt "Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore."

Similar presentations


Ads by Google