Prepared by Ertuğrul Kuzan

Prepared by Ertuğrul Kuzan
Effective and Concurrent Checkpointing and Recovery in Distributed Systems Prepared by Ertuğrul Kuzan

TOPIC an effective application-transparent checkpointing/rollback scheme for multiple processes that communicate via message passing in a distributed system. 2 of 26

Reviewed Issues Checkpointing Rollback Independent checkpointing
Consistent checkpointing Rollback Minimal rollback Global rollback 3 of 26

Definitions Checkpoint : snapshot of their states saved by processes to a stable storage. Rollback : retrieving a selected checkpoint and resuming execution from there when a failure occurs. 4 of 26

Checkpointing and rollback recovery
an important technique for tolerating transient faults such as hardware transient failure and transaction aborts. In a distributed system where processes communicate by message passing, individual process states may become dependent on one another due to inter-process communication. 5 of 26

Rollback propagation (Example)
the rollback of one process may result in an avalanche rollback of the other processes. if the rollback of a process Pi to its retrieved checkpoint undoes the sending of a message to another process Pi, Pj must also roll back to undo the receiving of that message. 6 of 26

Checkpointing Schemes (Plans)
Two main types Independent Checkpointing Consistent Checkpointing 7 of 26

Independent Checkpointing
no collaboration between processes on taking checkpoints. Disadvantages These types of scheme suffer from the domino effect (unbounded cascading of rollbacks of other processes) a process may need to keep all the checkpoints that have been taken since program initialisation. 8 of 26

Consistent Checkpointing
saves only two checkpoints for each process When a process fails, all the processes need only to roll back to their latest checkpoints (if necessary) Disadvantages Usually require a higher overhead in control messages less concurrency in process execution can be achieved 9 of 26

Proposed Checkpointing Scheme
an asynchronously co-ordinated checkpointing scheme that captures the essence of both independent and consistent checkpointing extra checkpoints as needed to maintain a set of consistent checkpoints and to avoid the domino effect. equipped with an effective global recovery line (GRL) determination mechanism to clean the checkpoints that a process will never roll back. (the process can discard all the checkpoints taken before the GRL) 10 of 26

Advantages of the proposed scheme
By taking checkpoints with respect to the frequency of message communications, rollback propagation can be significantly reduced. Does not cause a higher overhead in control messages Avoids the domino effect 11 of 26

System under consideration
consists of a set of processors connected by a local area network The failures considered in this method are transient; that is, once a process recovers from a failure and resumes its execution 12 of 26

Recovery Manager (RM) A process responsible for taking checkpoints and performing the rollback task responsible for discarding orphan messages ~ messages that were sent by a sender process before it rolls back to a checkpoint that was taken before the sending of these messages. 13 of 26

Coordinator process determines the checkpoints to which application processes should roll back in case of process failure. 14 of 26

Checkpointing scheme two component strategies in the proposed checkpointing scheme Unforced checkpointing Forced checkpointing Each process takes checkpoints independently with respect to the frequency of message sending. 15 of 26

Unforced checkpointing
the more frequently an application process sends messages to other processes, the more checkpoints the process should take in order to reduce the total rollback distance Example : 16 of 26

Unforced checkpointing (2)
instead of using the number of messages Process i (Pi) has sent since its last checkpoint, the number of distinct processes (NS) to which Pi has sent messages since its last checkpoint is used. This is because NS, represents the number of processes that may need to roll back with Pi when Pi fails, and therefore better reflects the rollback propagation that may result from Pi's rollback. 17 of 26

Forced checkpointing Unforced checkpointing alone cannot ensure checkpoint consistency To avoid checkpoint inconsistency, processes need to take checkpoints in addition to those taken using the unforced checkpointing strategy. 18 of 26

Rollback Scheme two rollback schemes which can be incorporated into the checkpointing scheme global rollback minimal rollback 19 of 26

Global rollback does not impose a complex communication structure for control messages, but may require irrelevant processes to roll back. Given a consistent global state, the execution of one or multiple checkpointing and global rollback instances terminates with a consistent global state. 20 of 26

Minimal rollback requires only relevant processes to roll back, therefore causes a minimum rollback propagation Given a consistent global state, the execution of one or multiple checkpointing and minimal rollback instances terminates with a consistent global state. 21 of 26

Experimental Study The proposed method is compared with the Silva’s method in terms of Rollback distance number of checkpoints taken throughout the simulation 22 of 26

Rollback distance graph
Our algorithm 23 of 26

Number of checkpoints taken graph
Our algorithm 24 of 26

Conclusion Proposed method
reduces the rollback propagation by dynamically changing the checkpoint interval eliminates the domino effect by taking forced checkpoint whenever necessary The checkpointing operation does not block either process execution or message communications 25 of 26

Conclusion (2) The checkpointing operation does not impose a complicated communication structure for control messages. Through the results of the simulations we can say that the proposed scheme can effectively reduce rollback propagation 26 of 26

References C. J. Hou, K.S. Tsoi and C.C. Han, Effective and Concurrent Checkpointing and Recovery in Distributed Systems, Computer and Digital Techniques, IEE Proceedings, Volume 144, Issue 5, Sept. 1997, Page(s): KIM, J.L. and PARK, T. : ‘An efficient protocol for checkpointing recovery in distributed systems’, ZEEE Trans. Parullrl and Distrib. Syst. Aug. 1993, 4, pp

Prepared by Ertuğrul Kuzan

Similar presentations

Presentation on theme: "Prepared by Ertuğrul Kuzan"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Prepared by Ertuğrul Kuzan

Similar presentations

Presentation on theme: "Prepared by Ertuğrul Kuzan"— Presentation transcript:

Similar presentations

About project

Feedback