Prepared by Ertuğrul Kuzan

Slides:

Advertisements

Similar presentations

Rollback-Retry Techniques & Checnkpointing Protocols.

Advertisements

Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.

Global States and Checkpoints

CS 603 Handling Failure in Commit February 20, 2002.

Chapter 19 Database Recovery Techniques

7. Fault Tolerance Through Dynamic (or Standby) Redundancy The lowest-cost fault-tolerance technique in multiprocessors. Steps performed: When a fault.

ICS (072)Database Recovery1 Database Recovery Concepts and Techniques Dr. Muhammad Shafique.

CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.

1 Message Logging Pessimistic & Optimistic CS717 Lecture 10/16/01-10/18/01 Kamen Yotov

©Silberschatz, Korth and Sudarshan19.1Database System Concepts Distributed Transactions Transaction may access data at several sites. Each site has a local.

Definition of terms Definition of terms Explain business conditions driving distributed databases Explain business conditions driving distributed databases.

Distributed Databases

A Survey of Rollback-Recovery Protocols in Message-Passing Systems M. Elnozahy, L. Alvisi, Y. Wang, D. Johnson Carnegie Mellon University Presented by:

1 Rollback-Recovery Protocols II Mahmoud ElGammal.

Commit Protocols. CS5204 – Operating Systems2 Fault Tolerance Causes of failure: process failure machine failure network failure Goals : transparent:

Distributed Deadlocks and Transaction Recovery.

Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.

A Survey of Rollback-Recovery Protocols in Message-Passing Systems.

Chapter 19 Recovery and Fault Tolerance Copyright © 2008.

EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Fault Tolerant Systems

Operating Systems Distributed Coordination. Topics –Event Ordering –Mutual Exclusion –Atomicity –Concurrency Control Topics –Event Ordering –Mutual Exclusion.

12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14.

Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.

Fault Tolerance CSCI 4780/6780. Distributed Commit Commit – Making an operation permanent Transactions in databases One phase commit does not work !!!

CprE 545: Fault Tolerant Systems (G. Manimaran), Iowa State University1 CprE 545: Fault Tolerant Systems Rollback Recovery Protocols.

Association Rule Mining in Peer-to-Peer Systems Ran Wolff Assaf Shcuster Department of Computer Science Technion I.I.T. Haifa 32000,Isreal.

Coordinated Checkpointing Presented by Sarah Arnold 1.

1 Distributed Databases BUAD/American University Distributed Databases.

Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.

Commit Algorithms Hamid Al-Hamadi CS 5204 November 17, 2009.

Sapna E. George, Ing-Ray Chen Presented By Yinan Li, Shuo Miao April 14, 2009.

Committed:Effects are installed to the database. Aborted:Does not execute to completion and any partial effects on database are erased. Consistent state:

University of Westminster – Checkpointing Mechanism for the Grid Environment K Sajadah, G Terstyanszky, S Winter, P. Kacsuk University.

EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003.

Rollback-Recovery Protocols I Message Passing Systems Nabil S. Al Ramli.

Movement-Based Check-pointing and Logging for Recovery in Mobile Computing Systems Sapna E. George, Ing-Ray Chen, Ying Jin Dept. of Computer Science Virginia.

Presented by Rukmini and Diksha Chauhan Virginia Tech 2 nd May, 2007 Movement-Based Checkpointing and Logging for Recovery in Mobile Computing Systems.

1 Fault Tolerance and Recovery Mostly taken from

Jun-Ki Min. Slide Purpose of Database Recovery ◦ To bring the database into the last consistent stat e, which existed prior to the failure. ◦

Operating System Reliability Andy Wang COP 5611 Advanced Operating Systems.

Distributed Databases – Advanced Concepts Chapter 25 in Textbook.

OPERATING SYSTEM CONCEPTS AND PRACTISE

Database recovery techniques

Database Recovery Techniques

Parallel and Distributed Simulation Techniques

EEC 688/788 Secure and Dependable Computing

Database System Implementation CSE 507

Operating System Reliability

Operating System Reliability

EECS 498 Introduction to Distributed Systems Fall 2017

Commit Protocols CS60002: Distributed Systems

Outline Announcements Fault Tolerance.

Operating System Reliability

Operating System Reliability

EEC 688/788 Secure and Dependable Computing

Middleware for Fault Tolerant Applications

EEC 688/788 Secure and Dependable Computing

Outline Introduction Background Distributed DBMS Architecture

EEC 688/788 Secure and Dependable Computing

Operating System Reliability

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

Distributed Systems and Concurrency: Distributed Systems

Last Class: Fault Tolerance

Operating System Reliability

Transaction Communication

Operating System Reliability

Presentation transcript:

Prepared by Ertuğrul Kuzan Effective and Concurrent Checkpointing and Recovery in Distributed Systems Prepared by Ertuğrul Kuzan

TOPIC an effective application-transparent checkpointing/rollback scheme for multiple processes that communicate via message passing in a distributed system. 2 of 26

Reviewed Issues Checkpointing Rollback Independent checkpointing Consistent checkpointing Rollback Minimal rollback Global rollback 3 of 26

Definitions Checkpoint : snapshot of their states saved by processes to a stable storage. Rollback : retrieving a selected checkpoint and resuming execution from there when a failure occurs. 4 of 26

Checkpointing and rollback recovery an important technique for tolerating transient faults such as hardware transient failure and transaction aborts. In a distributed system where processes communicate by message passing, individual process states may become dependent on one another due to inter-process communication. 5 of 26

Rollback propagation (Example) the rollback of one process may result in an avalanche rollback of the other processes. if the rollback of a process Pi to its retrieved checkpoint undoes the sending of a message to another process Pi, Pj must also roll back to undo the receiving of that message. 6 of 26

Checkpointing Schemes (Plans) Two main types Independent Checkpointing Consistent Checkpointing 7 of 26

Independent Checkpointing no collaboration between processes on taking checkpoints. Disadvantages These types of scheme suffer from the domino effect (unbounded cascading of rollbacks of other processes) a process may need to keep all the checkpoints that have been taken since program initialisation. 8 of 26

Consistent Checkpointing saves only two checkpoints for each process When a process fails, all the processes need only to roll back to their latest checkpoints (if necessary) Disadvantages Usually require a higher overhead in control messages less concurrency in process execution can be achieved 9 of 26

Proposed Checkpointing Scheme an asynchronously co-ordinated checkpointing scheme that captures the essence of both independent and consistent checkpointing extra checkpoints as needed to maintain a set of consistent checkpoints and to avoid the domino effect. equipped with an effective global recovery line (GRL) determination mechanism to clean the checkpoints that a process will never roll back. (the process can discard all the checkpoints taken before the GRL) 10 of 26

Advantages of the proposed scheme By taking checkpoints with respect to the frequency of message communications, rollback propagation can be significantly reduced. Does not cause a higher overhead in control messages Avoids the domino effect 11 of 26

System under consideration consists of a set of processors connected by a local area network The failures considered in this method are transient; that is, once a process recovers from a failure and resumes its execution 12 of 26

Recovery Manager (RM) A process responsible for taking checkpoints and performing the rollback task responsible for discarding orphan messages ~ messages that were sent by a sender process before it rolls back to a checkpoint that was taken before the sending of these messages. 13 of 26

Coordinator process determines the checkpoints to which application processes should roll back in case of process failure. 14 of 26

Checkpointing scheme two component strategies in the proposed checkpointing scheme Unforced checkpointing Forced checkpointing Each process takes checkpoints independently with respect to the frequency of message sending. 15 of 26

Unforced checkpointing the more frequently an application process sends messages to other processes, the more checkpoints the process should take in order to reduce the total rollback distance Example : 16 of 26

Unforced checkpointing (2) instead of using the number of messages Process i (Pi) has sent since its last checkpoint, the number of distinct processes (NS) to which Pi has sent messages since its last checkpoint is used. This is because NS, represents the number of processes that may need to roll back with Pi when Pi fails, and therefore better reflects the rollback propagation that may result from Pi's rollback. 17 of 26

Forced checkpointing Unforced checkpointing alone cannot ensure checkpoint consistency To avoid checkpoint inconsistency, processes need to take checkpoints in addition to those taken using the unforced checkpointing strategy. 18 of 26

Rollback Scheme two rollback schemes which can be incorporated into the checkpointing scheme global rollback minimal rollback 19 of 26

Global rollback does not impose a complex communication structure for control messages, but may require irrelevant processes to roll back. Given a consistent global state, the execution of one or multiple checkpointing and global rollback instances terminates with a consistent global state. 20 of 26

Minimal rollback requires only relevant processes to roll back, therefore causes a minimum rollback propagation Given a consistent global state, the execution of one or multiple checkpointing and minimal rollback instances terminates with a consistent global state. 21 of 26

Experimental Study The proposed method is compared with the Silva’s method in terms of Rollback distance number of checkpoints taken throughout the simulation 22 of 26

Rollback distance graph Our algorithm 23 of 26

Number of checkpoints taken graph Our algorithm 24 of 26

Conclusion Proposed method reduces the rollback propagation by dynamically changing the checkpoint interval eliminates the domino effect by taking forced checkpoint whenever necessary The checkpointing operation does not block either process execution or message communications 25 of 26

Conclusion (2) The checkpointing operation does not impose a complicated communication structure for control messages. Through the results of the simulations we can say that the proposed scheme can effectively reduce rollback propagation 26 of 26

References C. J. Hou, K.S. Tsoi and C.C. Han, Effective and Concurrent Checkpointing and Recovery in Distributed Systems, Computer and Digital Techniques, IEE Proceedings, Volume 144, Issue 5, Sept. 1997, Page(s): 304 -316 KIM, J.L. and PARK, T. : ‘An efficient protocol for checkpointing recovery in distributed systems’, ZEEE Trans. Parullrl and Distrib. Syst. Aug. 1993, 4, pp. 955-960