12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14.

Slides:



Advertisements
Similar presentations
Recovery Failure of a site/node in a distributed system causes inconsistencies in the state of the system. Recovery: bringing back the failed node in step.
Advertisements

(c) Oded Shmueli Distributed Recovery, Lecture 7 (BHG, Chap.7)
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Recovery from Crashes. ACID A transaction is atomic -- all or none property. If it executes partly, an invalid state is likely to result. A transaction,
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Distributed Systems CS Fault Tolerance- Part III Lecture 15, Oct 26, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.1 Fall 2006 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing.
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
1 Message Logging Pessimistic & Optimistic CS717 Lecture 10/16/01-10/18/01 Kamen Yotov
1 CS 194: Distributed Systems Distributed Commit, Recovery Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering.
CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XI: Distributed Transactions.
CMPT Dr. Alexandra Fedorova Lecture XI: Distributed Transactions.
State Machines CS 614 Thursday, Feb 21, 2002 Bill McCloskey.
CMPT Dr. Alexandra Fedorova Lecture XI: Distributed Transactions.
Academic Year 2014 Spring. MODULE CC3005NI: Advanced Database Systems “DATABASE RECOVERY” (PART – 1) Academic Year 2014 Spring.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems M. Elnozahy, L. Alvisi, Y. Wang, D. Johnson Carnegie Mellon University Presented by:
1 Rollback-Recovery Protocols II Mahmoud ElGammal.
CIS 720 Distributed algorithms. “Paint on the forehead” problem Each of you can see other’s forehead but not your own. I announce “some of you have paint.
Distributed Deadlocks and Transaction Recovery.
Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems.
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
Distributed Transactions Chapter 13
Distributed Systems CS Fault Tolerance- Part III Lecture 19, Nov 25, 2013 Mohammad Hammoud 1.
1 8.3 Reliable Client-Server Communication So far: Concentrated on process resilience (by means of process groups). What about reliable communication channels?
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Fault Tolerant Systems
Joonwon Lee Recovery. Lightweight Recoverable Virtual Memory Rio Vista.
Chapter 15 Recovery. Topics in this Chapter Transactions Transaction Recovery System Recovery Media Recovery Two-Phase Commit SQL Facilities.
PMIT-6102 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
CS5204 – Operating Systems 1 Checkpointing-Recovery.
Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.
Fault Tolerance CSCI 4780/6780. Distributed Commit Commit – Making an operation permanent Transactions in databases One phase commit does not work !!!
CprE 545: Fault Tolerant Systems (G. Manimaran), Iowa State University1 CprE 545: Fault Tolerant Systems Rollback Recovery Protocols.
1 CSE 8343 Presentation # 2 Fault Tolerance in Distributed Systems By Sajida Begum Samina F Choudhry.
Coordinated Checkpointing Presented by Sarah Arnold 1.
Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
Agenda Fail Stop Processors –Problem Definition –Implementation with reliable stable storage –Implementation without reliable stable storage Failure Detection.
Transactions.
EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Rollback-Recovery Protocols I Message Passing Systems Nabil S. Al Ramli.
Fault Tolerance and Checkpointing - Sathish Vadhiyar.
Classifying fault-tolerance Masking tolerance. Application runs as it is. The failure does not have a visible impact. All properties (both liveness & safety)
1 Fault Tolerance and Recovery Mostly taken from
Jun-Ki Min. Slide Purpose of Database Recovery ◦ To bring the database into the last consistent stat e, which existed prior to the failure. ◦
Operating System Reliability Andy Wang COP 5611 Advanced Operating Systems.
Database Recovery Techniques
8.6. Recovery By Hemanth Kumar Reddy.
Prepared by Ertuğrul Kuzan
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
Operating System Reliability
Fault Tolerance In Operating System
EECS 498 Introduction to Distributed Systems Fall 2017
Outline Announcements Fault Tolerance.
Operating System Reliability
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Outline Introduction Background Distributed DBMS Architecture
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Last Class: Fault Tolerance
Operating System Reliability
Operating System Reliability
Presentation transcript:

12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14

Contents Introduction Recovery Checkpointing Difficulty of Checkpointing Synchronous checkpointing / recovery ( Asynchronous checkpointing / recovery )

Introduction Long computation in distributed environments  High failure rate Host failure (a lot of hosts) Network failure One failure may disturb entire computation ⇒ Need to start it again from the beginning High cost Why don’t we utilize the previous computation? Recovery

Recovery is not easy Suppose that a parallel computation is running in distributed resources… for(i=0; i<MAXITER; i++){ local_compute(); // compute at each host global_state_exchange(); // communicate with neighbors } need to save process states periodically usually other processes have to restore to previous state overhead

Recovery

Back/Forward Error Recovery Forward-error recovery Only when it is possible to remove errors Enable processes to move forward Ex) Redundancy, vote Backward-error recovery General Restore to a previous error-free state Ex) Checkpoint

Backward-error recovery operational-based approach Record all modifications of a process’ state state-based approach Record complete state at certain point Recovery Forward- error Backward- error operational- based state- based

State-based approach Terminology checkpointing the process of saving state checkpoint the recovery point at which checkpointing occurs rolling back the process of restoring a process to a prior-state

Checkpointing

Problem of naïve checkpointing Orphan Messages and the Domino Effect Orphan message : a message that make an inconsistent state Domino Effect : what a single rolling back induce other rolling back Lost Messages Livelocks

Orphan message and Domino Effect X Y Z [ [ [ x1x1 y1y1 z1z1 [ [ [ [ x2x2 x3x3 y2y2 z2z2 Roll back Y has not sent yet, but X has received. : Orphan message : Domino Effect

Lost messages X Y Z [ [ [ x1x1 y1y1 z1z1 [ [ [ x2x2 y2y2 z2z2 [ x3x3 Roll back X has sent, but Y cannot receive forever : Lost message

Livelocks X Y[ y1y1 [ x1x1 n1 m1 n1 m2 n2

Consistency of Checkpoint Strongly consistent set of checkpoints no messages penetrating the set Consistent set of checkpoints no messages penetrating the set backward [ [ [ x1x1 y1y1 z1z1 [ [ [ y2y2 x2x2 z2z2 Strongly consistent consistent need to deal with lost messages

Checkpoint/Recovery Algorithm Synchronous with global synchronization at checkpointing Asynchronous without global synchronization at checkpointing

Preliminary (Assumption) Goal To make a consistent global checkpoint Assumptions Communication channels are FIFO No partition of the network End-to-end protocols cope with message loss due to rollback recovery and communication failure No failure during the execution of the algorithm ~ Synchronous Checkpoint ~

Preliminary (Two types of checkpoint) tentative checkpoint : a temporary checkpoint a candidate for permanent checkpoint permanent checkpoint : a local checkpoint at a process a part of a consistent global checkpoint ~ Synchronous Checkpoint ~

Checkpoint Algorithm Algorithm 1. an initiating process (a single process that invokes this algorithm) takes a tentative checkpoint 2. it requests all the processes to take tentative checkpoints 3. it waits for receiving from all the processes whether taking a tentative checkpoint has been succeeded 4. if it learns all the processes has succeeded, it decides all tentative checkpoints should be made permanent; otherwise, should be discarded. 5. it informs all the processes of the decision 6. The processes that receive the decision act accordingly Supplement Once a process has taken a tentative checkpoint, it shouldn’t send messages until it is informed of initiator’s decision. ~ Synchronous Checkpoint ~

Diagram of Checkpoint Algorithm [ [ [ | | Tentative checkpoint | request to take a tentative checkpoint OK decide to commit [ permanent checkpoint [ [ consistent global checkpoint Unnecessary checkpoint Initiator ~ Synchronous Checkpoint ~

Optimized Algorithm Each message is labeled by order of sending Labeling Scheme ⊥ : smallest label т : largest label last_label_rcvd X [Y] : the last message that X received from Y after X has taken its last permanent or tentative checkpoint. if not exists, ⊥ is in it. first_label_sent X [Y] : the first message that X sent to Y after X took its last permanent or tentative checkpoint. if not exists, ⊥ is in it. ckpt_cohort X : the set of all processes that may have to take checkpoints when X decides to take a checkpoint. ~ Synchronous Checkpoint ~ [ [ X Y x2 x3 y1y2 x2 Checkpoint request need to be sent to only the processes included in ckpt_cohort

Optimized Algorithm ckpt_cohort X : { Y | last_label_rcvd X [Y] > ⊥ } Y takes a tentative checkpoint only if last_label_rcvd X [Y] >= first_label_sent Y [X] > ⊥ ~ Synchronous Checkpoint ~ X Y [ [ last_label_rcvdX[Y] first_label_sentY[X]

Optimized Algorithm Algorithm 1. an initiating process takes a tentative checkpoint 2. it requests p ∈ ckpt_cohort to take tentative checkpoints ( this message includes last_label_rcvd[reciever] of sender ) 3. if the processes that receive the request need to take a checkpoint, they do the same as 1.2.; otherwise, return OK messages. 4. they wait for receiving OK from all of p ∈ ckpt_cohort 5. if the initiator learns all the processes have succeeded, it decides all tentative checkpoints should be made permanent; otherwise, should be discarded. 6. it informs p ∈ ckpt_cohort of the decision 7. The processes that receive the decision act accordingly ~ Synchronous Checkpoint ~

Diagram of Optimized Algorithm [ [ [ [ A C B D ab1 ac1 bd1 dc1 dc2 cb1 ba1ba2 ac2 cb2 cd1 | Tentative checkpoint ca2 last_label_rcvdX[Y] >= first_label_sentY[X] > ⊥ 2 >= 1 > 0 | 2 >= 2 > 0 | 2 >= 0 > 0 OK decide to commit [ Permanent checkpoint [ [ ckpt_cohortX : { Y | last_label_rcvdX[Y] > ⊥ } ~ Synchronous Checkpoint ~

Correctness A set of permanent checkpoints taken by this algorithm is consistent No process sends messages after taking a tentative checkpoint until the receipt of the decision New checkpoints include no message from the processes that don’t take a checkpoint The set of tentative checkpoints is fully either made to permanent checkpoints or discarded. ~ Synchronous Checkpoint ~

Recovery Algorithm Labeling Scheme ⊥ : smallest label т : largest label last_label_rcvd X [Y] : the last message that X received from Y after X has taken its last permanent or tentative checkpoint. If not exists, ⊥ is in it. first_label_sent X [Y] : the first message that X sent to Y after X took its last permanent or tentative checkpoint. If not exists, ⊥ is in it. roll_cohort X : the set of all processes that may have to roll back to the latest checkpoint when process X rolls back. last_label_sent X [Y] : the last message that X sent to Y before X takes its latest permanent checkpoint. If not exist, т is in it. ~ Synchronous Recovery ~

Recovery Algorithm roll_cohort X = { Y | X can send messages to Y } Y will restart from the permanent checkpoint only if last_label_rcvd Y [X] > last_label_sent X [Y] ~ Synchronous Recovery ~

Recovery Algorithm Algorithm 1. an initiator requests p ∈ roll_cohort to prepare to rollback ( this message includes last_label_sent[reciever] of sender ) 2. if the processes that receive the request need to rollback, they do the same as 1.; otherwise, return OK message. 3. they wait for receiving OK from all of p ∈ ckpt_cohort. 4. if the initiator learns p ∈ roll_cohort have succeeded, it decides to rollback; otherwise, not to rollback. 5. it informs p ∈ roll_cohort of the decision 6. the processes that receive the decision act accordingly ~ Synchronous Recovery ~

Diagram of Synchronous Recovery [ [ [ [ A C B D ab1 ac1 bd1 dc1 dc2 cb1 ba1ba2 ac2 cb2 dc1 request to roll back 0 > 1 last_label_rcvdY[X] > last_label_sentX[Y] 2 > 1 0 >т OK [ [ 2 > 1 0 >т [ decide to roll back roll_cohortX = { Y | X can send messages to Y }

Drawbacks of Synchronous Approach Additional messages are exchanged Synchronization delay An unnecessary extra load on the system if failure rarely occurs

Asynchronous Checkpoint Characteristic Each process takes checkpoints independently No guarantee that a set of local checkpoints is consistent A recovery algorithm has to search consistent set of checkpoints No additional message No synchronization delay Lighter load during normal excution

Preliminary (Assumptions) Goal To find the latest consistent set of checkpoints Assumptions Communication channels are FIFO Communication channels are reliable The underlying computation is event-driven ~ Asynchronous Checkpoint / Recovery ~

Preliminary (Two types of log) save an event on the memory at receipt of messages (volatile log) volatile log periodically flushed to the disk (stable log) ⇔ checkpoint volatile log : quick access lost if the corresponding processor fails stable log : slow access not lost even if processors fail ~ Asynchronous Checkpoint / Recovery ~

Preliminary (Definition) Definition CkPt i : the checkpoint (stable log) that i rolled back to when failure occurs RCVD i←j (CkPt i / e ) : the number of messages received by processor i from processor j, per the information stored in the checkpoint CkPt i or event e. SENT i→j (CkPt i / e ) : the number of messages sent by processor i to processor j, per the information stored in the checkpoint CkPt i or event e ~ Asynchronous Checkpoint / Recovery ~

Recovery Algorithm Algorithm 1. When one process crashes, it recovers to the latest checkpoint CkPt. 2. It broadcasts the message that it had failed. Others receive this message, and rollback to the latest event. 3. Each process sends SENT(CkPt) to neighboring processes 4. Each process waits for SENT(CkPt) messages from every neighbor 5. On receiving SENT j→i (CkPt j ) from j, if i notices RCVD i←j (CkPt i ) > SENT j→i (CkPt j ), it rolls back to the event e such that RCVD i←j (e) = SENT j→i (e), 6. repeat 3,4,and 5 N times (N is the number of processes) ~ Asynchronous Checkpoint / Recovery ~

Asynchronous Recovery X Y Z Ex0Ex1Ex2Ex3 Ey0Ey1Ey2 Ey3 Ez0 Ez1Ez2 [ [ [ x1 y1 z1 (Y,2) (Y,1) (X,2) (X,0) (Z,0) (Z,1) 3 <= 2 RCVDi←j (CkPti) <= SENTj→i(CkPtj) 2 <= 2 X:YX:Z 0 <= 0 1 <= 2 Y:X 1 <= 1 Y:Z 0 <= 0 Z:X 2 <= 1 Z:Y 1 <= 1