Rollback-Retry Techniques & Checnkpointing Protocols.

Slides:



Advertisements
Similar presentations
Distributed Snapshots: Non-blocking checkpoint coordination protocol Next: Uncoordinated Chkpnt.
Advertisements

Global States.
Global States and Checkpoints
Faults and Recovery Ludovic Henrio CNRS - projet OASIS Sources: - A survey of rollback-recovery protocols in message-passing systems.
Virtual Time “Virtual Time and Global States of Distributed Systems” Friedmann Mattern, 1989 The Model: An asynchronous distributed system = a set of processes.
Time and Global States Part 3 ECEN5053 Software Engineering of Distributed Systems University of Colorado, Boulder.
Distributed Computing 5. Snapshot Shmuel Zaks ©
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
CS 582 / CMPE 481 Distributed Systems
Distributed Systems Fall 2009 Logical time, global states, and debugging.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
Checkpointing 2.0 Compiler-Assisted Checkpointing Uncoordinated Checkpointing.
1 Message Logging Pessimistic & Optimistic CS717 Lecture 10/16/01-10/18/01 Kamen Yotov
EEC-681/781 Distributed Computing Systems Lecture 11 Wenbing Zhao Cleveland State University.
Causal Logging : Manetho Rohit C Fernandes 10/25/01.
Lecture 12 Synchronization. EECE 411: Design of Distributed Software Applications Summary so far … A distributed system is: a collection of independent.
Computer Science Lecture 10, page 1 CS677: Distributed OS Last Class: Clock Synchronization Physical clocks Clock synchronization algorithms –Cristian’s.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems M. Elnozahy, L. Alvisi, Y. Wang, D. Johnson Carnegie Mellon University Presented by:
1 Rollback-Recovery Protocols II Mahmoud ElGammal.
CIS 720 Distributed algorithms. “Paint on the forehead” problem Each of you can see other’s forehead but not your own. I announce “some of you have paint.
Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems.
Distributed Transactions Chapter 13
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Fault Tolerant Systems
12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14.
CS5204 – Operating Systems 1 Checkpointing-Recovery.
Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.
CprE 545: Fault Tolerant Systems (G. Manimaran), Iowa State University1 CprE 545: Fault Tolerant Systems Rollback Recovery Protocols.
Coordinated Checkpointing Presented by Sarah Arnold 1.
Rollback-Recovery Protocols in Message-Passing Systems Based on A Survey of Rollback-Recovery Protocols in Message-Passing Systems by Mootaz Elnozahy Lorenzo.
“Virtual Time and Global States of Distributed Systems”
Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
Building Dependable Distributed Systems, Copyright Wenbing Zhao
Revisiting failure detectors Some of you asked questions about implementing consensus using S - how does it differ from reaching consensus using P. Here.
EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Rollback-Recovery Protocols I Message Passing Systems Nabil S. Al Ramli.
Fault Tolerance and Checkpointing - Sathish Vadhiyar.
EEC 688/788 Secure and Dependable Computing Lecture 5 Wenbing Zhao Cleveland State University
1 Fault Tolerance and Recovery Mostly taken from
Operating System Reliability Andy Wang COP 5611 Advanced Operating Systems.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Snapshots, checkpoints, rollback, and restart
Ludovic Henrio CNRS - projet SCALE
Prepared by Ertuğrul Kuzan
Global State Recording
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
Operating System Reliability
Global State Recording
EECS 498 Introduction to Distributed Systems Fall 2017
Commit Protocols CS60002: Distributed Systems
Operating System Reliability
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Jenhui Chen Office number:
Distributed algorithms
ECE 753: FAULT-TOLERANT COMPUTING
Operating System Reliability
Operating System Reliability
Presentation transcript:

Rollback-Retry Techniques & Checnkpointing Protocols

Software Faults are Soft Most hardware faults are soft: most hardware faults are trainsient. Retries like checksum and retransmissions are standard way to deal with hardware faults. Conjecture (J. Gray 85). Also software faults are soft. If some software operation fails and the application is restarted from a quiescent state, the operation will usually not fail the second time

Bohrbugs vs. Heisenbugs Considering a software systems which has gone through structured design, design review, quality assurance, alpha test etc. After all these phases most of the bugs that always fail on retry are gone (Bohrbugs) For the other bugs (heisenbugs) retrying techniques will work as the system model have changed since one minute ago!!! During the operational phase of a software system, 1 out of 150 software faults is a bohrbug (1985).

The importance of checkpointing Checkpointing is the quiscent state from which a computation can be restarted after a failure Checkpointing save the state of a process into stable storage What does it means checkpointing a distributed application?

Consistent System States A global state of a message-passing system consists of: –individual states of all processes –the states of communication channels A consistent global state is a global state in which if a processs state reflects a message receipt, then the state of the corresponding sender reflects sending that message

Consistent System States (2) Intuitively, a consistent global state is one that may occur during a failure-free, correct execution of a distributed computation. The goal of a rollback-retry protocol is to bring the system into a consistent state Problem taking local ckpt that cannot belong to any consistent global checkpoint

Consistent Global Checkpoint A local scheckpoint is a local state saved onto stable storagr A global checkpoint is a set of local checkpoints one for each process A global checkpoint is consistent is no local checkpoint happens-before the other (i.e., there are no missing messages) A local checkpoint that does not belong to any global checkpoint is useless.

Checkpoint and Communication Patterns

Z-Cycles and Z-Paths A Z-path (zigzag path) is a special sequence of messages that connects two checkpoints. Let denote Lamports happen-before relation. Let c i,x denote the x th checkpoint of process P i. Define the execution portion between two consecutive checkpoints on the same process to be the checkpoint interval (starting with the earlier checkpoint). Let send i and deliver i be the communication events by process P i.

Definition of Z-Path Given two checkpoints c i,x and c j,y, a Z-path exists between c i,x and c j,y if and only if one of the following two conditions holds: 1.x < y and i = j; or 2. There exists a sequence of messages [m 0, m 1,…, m n ], n 0, such that: c i,x send i (m 0 ); l < n, either deliver k (m l ) and send k (m l+1 ) are in the same checkpoint interval, or deliver k (m l ) send k (m l+1 ); and deliver j (m n ) c j,y

Z-Cycles and Z-Paths (2) Z-cycle is a Z-path that begins and ends with the same checkpoint. –Above, [m 5, m 4, m 3 ] is a Z-cycle that start and ends at checkpoint c 2,2. c 2,2 is involved in a z-cycle. It is useless (i.e., it cannot belong to any consistent global checkpoint) [m 1, m 2 ] and [m 3, m 4 ] are Z-paths between c 0,1 and c 2,2

Rollback Propagation and The Domino Effect Upon a failure of one or more processes, the dependencies induced by messages may force some of the processes that did not fail to roll back. –This is commonly called rollback propagation. –If the processes have to roll back to the beginning of the computation, this is called the domino effect. Failure of P 2 causes rollback to the beginning of the computation

Classification of Checkpoint- based Protocols 1.Uncoordinated checkpointing – each process takes its checkpoints independently 2.Coordinated checkpointing – processes coordinate their checkpoints in order to save a system-wide consistent state 3.Communication-induced checkpointing – forces each process to take checkpoints based on information piggybacked on the application messages it receives from other processes.

Communication-induced Checkpointing Balances between uncoordinated and coordinated checkpointing –Allows processes to take some checkpoints independently. These checkpoints are called local checkpoints –Guarantees the eventual progress of the recovery line by forcing processes to take additional checkpoints, called forced checkpoints.

Communication-induced Checkpointing (2) Communication-induced checkpointing piggybacks protocol-related information on each application message. –In contrast with coordinated checkpointing, no special coordination messages are exchanged. The receiver of each application message uses the piggybacked information to determine if it has to take a forced checkpoint to advance the global recovery line. The forced checkpoint must be taken before the application may process the contents of the message. –high latency and overhead –need to reduce the number of forced checkpoints

Protocols CBR (Checkpoint before receive) NRAS (No receive after send) FDAS (Fixed dependency after send)

CBR Checkpoint and communication pattern Checkpoint before receive The case of two processes esite Z-cicle sequenza SR dentro almeno 1 intervallo compreso nello Z-cicle Quindi una condizione sufficiente per prevenire Z-cicle è Prevenire pattern SR in tutti gli intervalli non esistono Z-cicle

NRAS upon the receive of message m IF after_first_send THEN Take_CKPT() ELSE deliver(m) when sending message m after_first_send=TRUE Take_CKPT() CKPT// salva chek point su disco after_first_send=FALSE

FDAS Pi invia msg m (vedi handler: when sending message m): carica su m un timestamp D che è il VC corrente (Di) del processo mittente allatto dellinvio. Pi prende un ckpt locale (vedi funzione: Take_CKPT()): incremento VC locale (Di[i]++). Devo tener conto dei ckpt (locali o forzati dallalgoritmo) che prendo.

upon the receive of message m IF (after_first_send k: m.D[k]>Di[k])THEN Take_CKPT() for each K Di[k]:=max (Di[k], m.D[k]) //propago la conoscenza transitiva deliver(m) when sending message m after_first_send:=TRUE send (m.D) Take_CKPT() CKPT// salva chek point su disco after_first_send=FALSE Di[i]:=Di[i]+1