Software Faults are Soft Most hardware faults are soft: most hardware faults are trainsient. Retries like checksum and retransmissions are standard way to deal with hardware faults. Conjecture (J. Gray 85). Also software faults are soft. If some software operation fails and the application is restarted from a quiescent state, the operation will usually not fail the second time
Bohrbugs vs. Heisenbugs Considering a software systems which has gone through structured design, design review, quality assurance, alpha test etc. After all these phases most of the bugs that always fail on retry are gone (Bohrbugs) For the other bugs (heisenbugs) retrying techniques will work as the system model have changed since one minute ago!!! During the operational phase of a software system, 1 out of 150 software faults is a bohrbug (1985).
The importance of checkpointing Checkpointing is the quiscent state from which a computation can be restarted after a failure Checkpointing save the state of a process into stable storage What does it means checkpointing a distributed application?
Consistent System States A global state of a message-passing system consists of: –individual states of all processes –the states of communication channels A consistent global state is a global state in which if a processs state reflects a message receipt, then the state of the corresponding sender reflects sending that message
Consistent System States (2) Intuitively, a consistent global state is one that may occur during a failure-free, correct execution of a distributed computation. The goal of a rollback-retry protocol is to bring the system into a consistent state Problem taking local ckpt that cannot belong to any consistent global checkpoint
Consistent Global Checkpoint A local scheckpoint is a local state saved onto stable storagr A global checkpoint is a set of local checkpoints one for each process A global checkpoint is consistent is no local checkpoint happens-before the other (i.e., there are no missing messages) A local checkpoint that does not belong to any global checkpoint is useless.
Checkpoint and Communication Patterns
Z-Cycles and Z-Paths A Z-path (zigzag path) is a special sequence of messages that connects two checkpoints. Let denote Lamports happen-before relation. Let c i,x denote the x th checkpoint of process P i. Define the execution portion between two consecutive checkpoints on the same process to be the checkpoint interval (starting with the earlier checkpoint). Let send i and deliver i be the communication events by process P i.
Definition of Z-Path Given two checkpoints c i,x and c j,y, a Z-path exists between c i,x and c j,y if and only if one of the following two conditions holds: 1.x < y and i = j; or 2. There exists a sequence of messages [m 0, m 1,…, m n ], n 0, such that: c i,x send i (m 0 ); l < n, either deliver k (m l ) and send k (m l+1 ) are in the same checkpoint interval, or deliver k (m l ) send k (m l+1 ); and deliver j (m n ) c j,y
Z-Cycles and Z-Paths (2) Z-cycle is a Z-path that begins and ends with the same checkpoint. –Above, [m 5, m 4, m 3 ] is a Z-cycle that start and ends at checkpoint c 2,2. c 2,2 is involved in a z-cycle. It is useless (i.e., it cannot belong to any consistent global checkpoint) [m 1, m 2 ] and [m 3, m 4 ] are Z-paths between c 0,1 and c 2,2
Rollback Propagation and The Domino Effect Upon a failure of one or more processes, the dependencies induced by messages may force some of the processes that did not fail to roll back. –This is commonly called rollback propagation. –If the processes have to roll back to the beginning of the computation, this is called the domino effect. Failure of P 2 causes rollback to the beginning of the computation
Classification of Checkpoint- based Protocols 1.Uncoordinated checkpointing – each process takes its checkpoints independently 2.Coordinated checkpointing – processes coordinate their checkpoints in order to save a system-wide consistent state 3.Communication-induced checkpointing – forces each process to take checkpoints based on information piggybacked on the application messages it receives from other processes.
Communication-induced Checkpointing Balances between uncoordinated and coordinated checkpointing –Allows processes to take some checkpoints independently. These checkpoints are called local checkpoints –Guarantees the eventual progress of the recovery line by forcing processes to take additional checkpoints, called forced checkpoints.
Communication-induced Checkpointing (2) Communication-induced checkpointing piggybacks protocol-related information on each application message. –In contrast with coordinated checkpointing, no special coordination messages are exchanged. The receiver of each application message uses the piggybacked information to determine if it has to take a forced checkpoint to advance the global recovery line. The forced checkpoint must be taken before the application may process the contents of the message. –high latency and overhead –need to reduce the number of forced checkpoints
Protocols CBR (Checkpoint before receive) NRAS (No receive after send) FDAS (Fixed dependency after send)
CBR Checkpoint and communication pattern Checkpoint before receive The case of two processes esite Z-cicle sequenza SR dentro almeno 1 intervallo compreso nello Z-cicle Quindi una condizione sufficiente per prevenire Z-cicle è Prevenire pattern SR in tutti gli intervalli non esistono Z-cicle
NRAS upon the receive of message m IF after_first_send THEN Take_CKPT() ELSE deliver(m) when sending message m after_first_send=TRUE Take_CKPT() CKPT// salva chek point su disco after_first_send=FALSE
FDAS Pi invia msg m (vedi handler: when sending message m): carica su m un timestamp D che è il VC corrente (Di) del processo mittente allatto dellinvio. Pi prende un ckpt locale (vedi funzione: Take_CKPT()): incremento VC locale (Di[i]++). Devo tener conto dei ckpt (locali o forzati dallalgoritmo) che prendo.
upon the receive of message m IF (after_first_send k: m.D[k]>Di[k])THEN Take_CKPT() for each K Di[k]:=max (Di[k], m.D[k]) //propago la conoscenza transitiva deliver(m) when sending message m after_first_send:=TRUE send (m.D) Take_CKPT() CKPT// salva chek point su disco after_first_send=FALSE Di[i]:=Di[i]+1