Chkpnt_3 Slide 1 ECE 442, SPR 2004 Design of Reliable Systems and Networks ECE 442 Checkpointing & Recovery (III) Ravi K. Iyer Center for Reliable and.

Chkpnt_3 Slide 1 ECE 442, SPR 2004 Design of Reliable Systems and Networks ECE 442 Checkpointing & Recovery (III) Ravi K. Iyer Center for Reliable and High-Performance Computing Department of Electrical and Computer Engineering and Coordinated Science Laboratory University of Illinois at Urbana-Champaign iyer@crhc.uiuc.edu http://www.crhc.uiuc.edu/DEPEND

Chkpnt_3 Slide 2 ECE 442, SPR 2004 Outline Asynchronous checkpointing and recovery Examples: –Checkpointing in distributed data base systems –Micro-checkpointing, checkpointing of multithreaded processes –Checkpoint and restart in IRIX operating System (SGI)

Chkpnt_3 Slide 3 ECE 442, SPR 2004 Asynchronous Checkpointing and Recovery Checkpoints at each process are taken independently without any synchronization among the processors. There is no guarantee that a set of local checkpoints taken will be a consistent set of checkpoints. The recovery algorithm must search for the most recent consistent set of checkpoints before it initiates recovery. X Y Z x1x1 y1y1 z1z1 x2x2 y2y2 z2z2 Time y3y3 x3x3 Inconsistent recovery line Most recent consistent recovery line

Chkpnt_3 Slide 4 ECE 442, SPR 2004 Asynchronous Checkpointing and Recovery (cont.) All incoming messages are logged at each process. –This minimizes the amount of computation to undo during a rollback. –The messages received after setting the recovery point can be processed again. Message logging –Pessimistic: An incoming message is logged before it is processed This slows down the computation, even when there are no failures. –Optimistic: Processors continue to perform the computation, and the message received are stored in volatile storage and logged at certain intervals. Messages that are not logged (stored on stable storage) can be lost in the event of rollback. This does not slow down the underlying computation.

Chkpnt_3 Slide 5 ECE 442, SPR 2004 Optimistic Message Logging Messages not necessarily logged before being processed. Unlogged messages are not available during recovery. States in other processes that causally depend upon lost messages are called orphan states. Processes that have orphan states must rollback. Dependencies tracked trough state intervals: –Process consists of sequence of state intervals. –Receipt of message starts a new state interval. –Outgoing messages dependent upon current state interval of a process 85 86 state interval 43 X Y

Chkpnt_3 Slide 6 ECE 442, SPR 2004 Optimistic Message Logging (cont.) Each process keeps a dependency vector: –One entry per process in the system. –Entry for process j specifies latest state interval in process j on which the process is dependent. Dependency vector piggybacked on outgoing messages. Receivers update their own dependency vector from piggybacked vector. Causal dependencies propagated through piggybacked vector.

Chkpnt_3 Slide 7 ECE 442, SPR 2004 X Y Z 511 - 45 1011 34 state interval XYZ dependency vector 5 - - 511 4 Piggybacked Dependency Vector Example shows dependency vector being updated as time progresses. Dependency vector of Z after receipt of m 3 shows that Z is dependent upon state 5 of X and state 11 of Y. m1m1 m2m2 m3m3

Chkpnt_3 Slide 8 ECE 442, SPR 2004 X Y Z 511 - 45 1011 34 state interval XYZ dependency vector 5 - - 511 4 Recovery X fails; if X has not logged m 1 to disk at time of failure, then m 1 is unrecoverable. Cannot guarantee that state 5 of X can be recreated exactly as before. All states dependent on state 5 of X are orphan states. When X recovers, it broadcasts to other processes that it can recreate its state up to state 4. Other processes check their dependency vectors and rollback if they are dependent on a state interval of X greater than 4. m1m1 m2m2 m3m3 X

Chkpnt_3 Slide 9 ECE 442, SPR 2004 Asynchronous Checkpoint and Recovery Algorithm: An Example Communication channels are reliable. Messages are delivered in the order in which they were sent. Each process keeps track of the number of messages that were –Sent to other processes –Received from other processes A process, upon restarting (after failure) broadcasts a message that it had failed. All processes determine orphan messages by comparing the numbers of messages sent and received. The process rolls back to a state where the number of messages received (at the process) is not greater than the number of messages sent (according to the state at other processes).

Chkpnt_3 Slide 10 ECE 442, SPR 2004 Asynchronous Checkpoint and Recovery Algorithm: An Example (cont.) X Y Z e x0 e y0 e z0 Time e z1 e z2 e z3 e y1 e y2 e y3 e x1 e x2 If Y rolls back to a state e y1, then Y has sent only one message to X X has received two messages from Y thus far X must roll back to a state preceding e x1 (to be consistent with Y’s state) For similar reasons, Z must also roll back X

Chkpnt_3 Slide 11 ECE 442, SPR 2004 Checkpointing in Distributed Database Systems (DDBS) In a DDBS a set of data objects is partitioned among several sites. Checkpoints should be taken with minimal interference with normal operations. Sites take local checkpoints recording the state of the local database. It is desirable that the checkpoints are consistent. A consistent checkpointing requires –the state updates of a transaction (the basic unit of user activity, which may be carried at many different sites) are included in all the checkpoints completely or not at all –Synchronization among all the sites Transactions may have to be blocked while checkpointing is in progress thereby interfering with normal operations.

Chkpnt_3 Slide 12 ECE 442, SPR 2004 Checkpointing in DDBS Issues How the sites agree, upon updates, on what transactions are to be included in their checkpoints. How each site can take a local checkpoint in a non-blocking fashion.

Chkpnt_3 Slide 13 ECE 442, SPR 2004 Checkpointing in a DDBS Assumptions The basic unit of user activity is a transaction. Transactions follow some concurrency control protocol (i.e., a data base system maintains the database consistency). No two transactions have the same timestamp. –Lamport’s logical clocks are used to assign a timestamp to each transaction. Site failures are detectable either by network protocols or by timeout mechanisms. Network partitioning never occurs. The checkpoint algorithm is initiated by a special process - checkpoint coordinator (CC). CC takes a consistent checkpoint with the help of processes called checkpoint subordinates (CS) running at every site.

Chkpnt_3 Slide 14 ECE 442, SPR 2004 Checkpointing in a DDBS: The Algorithm Phase One At the checkpoint coordinator (CC) site –CC broadcast a Checkpoint_Request with local timestamp LC cc –(Local Checkpoint Number) LCPN cc := LC cc –CONVERT CC = false –CC waits for replies (LCPNs, local checkpoint numbers) from all subordinate sites At all the checkpoint subordinates (CS) sites –A site m updates local clock: LC m := MAX(LC m, LC cc + 1) –LCPN m := LC m –A site m send LCPN m to CC –CONVERT m = false –A site m marks all transactions with timestamp not greater than LCPN m as before checkpoint transactions (BCPTs) and the rest of the transactions as temporary - after checkpoint transactions (ACPTs)

Chkpnt_3 Slide 15 ECE 442, SPR 2004 Checkpointing in a DDBS: The Algorithm (cont.) Note All updates by ACPTs are stored in the buffers of ACPTs. If an ACPT commits, the data objects updated are maintained as committed temporary versions (CTVs). If another transaction wishes to use an object for which a CVT exists –For read - the data stored in the CTV is returned. –For write (updates) - another version of the object is created.

Chkpnt_3 Slide 16 ECE 442, SPR 2004 Checkpointing in a DDBS: The Algorithm (cont.) Phase Two At the checkpoint coordinator site: –Once all replies have been received, the coordinator broadcasts the global checkpoint number (GCPN) GCPN := MAX(LCPN 1, LCPN 2, …., LCPN n ) At all the checkpoint subordinates sites: –A site m marks all temporary ACPTs with the timestamp not greater than GCPN as BCPT. The updates of newly converted BCPTs are also included in the checkpoint. The updates due to remaining ACPTs will be flushed to the database after the current checkpoint is completed. –CONVERT m = true; indicates that GCPN is known and all BCPTs have been identifie d

Chkpnt_3 Slide 17 ECE 442, SPR 2004 Checkpointing in a DDBS: The Algorithm (cont.) –When all the BCPTs terminate, site m takes a local checkpoint by saving the state of the data objects. –When the local checkpoint is taken, the database is updated with the committed temporary versions and the committed temporary versions are deleted. Note If the site m receives a new “initiate transaction” message for a transaction with the timestamp not greater than GCPN and if all BCPTs have been identified then Site m rejects the “initiate transaction” message.

Chkpnt_3 Slide 18 ECE 442, SPR 2004 Micro-checkpointing: Checkpointing of Multithreaded Processes, An Example of ARMOR State Checkpointing

Chkpnt_3 Slide 19 ECE 442, SPR 2004 ARMOR Architecture An ARMOR is a multithreaded process composed of replaceable, basic building blocks called elements –an element is a depository of replaceable functions within the ARMOR –a building block typically provides an elementary detection/recovery service An ARMOR supports –an unified interface to invoke services provided by elements –static and dynamic customization of services ARMOR Interface element ARMOR

Chkpnt_3 Slide 20 ECE 442, SPR 2004 Example ARMOR Configuration ARMOR Interface Progress Indicator element HB element Checksum Element ARMOR Repository of Elements Text-segment signature element Range-check element Assertion check element Control flow signature element Data dependency checking element Data dependency checking element Text-segment signature element Checksum Element HB element Checkpoint element Progress Indicator element Checkpoint element

Chkpnt_3 Slide 21 ECE 442, SPR 2004 Processing Within A Thread Each incoming message processed in its own thread. Elements can only access private data (and payload fields in a message). State changes are only made during operation processing E2E2 E1E1 E2E2 E3E3 E4E4 OP_A OP_B OP_C operations payload fields OP_B OP_C OP_B OP_C

Chkpnt_3 Slide 22 ECE 442, SPR 2004 Concept of Micro-checkpointing A single checkpoint buffer is maintained per multithreaded ARMOR process. The element state is checkpointed after each operation. Checkpoints are committed to stable storage after processing a message. The is no need to do process-wide checkpoints of stacks, heap, etc. The existing locking policy of element data prevents the need to suspend all threads. Overhead is reduced in comparison with process-wide checkpointing.

Chkpnt_3 Slide 23 ECE 442, SPR 2004 IRIX Operating System (SGI) Checkpoint and Restart Facility for saving running process(es) and, at some other time, restarting the saved process(es) from the point already reached, without starting all over again. A checkpoint image is saved in a set of disk files and can comprise –A set of processes (one or more), e.g., $ cpr -c ckptSep7 -p 1234 where cpr -c is the checkpoint command, ckptSep7 is the statefile name, -p option allows to specify a process ID –All processes in the process group (a set of processes that constitute a logical job) –All processes in a process session (a set of processes started from the same physical or logical terminal) –All processes in an IRIX array session (a set of related processes running on different nodes in an array) The array service daemon supports chackpointing across the nodes. To restart a set of processes the cpr command is used with the option -r $ cpr -r ckptSep7 –If the restart involves more than one process, all restarts must succeed before any process can run; otherwise all restarts fail.

Chkpnt_3 Slide 24 ECE 442, SPR 2004 IRIX Operating System (SGI) Checkpointable & Non-Checkpointable Objects Checkpointable objects (objects that are checkpoint safe) –Process set ID –User memory (data, text, stack) –Kernel execution state ( e.g., signal mask, scheduling information, current and root directory) –System calls –Undelivered and queued signals –List of open files and devices –Pipeline setup and shared memory Non-Checkpointable objects (objects that are not checkpoint safe) –Network sockets connections –X terminals and X11 client sessions –Graphic state –File pointers to mounted CD-ROM(s)

Chkpnt_3 Slide 25 ECE 442, SPR 2004 IRIX Operating System (SGI) Application Handling of Non-Checkpointable Objects To handle non-checkpoinable objects (e.g., network sockets, file pointers to mounted CD-ROM(s) ), an application needs to: –Add an event handler to catch signals SIGCKPT & SIGRESTART –Run signal handlers to disconnect any open socket (or close open cdFiles and unmount the CD-ROM) before checkpoint and reconnect the socket (or mount the CD-ROM and reopen the cdFiles) after restart. Two functions are provided for applications to add cpr event handlers: –atcheckpoint(my_cpt_handler()) adds the application’s checkpoint handling function to the list of functions that get called upon receipt of SIGCKPT –atrestart(my_callback()) registers the application’s callback function for executing upon receipt of SIGRESTART.

Chkpnt_3 Slide 1 ECE 442, SPR 2004 Design of Reliable Systems and Networks ECE 442 Checkpointing & Recovery (III) Ravi K. Iyer Center for Reliable and.

Similar presentations

Presentation on theme: "Chkpnt_3 Slide 1 ECE 442, SPR 2004 Design of Reliable Systems and Networks ECE 442 Checkpointing & Recovery (III) Ravi K. Iyer Center for Reliable and."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chkpnt_3 Slide 1 ECE 442, SPR 2004 Design of Reliable Systems and Networks ECE 442 Checkpointing & Recovery (III) Ravi K. Iyer Center for Reliable and.

Similar presentations

Presentation on theme: "Chkpnt_3 Slide 1 ECE 442, SPR 2004 Design of Reliable Systems and Networks ECE 442 Checkpointing & Recovery (III) Ravi K. Iyer Center for Reliable and."— Presentation transcript:

Similar presentations

About project

Feedback