Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.1 Fall 2006 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing.

Slides:



Advertisements
Similar presentations
Global States.
Advertisements

Recovery Failure of a site/node in a distributed system causes inconsistencies in the state of the system. Recovery: bringing back the failed node in step.
Copyright 2004 Koren & Krishna ECE655/DataRepl.1 Fall 2006 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing.
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12.1 Fall 2006 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing.
Time and Global States Part 3 ECEN5053 Software Engineering of Distributed Systems University of Colorado, Boulder.
Transactions A process that reads or modifies the DB is called a transaction. It is a unit of execution of database operations. Basic JDBC transaction.
Recovery from Crashes. Transactions A process that reads or modifies the DB is called a transaction. It is a unit of execution of database operations.
Recovery from Crashes. ACID A transaction is atomic -- all or none property. If it executes partly, an invalid state is likely to result. A transaction,
ACID A transaction is atomic -- all or none property. If it executes partly, an invalid state is likely to result. A transaction, may change the DB from.
Copyright 2006 Koren & Krishna ECE655/ByzGen.1 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing ECE 655.
CPSC 668Set 12: Causality1 CPSC 668 Distributed Algorithms and Systems Fall 2009 Prof. Jennifer Welch.
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
Copyright 2004 Koren & Krishna ECE655/Ckptg.1 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing ECE 655 Checkpointing.
1 Message Logging Pessimistic & Optimistic CS717 Lecture 10/16/01-10/18/01 Kamen Yotov
1 CS 194: Distributed Systems Distributed Commit, Recovery Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems M. Elnozahy, L. Alvisi, Y. Wang, D. Johnson Carnegie Mellon University Presented by:
1 Rollback-Recovery Protocols II Mahmoud ElGammal.
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.17.1 FAULT TOLERANT SYSTEMS Chapter 6 – Checkpointing.
Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems.
The Functions of Operating Systems Interrupts. Learning Objectives Explain how interrupts are used to obtain processor time. Explain how processing of.
Distributed Transactions Chapter 13
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Fault Tolerant Systems
Chapter 15 Recovery. Topics in this Chapter Transactions Transaction Recovery System Recovery Media Recovery Two-Phase Commit SQL Facilities.
Recovery Chapter 6.3 V3.1 Napier University Dr Gordon Russell.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14.
CS5204 – Operating Systems 1 Checkpointing-Recovery.
Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.
CprE 545: Fault Tolerant Systems (G. Manimaran), Iowa State University1 CprE 545: Fault Tolerant Systems Rollback Recovery Protocols.
Coordinated Checkpointing Presented by Sarah Arnold 1.
Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
Agenda Fail Stop Processors –Problem Definition –Implementation with reliable stable storage –Implementation without reliable stable storage Failure Detection.
Copyright 2004 Koren & Krishna ECE655/Koren Part.8.1 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing ECE.
EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Rollback-Recovery Protocols I Message Passing Systems Nabil S. Al Ramli.
Fault Tolerance and Checkpointing - Sathish Vadhiyar.
Transactional Recovery and Checkpoints. Difference How is this different from schedule recovery? It is the details to implementing schedule recovery –It.
Fault Tolerance (2). Topics r Reliable Group Communication.
EEC 688/788 Secure and Dependable Computing Lecture 5 Wenbing Zhao Cleveland State University
1 Fault Tolerance and Recovery Mostly taken from
Jun-Ki Min. Slide Purpose of Database Recovery ◦ To bring the database into the last consistent stat e, which existed prior to the failure. ◦
Operating System Reliability Andy Wang COP 5611 Advanced Operating Systems.
Virtual Memory.
Database Recovery Techniques
8.6. Recovery By Hemanth Kumar Reddy.
FAULT TOLERANT SYSTEMS Chapter 6 – Checkpointing I
Prepared by Ertuğrul Kuzan
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
Operating System Reliability
EECS 498 Introduction to Distributed Systems Fall 2017
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Processor Fundamentals
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Recovery Unit 4.4 Dr Gordon Russell, Napier University
Operating System Reliability
Operating System Reliability
Presentation transcript:

Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.1 Fall 2006 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing ECE 655 Checkpointing II

Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.2 Checkpoint Placement - Notations  Checkpoint placement - tradeoff between cost and benefits - aimed at minimizing the expected execution time of a long job  Cost - time to store a checkpoint - can be large  t - execution time without checkpoints  t - average time of taking checkpoint  N - decision variable - number of checkpoints placed uniformly in job - minimizing total execution time T (N)   = t / N - time between consecutive checkpoints  Failures occur with rate  Failures are transient - go away after a mean lifetime t  System then rolls back to the latest checkpoint  Checkpoints in secure memory - uncorrupted by failure x c f x x tot

Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.3 Checkpoint Placement – Analytical Model  t - total time lost for every transient failure  t - time system is down  If failure occurs during checkpointing  probability p = t /(t +  )  lost time  + t /2  If failure occurs during execution  probability p =  /(t +  )  lost time  /2  t =t + p (  +t /2) + p  /2 =t + (t +  )/2 This result is intuitive - (t +  )/2 is half the interval t +  l f c cc x c x x c x x x l f f c c c x x x x c c x x

Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.4 Optimal Checkpoint Placement  Assume is sufficiently small so that probability of failure during rollback is negligible  Expected number of failures during the total execution time of t + N t is (t + N t )  Total time taken -  T (N) =t +N t + [t + N t ][t +(t +t /N )/2]  Select N so as to minimize T (N)   T (N) /  N = t + t (t /2+t )-( t )/(2N )  Setting derivative to zero, we obtain  N = t  /  t (2 + t +2 t ) c x x x 2 x x c c c c c f x tot x _ cc f 2 opt ________ c c f __

Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.5 Optimal Value of N  N must be a whole number - the one out of floor or ceiling that minimizes T (N)  Optimal inter-checkpoint interval -  =t / N  Exercise - Relax the assumption that the probability of additional failures during the recovery process is negligible  Uniform placement - optimal if checkpointing cost is constant throughout the execution  If checkpoint size - and hence checkpointing time - varies greatly from one part of the execution to the other - optimal time between checkpoints is not constant tot opt x

Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.6 Optimal Checkpoint Placement - An Instruction Level Model  Probability of a fault during an instruction execution depends on the functional units used and its execution time  Decision variable M - number of instructions between consecutive checkpoints  Minimizing W - time spent per instruction  Instruction set partitioned into N subsets of similar instructions  For a type i instruction - execution time T, failure rate, frequency f (  f =1)  s (1-s) - fraction of permanent (transient) faults   - “repair” rate of a transient failure in a type i instruction i i i ii

Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.7 Instruction Level Model - Notations  Possible events during an instruction execution:  H - Instruction is completed successfully when first executed - probability P  H - Instruction fails, failure identified, program rolled-back to last checkpoint, instruction completed - probability P  H - Program rollback fails, program fails, program reloaded and restarted - probability P  P, P, P - conditional probabilities for a type i instruction  These conditional probabilities will be calculated and then averaged: c RB PF RB c c PF i i i

Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.8 Instruction Level Model - Further Notations  For a system with failure rate and repair rate  -  Probability of no faults in the time interval (0,t)  Probability of transition from the fault-free state at time 0 to the fault-free state at time t  For 0  t  t, probability of transition from the fault-free state at time 0 to the fault-free state at time t with at least one fault during (0,t ) 1 1

Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.9 Instruction Level Model - Probabilities  M - number of instructions between checkpoints  m - number of instructions between failing instruction and last checkpoint =1,…,M with probability 1/M each  P - conditional probability of successful rollback given type i and m instructions to the last checkpoint   - setup time needed to initiate a program rollback including the time needed to load the information saved in the last checkpoint RB i,m 1

Copyright 2004 Koren & Krishna ECE655/Ckpt Part Instruction Level Model - Calculating W   - mean time to successfully execute an instruction  T - time taken for checkpointing  Time spent per instruction - W =  + T / M  Increasing M - first term increases, second decreases  T =  f T - mean execution time of a fault-free instruction   - average time required for diagnose and repair  L - average number of instructions per program   includes W as a term   is substituted in W and solved for W s s i i 2

Copyright 2004 Koren & Krishna ECE655/Ckpt Part Optimal value of M  Solving for W -  Finding the optimal value for M which minimizes W -done iteratively  Initial value is obtained by substituting 1 for the denominator and 0 for , taking the derivative with respect to M and letting it equal 0  Initial value of M - 2

Copyright 2004 Koren & Krishna ECE655/Ckpt Part CARER: Cache-Aided Rollback Error Recovery  Reducing checkpointing time allows more frequent checkpoints - reducing penalty of rollback upon failure  The CARER scheme reduces the time required to take a checkpoint by marking the process footprint in main memory and cache as parts of the checkpointed state  Assuming that the memory and cache are far less prone to failure than the processor  Checkpointing consists of storing the processor's registers in main memory, and includes the processes' footprint in main memory plus any lines of the cache marked as being part of the checkpoint

Copyright 2004 Koren & Krishna ECE655/Ckpt Part Checkpoint Bit For Each Cache Line  This requires hardware modification: an extra checkpoint bit associated with each cache line  When this bit is 1: the corresponding line is unmodifiable, i.e., the line is part of the latest checkpoint - may not update without being forced to take a checkpoint immediately  If the bit = 0: processor is free to modify the word  The process' footprint in main memory, and marked lines in the cache do double duty as both memory and part of checkpoint - less freedom when deciding when checkpoints have to be taken

Copyright 2004 Koren & Krishna ECE655/Ckpt Part Forced Checkpointing  Checkpointing is forced when  A line marked unmodifiable is to be updated  Anything in the main memory is to be updated  An I/O instruction is executed or an external interrupt occurs  Taking a checkpoint involves:  (a) saving the processor registers in memory  (b) setting to 1 the checkpoint bit associated with each valid cache line

Copyright 2004 Koren & Krishna ECE655/Ckpt Part Roll Back - Cost  Rolling back to the previous checkpoint is very simple: restore the registers, and mark invalid all the lines in cache with checkpoint bit = 0  The cost:  A checkpoint bit for every cache line  Every write-back of a cache line into main memory involves taking a checkpoint

Copyright 2004 Koren & Krishna ECE655/Ckpt Part Checkpointing in Distributed Systems  Distributed system: processors and associated memories connected by a network  Each processor may have local disks  Can be a network file system accessible by all processors  Processes connected by directional channels - point-to-point connections from one process to another  Assume channels are error-free and deliver messages in the order received

Copyright 2004 Koren & Krishna ECE655/Ckpt Part Deterministic and Non-deterministic Events  A non-deterministic event: its occurrence cannot be predicted based on prior state(s) of system  A deterministic event can be predicted  Process execution is a sequence of deterministic events, interrupted now and then by some non- deterministic events  Example: a program controlling a pressure valve of a chemical reactor - an endless loop with frequent inputs from pressure sensors - then making control decisions  The value of an input is a non-deterministic event: cannot predict it based on the results of prior processing

Copyright 2004 Koren & Krishna ECE655/Ckpt Part Piecewise Deterministic Process  However, once the input is available, the rest is predictable (assuming no failures)  A process execution can be regarded as piecewise deterministic:  It consists of time-slices, each of which begins with some non-deterministic event  Given information about the non-deterministic event and the state of the process at the beginning of that time-slice, we can predict every event that happens during the time-slice

Copyright 2004 Koren & Krishna ECE655/Ckpt Part Process/Channel/System State  The state of a process has the obvious meaning;  The state of the channel at t is the set of messages carried by it up to time t (and the order of receipt)  The state of the distributed system is the aggregate states of individual processes and channels  The state is said to be consistent if, for every message delivery there is a corresponding message- sending event  A state violating this - a message delivered that had not yet been sent - violating causality; such a message is called an orphan  The converse is consistent - a system state reflect the sending of a message but not its receipt

Copyright 2004 Koren & Krishna ECE655/Ckpt Part Consistent and Inconsistent States  Two processes, P and Q, each has two checkpoints taken; Message m is sent by P to Q  Sets of checkpoints representing consistent system states:  {P_1, Q_1}: Neither checkpoint has any information about m  {P_2, Q_2}: P_2 indicates that m was sent; Q_2 indicates that it was received  {P_2, Q_1}: P_2 indicates that m was sent; Q_1 has no record of receiving m

Copyright 2004 Koren & Krishna ECE655/Ckpt Part Inconsistent States  In contrast, the set {P_1, Q_2} is an inconsistent state; P_1 has no record of m being sent,while Q_2 records that m was received, i.e., m is an orphan message  The sets of checkpoints that represent a consistent system state are said to form a recovery line - we can roll the system back to them and restart from there:  {P_1, Q_1}: Rolling back P to P_1 undoes the sending of m and rolling back Q to Q_1 means that Q does not have any record of m  Restarting from these checkpoints, P will again send out m, which Q will receive

Copyright 2004 Koren & Krishna ECE655/Ckpt Part Inconsistent States - Cont.  {P_2, Q_1}: Rolling back P to P_2 means that it will not retransmit m; however, rolling back Q to Q_1 means that Q has no record of ever having received m  The recovery process has to be able to play back m to Q - can be done by adding it to the checkpoint of P or having a separate message log, recording everything received by Q  Sometimes, checkpoints can be useless - they will never form part of a recovery line, so that taking them is a waste of time

Copyright 2004 Koren & Krishna ECE655/Ckpt Part Useless Checkpoints  Q_2 is a useless checkpoint  Q_2 records the receipt of m1, but not the sending of m2  {P1,Q_2} cannot be consistent (otherwise m1 would become an orphan); similarly {P_2,Q_2} cannot be consistent (since otherwise m2 would become an orphan)

Copyright 2004 Koren & Krishna ECE655/Ckpt Part The Domino Effect  If checkpoints are not coordinated (directly - message passing or indirectly - synchronized clocks): a single failure can cause a domino effect  When P suffers a transient failure, it rolls back to checkpoint P_3  Since message f was sent after P_3, Q has to roll back (otherwise Q would have a message that was never sent: an orphan message)  P will rollback to P_2 since Q sent a message e to P  This continues until all processes have rolled back to their starting positions failure

Copyright 2004 Koren & Krishna ECE655/Ckpt Part Lost Message  Messages lost due to rollback:  Suppose Q rolls back to Q_1 after receiving message x from P  Record of having received x is lost  If P does not roll back to P_2 - as if P had sent a message which was never received by Q  Lost messages do not violate causality - similarly to messages lost due to network problems  Retransmission  However, if Q sent an ACK of x to P before rolling back, then that ACK will be an orphaned message unless P rolls back to P_2

Copyright 2004 Koren & Krishna ECE655/Ckpt Part Example of Livelock  Livelock - another problem that can arise in distributed checkpointed systems  Q sends P a message q1 P sends Q a message p1  Then, P fails at the point shown, before receiving q1. To prevent p1 from being orphaned, Q must roll back to Q_1  In the meantime, P recovers, rolls back to P_2, sends another copy of p1, and then receives the copy of q1 that was sent before all the rollbacks began  However, since Q has rolled back, this copy of q1 is now orphaned, and so P has to repeat its rollback  This in turn, orphans the second copy p1 as well, forcing Q to also repeat its rollback  This dance of the rollbacks can continue indefinitely