Presentation is loading. Please wait.

Presentation is loading. Please wait.

B. Prabhakaran 1 Recovery Failure of a site/node in a distributed system causes inconsistencies in the state of the system. Recovery: bringing back the.

Similar presentations


Presentation on theme: "B. Prabhakaran 1 Recovery Failure of a site/node in a distributed system causes inconsistencies in the state of the system. Recovery: bringing back the."— Presentation transcript:

1 B. Prabhakaran 1 Recovery Failure of a site/node in a distributed system causes inconsistencies in the state of the system. Recovery: bringing back the failed node in step with other nodes in the system. Failures: Process failure: Deadlocks, protection violation, erroneous user input, etc. System failure: Failure of processor/system. System failure can have full/partial amnesia. It can be a pause failure (system restarts at the same state it was in before the crash) or a complete halt. Secondary storage failure: data inaccessible. Communication failure: network inaccessible.

2 B. Prabhakaran 2 Fault-to-Recovery Manufacturing Fault DesignExternalFatigue Erroneous System State System failure

3 B. Prabhakaran 3 Backward & Forward Recovery Forward Recovery: Assess damages that could be caused by faults, remove those damages (errors), and help processes continue. Difficult to do forward assessment. Generally tough. Backward Recovery: When forward assessment not possible. Restore processes to previous error-free state. Expensive to rollback states Does not eliminate same fault occurring again (i.e. loop on a fault + recovery) Unrecoverable actions: print outs, cash dispensed at ATMs.

4 B. Prabhakaran 4 Recovery System Model For Backward Recovery A single system with secondary and stable storage Stable storage does not lose information on failures Stable storage used for logs and recovery points Stable storage assumed to be more secure than secondary storage. Data on secondary storage assumed to be archived periodically.

5 B. Prabhakaran 5 Approaches Operation-based Approach Maintaining logs: all modifications to the state of a process are recorded in sufficient detail so that a previous state can be restored by reversing all changes made to the state. (e.g.,) Commit in database transactions: a transaction if it is committed to by all nodes, then the changes are permanent. If it does not commit, the effect of transactions are to be undone. Updating-in-place: Every write (update) results in a log of (1) object name (2) old object state (3) new state. Operations: A do operation updates & writes the log An undo operation uses the log to remove the effect of a do A redo operation uses the log to repeat a do Write-ahead-log: To avoid the problem of a crash after update and before logging. Write (undo & redo) logs before update

6 B. Prabhakaran 6 Approaches State-based Approach Establish a recovery point where the process state is saved. Recovery done by restoring the process state at the recovery, called a checkpoint. This process is called rollback. Process of saving called checkpointing or taking a check point. Rollback normally done to the most recent checkpoint, hence many checkpoints are done over the execution of a process. Shadow pages technique can be used for checkpointing. Page containing the object to be updated is duplicated and maintained as a checkpoint in stable storage. Actual update done on page in secondary storage. Copy in stable storage used for rollback.

7 B. Prabhakaran 7 Recovery in Concurrent Systems Distributed system state involves message exchanges. In distributed systems, rolling back one process can cause the roll back of other processes. Orphan messages & the Domino effect: Assume Y fails after sending m. X has record of m at x3 but Y has no record. m -> orphan message. Y rolls back to y2 -> X should go to x2. If Z rolls back, X and Y has to go to x1 and y1 -> Domino effect, roll back of one process causes one or more processes to roll back. X Y Z x1 y1 z1 x2 x3 y2 z2 m

8 B. Prabhakaran 8 Lost Messages If Y fails after receiving m, it will rollback to y1. X will rollback to x1 m will be a lost message as X has recorded it as sent and Y has no record of receiving it. X Y m x1 y1 Failure X

9 B. Prabhakaran 9 Livelocks X Y x1 y1 X Y x1 y1 m1 n1 m2 n2 X Failure X 2nd Rollback n1 Y crashes before receiving n1. Y rolls back to Y1 -> X to x1. Y recovers, receives n1 and sends m2. X recovers, sends n2 but has no record of sending n1 Hence, Y is forced to rollback second time. X also rolls back as it has received m2 but Y has no record of m2. Above sequence can repeat indefinitely, causing a livelock.

10 B. Prabhakaran 10 Consistent Checkpoints Overcoming domino effect and livelocks: checkpoints should not have messages in transit. Consistent checkpoints: no message exchange between any pair of processes in the set as well as outside the set during the interval spanned by checkpoints. {x1,y1,z1} is a strongly consistent checkpoint. X Y Z x1 y1 z1 x2 x3 y2 z2 m

11 B. Prabhakaran 11 Synchronous Approach Checkpointing: First phase: An initiating process, Pi, takes a tentative checkpoint. Pi requests all other processes to take tentative checkpoints. Every process informs whether it was able to take checkpoint. A process can fail to take a checkpoint due to the nature of application (e.g.,) lack of log space, unrecoverable transactions. Second phase: If all processes took checkpoints, Pi decides to make the checkpoint permanent. Otherwise, checkpoints are to be discarded. Pi conveys this decision to all the processes as to whether checkpoints are to be made permanent or to be discarded.

12 B. Prabhakaran 12 Assumptions: Synchronous Appr. Processes communicate by exchanging messages through communication channels Channels are FIFO in nature. End-to-end protocols (e.g. TCP) are assumed to cope with message loss due to rollback recovery and communication failure. Communication failures do not partition the network. A process is not allowed to send messages between phase 1 and 2.

13 B. Prabhakaran 13 Synchronous Approach... Optimization: Taking a checkpoint is expensive and the algorithm discussed may take unnecessary checkpoints. X Y Z x1 y1 z1 x2 x3 y2 z3 y3 Initiate checkpointing z2 W w2 w3

14 B. Prabhakaran 14 Synchronous Approach... Optimization: Taking a checkpoint is expensive and the algorithm discussed may take unnecessary checkpoints. X Y Z x2 x3 y2 z3 y3 Initiate checkpointing z2 W w2 w3

15 B. Prabhakaran 15 Checkpointing Optimization Each process uses monotonically increasing labels in its outgoing messages. Notations: L: largest label. S: smallest label Let m be the last message X received from Y after X’s last permanent checkpoint. last_label_recdx[Y] = m.l, if m exists. Otherwise, it is set to S. Let m be the first message X sent to Y after checkpointing at X (permanent or temporary). first_label_sentx[Y] = m.l, if exists. Otherwise, set to L. For a checkpointing request to Y, X sends last_label_recdx[Y]. Y takes a temporary checkpoint iff last_label_recdx[Y] >= first_label_senty[X} > S. i.e., X has received 1 or more messages after checkpointing by Y and hence Y should take checkpoint. ckpt_cohortx = {Y | last_label_recdx[Y] > S}, i.e., the set of all processes from which X has received messages after its checkpoint.

16 B. Prabhakaran 16 Checkpointing Optimization Initial state at all processes p: first_label_sentp[q] := S. OK-to_take_ckptp := “yes” if p is willing; “no” otherwise At initiator Pi: for all p in ckpt_cohortpi do send Take_a_tentative_ckpt (Pi,last_label_recdpi[p]) message if all processes replied “yes”, then for all p in ckpt_cohortpi do send Make_tentative_ckpt_permanent. Else send Undo_tentative_ckpt. At all processes p: Upon receiving Take_a_tentative_ckpt message from qdo if OK_to_take_ckptp = “yes” AND last_label_recdq[p] >= first_label_sentp[q] > S take a tentative checkpoint.

17 B. Prabhakaran 17 Checkpointing Optimization... At all processes p: take a tentative checkpoint. for all processes r in ckpt_cohortp do send Take_a_tentative_ckpt (p,last_label_recdp[r]) message if all processes r replied “yes” OK_to_take_ckptp := “yes else OK_to_take_ckptp := “no” send (p, OK_to_take_ckptp) to q. Upon receiving Make_tentative_ckpt_permanent message do Make tentative checkpoint permanent for all processes r in ckpt_cohortp do Send Make_tentative_ckpt_permanent message Upon receiving Undo_tentative_ckpt message do Undo tentative checkpoint for all processes r in ckpt_cohortp do Send Undo_tentative_ckpt message.

18 B. Prabhakaran 18 Synchronous Rollback Rolling back: First phase: Pi initiates a rollback asking if all processes are willing to rollback to the previous checkpoint. Any process may say no, if it is involved in another recovery process. Second phase: Pi conveys the decision on agreement to all others. X Y Z x1 y1 z1 x2 y2 z2 X Failure

19 B. Prabhakaran 19 Rollback Optimization Additional Notation: last_label_sentx[Y] = m.l, if m exists. Otherwise, set to S. When X requests Y to restart from the permanent checkpoint, it sends last_label_sentx[Y] along with its request. Y will restart from its permanent checkpoint only if: last_label_recdy[X] > last_label_sentx[Y] roll_cohortx = {Y | X can send messages to Y} Algorithm: Initial State at all processes p: resume_executionp := true; for all processes q do last_label_recdp[q] := L; willing_to_rollp = “yes” if p is willing to roll back. “no” otherwise. At initiator process Pi: for all p in roll_cohortp do send Prepare_to_rollback (Pi, last_label_sentPi[p]) message.

20 B. Prabhakaran 20 Rollback Optimization... At initiator process Pi... if all processes reply “yes”, then for all p in roll_cohortp do send Roll_back message. else for all p in roll_cohortpi do send Donot_roll_back message. At all processes p: Upon receiving Prepare_to_rollback (q,last_label_sentq[p]) message from q do if willing_to_rollp AND last_label_recdp[q] > last_label_sentq[p] AND (resume_executionp) resume_executionp := false; for all r in roll_cohortp do send Prepare_to_rollback(p, last_label_sentp[r]) message; if all r in roll_cohortp replied “yes” then willing_to_rollp := “yes” else willing_to_rollp := “no” send (p, willing_to_rollp) message to q

21 B. Prabhakaran 21 Rollback Optimization... At all processes p: Upon receiving Roll_back message AND if resume_executionp = false do restart from p’s permanent checkpoint for all r in roll_cohortp do send Roll_back message Upon receiving Donot_roll_back message do resume execution for all r in roll_cohortp do send Donot_roll_back message

22 B. Prabhakaran 22 Rollback Optimization... X Y Z x1 y1 z1 X (3) (4) Label (2) (0) (3) (0) X rolls back to x1. Y & Z to y1 and z1.

23 B. Prabhakaran 23 Rollback Optimization... X Y Z x1 y1 z1 X (3) (4) Label (0) Both Y & Z do not roll back. X rolls back to x1 Message 3 will be handled by retransmission of network protocol (e.g., TCP)

24 B. Prabhakaran 24 Asynchronous Approach Disadvantages of Synchronous Approach: Additional message exchanges for taking checkpoints Delays normal executions as messages cannot be exchanged during checkpointing. Unnecessary overhead if no failures occur between checkpoints. Asynchronous approach: independent checkpoints at each processor. Identify a consistent set of checkpoints if needed, for roll backs. E.g., {x3,y3,z2} not consistent; {x2,y2,z2} consistent. Used for rollback X Y Z x1 y1 z1 x2 x3 y3 z2 y2

25 B. Prabhakaran 25 Asynchronous Approach... Assumption: 2 types of logging. Volatile logging: takes less time but contents lost on failure. Periodically flushed to stable logs. Stable log: may take more time but contents not lost. Logging: tuple {s, m, msgs_sent}. s process state, m message received, msgs_sent the set of messages sent during the event. Event logging initiated on message receipt. Notations & data structures: RCVDi<-j (CkPti): Number of messages received by processor Pi from Pj as per checkpoint CkPti. SENTi->j(CkPti): Number of messages sent by processor Pi to Pj as per checkpoint CkPti. Basic Idea: Each processor keeps track of the number of messages sent/ received to/ from other processors.

26 B. Prabhakaran 26 Asynchronous Approach... Basic Idea.... Existence of orphan messages identified by comparing the number of messages sent and received. If number of received messages > sent messages -> presence of orphans -> receiving process needs to rollback. Algorithm: A recovering processor broadcasts a message to all processors. if Pi is the recovering processor, CkPti := latest stable log. else CkPti := latest event that took place in i. for k := 1 to N do (N the total number of processors in the system) for each neighboring processor j do send ROLLBACK (i,SENTi- >j(CkPti)) message. Wait for ROLLBACK message from every neighbor.

27 B. Prabhakaran 27 Asynchronous Approach... Algorithm... for every ROLLBACK(j,c) message received from a neighbor j, i does the following: if RCVDi c then /* orphans present */ find the latest event e such that RCVDi<-j(e) = c; CkPti := e. end for k. Algorithm has |N| iterations. During kth (k != 1) iteration, Pi based CkPti determined in (k-1)th iteration, computes SENTi->j(CkPti) for each neighbor. This value is sent in a ROLLBACK message (in kth iteration) At the end of each iteration, at least 1 processor will roll back to its final recovery point.

28 B. Prabhakaran 28 Asynch. Approach Example Y fails, restarts from y1. CkPtx is ex3 & CkPtz is ez2. 1st iteration: Y sends RollBack(Y,2) to X & RollBack(Y,1) to Z X sends RollBack(X,1) to Y & RollBack(X,0) to Z Z send RollBack(Z,0) to X & RollBack(Z,1) to Y. Discussion: RCVDx 2 (in Y’s RollBack message) CkPtx set to ex2 to match the equality constraint. RCVDz 1 (in Y’s message) CkPtz set to ez1. X Y Z ex1 ex2ex3 ey1ey2 ey3 ez1 ez2 x1 y1 z1 X failure

29 B. Prabhakaran 29 Asynch. Approach Example.. Discussion... At Y, RCVDy<-x and RCVDy<-z satisfy the constraints. So CkPty is unchanged at y1. 2n d iteration: Y sends RollBack(Y,2) to X & RollBack(Z,1) to Z X sends RollBack(X,0) to Z & RollBack(X,1) to Y Z sends RollBack(Z,1) to Y & RollBack(Z,0) to X Checkpoint y1 is same as ey2. {ex2, y1/ey2, ez1} are identified as consistent checkpoints to rollback.

30 B. Prabhakaran 30 Distributed Databases Checkpointing objectives in distributed database systems (DDBS): Normal operations should be minimally interfered with, by checkpointing. A DDBS may update different objects in different sites, local checkpointing at each site is better. For faster recovery, checkpoints be consistent (desirable property). Activity in DDBS is in terms of transactions. So in DDBS, a consistent checkpoint should either include updates of a transaction completely or not include it all. Issues in identifying checkpoints: How sites agree on what transactions are to be included Taking checkpoints without interference

31 B. Prabhakaran 31 DDBS Checkpointing Assumptions: Basic unit of activity is transactions Transactions follow some concurrency control protocol Lamport’s logical clocks used for time-stamping transactions. Failures detected by network protocols or timeouts Network partitioning never occurs Basic Idea All sites agree on a Global Checkpoint Number (GCPN) Transactions with timestamps <= GCPN are included in the checkpoint. Called BCPTs: Before Checkpoint Transactions. Timestamps of After Checkpoint Transactions (ACPTs) > GCPN. Each site multiple versions of data items being updated by ACPTs in volatile storage -> No interference during checkpointing.

32 B. Prabhakaran 32 DDBS Checkpointing... Data Structures LC: local clock as per Lamport’s logical clock LCPN (local checkpoint number): determined locally for the current checkpoint. Algorithm: initiated by checkpoint coordinator (CC). CC uses checkpoint subordinates (CS). Phase 1 at the CC CC broadcasts a Checkpoint_Request message with a local timestamp LCcc. LCPNcc := LCcc CONVERTcc := false Wait for replies from CSs. Phase 1 at CSs

33 B. Prabhakaran 33 DDBS Checkpointing... Phase 1 at CSs On receiving a Checkpoint_Request message, a site m, updates its local clock as LCm := MAX(LCm, LCcc+1) LCPNm := LCm m informs LCPNm to the CC CONVERTm := false m marks all the transactions with timestamps !> LCPNm as BCPTs and the rest as temporary-ACPTs. All updates of temporary-ACPTs are stored in the buffers of the ACPTs If a temporary-ACPT commits, updates are not flushed to the database but maintained as committed temporary versions (CTVs). Other transactions access CTVs for reads. For writes, another version of CTV is created.

34 B. Prabhakaran 34 DDBS Checkpointing... Phase 2 at CC All CS’s replies received -> GCPN := Max(LCPN1,.., LCPNn) Broadcast GCPN Phase 2 at the CSs On receiving GCPN, m marks all temporary-ACPTs that satisfy the following conditions as BCPTs: LCPNm < transaction time stamp <= GCPN Updates of the above converted BCPTs are included in checkpoints CONVERTm := true (i.e., GCPN & BCPTs identified) When all BCPTs terminate and CONVERTm = true, m takes a local checkpoint by saving the state of the data objects. After local checkpointing, database is updated with CTVs and CTVs are deleted.


Download ppt "B. Prabhakaran 1 Recovery Failure of a site/node in a distributed system causes inconsistencies in the state of the system. Recovery: bringing back the."

Similar presentations


Ads by Google