Fault Tolerance.

Fault Tolerance

System reliability: Fault-Intolerance vs. Fault-Tolerance
The fault intolerance (or fault-avoidance) approach improves system reliability by removing the source of failures (i.e., hardware and software faults) before normal operation begins The approach of fault-tolerance expect faults to be present during system operation, but employs design techniques which insure the continued correct execution of the computing process

Approaches to fault-tolerance
(a) Mask failures (b) Well defined failure behavior Mask failures: System continues to provide its specified function(s) in the presence of failures Example: voting protocols (b) Well defined failure behaviour: System exhibits a well define behaviour in the presence of failures It may or it may not perform its specified function(s), but facilitates actions suitable for fault recovery Example: commit protocols A transaction made to a database is made visible only if successful and it commits If it fails, transaction is undone Redundancy: Method for achieving fault tolerance (multiple copies of hardware, processes, data, etc...)

Issues Process Deaths: Machine failure: Network Failure:
All resources allocated to a process must be recovered when a process dies Kernel and remaining processes can notify other cooperating processes Client-server systems: client (server) process needs to be informed that the corresponding server (client) process died Machine failure: All processes running on that machine will die Client-server systems: difficult to distinguish between a process and machine failure Issue: detection by processes of other machines Network Failure: Network may be partitioned into subnets Machines from different subnets cannot communicate Difficult for a process to distinguish between a machine and a communication link failure

Atomic actions System activity: sequence of primitive or atomic actions Atomic Action: Machine Level: uninterruptible instruction Process Level: Group of instructions that accomplish a task Example: Two processes, P1 and P2, share a memory location ‘x’ and both modify ‘x’ Process P1 Process P2 … … Lock(x); Lock(x); x := x + z; x := x + y; Atomic action Unlock(x); Unlock(x); …Failure … successful exit System level: group of cooperating process performing a task (global atomicity)

Committing Transaction: Sequence of actions treated as an atomic action to preserve consistency (e.g. access to a database) Commit a transaction: Unconditional guarantee that the transaction will complete successfully (even in the presence of failures) Abort a transaction: Unconditional guarantee to back out of a transaction, i.e., that all the effects of the transaction have been removed (transaction was backed out) Events that may cause aborting a transaction: deadlocks, timeouts, protection violation Commit protocols: Enforce global atomicity (involving several cooperating distributed processes) Ensure that all the sites either commit or abort transaction unanimously, even in the presence of multiple and repetitive failures

The two-phase commit protocol
Assumption: One process is coordinator, the others are “cohorts” (different sites) Stable store available at each site Write-ahead log protocol Coordinator Initialization Send start transaction message to all cohorts Phase 1 Send commit-request message, requesting all cohort to commit Wait for reply from cohorts Phase 2 If all cohorts sent agreed and coordinator agrees then write commit record into log and send commit message to cohorts else send abort message to cohorts Wait for acknowledgment from cohorts If acknowledgment from a cohort not received within specified period resent commit/abort to that cohort If all acknowledgments received, write complete record to log Cohorts If transaction at cohort is successful then write undo and redo log on stable storage and return agreed message else return abort message If commit received, release all resources and locks held for transaction and send acknowledgment if abort received, undo the transaction using undo log record, release resources and locks and

2PC Recovery Protocols –Additional Cases
Arise due to non-atomicity of log and message send actions Coordinator site fails after writing “begin_commit” log and before sending “prepare” command treat it as a failure in WAIT state; send “prepare” command Participant site fails after writing “ready” record in log but before “vote-commit” is sent treat it as failure in READY state alternatively, can send “vote-commit” upon recovery Participant site fails after writing “abort” record in log but before “vote-abort” is sent no need to do anything upon recovery

2PC Recovery Protocols –Additional Case (see book)
Coordinator site fails after logging its final decision record but before sending its decision to the participants coordinator treats it as a failure in COMMIT or ABORT state participants treat it as timeout in the READY state Participant site fails after writing “abort” or “commit” record in log but before acknowledgement is sent participant treats it as failure in COMMIT or ABORT state coordinator will handle it by timeout in COMMIT or ABORT state

Problem With 2PC Blocking Ready implies that the participant waits for the coordinator If coordinator fails, site is blocked until recovery Blocking reduces availability Independent recovery is not possible However, it is known that: Independent recovery protocols exist only for single site failures; no independent recovery protocol exists which is resilient to multiple-site failures. So we search for these protocols – 3PC

Three-Phase Commit 3PC is non-blocking.
A commit protocols is non-blocking iff it is synchronous within one state transition, and its state transition diagram contains no state which is “adjacent” to both a commit and an abort state, and no non-committable state which is “adjacent” to a commit state Adjacent: possible to go from one stat to another with a single state transition Committable: all sites have voted to commit a transaction e.g.: COMMIT state

State Transitions in 3PC
Coordinator Participants INITIAL INITIAL Commit command Prepare Prepare Vote-commit Prepare Vote-abort WAIT READY Vote-abort Vote-commit Global-abort Prepared-to-commit Global-abort Prepare-to-commit Ack Ready-to-commit PRE- COMMIT PRE- COMMIT ABORT ABORT Ready-to-commit Global commit Global commit Ack COMMIT COMMIT

Communication Structure
P P P P P P C C C C P P P P P P pre-commit/ ready? yes/no pre-abort? yes/no commit/abort ack Phase 1 Phase 2 Phase 3

Recovery

Recovery Computer system recovery: Process recovery:
Restore the system to a normal operational state Process recovery: Reclaim resources allocated to process, Undo modification made to databases, and Restart the process Or restart process from point of failure and resume execution Distributed process recovery (cooperating processes): Undo effect of interactions of failed process with other cooperating processes. Replication (hardware components, processes, data): Main method for increasing system availability System: Set of hardware and software components Designed to provide a specified service (I.e. meet a set of requirements)

Recovery (cont.) System failure: Erroneous System State:
System does not meet requirements, i.e.does not perform its services as specified Erroneous System State: State which could lead to a system failure by a sequence of valid state transitions Error: the part of the system state which differs from its intended value Fault: Anomalous physical condition, e.g. design errors, manufacturing problems, damage, external disturbances. Error could lead to system failure Error is a manifestation of a fault

Classification of failures
Process failure: Behavior: process causes system state to deviate from specification (e.g. incorrect computation, process stop execution) Errors causing process failure: protection violation, deadlocks, timeout, wrong user input, etc… Recovery: Abort process or Restart process from prior state System failure: Behavior: processor fails to execute Caused by software errors or hardware faults (CPU/memory/bus/…/ failure) Recovery: system stopped and restarted in correct state Assumption: fail-stop processors, i.e. system stops execution, internal state is lost Secondary Storage Failure: Behavior: stored data cannot be accessed Errors causing failure: parity error, head crash, etc. Recovery/Design strategies: Reconstruct content from archive + log of activities Design mirrored disk system Communication Medium Failure: Behavior: a site cannot communicate with another operational site Errors/Faults: failure of switching nodes or communication links Recovery/Design Strategies: reroute, error-resistant communication protocols

Backward and Forward Error Recovery
Failure recovery: restore an erroneous state to an error-free state Approaches to failure recovery: Forward-error recovery: Remove errors in process/system state (if errors can be completely assessed) Continue process/system forward execution Backward-error recovery: Restore process/system to previous error-free state and restart from there Comparison: Forward vs. Backward error recovery Backward-error recovery (+) Simple to implement (+) Can be used as general recovery mechanism (-) Performance penalty (-) No guarantee that fault does not occur again (-) Some components cannot be recovered Forward-error Recovery (+) Less overhead (-) Limited use, i.e. only when impact of faults understood (-) Cannot be used as general mechanism for error recovery

Backward-Error Recovery: Basic approach
Principle: restore process/system to a known, error-free “recovery point”/ “checkpoint”. System model: Approaches: (1) Operation-based approach (2) State-based approach CPU Main memory secondary storage stable storage Storage that maintains information in the event of system failure Bring object to MM to be accessed Store logs and recovery points Write object back if modified

(1) The Operation-based Approach
Principle: Record all changes made to state of process (‘audit trail’ or ‘log’) such that process can be returned to a previous state Example: A transaction based environment where transactions update a database It is possible to commit or undo updates on a per-transaction basis A commit indicates that the transaction on the object was successful and changes are permanent (1.a) Updating-in-place Principle: every update (write) operation to an object creates a log in stable storage that can be used to ‘undo’ and ‘redo’ the operation Log content: object name, old object state, new object state Implementation of a recoverable update operation: Do operation: update object and write log record Undo operation: log(old) -> object (undoes the action performed by a do) Redo operation: log(new) -> object (redoes the action performed by a do) Display operation: display log record (optional) Problem: a ‘do’ cannot be recovered if system crashes after write object but before log record write (1.b) The write-ahead log protocol Principle: write log record before updating object

(2) State-based Approach
Principle: establish frequent ‘recovery points’ or ‘checkpoints’ saving the entire state of process Actions: ‘Checkpointing’ or ‘taking a checkpoint’: saving process state ‘Rolling back’ a process: restoring a process to a prior state Note: A process should be rolled back to the most recent ‘recovery point’ to minimize the overhead and delays in the completion of the process Shadow Pages: Special case of state-based approach Only a part of the system state is saved to minimize recovery When an object is modified, page containing object is first copied on stable storage (shadow page) If process successfully commits: shadow page discarded and modified page is made part of the database If process fails: shadow page used and the modified page discarded

Recovery in concurrent systems
Issue: if one of a set of cooperating processes fails and has to be rolled back to a recovery point, all processes it communicated with since the recovery point have to be rolled back. Conclusion: In concurrent and/or distributed systems all cooperating processes have to establish recovery points Orphan messages and the domino effect Case 1: failure of X after x3 : no impact on Y or Z Case 2: failure of Y after sending msg. ‘m’ Y rolled back to y2 ‘m’ ≡ orphan massage X rolled back to x2 Case 3: failure of Z after z2 Y has to roll back to y1 X has to roll back to x1 Domino Effect Z has to roll back to z1 X Y Z y1 x1 z1 z2 x2 y2 x3 m Time

Problem of livelock Livelock: case where a single failure can cause an infinite number of rollbacks Process Y fails before receiving message ‘n1’ sent by X Y rolled back to y1, no record of sending message ‘m1’, causing X to roll back to x1 When Y restarts, sends out ‘m2’ and receives ‘n1’ (delayed) When X restarts from x1, sends out ‘n2’ and receives ‘m2’ Y has to roll back again, since there is no record of ‘n1’ being sent This cause X to be rolled back again, since it has received ‘m2’ and there is no record of sending ‘m2’ in Y The above sequence can repeat indefinitely X Y y1 x1 m1 Time Failure n1 (a) X Y y1 x1 m2 Time 2nd roll back n2 n1 (b) (a) (b)

Consistent set of checkpoints
Checkpointing in distributed systems requires that all processes (sites) that interact with one another establish periodic checkpoints All the sites save their local states: local checkpoints All the local checkpoints, one from each site, collectively form a global checkpoint The domino effect is caused by orphan messages, which in turn are caused by rollbacks Strongly consistent set of checkpoints Establish a set of local checkpoints (one for each process in the set) such that no information flow takes place (i.e., no orphan messages) during the interval spanned by the checkpoints Consistent set of checkpoints Similar to the consistent global state Each message that is received in a checkpoint (state) should also be recorded as sent in another checkpoint (state)

Static Voting Scheme System Model: Basic Idea: Voting Algorithm:
File replicas at different sites. File lock rule: either one writer + no reader or multiple readers + no writer. Every file is associated with a version number that gives the number of times a file has been updated. Version numbers are stored on stable storage. Every successful write updates version number. Basic Idea: Every replica assigned a certain number of votes. This number stored on stable storage. A read or write operation permitted if a certain number of votes, called read quorum or write quorum, are collected by the requesting process. Voting Algorithm: Let a site i issue a read or write request for a file. Site i issues a Lock_Request to its local lock manager.

Static Voting ... Voting Algorithm...:
When lock request is granted, i sends a Vote_Request message to all the sites. When a site j receives a Vote_Request message, it issues a Lock_Request to its lock manager. If the lock request is granted, then it returns the version number of the replica (VNj) and the number of votes assigned to the replica (Vj) at site i. Site i decides whether it has the quorum or not, based on replies received within a timeout period as follows. For read requests, Vr = Sum of Vk, k in P, where P is the set of sites from which replies were received. For write requests, Vw = Sum of Vk, k in Q such that: M = max{VN j: j is in P} Q = {j in P : VNj = M} Only the votes of the current (version) replicas are counted in deciding the write quorum.

Static Voting ... Voting Algorithm...: Vote Assignment:
If i is not successful in getting the quorum, it issues a Release _Lock to the lock manager & to all sites that gave their votes. If i is successful in collecting the quorum, it checks whether its copy of file is current (VNi = M). If not, it obtains the current copy. If the request is read, i reads the local copy. If write, i updates the local copy and VN. i sends all updates and VNi to all sites in Q, i.e., update only current replicas. i sends a Release_Lock request to its lock manager as well as those in P. All sites on receiving updates, perform updates. On receiving Release_Lock, releases lock. Vote Assignment: Let v be the total number of votes assigned to all copies. Read & write quorum, r & w, are selected such that: r + w > v; w > v/2.

Static Voting ... Vote Assignment ...:
Above values are determined so that there is a non-null intersection between every read and write quorum, i.e., at least 1 current copy in any reading quorum gathered. Write quorum is high enough to disallow simultaneous writes on 2 distinct subset of replicas. The scheme can be modified to collect write quorums from non-current replicas. Another modification: obsolete replicas updated.

Majority Approach ... Notations used:
Version Number, VNi: of a replica at a site i is an integer that counts the number of successful updates to the replica at i. Initially set to 0. Number of replicas updated, RUi: Number of replicas participating in the most recent update. Initially set to the total number of replicas. Distinguished sites list, DSi,: at i is a variable that stores IDs of one or more sites. DSi depends on RUi. RUi is even: DSi identifies the replica that is greater (as per the linear ordering) than all the other replicas that participated in the most recent update at i. RUi is odd: DSi is nil. RUi = 3: DSi lists the 3 replicas that participated in the most recent update from which a majority is needed to allow access to data.

Majority-based: Protocol
Site i receives an update and executes following protocol: i issues a Lock_Request to its local lock manager Lock granted? : i issues a Vote_Request to all the sites. Site j receives the request: j issues a Lock_Request to its local lock manager. Lock granted? : j sends the values of VNj, RUj, and DSj to i. Based on responses received, i decides whether it belongs to the distinguished partition procedure. i does not belong to distinguished partition? : issues Release_Lock to local lock manager and Abort to other sites (which will issue Release_Lock to their local lock manager). i belongs to distinguished partition? : performs update on local copy (current copy obtained before update is local copy is not current). i sends a commit message to participating sites with missing updates and values of VN, RU, and DS. Issues a Release_Lock request to local lock manager.

Fault Tolerance.

Similar presentations

Presentation on theme: "Fault Tolerance."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fault Tolerance.

Similar presentations

Presentation on theme: "Fault Tolerance."— Presentation transcript:

Similar presentations

About project

Feedback