Download presentation
Presentation is loading. Please wait.
Published byPatrick Robbins Modified over 9 years ago
1
Chapter 19 Recovery and Fault Tolerance Copyright © 2008
2
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.2Operating Systems, by Dhananjay Dhamdhere2 Introduction Faults, Failures, and Recovery Byzantine Faults and Agreement Protocols Recovery Fault Tolerance Techniques Resiliency
3
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.3Operating Systems, by Dhananjay Dhamdhere3 Faults, Failures, and Recovery A fault may damage the state of a system –Error: a part of the system state that is erroneous Failure: unexpected behavior or situation
4
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.4Operating Systems, by Dhananjay Dhamdhere4 Faults, Failures, and Recovery (continued) Recovery: for reliable operation, system is restored to a consistent state, and operation resumed –A recovery is performed when a failure is noticed
5
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.5Operating Systems, by Dhananjay Dhamdhere5 Classes of Faults Fault model: properties that determine the kinds of errors/failures that might result from a fault Classes of faults: –System fault system crash Amnesia and partial amnesia faults A fail-stop fault brings a system to a halt –Process fault Byzantine faults: malicious or arbitrary actions –Storage fault amnesia faults –Communication fault nonamnesia faults
6
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.6Operating Systems, by Dhananjay Dhamdhere6 Overview of Recovery Techniques For non-Byzantine faults, recovery involves restoring system or application to a consistent state –Involves reexecuting some actions
7
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.7Operating Systems, by Dhananjay Dhamdhere7 Overview of Recovery Techniques (continued) Recovery approaches are classified into: –Backward recovery: resetting state of entity affected by fault to a prior state and resuming its operation Involves reexecution of some actions –Forward recovery: repairing erroneous state of a system so system can continue its operation Repair cost depends on the nature of the computation May involve a certain amount of reexecution Backward recovery is simpler to implement –But, requires a practical method of producing a consistent state recording of a system
8
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.8Operating Systems, by Dhananjay Dhamdhere8 Byzantine Faults and Agreement Protocols Byzantine faults have been studied only in the restricted context of agreement between processes –Byzantine generals problem: Attack or retreat Agreement protocols used for: –Byzantine agreement problem –Consensus problem –Interactive consistency problem Impossibility result: a group of three (m) processes containing one (k) faulty cannot reach agreement –If m > 3, agreement is possible if m > 3 k
9
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.9Operating Systems, by Dhananjay Dhamdhere9 Recovery A recovery scheme consists of two components: –Checkpointing algorithm Decides when to take a checkpoint for a process –Recovery algorithm Uses checkpoints to roll back processes such that new process states are mutually consistent
10
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.10Operating Systems, by Dhananjay Dhamdhere10 Recovery (continued) State of a process cannot be recovered in isolation –Must restore state of computation S’ in which states of all pairs of processes are mutually consistent Goal of recovery algorithm is to decide: –Whether a process P i should be rolled back –Identify checkpoint to which P i should be rolled back Asynchronous or synchronous checkpointing
11
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.11Operating Systems, by Dhananjay Dhamdhere11 Fault Tolerance Techniques Basic principle in fault tolerance is to ensure that a fault either does not cause an error –Or that the error can be removed easily Two facets of the tolerance of system faults that follow the fail-stop model: –Fault tolerance for replicated data –Fault tolerance for distributed data
12
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.12Operating Systems, by Dhananjay Dhamdhere12 Logs, Forward Recovery, and Backward Recovery A log is a record of actions or activities in a process –Do logs (also called redo logs) Used to implement forward recovery –Undo logs Used to implement backward recovery A write-ahead logging principle is used A log could be an operation log or a value log –Example: intentions list (for atomic actions) is a value log used as a redo log Value logs provide idempotency
13
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.13Operating Systems, by Dhananjay Dhamdhere13 Handling Replicated Data Availability of data D provided through replication –At least one copy of D would be accessible from any node despite anticipated faults in the system If D may be modified, it is essential to use rules to ensure correctness of data access and updates: 1.Many processes can concurrently read D 2.Only one process can write into D at any time 3.Reading and writing can’t be performed concurrently 4.A process reading D must see the latest value of D
14
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.14Operating Systems, by Dhananjay Dhamdhere14 Handling Replicated Data (continued) Quorum: number of copies of D that must be accessed to perform a specific operation on D Quorum algorithms enforce Rules 1–4 by specifying a read quorum Q r and a write quorum Q w –2 x Q w > n and Q r + Q w > n –Two kinds of locks used on D: read and write locks –If a system is required to tolerate faults in up to k nodes, we could choose: Q r = k + 1 Q w = n − k n > 2 × k
15
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.15Operating Systems, by Dhananjay Dhamdhere15 Handling Distributed Data Distributed transaction: facility for manipulating files located in different nodes of a distributed system in a mutually consistent manner –Also called a multisite transaction –Each node contains a transaction manager –Originating node contains a transaction coordinator Coordinator implements the all-or-nothing property of transactions with two-phase commit (2PC) protocol –Depending on responses from participating nodes, decides whether to commit or abort transaction
16
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.16 Two-Phase Commit Protocol Phase 1 1.Actions of transaction coordinator: a.Write a ‘Prepare T i ’ record in the log b.Set a time-out and send a ‘Prepare T i ’ message to all nodes participating in the transaction 2.Actions of a participating node: a.If it is ready to commit, write updates in stable storage and a ‘Prepared T i ’ record in the log, and send a ‘Prepared T i ’ message to coordinator b.Otherwise, write an ‘Abandoned T i ’ record in the log and send an ‘Abandoned T i ’ message to coordinator Operating Systems, by Dhananjay Dhamdhere16
17
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.17 Two-Phase Commit Protocol Phase 2 1.Actions of transaction coordinator: a.If it receives a ‘Prepared’ reply from all nodes before time- out occurs, write a ‘Commit T i ’ record in the log and send ‘Commit T i ’ messages to all nodes b.Otherwise, write an ‘Abort T i ’ record in the log and send ‘Abort T i ’ messages to all nodes c.Wait for an acknowledgment from each node and write a ‘Complete T i ’ message in the log 2.Actions of a participating node: Depending on the coordinator’s message, a.Write a ‘Commit T i ’ record in the log and perform commit processing b.Write an ‘Abort T i ’ record in log and abandon updates of T i 17Operating Systems, by Dhananjay Dhamdhere
18
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.18Operating Systems, by Dhananjay Dhamdhere18 Resiliency Resiliency techniques focus on minimizing the cost of reexecution when faults occur –Basis for resiliency: failures in a distributed system are partial, so some parts of a distributed computation may survive a failure For example, distributed transactions use resiliency techniques: –Nested transactions –Tentative commits
19
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.19 Nested Transaction A nested transaction T ik is a part of an atomic transaction T i –It commits only if its parent transaction T i commits –It is implemented as follows When T ik completes, it is said to have reached a tentative commit When T i wishes to commit, it checks whether all nested transactions have reached a tentative commit and can participate in commit processing –It is implemented using a 2PC protocol Operating Systems, by Dhananjay Dhamdhere19
20
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.20 Nested Transaction Resiliency is implemented as follows: –If a nested transaction T ik does not respond to a ‘Prepare’ message, the coordinator can retry T ik in the same node or in some other node If T ik had reached tentative commit and its node had failed when ‘Prepare’ message was sent If the failed node recovers and the coordinator retries T ik in it The results of T ik, computed before the failure, can be used Operating Systems, by Dhananjay Dhamdhere20
21
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.21Operating Systems, by Dhananjay Dhamdhere21 Summary Recovery and fault tolerance are two approaches to reliability of a computer system –Generically called recovery A third recovery approach is resiliency A fault causes an error in the state of the system, which leads to a failure –A fail-stop fault brings the system to a halt –An amnesia fault makes it lose a part of its state –A Byzantine fault makes it behave in an unpredictable manner
22
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.22Operating Systems, by Dhananjay Dhamdhere22 Summary (continued) Recovery from non-Byzantine faults can be performed by using two approaches: –Backward recovery and forward recovery Fault tolerance implemented by maintaining logs –E.g., undo or do logs –Logs used to implement atomic transactions Two-phase commit protocol (2PC protocol) is used Nested transactions are a resiliency technique –Used when transaction involves data in many nodes
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.