We think you have liked this presentation. If you wish to download it, please recommend it to your friends in any social system. Share buttons are a little bit lower. Thank you!
Presentation is loading. Please wait.
Published byPatrick Robbins
Modified over 4 years ago
Chapter 19 Recovery and Fault Tolerance Copyright © 2008
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.2Operating Systems, by Dhananjay Dhamdhere2 Introduction Faults, Failures, and Recovery Byzantine Faults and Agreement Protocols Recovery Fault Tolerance Techniques Resiliency
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.3Operating Systems, by Dhananjay Dhamdhere3 Faults, Failures, and Recovery A fault may damage the state of a system –Error: a part of the system state that is erroneous Failure: unexpected behavior or situation
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.4Operating Systems, by Dhananjay Dhamdhere4 Faults, Failures, and Recovery (continued) Recovery: for reliable operation, system is restored to a consistent state, and operation resumed –A recovery is performed when a failure is noticed
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.5Operating Systems, by Dhananjay Dhamdhere5 Classes of Faults Fault model: properties that determine the kinds of errors/failures that might result from a fault Classes of faults: –System fault system crash Amnesia and partial amnesia faults A fail-stop fault brings a system to a halt –Process fault Byzantine faults: malicious or arbitrary actions –Storage fault amnesia faults –Communication fault nonamnesia faults
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.6Operating Systems, by Dhananjay Dhamdhere6 Overview of Recovery Techniques For non-Byzantine faults, recovery involves restoring system or application to a consistent state –Involves reexecuting some actions
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.7Operating Systems, by Dhananjay Dhamdhere7 Overview of Recovery Techniques (continued) Recovery approaches are classified into: –Backward recovery: resetting state of entity affected by fault to a prior state and resuming its operation Involves reexecution of some actions –Forward recovery: repairing erroneous state of a system so system can continue its operation Repair cost depends on the nature of the computation May involve a certain amount of reexecution Backward recovery is simpler to implement –But, requires a practical method of producing a consistent state recording of a system
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.8Operating Systems, by Dhananjay Dhamdhere8 Byzantine Faults and Agreement Protocols Byzantine faults have been studied only in the restricted context of agreement between processes –Byzantine generals problem: Attack or retreat Agreement protocols used for: –Byzantine agreement problem –Consensus problem –Interactive consistency problem Impossibility result: a group of three (m) processes containing one (k) faulty cannot reach agreement –If m > 3, agreement is possible if m > 3 k
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.9Operating Systems, by Dhananjay Dhamdhere9 Recovery A recovery scheme consists of two components: –Checkpointing algorithm Decides when to take a checkpoint for a process –Recovery algorithm Uses checkpoints to roll back processes such that new process states are mutually consistent
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.10Operating Systems, by Dhananjay Dhamdhere10 Recovery (continued) State of a process cannot be recovered in isolation –Must restore state of computation S’ in which states of all pairs of processes are mutually consistent Goal of recovery algorithm is to decide: –Whether a process P i should be rolled back –Identify checkpoint to which P i should be rolled back Asynchronous or synchronous checkpointing
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.11Operating Systems, by Dhananjay Dhamdhere11 Fault Tolerance Techniques Basic principle in fault tolerance is to ensure that a fault either does not cause an error –Or that the error can be removed easily Two facets of the tolerance of system faults that follow the fail-stop model: –Fault tolerance for replicated data –Fault tolerance for distributed data
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.12Operating Systems, by Dhananjay Dhamdhere12 Logs, Forward Recovery, and Backward Recovery A log is a record of actions or activities in a process –Do logs (also called redo logs) Used to implement forward recovery –Undo logs Used to implement backward recovery A write-ahead logging principle is used A log could be an operation log or a value log –Example: intentions list (for atomic actions) is a value log used as a redo log Value logs provide idempotency
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.13Operating Systems, by Dhananjay Dhamdhere13 Handling Replicated Data Availability of data D provided through replication –At least one copy of D would be accessible from any node despite anticipated faults in the system If D may be modified, it is essential to use rules to ensure correctness of data access and updates: 1.Many processes can concurrently read D 2.Only one process can write into D at any time 3.Reading and writing can’t be performed concurrently 4.A process reading D must see the latest value of D
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.14Operating Systems, by Dhananjay Dhamdhere14 Handling Replicated Data (continued) Quorum: number of copies of D that must be accessed to perform a specific operation on D Quorum algorithms enforce Rules 1–4 by specifying a read quorum Q r and a write quorum Q w –2 x Q w > n and Q r + Q w > n –Two kinds of locks used on D: read and write locks –If a system is required to tolerate faults in up to k nodes, we could choose: Q r = k + 1 Q w = n − k n > 2 × k
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.15Operating Systems, by Dhananjay Dhamdhere15 Handling Distributed Data Distributed transaction: facility for manipulating files located in different nodes of a distributed system in a mutually consistent manner –Also called a multisite transaction –Each node contains a transaction manager –Originating node contains a transaction coordinator Coordinator implements the all-or-nothing property of transactions with two-phase commit (2PC) protocol –Depending on responses from participating nodes, decides whether to commit or abort transaction
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.16 Two-Phase Commit Protocol Phase 1 1.Actions of transaction coordinator: a.Write a ‘Prepare T i ’ record in the log b.Set a time-out and send a ‘Prepare T i ’ message to all nodes participating in the transaction 2.Actions of a participating node: a.If it is ready to commit, write updates in stable storage and a ‘Prepared T i ’ record in the log, and send a ‘Prepared T i ’ message to coordinator b.Otherwise, write an ‘Abandoned T i ’ record in the log and send an ‘Abandoned T i ’ message to coordinator Operating Systems, by Dhananjay Dhamdhere16
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.17 Two-Phase Commit Protocol Phase 2 1.Actions of transaction coordinator: a.If it receives a ‘Prepared’ reply from all nodes before time- out occurs, write a ‘Commit T i ’ record in the log and send ‘Commit T i ’ messages to all nodes b.Otherwise, write an ‘Abort T i ’ record in the log and send ‘Abort T i ’ messages to all nodes c.Wait for an acknowledgment from each node and write a ‘Complete T i ’ message in the log 2.Actions of a participating node: Depending on the coordinator’s message, a.Write a ‘Commit T i ’ record in the log and perform commit processing b.Write an ‘Abort T i ’ record in log and abandon updates of T i 17Operating Systems, by Dhananjay Dhamdhere
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.18Operating Systems, by Dhananjay Dhamdhere18 Resiliency Resiliency techniques focus on minimizing the cost of reexecution when faults occur –Basis for resiliency: failures in a distributed system are partial, so some parts of a distributed computation may survive a failure For example, distributed transactions use resiliency techniques: –Nested transactions –Tentative commits
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.19 Nested Transaction A nested transaction T ik is a part of an atomic transaction T i –It commits only if its parent transaction T i commits –It is implemented as follows When T ik completes, it is said to have reached a tentative commit When T i wishes to commit, it checks whether all nested transactions have reached a tentative commit and can participate in commit processing –It is implemented using a 2PC protocol Operating Systems, by Dhananjay Dhamdhere19
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.20 Nested Transaction Resiliency is implemented as follows: –If a nested transaction T ik does not respond to a ‘Prepare’ message, the coordinator can retry T ik in the same node or in some other node If T ik had reached tentative commit and its node had failed when ‘Prepare’ message was sent If the failed node recovers and the coordinator retries T ik in it The results of T ik, computed before the failure, can be used Operating Systems, by Dhananjay Dhamdhere20
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.21Operating Systems, by Dhananjay Dhamdhere21 Summary Recovery and fault tolerance are two approaches to reliability of a computer system –Generically called recovery A third recovery approach is resiliency A fault causes an error in the state of the system, which leads to a failure –A fail-stop fault brings the system to a halt –An amnesia fault makes it lose a part of its state –A Byzantine fault makes it behave in an unpredictable manner
Operating Systems, by Dhananjay Dhamdhere Copyright © 200819.22Operating Systems, by Dhananjay Dhamdhere22 Summary (continued) Recovery from non-Byzantine faults can be performed by using two approaches: –Backward recovery and forward recovery Fault tolerance implemented by maintaining logs –E.g., undo or do logs –Logs used to implement atomic transactions Two-phase commit protocol (2PC protocol) is used Nested transactions are a resiliency technique –Used when transaction involves data in many nodes
Failure Detection The ping-ack failure detector in a synchronous system satisfies – A: completeness – B: accuracy – C: neither – D: both.
Lock-Based Concurrency Control
CS542: Topics in Distributed Systems Distributed Transactions and Two Phase Commit Protocol.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Chapter 19 Database Recovery Techniques
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Jan. 2014Dr. Yangjun Chen ACS Database recovery techniques (Ch. 21, 3 rd ed. – Ch. 19, 4 th and 5 th ed. – Ch. 23, 6 th ed.)
Chapter 19 Database Recovery Techniques Copyright © 2004 Pearson Education, Inc.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
ICS (072)Database Recovery1 Database Recovery Concepts and Techniques Dr. Muhammad Shafique.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Recovery Fall 2006McFadyen Concepts Failures are either: catastrophic to recover one restores the database using a past copy, followed by redoing.
Synchronization. Physical Clocks Solar Physical Clocks Cesium Clocks International Atomic Time Universal Coordinate Time (UTC) Clock Synchronization Algorithms.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 23 Database Recovery Techniques.
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
Chapter 19 Database Recovery Techniques. Slide Chapter 19 Outline Databases Recovery 1. Purpose of Database Recovery 2. Types of Failure 3. Transaction.
1 CS 194: Distributed Systems Distributed Commit, Recovery Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering.
©Silberschatz, Korth and Sudarshan19.1Database System Concepts Distributed Transactions Transaction may access data at several sites. Each site has a local.
1 ICS 214B: Transaction Processing and Distributed Data Management Distributed Database Systems.
TRANSACTION PROCESSING TECHNIQUES BY SON NGUYEN VIJAY RAO.
© 2020 SlidePlayer.com Inc. All rights reserved.