Chapter 19 Recovery and Fault Tolerance Copyright © 2008.

Slides:



Advertisements
Similar presentations
Failure Detection The ping-ack failure detector in a synchronous system satisfies – A: completeness – B: accuracy – C: neither – D: both.
Advertisements

Lock-Based Concurrency Control
CS542: Topics in Distributed Systems Distributed Transactions and Two Phase Commit Protocol.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Chapter 19 Database Recovery Techniques
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Jan. 2014Dr. Yangjun Chen ACS Database recovery techniques (Ch. 21, 3 rd ed. – Ch. 19, 4 th and 5 th ed. – Ch. 23, 6 th ed.)
Chapter 19 Database Recovery Techniques Copyright © 2004 Pearson Education, Inc.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
ICS (072)Database Recovery1 Database Recovery Concepts and Techniques Dr. Muhammad Shafique.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Recovery Fall 2006McFadyen Concepts Failures are either: catastrophic to recover one restores the database using a past copy, followed by redoing.
Synchronization. Physical Clocks Solar Physical Clocks Cesium Clocks International Atomic Time Universal Coordinate Time (UTC) Clock Synchronization Algorithms.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 23 Database Recovery Techniques.
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
Chapter 19 Database Recovery Techniques. Slide Chapter 19 Outline Databases Recovery 1. Purpose of Database Recovery 2. Types of Failure 3. Transaction.
1 CS 194: Distributed Systems Distributed Commit, Recovery Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering.
©Silberschatz, Korth and Sudarshan19.1Database System Concepts Distributed Transactions Transaction may access data at several sites. Each site has a local.
1 ICS 214B: Transaction Processing and Distributed Data Management Distributed Database Systems.
TRANSACTION PROCESSING TECHNIQUES BY SON NGUYEN VIJAY RAO.
Academic Year 2014 Spring. MODULE CC3005NI: Advanced Database Systems “DATABASE RECOVERY” (PART – 1) Academic Year 2014 Spring.
Commit Protocols. CS5204 – Operating Systems2 Fault Tolerance Causes of failure: process failure machine failure network failure Goals : transparent:
DISTRIBUTED ALGORITHMS Luc Onana Seif Haridi. DISTRIBUTED SYSTEMS Collection of autonomous computers, processes, or processors (nodes) interconnected.
Distributed Transactions March 15, Transactions What is a Distributed Transaction?  A transaction that involves more than one server  Network.
DISTRIBUTED SYSTEMS II AGREEMENT (2-3 PHASE COM.) Prof Philippas Tsigas Distributed Computing and Systems Research Group.
Distributed Transactions Chapter 13
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
CSE 486/586 CSE 486/586 Distributed Systems Concurrency Control Steve Ko Computer Sciences and Engineering University at Buffalo.
Chapter 15 Recovery. Topics in this Chapter Transactions Transaction Recovery System Recovery Media Recovery Two-Phase Commit SQL Facilities.
Lecture 12 Recoverability and failure. 2 Optimistic Techniques Based on assumption that conflict is rare and more efficient to let transactions proceed.
Operating Systems Distributed Coordination. Topics –Event Ordering –Mutual Exclusion –Atomicity –Concurrency Control Topics –Event Ordering –Mutual Exclusion.
Chapter 15 Recovery. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.15-2 Topics in this Chapter Transactions Transaction Recovery System.
Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
XA Transactions.
Commit Algorithms Hamid Al-Hamadi CS 5204 November 17, 2009.
Chapter 10 Recovery System. ACID Properties  Atomicity. Either all operations of the transaction are properly reflected in the database or none are.
IM NTU Distributed Information Systems 2004 Distributed Transactions -- 1 Distributed Transactions Yih-Kuen Tsay Dept. of Information Management National.
© Chinese University, CSE Dept. Distributed Systems / Distributed Systems Topic 8: Fault Tolerance and Replication Dr. Michael R. Lyu Computer Science.
Revisiting failure detectors Some of you asked questions about implementing consensus using S - how does it differ from reaching consensus using P. Here.
Introduction to Distributed Databases Yiwei Wu. Introduction A distributed database is a database in which portions of the database are stored on multiple.
EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Fault Tolerance Chapter 7. Goal An important goal in distributed systems design is to construct the system in such a way that it can automatically recover.
1 AGREEMENT PROTOCOLS. 2 Introduction Processes/Sites in distributed systems often compete as well as cooperate to achieve a common goal. Mutual Trust/agreement.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
Operating System Reliability Andy Wang COP 5611 Advanced Operating Systems.
Distributed Databases – Advanced Concepts Chapter 25 in Textbook.
Database recovery techniques
Database Recovery Techniques
Database Recovery Techniques
Fault Tolerance.
Two phase commit.
Operating System Reliability
Operating System Reliability
Commit Protocols CS60002: Distributed Systems
Outline Announcements Fault Tolerance.
Operating System Reliability
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
Database Recovery 1 Purpose of Database Recovery
Distributed Databases Recovery
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Abstractions for Fault Tolerance
Last Class: Fault Tolerance
Operating System Reliability
Transaction Communication
Operating System Reliability
Presentation transcript:

Chapter 19 Recovery and Fault Tolerance Copyright © 2008

Operating Systems, by Dhananjay Dhamdhere Copyright © Operating Systems, by Dhananjay Dhamdhere2 Introduction Faults, Failures, and Recovery Byzantine Faults and Agreement Protocols Recovery Fault Tolerance Techniques Resiliency

Operating Systems, by Dhananjay Dhamdhere Copyright © Operating Systems, by Dhananjay Dhamdhere3 Faults, Failures, and Recovery A fault may damage the state of a system –Error: a part of the system state that is erroneous Failure: unexpected behavior or situation

Operating Systems, by Dhananjay Dhamdhere Copyright © Operating Systems, by Dhananjay Dhamdhere4 Faults, Failures, and Recovery (continued) Recovery: for reliable operation, system is restored to a consistent state, and operation resumed –A recovery is performed when a failure is noticed

Operating Systems, by Dhananjay Dhamdhere Copyright © Operating Systems, by Dhananjay Dhamdhere5 Classes of Faults Fault model: properties that determine the kinds of errors/failures that might result from a fault Classes of faults: –System fault  system crash Amnesia and partial amnesia faults A fail-stop fault brings a system to a halt –Process fault Byzantine faults: malicious or arbitrary actions –Storage fault  amnesia faults –Communication fault  nonamnesia faults

Operating Systems, by Dhananjay Dhamdhere Copyright © Operating Systems, by Dhananjay Dhamdhere6 Overview of Recovery Techniques For non-Byzantine faults, recovery involves restoring system or application to a consistent state –Involves reexecuting some actions

Operating Systems, by Dhananjay Dhamdhere Copyright © Operating Systems, by Dhananjay Dhamdhere7 Overview of Recovery Techniques (continued) Recovery approaches are classified into: –Backward recovery: resetting state of entity affected by fault to a prior state and resuming its operation Involves reexecution of some actions –Forward recovery: repairing erroneous state of a system so system can continue its operation Repair cost depends on the nature of the computation May involve a certain amount of reexecution Backward recovery is simpler to implement –But, requires a practical method of producing a consistent state recording of a system

Operating Systems, by Dhananjay Dhamdhere Copyright © Operating Systems, by Dhananjay Dhamdhere8 Byzantine Faults and Agreement Protocols Byzantine faults have been studied only in the restricted context of agreement between processes –Byzantine generals problem: Attack or retreat Agreement protocols used for: –Byzantine agreement problem –Consensus problem –Interactive consistency problem Impossibility result: a group of three (m) processes containing one (k) faulty cannot reach agreement –If m > 3, agreement is possible if m > 3 k

Operating Systems, by Dhananjay Dhamdhere Copyright © Operating Systems, by Dhananjay Dhamdhere9 Recovery A recovery scheme consists of two components: –Checkpointing algorithm Decides when to take a checkpoint for a process –Recovery algorithm Uses checkpoints to roll back processes such that new process states are mutually consistent

Operating Systems, by Dhananjay Dhamdhere Copyright © Operating Systems, by Dhananjay Dhamdhere10 Recovery (continued) State of a process cannot be recovered in isolation –Must restore state of computation S’ in which states of all pairs of processes are mutually consistent Goal of recovery algorithm is to decide: –Whether a process P i should be rolled back –Identify checkpoint to which P i should be rolled back Asynchronous or synchronous checkpointing

Operating Systems, by Dhananjay Dhamdhere Copyright © Operating Systems, by Dhananjay Dhamdhere11 Fault Tolerance Techniques Basic principle in fault tolerance is to ensure that a fault either does not cause an error –Or that the error can be removed easily Two facets of the tolerance of system faults that follow the fail-stop model: –Fault tolerance for replicated data –Fault tolerance for distributed data

Operating Systems, by Dhananjay Dhamdhere Copyright © Operating Systems, by Dhananjay Dhamdhere12 Logs, Forward Recovery, and Backward Recovery A log is a record of actions or activities in a process –Do logs (also called redo logs) Used to implement forward recovery –Undo logs Used to implement backward recovery A write-ahead logging principle is used A log could be an operation log or a value log –Example: intentions list (for atomic actions) is a value log used as a redo log Value logs provide idempotency

Operating Systems, by Dhananjay Dhamdhere Copyright © Operating Systems, by Dhananjay Dhamdhere13 Handling Replicated Data Availability of data D provided through replication –At least one copy of D would be accessible from any node despite anticipated faults in the system If D may be modified, it is essential to use rules to ensure correctness of data access and updates: 1.Many processes can concurrently read D 2.Only one process can write into D at any time 3.Reading and writing can’t be performed concurrently 4.A process reading D must see the latest value of D

Operating Systems, by Dhananjay Dhamdhere Copyright © Operating Systems, by Dhananjay Dhamdhere14 Handling Replicated Data (continued) Quorum: number of copies of D that must be accessed to perform a specific operation on D Quorum algorithms enforce Rules 1–4 by specifying a read quorum Q r and a write quorum Q w –2 x Q w > n and Q r + Q w > n –Two kinds of locks used on D: read and write locks –If a system is required to tolerate faults in up to k nodes, we could choose: Q r = k + 1 Q w = n − k n > 2 × k

Operating Systems, by Dhananjay Dhamdhere Copyright © Operating Systems, by Dhananjay Dhamdhere15 Handling Distributed Data Distributed transaction: facility for manipulating files located in different nodes of a distributed system in a mutually consistent manner –Also called a multisite transaction –Each node contains a transaction manager –Originating node contains a transaction coordinator Coordinator implements the all-or-nothing property of transactions with two-phase commit (2PC) protocol –Depending on responses from participating nodes, decides whether to commit or abort transaction

Operating Systems, by Dhananjay Dhamdhere Copyright © Two-Phase Commit Protocol Phase 1 1.Actions of transaction coordinator: a.Write a ‘Prepare T i ’ record in the log b.Set a time-out and send a ‘Prepare T i ’ message to all nodes participating in the transaction 2.Actions of a participating node: a.If it is ready to commit, write updates in stable storage and a ‘Prepared T i ’ record in the log, and send a ‘Prepared T i ’ message to coordinator b.Otherwise, write an ‘Abandoned T i ’ record in the log and send an ‘Abandoned T i ’ message to coordinator Operating Systems, by Dhananjay Dhamdhere16

Operating Systems, by Dhananjay Dhamdhere Copyright © Two-Phase Commit Protocol Phase 2 1.Actions of transaction coordinator: a.If it receives a ‘Prepared’ reply from all nodes before time- out occurs, write a ‘Commit T i ’ record in the log and send ‘Commit T i ’ messages to all nodes b.Otherwise, write an ‘Abort T i ’ record in the log and send ‘Abort T i ’ messages to all nodes c.Wait for an acknowledgment from each node and write a ‘Complete T i ’ message in the log 2.Actions of a participating node: Depending on the coordinator’s message, a.Write a ‘Commit T i ’ record in the log and perform commit processing b.Write an ‘Abort T i ’ record in log and abandon updates of T i 17Operating Systems, by Dhananjay Dhamdhere

Operating Systems, by Dhananjay Dhamdhere Copyright © Operating Systems, by Dhananjay Dhamdhere18 Resiliency Resiliency techniques focus on minimizing the cost of reexecution when faults occur –Basis for resiliency: failures in a distributed system are partial, so some parts of a distributed computation may survive a failure For example, distributed transactions use resiliency techniques: –Nested transactions –Tentative commits

Operating Systems, by Dhananjay Dhamdhere Copyright © Nested Transaction A nested transaction T ik is a part of an atomic transaction T i –It commits only if its parent transaction T i commits –It is implemented as follows When T ik completes, it is said to have reached a tentative commit When T i wishes to commit, it checks whether all nested transactions have reached a tentative commit and can participate in commit processing –It is implemented using a 2PC protocol Operating Systems, by Dhananjay Dhamdhere19

Operating Systems, by Dhananjay Dhamdhere Copyright © Nested Transaction Resiliency is implemented as follows: –If a nested transaction T ik does not respond to a ‘Prepare’ message, the coordinator can retry T ik in the same node or in some other node If T ik had reached tentative commit and its node had failed when ‘Prepare’ message was sent If the failed node recovers and the coordinator retries T ik in it The results of T ik, computed before the failure, can be used Operating Systems, by Dhananjay Dhamdhere20

Operating Systems, by Dhananjay Dhamdhere Copyright © Operating Systems, by Dhananjay Dhamdhere21 Summary Recovery and fault tolerance are two approaches to reliability of a computer system –Generically called recovery A third recovery approach is resiliency A fault causes an error in the state of the system, which leads to a failure –A fail-stop fault brings the system to a halt –An amnesia fault makes it lose a part of its state –A Byzantine fault makes it behave in an unpredictable manner

Operating Systems, by Dhananjay Dhamdhere Copyright © Operating Systems, by Dhananjay Dhamdhere22 Summary (continued) Recovery from non-Byzantine faults can be performed by using two approaches: –Backward recovery and forward recovery Fault tolerance implemented by maintaining logs –E.g., undo or do logs –Logs used to implement atomic transactions Two-phase commit protocol (2PC protocol) is used Nested transactions are a resiliency technique –Used when transaction involves data in many nodes