Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5 Chapter 8 Fault.

Slides:



Advertisements
Similar presentations
6.852: Distributed Algorithms Spring, 2008 Class 7.
Advertisements

(c) Oded Shmueli Distributed Recovery, Lecture 7 (BHG, Chap.7)
CS 603 Handling Failure in Commit February 20, 2002.
1 ICS 214B: Transaction Processing and Distributed Data Management Lecture 12: Three-Phase Commits (3PC) Professor Chen Li.
CIS 720 Concurrency Control. Timestamp-based concurrency control Assign a timestamp ts(T) to each transaction T. Each data item x has two timestamps:
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
OCT Distributed Transaction1 Lecture 13: Distributed Transactions Notes adapted from Tanenbaum’s “Distributed Systems Principles and Paradigms”
Systems of Distributed Systems Module 2 -Distributed algorithms Teaching unit 3 – Advanced algorithms Ernesto Damiani University of Bozen Lesson 6 – Two.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Non-blocking Atomic Commitment Aaron Kaminsky Presenting Chapter 6 of Distributed Systems, 2nd edition, 1993, ed. Mullender.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
Atomic TransactionsCS-4513 D-term Atomic Transactions in Distributed Systems CS-4513 Distributed Computing Systems (Slides include materials from.
Atomic TransactionsCS-502 Fall Atomic Transactions in Distributed Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating.
Transactions Distributed Systems Lecture 15: Transactions and Concurrency Control Transaction Notes mainly from Chapter 13 of Coulouris.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
CS 603 Three-Phase Commit February 22, Centralized vs. Decentralized Protocols What if we don’t want a coordinator? Decentralized: –Each site broadcasts.
1 More on Distributed Coordination. 2 Who’s in charge? Let’s have an Election. Many algorithms require a coordinator. What happens when the coordinator.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 18: Replication Control All slides © IG.
1 ICS 214B: Transaction Processing and Distributed Data Management Distributed Database Systems.
Distributed Commit. Example Consider a chain of stores and suppose a manager – wants to query all the stores, – find the inventory of toothbrushes at.
Distributed Systems Fall 2009 Distributed transactions.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Service Oriented Architecture Master of Information System Management Service Oriented Architecture Lecture 9 Notes from: Web Services & Contemporary.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Distributed Commit Dr. Yingwu Zhu. Failures in a distributed system Consistency requires agreement among multiple servers – Is transaction X committed?
CS162 Section Lecture 10 Slides based from Lecture and
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 8 Fault.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 6 Synchronization.
Distributed Transactions March 15, Transactions What is a Distributed Transaction?  A transaction that involves more than one server  Network.
Service Oriented Architecture Master of Information System Management Service Oriented Architecture Notes from: Web Services & Contemporary SOA.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
Distributed Transactions Chapter 13
Distributed Txn Management, 2003Lecture 4 / Distributed Transaction Management – 2003 Jyrki Nummenmaa
Consensus and Its Impossibility in Asynchronous Systems.
Distributed Systems CS Fault Tolerance- Part III Lecture 19, Nov 25, 2013 Mohammad Hammoud 1.
1 8.3 Reliable Client-Server Communication So far: Concentrated on process resilience (by means of process groups). What about reliable communication channels?
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Distributed Transaction Management, Fall 2002Lecture Distributed Commit Protocols Jyrki Nummenmaa
Fault Tolerance CSCI 4780/6780. Distributed Commit Commit – Making an operation permanent Transactions in databases One phase commit does not work !!!
University of Tampere, CS Department Distributed Commit.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 8 Fault.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 7 Consistency.
More on Fault Tolerance Chapter 7. Topics Group Communication Virtual Synchrony Atomic Commit Checkpointing, Logging, Recovery.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Committed:Effects are installed to the database. Aborted:Does not execute to completion and any partial effects on database are erased. Consistent state:
Fault Tolerance Chapter 7.
Revisiting failure detectors Some of you asked questions about implementing consensus using S - how does it differ from reaching consensus using P. Here.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Fault Tolerance Chapter 7. Basic Concepts Dependability Includes Availability Reliability Safety Maintainability.
Fault Tolerance Chapter 7. Goal An important goal in distributed systems design is to construct the system in such a way that it can automatically recover.
Mutual Exclusion Algorithms. Topics r Defining mutual exclusion r A centralized approach r A distributed approach r An approach assuming an organization.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Chapter 8 – Fault Tolerance Section 8.5 Distributed Commit Heta Desai Dr. Yanqing Zhang Csc Advanced Operating Systems October 14 th, 2015.
More on Fault Tolerance
Fault Tolerance Prof. Orhan Gemikonakli
Fault Tolerance Chap 7.
Atomic Transactions in Distributed Systems
The consensus problem in distributed systems
Outline Introduction Background Distributed DBMS Architecture
Two phase commit.
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Lecture 21: Replication Control
Distributed Databases Recovery
Lecture 21: Replication Control
CIS 720 Concurrency Control.
Last Class: Fault Tolerance
Presentation transcript:

Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 8 Fault Tolerance (2) DISTRIBUTED SYSTEMS (dDist)

Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Distributed Commit (1/2) Given a process group and an operation –The operation might or might not be committable at all processes Either everybody commits or everybody aborts –Consistency, validity, termination

Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Distributed Commit (2/2) Can we not just do this with Virtual Synchrony? –Coordinator multicasts vote request –All processes respond to request –Coordinator multicasts vote result COMMIT iff all vote COMMIT This handles some error cases But, what if a participant B crashes between a backup votes COMMIT and the COMMIT result is broadcast and then comes back to live?

Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Two-Phase Commit 1) Commit → 2) Vote-request → 3) Vote-commit ← 4) Global-commit →

Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Two-Phase Commit Figure (a) The finite state machine for the coordinator in 2PC. (b) The finite state machine for a participant. Input event Output event COORDINATORPARTICIPANT

Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Two-Phase Commit 2PC detects crashes via timeouts 2PC handles crashes by logging state to permanent storage, turning crash errors into reset errors

Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Coordinator Perspective Blocks in WAIT –Participant may have failed –That participant might vote ABORT, in which case a GLOBAL COMMIT would be wrong and irreversible –So, must do a GLOBAL ABORT TIMEOUT COORDINATOR

Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Coordinator Perspective Figure Outline of the steps taken by the coordinator in a two-phase commit protocol.... COORDINATOR

Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Coordinator Perspective Figure Outline of the steps taken by the coordinator in a two-phase commit protocol.... COORDINATOR

Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Participant Perspective Blocks in READY –Coordinator may have failed What to do? –Some participants may already have committed… –Perhaps another participant knows what to do…? PARTICIPANT

Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Participant Perspective Figure Actions taken by a participant P when residing in state READY and having contacted another participant Q. We know that coordinator managed to start commit At least one participant aborted and coordinator noticed Q did not even receive vote-request, so no one committed yet What if all in READY? After timeout allowing all messages in transit to arrive:

Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Two-Phase Commit Figure (a) The steps taken by a participant process in 2PC. PARTICIPANT

Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved All READY (1/2) ? Why do we block when all live participants are in the READY state? PARTICIPANTCOORDINATOR

Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved All READY (2/2) ? Same view, but different decisions, so Yellow needs to wait for Blue or Green to come up again and inspect their log files! PARTICIPANTCOORDINATOR

Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Two-Phase Commit Two-Phase Commit has the problem that if the coordinator and one participant crashes at a bad time the entire system freezes until one of them is up again Getting a server up and running again typically involves human (a.k.a. very slow) intervention

Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Three-Phase Commit Three-Phase Commit enhances Two- Phase Commit in that it is non-blocking in many more cases As long as the live participants can make a majority decision they can continue on their own If there are many participants, this makes it very unlikely that 3PC blocks

Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Figure (a) The finite state machine for the coordinator in 3PC. (b) The finite state machine for a participant. TIMEOUT PARTICIPANTCOORDINATOR Three-Phase Commit

Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Figure (a) The finite state machine for the coordinator in 3PC. (b) The finite state machine for a participant. TIMEOUT PARTICIPANTCOORDINATOR Three-Phase Commit

Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved On timeout: IF anyone in ABORT  ABORT ELIF anyone in COMMIT  COMMIT ELIF anyone in INIT  ABORT ELSE elect new coordinator among the live New Coordinator: Go to WAIT and from there goto ABORT or PRECOMMIT ABORT: If a majority of participants are in READY PRECOMMIT: If a majority are in PRECOMMIT If no majority, then block PARTICIPANTCOORDINATOR

Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved PARTICIPANTCOORDINATOR On timeout: IF anyone in ABORT  ABORT ELIF anyone in COMMIT  COMMIT ELIF anyone in INIT  ABORT ELSE elect new coordinator among the live New Coordinator: Go to WAIT and from there goto ABORT or PRECOMMIT ABORT: If a majority of participants are in READY PRECOMMIT: If a majority are in PRECOMMIT If no majority, then block If anyone is in PRECOMMIT, then original coordinators vote is set to be PRECOMMIT, as the original coordinator must be in PRECOMMIT

Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved PARTICIPANTCOORDINATOR On timeout: IF anyone in ABORT  ABORT ELIF anyone in COMMIT  COMMIT ELIF anyone in INIT  ABORT ELSE elect new coordinator among the live New Coordinator: Go to WAIT and from there goto ABORT or PRECOMMIT ABORT: If a majority of participants are in READY PRECOMMIT: If a majority are in PRECOMMIT If no majority, then block If anyone is in PRECOMMIT, then original coordinators vote is set to be PRECOMMIT, as the original coordinator must be in PRECOMMIT

Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved More Non-Blocking Follows from the decision rules that the live agents always can make decisions on their own unless no true majority for READY or PRECOMMIT can be found True majority: Majority among all processes, both dead and live

Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Correctness (1/4) Let P and Q be any two processes which both acted as coordinator at some point THEOREM It can never happen that P is in ABORT and Q is in COMMIT Proof: 1.When P went to ABORT there was a true majority in READY 2.When Q went to COMMIT there was a true majority in PRECOMMIT 3.These two configurations are mutually exclusive

Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Correctness (2/4) By construction: If there is a process in ABORT, then there is a coordinator in ABORT PARTICIPANTCOORDINATOR

Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Correctness (3/4) Bu construction: If there is a process in COMMIT, then there is a coordinator in COMMIT PARTICIPANTCOORDINATOR

Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Correctness (4/4) Let P and Q be any two processes COROLLARY It can never happen that P is in ABORT and Q is in COMMIT

Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Summary Looked at Distributed Commit Distributed commit –2PC – blocking, has a bad state –3PC – less blocking, but not widely used in practice