Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5 Chapter 8 Fault.

Similar presentations


Presentation on theme: "Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5 Chapter 8 Fault."— Presentation transcript:

1 Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5 Chapter 8 Fault Tolerance (2) DISTRIBUTED SYSTEMS (dDist) 2014

2 Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5 Distributed Commit (1/3) Given a process group and an operation –The operation might or might not be committable at all processes Goal: –If committable at all processes, commit at all processes –Either everybody eventually commits or everybody eventually aborts Even servers which crash and come back to life Consistency, validity, termination

3 Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5 Distributed Commit (2/3) Can we not just do this with Virtual Synchrony? –Coordinator multicasts vote request –All processes respond to request –Coordinator multicasts vote result COMMIT iff all vote COMMIT This handles some error cases But, what if a participant B crashes between a backup votes COMMIT and the COMMIT result is broadcast and then comes back to live? We have to bring him up to a consistent state, even if others crash as he wakes up and so on…

4 Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5 Distributed Commit (3/3) We want to tolerate these errors: –Transient Crash-silent errors We have timeouts to detect crashes Transient: Crashed servers come back to life –And must then make the right decision –Messages can be dropped Even if we secure communication against omission errors and crash errors, say by implementing virtual synchrony, a server might be out of the view when a message is sent and then come back to life

5 Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5 Two-Phase Commit 1) Commit → 2) Vote-request → 3) Vote-commit ← 4) Global-commit →

6 Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5 Two-Phase Commit Figure 8-18. (a) The finite state machine for the coordinator in 2PC. (b) The finite state machine for a participant. Input event Output event COORDINATORPARTICIPANT

7 Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5 Two-Phase Commit 2PC detects crashes via timeouts 2PC handles crashes by logging state to permanent storage, turning crash errors into ommision errors –It is possible to execute code on an arrow and then fall back to the state before the arrow –Then the code can get executed again, and again, and again, … –More on this later…

8 Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5 Coordinator Perspective Blocks in WAIT –Participant may have failed –That participant might vote ABORT, in which case a GLOBAL COMMIT would be wrong and irreversible –So, must do a GLOBAL ABORT TIMEOUT COORDINATOR

9 Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5 Coordinator Perspective Figure 8-20. Outline of the steps taken by the coordinator in a two-phase commit protocol.... COORDINATOR

10 Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5 Coordinator Perspective Figure 8-20. Outline of the steps taken by the coordinator in a two-phase commit protocol.... COORDINATOR

11 Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5 Participant Perspective Blocks in READY –Coordinator may have failed What to do? –Some participants may already have committed… –Perhaps another participant knows what to do…? PARTICIPANT

12 Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5 Participant Perspective Figure 8-19. Actions taken by a participant P when residing in state READY and having contacted another participant Q. We know that coordinator managed to start commit At least one participant aborted and coordinator noticed Q did not even receive vote-request, so no one committed yet What if all in READY? After timeout allowing all messages in transit to arrive:

13 Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5 Two-Phase Commit Figure 8-21. (a) The steps taken by a participant process in 2PC. PARTICIPANT

14 Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5 All READY (1/2) ? Why do we block when all live participants are in the READY state? PARTICIPANTCOORDINATOR

15 Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5 All READY (2/2) ? Same view, but different decisions, so Yellow needs to wait for Blue or Green to come up again and inspect their log files! PARTICIPANTCOORDINATOR

16 Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5 Two-Phase Commit Two-Phase Commit has the problem that if the coordinator and one participant crashes at a bad time the entire system freezes until one of them is up again Getting a server up and running again typically involves humans (a.k.a. very slow) intervention

17 Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5 Three-Phase Commit Three-Phase Commit enhances Two- Phase Commit in that it is non-blocking in many more cases As long as the live participants can make a majority decision they can continue on their own –Majority among all, not only the live ones If there are many participants, this makes it very unlikely that 3PC blocks

18 Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5 Figure 8-22. (a) The finite state machine for the coordinator in 3PC. (b) The finite state machine for a participant. TIMEOUT PARTICIPANTCOORDINATOR Three-Phase Commit

19 Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5 Figure 8-22. (a) The finite state machine for the coordinator in 3PC. (b) The finite state machine for a participant. TIMEOUT PARTICIPANTCOORDINATOR Three-Phase Commit

20 Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5 On timeout (maybe after coming alive again): IF anyone else in ABORT  ABORT ELIF anyone else in COMMIT  COMMIT ELIF anyone else in INIT  ABORT ELSE everyone else in READY or PRECOMMIT: If a majority of participants is in READY goto ABORT If a majority is in PRECOMMIT goto PRECOMMIT If no majority, then block until more come back to life PARTICIPANTCOORDINATOR

21 Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5 PARTICIPANTCOORDINATOR If anyone is in PRECOMMIT, then original coordinators vote is set to be PRECOMMIT, as the original coordinator must be in PRECOMMIT

22 Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5 PARTICIPANTCOORDINATOR If anyone is in PRECOMMIT, then original coordinators vote is set to be PRECOMMIT, as the original coordinator must be in PRECOMMIT

23 Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5 More Non-Blocking Follows from the decision rules that the live agents always can make decisions on their own unless no true majority for READY or PRECOMMIT can be found among the live participants True majority: Majority among all processes, both dead and live

24 Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5 Termination PROPERTY If all servers eventually are alive at the same time, then all servers eventually end up in ABORT or COMMIT Proof sketch: 1.If any live server is in ABORT all the remaining unresolved go to ABORT eventually 2.If any live server is in COMMIT all the remaining unresolved go to COMMIT eventually 3.Otherwise, there will be a true majority in READY or PRECOMMIT and when all are alive it can be seen which, and then someone goes to ABORT or COMMIT

25 Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5 Correctness: An Easy Case If the coordinator goes to ABORT without using the red box, i.e., via a normal protocol flow or via the timeout, then no participant will ever reach PRECOMMIT an therefore no participant can ever reach COMMIT (So since they all eventually reach COMMIT or ABORT, they will all eventually reach ABORT)

26 Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5 PARTICIPANTCOORDINATOR If anyone is in PRECOMMIT, then original coordinators vote is set to be PRECOMMIT, as the original coordinator must be in PRECOMMIT On timeout (maybe after coming alive again): IF anyone else in ABORT  ABORT ELIF anyone else in COMMIT  COMMIT ELIF anyone else in INIT  ABORT ELSE everyone else in READY or PRECOMMIT: If a majority of participants is in READY goto ABORT If a majority is in PRECOMMIT goto COMMIT If no majority, then block unil more come back to life

27 Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5 Correctness: Another Easy Case If the coordinator goes to COMMIT without using the red box, i.e., via a normal protocol flow, then all participant are in PRECOMMIT or COMMIT and therefore no participants will ever reach ABORT (So since they all eventually reach COMMIT or ABORT, they will all eventually reach ABORT)

28 Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5 PARTICIPANTCOORDINATOR If anyone is in PRECOMMIT, then original coordinators vote is set to be PRECOMMIT, as the original coordinator must be in PRECOMMIT

29 Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5 Correctness (1/3) Now for the case where the coordinator crashes in WAIT or PRECOMMIT Two ways a server can make a decision: –HEAVY decision: Taken after seeing a true majority for READY or PRECOMMIT –LIGHT decision: Going to ABORT because someone else is in ABORT Going to COMMIT because someone else is in COMMIT

30 Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5 Correctness (2/3) Let P and Q be any two processes which made a HEAVY decisions PROPERTY It can never happen that P is in ABORT and Q is in COMMIT Proof sketch: 1.When P went to ABORT there was a true majority in READY 2.When Q went to COMMIT there was a true majority in PRECOMMIT 3.These two configurations are mutually exclusive as COORDINATOR is down so no participant moves between READY and PRECOMMIT anymore (when coming alive wait long enough for all messages to arrive)

31 Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5 Correctness (3/3) Let P and Q be any two processes COROLLARY It can never happen that P is in ABORT and Q is in COMMIT

32 Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5 Summary Looked at Distributed Commit Distributed commit –2PC – blocking, has a bad state –3PC – less blocking, but not widely used in practice


Download ppt "Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5 Chapter 8 Fault."

Similar presentations


Ads by Google