Presentation is loading. Please wait.

Presentation is loading. Please wait.

Outline Announcements Fault Tolerance.

Similar presentations


Presentation on theme: "Outline Announcements Fault Tolerance."— Presentation transcript:

1 Outline Announcements Fault Tolerance

2 COP 5611 - Operating Systems
Announcements Class evaluation at the beginning of next class Please come on time so that we still have enough time to cover the materials we need to cover Discussions Homework #4 Quiz #2 Decisions Final exam: open book or close book? Lab 2: Extension? Quiz #3: A week from today November 27, 2018 COP Operating Systems

3 COP 5611 - Operating Systems
Motivations A system is fault-tolerant If it can mask failures It continues to perform its specified function in the event of a failure Mainly through redundancy Or it exhibits a well defined failure behavior in the event of failure Distributed commit, either all sites commit a particular operation or none of them November 27, 2018 COP Operating Systems

4 Fault Tolerance Through Redundancy
The key approach to fault tolerance is redundancy Three kinds of redundancy Information redundancy Time redundancy Physical redundancy A system can have A multiple number of processes A multiple number of hardware components A multiple number of copies of data November 27, 2018 COP Operating Systems

5 Failure Resilient Processes
A process is resilient if it masks failures and guarantees progress despite a certain number of system failures Backup processes In this approach, each resilient process is implemented by a primary process and one or more backup processes The state of the primary processes is stored at some intervals If the primary terminates, one of the backup processes becomes active and takes over November 27, 2018 COP Operating Systems

6 Failure Resilient Processes – cont.
Replicated execution Several processes execute the same program concurrently It can increase the reliability and availability It requires that all requests at all processes in the same order Nonidempotent operations need to be taken care of November 27, 2018 COP Operating Systems

7 COP 5611 - Operating Systems
Distributed Commit The distributed commit problem involves having an operation being performed by each member of a process group or none at all This is referred to as global atomicity Commit protocols Given that each site has a recovery strategy at the local level, commit protocols ensure that all the sites either commit or abort the transaction unanimously, even in the presence of multiple and repetitive failures November 27, 2018 COP Operating Systems

8 One-phase Commit Protocol
One site is designated as a coordinator The coordinator tells all the other processes whether or not to locally perform the operation in question This scheme however is not fault tolerant November 27, 2018 COP Operating Systems

9 Two-Phase Commit Protocol
In this protocol, one of the processes acts as a coordinator Other processes are referred to as cohorts Cohorts are assumed to be executing at different sites A stable storage is available at each site The write-ahead log protocol is used There are two phases involved in the protocol November 27, 2018 COP Operating Systems

10 Two-Phase Commit Protocol – cont.
November 27, 2018 COP Operating Systems

11 Two-Phase Commit Protocol – cont.
November 27, 2018 COP Operating Systems

12 Two-Phase Commit Protocol – cont.
Coordinator November 27, 2018 COP Operating Systems

13 Two-Phase Commit Protocol – cont.
Site failures handling Suppose the coordinator crashes before having written the COMMIT record On recovery, the coordinator broadcasts an ABORT message to all the cohorts Suppose the coordinator crashes after writing the COMMIT record but before writing the COMPETE record On recovery, the coordinate broadcasts a COMMIT message Suppose the coordinator crashes after writing the COMPLETE record On recovery, there is nothing to be done for the transaction November 27, 2018 COP Operating Systems

14 Two-Phase Commit Protocol – cont.
Site failures handling - continued If a cohort crashes in Phase I, the coordinate aborts the transaction because it does not receive a reply from the crashed cohort If a cohort crashes in Phase II (after writing its UNDO and REDO log) On recovery, the cohort will check with the coordinator whether to abort or to commit the transaction November 27, 2018 COP Operating Systems

15 Two-Phase Commit Protocol – cont.
Limitation It is a blocking protocol Whenever the coordinator fails, cohort sites will have to wait for its recovery This is undesirable as these sites may be holding locks on resources It cannot be used if transactions must be resilient to site failures This leads to non-blocking commit protocols November 27, 2018 COP Operating Systems

16 Non-blocking Commit Protocols
To be non-blocking in the event of site failures Operational sites should agree on the outcome of the transaction by examining their local states Failed sites, upon recovery, must also reach the same conclusion regarding the outcome of the transaction as operational sites do Independent recovery refers to the situation that the recovering sites can decide the final outcome of the transaction based solely on their local state November 27, 2018 COP Operating Systems

17 Three-Phase Commit Protocol – cont.
November 27, 2018 COP Operating Systems

18 Three-Phase Commit Protocol for Single Site Failure
November 27, 2018 COP Operating Systems

19 Three-Phase Commit Protocol – cont.
Phase I - is identical to the that of the two-phase commit protocol except in the event of a site’s failure If a cohort fails, the coordinator times out waiting for the Agreed message and the coordinator aborts the transaction and sends abort messages to all the cohorts Phase II - The coordinator sends a Prepare message to all the cohorts if all the cohorts have sent Agreed message in phase I Otherwise, it sends an Abort message November 27, 2018 COP Operating Systems

20 Three-Phase Commit Protocol – cont.
Phase III – On receiving acknowledgments to the Prepare messages from all the cohorts, the coordinator sends a Commit message to all the cohorts On receiving a Commit message, a cohort commits the transaction November 27, 2018 COP Operating Systems

21 Three-Phase Commit Protocol – cont.
Theoretical results Rules 1 and 2 are sufficient for designing commit protocols resilient to a single site failure during a transaction There exists no protocol using independent recovery that is resilient to arbitrary failures by two sites There exists no protocol resilient to network partitioning when messages are lost There exists no protocol resilient to multiple network partitioning November 27, 2018 COP Operating Systems

22 COP 5611 - Operating Systems
Voting Protocols Distributed commit protocols are resilient to single site failures But they are not resilient to multiple site failures, communication failures, and network partitioning Voting protocols are more fault tolerant They allow data accesses under network failures, multiple site failures, and message losses without compromising the integrity of the data The basic idea is that each replica is assigned some number of votes and a majority of votes must be collected before a process can access a replica November 27, 2018 COP Operating Systems

23 COP 5611 - Operating Systems
Static Voting System model The replicas of files are stored at different sites Every file access operation requires that an appropriate lock is obtained The lock rule allows either “one writer and no readers” or “multiple readers and no writers” Every file is associated with a version number Indicates the number of times the file has been updated Version numbers are stored on stable storage Every write operation updates its version number November 27, 2018 COP Operating Systems

24 COP 5611 - Operating Systems
Static Voting – cont. Basic idea Every replica is assigned a certain number of votes This information is stored on stable storage A read or write operation is permitted if a certain number of votes, read quorum or write quorum, are collected by the requesting process November 27, 2018 COP Operating Systems

25 COP 5611 - Operating Systems
Static Voting – cont. November 27, 2018 COP Operating Systems

26 COP 5611 - Operating Systems
Static Voting – cont. November 27, 2018 COP Operating Systems

27 COP 5611 - Operating Systems
Static Voting – cont. November 27, 2018 COP Operating Systems

28 COP 5611 - Operating Systems
Vote Assignment November 27, 2018 COP Operating Systems

29 Vote Assignment Examples
November 27, 2018 COP Operating Systems

30 Reliable Communication
In a system using replicated data, it is important that data managers behave identically The data managers are required to have an identical view of the events Atomic broadcast November 27, 2018 COP Operating Systems

31 COP 5611 - Operating Systems
Summary Fault tolerance is to mask the failure or behave in a well-defined way in case of failures The key approach to failure masking is through redundancy Failure resilient processes Distributed commit protocols guarantee the global atomicity Either all sites will commit an operation or none of them November 27, 2018 COP Operating Systems


Download ppt "Outline Announcements Fault Tolerance."

Similar presentations


Ads by Google