Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.

Similar presentations


Presentation on theme: "Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement."— Presentation transcript:

1 Chapter 8 Fault Tolerance

2 Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement

3 Nondistributed vs. Distributed How do they fail? –Can a non-distributed system have partial failures? Key goal of fault tolerance is to allow a system to continue to function after a partial failure (while repairing). –Automatically recover, –Perform at an acceptable level –Can you make a nondistributed system fault-tolerant?

4 Basic Concepts What is “failure”? –The system Can’t meet promises. Error is something that might lead to failure. Part of the system state. Fault is the cause of the error, like a short cut in a circuit, etc. –Transient –Intermittent –Permanent A system is called k fault tolerant if it can provide services in the presence of k faults.

5 Basic Concepts Dependable systems: Availability –Is it working? Can it is used immediately? Reliability –Is it reliable during a time period? –Is it the same as availability? How available is something that is down for 1 ms every hour? How reliable is it? Safety –The consequence of failure. –E.g., What happens when there are no signals? Maintainability –How easy is it to repair? Can affect availability.

6 Failure Models Crashes are also called fail-stop. Arbitrary (Byzantine) failures are all else, and include malicious/subverted servers, etc. What kind of failures are easiest? What kind are hardest?

7 Redundancy You can mask failures by being redundant. –Redundant information. How? –Time redundancy. How? –Physical redundancy. How? Is nature redundant?

8 Triple Modular Redundancy Redundancy. How many fail-stop faults can this tolerate? How many response failures (wrong values)? Signals pass through three devices.

9 Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement

10 Process resilience Basic issue: Protect yourself against faulty processes by replicating and distributing computations in a group. –A group is a single abstraction. –Message sent to the group, all processes receive it. –If one fails, others can take over. Group dynamic –Create new, destroy old –A process can join and leave –A process can join several groups at the same time.

11 Process resilience Flat groups: Good for fault tolerance as information exchange immediately occurs with all group members; however, may impose more overhead as control is completely distributed (hard to implement). Hierarchical groups: All communication through a single coordinator ⇒ not really fault tolerant and scalable, but relatively easy to implement.

12 Process resilience: Flat Groups versus Hierarchical Groups

13 Process Groups Group Membership can be –by server or –distributed. Joining and leaving have to be synchronous. –As soon as it joins, should receive all messages. –As soon as it leaves, must stop receiving. How to know a member leave?

14 Groups and Failure Masking Replica process to replace single (vulnerable) process with a (fault tolerant) group. –Primary-based replication -> hierarchical group –Replica-based (active or quorum-based) -> flat group How much replication is needed ? Terminology: when a group can mask any k concurrent member failures, it is said to be k-fault tolerant (k is called degree of fault tolerance).

15 Groups and Failure Masking How large does a k-fault tolerant group need to be? –Assume crash/performance failure semantics ⇒ a total of k + 1 members are needed to survive k member failures. –Assume arbitrary failure semantics, and group output defined by voting ⇒ a total of 2k+1 members are needed to survive k member failures. Assumption: all members are identical, and process all input in the same order ⇒ only then are we sure that they do exactly the same thing. –Atomic multicast

16 Distributed Agreement Problem: Non-faulty group members should reach agreement on the same value System assumptions: 1.Sync vs. Async: Lock step, or not. 2.Communication delay bounded? 3.Ordered message delivery? 4.Unicast or multicast? –The problem is quite hard. Not always possible.

17 When is Agreement Possible

18 Byzantine Agreement Problem Goal: reach agreement with Byzantine failure. Two General’s Problem: Two generals want to attack a city from different sides. They will only succeed if both attack at the same time. They can communicate only by messengers sent by horse. –Faults: traitorous lieutenants, etc… How do they reach agreement? Observation: Assuming arbitrary failure semantics, we need 3k +1 group members to survive the attacks of k faulty members Why? –We are trying to reach a majority vote among the group of loyalists, in the presence of k traitors ⇒ need 2k + 1 loyalists. –K faulty ones can mislead

19 Agreement in Faulty Systems 1.Each process sends their value to the others. 2.Results collected into vectors. 3.Vectors re-distributed. 4.If any position has a majority, that is the value. Faulty: general arbitrary values Outcome: each process has a vector V[N], that V[i] = vi, if i is non-faulty V[i] = undefined, if i is faulty

20 Agreement in Faulty Systems 1.Each process sends their value to the others. 2.Results collected into vectors. 3.Vectors re-distributed. 4.If any position has a majority, that is the value. Faulty: general arbitrary values What is the outcome?

21 Agreement in Faulty Systems Two correct, one faulty. Remarks: the assumptions are nodes are either Byzantine or collaborative. Not possible, when unbounded delay, of even one failure. What is the outcome?

22 Failure Detection Active pinging, passive wait Timeout. –Setting timeouts properly is very difficult and application dependent How to distinguish process failures from network failures Consider failure notification throughout the system: –Gossiping (i.e., proactively disseminate a failure detection)


Download ppt "Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement."

Similar presentations


Ads by Google