2DependabilityDependability is the ability to avoid service failures that are more frequent or severe than desired. It is an important goal of distributed systems.Requirements for dependable systemsAvailability: the probability that the system is available to perform its functions at any moment% availability (five 9s) 5 minutes of downtime per yearReliability: the ability of the system to run continuously without failureDown for 1ms every hour % availability but highly unreliableDown for two weeks every year high reliability but only 96% availabilitySafety: when a system temporarily fails to operate correctly, nothing catastrophic happensMaintainability: how easily a failed system can be repairedSecurity: will cover in Chapter 9Availability - Readiness for usage, Reliability - Continuity of service delivery. Example: control system for airplanes, nuclear power plants. Safety - Very low probability of catastrophes, Maintainability - How easy can a failed system be repairedDependability is the trustworthiness of a computing system which allows reliance to be justifiably placed on the service it deliversDependability attributesWorthy of confidence, confident about relying on its serviceAttributes - A way to assess the Dependability of a systemNeed to prevent failure which is caused by faults. Fault tolerance means that a system can provide its services even in the presence of faults
3Failures and FaultsBuilding a dependable system comes down to preventing failuresA failure of a system occurs when the system cannot meet its promisesFailures are caused by faultsA fault is an anomalous condition. There are three categories of faults:Transient faults: occur once and never reoccur (e.g., wireless communication being interrupted by external interference)Intermittent faults: reoccur irregularly (e.g., a loose contact on a connector)Permanent faults: persist until the faulty component is replaced (e.g., software bugs)
4Types of FailuresFail-stop: server will stop in a way that clients can tell that it has haltedFail-silent: clients do not know server has haltedState transition failure: Execution of component brings it into a wrong stateArbitrary failures are also known as Byzantine failures
5Fault ToleranceIn a single-machine system, a failure is almost always totalAll components are affected and entire system may be brought down (e.g., OS crash, disk failures)Partial failures are possible in distributed systemsWhen one component fails, it may affect some components, while leaving other components unaffectedFault tolerance means that a system can provide its services even in the presence of faultsFault tolerance requirespreventing faults and failures from affecting other components of the systemautomatically recovering from partial failuresDS: multiple independent nodes, Prob(failure) = Prob(any one component fails)
6Failure MaskingFailure masking is a fault tolerance technique that hides occurrence of failures from other processesThe most common approach to failure masking is redundancyThree types of redundancy:Information redundancy: add extra bits to allow recovery from garbled bitsTime redundancy: repeat an action if neededPhysical redundancy: add extra equipment or processes so that the system can tolerate the loss or malfunctioning of some components
7An Example of Physical Redundancy Triple modular redundancy: the effect of a single component failing is completely masked.
8Process ResilienceProtection against process failures can be achieved by organizing several identical processes into a groupFlat group: all process are equal; the processes make decisions collectivelyNo single point of failure, but decision making is more complicatedHierarchical group: a single coordinator makes all decisionsDecision making is simpler, but coordinator is a single point of failureGroup is transparent to its users, the whole group is dealt with as a single process
9Fault Tolerance in Process Groups Having a group of identical processes allows us to mask one or more faculty processes in that groupA group of replicated processes is said to be k fault tolerant if it can survive k faults and still meet its specificationsWith crash failures, K+1 processes are sufficient to survive k faultsWith Byzantine failures, processes may produce erroneous, random, or malicious results 2k+1 processes are required to survive k faults (group output is defined by voting)Assumption: All requests arrive at all members in the group in the same order (this requires atomic multicast) only then are we sure that all members do exactly the same thingprocesses run even if sick
10Agreement in Faulty Systems The goal of distributed agreement algorithms is to have all the nonfaulty processes reach consensus on some issue within a finite number of stepsQ1: Can consensus be reached with nonfaulty processes and unreliable communication channel?A: Two nonfaulty processes can never reach agreement in presence of unreliable channelQ2: Can consensus be reached with faulty (Byzantine) processes and reliable channel?A: DependsTwo-army problem: two blue armies must agree to attack simultaneously in order to defeat the white armyEach blue army coordinates with a messengerMessenger can be captured by the white armyCan the two blue armies reach agreement?
11Conditions for Consensus Process behaviorMessage OrderCommunication delayUnorderedOrderedAsynchronousYesUnboundedBoundedSynchronousUnicastMulticastMessage TransmissionAssume processes may be faulty and communication is reliable.A system is synchronous iff the processes operate in a lock-step mode (i.e., there is a constant c≥1, such that if any process has taken c+1 steps, every other process has taken at least one step).
12Byzantine Agreement Problem Byzantine agreement problem: Can N generals reach consensus about each other’s troop strengths when communication channel is perfect but some of the generals are traitors and will lie to prevent agreement?Formally, there are N processes, each process i will provide a value vi to the others. The goal is to let each process construct a vector V of length N, such that if process i is nonfaulty, V[i]= vi. Otherwise V[i] is undefined.Assume processes are synchronous, messages are unicast while preserving ordering, and communication delay is bounded, with k faulty processes, agreement can be achieved if there are 2k+1 nonfaulty processes [Lamport et al., 1982].In lamport’s paper, byzantine generals problem requires two conditions to be met: 1) all loyal lieutenants obey the same order 2) if the commanding general is loyal, then every loyal lieutenants obeys the order he sends
13Byzantine Agreement Problem: An Example The Byzantine agreement problem for 3 nonfaulty processes and 1 faulty process with vi=i. Consensus is reached for the nonfaulty processes. (a) Each process sends its value to the others. (b) The vectors that each process assembles based on (a). (c) The vectors that each process receives after each process passes its vector from (b) to every other process.
14Byzantine Agreement Problem: Another Example The Byzantine agreement problem for 2 nonfaulty processes and 1 faulty process. The algorithm fails to produce agreement.