Understanding Fault-Tolerant Distributed Systems  A. Mok 2018

Understanding Fault-Tolerant Distributed Systems  A. Mok 2018
CS 386C Understanding Fault-Tolerant Distributed Systems  A. Mok 2018

Dependable Real-Time System Architecture
Application stack Real-time services Real-time scheduler

Basic Concepts Terms: service, server, "depends" relation
Failure semantics Failure masking: hierarchical group Hardware architecture issues Software architecture issues

Basic architectural building blocks
Service: collection of operations that can be invoked by users The operations can only be performed by a server which hides state representation and operation implementation details. Servers can be implemented in hardware or software, Examples IBM 4381 processor service: all operations in a 4381 manual DB2 database service: set of query and update operations

The “depends on” relation
We say that server s depends on server t if s relies on the correctness of t’s behavior to correctly provide its own service. Graphic representation: server s (user, client) server t (server, resource) System-level description: Levels of abstraction

Failure Classification
Service specification: For all operations, specify state transitions and timing requirements Correct service: meets state transition and timing specs Failure: erroneous state transition or timing in response to a request Crash failure: no state transition/response to all subsequent requests. Omission failure: no state transition/response in response to a request, distinguishable from crash failure only by timeout. Timing failure: either omission or the correct state transition occurs too early or too late; also called performance failure. Arbitrary failure: either timing failure or bad state transition, also known as Byzantine failure.

Failure Classification
Failure semantics specifies the behavior the system is forced to assume when some component(s) fail. Arbitrary (Byzantine) Omission Timing Crash

Examples of Failures Crash failure Omission failure Timing failure
“Clean" operating system shutdown due to power outage Omission failure Message loss on a bus Impossibility to read a bad disk sector Timing failure early scheduling of an event due to a fast clock Late scheduling of an event due to a slow clock Arbitrary failure A "plus" procedure returns 5 to plus(2,2) A search procedure finds a key that was never inserted The contents of a message is altered on route

Server Failure Semantics
When programming recovery actions when a server s fails, it is important to know what failure behavior s can exhibit. Example: server s, client r connected by network server n d - max time to transport message p - max time needed to receive and process a message If n, r can only suffer omission failures, then if no reply arrives at s within 2(d + p) time units, no reply will arrive at s. Hence s can assume "no answer" to message. If n, r can exhibit performance failures, then s must keep local data to discard replies to "old" messages. It is the responsibility of a server implementer to ensure that the specified failure semantics is properly implemented, e.g., a traffic light control system needs to ensure a "flashing red or no light" failure semantics. s r n

Failure Semantics Enforcement
Usually, failure semantics is implemented to the extent of satisfying the required degree of plausibility. Examples: Use error-detecting code to implement networks with performance failure semantics. Use error-detecting code, circuit switching and real-time executive implement networks with omission failure semantics. Use lock-step duplication and highly reliable comparator to implement crash failure semantics for CPUs. In general, the stronger the desired failure semantics, the more expensive it is to implement, e.g., it is cheaper to build a memory without error-detecting code (with arbitrary failure semantics) than build one with error-detecting code (with omission failure semantics).

Failure Masking: Hierarchical Masking
Suppose server u depends on server r. If server u can provide its service by taking advantage of r's failure semantics, then u is said to mask r’s failure to its own (u’s) clients. Typical sequence of events consists of A masking attempt. If impossible to mask, then recover consistent state and propagate failure upwards. Examples: Retry I/O on same disk assuming omission failure on bus (time redundancy). Retry I/O on backup disk assuming crash failure on disk being addressed (space redundancy).

Failure Masking: Group Masking
Implement service S by a group of redundant, independent servers so as to ensure continuity of service. A group is said to mask the failure of a member m if the group response stays correct despite m’s failure. Group response = f(member responses) Examples of group responses Response of fastest member Response of "primary" member Majority vote of member responses A server group able to mask to its clients any k concurrent member failures is said to be k-fault-tolerant. k = 1: single-fault tolerant k > 1: multiple-fault tolerant

Group Masking Enforcement
The complexity of group management mechanism is a function of the failure semantics assumed for group members Example: Assume memory components with read omission failure semantics Simply "oring" the read outputs is sufficient to mask member failure Cheaper and faster group management More expensive members (stores with error correcting code) M read write M’

Group Masking Enforcement
The complexity of group management mechanism is a function of the failure semantics assumed for group members Example: Assume memory components with arbitrary read failure semantics Needs majority voter to mask failure of member failure More expensive and slower group management Cheaper members (no e.c.c. circuitry) read write  M M’ M”

Key Architectural Issues
Strong server failure semantics - expensive Weak failure semantics - cheap Redundancy management for strong failure semantics - cheap Redundancy management for weak failure semantics - expensive This implies: Need to balance amount of failure detection, recovery and masking redundancy mechanisms at various levels of abstraction to obtain best overall cost/performance/dependability results Example: Error-detecting for memory systems at lower levels usually decreases the overall system cost; error-correcting for communication systems at lower levels may be an overkill for non-real-time applications (known as end-to-end arguments in networking).

Hardware Architectural Issues
Replaceable hardware unit: A set of hardware servers packaged together so that the set is a physical unit of failure, replacement and growth. May be field replaceable (by field engineers) or customer replaceable. Goals: Allow for physical removal/insertion without disruption to higher level software servers (removals and insertions are masked) If masking is not possible or too expensive, ensure "nice" failure semantics such as crash or omission Coarse granularity architecture: A replaceable unit includes several elementary servers, e.g., CPU, memory, I/O controller. Fine granularity architecture: Elementary hardware servers are replaceable units. Question: How are the replaceable units grouped, connected?

Coarse Granularity Example: Tandem Non-Stop

Coarse Granularity Example: DEC VAX cluster

Coarse Granularity Example: IBM Extended Recovery Facility
IBM XRF Architecture:

Fine Granularity Example: Stratus

Fine granularity Example: Sequoia

O.S. Hardware Failure Semantics?
What failure semantics is specified for hardware replaceable units that is usually assumed by operating system software? CPU - crash Bus - omission Memory- read omission Disk - read/write omission I/O controller - crash Network - omission or performance failure

What failure detection mechanisms are used to implement the specified hardware replaceable unit's failure semantics? Examples: Processors with crash failure semantics implemented by duplication and comparison in Stratus, Sequoia, Tandem CLX Crash failure semantics approximated by using error-detecting codes in IBM 370, Tandem TXP, VLX.

Masking at hardware level (e.g., Stratus)
At what level of abstraction are hardware replaceable unit's failure masked? Masking at hardware level (e.g., Stratus) Redundancy at the hardware level. Duplexing CPU-servers with crash failure semantics provides single-fault tolerance. Increases mean time between failure for CPU service. Masking at operating system level (e.g., Tandem process groups) Redundancy at the O.S. level. Hierarchical masking hides single CPU failure from higher level software servers by restarting a process that ran on a failed CPU in a manner transparent to the server. Masking at application server level (e.g., IBM XRF, AAS) Redundancy at the application level Group masking hides CPU failure from users by using a group of redundant software servers running on distinct hardware hosts and maintaining global service state.

Software Architecture Issues
Software servers: Analogous to hardware replaceable units (units of failure, replacement, growth) Goals: Allow for removal/insertion without disrupting higher level users. If masking impossible or not economical, ensure "nice" failure semantics (which will allow higher level users, possibly human to use simple masking techniques, such as "login and try again")

What failure semantics is specified for software servers?
If service state is persistent (e.g. ATM), servers are typically required to implement omission (atomic transaction, at-most-once) failure semantics. If service state is not persistent (e.g., network topology management, virtual circuit management, low level I/O controller),then crash failure semantics is sufficient. To implement atomic transaction or crash failure semantics, the operations implemented by servers are assumed to be functionally correct, a deposit of $100 must not credit customer's account by $1000.

How are software server failures masked?
Functional redundancy (e.g., N-version programming, recovery blocks) Use of software server groups (e.g., IBM XRF, Tandem) The use of software server groups raises a number of issues are not well understood. How do clients address service requests to server groups? What group-to-group communication protocols are needed?

Timed Asynchronous System Model Assumptions
Network topology Every process is known to every other process Communication is by messages. Automated routing assumed Synchrony Service times have known upper bounds Local clocks have bounded drift with known rates Failure semantics Processes have crash or performance failures Message delivery has omission or performance failures Message buffering Finite message buffers. Buffer overflow does not block sender FIFO message delivery not assumed Let us look at a real-life protocol implementation

The Tandem Process-Pair Communication Protocol
Current message counter Goal: To achieve at-most-once semantics in the presence of performance process and communication failures. C sends request to S; S assigns unique serial number. Client-server session number 0 replicated in (C, C'), (S, S'). Current serial number SN1 replicated in (S, S'). Current message counter records the Id of the transaction request. Since many requests from different clients may be processed simultaneously, each request has a unique Id

Normal Protocol Execution (1)
S: if SN(message) = my session message counter then do Increment current serial number to SN2 Log locally the fact that SN1 was returned to request 0 Increment session message counter Checkpoint (session message counter, log, new state) to S' else Return result saved for request 0 Current message counter

S‘ updates session message count, records that SN1 was returned for request 0, adopts new current serial number and sends ack. S after receipt of ack that S' is now in synch, sends response for request 0 to C

C updates its state with SN1, then checkpoints (session counter, new state) to C' C‘ updates session counter and state, then sends ack to C.

Example 1: S crashes before checkpointing to S'
C resends requests to (new) primary S S starts new backup S', initializes its state to (0, SN1) and interprets requests as before

Example 2: S' crashes before sending ack to S
S creates new backup S', initializes it and resends check-point message to S‘

Example 3: C crashes after sending request to S
C' becomes primary, starts new back-up and resends request to S S performs the check on session number: If sn(message) = my session message counter then ... else return result saved for request 0

Issues raised by the use of software server groups:
State synchronization: How should group members maintain consistency of their local states (including time) despite member failures and joins and communication failures? Group membership: How should group members achieve agreement on which members are correctly working? Service availability: How is it automatically ensured that the required number of members is maintained for each server group despite operating system, server and communication failures?

Example of Software Group Masking
If all working processors agree on group state, then the service S can be made available in spite of two concurrent processor failures a fails: b and c agree that a failed; b becomes primary, c back-up. b fails: c agrees (trivially) that b failed, c becomes primary. a and b fail: c agrees that a and b failed, c becomes primary.

Problem of Disagreement on Service State
Disagreement on state can cause unavailability when failure occurs: If a fails then S becomes unavailable despite the fact that enough hardware for running it exists.

Problem of Disagreement on Time
Disagreement on time will not detect performance failure If clocks synchronized within 10 milliseconds are used to detect excessive message delay, then clocks out of synch (by 10 minutes) can lead to a late system response Message m arriving 10 minutes later will still cause B to think that m took only 30 milliseconds for the trip which could be within network performance bounds.

Problem of Disagreement on Group Membership
Disagreement on membership can create confusion: Out-of-date membership information causes unavailability

Server Group Synchronization Strategy
Close state synchronization Each server interprets each request Sends result by voting if members have arbitrary failure semantics, else result can be sent by all or by highest ranking member Loose state synchronization Members are ranked, primary, first back-up, second back-up ... Primary maintains current service state and sends results Back-ups log requests (maybe results also). Periodically purges log. Applicable only when members have performance failure semantics One solution is to use atomic broadcast with tight clock synchronization.

Requirements for Atomic Broadcast and Membership
Safety properties, e.g., All group members agree on the group membership All group members agree on what messages were broadcast and the order in which they were sent Timeliness properties, e.g., There is a bound on the time required to complete an atomic broadcast There are bounds on the time needed to detect server failures and server reactivation

Service Availability Issues
What server group availability strategy would ensure required availability objective? Prescribe how many members should a server group have. What mechanism should be used to automatically enforce a given availability strategy? Direct mutual surveillance among group members General srevice availability manager

Understanding Fault-Tolerant Distributed Systems  A. Mok 2018

Similar presentations

Presentation on theme: "Understanding Fault-Tolerant Distributed Systems  A. Mok 2018"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Understanding Fault-Tolerant Distributed Systems  A. Mok 2018

Similar presentations

Presentation on theme: "Understanding Fault-Tolerant Distributed Systems  A. Mok 2018"— Presentation transcript:

Similar presentations

About project

Feedback