Presentation is loading. Please wait.

Presentation is loading. Please wait.

2/23/2009CS50901 Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred B. Schneider Presenter: Aly Farahat.

Similar presentations


Presentation on theme: "2/23/2009CS50901 Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred B. Schneider Presenter: Aly Farahat."— Presentation transcript:

1 2/23/2009CS50901 Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred B. Schneider Presenter: Aly Farahat

2 2/23/2009CS50902 Contents Introduction State Machines Fault-Tolerance Agreement & Order –Logical Clocks –Synchronized Clocks –Server Side Ordering Faulty Output Devices Faulty Clients Using Time to Make Requests Reconfiguration –Managing Reconfiguration –Integrating Repaired Replicas

3 2/23/2009CS50903 Client/Server Model

4 2/23/2009CS50904 Fault Types Fail Stop Faults: a faulty component enters a predefined state and halts Byzantine Faults: arbitrary malicious faults Q: Why do we need logic for programs?

5 2/23/2009CS50905 Fault Tolerance Based on the concept of Replication t- tolerant: system delivers correct service up to a failure of t components Identical Replicas of Server t+1 for Fail Stop faults 2t+1 for Byzantine faults Q: What kind of fault tolerance is this? What types of faults it can tolerate?

6 2/23/2009CS50906 Replication Scheme

7 2/23/2009CS50907 State Machine Model Each Server Replica is an identical state machine State Machines are Request Driven Machines and cannot progress on their own A client Issues a Request to the State Machine

8 2/23/2009CS50908 State Machine Behavior with respect to clients O1: Requests Issued by a single client should be processed in the same order they were issued O2: If a request r2 is causally related to r1, r1 should be processed before r2

9 2/23/2009CS50909 Example Q: Find the analogy between state machine in this context and FSM used in sequential circuits synthesis

10 2/23/2009CS509010 Agreement and Order Coordination is necessary to assure O1 and O2 Agreement: All Replicas agree upon the value of request they should process Order: All Replicas should process requests in the same order (agree on order of requests) Stable Request: a request whose value and order are agreed among Replicas

11 2/23/2009CS509011 Agreement IC1: All nonfaulty processors agree on the same value IC2: If the transmitter is nonfaulty, all nonfaulty processors use its value as the one on which they agree Q: How to determine faulty processors assuming a byzantine fault model?

12 2/23/2009CS509012 Order and Stability Order: all replicas process the requests in the same order Stability: a property of a request, meaning that it is in the correct order Protocols: –Logical Clocks –Synchronized Clocks –Server Side Identification Q: Suggest a scenario for an out of order request reception

13 2/23/2009CS509013 Logical Clocks

14 2/23/2009CS509014 Stability Test r is stable at a replica if for a new request r’ from every client, T(r) < T(r’): ( T: returns the logical clock value appended to a request) As unbounded delays of messages are accepted, agreement in the case of Byzantine faults is impossible

15 2/23/2009CS509015 Synchronized Real-Time Clocks Each Processor has a real-time clock synchronized with all other processors clocks. Upper bounds on request delays guarantee order in the case of Byzantine failures

16 2/23/2009CS509016 Stability Test 1- Replica waits to guarantee no reception of requests: disadvantage (Replica has to wait) 2- Check for a request from every client with a larger identifier In practice the disjunction of both tests is used Q: How Byzantine Failures are handled in this case?

17 2/23/2009CS509017 Replica Generated Identifiers Advantage: not all processors need to communicate Phase 1: each replica proposes a unique ID for the received request, a request is seen in this case Phase 2: all replicas agree upon the request ID, the request is accepted in this case

18 2/23/2009CS509018 Requirements for Stability Agreement Stability Test: For all received request r’ from every client, their candidate identifiers should be strictly greater than an accepted request r

19 2/23/2009CS509019 Generating Unique Identifiers Q: What is the significance of i/N term?

20 2/23/2009CS509020 Tolerating Faulty Output Devices Outputs Used Outside the System –Replicate Output Devices –Replicate Voters Outputs Used Inside the system –Outputs go back to Clients – Each Client has a voter inside it

21 2/23/2009CS509021 Tolerating Faulty Clients Replication –Server State Machine Modification –Voter Inside the State Machine Requests having same content but different identifiers Requests having different content and identifiers Q: How a voter failure inside server is handled?

22 2/23/2009CS509022 Defensive Programming Replicas are not always possible –Lack of hardware –Application Semantics do not allow replication Defensive Programming: additional requirements on state machines to prevent some possibly destructive actions from a faulty client Examples: –Memory Partitioning and prevention of shared access –Bounded time shared resources by using scheduled requests on the server side

23 2/23/2009CS509023 Timed Requests Pro: No need to transmit requests Con: Does not have parameters Default Request: Executes on time at the server unless the client sends a different request

24 2/23/2009CS509024 Reconfiguration

25 2/23/2009CS509025 C, O and S A configuration is a Triplet –C: the set of operational clients –O: the set of operational output devices –S: the set of operational state machine replicas C and O are needed by the state machine replicas S is needed by the agreement protocol

26 2/23/2009CS509026 Configurators Manages a single object in C, O or S Detects failures and repairs of this objects Are clients by themselves Issue requests of reconfiguration to State Machine Replicas State machine use application dependent mechanisms for failure detection

27 2/23/2009CS509027 Note The Next Slides are adapted from a presentation by Leon Traille From Georgia Tech For a presenatation of the same paper

28 2/23/2009CS509028 Integrating a Repaired Object e[r i ]:the state that a non-faulty system element e should be after processing requests r 0 through r i An element joining the configuration immediately after request r join must be in state e[r join ] before it can participate Fail-stop failures –output device : e[r join ] is likely to be a small amount of setup information that can be provided by state variables of sm i –a client : e[r join ] is frequently based on previous sensor values and can be determined by information from other clients –a state machine replica :the information for e[r join ] is stored in state variables and pending requests at sm i Byzantine failures –require t + 1 replicas instead of just one

29 2/23/2009CS509029 Integration with Logical Clocks Integrating element e by state machine replica sm i at request r join Fail-stop processors If e is client or e is output device then send any relevant portion of state variables to e before sending any output produced by requests with unique identifiers larger than the one on r join If e is state machine replica sm new then 1) send the values of its state variables and copies of any pending requests to sm new 2) send to sm new every subsequent received from each client c such that uid(r) < uid(r c ) where r c is the first request sm new received directly from c after being restarted Byzantine failures –Because information from sm i might be incorrect t + 1 copies of identical state information and t + 1 copies of relayed messages must be obtained

30 2/23/2009CS509030 Integration with Real-time Clocks Integrating element e by state machine replica sm i at request r join Fail-stop processors If e is client or e is output device then send relevant portions of its state variables to e before sending any output produced by requests with unique identifiers larger than the one on r join If e is state machine replica sm new then 1) send the values of its state variables and copies of any pending requests to sm i 2) send to sm new every request received during the next interval of duration Δ Byzantine failures –Because information from sm i might be incorrect t + 1 copies of identical state information and t + 1 copies of relayed messages must be obtained

31 2/23/2009CS509031 Stability Test During Restart Relaying of messages break the stability tests A request r may be received directly from client c but later a request r’, also from c, is relayed by sm i with uid(r) > uid(r’) Solution: must consider requests from c as stable only after no relayed requests from c can arrive Stability Test During Restart: A request r received directly from a client c by restarting state machine replica sm new is stable only after the last request from c relayed by another processor has been received by sm new

32 2/23/2009CS509032 Thank you!


Download ppt "2/23/2009CS50901 Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred B. Schneider Presenter: Aly Farahat."

Similar presentations


Ads by Google