Presentation is loading. Please wait.

Presentation is loading. Please wait.

CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture IX: Coordination And Agreement.

Similar presentations


Presentation on theme: "CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture IX: Coordination And Agreement."— Presentation transcript:

1 CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture IX: Coordination And Agreement

2 2 CMPT 401 Summer 2007 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write W data replication R read

3 3 CMPT 401 Summer 2007 © A. Fedorova A Need For Coordination And Agreement client servers network client master slave Must coordinate election of a new master Must agree on a new master

4 4 CMPT 401 Summer 2007 © A. Fedorova Roadmap Today we will discuss protocols for coordination and agreement This is a difficult problem because of failures and lack of bound on message delay We will begin with a strong set of assumptions (assume few failures), and then we will relax those assumptions We will look at several problems requiring communication and agreement: distributed mutual exclusion, election We will finally learn that in an asynchronous distributed system it is impossible to reach a consensus

5 5 CMPT 401 Summer 2007 © A. Fedorova Distributed Mutual Exclusion (DMTX) Similar to a local mutual exclusion problem Processes in a distributed system share a resource Only one process can access a resource at a time Examples: –File sharing –Sharing a bank account –Updating a shared database

6 6 CMPT 401 Summer 2007 © A. Fedorova Assumptions and Requirements An asynchronous system Processes do not fail Message delivery is reliable (exactly once) Protocol requirements: Safety: At most one process may execute in the critical section at a time Liveness: Requests to enter and exit the critical section eventually succeed Fairness: Requests to enter the critical section are granted in the order in which they were received

7 7 CMPT 401 Summer 2007 © A. Fedorova Evaluation Criteria of DMTX Algorithms Bandwidth consumed –proportional to the number of messages sent in each entry and exit operation Client delay –delay incurred by a process and each entry and exit operation System throughput –the rate at which processes can access the critical section (number of accesses per unit of time)

8 8 CMPT 401 Summer 2007 © A. Fedorova DMTX Algorithms We will consider the following algorithms: –Central server algorithm –Ring-based algorithm –An algorithm based on voting

9 9 CMPT 401 Summer 2007 © A. Fedorova The Central Server Algorithm

10 10 CMPT 401 Summer 2007 © A. Fedorova The Central Server Algorithm Performance: –Entering a critical section takes two messages (a request message followed by a grant message) –System throughput is limited by the synchronization delay at the server: the time between the release message to the server and the grant message to the next client) Fault tolerance –Does not tolerate failures –What if the client holding the token fails?

11 11 CMPT 401 Summer 2007 © A. Fedorova A Ring-Based Algorithm

12 12 CMPT 401 Summer 2007 © A. Fedorova A Ring-Based Algorithm (cont) Processes are arranged in the ring There is a communication channel from process p i to process (p i +1) mod N They continuously pass the mutual exclusion token around the ring A process that does not need to enter the critical section (CS) passes the token along A process that needs to enter the CS retains the token; once it exits the CS, it keeps on passing the token No fault tolerance Excessive bandwidth consumption

13 13 CMPT 401 Summer 2007 © A. Fedorova Maekawa’s Voting Algorithm To enter a critical section a process must receive a permission from a subset of its peers Processes are organized in voting sets A process is a member of M voting sets All voting sets are of equal size (for fairness)

14 14 CMPT 401 Summer 2007 © A. Fedorova Maekawa’s Voting Algorithm p1 p2 p3 p4 Intersection of voting sets guarantees mutual exclusion To avoid deadlock, requests to enter critical section must be ordered

15 15 CMPT 401 Summer 2007 © A. Fedorova Elections Election algorithms are used when a unique process must be chosen to play a particular role: –Master in a master-slave replication system –Central server in the DMTX protocol We will look at the bully election algorithm The bully algorithm tolerates failstop failures But it works only in a synchronous system with reliable messaging

16 16 CMPT 401 Summer 2007 © A. Fedorova The Bully Election Algorithm All processes are assigned identifiers The system always elects a coordinator with the highest identifier: –Each process must know all processes with higher identifiers than its own Three types of messages: –election – a process begins an election –answer – a process acknowledges the election message –coordinator – an announcement of the identity of the elected process

17 17 CMPT 401 Summer 2007 © A. Fedorova The Bully Election Algorithm (cont.) Initiation of election: –Process p 1 detects that the existing coordinator p 4 has crashed an initiates the election –p 1 sends an election messages to all processes with higher identifier than itself election p1p1 p2p2 p3p3 p4p4

18 18 CMPT 401 Summer 2007 © A. Fedorova The Bully Election Algorithm (cont.) What happens if there are no crashes: –p 2 and p 3 receive the election message from p 1 send back the answer message to p 1, and begin their own elections –p 3 sends answer to p 2 –p 3 receives no answer message from p 4, so after a timeout it elects itself as a leader (knowing it has the highest ID) election p1p1 p2p2 p3p3 p4p4 answer coordinator

19 19 CMPT 401 Summer 2007 © A. Fedorova The Bully Election Algorithm (cont.) What happens if p 3 also crashes after sending the answer message but before sending the coordinator message? In that case, p 2 will time out while waiting for coordinator message and will start a new election election p1p1 p2p2 p3p3 p4p4 answer p2p2

20 20 CMPT 401 Summer 2007 © A. Fedorova The Bully Election Algorithm (summary) The algorithm does not require a central server Does not require knowing identities of all the processes Does require knowing identities of processes with higher IDs Survives crashes Assumes a synchronous system (relies on timeouts)

21 21 CMPT 401 Summer 2007 © A. Fedorova Consensus in Asynchronous Systems With Failures The algorithms we’ve covered have limitations: –Either tolerate only limited failures (failstop) –Or assume a synchronous system Consensus is impossible to achieve in an asynchronous system Next we will see why…

22 22 CMPT 401 Summer 2007 © A. Fedorova Consensus All processes agree on the same value (or set of values) When do you need consensus? –Leader (master) election –Mutual exclusion –Transaction involving multiple parties (banking) We will look at several variants of consensus problem –Consensus –Byzantine generals –Interactive consensus

23 23 CMPT 401 Summer 2007 © A. Fedorova System Model There is a set of processes P i There is a set of values {v 0, …, v N-1 } proposed by processes Each processes P i decides on d i d i belongs to the set {v 0, …, v N-1 } Assumptions: –Synchronous system (for now) –Failstop failures –Byzantine failures –Reliable channels

24 24 CMPT 401 Summer 2007 © A. Fedorova Consensus Step 1 Propose. P1P1 P2P2 P3P3 v1v1 v3v3 v2v2 Consensus algorithm Step 2 Decide. P1P1 P2P2 P3P3 d1d1 d3d3 d2d2 Courtesy of Jeff Chase, Duke University

25 25 CMPT 401 Summer 2007 © A. Fedorova Consensus (C) P i selects d i from {v 0, …, v N-1 }. All P i select the same v k (make the same decision) d i = v k Courtesy of Jeff Chase, Duke University

26 26 CMPT 401 Summer 2007 © A. Fedorova Conditions for Consensus Termination: All correct processes eventually decide. Agreement: All correct processes select the same d i. Integrity: If all correct processes propose the same v, then d i = v

27 27 CMPT 401 Summer 2007 © A. Fedorova Byzantine Generals Problem (BG) Two types of generals: commander and subordinates A commander proposes an action (v i ). Subordinates must agree d i = v leader v leader leader or commander subordinate or lieutenant d j = v leader Courtesy of Jeff Chase, Duke University

28 28 CMPT 401 Summer 2007 © A. Fedorova Conditions for Consensus Termination: All correct processes eventually decide. Agreement: All correct processes select the same d i. Integrity: If the commander is correct than all correct processes decide on the value that the commander proposed

29 29 CMPT 401 Summer 2007 © A. Fedorova Interactive Consistency (IC) Each P i proposes a value v i P i selects d i = [v 0, …, v N-1 ] vector reflecting the values proposed by all correct participants. All P i must decide on the same vector d i = [v 0, …, v N-1 ]

30 30 CMPT 401 Summer 2007 © A. Fedorova Conditions for Consensus Termination: All correct processes eventually decide. Agreement: The decision vector of all correct processes is the same Integrity: If P i is correct then all correct processes decide on v i as the ith component of their vector

31 31 CMPT 401 Summer 2007 © A. Fedorova Equivalence of IC and BG We will show that BG is equivalent to IC If there is solution to one, there is solution to another Notation: –BG i (j, v) returns the decision value of p i when the commander p j proposed v –IC i (v 1, v 2, …., v N )[j] returns the jth value in the decision vector of p i in the solution to IC, where {v 1, v 2, …., v N } are the values that the processes proposed Our goal is to find solution to IC given a solution to BG

32 32 CMPT 401 Summer 2007 © A. Fedorova Equivalence of IC and BG We run the BG problem N times Each time the commander p j proposes a value v –Recall that in IC each process proposes a value After each run of BG problem we record BG i (j, v) for all i – that is what each process decided when the p j proposed v –Similarity with IC: we record what each p i decided for vector position j We need to record decisions for N vector positions, so we run the problem N times

33 33 CMPT 401 Summer 2007 © A. Fedorova Equivalence of IC and BG ??? Run #1: P 0 proposes v 0 We record d 0 for all p ??? ??? Initialization Empty decision vectors d0d0 ?? d0d0 ?? d0d0 ?? Run #2: P 1 proposes v 1 We record d 1 for all p d0d0 d1d1 ? d0d0 d1d1 ? d0d0 d1d1 ? Run #3: P 2 proposes v 2 We record d 2 for all p d0d0 d1d1 d2d2 d0d0 d1d1 d2d2 d0d0 d1d1 d2d2

34 34 CMPT 401 Summer 2007 © A. Fedorova Consensus in a Synchronous System Without Failures Each process p i proposes a decision value v i All proposed v i are sent around, such that each process knows all proposed v i Once all processes receive all proposed v’s, they apply to them the same function, such as: minimum(v 1, v 2, …., v N ) Each process p i sets d i = minimum(v 1, v 2, …., v N ) The consensus is reached What if processes fail? Can other processes still reach an agreement?

35 35 CMPT 401 Summer 2007 © A. Fedorova Consensus in a Synchronous System With Failstop Failures We assume that at most f out of N processes fail To reach a consensus despite f failures, we must extend the algorithm to take f+1 rounds At round 1: each round process p i sends its proposed v i to all other processes and receives v’s from other processes At each subsequent round process p i sends v’s that it has not sent before and receives new v’s The algorithm terminates after f+1 rounds Let’s see why it works…

36 36 CMPT 401 Summer 2007 © A. Fedorova Consensus in a Synchronous System With Failstop Failures: Proof Will prove by contradiction Suppose some correct process p i possesses a value that another correct process p j does not possess This must have happened because some other processes p k sent that value to p i but crashed before sending it to p j The crash must have happened in round f+1 (last round). Otherwise, p i would have sent that value to p j in round f+1 But how come p j have not received that value in any of the previous rounds? If at every round there was a crash – some process sent the value to some other processes, but crashed before sending it to p j But this implies that there must have been f+1 crashes This is a contradiction: we assumed at most f failures

37 37 CMPT 401 Summer 2007 © A. Fedorova Consensus in a Synchronous System: Discussion Can this algorithm withstand other types of failures – omission failures, byzantine failures? Let us look at consensus in presence of byzantine failures Processes separated by network partition: each group can agree on a separate value

38 38 CMPT 401 Summer 2007 © A. Fedorova Consensus in a Synchronous System With Byzantine Failures Byzantine failure: a process can forward to another process an arbitrary value v Byzantine generals: the commander says to one lieutenant that v = A, says to another lieutenant that v = B We will show that consensus is impossible with only 3 generals Pease et. al generalized this to impossibility of consensus with N≤3f faulty generals

39 39 CMPT 401 Summer 2007 © A. Fedorova BG: Impossibility With Three General Scenario 1: p 2 must decide v (by integrity condition) But p 2 cannot distinguish between Scenario 1 and Scenario 2, so it will decide w in Scenario 2 By symmetry, p 3 will decide x in Scenario 2 p 2 and p 3 will have reached different decisions p 1 (Commander) p 2 p 3 1:v 2:1:v 3:1:u p 1 (Commander) p 2 p 3 1:x 1:w 2:1:w 3:1:x Faulty processes are shown shaded “3:1:u” means “3 says 1 says u”. Scenario 1 Scenario 2

40 40 CMPT 401 Summer 2007 © A. Fedorova Solution With Four Byzantine Generals We can reach consensus if there are 4 generals and at most 1 is faulty Intuition: use the majority rule Correct process Who is telling the truth? Majority rules!

41 41 CMPT 401 Summer 2007 © A. Fedorova Solution With Four Byzantine Generals p 1 (Commander) p 2 p 3 1:v 2:1:v 3:1:u Faulty processes are shown shaded p 4 1:v 4:1:v 2:1:v3:1:w 4:1:v p 1 (Commander) p 2 p 3 1:w1:u 2:1:u 3:1:w p 4 1:v 4:1:v 2:1:u3:1:w 4:1:v Round 1: The commander sends v to all other generals Round 2: All generals exchange values that they sent to commander The decision is made based on majority

42 42 CMPT 401 Summer 2007 © A. Fedorova Solution With Four Byzantine Generals p 1 (Commander) p 2 p 3 1:v 2:1:v 3:1:u p 4 1:v 4:1:v 2:1:v3:1:w 4:1:v p 2 receives: {v, v, u}. Decides v p 4 receives: {v, v, w}. Decides v

43 43 CMPT 401 Summer 2007 © A. Fedorova Solution With Four Byzantine Generals p 1 (Commander) p 2 p 3 1:w1:u 2:1:u 3:1:w p 4 1:v 4:1:v 2:1:u3:1:w 4:1:v p 2 receives: {u, w, v}. Decides NULL p 4 receives: {u, v, w}. Decides NULL p 3 receives: {w, u, v}. Decides NULL The result generalizes for system with N ≥ 3f + 1, (N is the number of processes, f is the number of faulty processes)

44 44 CMPT 401 Summer 2007 © A. Fedorova Consensus in an Asynchronous System In the algorithms we’ve looked at consensus has been reached by using several rounds of communication The systems were synchronous, so each round always terminated If a process has not received a message from another process in a given round, it could assume that the process is faulty In an asynchronous system this assumption cannot be made! Fischer-Lynch-Patterson (1985): No consensus can be guaranteed in an asynchronous communication system in the presence of any failures. Intuition: a “failed” process may just be slow, and can rise from the dead at exactly the wrong time.

45 45 CMPT 401 Summer 2007 © A. Fedorova Consensus in Practice Real distributed systems are by and large asynchronous How do they operate if consensus cannot be reached? Fault masking: assume that failed processes always recover, and define a way to reintegrate them into the group. –If you haven’t heard from a process, just keep waiting… –A round terminates when every expected message is received. Failure detectors: construct a failure detector that can determine if a process has failed. –A round terminates when every expected message is received, or the failure detector reports that its sender has failed.

46 46 CMPT 401 Summer 2007 © A. Fedorova Fault Masking In a distributed system, a recovered node’s state must also be consistent with the states of other nodes. –Transaction processing systems record state to persistent storage, so they can recover after crash and continue as normal –What if a node has crashed before important state has been recorded on disk? A functioning node may need to respond to a peer’s recovery. –rebuild the state of the recovering node, and/or –discard local state, and/or –abort/restart operations/interactions in progress e.g., two-phase commit protocol

47 47 CMPT 401 Summer 2007 © A. Fedorova Failure Detectors First problem: how to detect that a member has failed? –pings, timeouts, beacons, heartbeats –recovery notifications Is the failure detector accurate? – Does it accurately detect failures? Is the failure detector live? – Are there bounds on failure detection time? In an asynchronous system, it impossible for a failure detector to be both accurate and live

48 48 CMPT 401 Summer 2007 © A. Fedorova Failure Detectors in Real Systems Use a failure detector that is live but not accurate. –Assume bounded processing delays and delivery times. –Timeout with multiple retries detects failure accurately with high probability. Tune it to observed latencies. –If a “failed” site turns out to be alive, then restore it or kill it (fencing, fail-silent). What do we assume about communication failures? –How much pinging is enough? –Tune parameters for your system – can you predict how your system will behave under pressure? –That’s why distributed system engineers often participate in multi- day support calls… What about network partitions? –Processes form two independent groups, reach consensus independently. Rely on quorum.

49 49 CMPT 401 Summer 2007 © A. Fedorova Summary Coordination and agreement are essential in real distributed systems Real distributed systems are asynchronous Consensus cannot be reached in an asynchronous distributed system Nevertheless, people still build useful distributed systems that rely on consensus Fault recovery and masking are used as mechanisms for helping processes reach consensus Popular fault masking and recovery techniques are transactions and replication – the topics of the next few lectures


Download ppt "CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture IX: Coordination And Agreement."

Similar presentations


Ads by Google