CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture IX: Coordination And Agreement.

Slides:



Advertisements
Similar presentations
CS542 Topics in Distributed Systems Diganta Goswami.
Advertisements

CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.
CS 542: Topics in Distributed Systems Diganta Goswami.
CS425 /CSE424/ECE428 – Distributed Systems – Fall 2011 Material derived from slides by I. Gupta, M. Harandi, J. Hou, S. Mitra, K. Nahrstedt, N. Vaidya.
NETWORK ALGORITHMS Presenter- Kurchi Subhra Hazra.
Token-Dased DMX Algorithms n LeLann’s token ring n Suzuki-Kasami’s broadcast n Raymond’s tree.
Failure Detection The ping-ack failure detector in a synchronous system satisfies – A: completeness – B: accuracy – C: neither – D: both.
Synchronization Chapter clock synchronization * 5.2 logical clocks * 5.3 global state * 5.4 election algorithm * 5.5 mutual exclusion * 5.6 distributed.
Consensus Hao Li.
Failures and Consensus. Coordination If the solution to availability and scalability is to decentralize and replicate functions and data, how do we coordinate.
CS 582 / CMPE 481 Distributed Systems
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 3 – Distributed Systems.
CMPT 431 Dr. Alexandra Fedorova Lecture XII: Replication.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Distributed Systems Fall 2009 Coordination and agreement, Multicast, and Message ordering.
Computer Science Lecture 11, page 1 CS677: Distributed OS Last Class: Clock Synchronization Logical clocks Vector clocks Global state.
Distributed Algorithms: Agreement Protocols. Problems of Agreement l A set of processes need to agree on a value (decision), after one or more processes.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 19: Paxos All slides © IG.
Consensus and Related Problems Béat Hirsbrunner References G. Coulouris, J. Dollimore and T. Kindberg "Distributed Systems: Concepts and Design", Ed. 4,
CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XI: Distributed Transactions.
CMPT Dr. Alexandra Fedorova Lecture XI: Distributed Transactions.
Computer Science Lecture 12, page 1 CS677: Distributed OS Last Class Vector timestamps Global state –Distributed Snapshot Election algorithms.
State Machines CS 614 Thursday, Feb 21, 2002 Bill McCloskey.
CMPT Dr. Alexandra Fedorova Lecture XI: Distributed Transactions.
Election Algorithms. Topics r Issues r Detecting Failures r Bully algorithm r Ring algorithm.
Paxos Made Simple Jinghe Zhang. Introduction Lock is the easiest way to manage concurrency Mutex and semaphore. Read and write locks. In distributed system:
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 6 Synchronization.
Coordination and Agreement, Multicast, and Message Ordering.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Mutual Exclusion Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Mutual Exclusion Steve Ko Computer Sciences and Engineering University at Buffalo.
Consensus and Its Impossibility in Asynchronous Systems.
Time and Coordination March 13, Time and Coordination What is time? :-)  Issue: How do you coordinate distributed computers if there is no global.
Computer Science Lecture 12, page 1 CS677: Distributed OS Last Class Vector timestamps Global state –Distributed Snapshot Election algorithms –Bully algorithm.
CS425 /CSE424/ECE428 – Distributed Systems – Fall 2011 Material derived from slides by I. Gupta, M. Harandi, J. Hou, S. Mitra, K. Nahrstedt, N. Vaidya.
Coordination and Agreement. Topics Distributed Mutual Exclusion Leader Election.
CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2 Lecture 8 Instructor: Haifeng YU.
November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms.
CS 425/ECE 428/CSE424 Distributed Systems (Fall 2009) Lecture 9 Consensus I Section Klara Nahrstedt.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Mutual Exclusion & Leader Election Steve Ko Computer Sciences and Engineering University.
Chap 15. Agreement. Problem Processes need to agree on a single bit No link failures A process can fail by crashing (no malicious behavior) Messages take.
Exercises for Chapter 15: COORDINATION AND AGREEMENT From Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edition 3, © Addison-Wesley.
SysRép / 2.5A. SchiperEté The consensus problem.
Lecture 10: Coordination and Agreement (Chap 12) Haibin Zhu, PhD. Assistant Professor Department of Computer Science Nipissing University © 2002.
Election Distributed Systems. Algorithms to Find Global States Why? To check a particular property exist or not in distributed system –(Distributed) garbage.
Page 1 Mutual Exclusion & Election Algorithms Paul Krzyzanowski Distributed Systems Except as otherwise noted, the content.
Lecture 12-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2012 Indranil Gupta (Indy) October 4, 2012 Lecture 12 Mutual Exclusion.
Lecture 7- 1 CS 425/ECE 428/CSE424 Distributed Systems (Fall 2009) Lecture 7 Distributed Mutual Exclusion Section 12.2 Klara Nahrstedt.
PROCESS RESILIENCE By Ravalika Pola. outline: Process Resilience  Design Issues  Failure Masking and Replication  Agreement in Faulty Systems  Failure.
Mutual Exclusion Algorithms. Topics r Defining mutual exclusion r A centralized approach r A distributed approach r An approach assuming an organization.
CSE 486/586 CSE 486/586 Distributed Systems Leader Election Steve Ko Computer Sciences and Engineering University at Buffalo.
Distributed Systems Lecture 9 Leader election 1. Previous lecture Middleware RPC and RMI – Marshalling 2.
Lecture 11: Coordination and Agreement Central server for mutual exclusion Election – getting a number of processes to agree which is “in charge” CDK4:
CS 425 / ECE 428 Distributed Systems Fall 2015 Indranil Gupta (Indy) Oct 1, 2015 Lecture 12: Mutual Exclusion All slides © IG.
Exercises for Chapter 11: COORDINATION AND AGREEMENT
Coordination and Agreement
CSE 486/586 Distributed Systems Leader Election
Outline Distributed Mutual Exclusion Distributed Deadlock Detection
Agreement Protocols CS60002: Distributed Systems
Outline Announcements Fault Tolerance.
EEC 688/788 Secure and Dependable Computing
CSE 486/586 Distributed Systems Leader Election
Consensus, FLP, and Paxos
CSE 486/586 Distributed Systems Mutual Exclusion
EEC 688/788 Secure and Dependable Computing
Lecture 10: Coordination and Agreement
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Lecture 11: Coordination and Agreement
CSE 486/586 Distributed Systems Mutual Exclusion
CSE 486/586 Distributed Systems Leader Election
Presentation transcript:

CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture IX: Coordination And Agreement

2 CMPT 401 Summer 2007 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write W data replication R read

3 CMPT 401 Summer 2007 © A. Fedorova A Need For Coordination And Agreement client servers network client master slave Must coordinate election of a new master Must agree on a new master

4 CMPT 401 Summer 2007 © A. Fedorova Roadmap Today we will discuss protocols for coordination and agreement This is a difficult problem because of failures and lack of bound on message delay We will begin with a strong set of assumptions (assume few failures), and then we will relax those assumptions We will look at several problems requiring communication and agreement: distributed mutual exclusion, election We will finally learn that in an asynchronous distributed system it is impossible to reach a consensus

5 CMPT 401 Summer 2007 © A. Fedorova Distributed Mutual Exclusion (DMTX) Similar to a local mutual exclusion problem Processes in a distributed system share a resource Only one process can access a resource at a time Examples: –File sharing –Sharing a bank account –Updating a shared database

6 CMPT 401 Summer 2007 © A. Fedorova Assumptions and Requirements An asynchronous system Processes do not fail Message delivery is reliable (exactly once) Protocol requirements: Safety: At most one process may execute in the critical section at a time Liveness: Requests to enter and exit the critical section eventually succeed Fairness: Requests to enter the critical section are granted in the order in which they were received

7 CMPT 401 Summer 2007 © A. Fedorova Evaluation Criteria of DMTX Algorithms Bandwidth consumed –proportional to the number of messages sent in each entry and exit operation Client delay –delay incurred by a process and each entry and exit operation System throughput –the rate at which processes can access the critical section (number of accesses per unit of time)

8 CMPT 401 Summer 2007 © A. Fedorova DMTX Algorithms We will consider the following algorithms: –Central server algorithm –Ring-based algorithm –An algorithm based on voting

9 CMPT 401 Summer 2007 © A. Fedorova The Central Server Algorithm

10 CMPT 401 Summer 2007 © A. Fedorova The Central Server Algorithm Performance: –Entering a critical section takes two messages (a request message followed by a grant message) –System throughput is limited by the synchronization delay at the server: the time between the release message to the server and the grant message to the next client) Fault tolerance –Does not tolerate failures –What if the client holding the token fails?

11 CMPT 401 Summer 2007 © A. Fedorova A Ring-Based Algorithm

12 CMPT 401 Summer 2007 © A. Fedorova A Ring-Based Algorithm (cont) Processes are arranged in the ring There is a communication channel from process p i to process (p i +1) mod N They continuously pass the mutual exclusion token around the ring A process that does not need to enter the critical section (CS) passes the token along A process that needs to enter the CS retains the token; once it exits the CS, it keeps on passing the token No fault tolerance Excessive bandwidth consumption

13 CMPT 401 Summer 2007 © A. Fedorova Maekawa’s Voting Algorithm To enter a critical section a process must receive a permission from a subset of its peers Processes are organized in voting sets A process is a member of M voting sets All voting sets are of equal size (for fairness)

14 CMPT 401 Summer 2007 © A. Fedorova Maekawa’s Voting Algorithm p1 p2 p3 p4 Intersection of voting sets guarantees mutual exclusion To avoid deadlock, requests to enter critical section must be ordered

15 CMPT 401 Summer 2007 © A. Fedorova Elections Election algorithms are used when a unique process must be chosen to play a particular role: –Master in a master-slave replication system –Central server in the DMTX protocol We will look at the bully election algorithm The bully algorithm tolerates failstop failures But it works only in a synchronous system with reliable messaging

16 CMPT 401 Summer 2007 © A. Fedorova The Bully Election Algorithm All processes are assigned identifiers The system always elects a coordinator with the highest identifier: –Each process must know all processes with higher identifiers than its own Three types of messages: –election – a process begins an election –answer – a process acknowledges the election message –coordinator – an announcement of the identity of the elected process

17 CMPT 401 Summer 2007 © A. Fedorova The Bully Election Algorithm (cont.) Initiation of election: –Process p 1 detects that the existing coordinator p 4 has crashed an initiates the election –p 1 sends an election messages to all processes with higher identifier than itself election p1p1 p2p2 p3p3 p4p4

18 CMPT 401 Summer 2007 © A. Fedorova The Bully Election Algorithm (cont.) What happens if there are no crashes: –p 2 and p 3 receive the election message from p 1 send back the answer message to p 1, and begin their own elections –p 3 sends answer to p 2 –p 3 receives no answer message from p 4, so after a timeout it elects itself as a leader (knowing it has the highest ID) election p1p1 p2p2 p3p3 p4p4 answer coordinator

19 CMPT 401 Summer 2007 © A. Fedorova The Bully Election Algorithm (cont.) What happens if p 3 also crashes after sending the answer message but before sending the coordinator message? In that case, p 2 will time out while waiting for coordinator message and will start a new election election p1p1 p2p2 p3p3 p4p4 answer p2p2

20 CMPT 401 Summer 2007 © A. Fedorova The Bully Election Algorithm (summary) The algorithm does not require a central server Does not require knowing identities of all the processes Does require knowing identities of processes with higher IDs Survives crashes Assumes a synchronous system (relies on timeouts)

21 CMPT 401 Summer 2007 © A. Fedorova Consensus in Asynchronous Systems With Failures The algorithms we’ve covered have limitations: –Either tolerate only limited failures (failstop) –Or assume a synchronous system Consensus is impossible to achieve in an asynchronous system Next we will see why…

22 CMPT 401 Summer 2007 © A. Fedorova Consensus All processes agree on the same value (or set of values) When do you need consensus? –Leader (master) election –Mutual exclusion –Transaction involving multiple parties (banking) We will look at several variants of consensus problem –Consensus –Byzantine generals –Interactive consensus

23 CMPT 401 Summer 2007 © A. Fedorova System Model There is a set of processes P i There is a set of values {v 0, …, v N-1 } proposed by processes Each processes P i decides on d i d i belongs to the set {v 0, …, v N-1 } Assumptions: –Synchronous system (for now) –Failstop failures –Byzantine failures –Reliable channels

24 CMPT 401 Summer 2007 © A. Fedorova Consensus Step 1 Propose. P1P1 P2P2 P3P3 v1v1 v3v3 v2v2 Consensus algorithm Step 2 Decide. P1P1 P2P2 P3P3 d1d1 d3d3 d2d2 Courtesy of Jeff Chase, Duke University

25 CMPT 401 Summer 2007 © A. Fedorova Consensus (C) P i selects d i from {v 0, …, v N-1 }. All P i select the same v k (make the same decision) d i = v k Courtesy of Jeff Chase, Duke University

26 CMPT 401 Summer 2007 © A. Fedorova Conditions for Consensus Termination: All correct processes eventually decide. Agreement: All correct processes select the same d i. Integrity: If all correct processes propose the same v, then d i = v

27 CMPT 401 Summer 2007 © A. Fedorova Byzantine Generals Problem (BG) Two types of generals: commander and subordinates A commander proposes an action (v i ). Subordinates must agree d i = v leader v leader leader or commander subordinate or lieutenant d j = v leader Courtesy of Jeff Chase, Duke University

28 CMPT 401 Summer 2007 © A. Fedorova Conditions for Consensus Termination: All correct processes eventually decide. Agreement: All correct processes select the same d i. Integrity: If the commander is correct than all correct processes decide on the value that the commander proposed

29 CMPT 401 Summer 2007 © A. Fedorova Interactive Consistency (IC) Each P i proposes a value v i P i selects d i = [v 0, …, v N-1 ] vector reflecting the values proposed by all correct participants. All P i must decide on the same vector d i = [v 0, …, v N-1 ]

30 CMPT 401 Summer 2007 © A. Fedorova Conditions for Consensus Termination: All correct processes eventually decide. Agreement: The decision vector of all correct processes is the same Integrity: If P i is correct then all correct processes decide on v i as the ith component of their vector

31 CMPT 401 Summer 2007 © A. Fedorova Equivalence of IC and BG We will show that BG is equivalent to IC If there is solution to one, there is solution to another Notation: –BG i (j, v) returns the decision value of p i when the commander p j proposed v –IC i (v 1, v 2, …., v N )[j] returns the jth value in the decision vector of p i in the solution to IC, where {v 1, v 2, …., v N } are the values that the processes proposed Our goal is to find solution to IC given a solution to BG

32 CMPT 401 Summer 2007 © A. Fedorova Equivalence of IC and BG We run the BG problem N times Each time the commander p j proposes a value v –Recall that in IC each process proposes a value After each run of BG problem we record BG i (j, v) for all i – that is what each process decided when the p j proposed v –Similarity with IC: we record what each p i decided for vector position j We need to record decisions for N vector positions, so we run the problem N times

33 CMPT 401 Summer 2007 © A. Fedorova Equivalence of IC and BG ??? Run #1: P 0 proposes v 0 We record d 0 for all p ??? ??? Initialization Empty decision vectors d0d0 ?? d0d0 ?? d0d0 ?? Run #2: P 1 proposes v 1 We record d 1 for all p d0d0 d1d1 ? d0d0 d1d1 ? d0d0 d1d1 ? Run #3: P 2 proposes v 2 We record d 2 for all p d0d0 d1d1 d2d2 d0d0 d1d1 d2d2 d0d0 d1d1 d2d2

34 CMPT 401 Summer 2007 © A. Fedorova Consensus in a Synchronous System Without Failures Each process p i proposes a decision value v i All proposed v i are sent around, such that each process knows all proposed v i Once all processes receive all proposed v’s, they apply to them the same function, such as: minimum(v 1, v 2, …., v N ) Each process p i sets d i = minimum(v 1, v 2, …., v N ) The consensus is reached What if processes fail? Can other processes still reach an agreement?

35 CMPT 401 Summer 2007 © A. Fedorova Consensus in a Synchronous System With Failstop Failures We assume that at most f out of N processes fail To reach a consensus despite f failures, we must extend the algorithm to take f+1 rounds At round 1: each round process p i sends its proposed v i to all other processes and receives v’s from other processes At each subsequent round process p i sends v’s that it has not sent before and receives new v’s The algorithm terminates after f+1 rounds Let’s see why it works…

36 CMPT 401 Summer 2007 © A. Fedorova Consensus in a Synchronous System With Failstop Failures: Proof Will prove by contradiction Suppose some correct process p i possesses a value that another correct process p j does not possess This must have happened because some other processes p k sent that value to p i but crashed before sending it to p j The crash must have happened in round f+1 (last round). Otherwise, p i would have sent that value to p j in round f+1 But how come p j have not received that value in any of the previous rounds? If at every round there was a crash – some process sent the value to some other processes, but crashed before sending it to p j But this implies that there must have been f+1 crashes This is a contradiction: we assumed at most f failures

37 CMPT 401 Summer 2007 © A. Fedorova Consensus in a Synchronous System: Discussion Can this algorithm withstand other types of failures – omission failures, byzantine failures? Let us look at consensus in presence of byzantine failures Processes separated by network partition: each group can agree on a separate value

38 CMPT 401 Summer 2007 © A. Fedorova Consensus in a Synchronous System With Byzantine Failures Byzantine failure: a process can forward to another process an arbitrary value v Byzantine generals: the commander says to one lieutenant that v = A, says to another lieutenant that v = B We will show that consensus is impossible with only 3 generals Pease et. al generalized this to impossibility of consensus with N≤3f faulty generals

39 CMPT 401 Summer 2007 © A. Fedorova BG: Impossibility With Three General Scenario 1: p 2 must decide v (by integrity condition) But p 2 cannot distinguish between Scenario 1 and Scenario 2, so it will decide w in Scenario 2 By symmetry, p 3 will decide x in Scenario 2 p 2 and p 3 will have reached different decisions p 1 (Commander) p 2 p 3 1:v 2:1:v 3:1:u p 1 (Commander) p 2 p 3 1:x 1:w 2:1:w 3:1:x Faulty processes are shown shaded “3:1:u” means “3 says 1 says u”. Scenario 1 Scenario 2

40 CMPT 401 Summer 2007 © A. Fedorova Solution With Four Byzantine Generals We can reach consensus if there are 4 generals and at most 1 is faulty Intuition: use the majority rule Correct process Who is telling the truth? Majority rules!

41 CMPT 401 Summer 2007 © A. Fedorova Solution With Four Byzantine Generals p 1 (Commander) p 2 p 3 1:v 2:1:v 3:1:u Faulty processes are shown shaded p 4 1:v 4:1:v 2:1:v3:1:w 4:1:v p 1 (Commander) p 2 p 3 1:w1:u 2:1:u 3:1:w p 4 1:v 4:1:v 2:1:u3:1:w 4:1:v Round 1: The commander sends v to all other generals Round 2: All generals exchange values that they sent to commander The decision is made based on majority

42 CMPT 401 Summer 2007 © A. Fedorova Solution With Four Byzantine Generals p 1 (Commander) p 2 p 3 1:v 2:1:v 3:1:u p 4 1:v 4:1:v 2:1:v3:1:w 4:1:v p 2 receives: {v, v, u}. Decides v p 4 receives: {v, v, w}. Decides v

43 CMPT 401 Summer 2007 © A. Fedorova Solution With Four Byzantine Generals p 1 (Commander) p 2 p 3 1:w1:u 2:1:u 3:1:w p 4 1:v 4:1:v 2:1:u3:1:w 4:1:v p 2 receives: {u, w, v}. Decides NULL p 4 receives: {u, v, w}. Decides NULL p 3 receives: {w, u, v}. Decides NULL The result generalizes for system with N ≥ 3f + 1, (N is the number of processes, f is the number of faulty processes)

44 CMPT 401 Summer 2007 © A. Fedorova Consensus in an Asynchronous System In the algorithms we’ve looked at consensus has been reached by using several rounds of communication The systems were synchronous, so each round always terminated If a process has not received a message from another process in a given round, it could assume that the process is faulty In an asynchronous system this assumption cannot be made! Fischer-Lynch-Patterson (1985): No consensus can be guaranteed in an asynchronous communication system in the presence of any failures. Intuition: a “failed” process may just be slow, and can rise from the dead at exactly the wrong time.

45 CMPT 401 Summer 2007 © A. Fedorova Consensus in Practice Real distributed systems are by and large asynchronous How do they operate if consensus cannot be reached? Fault masking: assume that failed processes always recover, and define a way to reintegrate them into the group. –If you haven’t heard from a process, just keep waiting… –A round terminates when every expected message is received. Failure detectors: construct a failure detector that can determine if a process has failed. –A round terminates when every expected message is received, or the failure detector reports that its sender has failed.

46 CMPT 401 Summer 2007 © A. Fedorova Fault Masking In a distributed system, a recovered node’s state must also be consistent with the states of other nodes. –Transaction processing systems record state to persistent storage, so they can recover after crash and continue as normal –What if a node has crashed before important state has been recorded on disk? A functioning node may need to respond to a peer’s recovery. –rebuild the state of the recovering node, and/or –discard local state, and/or –abort/restart operations/interactions in progress e.g., two-phase commit protocol

47 CMPT 401 Summer 2007 © A. Fedorova Failure Detectors First problem: how to detect that a member has failed? –pings, timeouts, beacons, heartbeats –recovery notifications Is the failure detector accurate? – Does it accurately detect failures? Is the failure detector live? – Are there bounds on failure detection time? In an asynchronous system, it impossible for a failure detector to be both accurate and live

48 CMPT 401 Summer 2007 © A. Fedorova Failure Detectors in Real Systems Use a failure detector that is live but not accurate. –Assume bounded processing delays and delivery times. –Timeout with multiple retries detects failure accurately with high probability. Tune it to observed latencies. –If a “failed” site turns out to be alive, then restore it or kill it (fencing, fail-silent). What do we assume about communication failures? –How much pinging is enough? –Tune parameters for your system – can you predict how your system will behave under pressure? –That’s why distributed system engineers often participate in multi- day support calls… What about network partitions? –Processes form two independent groups, reach consensus independently. Rely on quorum.

49 CMPT 401 Summer 2007 © A. Fedorova Summary Coordination and agreement are essential in real distributed systems Real distributed systems are asynchronous Consensus cannot be reached in an asynchronous distributed system Nevertheless, people still build useful distributed systems that rely on consensus Fault recovery and masking are used as mechanisms for helping processes reach consensus Popular fault masking and recovery techniques are transactions and replication – the topics of the next few lectures