The consensus problem in distributed systems

The consensus problem in distributed systems
These slides are based on Professor Ken Birman’s slides

state machine replication
A system can be regarded as a deterministic state machine. The state machine has a current state; it performs a step by taking as input a command and producing an output and a new state. To make the system reliable, the system needs to be replicated. Therefore, a replicated system can be represented as a collection of state machines. State machine replication requires that each machine execute the same command to maintain a consistent state across all the nodes in the replicated system. Agreeing the command to execute among the machines is a consensus problem

The consensus problem There are N nodes in the system
Each node starts with input {0,1} The networks is asynchronous but reliable Messages can take arbitrarily long to be delivered Nodes operate at arbitrary speed, may fail by stopping (crash failure), and may restart. At most 1 node fails Goal: all nodes decide same value v, where v was an input

Fault-tolerant consensus protocol
Collect votes from all N nodes Wait for the majority of nodes to respond, and tell everyone the outcome (choose value for the output) Nodes “decide” (i.e. they accept the outcome) There is a problem if a message is delayed or a node restarts after a failure.

FLP Impossibility of Consensus
A surprising result In an asynchronous model where only one node might crash, there is no fault-tolerant distributed algorithm that solves the consensus problem. They prove that no consensus algorithm will terminate in the presence of crash faults This is true even if no crash actually occurs Proof constructs infinite non-terminating runs

Intuition of FLP A system tries to agree on which command to execute next Node p’s messages are delayed during the transmission p is regarded as failed Since the system is fault-tolerant, if p crashes, the system should adapt and move on to reach a decision Before the decision is finally reached, p’s messages arrive. So, p has to be included in decision making. We are back to the beginning (step 1). This takes time and no real progress occurs between 1 and 4.

Overview of FLP Each node p has a state Configuration=global state.
program counter, registers, stack, local variables input register xp : initially either 0 or 1 output register yp : initially b (undecided) Configuration=global state. Collection of all nodes’ states state of the global message buffer A node’s state changes when it consumes a message A configuration C is bivalent if from C the final chosen value could be either 0 or 1 A configuration C is univalent if from C the final chosen value could be one of 0 and 1 0-valent or 1-valent Bivalent means outcome is unpredictable yet

In an initially bivalent state, there is an execution that would lead to a decision state, say “0”
At a certain step of this execution, the state switches from bivalent to univalent when one of the nodes receives a message m The proof studies the executions in which m is delayed The proof shows that, if the protocol is fault-tolerant, there must be a run that leads to another univalent state The proof shows that you can deliver m in this run without a decision being made (i.e. the system is back to bivalent).

The meaning of “impossibility”
In formal proofs, an algorithm is totally correct if It is safe. It always terminates. FLP proves that any fault-tolerant algorithm solving consensus has runs that never terminate These runs are extremely unlikely (“probability zero”) These runs mean that a totally correct solution for the consensus problem is impossible. It means consensus is not always possible.

Paxos Algorithm Distributed consensus algorithm Key Assumptions:
There are n nodes. The set of node is known a-priori. Nodes suffer crash failures, nodes can restart after a failure Network might be very slow Guarantees safety Only a single value is chosen Only a proposed value can be chosen A process never learns that a value has been chosen unless it actually has been Cannot guarantee liveness.

An overview of Paxos Nodes make proposals.
Each proposal is associated with a version number. A proposal only needs to be sent to a majority of the nodes. A proposal accepted by a majority of nodes will get passed (the consensus value). A node always accepts the proposal with a larger version number.

Details of Paxos 3 roles 2 phases proposer acceptor learner
Phase 1: prepare request Phase 2 (if get positive replies from a majority of the nodes): accept request

Phase 1: (prepare request)
A proposer chooses a new proposal version number n , and sends a prepare request (“prepare”,n) to a majority of acceptors: Can I make a proposal with number n ? If yes, do you suggest a value for my proposal?

When an acceptor receives a prepare request (“prepare”, n) where n is greater than the version number of any prepare request the acceptor t has already responded, the acceptor sends out (“ack”, n, n’, v’) or (“ack”, n, - , -) A respond is a promises not to accept any proposal with version number less than n. A respond suggests the value v’ of the highest-number proposal that the acceptor has accepted if any, else -

Phase 2: (accept request)
If the proposer receives responses from a majority of the acceptors, it can issue a accept request (“accept”, n , v) with version number n and value v: n is the number that appears in the prepare request. v is the value of the highest version number proposal among the responses (if any) If the acceptor receives an accept request (“accept”, n , v) , it accepts the proposal unless it has already responded to a prepare request having a version number greater than n.

Learning a Chosen Value
When an acceptor accepts a proposal, it tells all learners (“accept”, n, v). The scheme can be optimised to reduce the number of messages

An example Node 1 is the proposer Node 1 – n are the accepters
Nodes 1 – n are the learners

Safeness As a value is chosen when a majority of the acceptors respond to the proposer, if v in proposal (v, n) is chosen, the value in all accepted proposals (v’, n’) where n’>n must satisfy v = v’ In the respond to the prepare request, the acceptor informs the proposer of the value that the acceptor has accepted. The proposer uses accepted value with the largest version number as its own proposed value in the accept request.

Proof Sketch Let (v, n) be the earliest proposal that is accepted. If no other proposals are given, safety holds (i.e. only one value is chosen). Assume (v’, n’) be the earliest accepted after (v, n). As a proposal needs a majority of the nodes to respond, at least one node must have responded to proposals for (v, n) and (v’, n’). The node must have suggested using value v in its response to (v’, n’). The proposer of (v’, n’) must set the chosen value to v in its accept request message. Hence, v=v’ must hold.

Liveness Per FLP, Paxos cannot guarantee liveness.
Proposer p completes phase 1 for a proposal number n1. Another proposer q then completes phase 1 for a proposal number n2 > n1. Proposer p’s phase 2 accept requests for a proposal numbered n1 are ignored because the acceptors have all promised not to accept any new proposal numbered less than n2. Proposer p then begins and completes phase 1 for a new proposal number n3 > n2, causing the second phase 2 accept requests of proposer q to be ignored. And so on.

The lack of liveness can be addressed if there is only one proposer in the system.
Use virtual synchrony to ensure that everyone agrees on the membership of the group. Everyone knows which node is responsible for issuing proposals. The failed or slow node will be removed from the group. If the failed or slow node is the one for issuing the proposal, a new node will be made for carrying out the task.

Paxos in real life The replication services of some modern file systems uses Paxos Google BigTable Many MS products, e.g. SQL server clusters

reviews Understand the meaning of univalent and bivalent in the context of solving consensus problem in a distributed system. Can a system be in a univalent state if no node has decided? What causes a system to enter a univalent state? Understand how FLP impossibility theorem affect real system design. Give a scenario in which the Paxos algorithm cannot terminate. What are the safety conditions of the Paxos algorithm?

reviews In Paxos, what are the pros and cons of having a single acceptor? How does the Paxos algorithm guarantee that only the consensus value is propagated? In the Paxos algorithm, when a proposer knows that some acceptors have accepted a value from other proposers, can the proposer simply accept the value without running the second phase of the algorithm or executing the algorithm again with a new version number? Explain your answer. Assume that a membership service that implements virtual synchrony is available. Explain how the implementation of the Paxos algorithm can use the membership service to ensure the termination of the consensus algorithm.

Further readings Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson Impossibility of distributed consensus with one faulty process. J. ACM 32, 2 (April 1985), Leslie Lamport The part-time parliament. ACM Trans. Comput. Syst. 16, 2 (May 1998), Lamport, Leslie (2001). Paxos Made Simple ACM SIGACT News (Distributed Computing Column) 32, 4 (Whole Number 121, December 2001) 51-58

The consensus problem in distributed systems

Similar presentations

Presentation on theme: "The consensus problem in distributed systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The consensus problem in distributed systems

Similar presentations

Presentation on theme: "The consensus problem in distributed systems"— Presentation transcript:

Similar presentations

About project

Feedback