2Motivation Centralized service:- Coordination kernel Maintains configuration information,naming,distributed synchronization,group services.Avoids Synchronization and RacesFile-system based APIManipulates small data nodes: znodesState is a hierarchy of znodes
3Visualizing PaxoscoordinatorR1R2R3AcceptorAcceptorAcceptorproposerlearnersThe proposer requests that the Paxos system accept some command. Paxos is like a “postal system”It thinks about the letter for a while (replicating the data and picking a delivery order)Once these are “decided” the learners can execute the command
4Overview of roles of processes The Client issues a request to the distributed system, and waits for a response. For instance, a write request on a file in a distributed file server.The Acceptors act as the fault-tolerant "memory" of the protocol. Acceptors are collected into groups called Quorums. Any message sent to an Acceptor must be sent to a Quorum of Acceptors. Any message received from an Acceptor is ignored unless a copy is received from each Acceptor in a Quorum.
5Paxos Assumptions Processors operate at arbitrary speed. Processors may experience failures.Processors with stable storage may re-join the protocol after failuresUsing crash-recovery fault toleranceProcessors do not collude, lie, or otherwise attempt to subvert the protocol.i.e. Byzantine failures don't occur. See Byzantine Paxos for a solution that tolerates failures from arbitrary/malicious behavior of the processes.In general, a consensus algorithm can make progress using 2F+1 processors despite the simultaneous failure of any F processors.
6Paxos Network Processors can send messages to any other processor. Messages are sent asynchronously and may take arbitrarily long to deliver.Messages may be lost, reordered, or duplicated.Messages are delivered without corruption.i.e. Byzantine network failures don't occur. See Byzantine Paxos for a solution.
7Number of ProcessorsIn general, a consensus algorithm can make progress using 2F+1 processors despite the simultaneous failure of any F processors.However, using reconfiguration, a protocol may be employed which survives any number of total failures as long as no more than F fail simultaneously.
8Overview of roles of processes A Proposer advocates a client request, attempting to convince the Acceptors to agree on it, and Learners act as the replication factor for the protocol. Once a Client request has been agreed on by the Acceptors, the Learner may take action (i.e.: execute the request and send a response to the client). To improve availability of processing, additional Learners can be added.Paxos requires a distinguished Proposer (called the leader) to make progress. Many processes may believe they are leaders, but the protocol only guarantees progress if one of them is eventually chosen. If two processes believe they are leaders, they may stall the protocol by continuously proposing conflicting updates. However, the safety properties are still preserved on that case.
9Proposal Number & Agreed Value Each attempt to define an agreed value v is performed with proposals which may or may not be accepted by Acceptors.Each proposal is uniquely numbered for a given Proposer.
10Basic PaxosEach instance of the Basic Paxos protocol decides on a single output value.The protocol proceeds over several rounds.A successful round has two phases:Prepare-PromiseAccept Request - Accepted
12Prepare Promise Prepare: A Proposer (the leader) creates a proposal identified with a number N.This number must be greater than any previous proposal number used by this Proposer.Then, it sends a Prepare message containing this proposal to a Quorum of Acceptors.
14Prepare-Promise Promise If the proposal's number N is higher than any previous proposal number received from any Proposer by the Acceptor, then the Acceptor must return a promise to ignore all future proposals having a number less than N. If the Acceptor accepted a proposal at some point in the past, it must include the previous proposal number and previous value in its response to the Proposer.Otherwise, the Acceptor can ignore the received proposal. It does not have to answer in this case for Paxos to work. However, for the sake of optimization, sending a denial (Nack) response would tell the Proposer that it can stop its attempt to create consensus with proposal N.
16Accept RequestIf a Proposer receives enough promises from a Quorum of Acceptors, it needs to set a value to its proposal.If any Acceptors had previously accepted any proposal, then they'll have sent their values to the Proposer, who now must set the value of its proposal to the value associated with the highest proposal number reported by the Acceptors.If none of the Acceptors had accepted a proposal up to this point, then the Proposer may choose any value for its proposal.The Proposer sends an Accept Request message to a Quorum of Acceptors with the chosen value for its proposal.
18AcceptedIf an Acceptor receives an Accept Request message for a proposal N, it must accept it if and only if it has not already promised to only consider proposals having an identifier greater than N.In this case, it should register the corresponding value v and send an Accepted message to the Proposer and every Learner. Else, it can ignore the Accept Request.Rounds fail when multiple Proposers send conflicting Prepare messages, or when the Proposer does not receive a Quorum of responses (Promise or Accepted). In these cases, another round must be started with a higher proposal number.Notice that when Acceptors accept a request, they also acknowledge the leadership of the Proposer. Hence, Paxos can be used to select a leader in a cluster of nodes.
21A Paxos for every occasion Multi Paxos – avoid Prepare and PromiseCheap Paxos – tolerate F failures with F+1 processors and F auxiliaryFast Paxos – reduces end to end messagesGeneralized Paxos – exploits communitivityByzantine Paxos
22What is ZooKeeper?A highly available, scalable, distributed, configuration, consensus, group membership, leader election, naming, and coordination serviceDifficult to implement these kinds of services reliablybrittle in the presence of changedifficult to managedifferent implementations lead to management complexity when the applications are deployed
23Zookeeper Properties File API without partial reads/writes Simple wait free data objects organized hierarchically as in ﬁle systems.Per Client guarantee of FIFO execution of requestsLinearizability for all requests that change the Zookeeper stateBuilt using ZAB, a totally ordered broadcast protocol (based on Paxos)2F+1 servers can tolerate f crash failures
24Any Guarantees? Clients will never detect old data. Clients will get notified of a change to data they are watching within a bounded period of time.All requests from a client will be processed in order.All results received by a client will be consistent with results received by all other clients.
25ZooKeeper Servers1)All servers store a copy of the data on disk 2)A leader is elected at startup 3)Followers service clients, all updates go through leader 4)Update responses are sent when a majority of servers have persisted the change
26ZooKeeper ServiceZooKeeper ServiceLeaderServerServerServerServerServerServerClientClientClientClientClientClientClientAll servers store a copy of the data, logs, snapshots on disk and use an in memory databaseA leader is elected at startupFollowers service clients, all updates go through leaderUpdate responses are sent when a majority of servers have persisted the change
27Protocol Guarantees1) Sequential Consistency - Updates from a client will be applied in the order that they were sent. 2) Atomicity - Updates either succeed or fail. No partial results. 3) Single System Image - A client will see the same view of the service regardless of the server that it connects to. 4) Reliability - Once an update has been applied, it will persist from that time forward until a client overwrites the update. 5) Timeliness - The clients view of the system is guaranteed to be up-to-date within a certain bound. Either system changes will be seen by a client within this bound, or the client will detect a service outage.
28ZAB algorithm http://research.yahoo.com/files/ladis08.pdf Zookeeper is based on the ZAB algorithmZAB: Zookeeper Atomic BroadcastConsists of two modesRecoveryWhen the service starts or after a leader failure. Ends when a leader emerges and a quorum of servers have synchronized their state with the leaderBroadcastThe leader is the server that executes a broadcast by initiating the broadcast protocolOnce a leader has synchronized with a quorum of followers, it begins to broadcast messages.
29ZAB broadcastThe leader broadcasts a proposal for a message to be delivered.Before proposing a message the leader assigns a monotonically increasing unique id, called the zxid.Because Zab preserves causal ordering, the delivered messages will also be ordered by their zxids.Broadcasting consists of putting the proposal with the message attached into the outgoing queue for each followerWhen a follower receives a proposal, it writes it to disk, and sends an acknowledgement to the leader as soon as the proposal is on the disk media.When a leader receives ACKs from a quorum, the leader will broadcast a COMMIT and deliver the message locally. Followers deliver the message when they receive the COMMIT from the leader.
30ZAB Leader Election1)UDP based 2)Server with the highest logged transaction gets nominated 3)Election doesn't have to be absolutely correct, just very likely correct
31ZAB Leader ElectionlastZxid: 22vote: 1voteZxid: 22lastZxid: 22vote: 2voteZxid: 22lastZxid: 23vote: 3voteZxid: 23lastZxid: 21vote: 4voteZxid: 21lastZxid: 21vote: 5voteZxid: 211) Each server initially nominates itself 2)Servers poll each other to get their votes
32ZAB Leader ElectionlastZxid: 22vote: 3voteZxid: 22lastZxid: 22vote: 3voteZxid: 22lastZxid: 23vote: 3voteZxid: 23lastZxid: 21vote: 3voteZxid: 21lastZxid: 21vote: 3voteZxid: 211) Each server initially nominates itself 2) Servers poll each other to get their votes 3) and vote for the one with the highest zxid if there isn't a winner
33Difference Paxos ZAB Tolerates message losses and reordering Quorums If proposer believes it is a leader, it uses a higher number tom take over leadershop from another leaderUses TCPNo Quorums neededNew leader cannot take over leadership until all of the followers agree on the leader
34Paxos referencesSchneider, Fred (1990). "Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial". ACM Computing Surveys 22: 299.The Part-Time Parliament, Leslie Lamport,Paxos Made Simple, Leslie Lamport,