Dynamic Reconfiguration of Apache Zookeeper

Dynamic Reconfiguration of Apache Zookeeper
USENIX ATC 2012, Hadoop Summit 2012 In collaboration with: Ben Reed Dahlia Malkhi Flavio Junqueira Yahoo!, MSR Yahoo! Alex Shraer Yahoo! Research 1 1 1

Talk Overview Short intro to Zookeeper What is a configuration ?
Why change it dynamically ? How is Zookeeper reconfigured currently ? Our solution: Server side Client side

Modern Distributed Computing
Lots of servers Lots of processes High volumes of data Highly complex software systems … mere mortal developers 3

Coordination is important
4

Coordination primitives
Semaphores Queues Locks Leader Election Group Membership Barriers Configuration

Coordination is difficult even if we have just 2 replicas…
x = x * 2 x ++ x ++ x = x * 2 x = 0 x = 0 x = x * 2 x ++

Common approach: use a coordination service
Leading coordination services: Google: Chubby Apache Zookeeper Yahoo!, Linkedin, Twitter, Facebook, VMWare, UBS, Goldman Sachs, Netflix, Box, Cloudera, MapR, Nicira, …

Zookeeper data model A tree of data nodes (znodes)
Hierarchical namespace (like in a file system) Znode = <data, version, creation flags, children> / services workers worker1 worker2 locks x-1 x-2 apps users

ZooKeeper: API create sequential, ephemeral setData
can be conditional on current version getData can optionally set a “watch” on the znode getChildren exists delete

Example: Distributed Lock / Leader election
1- create “/lock”, ephemeral 2- if success, return with lock 3- getData /lock, watch=true 4- wait for /lock to go away, then goto 1 Client A / Client B B /lock Client C

1- create “/lock”, ephemeral 2- if success, return with lock 3- getData /lock, watch=true 4- wait for /lock to go away, then goto 1 Client A notification / Client B Client C notification

1- create “/lock”, ephemeral 2- if success, return with lock 3- getData /lock, watch=true 4- wait for /lock to go away, then goto 1 Client A / Client B C /lock Client C

Problem ? Herd effect Some client may never get the lock
Large number of clients wake up simultaneously Load spike Some client may never get the lock

1- z = create “/C-”, sequential, ephemeral 2- getChildren “/” 3- if z has lowest id return with lock 4- getData prev_node, watch=true 5- wait for prev_node to go away, then goto 2 Client A / A /C-1 Client B B /C-2 Client Z Z /C-n

1- z = create “/C-”, sequential, ephemeral 2- getChildren “/” 3- if z has lowest id return with lock 4- getData prev_node, watch=true 5- wait for prev_node to go away, then goto 2 Client A / Client B B /C-2 Client Z Z /C-n

1- z = create “/C-”, sequential, ephemeral 2- getChildren “/” 3- if z has lowest id return with lock 4- getData prev_node, watch=true 5- wait for prev_node to go away, then goto 2 Client A / Client B notification B /C-2 Client Z Z /C-n

Zookeeper - distributed and replicated
All servers store a copy of the data (in memory)‏ A leader is elected at startup Reads served by followers, all updates go through leader Update acked when a quorum of servers have persisted the change (on disk) Zookeeper uses ZAB - its own atomic broadcast protocol Borrows a lot from Paxos, but conceptually different ZooKeeper Service Leader Server Server Server Server Server Server Client Client Client Client Client Client Client Client

Leader atomically broadcast updates
ZooKeeper: Overview Ensemble Client App ZooKeeper Client Lib Follower Session Leader Leader atomically broadcast updates Client App ZooKeeper Client Lib Follower Session Follower Replicated system Client App ZooKeeper Client Lib Follower Session

Leader atomically broadcast updates
ZooKeeper: Overview Ensemble ok Client App ZooKeeper Client Lib Follower setData /x , 5 /X = 5 /X = 0 Leader Leader atomically broadcast updates /X = 0 /X = 5 Client App ZooKeeper Client Lib ok Follower sync 5 /X = 5 /X = 0 getData /x Follower /X = 5 /X = 0 Replicated system Client App ZooKeeper Client Lib Follower /X = 0 getData /x

FIFO order of client’s commands
A client invokes many operations asynchronously All commands have a blocking variant, rarely used in practice Zookeeper guarantees that a prefix of these operations complete in invocation order

Zookeeper is a Primary/Backup system
Important subclass of State-Machine Replication Many (most?) Primary/Backup systems work as follows: Primary speculatively executes operations, doesn’t wait for previous ops to commit Sends idempotent state updates to backups “makes sense’’ only in the context of Primary speculatively executes and sends out but it will only appear in a backup’s log after In general State Machine Replication (Paxos), a backup’s log may become Primary order: each primary commits a consecutive segment in the log We don’t need to stop operations - can execute ops after reconfiguration speculatively they will only persists if the reconfiguration is successful Ops of each leader in the log are according to the order it proposed them Ops by different leader cannot be interleaved in the log

Configuration of a Distributed Replicated System
Membership Role of each server E.g., deciding on changes (acceptor/follower) or learning the changes (learner/observer) Quorum System spec Zookeeper: majorities hierarchical (server votes have different weight) Network addresses & ports Timeouts, directory paths, etc. Hierarchical – for example in multi-datacenter setting or when some servers are more powerful or well-connected than others Zookeeper server: 2 ports to talk with other servers 1 port to talk with clients

Dynamic Membership Changes
Necessary in every long-lived system! Examples: Cloud computing: adapt to changing load, don’t pre-allocate! Failures: replacing failed nodes with healthy ones Upgrades: replacing out-of-date nodes with up-to-date ones Free up storage space: decreasing the number of replicas Moving nodes: within the network or the data center Increase resilience by changing the set of servers Example: asynch. replication works as long as > #servers/2 operate:

Other Dynamic Configuration Changes
Changing server addresses/ports Changing server roles: Changing the Quorum System E.g., if a new powerful & well-connected server is added observers (learners) leader & followers (acceptors) observers (learners) leader & followers (acceptors)

Reconfiguring Zookeeper
Not supported All config settings are static – loaded during boot Zookeeper users repeatedly asking for reconfig. since 2008 Several attempts found incorrect and rejected In practice, admins do this manually, hoping that nothing breaks

Manual Reconfiguration
Bring the service down, change configuration files, bring it back up Wrong reconfiguration caused split-brain & inconsistency in production Questions about manual reconfig are asked several times each week Admins prefer to over-provision than to reconfigure [LinkedIn 2012] Doesn’t help with many reconfiguration use-cases Wastes resources, adds management overhead Can hurt Zookeeper throughput (we show) Configuration errors primary cause of failures in production systems [Yin et al., SOSP’11] ZK expert: “Here are the 8 simple steps you need to follow to add 2 servers to a cluster of 3 servers”

Hazards of Manual Reconfiguration
{A, B, C, D, E} {A, B, C, D, E} {A, B, C} B {A, B, C, D, E} {A, B, C} D {A, B, C, D, E} {A, B, C} {A, B, C, D, E} Goal: add servers E and D Change configuration files Restart all servers We lost and !!

Just use a coordination service!
Zookeeper is the coordination service Don’t want to deploy another system to coordinate it! Who will reconfigure that system ? GFS has 3 levels of coordination services More system components -> more management overhead Use Zookeeper to reconfigure itself! Other systems store configuration information in Zookeeper Can we do the same?? Only if there are no failures To understand why not, we need to see how Zookeeper failure recovery works

Recovery in Zookeeper E C B setData(/x, 5) D A Impossible to know that blue or green was committed - A must assume it was Leader failure activates leader election & recovery

This doesn’t work for reconfigurations!
B {A, B, C, D, E} {A, B, C, D, E} setData(/zookeeper/config, {A, B, F}) {A, B, C, D, E} D remove C, D, E add F F {A, B, C, D, E} A {A, B, F} {A, B, C, D, E} {A, B, F} Must persist the decision to reconfigure in the old config before activating the new config! Once such decision is reached, must not allow further ops to be committed in old config

Our Solution Correct Fully automatic
No external services or additional components Minimal changes to Zookeeper Usually unnoticeable to clients Pause operations only in rare circumstances Clients work with a single configuration Rebalances clients across servers in new configuration Reconfigures immediately Speculative Reconfiguration Reconfiguration (and commands that follow it) speculatively sent out by the primary, similarly to all other updates

Principles of Reconfiguration
A reconfiguration C1 -> C2 should do the following: Consensus decision in the current config (C1) on a new conﬁg (C2) 2. Transfer of control from C1 to C2: Suspend C1 Collect a snapshot S of completed/potentially completed ops in C1 Transfer S to a quorum of C2 A consensus decision in C2 to activate C2 with the “start state” S

Easy in a Primary/Backup system!
Decision on the next config is like decision on any other operation a. Instead of suspending, we guarantee that further ops can only be committed in C2 unless reconfiguration fails (primary order!) b. The primary is the only one executing & proposing ops; S = the primary’s log c. Transfer ahead of time, here only make sure transfer complete 3. If leader stays the same, no need to run full consensus If leader stays the same, no need to run 2-round consensus – it knows that nothing was proposed in the new config, so there is no need for 1st phase of consensus

Failure-Free Flow

Protocol Features After reconfiguration is proposed, leader schedules & executes operations as usual Leader of the new configuration is responsible to commit these If leader of old config is in new config and “able to lead”, it remains the leader Otherwise, old leader nominates new leader (saves leader election time) We support multiple concurrent reconfigurations Activate only the “last” config, not intermediate ones In the paper, not in production  If we reconfiguring to remove a node, and meanwhile detected another failure, we won’t activate the intermediate config and will move to a config that excludes both servers directly

Recovery – Discovering Decisions
A {A, D, E} {A, B, C} B {A, B, C} {A, D, E} D {A, B, C} {A, D, E} : replace B, C with E, D C must 1) discover possible decisions in {A, B, C} (find out about {A, D, E}) 2) discover possible activation decision in {A, D, E} - If {A,D, E} is active, C mustn’t attempt to transfer state - Otherwise, C should transfer state & activate {A, D, E}

Recovery – Challenges of Configuration Overlap
{A, B, C} C B D {A, B, C} {A, B, D} {A, B, C} : replace C with D Only B can be the leader of {A, B, C}, but it won’t -> No leader will be elected in {A, B, C} D will not connect to B (doesn’t know who B is) -> B will not be able to lead {A, B, D}

Recovering Operations submitted while reconfig is pending
Old leader may execute and submit them New leader is responsible to commit them, once persisted on quorum of NEW config However, because of the way recovery in ZAB works, the protocol is easier to implement if such operations are persisted also on quorum of old config before committed in addition to new config Leader commits and , but no one else knows In ZAB C can’t fetch history from E/D. Thus, C must already have E C A {A, B, C} B {A, B, C} {A, D, E} D {A, B, C}

The “client side” of reconfiguration
When system changes, clients need to stay connected The usual solution: directory service (e.g., DNS) Re-balancing load during reconfiguration is also important! Goal: uniform #clients per server with minimal client migration Migration should be proportional to change in membership X 10 X 10 X 10

Consistent Hashing ? Client and server ids randomly mapped to a circular m-bit space (0 follows 2m – 1) Each client then connects to the server that immediately follows it in the circle In order to improve load balancing each server is hashed k times (usually k  log(#servers)) Load balancing is uniform W.H.P. depending on #servers and k. In Zookeeper #servers < 10  add 3 remove 1 remove 1 remove 2 add 1 9 servers, 1000 clients Each client initially connects to a random server k = 5 k = 20

Our approach - Probabilistic Load Balancing
Example 1 : Each client moves to a random new server with probability 0.4 1 – 3/5 = 0.4 Exp. 40% clients will move off of each server Example 2 : Clients connected to D and E don’t move Clients connected to A, B, C move to D, E with probability 4/9 |S S’|(|S|-|S’|)/|S’||S’\S| = 2(5-3)/3*3 = 4/9 Exp. 8 clients will move from A, B, C to D, E and 10 to F X 6 X 10 X 10 X 6 X 10 X 6 X 6 X 6 4/18 4/18 10/18 B A C D E F X 6 X 6 X 6 X 10 X 6 X 10 X 6 X 10

Probabilistic Load Balancing
When moving from config. S to S’: Solving for Pr we get case-specific probabilities. Input: each client answers locally Question 1: Are there more servers now or less ? Question 2: Is my server being removed? Output: 1) disconnect or stay connected to my server if disconnect 2) Pr(connect to one of the old servers) and Pr(connect to newly added server) expected #clients connected to i in S’ (10 in last example) #clients connected to i in S #clients moving to i from other servers in S #clients moving from i to other servers in S’

Implementation Implemented in Zookeeper (Java & C), integration ongoing 3 new Zookeeper API calls: reconfig, getConfig, updateServerList feature requested since 2008, expected in release (july 2012) Dynamic changes to: Membership Quorum System Server roles Addresses & ports Reconfiguration modes: Incremental (add servers E and D, remove server B) Non-incremental (new config = {A, C, D, E}) Blind or conditioned (reconfig only if current config is #5) Subscriptions to config changes Client can invoke client-side re-balancing upon change

remove add remove-leader add remove add
Evaluation remove add remove-leader add remove add remove 4 servers remove 2 servers

Summary Design and implementation of reconfiguration for Apache Zookeeper being contributed into Zookeeper codebase Much simpler than state of the art, using properties already provided by Zookeeper Many nice features: Doesn’t limit concurrency Reconfigures immediately Preserves primary order Doesn’t stop client ops Zookeeper used by online systems, any delay must be avoided Clients work with a single configuration at a time No external services Includes client-side rebalancing

Questions?

Dynamic Reconfiguration of Apache Zookeeper

Similar presentations

Presentation on theme: "Dynamic Reconfiguration of Apache Zookeeper"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dynamic Reconfiguration of Apache Zookeeper

Similar presentations

Presentation on theme: "Dynamic Reconfiguration of Apache Zookeeper"— Presentation transcript:

Similar presentations

About project

Feedback