Download presentation
Presentation is loading. Please wait.
Published byLynne Morrison Modified over 6 years ago
1
VAXclusters: A Closely Coupled Distributed System
Landon Cox February 10, 2017
2
Tight and loose coupling
Characteristics of a tightly-coupled system Close proximity of processors High-bandwidth communication via shared memory Single copy of operating system Characteristics of a loosely-coupled system Physically separated processors Low-bandwidth message-based communication Independent operating systems
3
Tight and loose coupling
Tightly-coupled systems can provide great performance What are the disadvantages of a tightly-coupled system? Scaling gets extremely expensive (e.g., supercomputer) Relatively hard to extend (add components) Failed component can often bring down entire system “Closely-coupled” VAXClusters tried to resolve this tension Want extensibility (should be easy to add components over time) Want availability (i.e., fault tolerance) Should be relatively affordable Performance should be acceptable
4
Tight and loose coupling
Performance bottleneck for loose coupling Communication between processors Processes in tightly-coupled systems use shared memory Processes in a loosely-coupled system use messages Physical memory Phys mem Phys mem Process 1 Process 2 Process 1 Process 2
5
Tight and loose coupling
What makes message passing so much slower? The interconnect (network versus memory bus) Need to copy data into and out of various address spaces Physical memory Phys mem Phys mem Process 1 Process 2 Process 1 Process 2
6
Close coupling So we have to make communication fast
System Communication Architecture (SCA) Computer interconnect (CI) for message passing CI Port provided hardware support Types of messages that SCA supported Messages (small w/ ordered, reliable delivery) Datagrams (small w/ unordered, unreliable delivery) Block transfers (large w/ ordered, reliable delivery)
7
CI Port interface Data structures through which software and CI Port communicate CI Port registers 7 queues (4 command queues, response queue, free queues) Phys Mem Commands Responses Port driver CI Port Free queues
8
CI Port interface Block-transfer commands include
Command points into src/dst address spaces (page tables) Identifies contiguous regions for data to be copied Hosts know this info because Exchanged via messages Messages/datagrams act as control plane, block transfers as data plane How does this reduce copying? Don’t have to copy data into individual messages Datagram/messages include payload in command CI Ports can reach into memory themselves How else does this improve performance? Far fewer interrupts for OS to handle Instead of interrupting every 576 bytes for new packet CI Port can copy data into dst address space w/o interruption Only interrupt when transfer is complete
9
Storage Single, networked storage interface
Disks were distributed across cluster Single namespace for all files Advantages of a unified file-system namespace Makes it easier to add new nodes Makes sharing files easier Can login anywhere and get to your data
10
Storage If we want to allow concurrent access to files, we need
Synchronization primitives Otherwise, processes can’t coordinate their activities This was a problem for single-host file systems But the OS kernel synchronized access Why is synchronization harder in a cluster? Cluster is a distributed system Nodes can fail, messages can be lost, etc.
11
Storage When synchronizing access to in-memory data
Locks are also in memory No magic: locks, objects all in the same address space Why not use file content itself as the basis for syncing? (i.e., write “lock=owned” to file lock.txt) Would require very strong consistency guarantees Equivalent to memory accesses Horrendously slow, and potentially have to pay penalty at all times
12
Implementing locking First have to agree on cluster membership
If we don’t all agree on who is around It’s going to be really hard to agree on anything else How does a cluster agree on its membership? Each node has a connection manager Each connection manager has a copy of the membership Use a quorum voting scheme Consensus is a really hard problem Nodes can fail and come back on-line arbitrarily Messages can be lost or slow Impossible to distinguish between failures and slow performance
13
Locking interface Users define locks and their names
How are locks named? Locks represent a hierarchical namespace Maps nicely onto the file-system namespace What kinds of modes can locks be in? Exclusive access, protected read Concurrent read, concurrent write, null, etc.
14
Locks thus far Lock anytime shared data is read/written
Ensures correctness Only one thread can read/write at a time Would like more concurrency How, without exposing violated invariants? Allow multiple threads to read (as long as none are writing)
15
Reader-writer interface
readerStart (called when thread begins reading) readerFinish (called when thread is finished reading) writerStart (called when thread begins writing) writerFinish (called when thread is finished writing) If no threads between writerStart/writerFinish Many threads between readerStart/readerFinish Only 1 thread between writerStart/writerFinish
16
Reader-writer interface vs locks
R-W interface looks a lot like locking *Start ~ lock *Finish ~ unlock Standard terminology Four functions called “reader-writer locks” Between readStart/readFinish has “read lock” Between writeStart/writeFinish has “write lock” Pros/cons of R-W vs standard locks? Trade-off concurrency for complexity Must know how data is being accessed in critical section
17
Back to VAXclusters Hierarchical locks
Allows coarse-grained mutual exclusion (tree roots) Allows fine-grained concurrency (tree leaves) Who maintains the locking state (i.e., the queues)? Each lock has a master node First node to request the lock is the master How do I find a lock’s master node? Through the resource directory The resource directory is replicated at several nodes If you don’t find a lock master in the directory You’re it! Have to update the directory to reflect your status
18
Handling failure Connection manager to lock managers
“We are in transition, please de-allocate locks.” What does a lock manager do? Releases all non-local locks Re-acquires all locks owned before transition (creates new directory nodes and re-distributes masters) What does this guarantee about the state of data? Not much Can leave in an inconsistent state No guarantee that previous lock holder will get it again
19
Influence of VAXClusters
Many of the concepts are relevant today Distributed locking Consensus and failure detection High availability using cheap hardware For example … Google, Facebook and every other cloud service e.g., infrastructure to support MapReduce jobs
20
How can things fall apart
Machines can get slow Machines can crash and reboot Machines can crash and die Machines can become partitioned Machines can behave arbitrarily Easier Harder Step 1: don’t lose data if machines crash and reboot Step 2: don’t lose data if machines crash and die What has to happen if machines are not guaranteed to restart after a crash? Transactions have to commit at > 1 machine
21
Paxos ACM TOCS: Submitted: 1990. Accepted: 1998 Introduced:
Transactions on Computer Systems Submitted: Accepted: 1998 Introduced:
22
Butler W. Lampson Butler Lampson is a Technical Fellow at Microsoft Corporation and an Adjunct Professor at MIT…..He was one of the designers of the SDS 940 time-sharing system, the Alto personal distributed computing system, the Xerox 9700 laser printer, two-phase commit protocols, the Autonet LAN, the SPKI system for network security, the Microsoft Tablet PC software, the Microsoft Palladium high-assurance stack, and several programming languages. He received the ACM Software Systems Award in 1984 for his work on the Alto, the IEEE Computer Pioneer award in 1996 and von Neumann Medal in 2001, the Turing Award in 1992, and the NAE’s Draper Prize in 2004.
23
Barbara Liskov MIT professor 2008 Turing Award
“View-stamped replication” PODC ’88 Very similar to Raft
24
At any moment, machine exists in a “state”
State machines At any moment, machine exists in a “state” What is a state? Should think of as a set of named variables and their values
25
State machines What is your state? 4 5 3 2 6 1 Client My state is “2”
Clients can ask a machine about its current state. What is your state? 4 5 3 2 6 1 Client My state is “2”
26
“actions” change the machine’s state
State machines “actions” change the machine’s state What is an action? Command that updates named variables’ values
27
“actions” change the machine’s state
State machines “actions” change the machine’s state Is an action’s effect deterministic? For our purposes, yes. Given a state and an action, we can determine next state w/ 100% certainty.
28
“actions” change the machine’s state
State machines “actions” change the machine’s state Is the effect of a sequence of actions deterministic? Yes, given a state and a sequence of actions, can be 100% certain of end state
29
Replicated state machines
Each state machine should compute same state, even if some fail. Client What is the state? What is the state? Client What is the state? Client
30
Replicated state machines
What has to be true of the actions that clients submit? Applied in same order Client Apply action c. Apply action a. Client Apply action b. Client
31
State machines How should a machine make sure it applies action in same order across reboots? Store them in a log! Action …
32
Replicated state machines
Once we have a leader it begins to service client requests. Leader=L Leader=L Apply action a. … … Client Leader=L Leader=L … …
33
Replicated state machines
Common approach Take simple, general service Implement using consensus Allow more complex services to use simple one Examples Chubby (Google’s distributed locking service) Zookeeper (Yahoo! clone of Chubby)
34
Replicated state machines
Common approach Take simple, general service Implement using consensus Allow more complex services to use simple one Examples Chubby (Google’s distributed locking service) Zookeeper (Yahoo! clone of Chubby)
35
Zookeeper Hierarchical file system A tree of named nodes Clients can
Directories (contain data, pointers) Files (contain data) Clients can Create/delete nodes Read/write files Receive notifications Used for coordination / /app1 /app2 /app1/config /app1/online/ /app1/online/S1 /app1/online/Sn …
36
Zookeeper Two kinds of nodes Persistent Ephemeral Exist until deleted
e.g., service-wide config data Exist until client departs e.g., group membership / /app1 /app2 /app1/config /app1/online/ /app1/online/S1 /app1/online/Sn …
37
Zookeeper Automated sequence numbers Will prove very useful
Node names can be sequenced Will prove very useful Clients create nodes /app/foo-X ZK chooses an order /app/foo-1 /app/foo-2 /app/foo-… / /app1 /app2 /app1/config /app1/online/ /app1/online/S1 /app1/online/Sn …
38
Zookeeper API Create Delete Exists Get children Get data Set data Sync
creates node in tree Delete deletes node in tree Exists tests if node exists at location Get children lists children of node Get data reads data from node Set data writes data to node Sync waits for data to commit Can embed information in the hierarchical namespace Can store extra details in individual node’s data
39
Zookeeper implementation
Follower Follower Leader Follower Follower Client Client A client connects to one server (any server will do).
40
Zookeeper guarantees Sequential consistency Atomicity
Updates from client applied in order sent. Atomicity No partial results. Updates succeed or fail. Single system image Client has same view, regardless of server. Reliability Applied updates persist until overwritten by another update. Timeliness Client’s view guaranteed to be up-to-date within a time bound. Zookeeper does not provide strong consistency Client reads not guaranteed to contain all other client updates
41
Zookeeper guarantees Sequential consistency Atomicity
Updates from client applied in order sent. Atomicity No partial results. Updates succeed or fail. Single system image Client has same view, regardless of server. Reliability Applied updates persist until overwritten by another update. Timeliness Client’s view guaranteed to be up-to-date within a time bound. Zookeeper’s consistency guarantees Read my writes (see your own updates) Consistent prefix (see a snapshot of the state) Monotonic reads (never go backward in time)
42
Example use Want to crawl and index the web Solution
Would like multiple machines to participate Want URLs explored at most once Solution Maintain an ordered queue of URLs Crawling machines assign themselves a URL to explore Crawling machines may add new URLs to queue Requires a producer-consumer queue
43
Nodes’ names are sequenced,
// Class simulating workers adding URLs to the queue public class CreateQueue { private class QueueAddWorker extends ConnectionWatcher implements Runnable { private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z"); private Random r = new Random(); private String name; public QueueAddWorker(String name){ this.name = name; } @Override public void run() { try { while(true){ this.connect("localhost"); String added = zk.create("/queue/q-", null, Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT_SEQUENTIAL); this.close(); Thread.sleep(new Long(r.nextInt(1000)+50)); } } catch (Exception e){ e.printStackTrace(); } public static void main(String[] args){ CreateQueue cQ = new CreateQueue(); Thread addWorker1 = new Thread(cQ.new QueueAddWorker("worker1")); Thread addWorker2 = new Thread(cQ.new QueueAddWorker("worker2")); Thread addWorker3 = new Thread(cQ.new QueueAddWorker("worker3")); addWorker1.start(); addWorker2.start(); addWorker3.start(); Nodes’ names are sequenced, e.g., /queue/q Now we can use watches to be notified when new nodes are added.
44
Registers object as a watcher for the node /queue
public class PullQueue extends ConnectionWatcher { private class PullWatcher implements Watcher { private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z"); private Random r = new Random(); private String name; private List<String> children; public PullWatcher(String name) throws Exception { this.name = name; children = zk.getChildren("/queue", this); } @Override public void process(WatchedEvent event) { try { if (event.getType().equals(EventType.NodeChildrenChanged)) { children = zk.getChildren("/queue", this); // get the children and renew watch Collections.sort(children); // we are getting an unsorted list for (String child : children) { if (zk.exists("/queue_lock/" + child, false) == null) { zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); // PROCESS QUEUE ENTRY zk.delete("/queue/" + child, -1); // even so we check the existence of a lock, // it could have been created in the mean time // making create fail. We catch and ignore it. catch (Exception ignore) { } } catch (Exception e){ e.printStackTrace(); } public static void main(String[] args) throws Exception { PullQueue pQ = new PullQueue(); pQ.connect("localhost"); pQ.new PullWatcher(args[0]); Thread.sleep(Long.MAX_VALUE); } finally { pQ.zk.close(); } Registers object as a watcher for the node /queue
45
When /queue changes, we check if any children were added
public class PullQueue extends ConnectionWatcher { private class PullWatcher implements Watcher { private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z"); private Random r = new Random(); private String name; private List<String> children; public PullWatcher(String name) throws Exception { this.name = name; children = zk.getChildren("/queue", this); } @Override public void process(WatchedEvent event) { try { if (event.getType().equals(EventType.NodeChildrenChanged)) { children = zk.getChildren("/queue", this); // get the children and renew watch Collections.sort(children); // we are getting an unsorted list for (String child : children) { if (zk.exists("/queue_lock/" + child, false) == null) { zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); // PROCESS QUEUE ENTRY zk.delete("/queue/" + child, -1); // even so we check the existence of a lock, // it could have been created in the mean time // making create fail. We catch and ignore it. catch (Exception ignore) { } } catch (Exception e){ e.printStackTrace(); } public static void main(String[] args) throws Exception { PullQueue pQ = new PullQueue(); pQ.connect("localhost"); pQ.new PullWatcher(args[0]); Thread.sleep(Long.MAX_VALUE); } finally { pQ.zk.close(); } When /queue changes, we check if any children were added
46
Start to iterate through the new queue entries
public class PullQueue extends ConnectionWatcher { private class PullWatcher implements Watcher { private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z"); private Random r = new Random(); private String name; private List<String> children; public PullWatcher(String name) throws Exception { this.name = name; children = zk.getChildren("/queue", this); } @Override public void process(WatchedEvent event) { try { if (event.getType().equals(EventType.NodeChildrenChanged)) { children = zk.getChildren("/queue", this); // get the children and renew watch Collections.sort(children); // we are getting an unsorted list for (String child : children) { if (zk.exists("/queue_lock/" + child, false) == null) { zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); // PROCESS QUEUE ENTRY zk.delete("/queue/" + child, -1); // even so we check the existence of a lock, // it could have been created in the mean time // making create fail. We catch and ignore it. catch (Exception ignore) { } } catch (Exception e){ e.printStackTrace(); } public static void main(String[] args) throws Exception { PullQueue pQ = new PullQueue(); pQ.connect("localhost"); pQ.new PullWatcher(args[0]); Thread.sleep(Long.MAX_VALUE); } finally { pQ.zk.close(); } Start to iterate through the new queue entries What is the problem with workers concurrently processing new entries? Work would be duplicated, as all workers process all new entries
47
Start to iterate through the new queue entries
public class PullQueue extends ConnectionWatcher { private class PullWatcher implements Watcher { private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z"); private Random r = new Random(); private String name; private List<String> children; public PullWatcher(String name) throws Exception { this.name = name; children = zk.getChildren("/queue", this); } @Override public void process(WatchedEvent event) { try { if (event.getType().equals(EventType.NodeChildrenChanged)) { children = zk.getChildren("/queue", this); // get the children and renew watch Collections.sort(children); // we are getting an unsorted list for (String child : children) { if (zk.exists("/queue_lock/" + child, false) == null) { zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); // PROCESS QUEUE ENTRY zk.delete("/queue/" + child, -1); // even so we check the existence of a lock, // it could have been created in the mean time // making create fail. We catch and ignore it. catch (Exception ignore) { } } catch (Exception e){ e.printStackTrace(); } public static void main(String[] args) throws Exception { PullQueue pQ = new PullQueue(); pQ.connect("localhost"); pQ.new PullWatcher(args[0]); Thread.sleep(Long.MAX_VALUE); } finally { pQ.zk.close(); } Start to iterate through the new queue entries How do we prevent duplicate work? Have workers lock entries that they want to process
48
These lines try to create a lock file for the entry.
public class PullQueue extends ConnectionWatcher { private class PullWatcher implements Watcher { private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z"); private Random r = new Random(); private String name; private List<String> children; public PullWatcher(String name) throws Exception { this.name = name; children = zk.getChildren("/queue", this); } @Override public void process(WatchedEvent event) { try { if (event.getType().equals(EventType.NodeChildrenChanged)) { children = zk.getChildren("/queue", this); // get the children and renew watch Collections.sort(children); // we are getting an unsorted list for (String child : children) { if (zk.exists("/queue_lock/" + child, false) == null) { zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); // PROCESS QUEUE ENTRY zk.delete("/queue/" + child, -1); // even so we check the existence of a lock, // it could have been created in the mean time // making create fail. We catch and ignore it. catch (Exception ignore) { } } catch (Exception e){ e.printStackTrace(); } public static void main(String[] args) throws Exception { PullQueue pQ = new PullQueue(); pQ.connect("localhost"); pQ.new PullWatcher(args[0]); Thread.sleep(Long.MAX_VALUE); } finally { pQ.zk.close(); } These lines try to create a lock file for the entry. What happens if workers concurrently try to create the same file? Only one will succeed, the failed worker will catch an exception
49
These lines try to create a lock file for the entry.
public class PullQueue extends ConnectionWatcher { private class PullWatcher implements Watcher { private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z"); private Random r = new Random(); private String name; private List<String> children; public PullWatcher(String name) throws Exception { this.name = name; children = zk.getChildren("/queue", this); } @Override public void process(WatchedEvent event) { try { if (event.getType().equals(EventType.NodeChildrenChanged)) { children = zk.getChildren("/queue", this); // get the children and renew watch Collections.sort(children); // we are getting an unsorted list for (String child : children) { if (zk.exists("/queue_lock/" + child, false) == null) { zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); // PROCESS QUEUE ENTRY zk.delete("/queue/" + child, -1); // even so we check the existence of a lock, // it could have been created in the mean time // making create fail. We catch and ignore it. catch (Exception ignore) { } } catch (Exception e){ e.printStackTrace(); } public static void main(String[] args) throws Exception { PullQueue pQ = new PullQueue(); pQ.connect("localhost"); pQ.new PullWatcher(args[0]); Thread.sleep(Long.MAX_VALUE); } finally { pQ.zk.close(); } These lines try to create a lock file for the entry. Why is it possible for only one create to succeed? ZK returns to client on commit; new nodes must reach majority to commit
50
These lines try to create a lock file for the entry.
public class PullQueue extends ConnectionWatcher { private class PullWatcher implements Watcher { private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z"); private Random r = new Random(); private String name; private List<String> children; public PullWatcher(String name) throws Exception { this.name = name; children = zk.getChildren("/queue", this); } @Override public void process(WatchedEvent event) { try { if (event.getType().equals(EventType.NodeChildrenChanged)) { children = zk.getChildren("/queue", this); // get the children and renew watch Collections.sort(children); // we are getting an unsorted list for (String child : children) { if (zk.exists("/queue_lock/" + child, false) == null) { zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); // PROCESS QUEUE ENTRY zk.delete("/queue/" + child, -1); // even so we check the existence of a lock, // it could have been created in the mean time // making create fail. We catch and ignore it. catch (Exception ignore) { } } catch (Exception e){ e.printStackTrace(); } public static void main(String[] args) throws Exception { PullQueue pQ = new PullQueue(); pQ.connect("localhost"); pQ.new PullWatcher(args[0]); Thread.sleep(Long.MAX_VALUE); } finally { pQ.zk.close(); } These lines try to create a lock file for the entry. Are workers reading children of /queue_lock, guaranteed to see all locks? No, a worker is only guaranteed to see the locks it created
51
These lines try to create a lock file for the entry.
public class PullQueue extends ConnectionWatcher { private class PullWatcher implements Watcher { private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z"); private Random r = new Random(); private String name; private List<String> children; public PullWatcher(String name) throws Exception { this.name = name; children = zk.getChildren("/queue", this); } @Override public void process(WatchedEvent event) { try { if (event.getType().equals(EventType.NodeChildrenChanged)) { children = zk.getChildren("/queue", this); // get the children and renew watch Collections.sort(children); // we are getting an unsorted list for (String child : children) { if (zk.exists("/queue_lock/" + child, false) == null) { zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); // PROCESS QUEUE ENTRY zk.delete("/queue/" + child, -1); // even so we check the existence of a lock, // it could have been created in the mean time // making create fail. We catch and ignore it. catch (Exception ignore) { } } catch (Exception e){ e.printStackTrace(); } public static void main(String[] args) throws Exception { PullQueue pQ = new PullQueue(); pQ.connect("localhost"); pQ.new PullWatcher(args[0]); Thread.sleep(Long.MAX_VALUE); } finally { pQ.zk.close(); } These lines try to create a lock file for the entry. What happens if a worker fails immediately after creating the lock file? The file is ephemeral and disappears when creator stops responding
52
After processing a queue entry, we delete it and try another
public class PullQueue extends ConnectionWatcher { private class PullWatcher implements Watcher { private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z"); private Random r = new Random(); private String name; private List<String> children; public PullWatcher(String name) throws Exception { this.name = name; children = zk.getChildren("/queue", this); } @Override public void process(WatchedEvent event) { try { if (event.getType().equals(EventType.NodeChildrenChanged)) { children = zk.getChildren("/queue", this); // get the children and renew watch Collections.sort(children); // we are getting an unsorted list for (String child : children) { if (zk.exists("/queue_lock/" + child, false) == null) { zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); // PROCESS QUEUE ENTRY zk.delete("/queue/" + child, -1); // even so we check the existence of a lock, // it could have been created in the mean time // making create fail. We catch and ignore it. catch (Exception ignore) { } } catch (Exception e){ e.printStackTrace(); } public static void main(String[] args) throws Exception { PullQueue pQ = new PullQueue(); pQ.connect("localhost"); pQ.new PullWatcher(args[0]); Thread.sleep(Long.MAX_VALUE); } finally { pQ.zk.close(); } After processing a queue entry, we delete it and try another
53
Leader election on top of ZK
Zookeeper elects a leader internally Uses Paxos, but could use Raft too Internal ZK leader election isn’t exposed to services Services can only manipulate ZK nodes But many services need to elect leaders e.g., a storage service that uses two-phase commit Easy to implement leader election on top of ZK
54
Leader election on top of ZK
To volunteer to be a leader Create node z “/Election/n_” with sequence and ephemeral flags Let C be “/Election/”’s children, and i be z’s sequence number Watch “Election/n_j” where j is smallest sequence < i Can two servers create the same “/Election/n_” node? No, ZK ensures that each file has a unique sequence number
55
Leader election on top of ZK
To volunteer to be a leader Create node z “/Election/n_” with sequence and ephemeral flags Let C be “/Election/”’s children, and i be z’s sequence number Watch “Election/n_j” where j is smallest sequence < i Which server is the leader? The one that created “/Election/n_j”
56
Leader election on top of ZK
To volunteer to be a leader Create node z “/Election/n_” with sequence and ephemeral flags Let C be “/Election/”’s children, and i be z’s sequence number Watch “Election/n_j” where j is smallest sequence < i How will we know when the leader fails? When node “/Election/n_j” is deleted
57
Leader election on top of ZK
To volunteer to be a leader Create node z “/Election/n_” with sequence and ephemeral flags Let C be “/Election/”’s children, and i be z’s sequence number Watch “Election/n_j” where j is smallest sequence < i Server is notified that a child of “/Election/” was deleted Let C be the new set of children for “/Election/” If z is the smallest node in C, then the volunteer is the leader Otherwise, keep watching for changes in smallest n_j Can two servers ever think that they are the leader? Something to work out on your own …
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.