VAXclusters: A Closely Coupled Distributed System

VAXclusters: A Closely Coupled Distributed System
Landon Cox February 10, 2017

Tight and loose coupling
Characteristics of a tightly-coupled system Close proximity of processors High-bandwidth communication via shared memory Single copy of operating system Characteristics of a loosely-coupled system Physically separated processors Low-bandwidth message-based communication Independent operating systems

Tightly-coupled systems can provide great performance What are the disadvantages of a tightly-coupled system? Scaling gets extremely expensive (e.g., supercomputer) Relatively hard to extend (add components) Failed component can often bring down entire system “Closely-coupled” VAXClusters tried to resolve this tension Want extensibility (should be easy to add components over time) Want availability (i.e., fault tolerance) Should be relatively affordable Performance should be acceptable

Performance bottleneck for loose coupling Communication between processors Processes in tightly-coupled systems use shared memory Processes in a loosely-coupled system use messages Physical memory Phys mem Phys mem Process 1 Process 2 Process 1 Process 2

What makes message passing so much slower? The interconnect (network versus memory bus) Need to copy data into and out of various address spaces Physical memory Phys mem Phys mem Process 1 Process 2 Process 1 Process 2

Close coupling So we have to make communication fast
System Communication Architecture (SCA) Computer interconnect (CI) for message passing CI Port provided hardware support Types of messages that SCA supported Messages (small w/ ordered, reliable delivery) Datagrams (small w/ unordered, unreliable delivery) Block transfers (large w/ ordered, reliable delivery)

CI Port interface Data structures through which software and CI Port communicate CI Port registers 7 queues (4 command queues, response queue, free queues) Phys Mem Commands Responses Port driver CI Port Free queues

CI Port interface Block-transfer commands include
Command points into src/dst address spaces (page tables) Identifies contiguous regions for data to be copied Hosts know this info because Exchanged via messages Messages/datagrams act as control plane, block transfers as data plane How does this reduce copying? Don’t have to copy data into individual messages Datagram/messages include payload in command CI Ports can reach into memory themselves How else does this improve performance? Far fewer interrupts for OS to handle Instead of interrupting every 576 bytes for new packet CI Port can copy data into dst address space w/o interruption Only interrupt when transfer is complete

Storage Single, networked storage interface
Disks were distributed across cluster Single namespace for all files Advantages of a unified file-system namespace Makes it easier to add new nodes Makes sharing files easier Can login anywhere and get to your data

Storage If we want to allow concurrent access to files, we need
Synchronization primitives Otherwise, processes can’t coordinate their activities This was a problem for single-host file systems But the OS kernel synchronized access Why is synchronization harder in a cluster? Cluster is a distributed system Nodes can fail, messages can be lost, etc.

Storage When synchronizing access to in-memory data
Locks are also in memory No magic: locks, objects all in the same address space Why not use file content itself as the basis for syncing? (i.e., write “lock=owned” to file lock.txt) Would require very strong consistency guarantees Equivalent to memory accesses Horrendously slow, and potentially have to pay penalty at all times

Implementing locking First have to agree on cluster membership
If we don’t all agree on who is around It’s going to be really hard to agree on anything else How does a cluster agree on its membership? Each node has a connection manager Each connection manager has a copy of the membership Use a quorum voting scheme Consensus is a really hard problem Nodes can fail and come back on-line arbitrarily Messages can be lost or slow Impossible to distinguish between failures and slow performance

Locking interface Users define locks and their names
How are locks named? Locks represent a hierarchical namespace Maps nicely onto the file-system namespace What kinds of modes can locks be in? Exclusive access, protected read Concurrent read, concurrent write, null, etc.

Locks thus far Lock anytime shared data is read/written
Ensures correctness Only one thread can read/write at a time Would like more concurrency How, without exposing violated invariants? Allow multiple threads to read (as long as none are writing)

Reader-writer interface
readerStart (called when thread begins reading) readerFinish (called when thread is finished reading) writerStart (called when thread begins writing) writerFinish (called when thread is finished writing) If no threads between writerStart/writerFinish Many threads between readerStart/readerFinish Only 1 thread between writerStart/writerFinish

Reader-writer interface vs locks
R-W interface looks a lot like locking *Start ~ lock *Finish ~ unlock Standard terminology Four functions called “reader-writer locks” Between readStart/readFinish has “read lock” Between writeStart/writeFinish has “write lock” Pros/cons of R-W vs standard locks? Trade-off concurrency for complexity Must know how data is being accessed in critical section

Back to VAXclusters Hierarchical locks
Allows coarse-grained mutual exclusion (tree roots) Allows fine-grained concurrency (tree leaves) Who maintains the locking state (i.e., the queues)? Each lock has a master node First node to request the lock is the master How do I find a lock’s master node? Through the resource directory The resource directory is replicated at several nodes If you don’t find a lock master in the directory You’re it! Have to update the directory to reflect your status

Handling failure Connection manager to lock managers
“We are in transition, please de-allocate locks.” What does a lock manager do? Releases all non-local locks Re-acquires all locks owned before transition (creates new directory nodes and re-distributes masters) What does this guarantee about the state of data? Not much Can leave in an inconsistent state No guarantee that previous lock holder will get it again

Influence of VAXClusters
Many of the concepts are relevant today Distributed locking Consensus and failure detection High availability using cheap hardware For example … Google, Facebook and every other cloud service e.g., infrastructure to support MapReduce jobs

How can things fall apart
Machines can get slow Machines can crash and reboot Machines can crash and die Machines can become partitioned Machines can behave arbitrarily Easier Harder Step 1: don’t lose data if machines crash and reboot Step 2: don’t lose data if machines crash and die What has to happen if machines are not guaranteed to restart after a crash? Transactions have to commit at > 1 machine

Paxos ACM TOCS: Submitted: 1990. Accepted: 1998 Introduced:
Transactions on Computer Systems Submitted: Accepted: 1998 Introduced:

Butler W. Lampson Butler Lampson is a Technical Fellow at Microsoft Corporation and an Adjunct Professor at MIT…..He was one of the designers of the SDS 940 time-sharing system, the Alto personal distributed computing system, the Xerox 9700 laser printer, two-phase commit protocols, the Autonet LAN, the SPKI system for network security, the Microsoft Tablet PC software, the Microsoft Palladium high-assurance stack, and several programming languages. He received the ACM Software Systems Award in 1984 for his work on the Alto, the IEEE Computer Pioneer award in 1996 and von Neumann Medal in 2001, the Turing Award in 1992, and the NAE’s Draper Prize in 2004.

Barbara Liskov MIT professor 2008 Turing Award
“View-stamped replication” PODC ’88 Very similar to Raft

At any moment, machine exists in a “state”
State machines At any moment, machine exists in a “state” What is a state? Should think of as a set of named variables and their values

State machines What is your state? 4 5 3 2 6 1 Client My state is “2”
Clients can ask a machine about its current state. What is your state? 4 5 3 2 6 1 Client My state is “2”

“actions” change the machine’s state
State machines “actions” change the machine’s state What is an action? Command that updates named variables’ values

State machines “actions” change the machine’s state Is an action’s effect deterministic? For our purposes, yes. Given a state and an action, we can determine next state w/ 100% certainty.

State machines “actions” change the machine’s state Is the effect of a sequence of actions deterministic? Yes, given a state and a sequence of actions, can be 100% certain of end state

Replicated state machines
Each state machine should compute same state, even if some fail. Client What is the state? What is the state? Client What is the state? Client

What has to be true of the actions that clients submit? Applied in same order Client Apply action c. Apply action a. Client Apply action b. Client

State machines How should a machine make sure it applies action in same order across reboots? Store them in a log! Action …

Once we have a leader it begins to service client requests. Leader=L Leader=L Apply action a. … … Client Leader=L Leader=L … …

Common approach Take simple, general service Implement using consensus Allow more complex services to use simple one Examples Chubby (Google’s distributed locking service) Zookeeper (Yahoo! clone of Chubby)

Zookeeper Hierarchical file system A tree of named nodes Clients can
Directories (contain data, pointers) Files (contain data) Clients can Create/delete nodes Read/write files Receive notifications Used for coordination / /app1 /app2 /app1/config /app1/online/ /app1/online/S1 /app1/online/Sn …

Zookeeper Two kinds of nodes Persistent Ephemeral Exist until deleted
e.g., service-wide config data Exist until client departs e.g., group membership / /app1 /app2 /app1/config /app1/online/ /app1/online/S1 /app1/online/Sn …

Zookeeper Automated sequence numbers Will prove very useful
Node names can be sequenced Will prove very useful Clients create nodes /app/foo-X ZK chooses an order /app/foo-1 /app/foo-2 /app/foo-… / /app1 /app2 /app1/config /app1/online/ /app1/online/S1 /app1/online/Sn …

Zookeeper API Create Delete Exists Get children Get data Set data Sync
creates node in tree Delete deletes node in tree Exists tests if node exists at location Get children lists children of node Get data reads data from node Set data writes data to node Sync waits for data to commit Can embed information in the hierarchical namespace Can store extra details in individual node’s data

Zookeeper implementation
Follower Follower Leader Follower Follower Client Client A client connects to one server (any server will do).

Zookeeper guarantees Sequential consistency Atomicity
Updates from client applied in order sent. Atomicity No partial results. Updates succeed or fail. Single system image Client has same view, regardless of server. Reliability Applied updates persist until overwritten by another update. Timeliness Client’s view guaranteed to be up-to-date within a time bound. Zookeeper does not provide strong consistency Client reads not guaranteed to contain all other client updates

Zookeeper guarantees Sequential consistency Atomicity
Updates from client applied in order sent. Atomicity No partial results. Updates succeed or fail. Single system image Client has same view, regardless of server. Reliability Applied updates persist until overwritten by another update. Timeliness Client’s view guaranteed to be up-to-date within a time bound. Zookeeper’s consistency guarantees Read my writes (see your own updates) Consistent prefix (see a snapshot of the state) Monotonic reads (never go backward in time)

Example use Want to crawl and index the web Solution
Would like multiple machines to participate Want URLs explored at most once Solution Maintain an ordered queue of URLs Crawling machines assign themselves a URL to explore Crawling machines may add new URLs to queue Requires a producer-consumer queue

Nodes’ names are sequenced,
// Class simulating workers adding URLs to the queue public class CreateQueue { private class QueueAddWorker extends ConnectionWatcher implements Runnable { private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z"); private Random r = new Random(); private String name; public QueueAddWorker(String name){ this.name = name; } @Override public void run() { try { while(true){ this.connect("localhost"); String added = zk.create("/queue/q-", null, Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT_SEQUENTIAL); this.close(); Thread.sleep(new Long(r.nextInt(1000)+50)); } } catch (Exception e){ e.printStackTrace(); } public static void main(String[] args){ CreateQueue cQ = new CreateQueue(); Thread addWorker1 = new Thread(cQ.new QueueAddWorker("worker1")); Thread addWorker2 = new Thread(cQ.new QueueAddWorker("worker2")); Thread addWorker3 = new Thread(cQ.new QueueAddWorker("worker3")); addWorker1.start(); addWorker2.start(); addWorker3.start(); Nodes’ names are sequenced, e.g., /queue/q Now we can use watches to be notified when new nodes are added.

Registers object as a watcher for the node /queue
public class PullQueue extends ConnectionWatcher { private class PullWatcher implements Watcher { private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z"); private Random r = new Random(); private String name; private List<String> children; public PullWatcher(String name) throws Exception { this.name = name; children = zk.getChildren("/queue", this); } @Override public void process(WatchedEvent event) { try { if (event.getType().equals(EventType.NodeChildrenChanged)) { children = zk.getChildren("/queue", this); // get the children and renew watch Collections.sort(children); // we are getting an unsorted list for (String child : children) { if (zk.exists("/queue_lock/" + child, false) == null) { zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); // PROCESS QUEUE ENTRY zk.delete("/queue/" + child, -1); // even so we check the existence of a lock, // it could have been created in the mean time // making create fail. We catch and ignore it. catch (Exception ignore) { } } catch (Exception e){ e.printStackTrace(); } public static void main(String[] args) throws Exception { PullQueue pQ = new PullQueue(); pQ.connect("localhost"); pQ.new PullWatcher(args[0]); Thread.sleep(Long.MAX_VALUE); } finally { pQ.zk.close(); } Registers object as a watcher for the node /queue

When /queue changes, we check if any children were added
public class PullQueue extends ConnectionWatcher { private class PullWatcher implements Watcher { private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z"); private Random r = new Random(); private String name; private List<String> children; public PullWatcher(String name) throws Exception { this.name = name; children = zk.getChildren("/queue", this); } @Override public void process(WatchedEvent event) { try { if (event.getType().equals(EventType.NodeChildrenChanged)) { children = zk.getChildren("/queue", this); // get the children and renew watch Collections.sort(children); // we are getting an unsorted list for (String child : children) { if (zk.exists("/queue_lock/" + child, false) == null) { zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); // PROCESS QUEUE ENTRY zk.delete("/queue/" + child, -1); // even so we check the existence of a lock, // it could have been created in the mean time // making create fail. We catch and ignore it. catch (Exception ignore) { } } catch (Exception e){ e.printStackTrace(); } public static void main(String[] args) throws Exception { PullQueue pQ = new PullQueue(); pQ.connect("localhost"); pQ.new PullWatcher(args[0]); Thread.sleep(Long.MAX_VALUE); } finally { pQ.zk.close(); } When /queue changes, we check if any children were added

Start to iterate through the new queue entries
public class PullQueue extends ConnectionWatcher { private class PullWatcher implements Watcher { private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z"); private Random r = new Random(); private String name; private List<String> children; public PullWatcher(String name) throws Exception { this.name = name; children = zk.getChildren("/queue", this); } @Override public void process(WatchedEvent event) { try { if (event.getType().equals(EventType.NodeChildrenChanged)) { children = zk.getChildren("/queue", this); // get the children and renew watch Collections.sort(children); // we are getting an unsorted list for (String child : children) { if (zk.exists("/queue_lock/" + child, false) == null) { zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); // PROCESS QUEUE ENTRY zk.delete("/queue/" + child, -1); // even so we check the existence of a lock, // it could have been created in the mean time // making create fail. We catch and ignore it. catch (Exception ignore) { } } catch (Exception e){ e.printStackTrace(); } public static void main(String[] args) throws Exception { PullQueue pQ = new PullQueue(); pQ.connect("localhost"); pQ.new PullWatcher(args[0]); Thread.sleep(Long.MAX_VALUE); } finally { pQ.zk.close(); } Start to iterate through the new queue entries What is the problem with workers concurrently processing new entries? Work would be duplicated, as all workers process all new entries

Start to iterate through the new queue entries
public class PullQueue extends ConnectionWatcher { private class PullWatcher implements Watcher { private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z"); private Random r = new Random(); private String name; private List<String> children; public PullWatcher(String name) throws Exception { this.name = name; children = zk.getChildren("/queue", this); } @Override public void process(WatchedEvent event) { try { if (event.getType().equals(EventType.NodeChildrenChanged)) { children = zk.getChildren("/queue", this); // get the children and renew watch Collections.sort(children); // we are getting an unsorted list for (String child : children) { if (zk.exists("/queue_lock/" + child, false) == null) { zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); // PROCESS QUEUE ENTRY zk.delete("/queue/" + child, -1); // even so we check the existence of a lock, // it could have been created in the mean time // making create fail. We catch and ignore it. catch (Exception ignore) { } } catch (Exception e){ e.printStackTrace(); } public static void main(String[] args) throws Exception { PullQueue pQ = new PullQueue(); pQ.connect("localhost"); pQ.new PullWatcher(args[0]); Thread.sleep(Long.MAX_VALUE); } finally { pQ.zk.close(); } Start to iterate through the new queue entries How do we prevent duplicate work? Have workers lock entries that they want to process

These lines try to create a lock file for the entry.
public class PullQueue extends ConnectionWatcher { private class PullWatcher implements Watcher { private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z"); private Random r = new Random(); private String name; private List<String> children; public PullWatcher(String name) throws Exception { this.name = name; children = zk.getChildren("/queue", this); } @Override public void process(WatchedEvent event) { try { if (event.getType().equals(EventType.NodeChildrenChanged)) { children = zk.getChildren("/queue", this); // get the children and renew watch Collections.sort(children); // we are getting an unsorted list for (String child : children) { if (zk.exists("/queue_lock/" + child, false) == null) { zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); // PROCESS QUEUE ENTRY zk.delete("/queue/" + child, -1); // even so we check the existence of a lock, // it could have been created in the mean time // making create fail. We catch and ignore it. catch (Exception ignore) { } } catch (Exception e){ e.printStackTrace(); } public static void main(String[] args) throws Exception { PullQueue pQ = new PullQueue(); pQ.connect("localhost"); pQ.new PullWatcher(args[0]); Thread.sleep(Long.MAX_VALUE); } finally { pQ.zk.close(); } These lines try to create a lock file for the entry. What happens if workers concurrently try to create the same file? Only one will succeed, the failed worker will catch an exception

public class PullQueue extends ConnectionWatcher { private class PullWatcher implements Watcher { private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z"); private Random r = new Random(); private String name; private List<String> children; public PullWatcher(String name) throws Exception { this.name = name; children = zk.getChildren("/queue", this); } @Override public void process(WatchedEvent event) { try { if (event.getType().equals(EventType.NodeChildrenChanged)) { children = zk.getChildren("/queue", this); // get the children and renew watch Collections.sort(children); // we are getting an unsorted list for (String child : children) { if (zk.exists("/queue_lock/" + child, false) == null) { zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); // PROCESS QUEUE ENTRY zk.delete("/queue/" + child, -1); // even so we check the existence of a lock, // it could have been created in the mean time // making create fail. We catch and ignore it. catch (Exception ignore) { } } catch (Exception e){ e.printStackTrace(); } public static void main(String[] args) throws Exception { PullQueue pQ = new PullQueue(); pQ.connect("localhost"); pQ.new PullWatcher(args[0]); Thread.sleep(Long.MAX_VALUE); } finally { pQ.zk.close(); } These lines try to create a lock file for the entry. Why is it possible for only one create to succeed? ZK returns to client on commit; new nodes must reach majority to commit

public class PullQueue extends ConnectionWatcher { private class PullWatcher implements Watcher { private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z"); private Random r = new Random(); private String name; private List<String> children; public PullWatcher(String name) throws Exception { this.name = name; children = zk.getChildren("/queue", this); } @Override public void process(WatchedEvent event) { try { if (event.getType().equals(EventType.NodeChildrenChanged)) { children = zk.getChildren("/queue", this); // get the children and renew watch Collections.sort(children); // we are getting an unsorted list for (String child : children) { if (zk.exists("/queue_lock/" + child, false) == null) { zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); // PROCESS QUEUE ENTRY zk.delete("/queue/" + child, -1); // even so we check the existence of a lock, // it could have been created in the mean time // making create fail. We catch and ignore it. catch (Exception ignore) { } } catch (Exception e){ e.printStackTrace(); } public static void main(String[] args) throws Exception { PullQueue pQ = new PullQueue(); pQ.connect("localhost"); pQ.new PullWatcher(args[0]); Thread.sleep(Long.MAX_VALUE); } finally { pQ.zk.close(); } These lines try to create a lock file for the entry. Are workers reading children of /queue_lock, guaranteed to see all locks? No, a worker is only guaranteed to see the locks it created

public class PullQueue extends ConnectionWatcher { private class PullWatcher implements Watcher { private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z"); private Random r = new Random(); private String name; private List<String> children; public PullWatcher(String name) throws Exception { this.name = name; children = zk.getChildren("/queue", this); } @Override public void process(WatchedEvent event) { try { if (event.getType().equals(EventType.NodeChildrenChanged)) { children = zk.getChildren("/queue", this); // get the children and renew watch Collections.sort(children); // we are getting an unsorted list for (String child : children) { if (zk.exists("/queue_lock/" + child, false) == null) { zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); // PROCESS QUEUE ENTRY zk.delete("/queue/" + child, -1); // even so we check the existence of a lock, // it could have been created in the mean time // making create fail. We catch and ignore it. catch (Exception ignore) { } } catch (Exception e){ e.printStackTrace(); } public static void main(String[] args) throws Exception { PullQueue pQ = new PullQueue(); pQ.connect("localhost"); pQ.new PullWatcher(args[0]); Thread.sleep(Long.MAX_VALUE); } finally { pQ.zk.close(); } These lines try to create a lock file for the entry. What happens if a worker fails immediately after creating the lock file? The file is ephemeral and disappears when creator stops responding

After processing a queue entry, we delete it and try another
public class PullQueue extends ConnectionWatcher { private class PullWatcher implements Watcher { private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z"); private Random r = new Random(); private String name; private List<String> children; public PullWatcher(String name) throws Exception { this.name = name; children = zk.getChildren("/queue", this); } @Override public void process(WatchedEvent event) { try { if (event.getType().equals(EventType.NodeChildrenChanged)) { children = zk.getChildren("/queue", this); // get the children and renew watch Collections.sort(children); // we are getting an unsorted list for (String child : children) { if (zk.exists("/queue_lock/" + child, false) == null) { zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); // PROCESS QUEUE ENTRY zk.delete("/queue/" + child, -1); // even so we check the existence of a lock, // it could have been created in the mean time // making create fail. We catch and ignore it. catch (Exception ignore) { } } catch (Exception e){ e.printStackTrace(); } public static void main(String[] args) throws Exception { PullQueue pQ = new PullQueue(); pQ.connect("localhost"); pQ.new PullWatcher(args[0]); Thread.sleep(Long.MAX_VALUE); } finally { pQ.zk.close(); } After processing a queue entry, we delete it and try another

Leader election on top of ZK
Zookeeper elects a leader internally Uses Paxos, but could use Raft too Internal ZK leader election isn’t exposed to services Services can only manipulate ZK nodes But many services need to elect leaders e.g., a storage service that uses two-phase commit Easy to implement leader election on top of ZK

To volunteer to be a leader Create node z “/Election/n_” with sequence and ephemeral flags Let C be “/Election/”’s children, and i be z’s sequence number Watch “Election/n_j” where j is smallest sequence < i Can two servers create the same “/Election/n_” node? No, ZK ensures that each file has a unique sequence number

To volunteer to be a leader Create node z “/Election/n_” with sequence and ephemeral flags Let C be “/Election/”’s children, and i be z’s sequence number Watch “Election/n_j” where j is smallest sequence < i Which server is the leader? The one that created “/Election/n_j”

To volunteer to be a leader Create node z “/Election/n_” with sequence and ephemeral flags Let C be “/Election/”’s children, and i be z’s sequence number Watch “Election/n_j” where j is smallest sequence < i How will we know when the leader fails? When node “/Election/n_j” is deleted

To volunteer to be a leader Create node z “/Election/n_” with sequence and ephemeral flags Let C be “/Election/”’s children, and i be z’s sequence number Watch “Election/n_j” where j is smallest sequence < i Server is notified that a child of “/Election/” was deleted Let C be the new set of children for “/Election/” If z is the smallest node in C, then the volunteer is the leader Otherwise, keep watching for changes in smallest n_j Can two servers ever think that they are the leader? Something to work out on your own …

VAXclusters: A Closely Coupled Distributed System

Similar presentations

Presentation on theme: "VAXclusters: A Closely Coupled Distributed System"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

VAXclusters: A Closely Coupled Distributed System

Similar presentations

Presentation on theme: "VAXclusters: A Closely Coupled Distributed System"— Presentation transcript:

Similar presentations

About project

Feedback