ORDERING AND DURABILITY IN ISIS 2 Ken Birman 1 Cornell University.

ORDERING AND DURABILITY IN ISIS 2 Ken Birman 1 Cornell University

Isis 2 System 2  Core functionality: groups of objects  … fault-tolerance, speed (parallelism), coordination  Intended for use in very large-scale settings  The local object instance functions as a gateway  Read-only operations performed on local state  Update operations update all the replicas myGroup state transfer “join myGroup” updateupdate

Terminology we’ve used 3  Process group: A term for a collection of programs that are all running (perhaps on different machines, perhaps on the same machine) and that use Isis 2  Each process group has a name (you pick it)  You can have multiple groups in one application  Message: Data encoded to be sent between programs  State transfer: Data to initialize a new group member  Update: Any action that changes the shared data  Lookup: Any action that only queries the data  Multicast: A message sent to every group member

A distributed request that updates group “state”... Some service A B C D Example: Cloud-Hosted Service 4 SafeSend... and the response Standard Web-Services method invocation

Multicast properties 5  In the figure, “SafeSend” is a “multicast”  A message that can be sent to a whole group  What properties do these multicasts need to keep the group members consistent?  In Isis 2 we focus on  Ordering properties: relative to group membership changes, and relative to other multicasts  Durability guarantees: what happens if a crash occurs?

In Isis 2 new View upcalls are synchronized relative to message delivery Key idea: View ordering 6

Membership changes 7  When a group gains or loses a member, the Isis 2 Oracle sequences the new view relative to other multicasts. Thus any multicast is delivered in the same view, from the perspective of all recipients.  Also, if a multicast is sent to the group in some view, it reaches all members of the group (of course if some crash, they might not process the message)  State transfers occur after every multicast has been delivered in the prior view and before any are delivered in the new view Group View is synchronized relative to multicasts

Message Ordering 8  The basic idea of Isis 2 is to deliver all multicasts in the same order at all group members receiving them  This keeps the data consistent and allows you to implement “state machine” algorithms: group members perform any desired actions in the same state and in the same order  But we offer various implementations of multicast and if you use them very wisely, some are faster than others. The caveat is that the fast versions can only be used in certain situations, which we’ll discuss.

A multicast arrives in a group… 9  What information is “the same” for all recipients?  If they call g.GetView(), or remembered properties of the most recently delivered view, all see same view  Also, everyone got the message  And the requested ordering was enforced by Isis 2  What aspects might differ, for different receivers?  Each has its own “rank” in the membership list, obtained by calling v.GetMyRank() or v.GetRankOf(who)

What if a failure happens just as a multicast is being sent? What about failures? 10

Delayed delivery 11  In Isis 2, a multicast send will often delay (in the platform) for a little while before delivery occurs  As a result, the sender does not know that the group view will be the same when the message is delivered This multicast might have been “sent” in the prior view when r, s and t weren’t yet members!

How can we know for sure? 12  Suppose the sender of a Query needs to know how many members processed the query, e.g. to notice that some reply is missing due to a failure. What can it do to know?  One option is to have the receivers include View information (such as how many members were in the View, what rank each replying member had) in the Reply()  The sender is also a receiver, so another approach is for the sender to wait for its own multicast or Query to be delivered and then make note of the View

How do we know who sent a message? 13  You can just include the sender’s Address in the arguments to the message  Cool Isis 2 fact:  After you see a View notifying you that some member has failed or voluntarily left the group, you will never receive additional multicasts from that sender!  If a process leaves a group but then tries to send in it, Isis 2 throws an exception in that sender.

No messages from the dead 14  In the Isis 2 system, you never receive messages from the deceased  Isis 2 watches for “late” messages that came from a process which is already considered to have died  It actively blocks such messages and won’t deliver them  Thus if you reconfigure after a failure, and reassign roles, you can’t get a kind of split-brain effect due to late delivery of a message

Ordering Properties 15  The most important form of message ordering is “total order”  Obtained by using g.OrderedSend or g.SafeSend  They both provide the same ordering guarantee. They have different durability properties  Everyone receives these in the same order. Everyone receives A first Everyone receives B second A B

Weaker ordering 16  Some applications want the lowest possible message latency  OrderedSend will usually achieve this best delay, but not always. (Slower case: when multiple group members are calling OrderedSend concurrently)  SafeSend uses a much slower approach.  For the very best speed, protocols guaranteed to be faster are available: Send and RawSend

A FIFO Ordering situation 17  Suppose one process sends all the multicasts that update some variable in a group. What ordering is really needed?  In this group, only the oldest living member sends multicasts  FIFO suffices! p q r s t Time: 0 10 20 30 4050 60 70 We say that p is the leader. It has rank 0 After p and q fail, r is the leader. It has rank 0 in the new view

A FIFO Ordering Situation 18  In this group we really only need to deliver messages in the order the leader sent them  For this purpose, the Send primitive is ideal  Send respects the FIFO order its sender used  Guaranteed to be extremely fast  RawSend: Send, but with no effort to guarantee reliability. Respects FIFO order… unless message is lost

What if two senders use Send? 19  When different senders use Send, the ordering will depend on when the messages showed up!  Different members might see different orderings  Example: r sees A B  … but p sees B A A B

When is FIFO good enough? 20  Suppose our group manages a collection of data items  Each item has its own leader and only the leader sends updates for that item  Consistency: It suiffices to apply updates in the order they were sent. g.Send() will do this!  But beware…  Multicasts from different senders can interleave in unpredictable ways

When would you use RawSend? 21  This primitive doesn’t guarantee reliability  We use it when reporting data from real-time sensors  We want the data delivered in order (new data replaces older data). RawSend is still FIFO ordered  But if data is lost, there is no point “wasting time” in the platform retransmitting it.

What about Query ordering? 22  Each kind of multicast has an associated Query MulticastMatching Query RawSendRawQuery SendQuery OrderedSendOrderedQuery SafeSendSafeQuery

CausalSend 23  Included mostly for academic reasons, but not used very often in Isis 2  Intended for situation in which the leader role moves around for each data item  First p is in charge, then q is the leader for a while, then r, then back to p…  CausalSend will respect the FIFO order “with moving leaders”. But we don’t recommend using it.

CausalSend picture: B is “after” A 24 A B

Causality idea 25  If B “might have been caused by A”, then B is causally ordered after A (we write A  B)  CausalSend tracks these causality dependencies and makes sure that if A  B, then B will be delivered after A  But the Isis 2 implementation of CausalSend is slow and this is why it isn’t used very often

Exactly what happens in the event of a failure? Durability 26

Durability 27  A durability guarantee is the property that information will survive a failure  There are several cases to think about  What if the sender of a multicast fails but someone received the multicast?  What if the sender and every receiver (so far) fails?  What if a whole group fails, but later restarts?  What if the group is managing a replicated database or files that aren’t even on the same computers?

Soft State in the Cloud 28  Many Isis 2 applications run in cloud settings.. And the cloud favors “soft state”  After a node crashes, the entire VM is reloaded  Thus any local state (even local files) are restored to their original state! All local data vanishes  We say that a group manages “hard state” if the group members can fail and yet their state lives on  In the cloud a hard-state node costs more $$$

Two cases thus arise 29  Durability for soft-state scenarios  Here the entire state “lives in the group members”  They might have files, but the files won’t be preserved if those members crash and later restart, even on the same nodes.  Very common in today’s cloud  Durability for hard-state cases  Here the state really is outside the group

Multicast durability 30  Isis 2 offers all-or-nothing delivery guarantees  Either every group member receives your multicast, or no group member receives it, even if the sender fails. As we saw, if a sender fails, its messages will be delivered before Isis 2 reports the failure  But this statement didn’t explain what happens when a receiver crashes “instantly”

Two options: Optimistic/Pessimistic 31  Optimistic case (Send, CausalSend, OrderedSend):  Messages are delivered instantly on arrival (low delay)  But if the sender and all receivers with copies fail, an optimistic message is lost forever even though it might have been delivered to some processes right before they crashed  An optimistic protocol always looks like it was all-or- nothing, but if you could see the details, you might see that in fact, it was delivered, but then “forgotten”

Optimistic delivery 32  Consider messages B and C  B was delivered to r,s and t. But it didn’t reach p and q because of a network failure.  C was delivered by p and q but never reached r,s,t  But notice that p and q both crashed  In a soft-state case, no evidence survived (unless they talked to someone outside the group – an external client, for example)  In effect, the surviving portion of the system is consistent A B C

Optimistic delivery is fastest 33  We deliver messages as soon as they arrive  But the price of this speed (which is a big benefit) is that these two “bad cases” can arise.  Nobody can tell when these things happen, unless p or q talked to an external client  … which leads to the idea of g.Flush(k)

How does Flush(k) work? 34  g.Flush(n) pauses until n group members definitely have all the prior optimistic multicasts.  g.Flush() waits for all members, but this is slow  Normally n=2 or n=3 is fine…  By calling g.Flush(2) or g.Flush(3) before talking to an external client, we can be sure these bad cases will not occur!

With g.Flush(k)… 35  … those stray delivery events can still occur, but we know that no external observer notices them!  If g.Flush(3) is called prior to talking to the observer, then until there are 3 or more copies of the message, the Flush waits.  In our example the crash would have occurred while we were waiting for g.Flush() to finish  If a tree falls in a forest… If a message is delivered but every process that saw it crashes, the effect is the same as if the message wasn’t delivered!

With g.Flush(k)… 36  … those stray delivery events can still occur, but we know that no external observer notices them!  If g.Flush(3) is called prior to talking to the observer, then until there are 3 or more copies of the message, the Flush waits.  In our example the crash would have occurred while we were waiting for g.Flush() to finish  If a tree falls in a forest… If a message is delivered but every process that saw it crashes, the effect is the same as if the message wasn’t delivered!

When to call g.Flush(k) 37  Use this primitive  When working with optimistic multicast protocols like Send, OrderedSend  Call it prior to interacting with something outside of the group, like an external client who issued a request  With g.Flush after g.OrderedSend, we get the guarantee that the group won’t forget the update. Without g.Flush, an unlikely failure sequence could cause a problem (sender+first recipients all die).

Pessimistic Delivery 38  SafeSend is much more pessimistic  This protocol is a kind of 2-phase commit  Gives the message to recipients, and they hold it (Two cases: In-memory logging, or on-disk logging)  When all have confirmed receipt, then delivery is authorized  No g.Flush(): it wouldn’t ever need to wait

Where’s the durable state? 39  SafeSend raises a question of where the state lives  For our optimistic protocols, state lives in the group  But Isis 2 can also support two more cases State lives in a checkpoint that will be reloaded if the whole group shuts down and restarts State lives in a database or in files external to the group SafeSend with disk logging aims at this second case

Should I always use SafeSend? 40  The SafeSend protocol is very costly and scales poorly, so it isn’t a great choice in the cloud  Also, using it correctly is a bit tricky  Better rule of thumb: use g.OrderedSend+g.Flush

Sidebar: Paxos family of protocols 41  Experts in this area will know about Leslie Lamport’s famous Paxos protocol (Wikipedia has a nice writeup)  It provides ordered, durable “actions”  These are often updates to a replicated database  SafeSend is the Isis 2 name for Paxos  You don’t really need to learn about Paxos to understand how SafeSend works, but I’ll include some comments aimed at people who do know about Paxos in this lecture, simply because that work is so famous.

How Paxos works 42  Paxos is basically a kind of 2-phase commit  In the first phase a leader proposes some action (for us, a multicast)  A quorum of group members (the acceptors) need to vote in favor of the proposed ordering for the message, and they need to first save it in a durable place (usually a log that lives on the disk)  In the second phase, delivery occurs (in Paxos: the learners are informed about the new event)

Paxos has a notion similar to Flush(k) 43  In Paxos you can specify the number of “acceptors” that must have a copy of a message before it can be delivered.  In Isis 2 this same parameter is available by means of a parameter you can set (g.SetSafeSendThreshold(k)) SafeSend is a true implementation of Paxos if this number is more than half the group members. With k smaller, like k=2 or k=3, but in a big group SafeSend starts to act exactly like g.OrderedSend()+g.Flush(k)

Isis 2 : Send v.s. SafeSend 44 Send scales best, but SafeSend with modern disks (RAM-like performance) and small numbers of acceptors isn’t terrible.

Variance from mean, 32-member case Jitter: how “steady” are latencies? 45 The “spread” of latencies is much better (tighter) with Send: the 2-phase SafeSend protocol is sensitive to scheduling delays

Flush delay as function of shard size 46 Flush is fairly fast if we only wait for acks from 3-5 members, but is slow if we wait for acks from all members. After we saw this graph, we changed Isis 2 to let users set the threshold.

Putting our insights to work… Several ways to make data durable 47

Checkpointing 48  Any group can be made durable using a checkpointing file  Call g.Persistent(filename)  Checkpoint will periodically be saved, or you can force the creation of checkpoints at times convenient to you  Entire group shares a single checkpoint file and it would normally live in the global file system. It should not live in any sort of soft-state file system!  On restart from a total shutdown, checkpoint is reloaded and the group recovers to its old state

External databases 49  If a group is being used to replicate something like a set of external mySQL databases, recovering the group state just isn’t good enough  We also need to make sure the mySQL replicas are in the identical states after a recovery  This is the case where we use SafeSend with the disklogging option enabled

What is the disklogger? 50  The disklogger is a special form of logged checkpoint, similar to the one used for g.Persistent()  But whereas normally there is just one durability log, this log is replicated with one copy per acceptor  Messages delivered by SafeSend are appended to this log during phase one  When an acceptor restarts, its log is scanned and “replayed”. Isis 2 will garbage collect a message once all the learners have seen it

A distributed request that updates group “state”... Some service A B C D Example: Cloud-Hosted Service 51 SafeSend... and the response Standard Web-Services method invocation DB Use the Isis 2 version of Paxos to replicate an external database

A distributed request that updates group “state”... Some service A B C D Example: Cloud-Hosted Service 52 Send... and the response Standard Web-Services method invocation In-memory collection Cheaper multicast+Flush suffices with in-memory replicas or other situations with soft state, like files local to the replicas on VMs that will be reloaded if a crash occurs g.Flush()

Check your understanding 53  Suppose we use SafeSend as shown in the figure, with 4 group members, and all are acceptors  You send 1 message. How many disk writes occur?  At least 4 (one per log) and perhaps 8 (the database may have a log too). Also, database needs to be updated!

Recovery with an external database is a pain! g.SetDurabilityMethod 54

SetDurabilityMethod 55  You must tell SafeSend to use the DiskLogger durability method  When you do this, SafeSend has an extremely strong guarantee: it won’t ever forget messages, until is it explictly told to do so by your code  This yields a version suitable for use when replicating a database

Recovering a database replica 56  After restarting a failed database replica, SafeSend with the DiskLogger durability method will replay all messages that it knows about  Your job is to make sure all of these updates have been applied to the database, exactly once  After that you tell SafeSend it can safely garbage collect these messages, and it does so when every group member has told it that the message is safe to garbage collect (at that point, it truncates the disk log)

Why not always use SafeSend? 57  SafeSend is harder to use  Must write code to handle replay of the log after recovery.  And SafeSend is also slower  Many people who assume Paxos is lightweight are surprised that all Paxos systems have high costs  Paxos is really a kind of durable database – a database of messages!

Durability Summary 58  To recap:  If your application maintains data purely inside the members of the group, or purely in memory, you can use the standard “optimistic” methods Call g.Flush(k) if worried about the tree-in-the-forest case  Use checkpointing to a log (g.Persistent()) to make the group state survive complete shutdowns  But switch to SafeSend for the strongest durability requirements. You’ll need to enable the DiskLogger durability method, and to write code to handle restarts and to tell SafeSend when it can garbage collect the log.

How does one make a checkpoint? Making Checkpoints 59

State transfer 60  In general, group members manage data (state)  When s and t join in this example, they don’t have the current state for the group. They obtain it via a state transfer: the white arrow.  In this example, p “writes down” its state (a checkpoint)  Then s and t “load” the state (they read the checkpoint) White Arrow is a state transfer

Making a checkpoint 61  You can save any state you wish  You can call SendChkpt as many times as needed int istuff; double dstuff; g.MakeChkpt += (Isis.ChkptMaker)delegate(View nv) { g.SendChkpt(istuff); // Checkpoint a single integer g.SendChkpt(dstuff); // Checkpoint a single floating point value g.EndOfChkpt(); // Finished making the checkpoint }; g.LoadChkpt += (loadichkpt)delegate(int what) { IsisSystem.WriteLine(name + ": Got integer checkpoint: istuff=" + what); istuff = what; }; g.LoadChkpt += (loaddchkpt)delegate(double what) { IsisSystem.WriteLine(name + ": Got double checkpoint: dstuff=" + what); dstuff = what; };

Steps 62  The MakeCheckpt method is called from time to time in your program.  You can control exactly when this will happen  That updates the log files  Later, after restart, the LoadCheckpt method(s) will be called to reload the saved state

To make a group persistent, store it in a global file system 63  It will be loaded into the NEXT instance that runs int istuff; double dstuff; g.MakeChkpt += (Isis.ChkptMaker)delegate(View nv) { g.SendChkpt(istuff); // Checkpoint a single integer g.SendChkpt(dstuff); // Checkpoint a single floating point value g.EndOfChkpt(); // Finished making the checkpoint }; g.LoadChkpt += (loadichkpt)delegate(int what) { IsisSystem.WriteLine(name + ": Got integer checkpoint: istuff=" + what); istuff = what; }; g.LoadChkpt += (loaddchkpt)delegate(double what) { IsisSystem.WriteLine(name + ": Got floating point checkpoint: dstuff=" + what); dstuff = what; }; Note: You must also call myGroup.Persistent(gname); This tells Isis 2 to keep checkpoints in a file (in this case with the same name as the group). There are also ways to control when the checkpoint will be made

Why did we register two loaders? 64  Isis 2 is polymorphic  Each method can be defined many times with different type signatures  As events occur, upcalls are done to the ones that match  In our examples we had just one argument to SendChkpt(), but we could have given many:  Any data type is allowed but you must register user- defined types with Isis first g.SendChkpt(x, y, z,....);

State transfer uses checkpoints! 65  If the checkpoint methods are defined, Isis 2 will ask for a checkpoint just as a new member joins  The old member makes the checkpoint  The new member loads it  This initializes the joining member myGroup state transfer updateupdate

Can we tell what a checkpoint will be used for? Can we do “per use” checkpoints? Persistent or just State Transfer? 66

What are checkpoints used for? 67  When you define a checkpoint create/load method, that automatically enables state transfer for joining members  With g.Persistent(), a checkpoint plays two roles; they are also logged into a recovery log file that will be reread after recovery from a total shutdown

State transfer could be s..l..o..w.. And while it happens, the group freezes up! What if the group state is large? 68 A B

What if the state is very large? 69  Really large states can be slow to transfer. While they are being sent, the group itself might hiccup  Best solution? Pre-transfer that huge state, perhaps using the highly efficient “Isis OOB” tool  Out of band transfer is minimally disruptive and faster too because the Isis 2 system optimizes heavily for this  But perhaps a few updates might occur after the pretransfer and before the member is added.  So you can include an argument to Join that tells how big the pre-transfer was, or what “time” it was made. Then the checkpoint only needs to include the delta!

Pretransfer  In this picture we send data to r, s and t “out of band”  Isis 2 has a tool for that, the OOB file transfer tool. Ideal for big copying 70 When they join, we send just the residual delta…

Enabling this feature 71  Instead of calling g.Join(), call g.Join(offset)  Offset tells the group how much of the state you have.  It shows up in the View argument to the make checkpoint method  Offset 0 means “send the whole state”  Example: pretransfer included updates 0… 12345. So you call g.Join(12345). The state transfer contains just updates 12346-12348…

What happens in an application that experiences many “events” all at the same time? When does State Transfer occur? 72

Isis 2 has a strong consistency model: a new form of virtual synchrony. 73  Virtual synchrony is a “consistency” model:  Membership epochs: begin when a new configuration is installed and reported by delivery of a new “view” and associated state  Protocols run “during” a single epoch: rather than overcome failure, we reconfigure when a failure occurs Synchronous executionVirtually synchronous execution Non-replicated reference execution A=3B=7B = B-AA=A+1

What Isis 2 ensures is that... 74  State transfer “seems” to occur at the instant when a new view is delivered (all prior multicasts have already been performed)  This means that the member preparing the state has the correct values for state variables needed by joining member!  It is “safe” to send this state  If desired, there is a way for you to specify which member will send state to each joining process

How do Queries handle failure? 75

Queries when failures occur… Group g = new Group(“myGroup”); Dictionary Values = new Dictionary (); g.ViewHandlers += delegate(View v) { Console.Title = “myGroup members: “+v.members; }; g.Handlers[UPDATE] += delegate(string s, double v) { Values[s] = v; }; g.Handlers[LOOKUP] += delegate(string s) { g.Reply(Values[s]); }; g.Join(); g.Send(UPDATE, “Harry”, 20.75); List resultlist = new List (); nr = g.Query(ALL, LOOKUP, “Harry”, EOL, resultlist);  First sets up group  Join makes this entity a member. State transfer isn’t shown  Then can multicast, query. Runtime callbacks to the “delegates” as events arrive  Easy to request security (g.SetSecure), persistence  “Consistency” model dictates the ordering seen for event upcalls and the assumptions user can make 76

This example used g.Reply 77  Also available:  g.AbortReply() – throws exception in the Query caller  g.NullReply() – Member doesn’t contribute any value but the caller won’t wait for it (useful with ALL)  g.NoReply() – A risky option: like NullReply but no message of any kind is sent to the caller  Query can also specify an Isis “Timeout”  new Timeout(delay_ms, action)  Action is: TO_NULLREPLY, TO_FAILURE, TO_ABORT

How can a caller sense missing replies? 78  The caller is told how many replies it got  If you expected 3 but got 2, either someone failed, or they used g.NullReply() to “opt out”  But when you issue the Query you won’t know who is going to be in the group at the time of delivery!  This is why it often makes sense for replies to specify that “this is reply R of N” (R=rank, N=size of view)

Lecture Summary 79  Isis 2 gives you control over  How durable multicasts and group data will be  How strongly ordered they will be  Whether to wait until a multicast has reached k of the destinations before you talk to external observers  Using these forms of control, you can program exactly the behavior you need in a given setting

ORDERING AND DURABILITY IN ISIS 2 Ken Birman 1 Cornell University.

Similar presentations

Presentation on theme: "ORDERING AND DURABILITY IN ISIS 2 Ken Birman 1 Cornell University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ORDERING AND DURABILITY IN ISIS 2 Ken Birman 1 Cornell University.

Similar presentations

Presentation on theme: "ORDERING AND DURABILITY IN ISIS 2 Ken Birman 1 Cornell University."— Presentation transcript:

Similar presentations

About project

Feedback