A BRIEF INTRODUCTION TO HIGH ASSURANCE CLOUD COMPUTING WITH ISIS2 Ken Birman 1 Cornell University.

A BRIEF INTRODUCTION TO HIGH ASSURANCE CLOUD COMPUTING WITH ISIS2 Ken Birman 1 Cornell University

An introduction to the lecturer About the Lecturer 2

Ken Birman 3  Researcher in high assurance computing since joining Cornell in 1982 (PhD U.C. Berkeley). Currently Cornell’s N. Rama Rao Professor of Computer Science.  ACM Fellow, Winner of IEEE Tsutomu Kanai Award  Built the distributed software infrastructure used for a decade by the New York Stock Exchange, and still used in the French Air Traffic Control System, the US Navy AEGIS and several other mission-criticial systems.  Contact information at http://www.cs.cornell.edu/ken

Introducing terminology Informal description of goals Segment I: The Cloud Landscape 4

High Assurance and the Cloud 5  Cloud Computing: The new universal standard  A technology for federating network services  Easy to share data, deeply integrated with web pages  Supports a wide range of media types  But the cloud can’t offer high assurance today!  A wave of sensitive applications is approaching (areas like mHealth, Smart power grid, eBanking, Smart cars...)  They need strong guarantees... what can we do to help?

How does today’s cloud work? 6  Client platform: browsers and “apps”, which are programs that exploit a stripped-down browser API  Internet transports the data  Data centers run “web services” that produce the pages we see, stream videos, etc

Each step embodies weaknesses 7  The client system is vulnerable to loss of connectivity, compromise by downloaded code and infection by viruses and worms.  The Internet layer is potentially unreliable  The mapping of domain names to IP addresses is very complex (consequence of cloud need to “steer” traffic)  Network reliability is much lower than it needs to be  Much too easy to snoop on traffic or attack connections  The Web Services infrastructure can fail or reconfigure abruptly, forcing the client to reconnect

Recipe for high assurance 8  Design a system to fail only in “safe ways”  Nobody gets hurt, but perhaps the system reports that it has gone offline  Then do everything practical to enhance reliability, consistency, security, other needed properties  Today: Focus on the web services running on the cloud data center

Tradeoffs in “cloud space” 9  The properties we need are in tension!  Snappy response: Every 100ms matters  Elasticity: Load varies suddenly and dramatically, service replication levels need to vary accordingly  Consistency: If distinct service replicas “talk” to multiple clients about something, they don’t say contradictory things.  Fault-tolerance: If a replica crashes, the cloud self-heals  Attack-tolerance: The service is very hard to attack.  Security: Authenticated clients are limited to performing authorized actions in accordance with a policy  Privacy: I can control who uses my data and how Required Often weak or lacking

Today’s cloud: As fast as possible 10  In the race to offer the fastest possible services to the largest possible number of clients today’s cloud often gives up on other assurance properties  In some sense the cloud is insecure and inconsistent by design! ... but does it have to be that way?

Tomorrow: A high assurance cloud! 11  A single system needs to tell multiple kinds of assurance stories and not all in the same way  An mHealth application:  Needs to reassure the user that it is trustworthy  Needs to help the developer make the right choices  Must implement complex protocols correctly  Must be a good citizen on the cloud data center

A few slides each on some challenging problems Each needs the cloud... but each needs some form of strong assurance guarantee too Segment II: Examples 12

Example 1: Power grid 13  Today’s power grid has serious issues  Wasteful: As much as 15% of power is lost just moving it around, and a great deal of “renewable” energy (solar, wind, tides) is lost because of poor integration with the standard grid  Rigid: Ideally, the grid should “adapt” and move parcels of power much as the Internet moves packets.  Dumb: even when it is obvious that we could optimize behavior, the grid uses old, inefficient techniques  Goal: A “smart” power grid!

How a small power grid operates 14  Power flows “like water”  Path of least resistance  Governed by Kirchoff’s Law  Power enters at every generator, exits at every load  Hierarchical structure:  Primary “power busses”  Secondary smaller local feeds 10-Generator, 39-bus New England System

Technology to enable a smart grid 15  We’ll need to monitor power loads, frequency, current in real-time, reliably and securely  Use this data to estimate the state of the grid and to predict its evolution over time  Use those predictions to plan control actions: increase/decrease generation, borrow “reactive” power from neighboring regions, adapt pricing, etc  Ultimately the grid will become a new kind of network. But must also be safe, efficient, and secure against both mishaps and even attack!

Even mundane problems can hurt 16  California: Repeated episodes of market manipulation aimed at increasing profits for companies such as Enron that speculate on pricing  Multi-state and multi-national rolling outages  Causes turmoil for air traffic, ground traffic, telephone outages  Will “smartness” also make grid more fragile?  Risk of CyberAttacks?

Control of the smart power grid 17  Suppose that a cloud control system speaks with “two voices”  In physical infrastructure settings, consequences can be very costly “Switch on the 50KV Canadian bus” “Canadian 50KV bus going offline”

Control of the smart power grid 18  Suppose that a cloud control system speaks with “two voices”  In physical infrastructure settings, consequences can be very costly “Switch on the 50KV Canadian bus” “Canadian 50KV bus going offline” Bang!

Power grid summary 19  To make it smart we need to monitor at a massive scale and use that to initiate control actions  But for this to be safe, we need more that fast response and elasticity  We also need security (so that attackers can’t take the grid down) ... and consistency (as we just saw) ... and fault-tolerance (since power systems often experience failures of various kinds)

Example 2: mHealth 20  A term for everything outside the doctor’s office (but might be linked to electronic health records)  Goal is to make your life better and healthier  Encourage activity  Discourage poor nutrician choices  Help patients with chronic conditions manage their complex medical devices and medications  Offer caregivers a window into health so that the patient can maintain independence

Integrated glucose monitor and Insulin pump receives instructions wirelessly Motion sensor, fall-detector Cloud Infrastructure Home healthcare application Healthcare provider monitors large numbers of remote patients Medication station tracks, dispenses pills What properties are needed in remote medical care systems? 21

Durability... scalability... fast response  Need: Strong consistency and durability for data Cloud Infrastructure Mrs. Marsh has been dizzy. Her stomach is upset and she hasn’t been eating well, yet her blood sugars are high. Let’s stop the oral diabetes medication and increase her insulin, but we’ll need to monitor closely for a week Patient Records DB 22

What do these terms mean? 23  Consistency: Even if accessed by multiple users concurrently, the data looks like a single database  This sounds like it should obviously be true, but when the data is spread over multiple computers, if they don’t coordinate their actions, consistency can easily violated  For example, perhaps machine 1 shows updates machine 2 never saw. Perhaps machine 3 sees all the updates but has the order confused. Each of these cases can cause serious inconsistencies.

What do these terms mean? 24  Durability: Even if system components crash and then recover later, data will not be lost.  Updates confuse things: before the update occurs, clearly it isn’t durable  After the update is finished, it must have durable effect  Question to pose: exactly when did it need to be durable? Usual answer: If the effect of an update survives a crash, then the update itself should also survive the crash

Scalability 25  As we make the system larger, perforance remains good  It needs to be able to support large numbers of clients and run on large numbers of cloud computing systems  Fast response: Queries shouldn’t delay for long. Updates should have rapid effect on the data.

Guarantees versus “best effort” 26  Today’s cloud systems work well in all of these ways but without providing strong guarantees except in certain very specialized cases, like Google’s new “Spanner” database  Our challenge: can normal people who aren’t in the Google spanner development team also create trustworthy cloud computing solutions?

mHealth summary 27  The needs of the system vary depending on what part of the system we focus on  In our example, some aspects need durability in the sense of a logged database update, while others might accept durability through in-memory replication  This illustrates one of many such tradeoffs  If we had more time we could identify a number of additional issues of this kind

How The Cloud Was Built 28  It is very hard to create software to run in cloud computing systems  Everything must be automated  You must follow many rules and use many packages  So open source “tools” have become popular  Examples: Hadoop (a version of MapReduce), Zookeeper, Graphlab, Pregel, Vowpal Wabbit, global file systems like GFS, etc.  In this short class we will focus on process group tools and will use Isis 2 as our main example.

An obsession with speed... 29  At very large scale, either a thing is extremely fast, or unacceptably slow  So everything we do must be shaped by speed!  High assurance is not an option if the solution would be dramatically slower  For example, the cloud computing community avoids databases. They founded the NoSQL movement (storage, but not as strong as a SQL database) for this reason.  Similarly we must have speed in mind at all times!

To understand speed, understand the limiting factors This forces us to think about critical paths Concept: Critical paths 30

What limits responsiveness? 31  Top priority: delay until a client receives a reply  Critical path traces actions that contribute to this delay Update the monitoring and alarms criteria for Mrs. Marsh as follows… Confirmed Response delay seen by end-user would include Internet latencies Service response delay Service instance

Critical path with complex services? 32  When we replicate information but want to be sure the data won’t be lost, critical path extends into the replicas Update the monitoring and alarms criteria for Mrs. Marsh as follows… Confirmed Response delay seen by end-user would include Internet latencies Service response delay Service instance Critical path

Why do critical paths matter? 33  When we build complex systems it is hard to imagine how they will behave when we run them  By thinking about the critical performance-limiting paths, we can focus our attention on specific elements and not think about the whole system  By avoiding delays on the critical path, we bring benefits to the whole system!

There are many critical applications 34  Cloud-hosted system to control transportation (think of Google’s smart cars)  The cars have autonomy but they depend on data from the cloud and would have a much harder challenge if that data couldn’t be trusted  Banking systems  Today’s online banking systems are growing, but as they happens, more and more security issues arise  Process control  Chemical refineries, manufacturing plants,...

And they come with similar stories 35  In each case we can identify properties that are  Absolutely needed for a cloud deployment  Absolutely needed for safety  And beyond that we might have other assurance properties that a particular use case doesn’t need  The challenge will be to analyze each application, and then to translate its needs into cloud solutions

We’ll drill down on the tradeoffs between durability and consistency Many cloud systems believe that consistency isn’t possible: CAP theorem Yet consistency underlies so many other guarantees Virtual synchrony model Segment III: Consistency 36

We’re going to drill down… 37  … on data and service replication  Replication is at the center of cloud computing:  With many replicas a service can handle many clients  And those replicas need as much of the critical data to be local as possible  So replication is a key technology. It even underlies security: we need to replicate the policy database and certificates that identify principals (clients, servers, etc)

Consistency for replication 38  There are many ways to replicate information  But it becomes tricky if the data or even the service evolves over time.  Replication of changing data can leave a confusing mess if a request encounters stale versions.  In some situations these errors can harm the client.  In others, they could cause security violations.

What do we mean by consistency? A consistent distributed system will often have many components, but users observe behavior indistinguishable from that of a single-component reference system. Our power system example illustrated a form of inconsistency 39 “Switch on the 50KV Canadian bus” “Canadian 50KV bus going offline” Bang!

Theory of Consistency 40  There are some famous impossibility results  Fischer, Lynch and Patterson: FLP theorem proves that any correct fault-tolerant protocol strong enough to solve “consensus” (a form of agreement) can also wedge in the event of certain sequences of failures. But those sequences turn out to be very rare.  Brewer’s CAP theorem posits that you can only have two from {Consistency, Availability and Partition Tolerance}. But the proof holds only for a service running in a WAN, not for one in a single data center.

Relate consistency to speed? 41  How costly is strong consistency?  The cloud computing community debates this topic!  It is a very contemporary question  We usually pose the question in connection to replicating data.  Strongly consistent data means “guaranteed to be correct and current”. Can cloud systems afford strong consistency?  Weakly consistent data means “best effort but can have mistakes.” Facebook, eBay, Google all use weak consistency

We will learn more about these topics 42  In today’s lecture we won’t “drill down”  But in lecture 4 we will look more closely at these theoretical questions  Mathematics is a valuable tool for cloud computing  By making a correspondance of computing ideas to mathematics we can reason more rigorously  Yet we will also find that some of the existing theory has limitations of its own

How does consistency look to the end user? What is it like to program with a powerful high assurance library like Isis 2 ? Segment IV: Isis 2 43

Isis 2 System 44  A prebuilt technology that automates many of the hard tasks involved in replicating services and the data on which they depend  Targets cloud computing settings  Available in open-source from isis2.codeplex.com  Intended to be easy to use…  … but still at an early stage of development

Isis 2 System  Elasticity (sudden scale changes)  Potentially heavily loads  High node failure rates  Concurrent (multithreaded) apps  Long scheduling delays, resource contention  Bursts of message loss  Need for very rapid response times  Community skeptical of “assurance properties”  C# library (but callable from any.NET language) offering replication techniques for cloud computing developers  Based on a model that fuses virtual synchrony and state machine replication models  Research challenges center on creating protocols that function well despite cloud “events” 45

Isis 2 makes developer’s life easier  Formal model permits us to achieve correctness  Isis 2 is too complex to use formal methods as a development too, but does facilitate debugging (model checking)  Think of Isis 2 as a collection of modules, each with rigorously stated properties  Isis 2 implementation needs to be fast, lean, easy to use  Developer must see it as easier to use Isis 2 than to build from scratch  Seek great performance under “cloudy conditions”  Forced to anticipate many styles of use Benefits of Using Formal model Importance of Sound Engineering 46

Isis 2 makes developer’s life easier Group g = new Group(“myGroup”); Dictionary Values = new Dictionary (); g.ViewHandlers += delegate(View v) { Console.Title = “myGroup members: “+v.members; }; g.Handlers[UPDATE] += delegate(string s, double v) { Values[s] = v; }; g.Handlers[LOOKUP] += delegate(string s) { g.Reply(Values[s]); }; g.Join(); g.Send(UPDATE, “Harry”, 20.75); List resultlist = new List (); nr = g.Query(ALL, LOOKUP, “Harry”, EOL, resultlist);  First sets up group  Join makes this entity a member. State transfer isn’t shown  Then can multicast, query. Runtime callbacks to the “delegates” as events arrive  Easy to request security (g.SetSecure), persistence  “Consistency” model dictates the ordering aseen for event upcalls and the assumptions user can make 47

Isis 2 makes developer’s life easier Group g = new Group(“myGroup”); Dictionary Values = new Dictionary (); g.ViewHandlers += delegate(View v) { Console.Title = “myGroup members: “+v.members; }; g.Handlers[UPDATE] += delegate(string s, double v) { Values[s] = v; }; g.Handlers[LOOKUP] += delegate(string s) { g.Reply(Values[s]); }; g.Join(); g.Send(UPDATE, “Harry”, 20.75); List resultlist = new List (); nr = g.Query(ALL, LOOKUP, “Harry”, EOL, resultlist);  First sets up group  Join makes this entity a member. State transfer isn’t shown  Then can multicast, query. Runtime callbacks to the “delegates” as events arrive  Easy to request security (g.SetSecure), persistence  “Consistency” model dictates the ordering seen for event upcalls and the assumptions user can make 48

Concept: A “multi-query”  Our lookup is  Multicast to the group  All members respond  A chance for parallelism  Each can do part of the job: e.g. search 1/n th of a database  Reduces response delays 52 Lookup “Harry” in the Ithaca phone directory Front end With n replicas...... we get an n times speedup! Names with Harry in them:....

Our example was overly simple 53  it didn’t show the “state transfer” code  Corresponds to the “white arrows” in time-line figure  In Isis 2 we have a way to make checkpoints  State transfer: Some active member makes a checkpoint, and the joiner loads the state from it.  The code looks like other operations in our example  Checkpoints can also be used to save group state during periods when all members are inactive

Adding security: Just one line! Group g = new Group(“myGroup”); Dictionary Values = new Dictionary (); g.ViewHandlers += delegate(View v) { Console.Title = “myGroup members: “+v.members; }; g.Handlers[UPDATE] += delegate(string s, double v) { Values[s] = v; }; g.Handlers[LOOKUP] += delegate(string s) { g.Reply(Values[s]); }; g.SetSecure(myKey); g.Join(); g.Send(UPDATE, “Harry”, 20.75); List resultlist = new List (); nr = g.Query(ALL, LOOKUP, “Harry”, EOL, resultlist);  First sets up group  Join makes this entity a member. State transfer isn’t shown  Then can multicast, query. Runtime callbacks to the “delegates” as events arrive  Easy to request security, persistence, tunnelling on TCP...  “Consistency” model dictates the ordering seen for event upcalls and the assumptions user can make 54

Some uses for process groups  To replicate data maintained by the members in memory  To replicate actions taken on an external service such as a replicated database  To ensure that all replicas are configured the same way  To coordinate the processing of requests and load-balance  To offer a way to parallelize processing by having each group member do part of the work  Fault-tolerance via a backup scheme 55

Isis 2 Summary 56  A library that you can invoke from a normal program written in a normal way  It does the work of creating groups and sending multicasts and ensuring that the consistency model will be enforced  The developer just tells it what to do.  She thinks about a parallel distributed application.  Virtual synchrony eliminates many hard problems

Why not build it yourself from scratch? 57 Isis 2 user object Isis 2 library Group instances and multicast protocols Flow Control Membership Oracle Large Group LayerTCP tunnels (overlay)Dr. MulticastPlatform Security Reliable SendingFragmentationGroup Security Sense Runtime Environment Self-stabilizing Bootstrap Protocol Socket Mgt/Send/Rcv Send CausalSend OrderedSend SafeSend Query.... Message Library“Wrapped” locks Bounded Buffers Oracle Membership Group membership Report suspected failures Views Other group members The SandBox itself is mostly composed of “convergent” protocols that use probabilistic methods SafeSend and Send are two of the protocol components hosted over what we call the large-scale properties sandbox. The sandbox addresses issues like flow control, security, etc. All protocols share and benefit from those properties  These systems are complex, especially if you want to run on platforms like EC2  By using Isis 2 you “inherit” 30 years of research on how to make it work

Why focus on Isis 2 ? 58  This is a good question to ask  In fact we could focus on any of a number of other technologies, including other multicast products  Such as Spread, JGroups, C-Ensemble...  But Isis 2 is open source and specifically designed for cloud settings. (Also, Ken built it!)  So since our class is short, we will look at Isis 2 examples

Can Isis2 applications achieve the kinds of scalable performance and elasticity required in large cloud deployments? Segment V: Performance 59

Revisit our notion of consistency 60  Let’s look again at our mHealth example  We want the best possible performance but we also want to be sure that the application is “safe” for this kind of use  We need consistency, yet also need snappy response and elasticity, especially in the monitoring component  After all, it continuously monitors huge numbers of patients.  What limits scalability?

Speed of updates 61  Isis 2 offers many ways to do updates  RawSend, Send, CausalSend, OrderedSend, SafeSend  Each has different consistency / durability guarantees  As a developer, you’ll want to use the fastest option that is still safe in your setting ... Hence will need to understand how each works ... and how fast each solution will be  Today we’ll just look at this superficially

Example: Speed of updates 62  Isis 2 offers several ways to do updates (we will visit them more carefully later)  They have big performance implications  But speed can have more than one definition!

Isis 2 : Send v.s. in-memory SafeSend 63 Send scales best, but SafeSend with in-memory (rather than disk) logging and small numbers of acceptors isn’t terrible.

Latency  ops/second 64  Latency: Delay before external user sees action  Ops/second: total throughput  For most purposes systems “like” Isis 2 offer basic performance of about 1000 ops/second  But by grouping requests into batches of ~50/request, services that can support ~50,000 ops/second are feasible  Building them is challenging, but we won’t focus on that engineering topic in these lectures

Jitter: how “steady” are latencies? Cornell (Birman): No distribution restrictions. 65 The “spread” of latencies is much better (tighter) with Send: the 2-phase SafeSend protocol is sensitive to scheduling delays

Flush delay as function of shard size Cornell (Birman): No distribution restrictions. 66 Flush is fairly fast if we only wait for acks from 3-5 members, but slow if we wait for all members. Isis 2 lets developer set the threshold.

So I want Send+Flush, right? 67  The problem is that the different solutions offer different guarantees  The fastest solutions have weaker guarantees  Using them safely involves understanding these properties in order to decide whether they are good enough for the desired purpose  But there are subtle issues we don’t have time to discuss in today’s lecture. We will revisit tomorrow.

Raw speed isn’t the whole story! 68  When building a system such as this we need to look at performance but also at steady behavior  Here’s an example of a problem we ran into when doing the experiments I just showed you  As we’ll see, Isis 2 had an instability. We think we’ve fixed it but it illustrates an important point

The experiment we did 69  We made a timeline picture from left to right  One node (the bottom one) sends multicasts  The others log the time of receipt  We graphed the delay, sorted from slowest (top) to fastest (bottom) delays  Here’s what we saw

Debugging: Stabilization bug Birman: DARPA MRC Kickoff, Washington, Nov 3-4 2011 70

As the application ran, it slowed down! 71  At first the system was fast: even the slowest nodes at the top had short delays  But within a few multicasts they slowed down  Then something “resets” them and they speed up  We tracked it down to a problem with garbage collection in our system  Modifying that protocol helped smooth things out

Debugging : Stabilization bug fixed Birman: DARPA MRC Kickoff, Washington, Nov 3-4 2011 72

Debugging : 358-node run slowdown

358-node run slowdown: Zoom in

358-node run slowdown: Filter

Summary of insights from example? 76  Tools like Isis 2 enable us to build cloud-scale replication based services with strong guarantees  But today, at least, they demand a lot from the developer, who needs to really understand the choices and their implications  As Isis 2 evolves, this problem will be reduced: the system will eventually automate many decisions, including picking the right update primitives for you

We’ve scratched the surface but there is much more to be explored Cornell’s high assurance researchers are creating solutions for tomorrow’s demanding applications Segment V: Conclusions 77

Key take-away points 78  Cloud computing, today, isn’t very friendly to high assurance applications  This is a problem because those applications are increasingly forced to migrate to the cloud for reasons of cost, scalability or just because the cloud is the dominant paradigm today  But we can already use tools like Isis 2 to solve these problems and as they become easier to work with, the community able to build these solutions will grow

Key take-away points 79  With Isis 2 we can easily create programs that run on cloud platforms like EC2 or even Android mobile  They form into groups and coordinate or replicate data or actions via group primitives  The concept is powerful and easily visualized  But tuning and doing sophisticated fault-tolerance remains challenging.  In the remaining lectures we will explore these issues

The last word...  The word on the street is that cloud computing will rule but that the cloud can’t do high assurance  But the word in the hallways at Cornell differs!  We see Isis 2 as our proof-by-demonstration that it can be done  Even so, the engineering challenge remains enormous 80

Learning more 81  Stay in the class. We’ll show you how!  Download the Isis 2 system from isis2.codeplex.com  You can access the user’s manual  The code itself (currently v2.xxx, a very stable release)  And we maintain a discussion and issues board there

Learning more 82  My textbook covers this topic in depth “Guide to Reliable Distributed Systems: Building High- Assurance Applications and Cloud-Hosted Services” Ken Birman. Springer Verlag, February 2012  A paper focused entirely on today’s topic is: Overcoming CAP with Consistent Soft-State Replication. Kenneth P. Birman, D. Freedman, Q. Huang and Patrick Dowell. IEEE Computer Magazine (special issue on “The Growing Impact of the CAP Theorem”). Volume 12. pp. 50-58. February 2012. You can download a copy from: http://www.cs.cornell.edu/projects/quicksilver/pubs.html

A BRIEF INTRODUCTION TO HIGH ASSURANCE CLOUD COMPUTING WITH ISIS2 Ken Birman 1 Cornell University.

Similar presentations

Presentation on theme: "A BRIEF INTRODUCTION TO HIGH ASSURANCE CLOUD COMPUTING WITH ISIS2 Ken Birman 1 Cornell University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A BRIEF INTRODUCTION TO HIGH ASSURANCE CLOUD COMPUTING WITH ISIS2 Ken Birman 1 Cornell University.

Similar presentations

Presentation on theme: "A BRIEF INTRODUCTION TO HIGH ASSURANCE CLOUD COMPUTING WITH ISIS2 Ken Birman 1 Cornell University."— Presentation transcript:

Similar presentations

About project

Feedback