CS5412: THE CLOUD UNDER ATTACK! Ken Birman 1 Lecture XXIV.

Slides:



Advertisements
Similar presentations
End-to-End Arguments in System Design
Advertisements

Time Management By Zahira Gonzalez.
COS 461 Fall 1997 Group Communication u communicate to a group of processes rather than point-to-point u uses –replicated service –efficient dissemination.
Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.
CS5412: THE REALTIME CLOUD Ken Birman 1 Lecture XXIV CS5412 Spring 2014.
CS 542: Topics in Distributed Systems Diganta Goswami.
Relaxed Consistency Models. Outline Lazy Release Consistency TreadMarks DSM system.
Nummenmaa & Thanish: Practical Distributed Commit in Modern Environments PDCS’01 PRACTICAL DISTRIBUTED COMMIT IN MODERN ENVIRONMENTS by Jyrki Nummenmaa.
Lecture 2 Page 1 CS 236, Spring 2008 Security Principles and Policies CS 236 On-Line MS Program Networks and Systems Security Peter Reiher Spring, 2008.
Ken Birman Cornell University. CS5410 Fall
CS514: Intermediate Course in Operating Systems Professor Ken Birman Vivek Vishnumurthy: TA.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
CPSC 668Set 16: Distributed Shared Memory1 CPSC 668 Distributed Algorithms and Systems Fall 2006 Prof. Jennifer Welch.
2/23/2009CS50901 Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred B. Schneider Presenter: Aly Farahat.
Reliable Distributed Systems Real Time Systems. Topics for this lecture Adding clocks to distributed systems Also, to non-distributed ones We’ll just.
Clock Synchronization Ken Birman. Why do clock synchronization?  Time-based computations on multiple machines Applications that measure elapsed time.
SynchronizationCS-4513, D-Term Synchronization in Distributed Systems CS-4513 D-Term 2007 (Slides include materials from Operating System Concepts,
Synchronization in Distributed Systems CS-4513 D-term Synchronization in Distributed Systems CS-4513 Distributed Computing Systems (Slides include.
Distributed Systems 2006 Group Membership * *With material adapted from Ken Birman.
CS CS 5150 Software Engineering Lecture 20 Acceptance and Delivery.
Xtreme Programming. Software Life Cycle The activities that take place between the time software program is first conceived and the time it is finally.
Code Generation CS 480. Can be complex To do a good job of teaching about code generation I could easily spend ten weeks But, don’t have ten weeks, so.
SM3121 Software Technology Mark Green School of Creative Media.
State Machines CS 614 Thursday, Feb 21, 2002 Bill McCloskey.
1 Physical Clocks need for time in distributed systems physical clocks and their problems synchronizing physical clocks u coordinated universal time (UTC)
Software Dependability CIS 376 Bruce R. Maxim UM-Dearborn.
Switching Techniques Student: Blidaru Catalina Elena.
Freenet. Anonymity  Napster, Gnutella, Kazaa do not provide anonymity  Users know who they are downloading from  Others know who sent a query  Freenet.
Lecture 18 Page 1 CS 111 Online Design Principles for Secure Systems Economy Complete mediation Open design Separation of privileges Least privilege Least.
Lecture 19 Page 1 CS 111 Online Symmetric Cryptosystems C = E(K,P) P = D(K,C) E() and D() are not necessarily the same operations.
J.H.Saltzer, D.P.Reed, C.C.Clark End-to-End Arguments in System Design Reading Group 19/11/03 Torsten Ackemann.
Fault Tolerance via the State Machine Replication Approach Favian Contreras.
Reliability Andy Jensen Sandy Cabadas.  Understanding Reliability and its issues can help one solve them in relatable areas of computing Thesis.
Royal Latin School. Spec Coverage: a) Explain the advantages of networking stand-alone computers into a local area network e) Describe the differences.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
Cloud Computing Characteristics A service provided by large internet-based specialised data centres that offers storage, processing and computer resources.
Lecture 15 Page 1 Advanced Network Security Perimeter Defense in Networks: Firewalls Configuration and Management Advanced Network Security Peter Reiher.
Architectures of distributed systems Fundamental Models
Switching breaks up large collision domains into smaller ones Collision domain is a network segment with two or more devices sharing the same Introduction.
Reliable Communication in the Presence of Failures Based on the paper by: Kenneth Birman and Thomas A. Joseph Cesar Talledo COEN 317 Fall 05.
WEEK TWO, Session 2 Information Gathering. Helpdesk metrics must be reprioritized from measuring internal efficiencies to evaluating customer retention.
From Quality Control to Quality Assurance…and Beyond Alan Page Microsoft.
1 3132/3192 User Accessibility © University of Stirling /3192 User Accessibility 2.
Cloud Computing Project By:Jessica, Fadiah, and Bill.
CS5412: SHEDDING LIGHT ON THE CLOUDY FUTURE Ken Birman 1 Lecture XXV.
Agenda Fail Stop Processors –Problem Definition –Implementation with reliable stable storage –Implementation without reliable stable storage Failure Detection.
A. Haeberlen Fault Tolerance and the Five-Second Rule 1 HotOS XV (May 18, 2015) Ang Chen Hanjun Xiao Andreas Haeberlen Linh Thi Xuan Phan Department of.
Lecture 16 Page 1 CS 236 Online Web Security CS 236 On-Line MS Program Networks and Systems Security Peter Reiher.
The Relational Model1 Transaction Processing Units of Work.
Computer Security The World of Cyber Crime Presentation Details This presentation will explain the purpose of bypassing security or stealing information.
WATERFALL DEVELOPMENT MODEL. Waterfall model is LINEAR development lifecycle. This means each phase must be completed before moving onto the next!!! WHAT.
Lecture 4 Page 1 CS 111 Online Modularity and Virtualization CS 111 On-Line MS Program Operating Systems Peter Reiher.
Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS Fall 2011 Prof. Jennifer Welch CSCE 668 Set 16: Distributed Shared Memory 1.
Distributed Systems Lecture 9 Leader election 1. Previous lecture Middleware RPC and RMI – Marshalling 2.
1 Chapter 1- Introduction How Bugs affect our lives What is a Bug? What software testers do?
Cloud Computing 10 Cloud Computing 10. Cloud Computing 10 You’ll have heard about the ‘Cloud’ Lots of you will use it! But you need to be clear about.
Lecture 5 Page 1 CS 111 Summer 2013 Bounded Buffers A higher level abstraction than shared domains or simple messages But not quite as high level as RPC.
1 Advanced Computer Programming Project Management: Basics Copyright © Texas Education Agency, 2013.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
Remote Procedure Calls
Understanding The Cloud
Presented by Muhammad Abu Saqer
Switching Techniques In large networks there might be multiple paths linking sender and receiver. Information may be switched as it travels through various.
Project Planning is a waste of time!!!
CS514: Intermediate Course in Operating Systems
CS514: Intermediate Course in Operating Systems
Lecture 20: Intro to Transactions & Logging II
Journey to the Cloud – Guidance and Lessons Learned
Security Principles and Policies CS 236 On-Line MS Program Networks and Systems Security Peter Reiher.
Presentation transcript:

CS5412: THE CLOUD UNDER ATTACK! Ken Birman 1 Lecture XXIV

For all its virtues, the cloud is risky! 2  Categories of concerns  Client platform inadequacies, code download, browser insecurities  Internet outages, routing problems, vulnerability to DDoS  Cloud platform might be operated by an untrustworthy third party, could shift resources without warning, could abruptly change pricing or go out of business  Provider might develop its own scalability problems  Consolidation creates monoculture threats  Cloud security model is very narrow and might not cover important usage cases

But the cloud is also good in some ways 3  With a private server, DDoS attacks often succeed  In contrast, it can be hard to DDoS a cloud  The DDoS operator spends real money and won’t want to waste the cash  Thus because cloud is hard to DDoS, cloud emerges as a very good response to DDoS worries

More good news 4  Diversity can compensate for monoculture worries  Elasticity is a unique capability not seen in other settings  Ability to host and compute on massive data sets is very valuable  Obviously, only of value if task is suited this style of massive parallism, but many do fit the model ... the list goes on

So the cloud is tempting 5  And cheaper, too!  What’s not to love?  Imagine that you work for a large company that is healthy and has managed its own story in its own way  Now the cloud suddenly offers absolutely unique opportunities that we can’t access in any other way  Should you recommend that your boss drink the potion?

But how can anyone trust the cloud? 6  The cloud seems so risky that it makes no sense at all to trust it in any way!  Yet we seem to trust it in many ways  This puts the fate of your company in the hands of third parties!

The concept of “good enough” 7  We’ve seen that there really isn’t any foolproof way to build a computer, put a large, complex program on it, and then run it with confidence  We also know that with effort, many kinds of systems really start to work very well  When is a “pretty good” solution good enough?

How they do it in avionics 8  FAA and NASA have a process that is used for things like fly-by-wire control software  Requires very stringent proofs  The program must be certified on particular hardware, even specific versions of chips  Any change of any kind triggers a recertification task  Very costly: a controller 100 lines long may generate 1000 pages of documentation!

How most production software is built 9  Generally, company develops good specification  Code is created in teams with code review frequent and much unit testing  Then code is passed to a “red team” that uses the code, attacks it, tries to find issues  Cycle continues until adequate assurance is reached and the initial release can take place  Subsequently must track and fix bugs, repeat Q/A, do periodic patch releases  Wise to rebuild entire solution every 5 years or so

How cloud software is built 10  Not enough time for proper Q/A  So much of the cloud was built in a huge hurry  Even today, race for features often doesn’t leave time for proper testing  Early versions have been rough, insecure, fault-prone  Over time, slow improvement  Seems to shift a lot of emphasis to patches and upgrades  Many cloud systems auto-upgrade frequently

Legacy code 11  Not all code fits the “rebuilt periodically” model  Many major technologies were important in their day but now live on in isolated settings  They work… do something important for some organization… and so nobody touches them  These legacy systems are often minimally maintained but over time the amount of legacy code can become substantial: a tangle of vital, rarely touched code and systems

The parable of Y2K 12  Once upon a time many, many depended on 2-digit clocks  They only kept 2 digits for the years, like a credit card that expires 05/13  Thus when we reached 01/00 it looked like time travel 100 years into the past  Experiments made it clear that many systems crashed when this happened… and nobody had any idea how to find the “bad apples” in the barrel

So how did things work out? 13  Initial cost estimates were terrifying  Tens or hundreds of billions of dollars to scan the hundreds of millions of lines of code that do important things  Lack of people do even do the work  Code in baffling, ancient languages like COBOL  Disaster loomed…  Infosys rode to the rescue!

Infosys in the pre-Y2K period 14  A small Indian software company that was known mostly for its work on the Paris Airport luggage transportation system  A very complex system, which Infosys was successful building at a fraction the standard cost and with far fewer bugs or delays than France had ever seen  Company had a few hundred employees  Founded in 1981 with $600!

Infosys was an unusual company 15  Founders were all very socially pro-active and very concerned about the situation of India’s poor  Extremely high ethical standard: A decision to never pay bribes or in any way rig the outcome of business decisions  When many company executives were paying themselves big bonuses, the founders reinvested

1987: A big event 16  Infosys got a toehold in the United States when it landed its first US corporate client  A company named Data Basics Corporation  The Infosys “angle”?  Hire smart kids from all over India  Offer them additional training at a corporate campus in Mysore  Form them into a highly qualified workforce

Financial angle? 17  In the early days, Infosys was paying highly qualified employees $5,000/year  In the US highly qualified technology workers were earning $125,000/year in that time period  Skill sets weren’t so different…  Today the gap is a little smaller, but not hugely so

How Y2K helped 18  Companies like Infosys tackled the Y2K challenge for “pennies on the dollar” relative to estimates  A company facing a $50M bill to review all the corporate code base saw it shrink to perhaps $1M  And Infosys often finished these tasks early  …. January 1, 2000 arrived and the world didn’t end. Instead the world of outsourcing began!  A few minor issues occurred, but nothing horrible

Lessons one learns 19  Cheaper isn’t necessarily inferior!  In fact over time, cheaper but “good enough” wins  This is a very important lesson that old companies miss  Earlier adopters often accept risks ... risks that can be managed  And those good-enough solutions sometimes catch up later  Bad stuff (lots of it) lurks deep within the cool new stuff that we all love

Fast forward to  Today cloud computing has a similar look and feel  It works really well for the things we use it to do today How often does an iPhone service malfunction? Pretty often, actually, but not often enough to bother anyone The cloud is fast, scalable, has amazing capabilities, and yes, it has a wide variety of issues  Is the cloud really worse than what came before it?  Given that the cloud evolved from what came earlier, is this even a sensible question?  When has any technology ever been “assured”?

Life with technology is about tradeoffs 21  Clearly, we err if we use a technology in a dangerous or inappropriate way  Liability laws need to be improved: they let software companies escape pretty much all responsibility  Yet gross negligence is still a threat to those who build things that will play critical roles and yet fail to take adequate steps to achieve assurance

Another parable: Real-time multicast 22  The community that builds real-time systems favors proofs that the system is guaranteed to satisfy its timing bounds and objectives  The community that does things like data replication in the cloud tends to favor speed  We want the system to be fast  Guarantees are great unless they slow the system down

Can a guarantee slow a system down?  Suppose we want to implement broadcast protocols that make direct use of temporal information  Examples:  Broadcast that is delivered at same time by all correct processes (plus or minus the clock skew)  Distributed shared memory that is updated within a known maximum delay  Group of processes that can perform periodic actions

A real-time broadcast p0p0 p1p1 p2p2 p3p3 p4p4 p5p5 t t+at+b * * * * * Message is sent at time t by p 0. Later both p 0 and p 1 fail. But message is still delivered atomically, after a bounded delay, and within a bounded interval of time (at non-faulty processes)

A real-time distributed shared memory p0p0 p1p1 p2p2 p3p3 p4p4 p5p5 t t+at+b At time t p 0 updates a variable in a distributed shared memory. All correct processes observe the new value after a bounded delay, and within a bounded interval of time. set x=3 x=3

Periodic process group: Marzullo p0p0 p1p1 p2p2 p3p3 p4p4 p5p5 Periodically, all members of a group take some action. Idea is to accomplish this with minimal communication

The CASD protocols  Also known as the “  -T” protocols  Developed by Cristian and others at IBM, was intended for use in the (ultimately, failed) FAA project  Goal is to implement a timed atomic broadcast tolerant of Byzantine failures

Basic idea of the CASD protocols  Assumes use of clock synchronization  Sender timestamps message  Recipients forward the message using a flooding technique (each echos the message to others)  Wait until all correct processors have a copy, then deliver in unison (up to limits of the clock skew)

CASD picture p0p0 p1p1 p2p2 p3p3 p4p4 p5p5 t t+at+b * * * * * p 0, p 1 fail. Messages are lost when echoed by p 2, p 3

Idea of CASD  Assume known limits on number of processes that fail during protocol, number of messages lost  Using these and the temporal assumptions, deduce worst-case scenario  Now now that if we wait long enough, all (or no) correct process will have the message  Then schedule delivery using original time plus a delay computed from the worst-case assumptions

The problems with CASD  In the usual case, nothing goes wrong, hence the delay can be very conservative  Even if things do go wrong, is it right to assume that if a message needs between 0 and  ms to make one hope, it needs [0,n*  ] to make n hops?  How realistic is it to bound the number of failures expected during a run?

CASD in a more typical run p0p0 p1p1 p2p2 p3p3 p4p4 p5p5 t t+at+b * * * * * *

... leading developers to employ more aggressive parameter settings p0p0 p1p1 p2p2 p3p3 p4p4 p5p5 t t+at+b * * * * * *

CASD with over-aggressive paramter settings starts to “malfunction” p0p0 p1p1 p2p2 p3p3 p4p4 p5p5 t t+at+b * all processes look “incorrect” (red) from time to time * * *

CASD “mile high”  When run “slowly” protocol is like a real-time version of abcast  When run “quickly” protocol starts to give probabilistic behavior:  If I am correct (and there is no way to know!) then I am guaranteed the properties of the protocol, but if not, I may deliver the wrong messages

How to repair CASD in this case?  Gopal and Toueg developed an extension, but it slows the basic CASD protocol down, so it wouldn’t be useful in the case where we want speed and also real-time guarantees  Can argue that the best we can hope to do is to superimpose a process group mechanism over CASD (Verissimo and Almeida are looking at this).

Why worry?  CASD can be used to implement a distributed shared memory (“delta-common storage”)  But when this is done, the memory consistency properties will be those of the CASD protocol itself  If CASD protocol delivers different sets of messages to different processes, memory will become inconsistent

Why worry?  In fact, we have seen that CASD can do just this, if the parameters are set aggressively  Moreover, the problem is not detectable either by “technically faulty” processes or “correct” ones  Thus, DSM can become inconsistent and we lack any obvious way to get it back into a consistent state

Using CASD in real environments  Once we build the CASD mechanism how would we use it?  Could implement a shared memory  Or could use it to implement a real-time state machine replication scheme for processes  US air traffic project adopted latter approach  But stumbled on many complexities…

Using CASD in real environments  Pipelined computation  Transformed computation

Issues?  Could be quite slow if we use conservative parameter settings  But with aggressive settings, either process could be deemed “faulty” by the protocol  If so, it might become inconsistent Protocol guarantees don’t apply  No obvious mechanism to reconcile states within the pair  Method was used by IBM in a failed effort to build a new US Air Traffic Control system

A comparison 42  Virtually synchronous Send is fault-tolerant and very robust, and very fast, but doesn’t guarantee realtime delivery of messages  CASD is fault-tolerant and very robust, but rather slow. But it does guarantee real-time delivery  CASD is “better” if our application requires absolute confidence that real-time deadlines will be achieved... but only if those deadlines are “slow”

Which is better for real-time uses? 43  Virtually synchronous Send or CASD?  CASD may need seconds before it can deliver, but comes with a very strong proof that it will do so correctly  Send will deliver within milliseconds unless strange scheduling delays impact a node But actually delay limit is probably ~45 seconds Beyond this, node will be declared to have crashed

Generalizing to the whole cloud 44  Massive scale  And most of the thing gives incredibly fast responses: sub 100ms is a typical goal  But sometimes we experience a long delay or a failure

Traditional view of real-time control favored CASD view of assurances 45  In this strongly assured model, the assumption was that we need to prove our claims and guarantee that the system will meet goals  And like CASD this leads to slow systems  And to CAP and similar concerns

Back to our puzzle 46  So can the cloud do high assurance?  Presumably not if we want CASD kinds of proofs  But if we are willing to “overwhelm” delays with redundancy, why shouldn’t we be able to do well?  Suppose that we connect our user to two cloud nodes and they perform read-only tasks in parallel  Client takes first answer, but either would be fine  We get snappier response but no real “guarantee”

A vision: “Good enough assurance” 47  Build applications to protect themselves against rare but extreme problems (e.g. a medical device might warn that it has lost connectivity)  This is needed anyhow: hardware can fail…  So: start with “fail safe” technology  Now make our cloud solution as reliable as we can without worrying about proofs  We want speed and consistency but are ok with rare crashes that might be noticed by the user

Will this do? 48  Probably not for some purposes… but some things just don’t belong under computer control  For most purposes, this sort of solution might balance the benefits of the cloud with the kinds of guarantees we know how to provide  Use redundancy to compensate for delays, insecurity, failures of individual nodes

How the cloud is like Infosys 49  The cloud brings huge advantages  Lower cost… much better scalability  And it also brings problems  Today’s cloud is inconsistent by design, not very secure…  But why should we assume tomorrow’s cloud won’t be better? The cloud seems to be winning!  Our job: find ways to make the cloud safely do more  This task seems completely feasible!

Summary: Should we trust the cloud? 50  We’ve identified a tension centering on priorities  If your top priority is assurance properties you may be forced to sacrifice scalability and performance in ways that leave you with a useless solution  If your top priorities center on scale and performance and then you layer in other characteristics it may be feasible to keep the cloud properties and get a good enough version of the assurance properties  These tradeoffs are central to cloud computing!  But like the other examples, cloud could win even if in some ways, it isn’t the “best” or “most perfect” solution