HOW THE CLOUD WORKS Ken Birman 1 Cornell University.

HOW THE CLOUD WORKS Ken Birman 1 Cornell University

Consider Facebook 2  Popular social networking site (currently blocked in China), with > 1B users, growing steadily  Main user interface: very visual, with many photos, icons, ways to comment on things, “like/dislike”  Rapid streams of updates

Facebook Page  Page itself was downloaded from Facebook.com  They operate many data centers, all can serve any user  Page was full of URLs  Each URL triggered a further download. Many fetched photos or videos 3 User’s wall: a continuously scrolling information feed with data from her friends, news, etc

Facebook image fetching architecture 4 local cache Akamai Facebook Edge Edge Cache Facebook Edge Edge Cache Facebook Edge Edge Cache Facebook Edge Edge Cache Facebook Resizer Resizer Cache Haystack Akamai Akamai Cloud

The system... 5  Operates globally, on every continent  Has hundreds of Facebook Edge sites  Dozens of Resizer locations just in the USA, many more elsewhere  A “few” Haystack systems for each continent  Close relationships with Akamai, other network providers

Things to notice 6  The cloud isn’t “about” one big system  We see multiple systems that talk easily to on-another all using web-page standards (XML, HTML, MPLS...)  They play complementary roles  Facebook deals with Akamai for caching of certain kinds of very popular images, AdNet and others for advertising, etc  And within the Facebook cloud are many, many, interconnected subsystems

Why so many caches? 7  To answer, need to start by understanding what a cache does  A cache is a collection of web pages and images, fetched previously and then retained  A request for an image already in the cache will be satisfied from the cache  Goal is to spread the work of filling the Facebook web page so widely that no single element could get overloaded and become a bottleneck

But why so many layers? 8  Akamai.com is a web company dedicated to caching for rapid data (mostly images) delivery  If Facebook uses Akamai, why would Facebook ever need its own caches?  Do the caches “talk to each other”? Should they?  To understand the cloud we should try and understand answers to questions like these

Memcached: A popular cache 9  Stands for “In-Memory Caching Daemon”  A simple, very common caching tool  Each machine maintains its own (single) cache function get_foo(foo_id) foo = memcached_get("foo:". foo_id) return foo if defined foo foo = fetch_foo_from_database(foo_id) memcached_set("foo:". foo_id, foo) return foo end

Should we use Memcached everywhere? 10  Cached data can be stale (old/incorrect)  A cache is not automatically updated when data changes; need to invalidate or update the entry yourself  And you may have no way to know that the data changed  When a cache gets full, we must evict less popular content. What policy will be used? (Memcached: LRU)  When applications (on one machine) share Memcached, they need to agree on the naming rule they will use for content  Otherwise could end up with many cached copies of Angelina Joli and Brad Pitt, “filling up” the limited cache space

Fun with Memcached 11  There are systems built over Memcached that have become very important  Berkeley Spark system is a good example  Spark  Memcached + a nice rule for naming what the cache contains  Spark approach focuses on in-memory caching on behalf of the popular MapReduce/Hadoop computing tool

MapReduce/Hadoop 12  Used when searching or indexing very large data sets constructed from collections of web data  For example, all the web pages in the Internet  Or all the friending relationships in all of Facebook  Idea is to spread the data over many machines, then run highly parallel computations on unchanging data  The actual computing tends to be simple programs  Map step: spreads computing out. Reduce: combines intermediary results. Result aggregated with exactly one copy of each intermediary output  Often iterative: second step depends on output of first step

Spark: Memcached for MapReduce 13  Spark developers reasoned that if MapReduce uses files for the intermediary results, file I/O would be a peformance limiting cost  Confirmed this using experiments  It also turned out that many steps recompute nearly the identical thing (for example by counting words in a file)  Memcached can help... if  MapReduce can find the precomputed results  and if we “place” tasks to run where those precomputed results are likely to be found

Spark “naming convention” 14  Key idea in Spark: rather than name intermediate results using URLs or file names, they use the “function that produced the result”  Represented in a functional programming notation based on the Microsoft LINQ syntax  In effect: “This file contains  (  (X))” Since the underlying data is unchanging, e.g. a file, “X” has same meaning at all times Thus  (  (X)) has a fixed meaning too  By cleverly keeping subresults likely to be reused, Spark obtained huge speedups, often 1000x or more!

Spark has become very important 15  While idea is simple, providing a full range of control over what is in the cache and when it is searched and when things are evicted is complex  Spark functionality was initially very limited but has become extremely comprehensive  There is a user community of tens of thousands  In the cloud, when things take off, they “go viral”!  … even software systems

Key insight 16  In-memory caching can have huge and very important performance implications  Caching, in general, is of vital importance in the cloud, where many computations run at high load and data rates and immense scale  But not every cache works equally well!

Back to Facebook 17  Seeing how important Spark became for MapReduce, we can ask questions about the Facebook caching strategy  Are these caches doing a good job? What hit rate do they achieve?  Should certain caches “focus” on retaining only certain kinds of data, while other caches specialize in other kinds of data? When a Facebook component encounters a photo or video, can we “predict” the likely value of caching a copy?

Using Memcached in a pool of machines 18  Facebook often has hundreds or thousands of machines in one spot, each can run Memcached  They asked: why not share the cache data?  Leads to a distributed cache structure  They built one using ideas from research experiences  A distributed hash table offers a simple way to share data in a large collection of caches

How it works 19  We all know how a HashMap or HashTable works in a language like Java or C# or C++  You take the object you want to save and compute a HashCode for it. This is an integer and will look “random” but is deterministic for any single object  For example, it could be the XOR of the bytes in a file  Hashcodes are designed to spread data very evenly over the range of possible integers

Network communication  It is easy for a program on biscuit.cs.cornell.edu to send a message to a program on “jam.cs.cornell.edu”  Each program sets up a “network socket  Each machine has an IP address, you can look them up and programs can do that too via a simple Java utility  Pick a “port number” (this part is a bit of a hack)  Build the message (must be in binary format)  Java utils has a request 20

Distributed Hash Tables  It is easy for a program on biscuit.cs.cornell.edu to send a message to a program on “jam.cs.cornell.edu” ... so, given a key and a value 1. Hash the key 2. Find the server that “owns” the hashed value 3. Store the key,value pair in a “local” HashMap there  To get a value, ask the right server to look up key 21

List of machines 22  There are several ways to track the machines in a network. Facebook just maintains a table  In each FB data center there is a big table of all machines currently in use  Every machine has a private copy of this table, and if a machine crashes or joins, it is quickly updates (seconds)  Can we turn our table of machines into a form of HashMap?

From a table of machines to a DHT 23  Take the healthy machines  Compute the HashCode for each using its name, or ID  These are integers in range [Int.MinValue, Int.MaxValue]  Rule: an object with HashCode  (O) will be be placed on the K machines closest to  (O)

Side remark about tracking members 24  This Facebook approach uses a “group” of machines and a “view” of the group  We will make use of this in later lectures too  But it is not the only way! Many DHTs track just log(N) of the members and build a routing scheme that takes log(N) “hops” to find an object (See: Chord, Pastry...)  The FB approach is an example of a 1-hop DHT  Cloud systems always favor 1-hop solutions if feasible

Distributed Hash Tables 25 dht.Put(“ken”,2110) (“ken”, 2110) dht.Get(“ken”) “ken”.hashcode()%N=77 IP.hashcode()%N=77 123.45.66.781 123.45.66.782 123.45.66.783 123.45.66.784 IP.hashcode()%N=98 IP.hashcode()%N=13 IP.hashcode()%N=175 hashmap kept by 123.45.66.782 “ken”.hashcode()%N=77

Facebook image “stack”  We decided to study the effectiveness of caching in the FB image stack, jointly with Facebook researchers  This stack’s role is to serve images (photos, videos) for FB’s hundreds of millions of active users  About 80B large binary objects (“blob”) / day  FB has a huge number of big and small data centers “Point of presense” or PoP: some FB owned equipment normally near the user Akamai: A company FB contracts with that caches images FB resizer service: caches but also resizes images Haystack: inside data centers, has the actual pictures (a massive file system) 26

What we instrumented in the FB stack  Think of Facebook as a giant distributed HashMap  Key: photo URL (id, size, hints about where to find it...)  Value: the blob itself 27

Facebook traffic for a week  Client activity varies daily.... ... and different photos have very different popularity statistics 28

Facebook cache effectiveness  Existing caches are very effective... ... but different layers are more effective for images with different popularity ranks 29

Facebook cache effectiveness  Each layer should “specialize” in different content.  Photo age strongly predicts effectiveness of caching 30

Hypothetical changes to caching?  We looked at the idea of having Facebook caches collaborate at national scale…  … and also at how to vary caching based on the “busyness” of the client 31

Social networking effect?  Hypothesis: caching will work best for photos posted by famous people with zillions of followers  Actual finding: not really 32

Locality?  Hypothesis: FB probably serves photos from close to where you are sitting  Finding: Not really...  … just the same, if the photo exists, it finds it quickly 33

Can one conclude anything?  Learning what patterns of access arise, and how effective it is to cache given kinds of data at various layers, we can customize cache strategies  Each layer can look at an image and ask “should I keep a cached copy of this, or not?”  Smart decisions  Facebook is more effective! 34

Strategy varies by layer  Browser should cache less popular content but not bother to cache the very popular stuff  Akamai/PoP layer should cache the most popular images, etc...  We also discovered that some layers should “cooperatively” cache even over huge distances  Our study discovered that if this were done in the resizer layer, cache hit rates could rise 35%! 35

… many research questions arise 36  Can we design much better caching solutions?  Are there periods with bursts of failures? What causes them and what can be done?  How much of the data in a typical cache gets reused? Are there items dropped from cache that should have been retained?

Overall picture in cloud computing  Facebook example illustrates a style of working  Identify high-value problems that matter to the community because of the popularity of the service, the cost of operating it, the speed achieved, etc  Ask how best to solve those problems, ideally using experiments to gain insight  Then build better solutions 37

Learning More? 38  We have a paper with more details and data in the 2013 version of ACM Symposium on Operating Systems Principles, SOSP  First author is Qi Huang, a Chinese student who created the famous PPLive system, was studying for his PhD at WUST, then came to Cornell to visit  Qi will eventually earn two PhD degrees! One awarded by Cornell, one by WUST after he finishes  A very amazing and talented cloud computing researcher

More about caching 39  Clearly, caching is central to modern cloud computing systems!  But the limitation that we are caching static data is worrying  MapReduce/Hadoop use purely static data  FB images and video are static data too  But in “general” cloud computing will have very dynamic kinds of data, rapidly changing

Coherent Cache 40  We say that a cache is coherent if it is always a perfect real-time replica of the “true” data  True object could be in a database or file system  Or we could dispense with the true object and use only in the in-memory versions. In this case the cache isn’t really a cache but is actually an in-memory replication scheme  In the cloud, file system access is too slow!  So we should learn more about coherent caching

What could a coherent cache hold? 41  A standard cache just has “objects” from some data structure  A coherent cache could hold the entire data structure!  A web graph with pages and links  A graph of social network relationships, Twitter feeds and followers, etc  Objects on a Beijing street, so that a self-driving car can safely drive to a parking area, park itself, and drive back later to pick you up

Coherent data replication 42  Clearly we will need to spread our data widely: “Partitioning” is required  We partition data in space (like with a DHT)  Also in time (e.g. version of the database at time T, T+1, …)  Sometimes hierarchically (e.g. users from the US North East, US Central, US North West…)  Famous paper by Jim Gray & others: Dangers of Database Replication and a Solution  Shows that good partitioning functions are critically needed  Without great partitioning, replication slows a system down!

Aspects of coherent replication 43  A partitioning method  Many servers, small subsets for each partition (“shard” has become the common term)  Synchronization mechanisms for conflicting actions  A method for updating the data  A way to spread the “read only” work over the replicas  Shard membership tracking  Handling of faults that cause whole shard to crash

Does Facebook have coherency? 44  Experiments reveal answer: “no”  Create a Facebook account and put many images on it  Then form networks with many friends  Now update your Facebook profile image a few times  Your friends may see multiple different images of you on their “wall” for a long period of time!  This reveals that it takes a long time (hours) for old data to clear from the FB cache hierarchy!

Inconsistencies in the cloud

In fact this is common in today’s cloud 46  We studied many major cloud providing systems  Some guarantee coherency for some purposes but the majority are at best weakly coherent  When data changes they need a long time to reflect the updates  They cache data heavily and don’t update it promptly

CAP Theorem 47  Proposed by Berkeley Professor Eric Brewer  “You can have just 2 of Consistency, Availability, Partitioning or Fault Tolerance”  He argues that consistency is the guarantee to relax  We will look at this more closely later  Many people adopt his CAP based views  This justifies non-coherent caching  But their systems can’t solve problems in ways guaranteed to be safe

High assurance will need more! 48  Remember that we are interested in the question of how one could create a high assurance cloud!  Such a cloud needs to make promises  If a car drives on a street it must not run over people  If a power system reconfigures it must not explode the power generators  A doctor who uses a hospital computing system needs correct and current data

So… we must look at coherent caching 49  In fact we will focus on “data replication”  Data that should be in memory, for speed  With a few shard members holding the information  But with guarantees: if you compute using it, the data is current  Why not a full database?  Our coherent replication methods would live in structures like the Facebook infrastructure: big, complex  We need to build these in ways optimized to the uses  Hence databases might be elements but we can’t just hand the whole question to a database system

Summary 50  We looked at the architecture of a typical very large cloud computing system (Facebook)  We saw that it uses caching extremely aggressively  Caching is fairly effective, but could improve  Coherent caching needed in high assurance systems but seems to be a much harder challenge

HOW THE CLOUD WORKS Ken Birman 1 Cornell University.

Similar presentations

Presentation on theme: "HOW THE CLOUD WORKS Ken Birman 1 Cornell University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

HOW THE CLOUD WORKS Ken Birman 1 Cornell University.

Similar presentations

Presentation on theme: "HOW THE CLOUD WORKS Ken Birman 1 Cornell University."— Presentation transcript:

Similar presentations

About project

Feedback