3 >200 billion monthly page views Over 300 million active users> 3.9 trillion feed actions processed per day>200 billion monthly page views100 million search queries per dayOver 1 million developers in 180 countries#2 site on the Internet(time on site)Exciting and also humbling to be able to serve 300M users, and so many operations.If Facebook were a country, we would be the 3rd largest after China and India.If we look at the amount of data users on Facebook upload daily, it’d exceed 350 feature length movies.Our users upload an average of 500GB of structured data per day in addition to about 2TB of photos and videos.As we will see during the talk,this kind of load is highly non-trivial.The challenge is to build systems that can support this scale and keep the site running 24 hours a day, each and every day.More than 232 photos…2 billion pieces ofcontent per week6 billion minutes per day
4 Growth Rate Active Users 300M 2009 People log in more than once a month :)- It took over four years to reach 100 million users, a level we achieved in September of 2008.And another 1 year to reach 300 million!- Technology adoption is accelerating...Radio took 38 yrs to reach 50M, TV took 13 yrs, computers took 4 years... Facebook took only 3 years to reach 50M active usersWhy is this important? The rate of growth matters because you have very little time to change things. You just re-built something yesterday and yet, it no longer works today! Designing for exponential growth is really hard. Hard to predict what will go wrong or where the next set of bottlenecks will be.
5 Social NetworksSo what makes the work at Facebook challenging?
6 The social graph links everything People are only one dimension of the social graph.Social applications links people to many types of data..photos, videos, music, blog posts, groups, events, organizations, and even other applications
7 Scaling Social Networks Much harder than typical websites where...Typically 1-2% online: easy to cache the dataPartitioning & scaling relatively easyWhat do you do when everything is interconnected?Facebook and social networks in general have a unique problem. The data is so connected that the only reasonable way to store it is essentially over a uniformly distributed set of data providers. It’s difficult to segment our data in any meaningful way to reside on the same disks without duplicating it everywhere. It’s also so frequently accessed that we simply can’t hit the database for each access.
8 name, status, privacy, profile photo name, status, privacy, video thumbnailname, status, privacy, video thumbnailname, status, privacy, video thumbnailname, status, privacy, video thumbnailname, status, privacy, video thumbnailname, status, privacy, profile photoname, status, privacy, profile photoname, status, privacy, video thumbnailname, status, privacy, video thumbnailname, status, privacy, profile photoname, status, privacy, profile photoname, status, privacy, profile photoname, status, privacy, profile photoname, status, privacy, profile photoname, status, privacy, video thumbnailname, status, privacy, profile photoname, status, privacy, profile photoname, status, privacy, profile photoname, status, privacy, profile photo[Consequence of a Click -- II] When I click on a friend’s photo, for example, a lot of things happen. The data is retrieved from the database and checked in real time for privacy, visibility and other rules. Each time you click on a photo, status, name, friends, friends of friends, this process takes place.name, status, privacy, profile photoname, status, privacy, profile photoname, status, privacy, profile photoname, status, privacy, profile photoname, status, privacy, profile photoname, status, privacy, profile photoname, status, privacy, profile photoname, status, privacy, profile photoname, status, privacy, profile photoname, status, privacy, profile photoname, status, privacy, profile photoname, status, privacy, profile photoname, status, privacy, profile photoname, status, privacy, profile photoname, status, privacy, profile photoname, status, privacy, video thumbnailname, status, privacy, profile photoname, status, privacy, profile photoname, status, privacy, profile photoname, status, privacy, profile photoname, status, privacy, profile photo
9 System ArchitectureSo what makes the work at Facebook challenging?
10 Architecture Load Balancer (assigns a web server) Web Server (PHP assembles data)Kinda inaccurate - memcache and db are on the side. Lots of services...Memcache (fast, simple)Database (slow, persistent)
11 Memcache Simple in-memory hash table Supports get/set,delete,multiget, multisetNot a write-through cachePros and ConsThe Database Shield!Low latency, very high request ratesCan be easy to corrupt, inefficient for very small itemsIn many ways, this is the heart or core of the site! 120 million queries/second!!!equivalent of typing out over 50 volumes of the Encyclopedia Britannica in 1/10th second.
16 Network Incast Memcache Memcache Memcache Memcache Switch - Implement flow control on client over multiple udp connections- aggressive timeout - blowing up memcache past a thresholdPHP Client
17 Memcache Clustering Many small objects per server Many servers per large objectIf objects are small, round trips dominate so you want objects clusteredIf objects are large, transfer time dominates so you want objects distributedIn a web application you will almost always be dealing with small objects, and you can get in a situation where adding machines doesn’t help scalingMany small objects per serverMany servers per large object
21 Memcache Pool Optimization Currently a manual processReplication for obvious hot data setsInteresting problem: Optimize the allocation based on access patterns
22 Vertical Partitioning of Object Types Specialized Replica 1Specialized Replica 2Shard 1Shard 2Shard 1Shard 2General pool with wide fanoutShard 1Shard 2Shard 3Shard n...
23 MySQL has played a role from the beginning Thousands of MySQL servers in two datacentersScribeToday, our user database cluster is a large pool of independent MySQL servers. We have chosen to use a shared-nothing architecture both to achieve scalability and fault isolation.Battleship vs. army of foot soldiers
24 MySQL Usage Pretty solid transactional persistent store Logical migration of data is difficultLogical-Physical db mappingRarely use advanced query featuresPerformanceDatabase resources are preciousWeb tier CPU is relatively cheapDistributed data - no joins!Sound administrative model
25 MySQL is better because it is Open Source We can enhance or extend the database...as we see fit...when we see fitFacebook extended MySQL to support distributed cache invalidation for memcacheINSERT table_foo (a,b,c) VALUES (1,2,3) MEMCACHE_DIRTY key1,key2,...
26 Scaling across datacenters West CoastEast CoastVA MySQLVA WebVA MemcacheMemcache ProxySC MemcacheSC WebSC MySQLSF WebSF MemcacheMemcache ProxyMySql replication
27 Other Interesting Issues Application level batching and parallelizationSuper hot data itemsCachekey versioning with continuous availability
28 PhotosSo what makes the work at Facebook challenging?
30 Photos: Scale 20 billion photos x4 = 80 billion Would wrap around the world more than 10 times!Over 40M new photos per day600K photos / second
31 Photos Scaling - The easy wins Upload tier - handles uploads, scales images, stores on NFSServing tier: Images served from NFS via HTTPHowever...File systems are not good at supporting large number of filesMetadata too large to fit in memory causing too many IOs for each file readLimited by I/O not storage densityEasy winsCDNCachr (http server + caching)NFS file handle cache10 i/os - 3 ios with easy wins
32 Photos: HaystackOverlay file systemIndex in memoryOne IO per read
33 Data WarehousingSo what makes the work at Facebook challenging?
34 Data: How much? 200GB per day in March 2008 2+TB(compressed) raw data per day in April 20094+TB(compressed) raw data per day today
35 The Data Age Free or low cost of user services Consumer behavior hard to predictData and analysis are criticalMore data beats better algorithms
36 Deficiencies of existing technologies Analysis/storage on proprietary systems too expensiveClosed systems are hard to extend
37 Hadoop & HiveSo what makes the work at Facebook challenging?
38 HadoopSuperior availability/scalability/manageability despite lower single node performanceOpen systemScalable costsCons: Programmability and MetadataMap-reduce hard to program (users know sql/bash/python/perl)Need to publish data in well known schemas
39 HiveA system for managing and querying structured data built on top of HadoopComponentsMap-Reduce for executionHDFS for storageMetadata in an RDBMS
41 Hive: Sample Applications ReportingE.g.,: Daily/Weekly aggregations of impression/click countsMeasures of user engagementAd hoc AnalysisE.g.,: how many group admins broken down by state/countryMachine Learning (Assembling training data)Ad OptimizationE.g.,: User Engagement as a function of user attributesLots More
42 Hive: Server Infrastructure 4800 cores, Storage capacity of 5.5 PetaBytes, 12 TB per nodeTwo level network topology1 Gbit/sec from node to rack switch4 Gbit/sec to top level rack switch
43 Hive & Hadoop: Usage Stats 4 TB of compressed new data added per day135TB of compressed data scanned per day7500+ Hive jobs on per day80K compute hours per day200 people run jobs on Hadoop/HiveAnalysts (non-engineers) use Hadoop through Hive95% of jobs are Hive Jobs
44 Hive: Technical Overview So what makes the work at Facebook challenging?
45 Hive: Open and Extensible Query your own formats and types with your own Serializer/DeserializersExtend the SQL functionality through User Defined FunctionsDo any non-SQL transformations through TRANSFORM operator that sends data from Hive to any user program/script
46 Hive: Smarter Execution Plans Map-side JoinsPredicate PushdownPartition PruningHash based AggregationsParallel execution of operator treesIntelligent Scheduling
47 Hive: Possible Future Optimizations Pipelining?Finer operator control (controlling sorts)Cost based optimizations?HBase
48 Spikes: The Username Launch The numbers presented earlier should give you a sense of the scale Facebook operates at and the challenges they might throw. Facebook requires a giant infrastructure and at also a very diverse array of components. The cool thing is that for every class you’re into, we have the opportunity for you to dive into that field, and be right at the forefront. And luckily, Facebook is also a place that is really a continuation of our study, that is broad and challenging.
49 System Design Database tier cannot handle the load Dedicated memcache tier for assigned usernamesMiss => AvailableAvoid database hits altogetherBlacklists: bucketize, local tier cachetimeout- Performance of avail checker was one of the most critical parts of the system- This wasn’t a big issue when were going down the auction path, but became critical once we decided to do FCFS- Generating suggestions adds more load (could do maybe 6-10 or more checks per page load)- Usual caching doesn’t help a lot since there will be a lot of misses. DB tier cannot handle the loadConstant refreshing was a concern - so we added a countdown to incentivize users to just watch. Upon countdown completion, don’t auto-refresh - show a continue button.Disabled Chat bar.Lite include stack.Don’t pull in your set of pages unnecessarily.Tradeoffs between UX (extra clicks) and performance.
50 Username Memcache Tier Parallel pool in each data centerWrites replicated to all nodes8 nodes per poolReads can go to any node (hashed by uid)PHP Client...- Replicated memcache tier- Key gets hashed to a number 0-7 (based on uid?)- But these are not db backed keys & so sets need to be replicatedUN0UN1UN7Username Memcache
51 Write Optimization Hashout store Distributed key-value store (MySQL backed)Lockless (optimistic) concurrency control- Hashout store stores the mapping between username -> <alias fbid> (which has some more data)- - optimistic concurrency => no locks are obtained and since conflict rates are low, it is a win
52 Fault Tolerance Memcache nodes can go down Always check another node on missReplay from a log file (scribe)Memcache sets are not guaranteed to succeedSelf-correcting code: write again to mc if we detect it during db writes- One of the issues we were worried about with non-db backed keys & assuming MC as the system of record was that mc nodes can go down.- MC replicated pools allows some redundancy (on miss, the replicated pool mc infrastructure allows us to check another node on miss)- replay scribe logs for recovery from more than 1 failure (this actually happened!)- MC sets not guaranteed to succeed => bad user experience. writes will not succeed since the database will complain on writes, but still this is not ideal. So, if we find that a username is taken during the db-write stage, we call set again to write those usernames to the UN memcache tier. Future users are prevented from seeing the same problem
53 Nuclear Options Newsfeed Reduce number of stories Turn off scrolling, highlightsProfileMake info tab the defaultChatReduce buddy list refresh rateTurn if off!Even though we designed the system to be highly performant, and tested it under a lot of load, there were chances that if a really really large number of people showed up at launch time, our system may not be able to handle it. So, we thought about ways in which we can reduce the load caused by other features on the site to provide extra capacity for the username launch. We affectionately called this the “nuclear options” since this was a last resort for handling the enormous load.
54 How much load? 200k in 3 min 1M in 1 hour 50M in first month Prepared for over 10x!The actual load was high, but not anywhere near what we were prepared for. Shows you an example of careful design, planning and testing can lead to a successful launch.
56 Some interesting problems Graph models and languagesLow latency fast accessSlightly more expressive queriesConsistency, Staleness can be a bit looseAnalysis over large data setsPrivacy as part of the modelFat data pipesPush enormous volumes of data to several third party applications (E.g., entire newsfeed to search partners).Controllable QoS
57 Some interesting problems (contd.) Search relevanceStorage systemsMiddle tier (cache) optimizationApplication data access language