TAO: Facebook's Distributed Data Store for the Social Graph Nathan Bronson, Zach Amsden, George Cabrera, Prasad Chakka, Peter Dimov, Hui Ding, Jack Ferris,

Slides:



Advertisements
Similar presentations
Wyatt Lloyd * Michael J. Freedman * Michael Kaminsky David G. Andersen * Princeton, Intel Labs, CMU Dont Settle for Eventual : Scalable Causal Consistency.
Advertisements

TAO Facebook’s Distributed Data Store for the Social Graph
SDN Controller Challenges
Case Study - Amazon. Amazon r Amazon has many Data Centers r Hundreds of services r Thousands of commodity machines r Millions of customers at peak times.
High throughput chain replication for read-mostly workloads
Cloud-Scale Information Retrieval
Peer-to-Peer Distributed Search. Peer-to-Peer Networks A pure peer-to-peer network is a collection of nodes or peers that: 1.Are autonomous: participants.
Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank
Computer Science Lecture 14, page 1 CS677: Distributed OS Consistency and Replication Today: –Introduction –Consistency models Data-centric consistency.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Replication and Consistency CS-4513 D-term Replication and Consistency CS-4513 Distributed Computing Systems (Slides include materials from Operating.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
NoSQL and NewSQL Justin DeBrabant CIS Advanced Systems - Fall 2013.
What Can Databases Do for Peer-to-Peer Steven Gribble, Alon Halevy, Zachary Ives, Maya Rodrig, Dan Suciu Presented by: Ryan Huebsch CS294-4 P2P Systems.
State Machines CS 614 Thursday, Feb 21, 2002 Bill McCloskey.
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications C. Faloutsos – A. Pavlo How to Scale a Database System.
CS162 Operating Systems and Systems Programming Key Value Storage Systems November 3, 2014 Ion Stoica.
Graph databases …the other end of the NoSQL spectrum. Material taken from NoSQL Distilled and Seven Databases in Seven Weeks.
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
Amazon’s Dynamo System The material is taken from “Dynamo: Amazon’s Highly Available Key-value Store,” by G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati,
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
1 The Google File System Reporter: You-Wei Zhang.
Distributed Systems Tutorial 11 – Yahoo! PNUTS written by Alex Libov Based on OSCON 2011 presentation winter semester,
Orbe: Scalable Causal Consistency Using Dependency Matrices & Physical Clocks Jiaqing Du, EPFL Sameh Elnikety, Microsoft Research Amitabha Roy, EPFL Willy.
CS Storage Systems Dangers of Replication Materials taken from “J. Gray, P. Helland, P. O’Neil, and D. Shasha. The Dangers of Replication and a.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Ahmad Al-Shishtawy 1,2,Tareq Jamal Khan 1, and Vladimir Vlassov KTH Royal Institute of Technology, Stockholm, Sweden {ahmadas, tareqjk,
Data Structures & Algorithms and The Internet: A different way of thinking.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Data Versioning Lecturer.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation MongoDB Architecture.
Serverless Network File Systems Overview by Joseph Thompson.
Distributed Computing Systems CSCI 4780/6780. Geographical Scalability Challenges Synchronous communication –Waiting for a reply does not scale well!!
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Kjell Orsborn UU - DIS - UDBL DATABASE SYSTEMS - 10p Course No. 2AD235 Spring 2002 A second course on development of database systems Kjell.
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
Service Primitives for Internet Scale Applications Amr Awadallah, Armando Fox, Ben Ling Computer Systems Lab Stanford University.
1 Some Real Problem  What if a program needs more memory than the machine has? —even if individual programs fit in memory, how can we run multiple programs?
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Dynamo: Amazon’s Highly Available Key-value Store DAAS – Database as a service.
Distributed Computing Systems CSCI 4780/6780. Scalability ConceptExample Centralized servicesA single server for all users Centralized dataA single on-line.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
Lecture 4 Page 1 CS 111 Online Modularity and Virtualization CS 111 On-Line MS Program Operating Systems Peter Reiher.
NOSQL DATABASE Not Only SQL DATABASE
1  1998 Morgan Kaufmann Publishers Chapter Seven.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
Scalable Data Scale #2 site on the Internet (time on site) >200 billion monthly page views Over 1 million developers in 180 countries.
Hiearchial Caching in Traffic Server. Hiearchial Caching  A set of techniques and mechanisms to increase the size and performance of network caches.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
Gorilla: A Fast, Scalable, In-Memory Time Series Database
CalvinFS: Consistent WAN Replication and Scalable Metdata Management for Distributed File Systems Thomas Kao.
Presented by Dipannita Dey and Andy Vuong
NOSQL databases and Big Data Storage Systems
Spatial Online Sampling and Aggregation
Be Fast, Cheap and in Control
I Can’t Believe It’s Not Causal
آزمايشگاه سيستمهای هوشمند علی کمالی زمستان 95
Lei Yao Nov 30, 2016 CS848 - University of Waterloo
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
Scalable Causal Consistency
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
THE GOOGLE FILE SYSTEM.
Caching 50.5* + Apache Kafka
Presentation transcript:

TAO: Facebook's Distributed Data Store for the Social Graph Nathan Bronson, Zach Amsden, George Cabrera, Prasad Chakka, Peter Dimov, Hui Ding, Jack Ferris, Anthony Giardullo, Sachin Kulkarni, Harry Li, Mark Marchukov, Dimitri Petrov, Lovro Puzar, Yee Jiun Song, Venkat Venkataramani Presenter: Chang Dong

Motivation From David's guest lecture: – Social graph stored in mySQL databases – Memcache used as a (scalable) look-aside cache This is great - but can we do even better? Some challenges with this design: – Inefficient edge lists: Key-value cache is not a good fit for the edge lists in a graph; need to always fetch entire list – Distributed control logic: Cache control logic is run on clients that don't communicate with each other More failure modes; difficult to avoid "thundering herds" ( leases) – Expensive read-after-write consistency: In original design, writes always have to go to the 'master' Can we write to caches directly, without inter-regional communication?

Goals for TAO Provide a data store with a graph abstraction (vertexes and edges), not keys/values Optimize heavily for reads – More than 2 orders of magnitude more reads than writes! Explicitly favor efficiency and availability over consistency – Slightly stale data is often okay (for Facebook) – Communication between data centers in different regions is expensive

Thinking about related objects We can represent related objects as a labeled, directed graph Entities are typically represented as nodes; relationships are typically edges – Nodes all have IDs, and possibly other properties – Edges typically have values, possibly IDs and other properties

TAO's data model Facebook's data model is exactly like that! – Focuses on people, actions, and relationships – These are represented as vertexes and edges in a graph Example: Alice visits a landmark with Bob – Alice 'checks in' with her mobile phone – Alice 'tags' Bob to indicate that he is with her – Cathy added a comment – David 'liked' the comment

TAO's data model and API TAO "objects" (vertexes) – 64-bit integer ID (id) – Object type (otype) – Data, in the form of key-value pairs TAO "associations" (edges) – Source object ID (id1) – Association type (atype) – Destination object ID (id2) – 32-bit timestamp – Data, in the form of key-value pairs

Example: Encoding in TAO Inverse edge types Data (KV pairs)

Association queries in TAO TAO is not a general graph database – Has a few specific (Facebook-relevant) queries 'baked into it' – Common query: Given object and association type, return an association list (all the outgoing edges of that type) Example: Find all the comments for a given checkin – Optimized based on knowledge of Facebook's workload Example: Most queries focus on the newest items (posts, etc.) There is creation-time locality  can optimize for that! Queries on association lists: – assoc_get(id1, atype, id2set, t_low, t_high) – assoc_count(id1, atype) – assoc_range(id1, atype, pos, limit) "cursor" – assoc_time_range(id1, atype, high, low, limit)

TAO's storage layer Objects and associations are stored in mySQL But what about scalability? – Facebook's graph is far too large for any single mySQL DB!! Solution: Data is divided into logical shards – Each object ID contains a shard ID – Associations are stored in the shard of their source object – Shards are small enough to fit into a single mySQL instance! – A common trick for achieving scalability – What is the 'price to pay' for sharding?

Caching in TAO (1/2) Problem: Hitting mySQL is very expensive – But most of the requests are read requests anyway! – Let's try to serve these from a cache TAO's cache is organized into tiers – A tier consists of multiple cache servers (number can vary) – Sharding is used again here  each server in a tier is responsible for a certain subset of the objects + associations – Together, the servers in a tier can serve any request! – Clients directly talk to the appropriate cache server Avoids bottlenecks! – In-memory cache for objects, associations, and association counts (!)

Caching in TAO (2/2) How does the cache work? – New entries filled on demand – When cache is full, least recently used (LRU) object is evicted – Cache is "smart": If it knows that an object had zero associ- ations of some type, it knows how to answer a range query What about write requests? – Need to go to the database (write-through) – But what if we're writing a bidirectonal edge? This may be stored in a different shard  need to contact that shard! – What if a failure happens while we're writing such an edge? You might think that there are transactions and atomicity but in fact, they simply leave the 'hanging edges' in place Asynchronous repair job takes care of them eventually

Leaders and followers How many machines should be in a tier? – Too many is problematic: More prone to hot spots, etc. Solution: Add another level of hierarchy – Each shard can have multiple cache tiers: one leader, and multiple followers – The leader talks directly to the mySQL database – Followers talk to the leader – Clients can only interact with followers – Leader can protect the database from 'thundering herds'

Scaling geographically Facebook is a global service. Does this work? No - laws of physics are in the way! – Long propagation delays, e.g., between Asia and U.S. – What tricks do we know that could help with this?

Scaling geographically Idea: Divide data centers into regions; have one full replica of the data in each region – What could be a problem with this approach? – Consistency! – Solution: One region has the 'master' database; other regions forward their writes to the master – Database replication makes sure that the 'slave' databases eventually learn of all writes; plus invalidation messages, just like with the leaders and followers

Handling failures What if the master database fails? What if the master database fails? – Can promote another region's database to be the master – But what about writes that were in progress during switch? – What would be the 'database answer' to this? – TAO's approach:

Production deployment at Facebook Impressive performance Impressive performance – Handles 1 billion reads/sec and 1 million writes/sec! Reads dominate massively Reads dominate massively – Only 0.2% of requests involve a write Most edge queries have zero results Most edge queries have zero results – 45% of assoc_count calls return 0... – but there is a heavy tail: 1% return >500,000! Cache hit rate is very high Cache hit rate is very high – Overall, 96.4%!

Summary The data model really does matter! The data model really does matter! – KV pairs are nice and generic, but you sometimes can get better performance by telling the storage system more about the kind of data you are storing in it ( optimizations!) Several useful scaling techniques Several useful scaling techniques – "Sharding" of databases and cache tiers (not invented at Facebook, but put to great use) – Primary-backup replication to scale geographically Interesting perspective on consistency Interesting perspective on consistency – On the one hand, quite a bit of complexity & hard work to do well in the common case (truly "best effort") – But also, a willingness to accept eventual consistency (or worse!) during failures, or when the cost would be high