PNUTS: Yahoo!’s Hosted Data Serving Platform Yahoo! Research present by Liyan & Fang.

Slides:



Advertisements
Similar presentations
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
Advertisements

High throughput chain replication for read-mostly workloads
PNUTS: Yahoo!’s Hosted Data Serving Platform Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, HansArno Jacobsen,
Replication. Topics r Why Replication? r System Model r Consistency Models r One approach to consistency management and dealing with failures.
Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung
The Google File System Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani 1CS5204 – Operating Systems.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
1 Web-Scale Data Serving with PNUTS Adam Silberstein Yahoo! Research.
PNUTS: Yahoo’s Hosted Data Serving Platform Jonathan Danaparamita jdanap at umich dot edu University of Michigan EECS 584, Fall Some slides/illustrations.
Transaction.
Chapter 13 (Web): Distributed Databases
PNUTS: Yahoo!’s Hosted Data Serving Platform Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen,
Distributed components
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Web Data Management Raghu Ramakrishnan Research QUIQ Lessons Structured data management powers scalable collaboration environments ASP Multi-tenancy.
Overview Distributed vs. decentralized Why distributed databases
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Presentation by Krishna
Module 14: Scalability and High Availability. Overview Key high availability features available in Oracle and SQL Server Key scalability features available.
Distributed Databases
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
PNUTS: YAHOO!’S HOSTED DATA SERVING PLATFORM FENGLI ZHANG.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
Databases with Scalable capabilities Presented by Mike Trischetta.
MapReduce.
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
1 The Google File System Reporter: You-Wei Zhang.
Software Engineer, #MongoDBDays.
Distributed Systems Tutorial 11 – Yahoo! PNUTS written by Alex Libov Based on OSCON 2011 presentation winter semester,
Security and Replication … and Course Wrap-up Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems September 13, 2015 PNUTS.
PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
04/18/2005Yan Huang - CSCI5330 Database Implementation – Distributed Database Systems Distributed Database Systems.
Ahmad Al-Shishtawy 1,2,Tareq Jamal Khan 1, and Vladimir Vlassov KTH Royal Institute of Technology, Stockholm, Sweden {ahmadas, tareqjk,
Alireza Angabini Advanced DB class Dr. M.Rahgozar Fall 88.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
D 3 S: Debugging Deployed Distributed Systems Xuezheng Liu et al, Microsoft Research, NSDI 2008 Presenter: Shuo Tang,
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
PNUTS PNUTS: Yahoo!’s Hosted Data Serving Platform Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, HansArno.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation MongoDB Architecture.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Databases Illuminated
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Authors Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver, Ramana.
Geo-distributed Messaging with RabbitMQ
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
Chapter 4 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
R*: An overview of the Architecture By R. Williams et al. Presented by D. Kontos Instructor : Dr. Megalooikonomou.
DATABASE REPLICATION DISTRIBUTED DATABASE. O VERVIEW Replication : process of copying and maintaining database object, in multiple database that make.
Bigtable: A Distributed Storage System for Structured Data
1 Information Retrieval and Use De-normalisation and Distributed database systems Geoff Leese September 2008, revised October 2009.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
1 Chapter 22 Distributed DBMSs - Concepts and Design Simplified Transparencies © Pearson Education Limited 1995, 2005.
Bigtable A Distributed Storage System for Structured Data.
CSCI5570 Large Scale Data Processing Systems NoSQL Slide Ack.: modified based on the slides from Adam Silberstein James Cheng CSE, CUHK.
Web-Scale Data Serving with PNUTS
Dr.S.Sridhar, Director, RVCET, RVCE, Bangalore
CSE-291 (Cloud Computing) Fall 2016
PNUTS: Yahoo!’s Hosted Data Serving Platform
PNUTS: Yahoo!’s Hosted Data Serving Platform
CHAPTER 3 Architectures for Distributed Systems
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Caching 50.5* + Apache Kafka
Chapter 21: Parallel and Distributed Storage
Presentation transcript:

PNUTS: Yahoo!’s Hosted Data Serving Platform Yahoo! Research present by Liyan & Fang

social network websites 2 Brian SonjaJimiBrandonKurt What are my friends up to? Sonja: Brandon:

What does a web application need? Scalability – architectural scalability – scale during periods of rapid growth with minimal operational effort Response Time and Geographic Scope – Fast response time to geographically distributed users High Availability and Fault Tolerance – Read and even write data in failures Relaxed Consistency Guarantees – Eventually consistency: update one replica first and then update others 3

What do we need from our DBMS? Web applications need: – Scalability And the ability to scale linearly – Geographic scope – High availability Web applications typically have: – Simplified query needs No joins, aggregations – Relaxed consistency needs Applications can tolerate stale or reordered data 4

What is PNUTS? 5

6 E C A E B W C W D E F E E C A E B W C W D E F E CREATE TABLE Parts ( ID VARCHAR, StockNumber INT, Status VARCHAR … ) CREATE TABLE Parts ( ID VARCHAR, StockNumber INT, Status VARCHAR … ) Parallel database Geographic replication Indexes and views Structured, flexible schema Hosted, managed infrastructure A E B W C W D E E C F E

Query model Per-record operations – Get – Set – Delete Multi-record operations – Multiget – Scan – Getrange 7

8 Data-path components Storage units Routers Tablet controller REST API Clients Message Broker Detailed architecture Data tables are horizontally partitioned into groups of records called tablets. Storage units: store tablets respond to get() and scan() requests by retrieving and returning matching records respond to set() requests by processing the update. If we want to commit the update result, need to write them to Message Broker firstly. Router: determine which storage unit is responsible for a given record to be read or written by the client, we must first determine which tablet contains the record, and then determine which storage unit has that tablet tablet controller : determines when it is time to move a tablet between storage units for load balancing or recovery when a large tablet must be split. update the copy of the interval mapping.

9 Storage units Routers Tablet controller REST API Clients Local region Remote regions YMB Detailed architecture record-level mastering: mastership is assigned on a record-by-record basis, and different records in the same table can be mastered in different clusters. In one week, 85 percent of the writes to a given record originated in the same datacenter. A master publishes its updates to a single broker, and thus updates are delivered to replicas in commit order. YMB takes multiple steps to ensure messages are not lost before they are applied to the database. messages published to one YMB cluster will be relayed to other YMB clusters for delivery to local subscribers

Query processing 10

Accessing data 11 SU 1 Get key k 2 3 Record for key k 4

Bulk read 12 SU Scatter/ gather engine: a component of the router. receives a multi-record request, splits it into multiple individual requests for single records or single tablet scans, and initiates those requests in parallel. SU 1 {k 1, k 2, … k n } 2 Get k 1 Get k 2 Get k 3

Range queries MIN-CanteloupeSU1 Canteloupe-LimeSU3 Lime-StrawberrySU2 Strawberry-MAXSU1 13 Storage unit 1Storage unit 2Storage unit 3 Router Apple Avocado Banana Blueberry Canteloupe Grape Kiwi Lemon Lime Mango Orange Pear Strawberry Tomato Watermelon Grapefruit…Pear? Grapefruit…Lime? Lime…Pear? SU1Strawberry-MAX SU2Lime-Strawberry SU3Canteloupe-Lime SU4MIN-Canteloupe

Updates 14 1 Write key k 2 7 Sequence # for key k 8 SU 3 Write key k 4 5 SUCCESS 6 Write key k Routers Message brokers

Asynchronous replication and consistency 15

Asynchronous replication 16

Goal: make it easier for applications to reason about updates and cope with asynchrony What happens to a record with primary key “Brian”? 17 Consistency model Time Record inserted Update Delete Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Update

18 Consistency model Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Current version Stale version Read Read-any: Returns a possibly stale version of the record. e.g., in a social networking application, for displaying a user’s friend’s status, it is not absolutely essential to get the most up-to-date value, and hence read-any can be used.

19 Consistency model Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Read up-to-date Current version Stale version

20 Consistency model Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Read ≥ v.6 Current version Stale version Read-critical(required version): Read-critical: Returns a version of the record that is strictly newer than, or the same as the required version. For example, when a user writes a record, and then wants to read a version of the record that definitely reflects his changes.

21 Consistency model Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Write Current version Stale version

22 Consistency model Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Write if = v.7 ERROR Current version Stale version Test-and-set-write(required version) Test-and-set-write(required version): This call performs the requested write to the record if and only if the present version of the record is the same as required version. This call can be used to implement transactions that first read a record, and then do a write to the record based on the read, e.g., incrementing the value of a counter..

Record and Tablet Mastership Data in PNUTS is replicated across sites Hidden field in each record stores which copy is the master copy – updates can be submitted to any copy – forwarded to master, applied in order received by master Record also contains origin of last few updates – Mastership can be changed by current master, based on this information – Mastership change is simply a record update Tablets mastership – Required to ensure primary key consistency – Can be different from record mastership 23

Other Features Per record transactions Copying a tablet (failure recovery, for e.g.) – Request copy – Publish checkpoint message – Get copy of tablet as of when checkpoint is received – Apply later updates Tablet split – Has to be coordinated across all copies 24

Query Processing Range scan can span tablets – Only one tablet scanned at a time – Client may not need all results at once Continuation object returned to client to indicate where range scan should continue Notification – One pub-sub topic per tablet – Client knows about tables, does not know about tablets Automatically subscribed to all tablets, even as tablets are added/removed. – Usual problem with pub-sub: undelivered notifications, handled in usual way 25

Experiments 26

Experimental setup Production version supported by – Hash tables – ordered tables Database – 3 regions: 2 west coast, 1 east coast – 1 KB records, 128 tablets per region – Each process had 100 client threads, – Totally 300 clients across the system. Workload – requests/second – 0-50% writes – 80% locality 27

Inserts Inserts (hash tables) – required 75.6 ms per insert in West 1 (tablet master) – ms per insert into the non-master West 2, and – ms per insert into the non-master East. Inserts (ordered tables) – 33 ms per insert in West 1 – ms per insert in the non-master West2 – ms per insert in the non-master East. 28

29 10% writes by default latency decreases, and then increases, with increasing load The high latency at low request rate resulted from an anomaly in the HTTP client library we used, which closed TCP connections in between requests at low request rates, requiring expensive TCP setup for each call. As the proportion of reads increases, the average latency decreases.

Scalability 30

Size of range scans 31

Thanks! 32