IBM Almaden Research Center © 2011 IBM Corporation 1 Spinnaker Using Paxos to Build a Scalable, Consistent, and Highly Available Datastore Jun Rao Eugene.

Slides:



Advertisements
Similar presentations
Megastore: Providing Scalable, Highly Available Storage for Interactive Services. Presented by: Hanan Hamdan Supervised by: Dr. Amer Badarneh 1.
Advertisements

Dynamo: Amazon’s Highly Available Key-value Store
Dynamo: Amazon’s Highly Available Key-value Store ID2210-VT13 Slides by Tallat M. Shafaat.
High throughput chain replication for read-mostly workloads
PNUTS: Yahoo!’s Hosted Data Serving Platform Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, HansArno Jacobsen,
Amazon’s Dynamo Simple Cloud Storage. Foundations 1970 – E.F. Codd “A Relational Model of Data for Large Shared Data Banks”E.F. Codd –Idea of tabular.
Dynamo: Amazon's Highly Available Key-value Store Distributed Storage Systems CS presented by: Hussam Abu-Libdeh.
The SMART Way to Migrate Replicated Stateful Services Jacob R. Lorch, Atul Adya, Bill Bolosky, Ronnie Chaiken, John Douceur, Jon Howell Microsoft Research.
PNUTS: Yahoo’s Hosted Data Serving Platform Jonathan Danaparamita jdanap at umich dot edu University of Michigan EECS 584, Fall Some slides/illustrations.
Dynamo Highly Available Key-Value Store 1Dennis Kafura – CS5204 – Operating Systems.
PNUTS: Yahoo!’s Hosted Data Serving Platform Yahoo! Research present by Liyan & Fang.
NoSQL Databases: MongoDB vs Cassandra
What Should the Design of Cloud- Based (Transactional) Database Systems Look Like? Daniel Abadi Yale University March 17 th, 2011.
CS 582 / CMPE 481 Distributed Systems
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
MS CLOUD DB - AZURE SQL DB Fault Tolerance by Subha Vasudevan Christina Burnett.
Murtadha Al Hubail Project Team:. Motivation & Goals NC 1 Cluster Controller NC2 NC3 AsterixDB typical cluster consists of a master node (Cluster Controller)
Dynamo A presentation that look’s at Amazon’s Dynamo service (based on a research paper published by Amazon.com) as well as related cloud storage implementations.
Distributed Databases
Distributed storage for structured data
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
PNUTS: YAHOO!’S HOSTED DATA SERVING PLATFORM FENGLI ZHANG.
Cloud Storage – A look at Amazon’s Dyanmo A presentation that look’s at Amazon’s Dynamo service (based on a research paper published by Amazon.com) as.
Cloud Storage: All your data belongs to us! Theo Benson This slide includes images from the Megastore and the Cassandra papers/conference slides.
Distributed Storage System Survey
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
Distributed Systems Tutorial 11 – Yahoo! PNUTS written by Alex Libov Based on OSCON 2011 presentation winter semester,
Fault Tolerance via the State Machine Replication Approach Favian Contreras.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Ahmad Al-Shishtawy 1,2,Tareq Jamal Khan 1, and Vladimir Vlassov KTH Royal Institute of Technology, Stockholm, Sweden {ahmadas, tareqjk,
© , OrangeScape Technologies Limited. Confidential 1 Write Once. Cloud Anywhere. Building Highly Scalable Web applications BASE gives way to ACID.
Modern Databases NoSQL and NewSQL Willem Visser RW334.
VLDB2012 Hoang Tam Vo #1, Sheng Wang #2, Divyakant Agrawal †3, Gang Chen §4, Beng Chin Ooi #5 #National University of Singapore, †University of California,
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
Dynamo: Amazon’s Highly Available Key-value Store
Alireza Angabini Advanced DB class Dr. M.Rahgozar Fall 88.
Cassandra - A Decentralized Structured Storage System
CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes 13: BigTable, HBASE, Cassandra Hector Garcia-Molina.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Using Paxos to Build a Scalable, Consistent, and Highly Available Datastore Jun Rao, Eugene J. Shekita, Sandeep Tata IBM Almaden Research Center PVLDB,
Paxos A Consensus Algorithm for Fault Tolerant Replication.
Megastore: Providing Scalable, Highly Available Storage for Interactive Services Jason Baker, Chris Bond, James C. Corbett, JJ Furman, Andrey Khorlin,
Authors Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver, Ramana.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
NoSQL Or Peles. What is NoSQL A collection of various technologies meant to work around RDBMS limitations (mostly performance) Not much of a definition...
NoSQL Systems Motivation. NoSQL: The Name  “SQL” = Traditional relational DBMS  Recognition over past decade or so: Not every data management/analysis.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Bigtable: A Distributed Storage System for Structured Data
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
Big Data Yuan Xue CS 292 Special topics on.
Kitsuregawa Laboratory Confidential. © 2007 Kitsuregawa Laboratory, IIS, University of Tokyo. [ hoshino] paper summary: dynamo 1 Dynamo: Amazon.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Amazon’s Dynamo Lecturer.
Bigtable A Distributed Storage System for Structured Data.
Detour: Distributed Systems Techniques
BChain: High-Throughput BFT Protocols
CS 405G: Introduction to Database Systems
Cassandra - A Decentralized Structured Storage System
Distributed Systems – Paxos
Alternative system models
Dynamo: Amazon’s Highly Available Key-value Store
NOSQL.
CPS 512 midterm exam #1, 10/7/2016 Your name please: ___________________ NetID:___________ /60 /40 /10.
NOSQL databases and Big Data Storage Systems
آزمايشگاه سيستمهای هوشمند علی کمالی زمستان 95
Principles of Computer Security
The SMART Way to Migrate Replicated Stateful Services
Implementing Consistency -- Paxos
Presentation transcript:

IBM Almaden Research Center © 2011 IBM Corporation 1 Spinnaker Using Paxos to Build a Scalable, Consistent, and Highly Available Datastore Jun Rao Eugene Shekita Sandeep Tata (IBM Almaden Research Center)

IBM Almaden Research Center © 2011 IBM Corporation 2 Outline Motivation and Background Spinnaker Existing Data Stores Experiments Summary

IBM Almaden Research Center © 2011 IBM Corporation 3 Motivation  Growing interest in “scale-out structured storage” – Examples: BigTable, Dynamo, PNUTS – Many open-source examples: HBase, Hypertable, Voldemort, Cassandra  The sharded-replicated-MySQL approach is messy  Start with a fairly simple node architecture that scales: Focus onGive up  Commodity components  Fault-tolerance and high availability  Easy elasticity and scalability  Relational data model  SQL APIs  Complex queries (joins, secondary indexes, ACID transactions)

IBM Almaden Research Center © 2011 IBM Corporation 4 Outline Motivation and Background Spinnaker Existing Data Stores Experiments Summary

IBM Almaden Research Center © 2011 IBM Corporation 5 Data Model  Familiar tables, rows, and columns, but more flexible – No upfront schema – new columns can be added any time – Columns can vary from row to row k127type: capacitorfarads: 12mfcost: $1.05 k187type: resistorohms: 8kcost: $.25 … col name row key col value label: banded row 1 row 2 row 3 … k217

IBM Almaden Research Center © 2011 IBM Corporation 6 Basic API insert (key, colName, colValue) delete(key, colName) get(key, colName) test_and_set(key, colName, colValue, timestamp)

IBM Almaden Research Center © 2011 IBM Corporation 7 Spinnaker: Overview  Data is partitioned into key-ranges  Chained declustering  The replicas of every partition form a cohort  Multi-Paxos executed within each cohort  Timeline consistency Node E key ranges [800,999] [600,799] [400,599] Node A key ranges [0,199] [800,999] [600,799] Node B key ranges [200,399] [0,199] [800,999] Node C key ranges [400,599] [200,399] [0,199] Node D key ranges [600,799] [400,599] [200,399] Zookeeper

IBM Almaden Research Center © 2011 IBM Corporation 8 Single Node Architecture Memtables Local Logging and Recovery SSTables Replication and Remote Recovery Commit Queue

IBM Almaden Research Center © 2011 IBM Corporation 9 Replication Protocol  Phase 1: Leader election  Phase 2: In steady state, updates accepted using Multi-Paxos

IBM Almaden Research Center © 2011 IBM Corporation 10 Multi-Paxos Replication Protocol Client Cohort Leader Cohort Followers Log, propose X insert X ACK client (commit) Log, ACK Clients can read latest version at leader and older versions at followers async commit All nodes have latest version time

IBM Almaden Research Center © 2011 IBM Corporation 11 LeaderFollowersClient Write Ack X Write X to WAL & Commit Queue Send Ack to Master Don’t apply to Memtables yet Update Commit Queue Apply X to Membtables Send Ack to Client Acquire LSN = X Propose X to Followers Write log record to WAL & Commit Queue Asynchronous Commit Message for LSN = Y (Y>=X) Process everything in the Commit Queue until Y and apply to Memtables. Client can read the latest value at the Leader X is not in the Memtable yet. Reads at Followers see an older value now Time Reads now see every update up to LSN = Y Details

IBM Almaden Research Center © 2011 IBM Corporation 12 Recovery  Each node maintains a shared log for all the partitions it manages  If a follower fails and rejoins – Leader ships log records to catch up follower – Once up to date, follower joins the cohort  If a leader fails – Election to choose a new leader – Leader re-proposes all uncommitted messages – If there’s a quorum, open up for new updates

IBM Almaden Research Center © 2011 IBM Corporation 13 Guarantees  Timeline consistency  Available for reads and writes as long as 2 out of 3 nodes in a cohort are alive  Write: 1 disk force and 2 message latencies  Performance is close to eventual consistency (Cassandra)

IBM Almaden Research Center © 2011 IBM Corporation 14 Outline Motivation and Background Spinnaker Existing Data Stores Experiments Summary

IBM Almaden Research Center © 2011 IBM Corporation 15 BigTable (Google) Master Chubby TabletServer Memtable GFS Contains Logs and SSTables for each TabletServer Table partitioned into “tablets” and assigned to TabletServers Logs and SSTables written to GFS – no update in place GFS manages replication

IBM Almaden Research Center © 2011 IBM Corporation 16 Advantages vs BigTable/HBase  Logging to a DFS – Forcing a page to disk may require a trip to the GFS master. – Contention from multiple write requests on the DFS can cause poor performance  DFS-level replication is less network efficient – Shipping log records and SSTables  DFS consistency does not allow tradeoff for performance and availability – Not warm standby in case of failure – large amount of state needs to be recovered – All reads/writes at same consistency and need to be handled by the TabletServer.

IBM Almaden Research Center © 2011 IBM Corporation 17 Dynamo (Amazon) BDB/ MySQL BDB/ MySQL BDB/ MySQL BDB/ MySQL BDB/ MySQL BDB/ MySQL Gossip Protocol Hinted Handoff, Read Repair, Merkle Trees Always available, eventually consistent Does not use a DFS Database-level replication on local storage, with no single point of failure Anti-entropy measures: Hinted Handoff, Read Repair, Merkle Trees

IBM Almaden Research Center © 2011 IBM Corporation 18 Advantages vs Dynamo/Cassandra  Spinnaker can support ACID operations – Dynamo requires conflict detection and resolution; does not support transactions  Timeline consistency: easier to reason about  Almost the same performance

IBM Almaden Research Center © 2011 IBM Corporation 19 PNUTS (Yahoo) Files/ MySQL Files/ MySQL Files/ MySQL Files/ MySQL Files/ MySQL Router Chubby Tablet Controller Chubby Yahoo! Message Broker Data partitioned and replicated in files/MySQL Notion of a primary and secondary replicas Timeline consistency, support for multi-datacenter replication Primary writes to local storage and YMB; YMB delivers updates to secondaries

IBM Almaden Research Center © 2011 IBM Corporation 20 Advantages vs PNUTS  Spinnaker does not depend on a reliable messaging system – The Yahoo Message Broker needs to solve replication, fault- tolerance, and scaling – Hedwig, a new open-source project from Yahoo and others could solve this  More efficient replication – Messages need to be sent over the network to the message broker, and then resent from there to the secondary nodes

IBM Almaden Research Center © 2011 IBM Corporation 21 Spinnaker Downsides  Research prototype  Complexity – BigTable and PNUTS offload the complexity of replication to DFS and YMB respectively – Spinnaker’s code is complicated by the replication protocol – Zookeeper helps  Single datacenter  Failure models – Block/file corruptions – DFS handles this better – Need to add checksums, additional recovery options

IBM Almaden Research Center © 2011 IBM Corporation 22 Outline Motivation and Background Spinnaker Existing Data Stores Experiments Summary

IBM Almaden Research Center © 2011 IBM Corporation 23 Unavailability Window on Failure: Spinnaker vs HBase  HBase recovery takes much longer: depends on amount of data in the logs  Spinnaker recovers quickly: unavailability only depends on asynchronous commit period

IBM Almaden Research Center © 2011 IBM Corporation 24 Write Performance: Spinnaker vs. Cassandra  Quorum writes used in Cassandra (R=2, W=2)  For similar level of consistency and availability, – Spinnaker write performance similar (within 10% ~ 15%)

IBM Almaden Research Center © 2011 IBM Corporation 25 Write Performance with SSD Logs: Spinnaker vs. Cassandra

IBM Almaden Research Center © 2011 IBM Corporation 26 Read Performance: Spinnaker vs. Cassandra  Quorum reads used in Cassandra (R=2, W=2)  For similar level of consistency and availability, – Spinnaker read performance is 1.5X to 3X better

IBM Almaden Research Center © 2011 IBM Corporation 27 Scaling Reads to 80 nodes on Amazon EC2

IBM Almaden Research Center © 2011 IBM Corporation 28 Outline Motivation and Background Spinnaker Existing Data Stores Experiments Summary

IBM Almaden Research Center © 2011 IBM Corporation 29 Summary  It is possible to build a scalable and consistent datastore in a single datacenter without relying on a DFS or a pub-sub system with good availability and performance characteristics  A consensus protocol can be used for replication with good performance – 10% slower writes, faster reads compared to Cassandra  Services like Zookeeper make implementing a system that uses many instances of consensus much simpler than previously possible

IBM Almaden Research Center © 2011 IBM Corporation 30 Related Work  Database Replication – Sharding + 2PC – Middleware-based replication (Postgres-R, Ganymed, etc.)  Bill Bolosky et. al., “Paxos Replicated State Machines as the Basis of a High-Performance Data Store”, NSDI 2011  John Ousterhout et al. “The Case for RAMCloud” CACM 2011  Curino et. al, “Relational Cloud: The Case for a Database Service”, CIDR 2011  SQL Azure, Microsoft

IBM Almaden Research Center © 2011 IBM Corporation 31 Backup Slides

IBM Almaden Research Center © 2011 IBM Corporation 32 Eventual Consistency Example  Apps can see inconsistent data if they are not careful about choice of R and W – Might not see its own writes or successive reads might see a row’s state jump back and forth in time [x=0, y=0] [x=1, y=0] [x=1, y=1] update to cols x,y on different nodes [x=0, y=0] [x=0, y=1] [x=1, y=1]  To ensure durability and strong consistency – Use quorum reads and writes (N=3, R=2, W=2)  For higher read performance and timeline consistency – Stick to the same replicas within a session and use (N=3, R=1, W=1) x=1 inconsistent state time y=1 initial state consistent state