Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.

Slides:



Advertisements
Similar presentations
Depot: Cloud Storage with Minimal Trust OSDI 2010 Prince Mahajan, Srinath Setty, Sangmin Lee, Allen Clement, Lorenzo Alvisi, Mike Dahlin, and Michael Walfish.
Advertisements

High throughput chain replication for read-mostly workloads
G O O G L E F I L E S Y S T E M 陳 仕融 黃 振凱 林 佑恩 Z 1.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
What Should the Design of Cloud- Based (Transactional) Database Systems Look Like? Daniel Abadi Yale University March 17 th, 2011.
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
The Google File System.
Farsite: Ferderated, Available, and Reliable Storage for an Incompletely Trusted Environment Microsoft Reseach, Appear in OSDI’02.
DISTRIBUTED COMPUTING
Distributed Databases
MetaSync File Synchronization Across Multiple Untrusted Storage Services Seungyeop Han Haichen Shen, Taesoo Kim*, Arvind Krishnamurthy,
Team CMD Distributed Systems Team Report 2 1/17/07 C:\>members Corey Andalora Mike Adams Darren Stanley.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Distributed Deadlocks and Transaction Recovery.
1 The Google File System Reporter: You-Wei Zhang.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Data in the Cloud – I Parallel Databases The Google File System Parallel File Systems.
Introduction. Readings r Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 m Note: All figures from this book.
Practical Byzantine Fault Tolerance
Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.
From Viewstamped Replication to BFT Barbara Liskov MIT CSAIL November 2007.
Presenters: Rezan Amiri Sahar Delroshan
1 ZYZZYVA: SPECULATIVE BYZANTINE FAULT TOLERANCE R.Kotla, L. Alvisi, M. Dahlin, A. Clement and E. Wong U. T. Austin Best Paper Award at SOSP 2007.
Byzantine fault tolerance
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
S-Paxos: Eliminating the Leader Bottleneck
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
Commit Algorithms Hamid Al-Hamadi CS 5204 November 17, 2009.
Byzantine Fault Tolerance CS 425: Distributed Systems Fall 2012 Lecture 26 November 29, 2012 Presented By: Imranul Hoque 1.
Chap 7: Consistency and Replication
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
)1()1( Presenter: Noam Presman Advanced Topics in Storage Systems – Semester B 2013 Authors: A.Cidon, R.Stutsman, S.Rumble, S.Katti,
Systems Research Barbara Liskov October Replication Goal: provide reliability and availability by storing information at several nodes.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Lecture 24: GFS.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
BIG DATA/ Hadoop Interview Questions.
CalvinFS: Consistent WAN Replication and Scalable Metdata Management for Distributed File Systems Thomas Kao.
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Distributed Systems – Paxos
Introduction to NewSQL
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
آزمايشگاه سيستمهای هوشمند علی کمالی زمستان 95
Outline Announcements Fault Tolerance.
7.1. CONSISTENCY AND REPLICATION INTRODUCTION
Consistency and Replication
From Viewstamped Replication to BFT
Decoupled Storage: “Free the Replicas!”
Distributed Systems CS
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
The SMART Way to Migrate Replicated Stateful Services
Database System Architectures
Presentation transcript:

Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike Dahlin University of Texas at Austin

Salus’ goal Usage: – Provide remote disks to users (Amazon EBS) Scalability – Thousands of machines Robustness – Tolerate disk/memory corruption, CPU errors, … Good performance

Scalability and robustness Operating System Distributed Protocol BigTable: 1 corruption/5TB of data?

Challenge: Parallelism vs Consistency Metadata server Storage servers Clients Infrequent metadata transfer Parallel data transfer Data is replicated for durability and availability State-of-the-art architecture: GFS/Bigtable, HDFS/HBase, WAS, …

Challenges Write in parallel and in order Eliminate single point of failures – Prevent a single node from corrupting data – Read safely from one node Do not increase replication cost

Write in parallel and in order Metadata server Data servers Clients

Write in parallel and in order Write 1Write 2 Write 2 is committed but write 1 is not. Not allowed for block store.

Prevent a single node from corrupting data Metadata server Data servers Clients

Prevent a single node from corrupting data Tasks of computation nodes: – Data forwarding, garbage collection, etc Examples of computation nodes: – Tablet server (Bigtable), Region server (HBase), … (WAS) Computation node

Read safely from one node Read is executed on one node: – Maximize parallelism – Minimize latency If that node experiences corruptions, …

Do not increase replication cost Industrial systems: – Write to f+1 nodes and read from one node BFT systems: – Write to 3f+1 nodes and read from 2f+1 nodes

Salus’ approach Start from a scalable architecture (Bigtable/HBase) Ensure robustness techniques do not hurt scalability

Salus’ key ideas Pipelined commit – Guarantee ordering despite parallel writes Active storage – Prevent a computation node from corrupting data End-to-end verification – Read safely from one node

Salus’ key ideas Metadata server Clients Pipelined commit Active storage End-to-end verification

Pipelined commit Goal: barrier semantic – A request can be marked as a barrier. – All previous ones must be executed before it. Naïve solution: – The client blocks at a barrier: lose parallelism A weaker version of distributed transaction – Well-known solution: two phase commit (2PC)

Pipelined commit – 2PC Previous leader Prepared Committed Client Servers Leader Prepared Leader Batch i Batch i+1

Pipelined commit – 2PC Previous leader Batch i-1 committed Client Servers Leader Commit Batch i committed Commit Leader Batch i Batch i+1

Pipelined commit - challenge Is 2PC slow? – Additional network messages? Disk is the bottleneck. – Additional disk write? Let’s eliminate that. – Challenge: whether to commit a write after recovery is prepared. Should it be committed? Both cases are possible. Salus’ solution: ask other nodes – Has anyone committed 3 or larger? If not, is 1 committed?

Active Storage Goal: a single node cannot corrupt data Well-known solution: replication – Problem: replication cost vs availablity Salus’ solution: use f+1 replicas – Require unanimous consent of the whole quorum – If one replica fails, replace the whole quorum

Active Storage Computation node Storage nodes

Active Storage Computation nodes Storage nodes Unanimous consent: – All updates must be agreed by f+1 computation nodes. Additional benefit: reduce network bandwidth usage

Active Storage Computation nodes Storage nodes What if one computation node fails? – Problem: we may not know which one is faulty. Replace the whole quorum

Active Storage Computation nodes Storage nodes What if one computation node fails? – Problem: we may not know which one is faulty. Replace the whole quorum – The new quorum must agree on the states.

Active Storage Does it provide BFT with f+1 replication? No …. During recovery, may accept stale states if: – The client fails; – At least one storage node provides stale states; – All other storage nodes are not available. 2f+1 replicas can eliminate this case: – Is it worth adding f replicas to eliminate that?

End-to-end verification Goal: read safely from one node – The client should be able to verify the reply. – If corrupted, the client retries another node. Well-known solution: Merkle tree – Problem: scalability Salus’ solution: – Single writer – Distribute the tree among servers

End-to-end verification Server 1 Server 2 Server 3 Server 4 Client maintains the top tree. Client does not need to store anything persistently. It can rebuild the top tree from the servers.

Recovery Pipelined commit – How to ensure write order after recovery? Active storage: – How to agree on the current states? End-to-end verification – How to rebuild Merkle tree if client recovers?

Discussion – why HBase? It’s a popular architecture – Bigtable: Google – HBase: Facebook, Yahoo, … – Windows Azure Storage: Microsoft It’s open source. Why two layers? – Necessary if storage layer is append-only Why append-only storage layer? – Better random write performance – Easy to scale

Discussion – multiple writers?

Lessons Strong checking makes debugging easier.

Evaluation

Challenge: Combining robustness and scalability Scalable systems (GFS/Bigtable, HDFS/HBase, WAS, Spanner, FDS, …..) Strong protections (End-to- end checks, BFT, Depot, …) Combining them is challenging.