State Machine Replication State Machine Replication through transparent distributed protocols State Machine Replication through a shared log.

Slides:



Advertisements
Similar presentations
Wyatt Lloyd * Michael J. Freedman * Michael Kaminsky David G. Andersen * Princeton, Intel Labs, CMU Dont Settle for Eventual : Scalable Causal Consistency.
Advertisements

Dynamo: Amazon’s Highly Available Key-value Store
Consistency Guarantees and Snapshot isolation Marcos Aguilera, Mahesh Balakrishnan, Rama Kotla, Vijayan Prabhakaran, Doug Terry MSR Silicon Valley.
High throughput chain replication for read-mostly workloads
SPORC: Group Collaboration using Untrusted Cloud Resources Ariel J. Feldman, William P. Zeller, Michael J. Freedman, Edward W. Felten Published in OSDI’2010.
1 Principles of Reliable Distributed Systems Tutorial 12: Frangipani Spring 2009 Alex Shraer.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis.
The Architecture of Transaction Processing Systems
Inventory Management System With Berkeley DB 1. What is Berkeley DB? Berkeley DB is an Open Source embedded database library that provides scalable, high-
Multi-level Selective Deduplication for VM Snapshots in Cloud Storage Wei Zhang*, Hong Tang †, Hao Jiang †, Tao Yang*, Xiaogang Li †, Yue Zeng † * University.
Platform as a Service (PaaS)
Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 1 Preview of Oracle Database 12 c In-Memory Option Thomas Kyte
Frangipani: A Scalable Distributed File System C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation.
Computer Measurement Group, India CLOUD PERFORMANCE TESTING - KEY CONSIDERATIONS Abhijeet Padwal, Persistent Systems.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
PMIT-6102 Advanced Database Systems
1 The Google File System Reporter: You-Wei Zhang.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Kuali Rice at Indiana University Rice Setup Options July 29-30, 2008 Eric Westfall.
Institute of Computer and Communication Network Engineering OFC/NFOEC, 6-10 March 2011, Los Angeles, CA Lessons Learned From Implementing a Path Computation.
RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University (Joint work with Diego Ongaro, Ryan Stutsman, Steve Rumble, Mendel.
Emalayan Vairavanathan
Some key-value stores using log-structure Zhichao Liang LevelDB Riak.
Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.
Storage in Big Data Systems
INFN-Pisa Glast Database in Pisa A practical solution based on MSAccess Luca Latronico INFN Pisa.
Managing a Cloud For Multi Agent System By, Pruthvi Pydimarri, Jaya Chandra Kumar Batchu.
VLDB2012 Hoang Tam Vo #1, Sheng Wang #2, Divyakant Agrawal †3, Gang Chen §4, Beng Chin Ooi #5 #National University of Singapore, †University of California,
IMDGs An essential part of your architecture. About me
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Scaling Out Without Partitioning Phil Bernstein & Colin Reid Microsoft Corporation A Novel Transactional Record Manager for Shared Raw Flash © 2010 Microsoft.
Kyung Hee University 1/41 Introduction Chapter 1.
DISTRIBUTED COMPUTING Introduction Dr. Yingwu Zhu.
Server to Server Communication Redis as an enabler Orion Free
Database Replication in Tashkent CSEP 545 Transaction Processing Sameh Elnikety.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Databases Illuminated
INTRODUCTION TO DBS Database: a collection of data describing the activities of one or more related organizations DBMS: software designed to assist in.
Building micro-service based applications using Azure Service Fabric
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
Wide Area Events Using DDS We have a model that can efficiently support a family of applications, Publish-Subscribe-Notify. To realize this model, we implemented.
Scalable Data Scale #2 site on the Internet (time on site) >200 billion monthly page views Over 1 million developers in 180 countries.
10 1 Chapter 10 - A Transaction Management Database Systems: Design, Implementation, and Management, Rob and Coronel.
Technology Drill Down: Windows Azure Platform Eric Nelson | ISV Application Architect | Microsoft UK |
Transactional Flash V. Prabhakaran, T. L. Rodeheffer, L. Zhou (MSR, Silicon Valley), OSDI 2008 Shimin Chen Big Data Reading Group.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
INTRODUCING HYBRID APP KAU with MICT PARK IT COMPANIES Supported by KOICA
BIG DATA/ Hadoop Interview Questions.
Ivy: A Read/Write Peer-to- Peer File System Authors: Muthitacharoen Athicha, Robert Morris, Thomer M. Gil, and Benjie Chen Presented by Saurabh Jha 1.
CalvinFS: Consistent WAN Replication and Scalable Metdata Management for Distributed File Systems Thomas Kao.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Slide credits: Thomas Kao
Microsoft Research Silicon Valley
Module 2: DriveScale architecture and components
Dynamo: Amazon’s Highly Available Key-value Store
Noah Treuhaft UC Berkeley ROC Group ROC Retreat, January 2002
Operational & Analytical Database
Sub-millisecond Stateful Stream Querying over
Microsoft Build /8/2018 5:15 AM © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY,
Distributed Consensus Paxos
Nicholas Chan Himel Dev
the shared log approach to cloud-scale consistency
Distributed File Systems
Building a Database on S3
Lecture 1: Multi-tier Architecture Overview
Leader Election Using NewSQL Database Systems
H-store: A high-performance, distributed main memory transaction processing system Robert Kallman, Hideaki Kimura, Jonathan Natkins, Andrew Pavlo, Alex.
Concurrent, Consistent Applications over a Distributed Shared Log
Presentation transcript:

State Machine Replication State Machine Replication through transparent distributed protocols State Machine Replication through a shared log

Tango: Distributed Data Structures over a Shared Log Mahesh Balakrishnan, Dahlia Malkhi, Ted Wobber, Ming Wu, Vijayan Prabhakaran Michael Wei, John D. Davis, Sriram Rao, Tao Zou, Aviad Zuck Microsoft Research Presented by Faria Kalim

Motivation Distributed data, but centralized metadata

Motivation Distributed data, but centralized metadata Usually in-memory data structures Require transactional access Alternatives? Support transactions but not scalability (conventional databases) Support limited APIs (eg. Zookeeper) Implement customized protocols

Problem Statement How to build a highly available metadata service that provides whatever data abstractions you want? A shared log is a powerful and versatile abstraction. Tango: A system for building highly available metadata services Tango object: a class of in-memory data structures built over a durable, fault-tolerant shared log Contribution

The Remote Shared Log Clients Log Read Append The Shared Log API O = append(V) V = read(O) trim(O) O = check() Imposes Total Ordering Fast and Scalable

CORFU: Clusters of Raw Flash Units Client Log Read Append Application Sequencer CORFU library

The Sequencer Not required for safety or liveness. Fast.

Chain Replication in Corfu — Resolves contention — Provides consistency AB Client

Tango Architecture Read Append Properties Persistence Elasticity Availability Atomicity Isolation a Tango object = view in-memory data structure + history ordered updates in shared log Tango Runtime Applications

Tango Objects Easy to use Easy to build Scalable and Fast (CORFU)

Tango Objects Easy to use Linearizability for single operations currowner = ownermap.get (“ledger”) if (….) ledger.add(item); Each operation by a client is visible (or available) instantaneously to all other clients

Tango Objects Serializable Transactions Easy to use currowner = ownermap.get (“ledger”); if (….) ledger.add(item); TR.BeginTX(); status = TR.EndTX(); Updates by other apps the execution of a set of operations over multiple items is equivalent to some serial execution (total ordering) of the transactions.

Tango Objects Easy to build API between runtime to object Upcall, Query and Update helper API between object and application Mutators and Accessors

The Stream Abstraction

Streams Stored with Backpointers

Evaluation Single Object Linearizability

Evaluation Transactions on a fully replicated TangoMap

Evaluation Scalability

Takeaways Pros A durable, iterable total order (i.e., a shared log) is a unifying abstraction for distributed systems, subsuming the roles of many distributed protocols It is possible to impose a total order at speeds exceeding the I/O capacity of any single machine A total order is useful even when individual nodes consume a subsequence of it Cons Evaluation without the sequencer: how much would the performance decrease? How affordable is the SSD cluster?

Backup Slides

Conclusion Tango allows users to build highly available, persistent and strongly consistent metadata services easily Provides data structures backed by a shared log The data structures are easy to use and build The shared log provides consistency, persistence, elasticity, atomicity and isolation

Evaluation Setup 20Gbps between top of the rack switches, Gb per node 36 8-core machines in 2 racks Half the nodes (evenly divided across racks) equipped with 2 Intel X25V SSDs each. 18-node CORFU deployment CORFU sequencer on a powerful, 32-core machine in separate rack. Other 18 nodes used as clients, running applications and benchmarks that operate on Tango objects

Tango Objects Scalable and Fast CORFU decentralized shared log Reads scale linearly with number of flash drives 600K/s appends (limited by sequencer speed)

Use Cases Replicate StateIndex State

Other Use Cases Partitioning StateSharing State

Code