Orbe: Scalable Causal Consistency Using Dependency Matrices & Physical Clocks Jiaqing Du, EPFL Sameh Elnikety, Microsoft Research Amitabha Roy, EPFL Willy.

Slides:



Advertisements
Similar presentations
Chen Zhang Hans De Sterck University of Waterloo
Advertisements

Wyatt Lloyd * Michael J. Freedman * Michael Kaminsky David G. Andersen * Princeton, Intel Labs, CMU Dont Settle for Eventual : Scalable Causal Consistency.
Consistency Guarantees and Snapshot isolation Marcos Aguilera, Mahesh Balakrishnan, Rama Kotla, Vijayan Prabhakaran, Doug Terry MSR Silicon Valley.
High throughput chain replication for read-mostly workloads
Omid Efficient Transaction Management and Incremental Processing for HBase Copyright © 2013 Yahoo! All rights reserved. No reproduction or distribution.
PNUTS: Yahoo!’s Hosted Data Serving Platform Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, HansArno Jacobsen,
PROVENANCE FOR THE CLOUD (USENIX CONFERENCE ON FILE AND STORAGE TECHNOLOGIES(FAST `10)) Kiran-Kumar Muniswamy-Reddy, Peter Macko, and Margo Seltzer Harvard.
Enterprise Job Scheduling for Clustered Environments Stratos Paulakis, Vassileios Tsetsos, and Stathes Hadjiefthymiades P ervasive C omputing R esearch.
Dynamo: Amazon's Highly Available Key-value Store Distributed Storage Systems CS presented by: Hussam Abu-Libdeh.
Causal Consistency Without Dependency Check Messages Willy Zwaenepoel.
A Collaborative Monitoring Mechanism for Making a Multitenant Platform Accoutable HotCloud 10 By Xuanran Zong.
1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,
1 Database Replication Using Generalized Snapshot Isolation Sameh Elnikety, EPFL Fernando Pedone, USI Willy Zwaenepoel, EPFL.
Predicting Replicated Database Scalability Sameh Elnikety, Microsoft Research Steven Dropsho, Google Inc. Emmanuel Cecchet, Univ. of Mass. Willy Zwaenepoel,
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Database Replication techniques: a Three Parameter Classification Authors : Database Replication techniques: a Three Parameter Classification Authors :
CS 582 / CMPE 481 Distributed Systems
“Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System ” Distributed Systems Κωνσταντακοπούλου Τζένη.
1 Tashkent: Uniting Durability & Ordering in Replicated Databases Sameh Elnikety, EPFL Steven Dropsho, EPFL Fernando Pedone, USI.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Overview  Strong consistency  Traditional approach  Proposed approach  Implementation  Experiments 2.
Concurrency Control & Caching Consistency Issues and Survey Dingshan He November 18, 2002.
GentleRain: Cheap and Scalable Causal Consistency with Physical Clocks Jiaqing Du | Calin Iorgulescu | Amitabha Roy | Willy Zwaenepoel École polytechnique.
Clock-RSM: Low-Latency Inter-Datacenter State Machine Replication Using Loosely Synchronized Physical Clocks Jiaqing Du, Daniele Sciascia, Sameh Elnikety.
CS 603 Data Replication February 25, Data Replication: Why? Fault Tolerance –Hot backup –Catastrophic failure Performance –Parallelism –Decreased.
Module 14: Scalability and High Availability. Overview Key high availability features available in Oracle and SQL Server Key scalability features available.
Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1.
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
6.4 Data and File Replication Gang Shen. Why replicate  Performance  Reliability  Resource sharing  Network resource saving.
PNUTS: YAHOO!’S HOSTED DATA SERVING PLATFORM FENGLI ZHANG.
Amazon’s Dynamo System The material is taken from “Dynamo: Amazon’s Highly Available Key-value Store,” by G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati,
Cloud Storage: All your data belongs to us! Theo Benson This slide includes images from the Megastore and the Cassandra papers/conference slides.
SPORC: Group Collaboration using Untrusted Cloud Resources OSDI 2010 Presented by Yu Chen.
Peer-to-Peer in the Datacenter: Amazon Dynamo Aaron Blankstein COS 461: Computer Networks Lectures: MW 10-10:50am in Architecture N101
Fault Tolerance via the State Machine Replication Approach Favian Contreras.
Low-Overhead Byzantine Fault-Tolerant Storage James Hendricks, Gregory R. Ganger Carnegie Mellon University Michael K. Reiter University of North Carolina.
Ahmad Al-Shishtawy 1,2,Tareq Jamal Khan 1, and Vladimir Vlassov KTH Royal Institute of Technology, Stockholm, Sweden {ahmadas, tareqjk,
High Throughput Computing on P2P Networks Carlos Pérez Miguel
Dynamo: Amazon’s Highly Available Key-value Store DeCandia, Hastorun, Jampani, Kakulapati, Lakshman, Pilchin, Sivasubramanian, Vosshall, Vogels PRESENTED.
CS 5204 (FALL 2005)1 Leases: An Efficient Fault Tolerant Mechanism for Distributed File Cache Consistency Gray and Cheriton By Farid Merchant Date: 9/21/05.
System Support for Managing Graphs in the Cloud Sameh Elnikety & Yuxiong He Microsoft Research.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Data Versioning Lecturer.
Consistent and Efficient Database Replication based on Group Communication Bettina Kemme School of Computer Science McGill University, Montreal.
Usenix Annual Conference, Freenix track – June 2004 – 1 : Flexible Database Clustering Middleware Emmanuel Cecchet – INRIA Julie Marguerite.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
Chen ying 1 Taming aggressive replication in the Pangaea wide-area file system Authors: Yasushi Saito, Christos Karamanolis, Magnus Karlsson, Mallik Mahalingam.
1 ZYZZYVA: SPECULATIVE BYZANTINE FAULT TOLERANCE R.Kotla, L. Alvisi, M. Dahlin, A. Clement and E. Wong U. T. Austin Best Paper Award at SOSP 2007.
Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes.
Eiger: Stronger Semantics for Low-Latency Geo-Replicated Storage Wyatt Lloyd * Michael J. Freedman * Michael Kaminsky † David G. Andersen ‡ * Princeton,
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
1 Multiversion Reconciliation for Mobile Databases Shirish Hemanath Phatak & B.R.Badrinath Presented By Presented By Md. Abdur Rahman Md. Abdur Rahman.
12/17/2015Distributed Systems - Comp 6551 Consistency and Replication The problems we are trying to solve Types of consistency Approaches to propagation.
THE EVOLUTION OF CODA M. Satyanarayanan Carnegie-Mellon University.
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
Clock Snooping and its Application in On-the-fly Data Race Detection Koen De Bosschere and Michiel Ronsse University of Ghent, Belgium Taipei, TaiwanDec.
Minimizing Commit Latency of Transactions in Geo-Replicated Data Stores Paper Authors: Faisal Nawab, Vaibhav Arora, Divyakant Argrawal, Amr El Abbadi University.
Centiman: Elastic, High Performance Optimistic Concurrency Control by Watermarking Authors: Bailu Ding, Lucja Kot, Alan Demers, Johannes Gehrke Presenter:
CalvinFS: Consistent WAN Replication and Scalable Metdata Management for Distributed File Systems Thomas Kao.
Shuai Mu, Lamont Nelson, Wyatt Lloyd, Jinyang Li
Lecturer : Dr. Pavle Mogin
Clock-SI: Snapshot Isolation for Partitioned Data Stores
OCCULT Observable Causal Consistency Using Lossy Timestamps
Consistency in Distributed Systems
I Can’t Believe It’s Not Causal
EECS 498 Introduction to Distributed Systems Fall 2017
Presented by Marek Zawirski
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Scalable Causal Consistency
Decoupled Storage: “Free the Replicas!”
Distributed Systems CS
Presentation transcript:

Orbe: Scalable Causal Consistency Using Dependency Matrices & Physical Clocks Jiaqing Du, EPFL Sameh Elnikety, Microsoft Research Amitabha Roy, EPFL Willy Zwaenepoel, EPFL

Key-Value Data Store API Read operation – value = get( key ) Write operation – put( key, value) Read transaction – = mget ( key1, key2, … ) 2

Partitioning Divide data set into several partitions. A server manages each partition. 3 Partition 1Partition 2 Partition N …

Data set is partitioned Inside a Data Center Partition 1 Application Partition 2 Partition N … client Application client … Application tier Data tier 4

Geo-Replication Data Center E Data Center B Data Center C Data Center F Data close to end users Tolerates disasters 5 Data Center A

Scalable Causal Consistency in Orbe Partitioned and replicated data store Parallel asynchronous update propagation Efficient implementation of causal consistency 6 Partition 1Partition 2 Partition N … Partition 1Partition 2 Partition N … Replica A Replica B

Consistency Models Strong consistency – Total order on propagated updates – High update latency, no partition tolerance Causal consistency – Propagated updates are partially ordered – Low update latency, partition tolerance Eventual consistency – No order among propagated updates – Low update latency, partition tolerance 7

If A depends on B, then A appears after B. Causal Consistency (1/3) Photo Comment: Great weather! Update 8 Propagation Alice

If A depends on B, then A appears after B. Causal Consistency (2/3) Photo Comment: Nice photo! Update 9 Propagation Alice Bob

Causal Consistency (3/3) Partitioned and replicated data stores Partition 1Partition 2 Partition N … Partition 1Partition 2 Partition N … Replica A Replica B Client Read(A) Read(B)Write(C, A+B) Propagate (C) How to guarantee A and B appear first? 10

Existing Solutions Version vectors – Only work for purely replicated systems COPS [Lloyd’11] – Explicit dependency tracking at client side – Overhead is high under many workloads Our work – Extends version vectors to dependency matrices – Employs physical clocks for read-only transactions – Keeps dependency metadata small and bounded 11

Outline DM protocol DM-Clock protocol Evaluation Conclusions 12

Dependency Matrix (DM) Represents dependencies of a state or a client session One integer per server An integer represents all dependencies from a partition Partition 1Partition 2 Replica A Partition 1Partition 2 Replica B DM first 9 updates first 5 updates Partition 3

DM Protocol: Data Structures Partition 1 of Replica A Client Dependency matrix (DM) 14 0 DM =

DM Protocol: Data Structures Partition 1 of Replica A Client 38 Dependency matrix (DM) Version vector (VV) 15 0 VV = DM =

DM Protocol: Data Structures Partition 1 of Replica A Client 38 Item A, rid = A, ut = 2, dm = Item B, rid = B, ut = 5, dm = Dependency matrix (DM) Version vector (VV) Update timestamp (UT) Source replica id (RID) VV = DM =

DM Protocol: Read and Write Read item – Client server – Includes read item in client DM Write item – Client server – Associates client DM to updated item – Resets client DM (transitivity of causality) – Includes updated item in client DM 17

Replica A Partition 1 (v, rid = A, ut = 4) Example: Read and Write Partition 2 Client DM = read(photo) write(comment, ) VV = [7, 0] (ut = 1) VV = [0, 0] VV = [1, 0] 18 Partition 3 VV = [0, 0]

DM Protocol: Update Propagation Propagate an update – Server server – Asynchronous propagation – Compares DM with VVs of local partitions 19

Replica A Partition 1 Example: Update Propagation Replica B Partition 1 Partition 2 VV = [7, 0] VV = [0, 0] VV = [3, 0] VV = [0, 0] VV = [4, 0] VV = [1, 0] check dependency VV = [1, 0] 20 Partition 3 VV = [0, 0] Partition 3 VV = [0, 0] replicate(comment, ut = 1, ) 0 4 0

Complete and Nearest Dependencies Transitivity of causality – If B depends on A, C depends on B, then C depends on A. Tracking nearest dependencies – Reduces dependency metadata size – Does not affect correctness 21 C: write Comment 2 A: write Photo B: write Comment 1 Complete Dependencies Nearest Dependencies

DM Protocol: Benefits Keeps dependency metadata small and bounded – Only tracks nearest dependencies by resetting the client DM after each update – Number of elements in a DM is fixed – Utilizes sparse matrix encoding 22

Outline DM protocol DM-Clock protocol Evaluation Conclusions 23

Read Transaction on Causal Snapshot 24 Album: Public Bob 1 Photo Bob 2 Album: Public Only close friends! Bob 3 Photo Bob 4 Replica AReplica B Album: Public Photo Album: Public Only close friends! Photo Mom 1 Mom 2

Provides causally consistent read-only transactions Requires loosely synchronized clocks (NTP) Data structures DM-Clock Protocol (1/2) Timestamps from physical clocks 25 Partition 0 Client 38 Item A, rid = A, ut = 2, dm =, put = 27 Item B, rid = B, ut = 5, dm =, put = VV = DM = PDT = 0

DM-Clock Protocol (2/2) Still tracks nearest dependencies Read-only transaction – Obtains snapshot timestamp from local physical clock – Reads latest versions created “before” snapshot time A cut of the causal relationship graph A0A0 B2B2 C1C1 B0B0 C0C0 D3D3 E0E0 snapshot timestamp 26

Outline DM protocol DM-Clock protocol Evaluation Conclusions 27

Evaluation Orbe – A partitioned and replicated key-value store – Implements the DM and DM-Clock protocols Experiment Setup – A local cluster of 16 servers – 120 ms update latency 28

Evaluation Questions 1.Does Orbe scale out? 2.How does Orbe compare to eventual consistency? 3.How does Orbe compare to COPS 29

Throughput over Num. of Partitions 30 Workload: Each client accesses two partitions. Orbe scales out as the number of partitions increases.

Throughput over Varied Workloads 31 Workload: Each client accesses three partitions. Orbe incurs relatively small overhead for tracking dependencies under many workloads.

Orbe Metadata Percentage 32 Dependency metadata is relatively small and bounded.

Orbe Dependency Check Messages 33 The number of dependency check messages is relatively small and bounded.

Orbe & COPS: Throughput over Client Inter-Operation Delays 34 Workload: Each client accesses three partitions.

Orbe & COPS: Number of Dependencies per Update 35 Orbe only tracks nearest dependencies when supporting read-only transactions.

In the Paper Protocols – Conflict detection – Causal snapshot for read transaction – Garbage collection Fault-tolerance and recovery Dependency cleaning optimization More experimental results – Micro-benchmarks & latency distribution – Benefits of dependency cleaning 36

Conclusions Orbe provides scalable causal consistency – Partitioned and replicated data store DM protocol – Dependency matrices DM-Clock protocol – Dependency matrices + physical clocks – Read-only transactions (causally consistency) Performance – Scale out, low overhead, comparison to EC & COPS 37