Calvin : Fast Distributed Transactions for Partitioned Database

Slides:



Advertisements
Similar presentations
Remus: High Availability via Asynchronous Virtual Machine Replication
Advertisements

Optimistic Methods for Concurrency Control By : H.T. Kung & John T. Robinson Presenters: Munawer Saeed.
Principles of Transaction Management. Outline Transaction concepts & protocols Performance impact of concurrency control Performance tuning.
CS 440 Database Management Systems Lecture 10: Transaction Management - Recovery 1.
CSC271 Database Systems Lecture # 32.
CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture X: Transactions.
1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
CMPT Dr. Alexandra Fedorova Lecture X: Transactions.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
What Should the Design of Cloud- Based (Transactional) Database Systems Look Like? Daniel Abadi Yale University March 17 th, 2011.
Database Replication techniques: a Three Parameter Classification Authors : Database Replication techniques: a Three Parameter Classification Authors :
Quick Review of May 1 material Concurrent Execution and Serializability –inconsistent concurrent schedules –transaction conflicts serializable == conflict.
Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis.
Chapter 8 : Transaction Management. u Function and importance of transactions. u Properties of transactions. u Concurrency Control – Meaning of serializability.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
The Google File System.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 18: Replication Control All slides © IG.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Transaction Management and Concurrency Control.
Database System Architectures  Client-server Database System  Parallel Database System  Distributed Database System Wei Jiang.
Transaction. A transaction is an event which occurs on the database. Generally a transaction reads a value from the database or writes a value to the.
Highly Available ACID Memory Vijayshankar Raman. Introduction §Why ACID memory? l non-database apps: want updates to critical data to be atomic and persistent.
Fault Tolerance via the State Machine Replication Approach Favian Contreras.
Presented by Dr. Greg Speegle April 12,  Two-phase commit slow relative to local transaction processing  CAP Theorem  Option 1: Reduce availability.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
Transaction Communications Yi Sun. Outline Transaction ACID Property Distributed transaction Two phase commit protocol Nested transaction.
Switch off your Mobiles Phones or Change Profile to Silent Mode.
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Data Versioning Lecturer.
C-Store: Concurrency Control and Recovery Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Jun. 5, 2009.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Chapter 16 Recovery Yonsei University 1 st Semester, 2015 Sanghyun Park.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Databases Illuminated
II.I Selected Database Issues: 2 - Transaction ManagementSlide 1/20 1 II. Selected Database Issues Part 2: Transaction Management Lecture 4 Lecturer: Chris.
Real-time Databases Presented by Parimala kyathsandra CSE 666 fall 2006 Instructor Prof. Subra ganesan.
Chapter 10 Recovery System. ACID Properties  Atomicity. Either all operations of the transaction are properly reflected in the database or none are.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
Transaction Management Transparencies. ©Pearson Education 2009 Chapter 14 - Objectives Function and importance of transactions. Properties of transactions.
3 Database Systems: Design, Implementation, and Management CHAPTER 9 Transaction Management and Concurrency Control.
Antidio Viguria Ann Krueger A Nonblocking Quorum Consensus Protocol for Replicated Data Divyakant Agrawal and Arthur J. Bernstein Paper Presentation: Dependable.
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
CalvinFS: Consistent WAN Replication and Scalable Metdata Management for Distributed File Systems Thomas Kao.
Distributed Databases – Advanced Concepts Chapter 25 in Textbook.
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
DURABILITY OF TRANSACTIONS AND CRASH RECOVERY
Slide credits: Thomas Kao
Transaction Management
Shuai Mu, Lamont Nelson, Wyatt Lloyd, Jinyang Li
Distributed Systems – Paxos
Concurrency Control Techniques
Alternative system models
Lecturer : Dr. Pavle Mogin
CSCI5570 Large Scale Data Processing Systems
Introduction to NewSQL
Operating System Reliability
Operating System Reliability
EECS 498 Introduction to Distributed Systems Fall 2017
Outline Announcements Fault Tolerance.
Chapter 10 Transaction Management and Concurrency Control
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Lecture 21: Replication Control
Operating System Reliability
Transactions in Distributed Systems
Database System Architectures
EEC 688/788 Secure and Dependable Computing
Lecture 21: Replication Control
Transaction Communication
Presentation transcript:

Calvin : Fast Distributed Transactions for Partitioned Database Based on SIGMOD’12 paper by Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng , Kun Ren, Philip Shao, Daniel J. Abadi By : K. V. Mahesh , Abhishek Gupta Under guidance of: Prof. S.Sudarshan

Outline Motivation Deterministic Database Systems Calvin : System Architecture Sequencer Scheduler Calvin with Disk based storage Checkpointing Performance Evaluation Conclusion

Motivation Distributed storage systems to achieve high data access throughput Partitioning and Replication Examples BigTable, PNUTS, Dynamo, MongoDB, Megastore What about Consistency ? What about scalability ? It does not come for free. Some thing has to be sacrificed Major three types of tradeoffs

Tradeoffs for scalability Sacrifice ACID for scalability Drops ACID guarantees Avoids impediments like two phase commit,2PL Examples: BigTable, PNUTS Reduce transaction flexibility for scalability Transactions are completely isolated to a single “partition” Transactions spanning multiple partitions are not supported or uses agreement protocols Examples: VoltDB Trade cost for scalability Using high end hardware Achieves high throughput using old techniques but don’t have shared-nothing horizontally scalability Example: Oracle tops TPC-C

Distributed Transactions in Traditional Distributed Database Agreement Protocol Ensures atomicity and durability Example : Two Phase commit Hold Locks till the end of agreement protocol Ensure Isolation Problems : Long transaction duration Multiple Round trip message Agreement Protocol overhead can be more than actual transaction time Distributed Deadlock T1

Can we avoid this agreement protocol?? In nutshell distributed transactions are costly because of Agreement protocol Can we avoid this agreement protocol?? Answer: Yes Deterministic Databases

Deterministic Database Approach Provides transaction scheduling layer Sequencer decides the global execution order of transactions before their actual execution All replicas follow same order to execute the transactions Do all "hard" work before acquiring locks and beginning of transaction execution

Deterministic Database System Alexander thomson et al.,2010 What are the events that may cause a distributed transaction failure ? Nondeterministic - node failure , rollback due to deadlock Deterministic - logical errors

Deterministic Database System(2) If a non-deterministic failure occur A node is crashed , in one replica Other replica executing same transaction in parallel Run transaction using live replica Commit transaction Failed node would be recovered later Node crashed T1 T1 A A B B C C D D Replica 1 Replica 2

Deterministic Database System(3) But need to ensure “Every replica need to be going through same sequence of database states” To ensure same sequence of states across all replicas Use synchronous replication of transaction inputs across replicas Change concurrency scheme to ensure execution of transaction in exactly same order on all replica Notice this method will not work in traditional database.

Deterministic Database System(4) What about the deterministic failure ? Each node waits for one way message from other nodes which could deterministically cause abort transaction Commits if receives all messages “So no need of agreement protocol”

Calvin-System Architecture Scalable transactional layer above storage system which provides CRUD interface (Create / Insert, read, update,delete ) Sequencing layer Batches transaction inputs into some order All replica will follow this order Replication and logging Scheduling layer Handles concurrency control Has pool of transaction execution threads Storage layer Handles physical data layout Transactions access data using CRUD interface

Architecture

Sequencer 10ms batch epoch for batching Distributed across all nodes No single point of failure High scalability 10ms batch epoch for batching Batch the transaction inputs, determine their execution sequence, and dispatch them to the schedulers. Transactional inputs are replicated Asynchronous and paxos-based Sends every scheduler Sequencers’ node id Epoch number transactional inputs collected

Asynchronous Replication of Transactions Input Replication group – All replicas of a particular partition All the requests are forwarded to the master replica Sequencer component forwards batch to slave replicas in its replication group Extreme low latency before transaction is executed High cost to handle failures Sequencer(Master) Sequencer T1 T2 epoch epoch T3 Replication group T4 Diagram is from presentation by Xinpan

Paxos-based Replication for Transaction Input Sequencer Sequencer T1 T2 epoch epoch T4 T3 Replication group Diagram is from presentation by Xinpan

Paxos-based Replication of Transaction Input Sequencer Sequencer T1 T2 T3 T4 T1 T2 T3 T4 epoch epoch Replication group Diagram is from presentation by Xinpan

Sequencer Architecture Partition1 Partition2 Partition3 Diagram is from presentation by Xinpan

Scheduler Transactions are executed concurrently by pool of execution threads Orchestrates transaction execution using Deterministic Locking Scheme

Deterministic Locking Protocol Lock manager is distributed across scheduling layer Each node’s scheduler will only locks the co located data items Resembles strict two phase locking but with some invariants All transactions have to declare all lock request before the transaction execution start All the locks of transaction would be given in their global ordering Transaction A and B have need exclusive lock on same data item A comes before B in global ordering , then A must request its lock request before B

Deterministic Locking Protocol(2) Implemented by serializing lock requests in a single thread Lock manager must grant the lock in the order they have been requested

Transaction Execution Phases 1)Analysis all read/write sets. -Passive participants -Active participants 2) Perform local reads. 3) Serve remote reads - send data needed by remote ones. 4)Collect remote read results - receive data from remote. 5) execute transaction logic and apply writes

T1 : A = A + B C = C + B Local RS: ( A) ( B ) ( C ) Local WS: (A) ( C ) Phase 1 Active Participant Passive Participant Active Participant Phase 2 Read Local Data Items Phase 3 Send A Send B Send B Send C Phase 4 Collect Remote Data Items Execute Execute Phase 5 P3 (C) P1 (A) P2 (B) Perform Only Local write

Dependent Transactions X <- Read (emp_tbl where salary >1000) Update (X) Transactions which need to perform reads to determine their complete read/write sets Optimistic Lock Location Prediction Can be implemented by modifying the client transaction code Execute “Reconnaissance query” that performs necessary reads to discover full read/write sets Actual transaction added to global sequence with this info Problem?? -Records read may have changed in between Solution -The process is restarted , deterministically across all nodes For most applications Read/Write set does not change frequently

Calvin : With disk based storage Deterministic execution works well for in memory resident databases. Traditional databases guarantee equivalence to any serial order but deterministic databases should respect single order chosen If transaction access data items from disk High contention footprint (locks are hold for longer duration) Low throughput

Calvin : With disk based storage (2) when sequencer gets transaction which may cause disk stall Approach1: Use “Reconnaissance query” Approach2: Send a prefetch ( warmup ) request to relevant storage components Add artificial delay – equals to I/O latency Transaction would find all data items in memory

Checkpointing Fault tolerance is easy to ensure Active replication allows to failover to another replica instantly Only transactional input is logged Avoids physical REDO logging Replaying transactional input is sufficient to recover Transaction consistent check pointing is needed Transaction input can be replayed on consistent sate

Checkpointing Modes Three modes are supported Naïve synchronous checkpointing Zig-Zag algorithm Asynchronous snapshot mode Storage layer should support multiversioning

Naïve synchronous mode Process: 1) Stop one replica. 2) Checkpointing it. 3) Replay delayed transactions Done periodically Unavailability period of replica is not seen by client Problem: Replica may fall behind other replicas Problematic if called into action due to failure at other replica Significant time is needed to catch backup to other replicas

Zig-Zag algorithm Variant of zig-zag is used in calvin Stores two copies of each record along with two additional bits per record Captures snapshot w.r.t virtual point of consistency(pre-specified point in global serial order)

Modified Zig-Zag algorithm Transactions preceding virtual point uses “before” version Transactions appearing after virtual point uses “after” version Once the transactions preceded are finished execution “before” versions are immutable Asynchronous checkpointing thread begins checkpointing “before” versions “Before” versions are discarded Incurs moderate overhead

Modified Zig-Zag Algorithm(2) CP T3 Database Before Version Current Version Before Writes Later Version CP T3 T1 T2 Before Version Current Version Transaction following CP Later Version T3 T1 T2 T3 Current Version T3 Discard Before Version after check pointing is complete

Modified Zig-Zag Algorithm(3) Checkpointing needs no quiescing of database Reduction in throughput during checkpointing is due to CPU cost Small amount of latch contention

Asynchronous snapshot mode Support by Storage layers that have full multiversioning scheme Read queries need not acquire locks Checkpointing scheme is just “ SELECT * ” query over versioned data Result of query is logged to disk

Performance Evaluation Two benchmarks TPC-C benchmark New order transaction Microbenchmark Systems used Amazon EC2 Having 7GB memory ,8 Virtual cores

TPC-C benchmark Results Scales linearly It shows very near throughput to TPC-C world record holder Oracle 5000 transactions per second per node in clusters larger than 10 nodes

Microbenchmark results Shares characteristics Of TPC-C New Order Transaction Contention index Fractions of total “hot” records updated by a transaction at a particular machine

Microbechmark results(2) Sharp drop from one machine to two machines Due to additional work done by CPU for each multi partition transaction

Microbechmark results(2) As machines are added Slow machines Execution progress skew Sensitivity of throughput to execution progress skew depends on Number of machines Level of contention

Handling High Contention-Evaluation

Conclusions Deterministic databases arranges “everything” at the beginning. Instead of trying to optimize the distributed commit protocols, deterministic databases jumps out and say: why not eliminate it?

EXTRA SLIDES

Disk I/O Latency Prediction Challenges with this is How to accurately predict disk latencies? Transactions are delayed for appropriate time How to track keys in memory inorder to determine when prefetching is necessary? Done by sequencing layer

Disk I/O Latency Prediction Time taken to read disk-resident data depends Variable physical distance for head and spindle to move Prior queued disk I/O operations Network latency for remote reads Failover from media failures etc., Not possible for perfect prediction Disk I/O Latency estimation is crucial under high contention in the system

Disk I/O Latency Prediction(2) If overestimated Contention cost due to disk access is minimized Overall transaction latency is increased Memory being overloaded If underestimated Transaction stalls during execution until fetching is done High contention footprint Throughput is reduced Tradeoffs are necessary Exhaustive exploration is future work

Globally Tracking Hot Records Sequencer must track current data in memory across entire system To determine which transactions to delay while readsets are warmedup Solutions Global list of hot keys at every sequencer Delay all transactions at every sequencer until adequate time for prefetching is allowed Allow scheduler to track hot local data across replicas Works only for single partition transactions