Acknowledgments University of Waterloo University of British Columbia

Slides:



Advertisements
Similar presentations
Remus: High Availability via Asynchronous Virtual Machine Replication
Advertisements

Database Architectures and the Web
High throughput chain replication for read-mostly workloads
Transaction.
1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,
NoSQL Databases: MongoDB vs Cassandra
Database Systems on Virtual Machines: How Much Do We Lose? Kristin Travis March 2, 2011.
1 © Copyright 2010 EMC Corporation. All rights reserved. EMC RecoverPoint/Cluster Enabler for Microsoft Failover Cluster.
Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank
Towards High-Availability for IP Telephony using Virtual Machines Devdutt Patnaik, Ashish Bijlani and Vishal K Singh.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
What Should the Design of Cloud- Based (Transactional) Database Systems Look Like? Daniel Abadi Yale University March 17 th, 2011.
Database Replication techniques: a Three Parameter Classification Authors : Database Replication techniques: a Three Parameter Classification Authors :
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
Keith Burns Microsoft UK Mission Critical Database.
EEC-681/781 Distributed Computing Systems Lecture 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Distributed Resource Management: Distributed Shared Memory
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Virtualization in Data Centers Prashant Shenoy
Remus: High Availability via Asynchronous Virtual Machine Replication.
Virtual techdays INDIA │ September 2011 High Availability - A Story from Past to Future Balmukund Lakhani │ Technical Lead – SQL Support, Microsoft.
MS CLOUD DB - AZURE SQL DB Fault Tolerance by Subha Vasudevan Christina Burnett.
Turn on…. Calm restored …. Application within DC failure DC location failure Loss of access from / to your office location.
1© Copyright 2011 EMC Corporation. All rights reserved. EMC RECOVERPOINT/ CLUSTER ENABLER FOR MICROSOFT FAILOVER CLUSTER.
Implementing Failover Clustering with Hyper-V
National Manager Database Services
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
Disaster Recovery as a Cloud Service Chao Liu SUNY Buffalo Computer Science.
Highly Available ACID Memory Vijayshankar Raman. Introduction §Why ACID memory? l non-database apps: want updates to critical data to be atomic and persistent.
Cloud Computing for the Enterprise November 18th, This work is licensed under a Creative Commons.
1 The Google File System Reporter: You-Wei Zhang.
Chapter 10 : Designing a SQL Server 2005 Solution for High Availability MCITP Administrator: Microsoft SQL Server 2005 Database Server Infrastructure Design.
© 2011 Cisco All rights reserved.Cisco Confidential 1 APP server Client library Memory (Managed Cache) Memory (Managed Cache) Queue to disk Disk NIC Replication.
Remus: VM Replication Jeff Chase Duke University.
Sofia, Bulgaria | 9-10 October SQL Server 2005 High Availability for developers Vladimir Tchalkov Crossroad Ltd. Vladimir Tchalkov Crossroad Ltd.
VLDB2012 Hoang Tam Vo #1, Sheng Wang #2, Divyakant Agrawal †3, Gang Chen §4, Beng Chin Ooi #5 #National University of Singapore, †University of California,
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
Alireza Angabini Advanced DB class Dr. M.Rahgozar Fall 88.
Session objectives Discuss whether or not virtualization makes sense for Exchange 2013 Describe supportability of virtualization features Explain sizing.
Database Systems: Design, Implementation, and Management Tenth Edition Chapter 12 Distributed Database Management Systems.
CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM S. A. Weil, S. A. Brandt, E. L. Miller D. D. E. Long, C. Maltzahn U. C. Santa Cruz OSDI 2006.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
1 ACTIVE FAULT TOLERANT SYSTEM for OPEN DISTRIBUTED COMPUTING (Autonomic and Trusted Computing 2006) Giray Kömürcü.
Introduction to Database Systems1. 2 Basic Definitions Mini-world Some part of the real world about which data is stored in a database. Data Known facts.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Databases Illuminated
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
VMware vSphere Configuration and Management v6
Virtualization and Databases Ashraf Aboulnaga University of Waterloo.
Efficient Live Checkpointing Mechanisms for computation and memory-intensive VMs in a data center Kasidit Chanchio Vasabilab Dept of Computer Science,
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
Log Shipping, Mirroring, Replication and Clustering Which should I use? That depends on a few questions we must ask the user. We will go over these questions.
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
Distributed Cache Technology in Cloud Computing and its Application in the GIS Software Wang Qi Zhu Yitong Peng Cheng
A Technical Overview of Microsoft® SQL Server™ 2005 High Availability Beta 2 Matthew Stephen IT Pro Evangelist (SQL Server)
Introduction to NewSQL
RAID RAID Mukesh N Tekwani
Outline Virtualization Cloud Computing Microsoft Azure Platform
Specialized Cloud Architectures
Outline Review of Quiz #1 Distributed File Systems 4/20/2019 COP5611.
THE GOOGLE FILE SYSTEM.
RAID RAID Mukesh N Tekwani April 23, 2019
Distributed Availability Groups
Database System Architectures
Distributed Resource Management: Distributed Shared Memory
Setting up PostgreSQL for Production in AWS
Presentation transcript:

High Availability for Database Systems in Cloud Computing Environments Ashraf Aboulnaga University of Waterloo

Acknowledgments University of Waterloo University of British Columbia Prof. Kenneth Salem Umar Farooq Minhas Rui Liu (post-doctoral fellow) University of British Columbia Prof. Andrew Warfield Shriram Rajagopalan Brendan Cully

What is Cloud Computing? Trend toward consolidation in server-side computing Cloud computing: massive, shared, hosted clusters of servers

What is Cloud Computing? Users access the provided services through the internet or an intranet Cloud provider makes available massive clusters of servers running applications that provide services to users

Why Cloud Computing? For cloud provider For cloud user Consolidation leads to economies of scale Cheaper hardware, software, networking, power Easier to invest in administration, availability, security, etc. Enables extreme levels of scalability Makes it possible to support 100’s of millions of users or requests, and terabytes of data For cloud user Capital expenses for computing become operational expenses Lower cost and higher quality of service Elasticity Grow and shrink your computing resources as you need, and pay only for what you use

Database Systems in the Cloud Using cloud technologies plus SQL database systems to build a scalable highly available database service in the cloud Better deployment of database systems in the cloud Better support for database systems by cloud technologies

Why Database Systems? Databases are important! A narrow interface to the user (SQL) Transactions offer well defined semantics for data access and update Well defined internal structures (e.g., buffer pool) and query execution operators (e.g., hash joins) Accurate performance models for query execution time and resource consumption

Outline Introduction RemusDB: Database high availability using virtualization Umar Farooq Minhas, Shriram Rajagopalan, Brendan Cully, Ashraf Aboulnaga, Kenneth Salem, and Andrew Warfield. “RemusDB: Transparent High Availability for Database Systems,” In Proceedings of the VLDB Endowment (PVLDB), 2011. (Best Paper Award) DBECS: Database high availability (and scalability) using eventually consistent cloud storage Conclusion

Virtual Machine Monitor (VMM) Virtual Machines App 1 App 2 App 3 App 4 App 5 Operating System Operating System CPU Mem Virtual Machine 1 CPU Mem Net Virtual Machine 2 Virtual Machine Monitor (VMM) Physical Machine CPU CPU Mem Net

Capabilities of VMMs Strong isolation between VMs Controlling the allocation of resources (CPU, memory, etc.) to VMs Pause, checkpoint, resume VMs Live VM migration between physical servers … and more

High Availability A database system is highly available if it remains accessible to its users in the face of hardware failures High availability (HA) is becoming a requirement for almost all database applications, not just mission critical ones Key issues: Maintaining database consistency in the face of failure Minimizing the impact on performance during normal operation and after a failure Reducing the complexity and administrative overhead of HA

Active/Standby Replication Primary Server DBMS DBMS Primary Server Backup Server DB Database Changes (Transaction Log) DB A copy of the database is stored on two servers, a primary and a backup Primary server (active) accepts user requests and performs database updates Changes to database propagated to backup server (standby) by propagating the transaction log Upon failure, backup server takes over as primary

Active/Standby Replication Active/standby replication is complex to implement in the DBMS, and complex to administer Propagating the transaction log Atomic handover from primary to backup on failure Redirecting client requests to backup after failure Minimizing effect on performance (e.g., warming the buffer pool of the backup) Our approach: Implement active/standby replication at the virtual machine layer Push the complexity out of the DBMS High availability as a service: Any DBMS can be made highly available with little or no code changes Low performance overhead

Transparent HA for DBMS Primary Server VM VM Changes to VM State DB DBMS DBMS DB Primary Server Backup Server RemusDB: efficient and transparent active/standby high availability for DBMS implemented in the virtualization layer Propagates all changes in VM state from primary to backup High availability with no code changes to the DBMS Completely transparent failover from primary to backup Failover to a warmed up backup server

Remus and VM Checkpointing RemusDB is based on Remus, a high availability solution that is now part of the Xen virtual machine monitor Remus maintains a replica of a running VM on a separate physical machine Periodically replicates state changes from the primary VM to the backup VM using whole machine checkpointing Checkpointing is based on extensions to live VM migration Provides transparent failover with only seconds of downtime

Remus Checkpoints Remus divides time into epochs (~25ms) Performs a checkpoint at the end of each epoch 1. Suspend primary VM 2. Copy all state changes to a buffer in Domain 0 3. Resume primary VM 4. Send asynchronous message to backup containing state changes 5. Backup VM applies state changes Primary VM Domain 0 Periodic Checkpoints (Changes to VM State) Domain 0 Backup VM Xen VMM Xen VMM Primary Server Backup Server

Remus Checkpoints After a failure, the backup resumes execution from the latest checkpoint Any work done by the primary during epoch C will be lost (unsafe) Remus provides a consistent view of execution to clients Any network packets sent during an epoch are buffered until the next checkpoint Guarantees that a client will see results only if they are based on safe execution Same principle is also applied to disk writes

Remus and DB Workloads Remus protection no protection network buffering processing Primary Server query response (unprotected) response (protected) up to 32 % DBMS Client response time (unprotected) response time (protected) overhead of protection RemusDB implements optimizations to reduce the overhead of protection for database workloads Incurs ≤ 3% overhead and recovers from failures in ≤ 3 seconds

RemusDB Remus optimized for protecting database workloads Memory optimizations Database workloads tend to modify a lot of memory in each epoch (buffer pool, working memory for queries, etc.) Reduce checkpointing overhead Asynchronous checkpoint compression Disk read tracking Network optimization Some database workloads are sensitive to the network latency added by buffering network packets Exploit semantics of database transactions to avoid buffering Commit protection Send less data Protect less memory

Async Checkpoint Compression Database workloads typically involve a large set of frequently changing pages of memory (e.g., buffer pool pages) Results in a large amount of replication traffic The DBMS often changes only a small part of the pages Data that is replicated contains redundancy Reduce replication traffic by only sending the changes to the memory pages (and send them compressed)

Async Checkpoint Compression Protected VM Domain 0 Compute delta and compress Dirty Pages (epoch i) to backup LRU Cache Dirty pages from epochs [1 … i-1] Xen VMM

Disk Read Tracking Active VM Standby VM DBMS BP DBMS BP DB DB Changes to VM State P P P P DBMS loads pages from disk into its buffer pool (BP) Clean to DBMS, dirty to Remus Remus synchronizes dirty BP pages in every checkpoint Synchronization of clean BP pages is unnecessary Can be read from disk at the backup

Disk Read Tracking Track the memory pages into which disk reads are placed Do not mark these pages as dirty until they are actually modified Add an annotation to the replication stream indicating the disk sectors to read to reconstruct these pages

Network Optimization Remus buffers every outgoing network packet Ensures clients never see results of unsafe execution But increases round trip latency by 2-3 orders of magnitude Largest source of overhead for many database workloads Unnecessarily conservative for database systems Database systems provide transactions with clear consistency and durability semantics Remus’s TCP-level per-checkpoint transactions are redundant Provide an interface to allow a DBMS to decide which packets are protected (i.e., buffered until the next checkpoint) and which are unprotected Implemented as a new setsockopt() option in Linux

Commit Protection Commit Protection DBMS only protects transaction control packets (BEGIN TRANSACTION, COMMIT, ABORT) Other packets are unprotected After failover, a recovery handler runs in the DBMS at the backup Aborts all in-flight transaction where the client connection was in unprotected mode Not transparent to the DBMS Requires minor modifications to the client connection layer 103 LoC for PostgreSQL, 85 LoC for MySQL Transaction safety is guaranteed

Experimental Setup MySQL / PostgreSQL (Active VM) MySQL / PostgreSQL TPC-C / TPC-H MySQL / PostgreSQL (Active VM) MySQL / PostgreSQL (Standby VM) DB DB Xen 4.0 Xen 4.0 Primary Server Backup Server Gigabit Ethernet

Failover Primary server fails TPC-C on MySQL

Benefits of RemusDB High availability for any DBMS with no code changes Or with very little code changes if we use commit protection “High availability as a service” Automatic and fully transparent failover to a warmed up system Next steps Reprotection after a failure One server as the backup for multiple primary servers Administration of RemusDB failover

Outline Introduction RemusDB: Database high availability using virtualization DBECS: Database high availability (and scalability) using eventually consistent cloud storage (Under submission) Conclusion

Cloud Storage Many cloud storage systems Amazon S3 HBase Cassandra …and more Scalable, distributed, fault tolerant Support simple read and write operations write(key, value) value = read(key) Atomicity only for single-row operations No atomic multi-row reads or writes Interface much simpler than SQL (“NoSQL”)

Databases Over Cloud Storage DBMS App Cloud Storage System Cloud Storage System Data Center Data Center Goal: A scalable, elastic, highly available, multi-tenant database service that supports SQL and ACID transactions Cloud storage system provides scalability, elasticity, and availability. DBMS provides SQL and ACID transactions.

DBECS DBECS: Databases on Eventually Consistent Stores MySQL InnoDB Cassandra I/O Cassandra DBECS: Databases on Eventually Consistent Stores Can replace MySQL with another DBMS Need Cassandra since we want eventual consistency

Why Cassandra? Relaxing consistency reduces the write latency of Cassandra and makes it partition tolerant Cassandra stores semi-structured rows that belong to column families Rows are accessed by a key Rows are replicated and distributed by hashing keys Multi-master replication for each row Enables Cassandra to run in multiple data centers Also gives us partition tolerance DBECS leverages this for disaster tolerance

Why Cassandra? Client controls the consistency vs. latency trade-off for each read and write operation write(1)/read(1) – fast but not necessarily consistent write(ALL)/read(ALL) – consistent but may be slow We posit that database systems can control this trade-off quite well Client decides the serialization order of updates Important for consistency in DBECS Scalable, elastic, highly available Like many other cloud storage systems!

Consistency vs. Latency value = read(1, key, column) Send read request to all replicas of the row (based on key) Return first response received to client Returns quickly but may return stale data value = read(ALL, key, column) Wait until all replicas respond and return latest version to client Consistent but as slow as the slowest replica write(1) vs. write(ALL) Send write request to all replicas Client provides a timestamp for each write Other consistency levels are supported

Consistency vs. Latency 1- write(k, v1) 2- write(k, v2) Data Center 1 Data Center 2 3- v = read(k) Data Center 3 Which v is returned to the read()? write(1)/read(1): possibly v1, and eventually v2 write(ALL)/read(1): guaranteed to return v2 if successful write(1)/read(ALL): guaranteed to return v2 if successful

Consistency vs. Latency Experiment on Amazon EC2 – Yahoo! Cloud Serving Benchmark (YCSB) – 4 Cassandra Nodes Same EC2 Availability Zone

Consistency vs. Latency Two EC2 Availability Zones Same EC2 Geographic Region

Consistency vs. Latency Two EC2 Regions (US East and US West)

Databases Over Cassandra Make Cassandra look like a disk to the DBMS tenants Databases stored in one column (in one column family) key = DBMS id + disk block id value = contents of disk block Cassandra I/O layer maps DBMS reads and writes to Cassandra reads and writes Which consistency level to use? write(1)/read(1): Fast but may return stale data and provides no durability guarantees. Not good for a DBMS. write(ALL)/read(1): Returns no stale data and guarantees durability but writes are slow.

Goal of DBECS Achieve the performance of write(1)/read(1) while maintaining consistency, durability, and availability Optimistic I/O Use write(1)/read(1) and detect stale data Client-controlled synchronization Make database updates safe in the face of failures

Optimistic I/O Key observation: with write(1)/read(1), most reads will not return stale data Single writer for each database block Reads unlikely to come soon after writes because of DBMS buffer pool Cassandra sends writes to all replicas Network topology means that first replica to acknowledge a write will likely be the first to acknowledge a read So use write(1)/read(1), detect stale data, and recover from it

Optimistic I/O Detecting stale data Recovering from stale data Cassandra I/O stores a version number with each database block and remembers the current version of each block Checks the version number returned by read(1) against the current version number Recovering from stale data Use read(ALL) Retry the read(1) When read(1) detects stale data, Cassandra brings the stale replicas up to date (read repair)

Implementation Cassandra I/O maintains a version list containing the version number of the latest write for each accessed block Initially, this list is empty, so use read(ALL) for all blocks As blocks get used, their latest version numbers get added to the version list. Use read(1) and check returned version number to detect stale reads. If version list fills up, evict old blocks Use read(ALL) for any block not on the version list Optimization: modified Cassandra so that read requests notify Cassandra I/O of oldest and newest version numbers of retrieved blocks If oldest == newest, all replicas of the block are consistent Can use read(1) without checking returned version numbers

Dealing with Failures With write(1), data is not safe With read(ALL), will block if one replica is down Naive solution: use write(ALL)/read(QUORUM) Key observation: Transaction semantics tell us when writes must be safe and when the DBMS can tolerate unsafe writes Write Ahead Logging tells us when data needs to be safe (write log before data and flush log on commit) Database systems explicitly synchronize data at the necessary points, for example, by using fsync() Need an fsync() for Cassandra Can abort transactions if unsafe writes are lost

Client-Controlled Sync Added new type of write in Cassandra: write(CSYNC) Like write(1), but stores the key of the written row in a sync_pending list Cassandra client can issue a CSync() call to synchronize all rows in the sync_pending list To protect against failure, Cassandra I/O Uses write(CSYNC) instead of write(1) Calls CSync() whenever the DBMS calls fsync() Data or log pages are made safe only when needed Time between write() and CSync() enables latency hiding Uses read(QUORUM)

Examples of Failures Loss of a Cassandra node Handled by Cassandra Completely transparent to DBMS Loss of the primary data center (Disaster Recovery) Cassandra needs to be running in multiple data centers Restart the DBMS in a backup data center Log-based recovery brings the database up to date in a transactionally consistent way

MySQL and Cassandra in Amazon EC2 TPC-C Workload – 6 Cassandra Nodes Throughput MySQL and Cassandra in Amazon EC2 TPC-C Workload – 6 Cassandra Nodes

Scalability Adding DBMS Tenants and Proportionally Increasing the Number of Cassandra Nodes

Failure of Cassandra Node in Primary Data Center High Availability Failure of Cassandra Node in Primary Data Center

Failure of Cassandra Node in Secondary DC High Availability Failure of Cassandra Node in Secondary DC

Loss of Primary Data Center Disaster Recovery Loss of Primary Data Center

Benefits of DBECS Scalable and elastic storage capacity and bandwidth Scalable in the number of database tenants Highly available and disaster tolerant storage tier SQL and ACID transactions for the tenants An interesting point in the spectrum of answers to the question: “Can consistency scale?” Missing from DBECS (next steps) Always-up hosted DBMS tenants DBECS enables DBMS tenant to use standard log-based recovery, but tenant incurs significant down time Scaling of individual hosted DBMS tenants

Conclusion High availability (and scalability) for database systems can be provided by the cloud infrastructure Taking advantage of the well-known characteristics of database systems can greatly enhance the solutions RemusDB: Efficient and transparent database high availability in the virtualization layer DBECS: Scalability and high availability for database clients built on the Cassandra cloud storage system