Durability and Crash Recovery for Distributed In-Memory Storage Ryan Stutsman, Asaf Cidon, Ankita Kejriwal, Ali Mashtizadeh, Aravind Narayanan, Diego Ongaro,

Slides:

Advertisements

Similar presentations

MinCopysets: Derandomizing Replication in Cloud Storage

Advertisements

Fast Crash Recovery in RAMCloud

The google file system Cs 595 Lecture 9.

Memory and Object Management in a Distributed RAM-based Storage System Thesis Proposal Steve Rumble April 23 rd, 2012.

RAMCloud: Scalable High-Performance Storage Entirely in DRAM John Ousterhout Stanford University (with Nandu Jayakumar, Diego Ongaro, Mendel Rosenblum,

The Zebra Striped Network File System Presentation by Joseph Thompson.

Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.

RAMCloud Scalable High-Performance Storage Entirely in DRAM John Ousterhout, Parag Agrawal, David Erickson, Christos Kozyrakis, Jacob Leverich, David Mazières,

1 CSIS 7102 Spring 2004 Lecture 8: Recovery (overview) Dr. King-Ip Lin.

Recovery CPSC 356 Database Ellen Walker Hiram College (Includes figures from Database Systems by Connolly & Begg, © Addison Wesley 2002)

CSCI 3140 Module 8 – Database Recovery Theodore Chiasson Dalhousie University.

CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture X: Transactions.

RAMCloud 1.0 John Ousterhout Stanford University (with Arjun Gopalan, Ashish Gupta, Ankita Kejriwal, Collin Lee, Behnam Montazeri, Diego Ongaro, Seo Jin.

Log-Structured Memory for DRAM-Based Storage Stephen Rumble, Ankita Kejriwal, and John Ousterhout Stanford University.

Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.

CMPT Dr. Alexandra Fedorova Lecture X: Transactions.

Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.

The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.

The Google File System.

CS 142 Lecture Notes: Large-Scale Web ApplicationsSlide 1 RAMCloud Overview ● Storage for datacenters ● commodity servers ● GB DRAM/server.

Metrics for RAMCloud Recovery John Ousterhout Stanford University.

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗

Transactions and Reliability. File system components Disk management Naming Reliability  What are the reliability issues in file systems? Security.

Distributed Deadlocks and Transaction Recovery.

RAMCloud Design Review Recovery Ryan Stutsman April 1,

What We Have Learned From RAMCloud John Ousterhout Stanford University (with Asaf Cidon, Ankita Kejriwal, Diego Ongaro, Mendel Rosenblum, Stephen Rumble,

1 The Google File System Reporter: You-Wei Zhang.

CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.

Module 12: Designing High Availability in Windows Server ® 2008.

RAMCloud: a Low-Latency Datacenter Storage System John Ousterhout Stanford University

RAMCloud: Concept and Challenges John Ousterhout Stanford University.

© 2011 Cisco All rights reserved.Cisco Confidential 1 APP server Client library Memory (Managed Cache) Memory (Managed Cache) Queue to disk Disk NIC Replication.

RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University (Joint work with Diego Ongaro, Ryan Stutsman, Steve Rumble, Mendel.

Distributed File Systems

Cool ideas from RAMCloud Diego Ongaro Stanford University Joint work with Asaf Cidon, Ankita Kejriwal, John Ousterhout, Mendel Rosenblum, Stephen Rumble,

John Ousterhout Stanford University RAMCloud Overview and Update SEDCL Forum January, 2015.

RAMCloud: Low-latency DRAM-based storage Jonathan Ellithorpe, Arjun Gopalan, Ashish Gupta, Ankita Kejriwal, Collin Lee, Behnam Montazeri, Diego Ongaro,

RAMCloud: Scalable High-Performance Storage Entirely in DRAM John Ousterhout Stanford University (with Christos Kozyrakis, David Mazières, Aravind Narayanan,

MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.

Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.

Presenters: Rezan Amiri Sahar Delroshan

CS 140 Lecture Notes: Technology and Operating Systems Slide 1 Technology Changes Mid-1980’s2012Change CPU speed15 MHz2.5 GHz167x Memory size8 MB4 GB500x.

Serverless Network File Systems Overview by Joseph Thompson.

Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.

CS 153 Design of Operating Systems Spring 2015 Lecture 22: File system optimizations.

Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.

Service Primitives for Internet Scale Applications Amr Awadallah, Armando Fox, Ben Ling Computer Systems Lab Stanford University.

GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.

RAMCloud Overview and Status John Ousterhout Stanford University.

Problem-solving on large-scale clusters: theory and applications Lecture 4: GFS & Course Wrap-up.

HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.

)1()1( Presenter: Noam Presman Advanced Topics in Storage Systems – Semester B 2013 Authors: A.Cidon, R.Stutsman, S.Rumble, S.Katti,

Transaction Management Transparencies. ©Pearson Education 2009 Chapter 14 - Objectives Function and importance of transactions. Properties of transactions.

 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.

CSci8211: Distributed Systems: RAMCloud 1 Distributed Shared Memory/Storage Case Study: RAMCloud Developed by Stanford Platform Lab  Key Idea: Scalable.

John Ousterhout Stanford University RAMCloud Overview and Update SEDCL Retreat June, 2013.

The Raft Consensus Algorithm Diego Ongaro and John Ousterhout Stanford University.

The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)

RAMCloud and the Low-Latency Datacenter John Ousterhout Stanford Platform Laboratory.

1 CMPT 431© A. Fedorova Google File System A real massive distributed file system Hundreds of servers and clients –The largest cluster has >1000 storage.

Log-Structured Memory for DRAM-Based Storage Stephen Rumble and John Ousterhout Stanford University.

Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung

CSE-291 Cloud Computing, Fall 2016 Kesden

Google Filesystem Some slides taken from Alan Sussman.

Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.

Overview Continuation from Monday (File system implementation)

by Mikael Bjerga & Arne Lange

The SMART Way to Migrate Replicated Stateful Services

Presentation transcript:

Durability and Crash Recovery for Distributed In-Memory Storage Ryan Stutsman, Asaf Cidon, Ankita Kejriwal, Ali Mashtizadeh, Aravind Narayanan, Diego Ongaro, Stephen M. Rumble, John Ousterhout, and Mendel Rosenblum

RAMCloud Overview RAMCloud: General purpose storage in RAM Low latency: 5 µs remote access Large scale: 10,000 nodes, 100 TB to 1 PB Enable applications which access data intensively Access 100 to 1000x times more data Simplify development of large scale applications Eliminate issues that kill developer productivity Partitioning, parallelism/pipelining, locality, consistency Key Problem: RAM’s lack of durability 2

Thesis Large-scale RAM-based storage must be durable and available for applications to gain its full benefits, and it can be made so cost-effectively using commodity hardware. Solution Fast (1 - 2s) crash recovery for availability Restoration of durability after crashes Survival of massive failures (with loss of availability)

Durability Requirements Minimal impact on normal case performance Remain available during typical failures Remain durable during massive/correlated failures No undetected data loss Low cost, low energy Use commodity datacenter hardware available in a few years The performance of RAM-based storage with durability of traditional datacenter storage

Outline Master Recovery Backup Recovery Massive Failures/Cluster Restart Distributed Log Key structure underlying RAMCloud

Master Backup Master Backup … App Library App Library App Library App Library … Datacenter Network Coordinator Up to 10,000 Storage Servers 6 RAMCloud Architecture Up to 100,000 Application Servers Master Backup Master Backup Masters expose RAM as Key-Value Store Backups store data from other Masters

Normal Operation 7 Master Backups In-memory Log Buffer write

Normal Operation 8 Master Backups In-memory Log Buffer write

Normal Operation 9 Master Backups Backups buffer update No synchronous disk write Must flush on power loss In-memory Log Buffer write

Normal Operation 10 Master Backups Backups buffer update No synchronous disk write Must flush on power loss In-memory Log Buffer

Normal Operation 11 Master Backups Backups buffer update No synchronous disk write Must flush on power loss Bulk writes in background In-memory Log Buffer

Normal Operation 12 Master Backups Backups buffer update No synchronous disk write Must flush on power loss Bulk writes in background Pervasive log structure Even RAM is a log Log cleaner In-memory Log Buffer Segment Replicas

Normal Operation 13 Master Backups Backups buffer update No synchronous disk write Must flush on power loss Bulk writes in background Pervasive log structure Even RAM is a log Log cleaner Hashtable, key → location In-memory Log Buffer Hashtable

Normal Operation Summary Cost-effective 1 copy in RAM Backup copies on disk/flash: durability ~ free! Fast RAM access times for all accesses Avoids synchronous disk writes × Non-volatile buffers × Unavailability on crash 14

Fast Crash Recovery What is left when a Master crashes? Log data stored on disk on backups What must be done to restart servicing requests? Replay log data into RAM Reconstruct the hashtable Recover fast: 64 GB in 1-2 seconds 15

One machine cannot recover in 2s Key to fast recovery: Use system scale 16 Recovery Bottlenecks Recovery Master Backups Crashed Master Disks: ~500 MB/s aggregate NIC: ~1 GB/s

Scatter/read segments across all backups Solves disk bandwidth bottleneck Recover to many masters Solves NIC and memory bandwidth bottlenecks 17 Distributed Recovery Recovery Masters Backups Crashed Master

18 Results 60 Recovery Masters 120 SSDs: 31 GB/s Recovered: 35 GB 60 Recovery Masters 120 SSDs: 31 GB/s Recovered: 35 GB 2x270 MB/s SSDs, 60 machines

Key Issues and Solutions Maximizing concurrency Data parallelism Heavily pipelined Out-of-order replay Balancing work evenly Balancing partitions Divide crashed master up at recovery time Segment scattering Random, but bias to balance disk read time Avoiding centralization Finding segment replicas Detecting incomplete logs 19

Backup Recovery Failures result in loss of segment replicas Recreate replicas to maintain desired redundancy Simple: Master uses same procedure as normal operation Efficient: Use master’s in-memory copy Concurrent: Work is divided up among all the hosts B17 B91 Segment 1 B31 B1 B72 Segment 2 Replicas Master Segment 3 B68 B12

Expected Backup Recovery Time Backup could have 192 GB of data GB masters triplicating segments Aggregate network bandwidth: 1000 GB/s Aggregate disk bandwidth: 200 GB/s ~200 ms if backups have buffer space ~1,000 ms if not

Massive Failures/Cluster Restart Unavailability ok during massive/complete outages Remain durable Simplicity is the goal Massive failures are expected to be rare Shortest restart time is a non-goal (within reason) Reuse master/backup recovery Must make them work correctly under massive failure Treat massive failure as a series of single failures 22

Massive Failures/Cluster Restart Must readmit replicas to cluster after mass failure All master data is lost Only replicas on backups’ disks persist Start recovery when complete log becomes available Determine if recovery can proceed as backups rejoin 23 Masters Backups

Key Issues Replica garbage collection Dilemma when a replica rejoins the cluster It may be needed for a recovery It may be redundant; master has created new replicas 24

Fault-tolerant Distributed Log Distributed log Fast durability in normal case Restores availability quickly during crashes Clear strategy for maintaining durability We’ve talked about how the log operates on a good day 25 Massive Failures Fast Master Recovery Distributed Log Backup Recovery

Potential Problems Detecting incomplete logs Has a complete log been found for recovery? Replica consistency How are the inconsistencies due to failures handled? 26

Finding Replicas Log segments created/destroyed rapidly across cluster 100,000s per second Centralized log information too expensive No centralized list of segments that comprise a log No centralized map of replica locations Broadcast on Master crash Coordinator contacts each backup explicitly Have to contact all backups anyway Master’s segments are scattered across all of them Collects a list of all available replicas 27

Problem: Ensure complete log found during recovery What if all replicas for some segment are missing? Solution: Make log self-describing Add a “log digest” to each replica when it is allocated Lists all segments of the log Reduces problem to finding the up-to-date log digest Detecting Incomplete Logs 28 S7S19S20 [S7, S19, S20] Digest

Choosing the Right Digest Solution: Make most recent segment of log identifiable (the “head”) Segments are allocated in an “open” status Mark “closed” when head segment fills up Only use digests from head segment on recovery Challenge: Safe transition between log heads 29

Preventing Lost Digests Problem: Crash during head transition can leave no log head 30 Segment 19Segment 20 [7, 19] Open Closed Open Time [7, 19][7, 19, 20] Closed Digest

Preventing Lost Digests Problem: Crash during head transition can leave no log head Segment 19 no longer head, segment 20 hasn’t been created 31 Segment 19Segment 20 [7, 19] Open Closed Open [7, 19][7, 19, 20] Closed Digest Time Crash

Preventing Lost Digests Solution: Open new log head before closing old one 32 Segment 19Segment 20 [7, 19] Open [7, 19][7, 19, 20] Closed Digest Open [7, 19, 20] [7, 19] Open Time

Inconsistent Digests Problem: Crash during transition can leave two log digests Solution: Don’t put data in new head until the old head is closed If new log head has no data either is safe to use 33 Segment 19Segment 20 [7, 19] Open [7, 19][7, 19, 20] Closed Digest Open [7, 19, 20] [7, 19] Open Time Crash

Replica Consistency Replicas can become inconsistent Failures during (re-)creation of replicas Must recover even with one only replica available for each segment Can’t prevent inconsistencies Make them detectable Prevent recovery from using them Open segments present special challenges 34

Failures on Closed Segments Problem: Master may recover from a partial replica Solution: Recreate replicas atomically Backup considers atomic replica invalid until closed Will not be persisted or used in recovery Segment 19 on Master Replicas for Segment 19 on Backups Time

Failures on Open Segments Closed segment solution doesn’t work Problem: Segment in Master’s RAM is changing Lost replica may be useful for recovery one moment A moment later it must not be used during recovery Rare: Expect 3 Masters affected every 30 mins or longer Segment 19 on Master Replicas for Segment 19 on Backups Write starts Write ends Time

Failures on Open Segments Solution Make valid replicas distinguishable (close them) Notify coordinator to prevent use of open replicas Segment 19 on Master Replicas for Segment 19 on Backups Write starts Write ends Open segment 20, close segment 19 Min Head Id: 20 Update Min Head Segment Id on Coordinator Time

Fault-tolerant Distributed Log Detecting incomplete logs Broadcast to find replicas Log digest provides catalog of segments Head of log ensures up-to-date digest is found Replica consistency Atomic replica recreation for most segments Centralized minimum head id invalidates lost replicas No centralization during normal operation Even during transitions between segments 38

What’s done Normal operation Low latency 5 µs remote access Master recovery s restoration of availability after failures Backup recovery Durability safely maintained via re- replication 39 [Fall 2009] [Fall 2010] [SOSP 2011] [Feb 2012]

What’s left Massive Failures/Cluster Restart Replica garbage collection from backups Master recovery on coordinator Recoveries which can wait for lost replicas Experiment with longer-lived RAMCloud instances Work out kinks Measure/understand complex interactions e.g. Multiple recoveries, impact of recovery on normal operation, storage leaks, etc. Explore strategies for surviving permanent backup failures 40 [Spring 2012] [Apr 2012] [May-June 2012] [Summer 2012]

Summary RAM’s volatility makes it difficult to use for storage Applications expect large-scale storage to be durable RAMCloud’s durability & availability solve this RAMCloud provides general purpose storage in RAM Low latency with large scale Enables new applications which access more data Simplifies development of large scale applications 41

42