RAMCloud Design Review Recovery Ryan Stutsman April 1, 2010 1.

Slides:



Advertisements
Similar presentations
MinCopysets: Derandomizing Replication in Cloud Storage
Advertisements

Fast Crash Recovery in RAMCloud
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Disks and RAID.
RAMCloud Design Review Indexing Ryan Stutsman April 1,
RAMCloud: Scalable High-Performance Storage Entirely in DRAM John Ousterhout Stanford University (with Nandu Jayakumar, Diego Ongaro, Mendel Rosenblum,
1 Magnetic Disks 1956: IBM (RAMAC) first disk drive 5 Mb – Mb/in $/year 9 Kb/sec 1980: SEAGATE first 5.25’’ disk drive 5 Mb – 1.96 Mb/in2 625.
RAMCloud Scalable High-Performance Storage Entirely in DRAM John Ousterhout, Parag Agrawal, David Erickson, Christos Kozyrakis, Jacob Leverich, David Mazières,
Log-Structured Memory for DRAM-Based Storage Stephen Rumble, Ankita Kejriwal, and John Ousterhout Stanford University.
The Google File System (GFS). Introduction Special Assumptions Consistency Model System Design System Interactions Fault Tolerance (Results)
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
The design and implementation of a log-structured file system The design and implementation of a log-structured file system M. Rosenblum and J.K. Ousterhout.
Other Disk Details. 2 Disk Formatting After manufacturing disk has no information –Is stack of platters coated with magnetizable metal oxide Before use,
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
CS 333 Introduction to Operating Systems Class 11 – Virtual Memory (1)
CS 142 Lecture Notes: Large-Scale Web ApplicationsSlide 1 RAMCloud Overview ● Storage for datacenters ● commodity servers ● GB DRAM/server.
The Design and Implementation of a Log-Structured File System Presented by Carl Yao.
Network File Systems Victoria Krafft CS /4/05.
Metrics for RAMCloud Recovery John Ousterhout Stanford University.
Assumptions Hypothesis Hopes RAMCloud Mendel Rosenblum Stanford University.
What We Have Learned From RAMCloud John Ousterhout Stanford University (with Asaf Cidon, Ankita Kejriwal, Diego Ongaro, Mendel Rosenblum, Stephen Rumble,
1 Storage Refinement. Outline Disk failures To attack Intermittent failures To attack Media Decay and Write failure –Checksum To attack Disk crash –RAID.
I/O – Chapter 8 Introduction Disk Storage and Dependability – 8.2 Buses and other connectors – 8.4 I/O performance measures – 8.6.
RAMCloud: a Low-Latency Datacenter Storage System John Ousterhout Stanford University
RAMCloud: Concept and Challenges John Ousterhout Stanford University.
RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University (Joint work with Diego Ongaro, Ryan Stutsman, Steve Rumble, Mendel.
Cool ideas from RAMCloud Diego Ongaro Stanford University Joint work with Asaf Cidon, Ankita Kejriwal, John Ousterhout, Mendel Rosenblum, Stephen Rumble,
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
Low-Latency Datacenters John Ousterhout Platform Lab Retreat May 29, 2015.
Durability and Crash Recovery for Distributed In-Memory Storage Ryan Stutsman, Asaf Cidon, Ankita Kejriwal, Ali Mashtizadeh, Aravind Narayanan, Diego Ongaro,
RAMCloud: Low-latency DRAM-based storage Jonathan Ellithorpe, Arjun Gopalan, Ashish Gupta, Ankita Kejriwal, Collin Lee, Behnam Montazeri, Diego Ongaro,
RAMCloud: Scalable High-Performance Storage Entirely in DRAM John Ousterhout Stanford University (with Christos Kozyrakis, David Mazières, Aravind Narayanan,
The Design and Implementation of Log-Structure File System M. Rosenblum and J. Ousterhout.
1© Copyright 2012 EMC Corporation. All rights reserved. EMC PERFORMANCE OPTIMIZATION FOR MICROSOFT FAST SEARCH SERVER 2010 FOR SHAREPOINT EMC Symmetrix.
Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.
CS 140 Lecture Notes: Technology and Operating Systems Slide 1 Technology Changes Mid-1980’s2012Change CPU speed15 MHz2.5 GHz167x Memory size8 MB4 GB500x.
1 MONGODB: CH ADMIN CSSE 533 Week 4, Spring, 2015.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
Log-Structured File Systems
CS 153 Design of Operating Systems Spring 2015 Lecture 22: File system optimizations.
Computer Systems Week 14: Memory Management Amanda Oddie.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
RAMCloud Overview and Status John Ousterhout Stanford University.
CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.
Efficient Live Checkpointing Mechanisms for computation and memory-intensive VMs in a data center Kasidit Chanchio Vasabilab Dept of Computer Science,
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
)1()1( Presenter: Noam Presman Advanced Topics in Storage Systems – Semester B 2013 Authors: A.Cidon, R.Stutsman, S.Rumble, S.Katti,
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
Am I RAM Or am I ROM?.
CSci8211: Distributed Systems: RAMCloud 1 Distributed Shared Memory/Storage Case Study: RAMCloud Developed by Stanford Platform Lab  Key Idea: Scalable.
External Sorting. Why Sort? A classic problem in computer science! Data requested in sorted order –e.g., find students in increasing gpa order Sorting.
Lecture Topics: 11/22 HW 7 File systems –block allocation Unix and NT –disk scheduling –file caches –RAID.
RAMCloud and the Low-Latency Datacenter John Ousterhout Stanford Platform Laboratory.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Log-Structured Memory for DRAM-Based Storage Stephen Rumble and John Ousterhout Stanford University.
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
So far we have covered … Basic visualization algorithms
Dingming Wu+, Yiting Xia+*, Xiaoye Steven Sun+,
RAMCloud Architecture
External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.
RAMCloud Architecture
Overview Continuation from Monday (File system implementation)
Selected Topics: External Sorting, Join Algorithms, …
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
THE GOOGLE FILE SYSTEM.
by Mikael Bjerga & Arne Lange
The Design and Implementation of a Log-Structured File System
Presentation transcript:

RAMCloud Design Review Recovery Ryan Stutsman April 1,

Overview Master Recovery o 2-Phase o Partitioned Failures o Backups o Rack/Switch o Power o Datacenter 2

Implications of Single Copy in Memory Problem: Unavailability o If master crashes unavailable until read from disks on backups o Read 64 GB from one disk? 10 minutes Use scale to get low-latency recovery o Lots of disk heads, NICs, CPUs o Our goal: recover in 1-2 seconds  Is this good enough? o Applications just see “hiccups” 3

Fast Recovery Problem: Disk bottleneck for recovery Idea: Leverage many spindles to recover quickly o Data broadly scattered throughout backups  Not just great write throughput  Take advantage of read throughput Reincarnate masters o With same tables o With same indexes o Preserves locality 4

Fast Recovery: The Problem After crash, all backups read disks in parallel (64 GB/ MB/sec = 0.6 sec, great!) Collect all backup data on replacement master (64 GB/10Gbit/sec ~ 60 sec: too slow!) Problem: Network is now the bottleneck! Backups... Replacement Master 0.6 s 60 s 5

2-Phase Recovery Idea: Data already in memory on backups, just need to know where Master A Backup E Backup FBackup H Backup G 6

2-Phase Recovery Phase #1: Recover location info (< 1s) Backup E Backup FBackup H Backup G Free Host Master A 7

2-Phase Recovery Phase #1: Recover location info (< 1s) o Read all data into memories of backups Backup E Backup FBackup H Backup G Dead 8

Master A 2-Phase Recovery Phase #1: Recover location info (< 1s) o Read all data into memories of backups o Send only location info to replacement master Backup E Backup FBackup H Backup G 9

2-Phase Recovery Phase #2: Proxy & Recover Full Data (~60s) o System resumes operation:  Fetch on demand from backups  1 extra round trip on first read of an object  Writes are full speed Master A Backup E Backup FBackup H Backup G get 10

2-Phase Recovery Phase #2: Proxy & Recover Full Data (~60s) o System resumes operation:  Fetch on demand from backups  1 extra round trip on first read of an object  Writes are full speed Master A Backup E Backup FBackup H Backup G get 11

2-Phase Recovery Phase #2: Proxy & Recover Full Data (~60s) o Transfer data from backups in between servicing requests Master A Backup E Backup FBackup H Backup G 12

2-Phase Recovery Performance normal after Phase #2 completes Master A Backup E Backup FBackup H Backup G 13

2-Phase Recovery: Thoughts Recovers locality by recovering machines  Need to talk to all hosts o Because backup data for a single master is on all machines o How bad is this?  Doesn’t deal with heterogeneity o Machine is the unit of recovery o Can only recover a machine to one with more capacity Bi-modal Utilization o Must retain pool of empty hosts 14

2-Phase Recovery: Problem  Hashtable inserts become the new bottleneck o Phase #1 must place all objects in the hashtable o Master can have 64 million 1 KB objects o Hashtable can sustain about 10 million inserts/s o 6.4s is over our budget o Can use additional cores, but objects could be even smaller Unsure of a way to recover exact master quickly o Constrained by both CPU and NIC o Recovery to single host is a bottleneck Another way? 15

Partitioned Recovery Idea: Leverage many hosts to overcome bottleneck o Problem is machines are large so divide them into partitions o Recover each partition to a different master o Just like a machine  Contains any number of tables, table fragments, indexes, etc. Backup E Backup FBackup H Backup G Master CMaster B Partitions Master A 16

Partitioned Recovery Load data from disks Backup E Backup FBackup H Backup G Master CMaster BDead 17

Partitioned Recovery Reconstitute partitions on many hosts 64 GB / 100 partitions = 640 MB 640 MB / 10 GBit/s = 0.6s for full recovery Backup E Backup FBackup H Backup G Master CMaster B 18

Partitioned Recovery: Thoughts It works: meets availability goals o Can tune time by adjusting partition size Helps with heterogeneity o Unit of recovery is no longer a machine  Increases host/partition related metadata o Partitions still large enough to provide app locality  Need to talk to all hosts No hot spares 19

Phase #1 Partitioned Recovery: Thoughts Does not recover locality o But, no worse than 2-Phase o Partitioned approach recovers as fast as Phase #1 o Can restore locality as fast as Phase #2 Phase #2 20 Partitioned Full Recovery 0.6s Reshuffle Partitions 10s

Failures: Backups On backup failure the coordinator broadcasts More information during coordinator discussion All masters check their live data If objects were on that backup rewrite elsewhere (from RAM) No recovery of backups themselves 21

Failures: Racks/Switches Careful selection of backup locations o Write backups for objects to other racks  As each other  As the master o Changes as masters recover  Can move between racks o Masters fix this on recovery Rack failures handled the same as machine failures o Consider all the machines in the rack dead Question: Minimum RAMCloud that can sustain an entire rack failure and meet recovery goal? o 100 partitions to recover a single machine in 0.6s o 50 dead * 100 partitions, need 5000 machines to make 0.6s o Don’t pack storage servers in racks, mix with app servers 22

Failures: Power Problem: Objects are buffered temporarily in RAM o Even after the put has returned as successful to the application Solution: All hosts have on-board battery backup Flush all dirty objects on fluctuation o Any battery should be easily sufficient for this o Each master has r active buffers on backups o Buffers are 8MB, for 3 disk copies  3 * 8MB takes 0.24s to flush Question: Cost effective way to get 10-20s of power? No battery? o Deal with lower consistency o Synchronous writes 23

Failures: Datacenter Durability guaranteed by disks, no availability o Modulo nuclear attacks No cross-DC replication in version 1 o Latency can’t be reconciled with consistency o Aggregate write bandwidth of 1000 host RAMCloud  100 MB/s * 1000 = 1 Tbit/s Application level replication will do much better o Application can batch writes o Application understands consistency needs Is this something we need to support? 24

Summary Use scale in two ways to achieve availability o Scatter reads to overcome disk bottleneck o Scatter rebuilding to overcome CPU and network bottlenecks Scale driving lower-latency Remaining Issue: How do we get information we need for recovery? o Every master recovery involves all backups 25

Discussion 26