Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.

Similar presentations


Presentation on theme: "Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there."— Presentation transcript:

1 Fast Crash Recovery in RAMCloud

2 Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there are limitations – DRAM is typically used as cache Need to worry about consistency and cache misses

3 RAMCloud Keeps all data in RAM at all times – Designed to scale to thousands of servers – To host terabytes of data – Provides low-latency (5-10 µs) for small reads Design goals – High durability and availability Without compromising performance

4 Alternatives 3x RAM replications – 3x cost and energy – Power failure RAMCloud keeps one copy in RAM – Two copies on disks To achieve good availability – Fast crash recovery (64GB in 1-2 seconds)

5 RAMCloud Basics Thousands of off-the-shelf servers – Each with 64GB of RAM – With Infiniband NICs Remote access below 10 µs

6 Data Model Key-value store Tables of objects – Object 64-bit ID + 1MB array + 64-bit version number No atomic updates to multiple objects

7 System Structure A large number of storage servers – Each server hosts A master, which manages local DRAM objects and service requests A backup, which stores copies of objects from other masters on storage A coordinator – Manages config info and object locations – Not involved in most requests

8 RAMCloud Cluster Architecture client master backup coordinator

9 More on the Coordinator Maps objects to servers in units of tablets – Hold consecutive key ranges within a single table For locality reasons – Small tables are stored on a single server – Large tables are split across servers Clients can cache tablets to access servers directly

10 Log-structured Storage Logging approach – Each master logs data in memory Log entries are forwarded to backup servers – Backup servers buffer log entries » Battery-backed Writes complete once all backup servers acknowledge A backup server flushes its buffer when full – 8MB segment for logging, buffering, and IOs – Each server can handle 300K 100-byte writes/sec

11 Recovery When a server crashes, its DRAM content must be reconstructed 1-2 second recovery time is good enough

12 Using Scale Simple 3 replica approach – Recovery based on the speed of three disks – 3.5 minutes to read 64GB of data Scattered over 1,000 disks – Takes 0.6 seconds to read 64GB – Centralized recovery master becomes a bottleneck – 10 Gbps network means 1 min to transfer 64GB of data to the centralized master

13 RAMCloud Uses 100 recovery masters – Cuts the time down to 1 second

14 Scattering Log Segments Ideally uniform, but with more details – Need to avoid correlated failures – Need to account for heterogeneity of hardware – Need to coordinate machines not to overflow buffers on individual machines – Need to account for changing memberships of servers due to failures

15 Failure Detection Periodic pings to random servers – With 99% chance to detect failed servers within 5 rounds Recovery – Setup – Replay – Cleanup

16 Setup Coordinator finds log segment replicas – By querying all backup servers Detecting incomplete logs – Logs are self describing Starting Partition Recoveries – Each master uploads a will periodically to the coordinator in the event of its demise – Coordinator carries out the will accordingly

17 Replay Parallel recovery Six stages of pipelining – At segment granularity – Same ordering of operations on segments to avoid pipeline stalls Only the primary replicas is involved in recovery

18 Cleanup Get master online Free up segments from the previous crash

19 Consistency Exactly-once semantics Implementation not yet complete ZooKeeper handles coordinator failures – Distributed configuration service – With its own replication

20 Additional Failure Modes Current focus – Recover DRAM content for a single master failure Failed backup server – Need to know what segments are lost from the server – Rereplicate those lost segments across remaining disks

21 Multiple Failures Multiple servers fail simultaneously Recover each failure independently – Some will involve secondary replicas Based on projection – With 5,000 servers, recovering 40 masters within a rack will take about 2 seconds Can’t do much when many racks are blacked out

22 Cold Start Complete power outage Backups will contact the coordinate as they reboot Need to quorum of backups before starting reconstructing masters Current implementation does not perform cold starts

23 Evaluation 60-node cluster Each node – 16GB RAM, 1 disk – Infiniband (25 Gbps) User level apps can talk to NICs bypassing the kernel

24 Results Can recover lost data at 22 GB/s A crashed server with 35 GB storage – Can be recovered in 1.6 seconds Recovery time stays nearly flat from 1 to 20 recovery masters, each talks to 6 disks 60 recovery masters adds only 10 ms recovery time

25 Results Fast recovery significantly reduces the risk of data loss – Assume recovery time of 1 sec – The risk of data loss for 100K node is 10 -5 in one year – 10x improvement in recovery time, improves reliability by 1,000x Assumes independent failures

26 Theoretical Recovery Speed Limit Harder to be faster than a few hundred msec 150 msec to detect failure 100 msec to contact every backup 100 msec to read a single segment from disk

27 Risks Scalability study based on a small cluster Can treat performance glitches as failures – Trigger unnecessary recovery Access patterns can change dynamically – May lead to unbalanced load


Download ppt "Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there."

Similar presentations


Ads by Google