Free Recovery: A Step Towards Self-Managing State Andy Huang and Armando Fox Stanford University.

Slides:



Advertisements
Similar presentations
Dynamo: Amazon’s Highly Available Key-value Store
Advertisements

Replication and Consistency (2). Reference r Replication in the Harp File System, Barbara Liskov, Sanjay Ghemawat, Robert Gruber, Paul Johnson, Liuba.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Recovery CPSC 356 Database Ellen Walker Hiram College (Includes figures from Database Systems by Connolly & Begg, © Addison Wesley 2002)
CS-550: Distributed File Systems [SiS]1 Resource Management in Distributed Systems: Distributed File Systems.
DStore: Recovery-friendly, self-managing clustered hash table Andy Huang and Armando Fox Stanford University.
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
CS 582 / CMPE 481 Distributed Systems
CMPT 431 Dr. Alexandra Fedorova Lecture XII: Replication.
Crash recovery All-or-nothing atomicity & logging.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 23 Database Recovery Techniques.
CSS490 Replication & Fault Tolerance
Chapter 19 Database Recovery Techniques. Slide Chapter 19 Outline Databases Recovery 1. Purpose of Database Recovery 2. Types of Failure 3. Transaction.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Crash recovery All-or-nothing atomicity & logging.
Recovery Techniques in Distributed Databases Naveen Jones December 5, 2011.
Academic Year 2014 Spring. MODULE CC3005NI: Advanced Database Systems “DATABASE RECOVERY” (PART – 1) Academic Year 2014 Spring.
Amazon’s Dynamo System The material is taken from “Dynamo: Amazon’s Highly Available Key-value Store,” by G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati,
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Case Study: Amazon Dynamo Steve Ko Computer Sciences and Engineering University at Buffalo.
Peer-to-Peer in the Datacenter: Amazon Dynamo Aaron Blankstein COS 461: Computer Networks Lectures: MW 10-10:50am in Architecture N101
Highly Available ACID Memory Vijayshankar Raman. Introduction §Why ACID memory? l non-database apps: want updates to critical data to be atomic and persistent.
Distributed Deadlocks and Transaction Recovery.
Distributed Systems Tutorial 11 – Yahoo! PNUTS written by Alex Libov Based on OSCON 2011 presentation winter semester,
Chord & CFS Presenter: Gang ZhouNov. 11th, University of Virginia.
Ahmad Al-Shishtawy 1,2,Tareq Jamal Khan 1, and Vladimir Vlassov KTH Royal Institute of Technology, Stockholm, Sweden {ahmadas, tareqjk,
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
Probabilistic Consistency and Durability in RAINS: Redundant Array of Independent, Non-Durable Stores Andy Huang and Armando Fox Stanford University.
Chapterb19 Transaction Management Transaction: An action, or series of actions, carried out by a single user or application program, which reads or updates.
Replication March 16, Replication What is Replication?  A technique for increasing availability, fault tolerance and sometimes, performance 
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Data Versioning Lecturer.
Chapter 15 Recovery. Topics in this Chapter Transactions Transaction Recovery System Recovery Media Recovery Two-Phase Commit SQL Facilities.
Database Systems/COMP4910/Spring05/Melikyan1 Transaction Management Overview Unit 2 Chapter 16.
1 File Systems: Consistency Issues. 2 File Systems: Consistency Issues File systems maintains many data structures  Free list/bit vector  Directories.
Concurrency Control. Objectives Management of Databases Concurrency Control Database Recovery Database Security Database Administration.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
IM NTU Distributed Information Systems 2004 Replication Management -- 1 Replication Management Yih-Kuen Tsay Dept. of Information Management National Taiwan.
Databases Illuminated
Service Primitives for Internet Scale Applications Amr Awadallah, Armando Fox, Ben Ling Computer Systems Lab Stanford University.
XA Transactions.
Commit Algorithms Hamid Al-Hamadi CS 5204 November 17, 2009.
A Recovery-Friendly, Self-Managing Session State Store Benjamin Ling and Armando Fox
Progress Report Armando Fox with George Candea, James Cutler, Ben Ling, Andy Huang.
Introduction to Distributed Databases Yiwei Wu. Introduction A distributed database is a database in which portions of the database are stored on multiple.
CS 162 Section 10 Two-phase commit Fault-tolerant computing.
05/11/2011ecs251 spring 2011, midterm1 ecs251 Spring 2011 midterm Name: Student ID: Open book/laptop/Internet and totally 10 questions (choose at.
A Recovery-Friendly, Self-Managing Session State Store Benjamin Ling, Emre Kiciman, Armando Fox
DStore: An Easy-to-Manage Persistent State Store Andy Huang and Armando Fox Stanford University.
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
CS294, YelickDataStructs, p1 CS Distributed Data Structures
Operating System Reliability Andy Wang COP 5611 Advanced Operating Systems.
Database Recovery Techniques
Cluster-Based Scalable
The Case for a Session State Storage Layer
Distributed File Systems
Dynamo: Amazon’s Highly Available Key-value Store
Noah Treuhaft UC Berkeley ROC Group ROC Retreat, January 2002
Cassandra Transaction Processing
Lecturer : Dr. Pavle Mogin
Operating System Reliability
Operating System Reliability
Operating System Reliability
Operating System Reliability
Distributed Transactions
Decoupled Storage: “Free the Replicas!”
Transaction Properties: ACID vs. BASE
Operating System Reliability
Database Recovery 1 Purpose of Database Recovery
Concurrency Control.
Operating System Reliability
Operating System Reliability
Presentation transcript:

Free Recovery: A Step Towards Self-Managing State Andy Huang and Armando Fox Stanford University

Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Persistent hash tables Frontends App Servers DB LA N KeyValue Yahoo! user ID User profile ISBN Amazon catalog metadata Hash table

Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Two state management challenges Failure handling Consistency requirements Consistency requirements ð Node recovery costly ð Reliable failure detection Relax internal consistency Relax internal consistency ð Fast, non-intrusive recovery (“free”) System evolution Large data sets Large data sets ð Repartitioning is costly ð Good resources provisioning Free recovery Free recovery ð Automatic, online repartitioning an easy-to-manage cluster-based persistent hash table for Internet services DStore

Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang DStore architecture Dlib LA N Brick app server Dlib: exposes hash table API and is the “coordinator” for distributed operations Brick: stores data by writing synchronously to disk an easy-to-manage cluster-based persistent hash table for Internet services DStore

Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Focusing on recovery Technique 1: Quorums Tolerant to brick inconsistency Technique 2: Single-phase writes No request relies on specific bricks Simple, non-intrusive recovery 2PC: failure between phases complicates protocol 2 nd phase depends on particular set of bricks Relies on reliable failure detection Single-phase quorum writes: can be completed by any majority of bricks Any brick can fail at any time Write: send to all, wait for majority Read: read from majority OK if some bricks’ data differs Failure = missing some writes

Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Considering consistency Dl 1 B1B1 B2B2 B3B3 x = 0 Dl 2 0 read 1 Dlib failure can cause a partial write, violating the quorum property If timestamps differ, read-repair restores majority invariant Delayed commit write(1)

Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Considering consistency B1B1 B2B2 B3B3 x = 0 Dl 1 Dl 2 1 read write write(1) A write-in-progress cookie can be used to detect partial writes and commit/abort on the next read An individual client’s view of DStore is consistent with that of a single centralized server (Bayou)

Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Benchmark: Free recovery Worst-case behavior (100% cache hit rate) Expected behavior (85% cache hit rate) Recovery: fast and non-intrusive Brick killedRecovery

Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Benchmark: Automatic failure detection Modest policy (anomaly threshold = 8) Aggressive policy (anomaly threshold = 5) False positives: low cost Fail-stutter: detected by Pinpoint Fail-stutter

Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Online repartitioning 1. Take brick offline 2. Copy data to new brick 3. Bring both bricks online Appears as if brick just failed and recovered

Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Benchmark: Automatic online repartitioning Evenly-distributed load (3 to 6 bricks) Hotspot in 01 partition (6 to 12 bricks) Brick selection: effective Repartitioning: non-intrusive Naive

Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang n Perform online checkpoints l Take checkpointing brick offline l Just like failure+recovery n See if free recovery can simplify online data reconstruction after hard failures n Any other state management challenges you can think of? Next up for free recovery

Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Summary Free recovery DStore = Decoupled Storage Managed like a stateless Web farm Quorums [spacial decoupling] Cost: extra overprovisioning Gain: fast, non-intrusive recovery Single-phase ops [temporal decoupling ] Cost: temporarily violates “majority” invariant Gain: any brick can fail at any time Failure handling  fast, non-intrusive  Mechanism: simple reboot  Policy: aggressively reboot anomalous bricks System evolution  “plug-and-play” þ Mechanism: automatic, online repartitioning  Policy: dynamically add and remove nodes based on predicted load

Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang an easy-to-manage cluster-based persistent hash table for Internet services DStore

Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang ACID Properties Atomicity: a put replaces existing value and is atomic (multi-operation transactions and partial updates not supported) Consistency: Jane’s view of the hash table is consistent with that of a single centralized server (Bayou) l Read your writes: Jane sees her own updates l Monotonic reads: Jane won’t read a value older than one she’s read before l Writes follow reads: Jane’s writes are ordered after any writes (by any user), which Jane has read l Monotonic writes: Jane’s own writes are totally ordered Isolation: no multi-operation transactions to isolate Durability: updates synced to disk on multiple servers

Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Summary n Quorums = spacial decoupling (between nodes) l Gain: fast, non-intrusive recovery l Cost: overprovision for quorum replication n Single-phase operations = temporal decoupling l Gain: any brick can fail at any time l Cost: temporary violation of quorum majority invariants n Free recovery addresses challenges: l Handing failures  fail anytime, recovery quickly, non-intrusively l System evolution  plug-and-play nodes via automatic, online repartitioning l Failure detection  aggressive (low false-positive cost) l Resource provisioning  dynamic (low repartitioning cost) n Resulting system: can be managed like a stateless Web farm

Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Algorithm: Wavering reads n No two-phase commit (complicates recovery and introduces coupling) n C 1 attempts to write, but fails before completion n Quorum property violated: reading a majority doesn’t guarantee latest value is returned n Result: wavering reads R1R1 R2R2 R3R3 x = 0 C1C1 C2C2 0 read write(1)

Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Algorithm: Read writeback n Idea: commit partial write when it is first read n Commit point l Before x=0 l After x=1 n Proven linearizable under fail-stop model C1C1 R1R1 R2R2 R3R3 x = 0 C2C2 0 read write(1)

Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Algorithm: Crash recovery n Fail-stop not an accurate model: implies client that generated the request fails permanently n With writeback, commit point occurs sometime in the future n A writer expects request to succeed or fail, not be “in- progress” read 0 R1R1 R2R2 R3R3 x = 0 C1C1 C2C2 1 write(1) write 1 1

Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Algorithm: Write in-progress n Requirement: write must be committed/aborted on the next read n Record “write in-progress” on client l On submit: write “start” cookie l On return: write “end” cookie l On read: if “start” cookie has no matching “end,” read all R1R1 R2R2 R3R3 x = 0 C1C1 C2C2 read 1 write 11 1 write(1)

Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Focusing on recovery n Technique 1: Quorums l Write to ≥ majority; read from majority l Failure = missing a few writes l Simple, non-intrusive recovery n Decouple in time (i.e., between requests) using single-phase operations l Lazy read-repair handles Dlib failures l No request relies on a specific set of replicas l Safe for any node to fail at any time Bricks Dlib =