Presentation is loading. Please wait.

Presentation is loading. Please wait.

Free Recovery: A Step Towards Self-Managing State Andy Huang and Armando Fox Stanford University.

Similar presentations


Presentation on theme: "Free Recovery: A Step Towards Self-Managing State Andy Huang and Armando Fox Stanford University."— Presentation transcript:

1 Free Recovery: A Step Towards Self-Managing State Andy Huang and Armando Fox Stanford University

2 Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Persistent hash tables Frontends App Servers DB LA N KeyValue Yahoo! user ID User profile ISBN Amazon catalog metadata Hash table

3 Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Two state management challenges Failure handling Consistency requirements Consistency requirements ð Node recovery costly ð Reliable failure detection Relax internal consistency Relax internal consistency ð Fast, non-intrusive recovery (“free”) System evolution Large data sets Large data sets ð Repartitioning is costly ð Good resources provisioning Free recovery Free recovery ð Automatic, online repartitioning an easy-to-manage cluster-based persistent hash table for Internet services DStore

4 Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang DStore architecture Dlib LA N Brick app server Dlib: exposes hash table API and is the “coordinator” for distributed operations Brick: stores data by writing synchronously to disk an easy-to-manage cluster-based persistent hash table for Internet services DStore

5 Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Focusing on recovery Technique 1: Quorums Tolerant to brick inconsistency Technique 2: Single-phase writes No request relies on specific bricks Simple, non-intrusive recovery 2PC: failure between phases complicates protocol 2 nd phase depends on particular set of bricks Relies on reliable failure detection Single-phase quorum writes: can be completed by any majority of bricks Any brick can fail at any time Write: send to all, wait for majority Read: read from majority OK if some bricks’ data differs Failure = missing some writes

6 Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Considering consistency Dl 1 B1B1 B2B2 B3B3 x = 0 Dl 2 0 read 1 Dlib failure can cause a partial write, violating the quorum property If timestamps differ, read-repair restores majority invariant Delayed commit write(1)

7 Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Considering consistency B1B1 B2B2 B3B3 x = 0 Dl 1 Dl 2 1 read write write(1) A write-in-progress cookie can be used to detect partial writes and commit/abort on the next read An individual client’s view of DStore is consistent with that of a single centralized server (Bayou)

8 Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Benchmark: Free recovery Worst-case behavior (100% cache hit rate) Expected behavior (85% cache hit rate) Recovery: fast and non-intrusive Brick killedRecovery

9 Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Benchmark: Automatic failure detection Modest policy (anomaly threshold = 8) Aggressive policy (anomaly threshold = 5) False positives: low cost Fail-stutter: detected by Pinpoint Fail-stutter

10 Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Online repartitioning 1. Take brick offline 2. Copy data to new brick 3. Bring both bricks online 01 01 01 01 01 01 01 01 01 1 01 010 1 Appears as if brick just failed and recovered

11 Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Benchmark: Automatic online repartitioning Evenly-distributed load (3 to 6 bricks) Hotspot in 01 partition (6 to 12 bricks) Brick selection: effective Repartitioning: non-intrusive Naive

12 Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang n Perform online checkpoints l Take checkpointing brick offline l Just like failure+recovery n See if free recovery can simplify online data reconstruction after hard failures n Any other state management challenges you can think of? Next up for free recovery

13 Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Summary Free recovery DStore = Decoupled Storage Managed like a stateless Web farm Quorums [spacial decoupling] Cost: extra overprovisioning Gain: fast, non-intrusive recovery Single-phase ops [temporal decoupling ] Cost: temporarily violates “majority” invariant Gain: any brick can fail at any time Failure handling  fast, non-intrusive  Mechanism: simple reboot  Policy: aggressively reboot anomalous bricks System evolution  “plug-and-play” þ Mechanism: automatic, online repartitioning  Policy: dynamically add and remove nodes based on predicted load

14 Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang an easy-to-manage cluster-based persistent hash table for Internet services DStore andy.huang@stanford.edu

15 Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang ACID Properties Atomicity: a put replaces existing value and is atomic (multi-operation transactions and partial updates not supported) Consistency: Jane’s view of the hash table is consistent with that of a single centralized server (Bayou) l Read your writes: Jane sees her own updates l Monotonic reads: Jane won’t read a value older than one she’s read before l Writes follow reads: Jane’s writes are ordered after any writes (by any user), which Jane has read l Monotonic writes: Jane’s own writes are totally ordered Isolation: no multi-operation transactions to isolate Durability: updates synced to disk on multiple servers

16 Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Summary n Quorums = spacial decoupling (between nodes) l Gain: fast, non-intrusive recovery l Cost: overprovision for quorum replication n Single-phase operations = temporal decoupling l Gain: any brick can fail at any time l Cost: temporary violation of quorum majority invariants n Free recovery addresses challenges: l Handing failures  fail anytime, recovery quickly, non-intrusively l System evolution  plug-and-play nodes via automatic, online repartitioning l Failure detection  aggressive (low false-positive cost) l Resource provisioning  dynamic (low repartitioning cost) n Resulting system: can be managed like a stateless Web farm

17 Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Algorithm: Wavering reads n No two-phase commit (complicates recovery and introduces coupling) n C 1 attempts to write, but fails before completion n Quorum property violated: reading a majority doesn’t guarantee latest value is returned n Result: wavering reads R1R1 R2R2 R3R3 x = 0 C1C1 C2C2 0 read 1 0 1 write(1)

18 Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Algorithm: Read writeback n Idea: commit partial write when it is first read n Commit point l Before x=0 l After x=1 n Proven linearizable under fail-stop model C1C1 R1R1 R2R2 R3R3 x = 0 C2C2 0 read 1 1 1 write(1)

19 Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Algorithm: Crash recovery n Fail-stop not an accurate model: implies client that generated the request fails permanently n With writeback, commit point occurs sometime in the future n A writer expects request to succeed or fail, not be “in- progress” read 0 R1R1 R2R2 R3R3 x = 0 C1C1 C2C2 1 write(1) write 1 1

20 Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Algorithm: Write in-progress n Requirement: write must be committed/aborted on the next read n Record “write in-progress” on client l On submit: write “start” cookie l On return: write “end” cookie l On read: if “start” cookie has no matching “end,” read all R1R1 R2R2 R3R3 x = 0 C1C1 C2C2 read 1 write 11 1 write(1)

21 Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Focusing on recovery n Technique 1: Quorums l Write to ≥ majority; read from majority l Failure = missing a few writes l Simple, non-intrusive recovery n Decouple in time (i.e., between requests) using single-phase operations l Lazy read-repair handles Dlib failures l No request relies on a specific set of replicas l Safe for any node to fail at any time Bricks Dlib =


Download ppt "Free Recovery: A Step Towards Self-Managing State Andy Huang and Armando Fox Stanford University."

Similar presentations


Ads by Google