DStore: An Easy-to-Manage Persistent State Store Andy Huang and Armando Fox Stanford University.

DStore: An Easy-to-Manage Persistent State Store Andy Huang and Armando Fox Stanford University

ROC Retreat – June 2004 © 2004 Andy Huang Background: Scalable CHTs Frontends App Servers DBs LA N Cluster hash tables (CHTs) Single-key-lookup data Yahoo! user profiles Yahoo! user profiles Amazon catalog metadata Amazon catalog metadata Underlying storage layer Inktomi: wordID  docID list docID  document metadata Inktomi: wordID  docID list docID  document metadata DDS/Ninja: atomic compare-and-swap DDS/Ninja: atomic compare-and-swap

ROC Retreat – June 2004 © 2004 Andy Huang Our online repartitioning algorithm lowers scaling cost Our online repartitioning algorithm lowers scaling cost Reactive scaling adjusts capacity to match current load Reactive scaling adjusts capacity to match current load Lowers the cost of acting on false positive Lowers the cost of acting on false positive Effective failure detection not contingent on accuracy Effective failure detection not contingent on accuracy DStore: An easy-to-manage CHT Capacity planning High scaling costs necessitate accurate load prediction High scaling costs necessitate accurate load prediction Failure detection Fast detection is at odds with accurate detection Fast detection is at odds with accurate detection Cheap recovery Predictably fast and predictably small impact on availability/performance C H A L L E N G E S B E N E F I T S Manage like stateless frontends

ROC Retreat – June 2004 © 2004 Andy Huang Sacrifice some consistency: Well-defined guarantees that provide consistent ordering Sacrifice some consistency: Well-defined guarantees that provide consistent ordering Higher replication factor: 2N+1 bricks to tolerate N failures (vs. N+1 in ROWA) Higher replication factor: 2N+1 bricks to tolerate N failures (vs. N+1 in ROWA) Single-phase writes No locking and transactional logging No locking and transactional loggingQuorums No recovery code to freeze writes & copy missed updates No recovery code to freeze writes & copy missed updates Cheap recovery: Principles and costs C O S T S T E C H N I Q U E S Trade storage and consistency for cheap recovery Write: send to all, wait for majority Read: read from majority dlib

ROC Retreat – June 2004 © 2004 Andy Huang Nothing new under the sun, but… Ease of management Scalable performance CHT Cheap recovery (but that’s just the start…) High availability and performance (end goal) Result Availability and performance while nodes are unavailable Relaxed consistency Availability during failures and recovery Availability during network partitions and Byzantine faults Quorums DStore Prior work Technique

ROC Retreat – June 2004 © 2004 Andy Huang Cheap recovery simplifies state management [Future work] [RAID] Data reconstruction Manage state with techniques used for stateless frontends State management is costly (administration- and availability-wise) Result Scale reactively based on current load Predict future load Capacity planning Duration and impact is predictably small Relatively new area [Aqueduct] Online repartitioning Effective even if it is not highly accurate Difficult to make fast and accurate Failure detection DStore Prior work Challenge

ROC Retreat – June 2004 © 2004 Andy Huang Consistency guarantees n Usage model: n Guarantee: For a key k, DStore enforces a global order of operations that is consistent with the order seen by individual clients. n C 1 issues w 1 (k, v new ) to replace current hash table entry (k, v old ) l w 1 returns SUCCESS: subsequent reads return v new l w 1 returns FAIL: subsequent reads return v old l w 1 return UNKNOWN (due to Dlib failure): two cases dlib c 1. A client issues a request 2. Request forwarded to a random Dlib 3. Dlib issues quorum r/w on bricks Assumption: Clients share data, but otherwise act independently Assumption: Clients share data, but otherwise act independently

ROC Retreat – June 2004 © 2004 Andy Huang Case 1: Another user U 2 performs a read U1U1 B1B1 B2B2 B3B3 (k 1,v old ) U2U2 Dlib failure can cause a partial write, violating the quorum property If timestamps differ, read-repair restores majority invariant Delayed commit w 1 (k 1,v new ) v old r 1 (k 1 ) v new w 2 (k 1,v new ) r 2 (k 1 ) U 2 r(k 1 ) returns: v old – no user has read v new v new – no user will later read v old

ROC Retreat – June 2004 © 2004 Andy Huang Case 2: U 1 performs a read B1B1 B2B2 B3B3 U1U1 U2U2 v new r 1 (k 1 ) w 2 (k 1,v new ) A write-in-progress cookie can be used to detect partial writes and commit/abort on the next read (k 1,v old ) w 1 (k 1,v new ) U 1 r(k 1 ): write is immediately committed or aborted – all future readers see either v old or v new

ROC Retreat – June 2004 © 2004 Andy Huang Consistency guarantees n C 1 issues w 1 (k, v new ) to replace current hash table entry (k, v old ) l w 1 returns SUCCESS: subsequent reads return v new l w 1 returns FAIL: subsequent reads return v old l w 1 return UNKNOWN (due to Dlib failure): U 1 reads – w 1 is immediately committed or aborted U 2 reads – if v old is returned, no user has read v new if v new is returned, no user will later read v old

ROC Retreat – June 2004 © 2004 Andy Huang Versus sequential consistency U1U1 B1B1 B2B2 B3B3 U2U2 (k 1,v old ) (k 2,v old ) w 2 (k 2,v new ) w 1 (k 1,v new ) r 1 (k 2 ) v new Result of w 2 seen before result of w 1 Conditions: atomicity consistent ordering UNKNOWN causes non-atomic writes r 2 (k 1 ) v old r 3 (k 1 ) v new

ROC Retreat – June 2004 © 2004 Andy Huang Two-phase commit vs. single phase writes No special-case recovery Read log to complete in progress transactions Recovery Read-repair (spreads out the cost of 2-PC to make common case faster) Write-in-progress cookie (spreads out the responsibility of 2-PC) None Other costs 1 synchronous update 1 roundtrip 2 synchronous log writes 2 roundtrips Performance No locking Locking may cause request to block during failures Availability Consistent ordering Sequential consistency Consistency Single-phase writes 2-phase commit Property

ROC Retreat – June 2004 © 2004 Andy Huang Application-generic failure detection Operating statistics (CPU load, requests processed, etc.) Beacon listener Median absolute deviation Tarzan algorithm Anomalies Failure detection techniques > treshold reboot Simple detection techniques “work” because resolution mechanism is cheap

ROC Retreat – June 2004 © 2004 Andy Huang Failure detection and repartitioning behavior Aggressive failure detection Online repartitioning Low scaling cost Low cost of acting on false positives Fail-stutter

ROC Retreat – June 2004 © 2004 Andy Huang reboot Bigger picture: What is “self-managing”? Brick performance Indicator Monitoring Treatment a sign of system health tests for potential problems low-impact resolution mechanism

ROC Retreat – June 2004 © 2004 Andy Huang Bigger picture: What is “self-managing”? Brick performance System load Disk failures Key: low-cost mechanisms Simple detection mechanisms & policies Constant “recovery” reboot repartition reconstruction

ROC Retreat – June 2004 © 2004 Andy Huang Nothing new under the sun, but… Technique Prior work DStore CHT Scalable performance Ease of management Quorums Availability during network partitions and Byzantine faults Availability during failures and recovery Relaxed consistency Availability and performance while nodes are unavailable Result High availability and performance (end goal) Cheap recovery (but that’s just the start…)

ROC Retreat – June 2004 © 2004 Andy Huang Cheap recovery simplifies state management Challenge Prior work DStore Failure detection Difficult to make fast and accurate Effective even if it is not highly accurate Online repartitioning Relatively new area [Aqueduct] Duration and impact is predictably small Capacity planning Predict future load Scale reactively based on current load Data reconstruction [RAID] [Future work] Result State management is costly (administration- and availability-wise) Manage state with techniques used for stateless frontends

ROC Retreat – June 2004 © 2004 Andy Huang Two-phase commit vs. single phase writes Property 2-phase commit Single-phase writes Consistency Sequential consistency Consistent ordering Recovery Read log to complete in progress transactions No special-case recovery Availability Locking may cause request to block during failures No locking Performance 2 synchronous log writes 2 roundtrips 1 synchronous update 1 roundtrip Other costs None Read-repair (spreads out the cost of 2-PC to make common case faster) Write-in-progress cookie (spreads out the responsibility of 2-PC)

ROC Retreat – June 2004 © 2004 Andy Huang Simple, aggressive failure detection n Bricks send operating statistics l CPU load, average queue delay, number of requests processed, etc. n Statistical methods l Median absolute deviation – compares one brick’s behavior with the current behavior of the rest of the bricks l Tarzan – incorporates past behavior of each brick and detects anomalies in the operating statistics’ patterns n Why these techniques are effective l Not the “best” failure detection mechanisms l Parameters are not highly tuned l Simple, application-generic techniques “work” because of the low cost of acting on false positives

DStore: An Easy-to-Manage Persistent State Store Andy Huang and Armando Fox Stanford University.

Similar presentations

Presentation on theme: "DStore: An Easy-to-Manage Persistent State Store Andy Huang and Armando Fox Stanford University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

DStore: An Easy-to-Manage Persistent State Store Andy Huang and Armando Fox Stanford University.

Similar presentations

Presentation on theme: "DStore: An Easy-to-Manage Persistent State Store Andy Huang and Armando Fox Stanford University."— Presentation transcript:

Similar presentations

About project

Feedback