Presentation is loading. Please wait.

Presentation is loading. Please wait.

© 2005 Andreas Haeberlen, Rice University 1 Glacier: Highly durable, decentralized storage despite massive correlated failures Andreas Haeberlen Alan Mislove.

Similar presentations


Presentation on theme: "© 2005 Andreas Haeberlen, Rice University 1 Glacier: Highly durable, decentralized storage despite massive correlated failures Andreas Haeberlen Alan Mislove."— Presentation transcript:

1 © 2005 Andreas Haeberlen, Rice University 1 Glacier: Highly durable, decentralized storage despite massive correlated failures Andreas Haeberlen Alan Mislove Peter Druschel Rice University Houston, TX 2nd Symposium on Networked Systems Design & Implementation (NSDI) Boston, MA May 2-4, 2005

2 2 © 2005 Andreas Haeberlen, Rice University Introduction Many distributed applications require storage Cooperative storage: Aggregate storage on participating nodes Advantages: Resilient Highly scalable Examples: Farsite, PAST, OceanStore Structured overlay network

3 3 © 2005 Andreas Haeberlen, Rice University Motivation Common assumption: High node diversity  Failure independence Unrealistic! Node population may have low diversity (e.g. OS) Worms can cause large-scale correlated Byzantine failures Reactive systems are too slow to prevent data loss

4 4 © 2005 Andreas Haeberlen, Rice University Related Work Phoenix, OceanStore use introspection: Build failure model Store data on nodes with low correlation Limitations: Model must reflect all possible correlations Even small inaccuracies may lead to data loss Users have an incentive to report incorrect data

5 5 © 2005 Andreas Haeberlen, Rice University Our Approach: Glacier Create massive redundancy to ensure that data survives any correlated failure with high probability Assumption: Magnitude of the failure can be bounded by fraction f max Challenges: Minimize storage and bandwidth requirements Withstand attacks, Byzantine failures

6 6 © 2005 Andreas Haeberlen, Rice University 654321 Glacier: Insertion When a new object is inserted: 1. Apply erasure code 2. Attach manifest with hashes of fragments 3. Send each fragment to a different node No remote delete operation, but lifetime of objects can be limited Storage is lease-based; reclaims unused storage 123456 X

7 7 © 2005 Andreas Haeberlen, Rice University Glacier: Maintenance Nodes with distance store similar fragments Periodic maintenance: Ask a peer node for its list of fragments Compare with local list, recover any missing fragments Fragments remain on their nodes during offline periods 3 2 1 6 5 4 ? X

8 8 © 2005 Andreas Haeberlen, Rice University Glacier: Recovery During a failure, some fragments are damaged or lost Communication may not be possible Unaffected nodes do not take any special action: Failed nodes are eventually repaired Maintenance gradually restores lost fragments 1 2 3 4 5 6 Time Insert Correlated failure T fail Offline period

9 9 © 2005 Andreas Haeberlen, Rice University Glacier: Durability Example configuration: 48 fragments, any 5 sufficient for recovery Bad news: Storage overhead 9.6x Good news: Survives 60% correlated failure with P=0.999999 (single object) f max DurabilityCodeFragmentsStorage 0.300.99993134.33 0.500.999994297.25 0.600.9999995489.60 0.700.99999956813.60 0.850.999999514929.80 More storage Higher durability

10 10 © 2005 Andreas Haeberlen, Rice University Aggregation If objects are small: Huge number of fragments High overhead for storage, management Solution: Aggregate objects before storing them in Glacier Challenges: Untrusted environment Aggregates must be self-authenticating App Glacier App Aggreg. Glacier

11 11 © 2005 Andreas Haeberlen, Rice University Aggregation: Links Mapping from objects to aggregates is crucial! Need durability Need authentication Solution: Link aggregates Result: DAG Can recover mapping by traversing the DAG DAG forms a hash tree; easy to authenticate Top-level pointer is kept in Glacier itself

12 12 © 2005 Andreas Haeberlen, Rice University Evaluation Two sets of experiments: Trace-driven simulations (scalability, churn,...) Actual deployment: ePOST ePOST: A cooperative, serverless e-mail system In production use: Initially 17 users, 20 nodes Based on FreePastry, PAST, Scribe, POST Added Glacier for durability Glacier configuration in ePOST: 48 fragments, 0.2 encoding f max =0.6, P=0.999999 140 days of practical experience (incl. some failures)

13 13 © 2005 Andreas Haeberlen, Rice University Evaluation: Storage Inherent storage overhead: 48/5=9.6 17 GB of on-disk storage for 1.3GB of data Actual storage overhead on disk: About 12.6

14 14 © 2005 Andreas Haeberlen, Rice University Evaluation: Network load During stable periods, traffic is comparable to PAST In the ePOST experiment, a misconfiguration caused frequent traffic spikes Long off-line periods were mistaken for failures

15 15 © 2005 Andreas Haeberlen, Rice University Evaluation: Recovery Experiment: Created a 'clone' of the ePOST ring with only 13 of the 31 nodes (a 58% failure!) Started recovery process on a freshly installed node: User entered e-mail address and date of last use Glacier located head of aggregate tree, recovered it System was again ready for use; no data loss

16 16 © 2005 Andreas Haeberlen, Rice University Conclusions Large-scale correlated failures are a realistic threat to distributed storage systems Glacier provides hard durability guarantees with minimal assumptions about the failure model Glacier transforms abundant but unreliable disk space into reliable storage Bandwidth cost is low Thank you!

17 17 © 2005 Andreas Haeberlen, Rice University Glacier is available! Download: www.epostmail.org Serverless, secure e-mail Easy to set up Uses Glacier for durability


Download ppt "© 2005 Andreas Haeberlen, Rice University 1 Glacier: Highly durable, decentralized storage despite massive correlated failures Andreas Haeberlen Alan Mislove."

Similar presentations


Ads by Google