Presentation is loading. Please wait.

Presentation is loading. Please wait.

POND: THE OCEANSTORE PROTOTYPE S. Rea, P. Eaton, D. Geels, H. Weatherspoon, J. Kubiatowicz U. C. Berkeley.

Similar presentations


Presentation on theme: "POND: THE OCEANSTORE PROTOTYPE S. Rea, P. Eaton, D. Geels, H. Weatherspoon, J. Kubiatowicz U. C. Berkeley."— Presentation transcript:

1 POND: THE OCEANSTORE PROTOTYPE S. Rea, P. Eaton, D. Geels, H. Weatherspoon, J. Kubiatowicz U. C. Berkeley

2 Key Ideas Versioning file system Location independent routing –Uses hashes instead of addresses – Mapping is done through Tapestry Byzantine update commitment –By nodes holding primary copies ( inner ring ) –Proactive threshold signatures allow inner ring membership updates

3 Key Ideas Push-based update of other copies –Through an overlay multicast network –Copies are not permanent Continuous archiving in erasure-coded form –Very reliable –Very slow access

4 Motivation Find a better solution for long-term management of data Enabling trends: –Near universal connectivity through high- bandwidth links –Very fast increase of disk storage capacity per unit cost

5 OceanStore Internet-scale cooperative file system Will provide –High durability –Universal accessibility Will use a two-tiered storage system Stores data objects

6 Two-tiered organization Upper tier – Powerful, well connected hosts –Serialize changes and archive results Lower tier – Less powerful hosts Can be user workstations –Provide storage resources

7 Two-tiered organization Archive Primary replica (in inner ring) Secondary replica

8 Basic requirements OceanStore should 1.Let information be accessed from any location 2.Balance the tension between privacy and information sharing 3.Offer an easily understandable and usable model of data consistency 4.Guarantee data integrity

9 First basic assumption Infrastructure cannot be trusted, except in aggregate –Host and routers can fail arbitrarily –Must consider Passive failures: host snooping, … Active failures: host injecting malicious messages, …

10 Second basic assumption Infrastructure is continuously changing –Performance of communication paths varies –Resources enter and leave the network without warning –System should Be self-organizing and self-repairing Aim to be self-tuning

11 The challenge Build a system that provides –An expressive user interface –High data availability –High data durability –High data privacy and integrity atop an untrusted and ever changing base More ambitious than FARSITE

12 The data model OceanStore data object –Similar to a traditional file –Ordered sequence of read-only versions Versioning –Simplifies consistency issues –Allows recovery of previous versions Identical blocks are shared among versions

13 Data object implementation (I) Each data object has an AGUID (Active Globally-Unique Identifier) –Secure hash of application-level name and private key of owner Each version has a VGUID (Version GUID) –BGUID of root block of a version Each block has a BGUID (Block GUID) –Secure hash of block contents

14 A data object AGUID VGUID i VGUID i+1 MM root block Indirect blocks Data blocks COW

15 Data object implementation AGUID, VGUID and BGUID are l ocation-transparent –OceanStore relies on a lower-level service to map GIDs into addresses

16 Application-level consistency (I) Updating an object means creating a new version Updates are –Atomic –Represented as an array of potential actions each guarded by a predicate

17 Application-level consistency (II) Actions can be –Appending data –Replacing bytes at a specific address Predicates can be –Checking the latest version number of the object –Verifying values of bytes at a specific address

18 Application-level consistency (II) Actions can be –Appending data –Replacing bytes at a specific address Predicates can be –Checking the latest version number of the object –Verifying values of bytes at a specific address

19 Application-level consistency (III) Predicate and action model –Allows to implement multiple level of consistency Atomic transactions satisfying ACID properties for database applications Weaker consistency for mailboxes

20 A footnote ACID properties of atomic transactions mean that atomic transactions –Are Atomic –Bring the database from one consistent state to another consistent state – Isolate their partial results until the transaction is completed –Guarantee the durability of final result

21 Virtualization through Tapestry OceanStore messages are addressed with a GUID Tapestry forwards these messages to host containing a resource with that GUID –Fully decentralized service Hosts can –Join tapestry by supplying its GUID –Publish the GUIDs of the resources they have

22 Replication and consistency (I) Each object has a single primary replica Primary replica –Serializes and applies all updates –Creates a certificate ( heartbeat ) mapping AGUID of object to GUID of its latest version –Controls access to the object – …

23 Replication and consistency (II) Heartbeat contains –An AGUID –A VGUID –A timestamp –A version sequence number Getting the most recent version of object means getting its most recent heartbeat

24 The inner ring Small set of co-operating servers that manage primary replicas Implement a Byzantine fault-tolerant protocol to –Agree on all updates to an object –Digitally sign the result

25 Archival storage Stores object versions that are not frequently accessed Uses erasure codes –Each block Partitioned into m fragments Encoded into n > m fragments – Any subset of m fragments suffices to reconstitute the block

26 Caching of data objects Retrieving data from archive is slow OceanStore also maintains of whole blocks –Secondary replicas Heartbeats always come from the primary replica Updates of secondary replicas are done through a dissemination tree

27 Path of an OceanStore update Application Archive Primary replica in inner ring Secondary replica

28 Updating primary replicas (I) Use a Byzantine fault-tolerant protocol –Tolerates up to f failures in a system made up of 3 f + 1 hosts Protocol uses digitally signed messages using symmetric key message authentication code –Faster than using public keys –Complicates the Byzantine agreement protocol

29 Updating primary replicas (II) Solution was to use –Symmetric keys for all communications within the inner ring –Public keys to communicate with all other machines

30 Proactive threshold signatures (listen to lecture)

31 Prototype software architecture Network (Java NBIO) Tapestry Byzantine agreement Inner ring Archive Dissemination tree/replicas Client interface Application

32 The prototype Written in Java

33 Conclusion


Download ppt "POND: THE OCEANSTORE PROTOTYPE S. Rea, P. Eaton, D. Geels, H. Weatherspoon, J. Kubiatowicz U. C. Berkeley."

Similar presentations


Ads by Google