OceanStore: In Search of Global-Scale, Persistent Storage John Kubiatowicz UC Berkeley.

Slides:



Advertisements
Similar presentations
Dynamic Replica Placement for Scalable Content Delivery Yan Chen, Randy H. Katz, John D. Kubiatowicz {yanchen, randy, EECS Department.
Advertisements

Perspective on Overlay Networks Panel: Challenges of Computing on a Massive Scale Ben Y. Zhao FuDiCo 2002.
What is OceanStore? - 10^10 users with files each - Goals: Durability, Availability, Enc. & Auth, High performance - Worldwide infrastructure to.
Peer-to-Peer Systems Chapter 25. What is Peer-to-Peer (P2P)? Napster? Gnutella? Most people think of P2P as music sharing.
© 2005 Andreas Haeberlen, Rice University 1 Glacier: Highly durable, decentralized storage despite massive correlated failures Andreas Haeberlen Alan Mislove.
POND: the OceanStore Prototype Sean Rhea, Patrick Eaton, Dennis Geels, Hakim Weatherspoon, Ben Zhao and John Kubiatowicz UC, Berkeley File and Storage.
Pond: the OceanStore Prototype CS 6464 Cornell University Presented by Yeounoh Chung.
Pond The OceanStore Prototype. Pond -- Dennis Geels -- January 2003 Talk Outline System overview Implementation status Results from FAST paper Conclusion.
Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte.
David Choffnes, Winter 2006 OceanStore Maintenance-Free Global Data StorageMaintenance-Free Global Data Storage, S. Rhea, C. Wells, P. Eaton, D. Geels,
1 Accessing nearby copies of replicated objects Greg Plaxton, Rajmohan Rajaraman, Andrea Richa SPAA 1997.
OceanStore: An Infrastructure for Global-Scale Persistent Storage John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels,
Option 2: The Oceanic Data Utility: Global-Scale Persistent Storage John Kubiatowicz.
Outline for today Structured overlay as infrastructures Survey of design solutions Analysis of designs.
OceanStore Exploiting Peer-to-Peer for a Self-Repairing, Secure and Persistent Storage Utility John Kubiatowicz University of California at Berkeley.
Option 2: The Oceanic Data Utility: Global-Scale Persistent Storage John Kubiatowicz.
OceanStore Status and Directions ROC/OceanStore Retreat 1/16/01 John Kubiatowicz University of California at Berkeley.
OceanStore: An Architecture for Global-Scale Persistent Storage John Kubiatowicz University of California at Berkeley.
OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley.
P2P: Advanced Topics Filesystems over DHTs and P2P research Vyas Sekar.
OceanStore An Architecture for Global-scale Persistent Storage By John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels,
Scalable Adaptive Data Dissemination Under Heterogeneous Environment Yan Chen, John Kubiatowicz and Ben Zhao UC Berkeley.
OceanStore Status and Directions ROC/OceanStore Retreat 1/13/03 John Kubiatowicz University of California at Berkeley.
Tentative Updates in MINO Steven Czerwinski Jeff Pang Anthony Joseph John Kubiatowicz ROC Winter Retreat January 13, 2002.
Naming and Integrity: Self-Verifying Data in Peer-to-Peer Systems Hakim Weatherspoon, Chris Wells, John Kubiatowicz University of California, Berkeley.
The Oceanic Data Utility: (OceanStore) Global-Scale Persistent Storage John Kubiatowicz.
OceanStore: Data Security in an Insecure world John Kubiatowicz.
OceanStore Theoretical Issues and Open Problems John Kubiatowicz University of California at Berkeley.
Introspective Replica Management Yan Chen, Hakim Weatherspoon, and Dennis Geels Our project developed and evaluated a replica management algorithm suitable.
Weaving a Tapestry Distributed Algorithms for Secure Node Integration, Routing and Fault Handling Ben Y. Zhao (John Kubiatowicz, Anthony Joseph) Fault-tolerant.
OceanStore: An Architecture for Global-Scale Persistent Storage Professor John Kubiatowicz, University of California at Berkeley
Opportunities for Continuous Tuning in a Global Scale File System John Kubiatowicz University of California at Berkeley.
OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley.
Concurrency Control & Caching Consistency Issues and Survey Dingshan He November 18, 2002.
OceanStore/Tapestry Toward Global-Scale, Self-Repairing, Secure and Persistent Storage Anthony D. Joseph John Kubiatowicz Sahara Retreat, January 2003.
Or, Providing High Availability and Adaptability in a Decentralized System Tapestry: Fault-resilient Wide-area Location and Routing Issues Facing Wide-area.
Wide-area cooperative storage with CFS
Or, Providing Scalable, Decentralized Location and Routing Network Services Tapestry: Fault-tolerant Wide-area Application Infrastructure Motivation and.
OceanStore Global-Scale Persistent Storage John Kubiatowicz University of California at Berkeley.
OceanStore An Architecture for Global-Scale Persistent Storage Motivation Feature Application Specific Components - Secure Naming - Update - Access Control-
Tapestry: A Resilient Global-scale Overlay for Service Deployment Ben Y. Zhao, Ling Huang, Jeremy Stribling, Sean C. Rhea, Anthony D. Joseph, and John.
Long Term Durability with Seagull Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group) University of California, Berkeley ROC/Sahara/OceanStore.
OceanStore: An Architecture for Global - Scale Persistent Storage John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patric Eaton, Dennis Geels,
Tapestry GTK Devaroy (07CS1012) Kintali Bala Kishan (07CS1024) G Rahul (07CS3009)
Failure Resilience in the Peer-to-Peer-System OceanStore Speaker: Corinna Richter.
Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.
Jonathan Walpole CSE515 - Distributed Computing Systems 1 Teaching Assistant for CSE515 Rahul Dubey.
OceanStore: An Infrastructure for Global-Scale Persistent Storage John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels,
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
OceanStore: An Architecture for Global- Scale Persistent Storage.
Scalable Content- Addressable Networks Prepared by Kuhan Paramsothy March 5, 2007.
Toward Achieving Tapeless Backup at PB Scales Hakim Weatherspoon University of California, Berkeley Frontiers in Distributed Information Systems San Francisco.
POND: THE OCEANSTORE PROTOTYPE S. Rea, P. Eaton, D. Geels, H. Weatherspoon, J. Kubiatowicz U. C. Berkeley.
Tapestry : An Infrastructure for Fault-tolerant Wide-area Location and Routing Presenter : Lee Youn Do Oct 5, 2005 Ben Y.Zhao, John Kubiatowicz, and Anthony.
1 Plaxton Routing. 2 History Greg Plaxton, Rajmohan Rajaraman, Andrea Richa. Accessing nearby copies of replicated objects, SPAA 1997 Used in several.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
CS791Aravind Elango Maintenance-Free Global Data Storage Sean Rhea, Chris Wells, Patrick Eaten, Dennis Geels, Ben Zhao, Hakim Weatherspoon and John Kubiatowicz.
OceanStore Global-Scale Persistent Storage John Kubiatowicz University of California at Berkeley.
OceanStore : An Architecture for Global-Scale Persistent Storage Jaewoo Kim, Youngho Yi, Minsik Cho.
Option 2: The Oceanic Data Utility: Global-Scale Persistent Storage
OceanStore: An Architecture for Global-Scale Persistent Storage
Scaling for the Future Katherine Yelick U.C. Berkeley, EECS
Accessing nearby copies of replicated objects
OceanStore August 25, 2003 John Kubiatowicz
Providing Secure Storage on the Internet
OceanStore: Data Security in an Insecure world
John D. Kubiatowicz UC Berkeley
Content Distribution Network
Outline for today Oceanstore: An architecture for Global-Scale Persistent Storage – University of California, Berkeley. ASPLOS 2000 Feasibility of a Serverless.
Presentation transcript:

OceanStore: In Search of Global-Scale, Persistent Storage John Kubiatowicz UC Berkeley

OceanStore:2FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley OceanStore Context: Ubiquitous Computing Computing everywhere: –Desktop, Laptop, Palmtop –Cars, Cellphones –Shoes? Clothing? Walls? Connectivity everywhere: –Rapid growth of bandwidth in the interior of the net –Broadband to the home and office –Wireless technologies such as CMDA, Satelite, laser

OceanStore:3FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Utility-based Infrastructure? Pac Bell Sprint IBM AT&T Canadian OceanStore IBM Data service provided by federation of companies Cross-administrative domain Metric: MOLE OF BYTES (6  )

OceanStore:4FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley OceanStore Assumptions Untrusted Infrastructure: –The OceanStore is comprised of untrusted components –Only ciphertext within the infrastructure Responsible Party: –Some organization (i.e. service provider) guarantees that your data is consistent and durable –Not trusted with content of data, merely its integrity Mostly Well-Connected: –Data producers and consumers are connected to a high- bandwidth network most of the time –Exploit multicast for quicker consistency when possible Promiscuous Caching: –Data may be cached anywhere, anytime

OceanStore:5FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Key Observation: Want Automatic Maintenance Can’t possibly manage billions of servers by hand! System should automatically: –Adapt to failure –Repair itself –Incorporate new elements Introspective Computing/Autonomic Computing Can data be accessible for 1000 years? –New servers added from time to time –Old servers removed from time to time –Everything just works

OceanStore:6FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Outline Motivation Assumptions of the OceanStore Specific Technologies and approaches: –Routing and Data Location –Naming –Conflict resolution on encrypted data –Replication and Deep archival storage –Introspection for optimization and repair Conclusion

OceanStore:7FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Basic Structure: Irregular Mesh of “Pools”

OceanStore:8FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Bringing Order to this Chaos How do you find information? –Must be scalable and provide maximum flexibility How do you name information? –Must provide global uniqueness How do you ensure consistency? –Must scale and handle intermittent connectivity –Must prevent unauthorized update of information How do you protect information? –Must preserve privacy –Must provide deep archival storage (continuous repair) How do go tune performance? –Locality very important Throughout all of this: how do you maintain it???

OceanStore:9FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Location and Routing

OceanStore:10FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Locality, Locality, Locality One of the defining principles “The ability to exploit local resources over remote ones whenever possible” “-Centric” approach –Client-centric, server-centric, data source-centric Requirements: –Find data quickly, wherever it might reside –Locate nearby object without global communication –Permit rapid object migration –Verifiable: can’t be sidetracked Locality yields: Performance, Availability, Reliability

OceanStore:11FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Enabling Technology: DOLR (Decentralized Object Location and Routing) GUID1 Tapestry GUID1 GUID2

OceanStore:12FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Stability under Changes Unstable, unreliable, untrusted nodes are the common case! –Network never fully stabilizes –What is half-life of a routing node? –Must provide stable routing in these circumstances Redundancy and adaptation fundamental: –Make use of alternative paths when possible –Incrementally remove faulty nodes –Route around network faults –Continuously tune neighbor links

OceanStore:13FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley The Tapestry DOLR Routing to Objects, not Locations! –Replacement for IP? –Very powerful abstraction Built as overlay network, but not fundamental –Randomized prefix routing + distributed object location index –Routing nodes have links to nearby neighbors –Additional state tracks objects Massive parallel insert (SPAA 2002) –Construction of nearest-neighbor mesh links Log 2 n message complexity for new node –New nodes integrated, faulty ones removed –Objects kept available during this process

OceanStore:14FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley OceanStore Naming

OceanStore:15FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Model of Data Ubiquitous object access from anywhere –Undifferentiated “Bag of Bits” Versioned Objects –Every update generates a new version –Can always go back in time (Time Travel) Each Version is Read-Only –Can have permanent name (SHA-1 Hash) –Much easier to repair An Object is a signed mapping between permanent name and latest version –Write access control/integrity involves managing these mappings Comet Analogy updates versions

OceanStore:16FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Secure Hashing Read-only data: GUID is hash over actual information –Uniqueness and Unforgeability: the data is what it is! –Verification: check hash over data Changeable data: GUID is combined hash over a human-readable name + public key –Uniqueness: GUID space selected by public key –Unforgeability: public key is indelibly bound to GUID –Verification: check signatures with public key SHA-1 DATA 160-bit GUID

OceanStore:17FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Secure Naming Naming hierarchy: –Users map from names to GUIDs via hierarchy of OceanStore objects (ala SDSI) –Requires set of “root keys” to be acquired by user Foo Bar Baz Myfile Out-of-Band “Root link”

OceanStore:18FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley The Write Path

OceanStore:19FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley The Path of an OceanStore Update Second-Tier Caches Multicast trees Inner-Ring Servers Clients

OceanStore:20FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley OceanStore Consistency via Conflict Resolution Consistency is form of optimistic concurrency –An update packet contains a series of predicate-action pairs which operate on encrypted data –Each predicate tried in turn: If none match, the update is aborted Otherwise, action of first true predicate is applied Inner Ring must securely: –Pick serial order of updates –Apply them –Sign result (threshold signature) –Disseminate results to active users

OceanStore:21FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Automatic Maintenance Byzantine Commitment for inner ring: –Tolerates up to 1/3 malicious servers in inner ring –Continuous refresh of set of inner-ring servers Proactive threshold signatures Use of Tapestry  membership of inner ring unknown to clients Secondary tier self-organized into overlay dissemination tree –Use of Tapestry routing to suggest placement of replicas in the infrastructure –Automatic choice between update vs invalidate

OceanStore:22FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Self-Organizing Soft-State Replication Simple algorithms for placing replicas on nodes in the interior –Intuition: locality properties of Tapestry help select positions for replicas –Tapestry helps associate parents and children to build multicast tree Preliminary results show that this is effective

OceanStore:23FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Deep Archival Storage

OceanStore:24FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley TwoTypes of OceanStore Data Active Data: “Floating Replicas” –Per object virtual server –Logging for updates/conflict resolution –Interaction with other replicas for consistentency –May appear and disappear like bubbles Archival Data: OceanStore’s Stable Store –m-of-n coding: Like hologram Data coded into n fragments, any m of which are sufficient to reconstruct (e.g m=16, n=64) Coding overhead is proportional to n  m (e.g 4) Other parameter, rate, is 1/overhead –Fragments are cryptographically self-verifying Most data in the OceanStore is archival!

OceanStore:25FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Archival Dissemination of Fragments

OceanStore:26FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Fraction of Blocks Lost per Year (FBLPY) Exploit law of large numbers for durability! 6 month repair, FBLPY: –Replication: 0.03 –Fragmentation:

OceanStore:27FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley The Dissemination Process: Achieving Failure Independence Model Builder Set Creator Introspection Human Input Network Monitoring model Inner Ring set probe type fragments

OceanStore:28FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Automatic Maintenance Continuous Entropy Suppression – i.e. repair! –Erasure coding give flexibility in timing repair Data continuously transferred from physical medium to physical medium –No “tapes decaying in basement” Actual Repair –Recombine fragments, then send out copies again –DOLR permits efficient heartbeat mechanism Permits infrastructure to notice: –Servers going away for a while –Or, going away forever! –Continuous sweep through data

OceanStore:29FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Introspective Tuning

OceanStore:30FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley On the use of Redundancy Question: Can we use Moore’s law gains for something other than just raw performance? –Growth in computational performance –Growth in network bandwidth –Growth in storage capacity Physical systems are unreliable and untrusted –Can we use multiple faulty elements instead of one? –Can we devote resources to monitoring and analysis? –Can we devote resources to repairing systems? Complexity of systems growing rapidly –Can no longer debug systems entirely –How to handle this?

OceanStore:31FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley The Biological Inspiration Biological Systems are built from (extremely) faulty components, yet: –They operate with a variety of component failures  Redundancy of function and representation –They have stable behavior  Negative feedback –They are self-tuning  Optimization of common case Introspective Computing: –Components for computing –Components for monitoring and model building –Components for continuous adaptation Adapt Compute Monitor

OceanStore:32FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley The Thermodynamic Analogy System such as OceanStore has a variety of latent order –Connections between elements –Mathematical structure (erasure coding, etc) –Distributions peaked about some desired behavior Permits “Stability through Statistics” –Exploit the behavior of aggregates Subject to Entropy –Servers fail, attacks happen, system changes Requires continuous repair –Apply energy (i.e. through servers) to reduce entropy

OceanStore:33FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Introspective Optimization Adaptation of routing substrate –Optimization of Tapestry Mesh –Fault-tolerant routing mechanisms –Adaptation of second-tier multicast tree Monitoring of access patterns: –Clustering algorithms to discover object relationships –Time series-analysis of user and data motion Observations of system behavior –Extracting of failure correllations Continuous testing and repair of information –Slow sweep through all information to make sure there are sufficient erasure-coded fragments –Continuously reevaluate risk and redistribute data

OceanStore:34FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley PondStore [Java]: Event-driven state-machine model Included Components Initial floating replica design Conflict resolution and Byzantine agreement Routing facility (Tapestry) Bloom Filter location algorithm Plaxton-based locate and route data structures Introspective gathering of tacit info and adaptation Language for introspective handler construction Clustering, prefetching, adaptation of network routing Initial archival facilities Interleaved Reed-Solomon codes for fragmentation Methods for signing and validating fragments Target Applications Unix file-system interface under Linux (“legacy apps”) application, proxy for web caches, streaming multimedia applications

OceanStore:35FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley We have Things Running! Latest: it is up to 7MB/sec Still a ways to go, but working

OceanStore:36FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Update Latency Cryptography in critical path (not surprising!) New metric: Avoid hashes (like avoid copies)

OceanStore:37FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley OceanStore Goes Global! OceanStore components running “globally:” –Australia, Georgia, Washington, Texas, Boston –Able to run the Andrew File-System benchmark with inner ring spread throughout US –Interface: NFS on OceanStore Word on the street: it was easy to do –The components were debugged locally –Easily set up remotely I am currently talking with people in: –England, Maryland, Minnesota, …. –PlanetLab testbed will give us access to much more

OceanStore:38FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Reality: Web Caching through OceanStore

OceanStore:39FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley Other Apps Better file system support –NFS (working – reimplementation in progress) –Windows Installable file system (soon) through OceanStore –IMAP and POP proxies –Let normal mail clients access mailboxes in OS Palm-pilot synchronization –Palm data base as an OceanStore DB

OceanStore:40FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley OceanStore Conclusions OceanStore: everyone’s data, one big utility –Global Utility model for persistent data storage OceanStore assumptions: –Untrusted infrastructure with a responsible party –Mostly connected with conflict resolution –Continuous on-line optimization OceanStore properties: –Provides security, privacy, and integrity –Provides extreme durability –Lower maintenance cost through redundancy, continuous adaptation, self-diagnosis and repair –Large scale system has good statistical properties

OceanStore:41FDIS 2002 ©2002 John Kubiatowicz/UC Berkeley For more info: OceanStore vision paper for ASPLOS 2000 “OceanStore: An Architecture for Global-Scale Persistent Storage” Tapestry algorithms paper (SPAA 2002): “Distributed Object Location in a Dynamic Network” Bloom Filters for Probabilistic Routing (INFOCOM 2002): “Probabilistic Location and Routing” OceanStore web site: