OceanStore: An Architecture for Global-Scale Persistent Storage John Kubiatowicz, et al ASPLOS 2000.

Slides:



Advertisements
Similar presentations
What is OceanStore? - 10^10 users with files each - Goals: Durability, Availability, Enc. & Auth, High performance - Worldwide infrastructure to.
Advertisements

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google Jaehyun Han 1.
POND: the OceanStore Prototype Sean Rhea, Patrick Eaton, Dennis Geels, Hakim Weatherspoon, Ben Zhao and John Kubiatowicz UC, Berkeley File and Storage.
Pond: the OceanStore Prototype CS 6464 Cornell University Presented by Yeounoh Chung.
Pond The OceanStore Prototype. Pond -- Dennis Geels -- January 2003 Talk Outline System overview Implementation status Results from FAST paper Conclusion.
Pond: the OceanStore Prototype Sean Rhea, Patrick Eaton, Dennis Geels, Hakim Weatherspoon,
Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte.
Pond: the OceanStore Prototype Sean Rhea, Patrick Eaton, Dennis Geels, Hakim Weatherspoon,
David Choffnes, Winter 2006 OceanStore Maintenance-Free Global Data StorageMaintenance-Free Global Data Storage, S. Rhea, C. Wells, P. Eaton, D. Geels,
1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,
1 Accessing nearby copies of replicated objects Greg Plaxton, Rajmohan Rajaraman, Andrea Richa SPAA 1997.
OceanStore: An Infrastructure for Global-Scale Persistent Storage John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels,
Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis.
OceanStore An Architecture for Global-scale Persistent Storage By John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels,
Tentative Updates in MINO Steven Czerwinski Jeff Pang Anthony Joseph John Kubiatowicz ROC Winter Retreat January 13, 2002.
Naming and Integrity: Self-Verifying Data in Peer-to-Peer Systems Hakim Weatherspoon, Chris Wells, John Kubiatowicz University of California, Berkeley.
OceanStore Theoretical Issues and Open Problems John Kubiatowicz University of California at Berkeley.
Introspective Replica Management Yan Chen, Hakim Weatherspoon, and Dennis Geels Our project developed and evaluated a replica management algorithm suitable.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
OceanStore: An Architecture for Global-Scale Persistent Storage Professor John Kubiatowicz, University of California at Berkeley
Tapestry on PlanetLab Deployment Experiences and Applications Ben Zhao, Ling Huang, Anthony Joseph, John Kubiatowicz.
Concurrency Control & Caching Consistency Issues and Survey Dingshan He November 18, 2002.
OceanStore An Architecture for Global-Scale Persistent Storage Motivation Feature Application Specific Components - Secure Naming - Update - Access Control-
Long Term Durability with Seagull Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group) University of California, Berkeley ROC/Sahara/OceanStore.
OceanStore: An Architecture for Global - Scale Persistent Storage John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patric Eaton, Dennis Geels,
Case Study - GFS.
Frangipani: A Scalable Distributed File System C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
1 The Google File System Reporter: You-Wei Zhang.
July 2003 Sorrento: A Self-Organizing Distributed File System on Large-scale Clusters Hong Tang, Aziz Gulbeden and Tao Yang Department of Computer Science,
Jan 17, 2001CSCI {4,6}900: Ubiquitous Computing1 Announcements I will be out of town Monday and Tuesday to present at Multimedia Computing and Networking.
Failure Resilience in the Peer-to-Peer-System OceanStore Speaker: Corinna Richter.
Low-Overhead Byzantine Fault-Tolerant Storage James Hendricks, Gregory R. Ganger Carnegie Mellon University Michael K. Reiter University of North Carolina.
Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.
OceanStore: An Infrastructure for Global-Scale Persistent Storage John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels,
CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM S. A. Weil, S. A. Brandt, E. L. Miller D. D. E. Long, C. Maltzahn U. C. Santa Cruz OSDI 2006.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Presenters: Rezan Amiri Sahar Delroshan
Intrusion Tolerant Software Architectures Bruno Dutertre, Valentin Crettaz, Victoria Stavridou System Design Laboratory, SRI International
Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes.
OceanStore: An Architecture for Global- Scale Persistent Storage.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Disk Farms at Jefferson Lab Bryan Hess
Toward Achieving Tapeless Backup at PB Scales Hakim Weatherspoon University of California, Berkeley Frontiers in Distributed Information Systems San Francisco.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
POND: THE OCEANSTORE PROTOTYPE S. Rea, P. Eaton, D. Geels, H. Weatherspoon, J. Kubiatowicz U. C. Berkeley.
CS791Aravind Elango Maintenance-Free Global Data Storage Sean Rhea, Chris Wells, Patrick Eaten, Dennis Geels, Ben Zhao, Hakim Weatherspoon and John Kubiatowicz.
OceanStore : An Architecture for Global-Scale Persistent Storage Jaewoo Kim, Youngho Yi, Minsik Cho.
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Google File System.
OceanStore: An Architecture for Global-Scale Persistent Storage
Google Filesystem Some slides taken from Alan Sussman.
Google File System CSE 454 From paper by Ghemawat, Gobioff & Leung.
Making the Archive Real
Providing Secure Storage on the Internet
Distributed P2P File System
The Google File System (GFS)
Pond: the OceanStore Prototype
OceanStore: An Architecture for Global-Scale Persistent Storage
The Google File System (GFS)
The Google File System (GFS)
The Google File System (GFS)
The Google File System (GFS)
THE GOOGLE FILE SYSTEM.
by Mikael Bjerga & Arne Lange
Content Distribution Network
The Google File System (GFS)
Outline for today Oceanstore: An architecture for Global-Scale Persistent Storage – University of California, Berkeley. ASPLOS 2000 Feasibility of a Serverless.
Presentation transcript:

OceanStore: An Architecture for Global-Scale Persistent Storage John Kubiatowicz, et al ASPLOS 2000

OceanStore Global scale information storage. Mobile access to information in a uniform and highly available way. Servers are untrusted. Caches data anywhere, anytime. Monitors usage patterns users, each with 10,000 files

[Rhea et al. 2003] OceanStore

Main Goals Untrusted infrastructure Nomadic data

Example applications Groupware and PIM Digital libraries Scientific data repository Personal information management tools: calendars, s, contact lists

Example applications Groupware and PIM Digital libraries Scientific data repository Challenges: scaling, consistency, migration, network failures

Storage Organization OceanStore data object ~= file Ordered sequence of read-only versions Every version of every object kept forever Can be used as backup An object contains metadata, data, and references to previous versions

Storage Organization A stream of objects identified by AGUID Active globally-unique identifier Cryptographically-secure hash of an application-specific name and the owner’s public key Prevents namespace collisions

Storage Organization Each version of data object stored in a B- tree like data structure Each block has a BGUID Cryptographically-secure hash of the block content Each version has a VGUID Two versions may share blocks

[Rhea et al. 2003] Storage Organization

Access Control Restricting readers: Symmetric encryption key distributed to allowed readers. Restricting writers: ACL. Signed writes. ACL for object chosen with signed certificate.

Location and Routing Attenuated Bloom FiltersBloom Filters Find 11010

Location and Routing Plaxton-like trees

Updating data All data is encrypted. A set of predicates is evaluated in order. The actions of the earliest true predicate are applied. Update is logged if it commits or aborts. Predicates: compare-version, compare-block, compare-size, search Actions replace-block, insert-block, delete-block, append

Application-Specific Consistency An update is the operation of adding a new version to the head of a version stream Updates are applied atomically Represented as an array of potential actions Each guarded by a predicate

Application-Specific Consistency Example actions Replacing some bytes Appending new data to an object Truncating an object Example predicates Check for the latest version number Compare bytes

Application-Specific Consistency To implement ACID semantic Check for readers If none, update Append to a mailbox No checking No explicit locks or leases

Application-Specific Consistency Predicate for reads Examples Can’t read something older than 30 seconds Only can read data from a specific time frame

Replication and Consistency A data object is a sequence of read-only versions, consisting of read-only blocks, named by BGUIDs No issues for replication The mapping from AGUID to the latest VGUID may change Use primary-copy replication

Serializing updates A small primary tier of replicas run a Byzantine agreement protocol. A secondary tier of replicas optimistically propagate the update using an epidemic protocol. optimistically Ordering from primary tier is multicasted to secondary replicas.

The Full Update Path

Deep Archival Storage Data is fragmented. Each fragment is an object. Erasure coding is used to increase reliability. Erasure coding

Introspection computation observation optimization Uses: Cluster recognition Replica management Other uses

Software Architecture Java atop the Staged Event Driven Architecture (SEDA) Each subsystem is implemented as a stage With each own state and thread pool Stages communicate through events 50,000 semicolons by five graduate students and many undergrad interns

Software Architecture

Language Choice Java: speed of development Strongly typed Garbage collected Reduced debugging time Support for events Easy to port multithreaded code in Java Ported to Windows 2000 in one week

Language Choice Problems with Java: Unpredictability introduced by garbage collection Every thread in the system is halted while the garbage collector runs Any on-going process stalls for ~100 milliseconds May add several seconds to requests travel cross machines

Experimental Setup Two test beds Local cluster of 42 machines at Berkeley Each with GHz Pentium III 1.5GB PC133 SDRAM 2 36GB hard drives, RAID 0 Gigabit Ethernet adaptor Linux SMP

Experimental Setup PlanetLab, ~100 nodes across ~40 sites 1.2 GHz Pentium III, 1GB RAM ~1000 virtual nodes

Storage Overhead For 32 choose 16 erasure encoding 2.7x for data > 8KB For 64 choose 16 erasure encoding 4.8x for data > 8KB

The Latency Benchmark A single client submits updates of various sizes to a four-node inner ring Metric: Time from before the request is signed to the signature over the result is checked Update 40 MB of data over 1000 updates, with 100ms between updates

The Latency Benchmark Update Latency (ms) Key Size Update Size 5% Time Median Time 95% Time 512b 4kB MB b 4kB MB Latency Breakdown PhaseTime (ms) Check0.3 Serialize6.1 Apply1.5 Archive4.5 Sign77.8

The Throughput Microbenchmark A number of clients submit updates of various sizes to disjoint objects, to a four- node inner ring The clients Create their objects Synchronize themselves Update the object as many time as possible for 100 seconds

The Throughput Microbenchmark

Archive Retrieval Performance Populate the archive by submitting updates of various sizes to a four-node inner ring Delete all copies of the data in its reconstructed form A single client submits reads

Archive Retrieval Performance Throughput: 1.19 MB/s (Planetlab) 2.59 MB/s (local cluster) Latency ~30-70 milliseconds

The Stream Benchmark Ran 500 virtual nodes on PlanetLab Inner Ring in SF Bay Area Replicas clustered in 7 largest P-Lab sites Streams updates to all replicas One writer - content creator – repeatedly appends to data object Others read new versions as they arrive Measure network resource consumption

The Stream Benchmark

The Tag Benchmark Measures the latency of token passing OceanStore 2.2 times slower than TCP/IP

The Andrew Benchmark File system benchmark 4.6x than NFS in read-intensive phases 7.3x slower in write-intensive phases

[Koloniari and Pitoura] Bloom Filters Compact data structures for a probabilistic representation of a set Appropriate to answer membership queries

Bloom Filters (cont’d) Query for b: check the bits at positions H 1 (b), H 2 (b),..., H 4 (b). back

[Kang et al. 2003] Site ASite BSite C V 0 : (x 0 ) V 1 : (x 1 ) write x V 0 : (x 0 ) V 4 : (x 4 ) V 5 : (x 5 ) V 2 : (x 2 ) write x V 3 : (x 3 ) write x Pair-Wise Reconciliation

Site ASite BSite C V0V0 V1V1 V2V2 V3V3 V4V4 H0H0 H0H0 H1H1 H0H0 H2H2 H0H0 H3H3 H0H0 H1H1 H2H2 H4H4 H i = hash (V i ) V5V5 H3H3 H5H5 H0H0 H1H1 H2H2 H4H4 Hash History Reconciliation back

[Mitzenmacher] Erasure Codes Message Encoding Received Message Encoding Algorithm Decoding Algorithm Transmission n cn n n back