Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 OceanStore Global-Scale Persistent Storage Ying Lu CSCE496/896 Spring 2011.

Similar presentations


Presentation on theme: "1 OceanStore Global-Scale Persistent Storage Ying Lu CSCE496/896 Spring 2011."— Presentation transcript:

1 1 OceanStore Global-Scale Persistent Storage Ying Lu CSCE496/896 Spring 2011

2 2 Give Credits Many slides are from John Kubiatowicz, University of California at Berkeley I have modified them and added new slides

3 3 Motivation Personal Information Mgmt is the Killer App –Not corporate processing but management, analysis, aggregation, dissemination, filtering for the individual –Automated extraction and organization of daily activities to assist people Information Technology as a Utility –Continuous service delivery, on a planetary- scale, on top of a highly dynamic information base

4 4 OceanStore Context: Ubiquitous Computing Computing everywhere: –Desktop, Laptop, Palmtop, Cars, Cellphones –Shoes? Clothing? Walls? Connectivity everywhere: –Rapid growth of bandwidth in the interior of the net –Broadband to the home and office –Wireless technologies such as CDMA, Satellite, laser Rise of the thin-client metaphor: –Services provided by interior of network –Incredibly thin clients on the leaves MEMS devices -- sensors+CPU+wireless net in 1mm 3 Mobile society: people move and devices are disposable

5 What do we need for personal information management? 5

6 6 Questions about information: Where is persistent information stored? –20 th -century tie between location and content outdated How is it protected? –Can disgruntled employee of ISP sell your secrets? –Can ’ t trust anyone (how paranoid are you?) Can we make it indestructible? –Want our data to survive “ the big one ” ! –Highly resistant to hackers (denial of service) –Wide-scale disaster recovery Is it hard to manage? –Worst failures are human-related –Want automatic (introspective) diagnose and repair

7 7 First Observation: Want Utility Infrastructure Mark Weiser from Xerox: Transparent computing is the ultimate goal –Computers should disappear into the background In storage context: –Don ’ t want to worry about backup, obsolescence –Need lots of resources to make data secure and highly available, BUT don ’ t want to own them –Outsourcing of storage already very popular Pay monthly fee and your “ data is out there ” –Simple payment interface  one bill from one company

8 8 Second Observation: Need wide-scale deployment Many components with geographic separation –System not disabled by natural disasters –Can adapt to changes in demand and regional outages Wide-scale use and sharing also requires wide- scale deployment –Bandwidth increasing rapidly, but latency bounded by speed of light Handling many people with same system leads to economies of scale

9 9 OceanStore: Everyone ’ s data, One big Utility “ The data is just out there ” Separate information from location –Locality is only an optimization (an important one!) –Wide-scale coding and replication for durability All information is globally identified –Unique identifiers are hashes over names & keys –Single uniform lookup interface –No centralized namespace required

10 10 Amusing back of the envelope calculation (courtesy Bill Bolotsky, Microsoft) How many files in the OceanStore? –Assume 10 10 people in world –Say 10,000 files/person (very conservative?) –So 10 14 files in OceanStore! –If 1 gig files (not likely), get 1 mole of files! Truly impressive number of elements … … but small relative to physical constants

11 11 Service provided by confederation of companies –Monthly fee paid to one service provider –Companies buy and sell capacity from each other Utility-based Infrastructure Pac Bell Sprint IBM AT&T Canadian OceanStore IBM

12 12 Outline Motivation Properties of the OceanStore Specific Technologies and approaches: –Naming and Data Location –Conflict resolution on encrypted data –Replication and Deep archival storage –Introspective computing for optimization and repair –Economic models Conclusion

13 13 Ubiquitous Devices  Ubiquitous Storage Consumers of data move, change from one device to another, work in cafes, cars, airplanes, the office, etc. Properties REQUIRED for OceanStore storage substrate: –Strong Security: data encrypted in the infrastructure; resistance to monitoring and denial of service attacks –Coherence: too much data for na ï ve users to keep coherent “ by hand ” –Automatic replica management and optimization: huge quantities of data cannot be managed manually –Simple and automatic recovery from disasters: probability of failure increases with size of system –Utility model: world-scale system requires cooperation across administrative boundaries

14 14 OceanStore Technologies I: Naming and Data Location Requirements: –System-level names should help to authenticate data –Route to nearby data without global communication –Don ’ t inhibit rapid relocation of data OceanStore approach: Two-level search with embedded routing –Underlying namespace is flat and built from secure cryptographic hashes (160-bit SHA-1) –Search process combines quick, probabilistic search with slower guaranteed search

15 15 Universal Location Facility Universal Name Name OID Root Structure Update OID: Archive versions: Version OID 1 Version OID 2 Version OID 3 Global Object Resolution Floating Replica Active Data Commit Logs Checkpoint OID Global Object Resolution Version OID Archival copy or snapshot Archival copy or snapshot Archival copy or snapshot Global Object Resolution Global Object Resolution Erasure Coded: Takes 160-bit unique identifier (GUID) and Returns the nearest object that matches

16 16 Routing Two-tiered approach Fast probabilistic routing algorithm –Entities that are accessed frequently are likely to reside close to where they are being used (ensured by introspection) Slower, guaranteed hierarchical routing method Self-optimizing

17 17 Probabilistic Routing Algorithm n3n3 n4n4 n2n2 n1n1 X (0,1,3) z (0,2,4) 11011 01234 bit 11010 01234 bit 11010 11001 11011 10101 00011 11100 11011 00011 1st 2nd 11100 1st 11011 2nd 00011 11011 Y (0,1,4) 1st Query for X (11010) M (1,3,4) 11000 00100 11010 1st 2nd 3rd reliable factors 10 reliable factors 100 self-optimizing on the depth of the attenuated bloom filter array Bloom filter on each node; Attenuated Bloom filter on each directed edge.

18 18 Hierarchical Routing Algorithm Based on Plaxton scheme Every server in the system is assigned a random node-ID Object ’ s root –each object is mapped to a single node whose node- ID matches the object ’ s GUID in the most bits (starting from the least significant) Information about the GUID (such as location) were stored at its root

19 19 Construct Plaxton Mesh … 0324 x431 x742 x633 0265 x927 1215 1 1 1 1 1 5724 1624 1324 2 2 2 3 3 4 x431 x742 x633 0265 x927 1215 3714 2344 9834 7144 3714 2344 9834 7144 5724 1624

20 20 4 2 3 3 3 2 2 1 2 4 1 2 3 3 1 3 4 1 1 43 2 4 Basic Plaxton Mesh Incremental suffix-based routing NodeID 0x43FE NodeID 0x13FE NodeID 0xABFE NodeID 0x1290 NodeID 0x239E NodeID 0x73FE NodeID 0x423E NodeID 0x79FE NodeID 0x23FE NodeID 0x73FF NodeID 0x555E NodeID 0x035E NodeID 0x44FE NodeID 0x9990 NodeID 0xF990 NodeID 0x993E NodeID 0x04FE NodeID 0x43FE GUID 0x43FE a c b d e

21 21 Use of Plaxton Mesh Randomization and Locality

22 22 OceanStore Enhancements of the Plaxton Mesh Documents have multiple roots (Salted hash of GUID) Each node has multiple neighbor links Searches proceed along multiple paths –Tradeoff between reliability, performance and bandwidth? Dynamic node insertion and deletion algorithms –Continuous repair and incremental optimization of links self-healingself-optimizing self-configuration

23 23 OceanStore Technologies II: Rapid Update in an Untrusted Infrastructure Requirements: –Scalable coherence mechanism which can operate directly on encrypted data without revealing information –Handle Byzantine failures –Rapid dissemination of committed information OceanStore Approach: –Operations-based interface using conflict resolution Modeled after Xerox Bayou  updates packets include: Predicate/action pairs which operate on encrypted data –User signs Updates and principle party signs commits –Committed data multicast to clients

24 24 Update Model Concurrent updates w/o wide-area locking –Conflict resolution Updates Serialization A master replica? Role of primary tier of replicas –All updates submitted to primary tier of replicas which chooses a final total order by following Byzantine agreement protocol A secondary tier of replicas –The result of the updates is multicast down the dissemination tree to all the secondary replicas

25 Agreement Need agreement in DS: –Leader, commit, synchronize Distributed Agreement algorithm: all non-faulty processes achieve consensus in a finite number of steps Perfect processes, faulty channels: two- army Faulty processes, perfect channels: Byzantine generals

26 Two-Army Problem

27 Possible Consensus Agreement is possible in synchronous DS [e.g., Lamport et al.] –Messages can be guaranteed to be delivered within a known, finite time. –Byzantine Generals Problem A synchronous DS: can distinguish a slow process from a crashed one

28 Byzantine Generals Problem    

29 Byzantine Generals -Example (1) The Byzantine generals problem for 3 loyal generals and1 traitor. a)The generals announce the time to launch the attack (by messages marked by their ids). b)The vectors that each general assembles based on (a) c)The vectors that each general receives, where every general passes his vector from (b) to every other general.

30 Byzantine Generals –Example (2) The same as in previous slide, except now with 2 loyal generals and one traitor.

31 Byzantine Generals Given three processes, if one fails, consensus is impossible Given N processes, if F processes fail, consensus is impossible if N  3F

32 32 Tentative Updates: Epidemic Dissemination

33 33 Committed Updates: Multicast Dissemination

34 34 Data Coding Model Two distinct forms of data: active and archival Active Data in Floating Replicas –Latest version of the object Archival Data in Erasure Coded Fragments –A permanent, read-only version of the object –During commit, previous version coded with erasure-code and spread over 100s or 1000s of nodes –Advantage: any 1/2 or 1/4 of fragments regenerates data

35 35 Floating Replica and Deep Archival Coding Erasure-coded Fragments Ver1: 0x34243 Ver2: 0x49873 Ver3: … Full Cop y Conflict Resolution Logs Ver1: 0x34243 Ver2: 0x49873 Ver3: … Full Cop y Conflict Resolution Logs Ver1: 0x34243 Ver2: 0x49873 Ver3: … Full Cop y Conflict Resolution Logs Floating Replica

36 36 Proactive Self-Maintenance Continuous testing and repair of information –Slow sweep through all information to make sure there are sufficient erasure-coded fragments –Continuously reevaluate risk and redistribute data –Slow sweep and repair of metadata/search trees Continuous online self-testing of HW and SW –Detects flaky, failing, or buggy components via: fault injection: triggering hardware and software error handling paths to verify their integrity/existence stress testing: pushing HW/SW components past normal operating parameters scrubbing: periodic restoration of potentially “ decaying ” hardware or software state –Automates preventive maintenance

37 37 OceanStore Technologies IV: Introspective Optimization Requirements: –Reasonable job on global-scale optimization problem Take advantage of locality whenever possible Sensitivity to limited storage and bandwidth at endpoints –Repair of data structures, increasing of redundancy –Stability in chaotic environment  Active Feedback OceanStore Approach: –Introspective monitoring and analysis of relationships to cluster information by relatedness –Time series-analysis of user and data motion –Rearrangement and replication in response to monitoring Clustered prefetching: fetch related objects Proactive-prefetching: get data there before needed Rearrangement in response to overload and attack

38 38 Client observer and optimizer components –Greedy agents working on the behalf of the client Watches client activity/combines with historical info Performs clustering and time-series analysis Forwards results to infrastructure (privacy issues!) –Monitoring state of network to adapt behaviour Typical Actions: –Cluster related files together –Prefetch files that will be needed soon –Create/destroy floating replicas Example: Client Introspection

39 39 OceanStore Conclusion The Time is now for a Universal Data Utility –Ubiquitous computing and connectivity is (almost) here! –Confederation of utility providers is right model OceanStore holds all data, everywhere –Local storage is a cache on global storage –Provides security in an untrusted infrastructure Exploits economies of scale to: –Provide high-availability and extreme survivability –Lower maintenance cost: self-diagnosis and repair Insensitivity to technology changes: Just unplug one set of servers, plug in others

40


Download ppt "1 OceanStore Global-Scale Persistent Storage Ying Lu CSCE496/896 Spring 2011."

Similar presentations


Ads by Google