Making the Archive Real

Making the Archive Real
My name is HW and this is joint work with CW and JK at UC Berkeley. And I am going to primarily talk about the OceanStore Archival Layer, except when it is appropriate to show the relationship to other components, like the inner ring or secondary replicas. Hakim Weatherspoon University of California, Berkeley ROC/OceanStore Retreat, Lake Tahoe. Tuesday, June 11, 2002

©2002 Hakim Weatherspoon/UC Berkeley
Archival Storage Every version of every object is archived. At least in principle. How long does it take to… Write? Read? Can the Archive be run inline? Or, is it best to batch? This talk is really a war story talk about how we fight for principle. In OceanStore, every version of every object is archived (at least in principle). This means that the archival process overhead has to be essentially free (performance wise) or minimal Overhead over the base update process. Specifically, I am going to talk about how long does it take to write/read to/from the Archive. Where can we improve performance, where are the bottlenecks etc. In standing by our principle can we archive every update inline, or is it best to batch archival requests. And what are the tradeoffs/consequences. ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

Path of an Update As an example, imagine a very popular file, like a popular MP3, has been cached and replicated throughout the infrastructure. These purple bubbles represent the cached and replicated file and the orange bubbles represent a Byzantine ring responsible for committing changes to the file. Suppose now that the author wants to add a new verse and modify the file. She would submit her changes to the infrastructure that would go to the an inner ring that would perform a Byzantine agreement on the changes and multicast down the changes to each replica. This situation should be of concern because each replica can respond to a request for the file, so you need some way to verify the results returned, also, each replica can come and go as it pleases, so again, where is the persistent storage ? Verify data. Persistence storage. ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

Archival Dissemination Built into Update
Erasure codes redundancy without overhead of strict replication produce n fragments, where any m is sufficient to reconstruct data. m < n. rate r = m/n. Storage overhead is 1/r. Erasure codes are a form of redundancy without the overhead of strict replication. Erasure codes work by producing n fragments, where any m is sufficient to reconstruct the data and m is strictly less than n. We call m over n the rate of encoding. The observation is that storage is only increased by the inverse of the rate. E.g. m = 16. n = 64 increases storage by 4. We have pictured building this redundancy step into the update process. Again the modification to the mp3 file is submitted to the network, a Byzantine agreement is performed and right away fragments from the erasure codes are disseminated. Note that the fragments are much smaller than the actual data. To keep the system durable, you could imagine a repair process that goes by and replaces lost redundancy/fragments. Now lets say that you want to keep the system durable with four replicas. How can you structure redundancy to comply with this restraint. Assume repair process. ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

Durability Fraction of Blocks Lost Per Year (FBLPY)* r = ¼, erasure-encoded block. (e.g. m = 16, n = 64) Increasing number of fragments, increases durability of block Same storage cost and repair time. n = 4 fragment case is equivalent to replication on four servers. * Erasure Coding vs. Replication, H. Weatherspoon and J. Kubiatowicz, In Proc. of IPTPS 2002. You can compute the the durability of a system based only on a factor of four in storage cost for each data item. We have graphed here, the probability of losing a block vs. some repair time in months. The top line is the case of four replicas and each line shows increasing the total number of fragments while keeping the storage cost the same. Note that the y-axis is logarithmic, meaning that as you increase the number of fragments, you chances of losing a block goes down exponentially. The key liability about fragments is that you must have a a fragment whole or a complete erasure. The algorithm cannot handle corrupt fragments because you may have to attempt m choose n combinations to reconstruct the data correctly. Also, since the infrastructure is storing the fragments you need some why to verify the fragments returned. Figure 3 shows the resulting Fraction of Blocks Lost Per Year (FBLPY) for a variety of epoch lengths and number of fragments. For a given frequency of repair and number of fragments, this graph shows the fraction of all encoded blocks that are lost. To generate this graph, we construct the probability that a fragment will survive until the next epoch from the distribution of disk lifetimes in [26]4, using a method similar to computing the residual average lifetime of a randomly selected disk [6]. The probability that a block can be reconstructed is the sum of all cases in which at least m fragments survive. The FBLPY is the inverse of the mean of the geometric distribution produced from the block survival (i.e. 1/MTTF of a block). Assumptions: We assume that the archive consists of a collection of independently failing disks and that failed disks are immediately replaced by new, blank ones3. During dissemination, each archival fragment for a given block is placed on a unique, randomly selected disk. We assume the global sweep and repair process described above. ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

Naming and Verification Algorithm
Use cryptographically secure hash algorithm to detect corrupted fragments. Verification Tree: n is the number of fragments. store log(n) + 1 hashes with each fragment. Total of n.(log(n) + 1) hashes. Top hash is a block GUID (B-GUID). Fragments and blocks are self-verifying B-GUID Hd Hd H14 Data H12 H34 H34 H1 H2 H2 H3 H4 F1 F1 F2 F3 F4 Encoded Fragments Fragment 1: Fragment 1: H2 H2 H34 H34 Hd Hd F1 - fragment data F1 - fragment data Fragment 2: H1 H34 Hd F2 - fragment data Fragment 3: H4 H12 Hd F3 - fragment data Fragment 4: H3 H12 Hd F4 - fragment data Data: H14 data ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

Enabling Technology Fragments GUID Here we have a person that wants to listen to this mp3, they submit a GUID to the infrastructure. And get back fragments that can be verified and used to reconstruct the block. ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

Complex Objects I Verification Tree GUID of d Encoded Fragments: Unit of Archival Storage Unit of Coding So far we have only shown you how to make one block verifiable. Now we show how to make an object of arbitrary size verifiable. data ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

Complex Objects II VGUID Data B -Tree M Indirect Blocks Blocks Data Now we show how to make more complex objects of arbitrary size verifiable. Asking DOLR to reconstruct top block. Can get every block in picture. If we were a library we would stop there. Version are read-only. d1 d2 d3 d4 d5 d6 d7 d8 d9 Unit of Coding Encoded Fragments: Unit of Archival Storage Verification Tree GUID of d1 ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

Complex Objects III AGUID = hash{name+keys} VGUIDi VGUIDi + 1 Data B -Tree backpointer M M copy on write Indirect Blocks Storage is free/cheap. Inherently keep all versions. we can create a versioning file system where each version is tied to one AGUID (I.e. hash{name +keys} By adding a backpointer and copy-on-write capabilities, We show a Trail of versions. Where Aguid that ties VGUIDs to an AGUID. *The Pointer to head of data is tricky. Limitation/Motivate need heartbeats (signed map of Aguid to vguid). Some why of verifying the integrity of the head pointer is required for any mutable system such as a file system. copy on write Data Blocks d'8 d'9 d1 d2 d3 d4 d5 d6 d7 d8 d9 ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

Mutable Data Need mutable data for real system. Entity in network. A-GUID to V-GUID mapping. Byzantine Commitment for Integrity Verifies client privileges. Creates a serial order. Atomically applies update. Versioning system Each version is inherently read-only. End result, complex objects w/ mutability. Trail of versions. Aguid that ties VGUIDs to an AGUID. Pointer to head of data is tricky. Limitation/Motivate need heartbeats (signed map of Aguid to vguid). Verifies client privileges. Atomically applies update. ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

Archiver I Archiver Server Architecture Requests to archive objects recv’d thru network layers. Consistency mechanisms decides to archive obj. Asynchronous Disk Asynchronous Network Network Operating System Java Virtual Machine Thread Scheduler X Y Consistency Location & Routing Archiver Introspection Modules D i s p a t c h 4 2 3 1 Look, we have built this, using the principles of event-driven programming ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

Archiver Control Flow ROC/OceanStore
DisseminateFragsResp Disseminator Stage Send Frags to Storage Servers GenerateFragsResp DisseminateFragsReq Req Consistency Stage(s) GenerateChkpt Stage GenerateFrags Stage GenerateFragsChkptReq GenerateFragsReq GenerateFragsChkptResp ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

Archival Modes The two disks run in software RAID 0 (striping) mode using md raidtools-0.90. Used BerkeleyDB when reading and writing blocks to disk. Simulated a number of independent event streams. Read or write ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

Performance Performance of the Archival Layer Performance of an OceanStore server in archiving a objects. analyze operations of archiving data (this includes signing updates in a BFT protocol). m = 16, n = 32 Experiment Environment OceanStore servers were analyzed on a 42-node cluster. Each machine in the cluster is a IBM xSeries 330 1U rackmount PC with two 1.0 GHz Pentium III CPUs 1.5 GB ECC PC133 SDRAM two 36 GB IBM UltraStar 36LZX hard drives. The machines use a single Intel PRO/1000 XF gigabit Ethernet adaptor to connect to a Packet Engines Linux SMP kernel. The two disks run in software RAID 0 (striping) mode using md raidtools-0.90. Used BerkeleyDB when reading and writing blocks to disk. Simulated a number of independent event streams. Read or write ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

Performance: Throughput I
Data Throughput No archive 5MB/s. Delayed 3MB/s. Inlined 2.5MB/s. ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

Performance: Throughput II
Operations Throughput No archive 55 opt/s. Small writes. Delayed 8 opt/s. Small writes. Inlined 50 opt/s. Small writes. ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

Performance: Latency I
Archive only Y-intercept 3ms, slope 0.3s/MB. Inlined Archive = No archive + only archive. Ms/kb * 1s/1000ms * 1000kb/1MB ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

Performance: Latency II
No Archive. Low variance. Delayed Archive. High variance. Ms/kb * 1s/1000ms * 1000kb/1MB ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

Performance: Latency III
Inlined Archive. Low variance. Delayed Archive. High variance. Ms/kb * 1s/1000ms * 1000kb/1MB ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

Performance: Read Latency
Low median. High variance. Queuing effect. Ms/kb * 1s/1000ms * 1000kb/1MB ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

Future Directions Caching for performance Automatic Replica Placement Automatic Repair Low Failure Correlation Dissemination Picture of heartbeats and repair. ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

Caching Automatic Replica Placement Replicas are soft-state. Can be constructed and destroyed as necessary. Prefetching Reconstruct replicas from fragments in advance of use Performance vs. durability. ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

Efficient Repair Global. Global Sweep and repair not efficient. Want detection/notification of node removal in system. Not as affective as distributed mechanisms. Distributed. Exploit DOLR’s distributed information and locality properties. Efficient detection and then reconstruction of fragments. 3274 4577 5544 AE87 3213 9098 1167 6003 0128 L2 L1 L3 Ring of L1 Heartbeats Global Sweep and repair not efficient. Want detection of node removal in system. Efficient detection and then reconstruction of fragments. Detection is not efficient. System should automatically: Adapt to failure. Repair itself. Incorporate new elements. Can we guarantee data is available for 1000 years? New servers added from time to time Old servers removed from time to time Everything just works Many components with geographic separation System not disabled by natural disasters Can adapt to changes in demand and regional outages Gain in stability through statistics ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

Low Failure Correlation Dissemination
Model Builder. Various sources. Model failure correlation. Set Creator. Queries random nodes. Dissemination Sets. Storage servers that fail with low correlation. Disseminator. Sends fragments to members of set. Introspection Model Builder Human Input Network Monitoring Set Creator model probe type fragments set Disseminator fragments Storage Server ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

Conclusion Storage efficient, self-verifying mechanism. Erasure codes are good. Performance is good. Improvements forthcoming. Self-verifying data assist in Secure read-only data Secure caching infrastructures Continuous adaptation and repair For more information: ROC/OceanStore ©2002 Hakim Weatherspoon/UC Berkeley

Making the Archive Real

Similar presentations

Presentation on theme: "Making the Archive Real"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Making the Archive Real

Similar presentations

Presentation on theme: "Making the Archive Real"— Presentation transcript:

Similar presentations

About project

Feedback