UC Santa Cruz Providing High Reliability in a Minimum Redundancy Archival Storage System Deepavali Bhagwat Kristal Pollack Darrell D. E. Long Ethan L.

UC Santa Cruz Providing High Reliability in a Minimum Redundancy Archival Storage System Deepavali Bhagwat Kristal Pollack Darrell D. E. Long Ethan L. Miller Storage Systems Research Center University of California, Santa Cruz Thomas Schwarz Computer Engineering Department Santa Clara University Jehan-François Pâris Department of Computer Science University of Houston, Texas

2 Introduction u Archival data will increase ten-fold from 2007 to 2010 J. McKnight, T. Asaro, and B. Babineau, Digital Archiving:End-User Survey and Market Forecast 2006 - 2010. The Enterprise Strategy Group, Jan. 2006. u Data compression techniques used to reduce storage costs u Deep Store - An archival storage system Uses interfile and intrafile compression techniques Uses chunking u Compression hurts reliability Loss of a shared chunk  Disproportionate data loss u Our solution: Reinvest the saved storage space to improve reliability Selective replication of chunks u Our results: Better reliability compared to that of mirrored Lempel-Ziv compressed files using only about half of the storage space

3 Deep Store: An overview u Whole File Hashing Content Addressable Storage u Delta Compression u Chunk-based Compression File broken down into variable-length chunks using a sliding window technique A chunk identifier/digest used to look for identical chunks Only unique chunks stored w fixed size window chunk end/start variable chunk size window fingerprint chunk ID (content address) sliding window

4 Effects of Compression on Reliability u Chunk-based compression  Interfile dependencies u Loss of a shared chunk  Disproportionate amount of data loss Files Chunks

5 Effects of Compression on Reliability….. u Simple experiment to show the effects of interfile dependencies: 9.8 GB of data from several websites, The Internet Archive Compressed using chunking to 1.83 GB. (5.62 GB using gzip) Chunks were mirrored and distributed evenly onto 179 devices, 20 MB each.

6 Compression and Reliability u Chunking: Minimizes redundancies. Gives us excellent compression ratios Introduces interfile dependencies Interfile dependencies are detrimental to reliability u Hence, reintroduce redundancies Selective replication of chunks u Some chunks more important than others. How important? The amount of data depending on a chunk (byte count) The number of files depending on a chunk (reference count) u Selective replication strategy Weight of a chunk (w)  Number of replicas for a chunk (k) We use a heuristic function to calculate k

7 u k : Number of replicas u w : Weight of a chunk u a : Base level of replication, independent of w u b : To boost the number of replicas for chunks with high weight u Every chunk is mirrored u k max : Maximum number of replicas As replicas increase the gain in reliability obtained as a result of every additional replica reduces u k rounded off to the nearest integer. Heuristic Function

8 Distribution of Chunks u An archival system receives files in batches u Files stored onto a disk until the disk is full u For every file Chunks extracted and compressed Unique chunks stored u A non unique chunk stored only if: The present disk does not contain the chunk For this chunk, k < k max u At the end of the batch All chunks revisited and replicas made for appropriate chunks u A chunk is not proactively replicated Wait for a chunk’s replica to arrive as a chunk of a future file Reduce inter-device dependencies for a file.

9 Experimental Setup u We measure Robustness: The fraction of the data available given a certain percentage of unavailable storage devices u We use Replication to introduce redundancies Future work will investigate erasure codes u Data Set: HTML, PDF, image files from The Internet Archive. (9.8 GB) HTML, image (JPG and TIFF), PDF, Microsoft Word files from The Santa Cruz Sentinel (40.22 GB) u We compare the Robustness and Storage Space utilization of archives that use: Chunking with selective redundancies and Lempel-Ziv compression with mirroring

10 Details of the Experimental Data

11 u When using dependent data (byte count) as a heuristic: w = D/d D : sum of the sizes of all files depending on a chunk d : average size of a chunk u When using the number of files (reference count) as a heuristic: w = F F : number of files depending on a chunk Weight of a Chunk

12 Robustness, Effect of varying a, w=F, b=1, k max =4, The Internet Archive

13 Robustness, Effect of varying a, w=D/d,b=0.4, k max =4, The Internet Archive

14 Robustness, Effect of limiting k, w=D/d, b=0.55, a=0, The Internet Archive

15 Robustness, Effect of varying b, w=D/d, a=0, k max =4, The Internet Archive

16 Robustness, Effect of varying b, w=D/d, a=0, k max =4, The Sentinel

17 Choice of a Heuristic u Choice of a heuristic depends on the corpus If file size is indicative of file importance, choose w=D/d If file’s importance is independent of its size, choose w=F Use the same metric to measure robustness

18 Future Work u Study reliability of Deep Store With a recovery model in place When using delta compression u Use different redundancy mechanisms such as erasure codes u Data placement in conjunction with hardware statistics

19 Related Work u Many archival systems use Content Addressable Storage: EMC’s Centera Variable-length chunks: LBFS Fixed-size chunks: Venti u OceanStore aims to provide continuous access to persistent data uses automatic replication for high reliability erasure codes for high availability u FARSITE: a distributed file system Replication of metadata, replication chosen avoid the overhead of data reconstruction when using erasure codes u PASIS, Glacier use aggressive replication as a protection against data loss u LOCKSS provides long term access to digital data uses peer-to-peer audit and repair protocol to preserve the integrity and long-term access to document collections

20 Conclusion u Chunking gives excellent compression ratios but introduces interfile dependencies that adversely affect system reliability. u Selective replication of chunks using heuristics gives better robustness than mirrored LZ-compressed files significantly high storage space efficiency -- only uses about half of the space used by mirrored LZ-compressed files u We use simple replication. Our results will only improve with other forms of redundancies.

UC Santa Cruz Providing High Reliability in a Minimum Redundancy Archival Storage System Deepavali Bhagwat Kristal Pollack Darrell D. E. Long Ethan L. Miller Storage Systems Research Center University of California, Santa Cruz Thomas Schwarz Computer Engineering Department Santa Clara University Jehan-François Pâris Department of Computer Science University of Houston, Texas

UC Santa Cruz Providing High Reliability in a Minimum Redundancy Archival Storage System Deepavali Bhagwat Kristal Pollack Darrell D. E. Long Ethan L.

Similar presentations

Presentation on theme: "UC Santa Cruz Providing High Reliability in a Minimum Redundancy Archival Storage System Deepavali Bhagwat Kristal Pollack Darrell D. E. Long Ethan L."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

UC Santa Cruz Providing High Reliability in a Minimum Redundancy Archival Storage System Deepavali Bhagwat Kristal Pollack Darrell D. E. Long Ethan L.

Similar presentations

Presentation on theme: "UC Santa Cruz Providing High Reliability in a Minimum Redundancy Archival Storage System Deepavali Bhagwat Kristal Pollack Darrell D. E. Long Ethan L."— Presentation transcript:

Similar presentations

About project

Feedback