Presentation is loading. Please wait.

Presentation is loading. Please wait.

1© Copyright 2012 EMC Corporation. All rights reserved. Delta Compressed and Deduplicated Storage Using Stream-Informed Locality Philip Shilane, Grant.

Similar presentations


Presentation on theme: "1© Copyright 2012 EMC Corporation. All rights reserved. Delta Compressed and Deduplicated Storage Using Stream-Informed Locality Philip Shilane, Grant."— Presentation transcript:

1 1© Copyright 2012 EMC Corporation. All rights reserved. Delta Compressed and Deduplicated Storage Using Stream-Informed Locality Philip Shilane, Grant Wallace, Mark Huang, & Windsor Hsu Backup Recovery Systems Division EMC Corporation

2 2© Copyright 2012 EMC Corporation. All rights reserved. Motivation and Approach  Improve storage compression –Decrease price per GB –Decrease data center space –Decrease power –Decrease management  Combine deduplication and delta compression –Remove identical data regions –Compress with similar data regions

3 3© Copyright 2012 EMC Corporation. All rights reserved. Previous Work on Similarity Indexing  Version information [Burns’97, MacDonald’00]  Similarity index in memory [Aronovich’09, Kulkarni’04]  Similarity index on-disk [You’11]  Stream-informed delta locality for WAN replication [Shilane’12] –Low WAN throughput –Did not store delta compressed data

4 4© Copyright 2012 EMC Corporation. All rights reserved. Contributions 1.First deduplicated and delta compressed storage implementation using stream- informed locality 2.Quantify throughput and suggest improvement areas 3.Explore new complexities related to data integrity and cleaning 4.Report combination of deduplication and delta compression across chunk sizes

5 5© Copyright 2012 EMC Corporation. All rights reserved. FP cache hit on 1,2,3,5, & 6 Stream-informed Deduplication backup.tar Content Defined Chunks Chunk 1 Chunk 2 Chunk 3 Chunk 4 Chunk 5 Chunk 6 Fp 1Fp 2Fp 3Fp 4Fp 5 Fp 6 Secure Fingerprint of Chunks backup.tar Content Defined Chunks Chunk 1 Chunk 2 Chunk 3 Chunk 4’ Chunk 5 Chunk 6 Fp 1Fp 2Fp 3Fp 4’Fp 5 Fp 6 Secure Fingerprint of Chunks 1 week later Fp 1Fp 2Fp 3 Fp 4Fp 5Fp 6 Fingerprint Cache Chunks stored together on disk Fp Index … Fp 6 -> C1 Fp 1 -> C1 Container C1 Meta Data Data Fp 1 … 6 Chunks 1 … 6

6 6© Copyright 2012 EMC Corporation. All rights reserved. Stream-informed Deduplication and Delta Compression backup.tar Content Defined Chunks Chunk 1 Chunk 2 Chunk 3 Chunk 4 Chunk 5 Chunk 6 Fp 1 Sk 1 Fp 2 Sk 2 Fp 3 Sk 3 Fp 4 Sk 4 Fp 5 Sk 5 Fp 6 Sk 6 Secure Fingerprint and Sketch of Chunks backup.tar Content Defined Chunks Chunk 1 Chunk 2 Chunk 3 Chunk 4’ Chunk 5 Chunk 6 Fp 1 Sk 1 Fp 2 Sk 2 Fp 3 Sk 3 Fp 4’ Sk 4 Fp 5 Sk 5 Fp 6 Sk 6 Secure Fingerprint and Sketch of Chunks 1 week later Fp 1…Fp 6 Sk 1…Sk 6 Fingerprint and Sketch Cache FP cache hit on 1,2,3,5, & 6 Sketch cache hit on 4 Delta encode Chunks stored together on disk Fp Index … Fp 6 -> C1 Fp 1 -> C1 Container C1 Meta Data Data Fp 1 … 6 Sk 1 … 6 Chunks 1 … 6

7 7© Copyright 2012 EMC Corporation. All rights reserved. Deduplication and Delta Compression Chunk Maximal Value 1 Maximal Value 2 Maximal Value 3 Maximal Value 4 Chunk (similar to earlier chunk) Regions of difference super_feature = Rabin_fp(feature 1 …feature 4 ) sketch is one or more super_features Sketches based on Broder’97 fpsk fp Chunk fp (duplicate of earlier chunk) Store fp and differences fp

8 8© Copyright 2012 EMC Corporation. All rights reserved. Backup Datasets DatasetTypeBackup PolicyTBMonths Workstations16 desktops Weekly full Daily incremental 2.34 Email MS Exchange server Daily full2.55 Source Code Version control repository Weekly full Daily incremental 4.56 System Logs Server’s /var directory Weekly full Daily incremental 5.34

9 9© Copyright 2012 EMC Corporation. All rights reserved. Compression Results Delta adds 1.4 – 3.5X compression improvement over deduplication and LZ DatasetDeduplicationDeltaLZ Total Compression Delta Improvement Workstations5.04.21.633.63.5 Email4.92.62.126.82.1 Source Code16.73.62.5150.31.4 System Logs25.23.31.8149.71.5

10 10© Copyright 2012 EMC Corporation. All rights reserved. Throughput  Delta compression requires extra computation and I/O  Compare to deduplicated storage as baseline  Throughput: 74% on first full backup Throughput: 53% on later full backups fp - = Read Calculate differenceStore delta

11 11© Copyright 2012 EMC Corporation. All rights reserved. Throughput Stages Single-stream timing for each stage Dataset Sketch MB/s Lookup MB/s Encode Mb/s HDD MB/s SSD MB/s Workstations471,528945400 Email491,44169180 Source Code30 312160 System Logs3070502160 Read Alternatives Aggregate throughput is higher due to: deduplication, 2/3 are delta encoded, multi-threading and asynchronous reads across multiple disks

12 12© Copyright 2012 EMC Corporation. All rights reserved. Indirection Complexities  Writing duplicates causes unintended read paths –Unpredictable read back times  Multi-level delta increases compression and complexity –We implemented 1-level delta catchcat… catchcatcatchcatcatch cat + ch catch Read back cat catch + er catcher

13 13© Copyright 2012 EMC Corporation. All rights reserved. Indirection Complexities  End-to-end validity checks are slow because of remote references –Reconstructing a delta chunk requires reading the base  Incorrect garbage collection can cause loops and data loss catchcat… catch cat catch cat + ch catch catch + er catcher Verify catcher

14 14© Copyright 2012 EMC Corporation. All rights reserved. Garbage Collection  Cleaning deleted chunks in a log structured file system –Reference counts –Mark-and-sweep  Copying live chunks forward changes data locality, which impacts delta compression applebananacat C1 bananaapplezebrawhalezebra C2 applebanana C3 zebra

15 15© Copyright 2012 EMC Corporation. All rights reserved. Garbage Collection Impact on Delta Compression GC changes data locality but has only a small impact on delta compression

16 16© Copyright 2012 EMC Corporation. All rights reserved. Compression vs. Chunk Size Delta adds significant compression beyond deduplication and LZ Delta helps maintain high total compression as chunk size varies Similar results on all datasets

17 17© Copyright 2012 EMC Corporation. All rights reserved. Summary and Future Work  Deduplication and delta compression prototype –Stream-informed locality replaces sketch indexes and improves write path throughput –Adds 1.4X – 3.5X compression  Studied throughput –Throughput 50% of underlying deduplication system –Areas for improvement: SSD  Garbage collection and data integrity –Remote reference complexity –Affects speed and validity  Delta helps maintain overall compression across a broad range of chunk sizes

18 18© Copyright 2012 EMC Corporation. All rights reserved. Questions?

19

20 20© Copyright 2012 EMC Corporation. All rights reserved. Compression Results DatasetDeduplicationDeltaLZ Total Compression Delta Improvement Workstations5.04.21.633.63.5 Email4.92.62.126.82.1 Source Code16.73.62.5150.31.4 System Logs25.23.31.8149.71.5 DatasetDeduplicationDeltaLZ Total Compression Workstations5.0NA1.99.6 Email4.9NA2.612.7 Source Code16.7NA6.4106.8 System Logs25.2NA4.0100.8


Download ppt "1© Copyright 2012 EMC Corporation. All rights reserved. Delta Compressed and Deduplicated Storage Using Stream-Informed Locality Philip Shilane, Grant."

Similar presentations


Ads by Google