RevDedup: A Reverse Deduplication Storage System Optimized for Reads to Latest Backups Chun-Ho Ng, Patrick P. C. Lee The Chinese University of Hong Kong.

Slides:



Advertisements
Similar presentations
Henry C. H. Chen and Patrick P. C. Lee
Advertisements

Yuchong Hu1, Henry C. H. Chen1, Patrick P. C. Lee1, Yang Tang2
File Systems.
1 STAIR Codes: A General Family of Erasure Codes for Tolerating Device and Sector Failures in Practical Storage Systems Mingqiang Li and Patrick P. C.
1© Copyright 2012 EMC Corporation. All rights reserved. Delta Compressed and Deduplicated Storage Using Stream-Informed Locality Philip Shilane, Grant.
Tradeoffs in Scalable Data Routing for Deduplication Clusters FAST '11 Wei Dong From Princeton University Fred Douglis, Kai Li, Hugo Patterson, Sazzala.
Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage Wei Zhang, Tao Yang, Gautham Narayanasamy University of California at Santa Barbara.
1 Live Deduplication Storage of Virtual Machine Images in an Open-Source Cloud Chun-Ho Ng, Mingcao Ma, Tsz-Yeung Wong, Patrick P. C. Lee, John C. S. Lui.
G Robert Grimm New York University Sprite LFS or Let’s Log Everything.
File System Implementation CSCI 444/544 Operating Systems Fall 2008.
CS 333 Introduction to Operating Systems Class 18 - File System Performance Jonathan Walpole Computer Science Portland State University.
1 Outline File Systems Implementation How disks work How to organize data (files) on disks Data structures Placement of files on disk.
Chapter 12: File System Implementation
G Robert Grimm New York University Sprite LFS or Let’s Log Everything.
File System Implementation
THE DESIGN AND IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM M. Rosenblum and J. K. Ousterhout University of California, Berkeley.
Multi-level Selective Deduplication for VM Snapshots in Cloud Storage Wei Zhang*, Hong Tang †, Hao Jiang †, Tao Yang*, Xiaogang Li †, Yue Zeng † * University.
DEDUPLICATION IN YAFFS KARTHIK NARAYAN PAVITHRA SESHADRIVIJAYAKRISHNAN.
Data Deduplication in Virtualized Environments Marc Crespi, ExaGrid Systems
An Evaluation of Using Deduplication in Swappers Weiyan Wang, Chen Zeng.
1 Convergent Dispersal: Toward Storage-Efficient Security in a Cloud-of-Clouds Mingqiang Li 1, Chuan Qin 1, Patrick P. C. Lee 1, Jin Li 2 1 The Chinese.
Objectives Learn what a file system does
Understanding the Benefits and Costs of Deduplication Mahmoud Abaza, and Joel Gibson School of Computing and Information Systems, Athabasca University.
File Systems and Disk Management. File system Interface between applications and the mass storage/devices Provide abstraction for the mass storage and.
1 File Systems Chapter Files 6.2 Directories 6.3 File system implementation 6.4 Example file systems.
Min Xu1, Yunfeng Zhu2, Patrick P. C. Lee1, Yinlong Xu2
DATA DEDUPLICATION By: Lily Contreras April 15, 2010.
Demystifying Deduplication. Global SMB Event Marketing 2 APPROACH: What is deduplication? Eliminate redundant data Start with the backup environment as.
THE DESIGN AND IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM M. Rosenblum and J. K. Ousterhout University of California, Berkeley.
File System Implementation Chapter 12. File system Organization Application programs Application programs Logical file system Logical file system manages.
Improving Content Addressable Storage For Databases Conference on Reliable Awesome Projects (no acronyms please) Advanced Operating Systems (CS736) Brandon.
Ji-Yong Shin Cornell University In collaboration with Mahesh Balakrishnan (MSR SVC), Tudor Marian (Google), Lakshmi Ganesh (UT Austin), and Hakim Weatherspoon.
Chapter 8 – Main Memory (Pgs ). Overview  Everything to do with memory is complicated by the fact that more than 1 program can be in memory.
The Design and Implementation of Log-Structure File System M. Rosenblum and J. Ousterhout.
1 Shared Files Sharing files among team members A shared file appearing simultaneously in different directories Share file by link File system becomes.
1 CloudVS: Enabling Version Control for Virtual Machines in an Open- Source Cloud under Commodity Settings Chung-Pan Tang, Tsz-Yeung Wong, Patrick P. C.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 12: File System Implementation File System Structure File System Implementation.
Module 4.0: File Systems File is a contiguous logical address space.
CS 153 Design of Operating Systems Spring 2015 Lecture 21: File Systems.
Parity Logging with Reserved Space: Towards Efficient Updates and Recovery in Erasure-coded Clustered Storage Jeremy C. W. Chan*, Qian Ding*, Patrick P.
Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.
Fast File System 2/17/2006. Introduction Paper talked about changes to old BSD 4.2 File System (FS) Motivation - Applications require greater throughput.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition File System Implementation.
CS333 Intro to Operating Systems Jonathan Walpole.
Lecture 10 Page 1 CS 111 Summer 2013 File Systems Control Structures A file is a named collection of information Primary roles of file system: – To store.
Fragmentation in in-line deduplication backup systems 1. Reducing Impact of Data Fragmentation Caused By In-Line Deduplication. Michal Kaczmarczyk, Marcin.
CS 540 Database Management Systems
File Systems.  Issues for OS  Organize files  Directories structure  File types based on different accesses  Sequential, indexed sequential, indexed.
Deploying disk deduplication for Hyper-v 3.0 Žigmund Maťašovský.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 12: File System Implementation.
Elastic Parity Logging for SSD RAID Arrays Yongkun Li*, Helen Chan #, Patrick P. C. Lee #, Yinlong Xu* *University of Science and Technology of China #
File Systems and Disk Management
Jonathan Walpole Computer Science Portland State University
Rekeying for Encrypted Deduplication Storage
Information Leakage in Encrypted Deduplication via Frequency Analysis
A Simulation Analysis of Reliability in Primary Storage Deduplication
Demystifying Deduplication
Filesystems.
HashKV: Enabling Efficient Updates in KV Storage via Hashing
File Systems and Disk Management
File Systems and Disk Management
File Systems and Disk Management
File Systems and Disk Management
File Systems and Disk Management
Chapter 14: File-System Implementation
File Systems and Disk Management
File Systems and Disk Management
Jingwei Li*, Patrick P. C. Lee #, Yanjing Ren*, and Xiaosong Zhang*
Fan Ni Xing Lin Song Jiang
The Design and Implementation of a Log-Structured File System
Presentation transcript:

RevDedup: A Reverse Deduplication Storage System Optimized for Reads to Latest Backups Chun-Ho Ng, Patrick P. C. Lee The Chinese University of Hong Kong APSYS’13 1

Virtual Machine Backup 2 VM Backup Storage Snapshot Backup  Virtual machines (VMs) widely deployed Reduce hardware footprints  Regular backups for VM disk images critical for disaster recovery Disk-based backup  Storage explosion! Many VM images Many versions per VM Virtual Machine Environment

Deduplication  Disk prices drop, but data grows much faster 1.8 trillion GB new data in the wild in 2011 [IDC, 2011]  Deduplication removes content redundancy to improve storage efficiency Store one copy of chunks with same content Effective for VMs [Jin, SYSTOR’09]  Two types: Inline: redundancy removed on write path Out-of-order: redundancy removed after data stored Extra I/O cost 3

Fragmentation  Fragmentation in inline deduplication 4 ABCDEFGH VM 1 ABCD’EF’GH VM 2 AB’CD’E’ F’’ GH’ VM 3 D’F’ABCDEFGHE’F’’H’B’ VM 3 VM 2 VM 1  Latest backup  most fragmented

Related Work  Optimized for reads in dedup systems: Selective rewrites [Kaczmarczyk, SYSTOR’12; Nam, MASCOTS’12] Rewrite duplicate data to mitigate fragmentation Container capping and prefetching [Lillibridge, FAST’13]  Their designs are based on inline deduplication  remove duplicates from new data  Another open issue: empirical performance? 5

Our Work  Goal: mitigate fragmentation for latest backups Rationale: latest data more likely to read  Contributions: Hybrid approach of inline and out-of-order deduplication Prototype and evaluation on real VM dataset low memory usage high storage efficiency GB/s of read/write throughput 6 RevDedup: a deduplication system optimizing reads for latest VM backups

Key Idea  Two-phase deduplication  Coarse-grained global deduplication Inline deduplication to different VMs Removes duplicates on the write path  Fine-grained reverse deduplication Out-of-order deduplication to different versions of the same VM Remove duplicates from old backup versions Shift fragmentation to old data 7

Preliminaries  Definitions: Backup: a single snapshot of a VM disk image Versions: different snapshots of the same VM  Deduplication basics: Chunking: divide data into chunks Fixed-size chunking; simple and effective for VM image dedup Fingerprinting: compute crypto hashes for chunks Comparison: two chunks with same fingerprints are the same (assuming no hash collisions) 8

Global Deduplication  Inline deduplication on large fixed-size units called segments Several MB of size  Global: eliminate duplicates in different levels: In the same version In different versions of the same VM In different version of different VMs  Small memory usage for indexing 1PB data, 8MB segment size, 32 byte entry  memory usage for indexing 4GB 9

Global Deduplication 10  Seek time of locating segments is reduced compared to reading segment contents  Still effective in dedup efficiency: Idea: changes confined in small regions  Challenge: dedup efficiency not as good as conventional inline approaches Mitigate fragmentation by amortizing disk seeks on reading segments

Reverse Deduplication  Finer-grained deduplication to further remove duplicates  Blocks: sub-segment of KBs (e.g., 4KB for FS blocks)  Compare two versions of the same VM; remove duplicate blocks in the earlier version  Accessing earlier blocks follow references to future versions 11

Reverse Deduplication  VM 3 is sequential  VM 1 is the most fragmented 12 Shift fragmentation to earlier versions by eliminating earlier blocks D’F’AB C DEFGHE’F’’H’B’ VM 3 VM 2 VM 1

Shared Segments  Block reference counts  Add 1 for all blocks in a segment when referenced by a VM version  Minus 1 when a block refers to a future version in reverse deduplication  Remove blocks when reference counts dropped to zero 13

Shared Segments: Example 14 VMA 1 BDEFHACG VMB 1 BDEFHACG VMB 2 E’F’D’BACGH’ VMA 2 D’BACEHG F BDAC BACEFHGE’F’GH’ VMA 1 VMB 1 VMA 2 VMB 2 Typo in the paper: Block G of segment EFGH should have refcnt = 2 instead of BDAC

Removal of Duplicate Blocks  Block punching fallocate() to release space to native file system Generate free-space fragmentation  Segment compaction Copy all remaining blocks to another segment and delete old one Incur I/O overhead  Threshold-based block removal mechanism Few blocks removed  block punching Many blocks removed  segment compaction 15

Evaluation  Client-server implementation RevDedup server mounted on ext4  Dataset 160 VMs x 12 versions for students to work on assignments over 12 weeks total 1920 versions  Testbed RAID-0 disk array with 8 SATA disks 1.37GB/s raw write 1.27GB/s raw read  Evaluate storage efficiency and backup/restore throughput 16 Backup Virtual Machine Environment RevDedup Restore Storage

Space Usage  RevDedup matches deduplication savings of conventional deduplication (96.8% for 128KB) 17

Backup Throughput  RevDedup achieves 3-5x speedup of raw writes 18

Restore Throughput  RevDedup shifts fragmentation to earlier versions 19

Summary of Results  RevDedup achieves: similar storage savings as conventional deduplication high backup throughput high restore throughput for latest versions by shifting fragmentation to earlier versions  Tech. report shows additional evaluation results Different rebuild thresholds Tracing indirect references 20

Open Issues  How to make effective tradeoff? Fixed size vs. variable size chunking Backup throughput vs. restore throughput Storage efficiency vs. throughput  What about different workloads? Unstructured data Primary storage data  Any new applications? 21

Conclusions  Advocate the importance of read optimization in deduplication system  Propose RevDedup for VM image backups Low memory usage High storage efficiency High backup performance High restore performance on latest backups  Source code: 22