Presentation is loading. Please wait.

Presentation is loading. Please wait.

Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage Wei Zhang, Tao Yang, Gautham Narayanasamy University of California at Santa Barbara.

Similar presentations


Presentation on theme: "Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage Wei Zhang, Tao Yang, Gautham Narayanasamy University of California at Santa Barbara."— Presentation transcript:

1 Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage Wei Zhang, Tao Yang, Gautham Narayanasamy University of California at Santa Barbara Hong Tang Alibaba Inc. USENIX HotStorage’2013

2 Motivation Virtual machines in the cloud can use frequent backup to improve service reliability  Used in Alibaba’s Aliyun - the largest public cloud service in China High storage demand & large content duplicates  Daily backup workload: hundreds of TB @ Aliyun  Number of VMs per cluster: tens of thousands Seek for inexpensive solutions

3 Architecture Consideration An external and dedicated backup storage system. High network traffic for transferring undeduplicated data Expensive A decentralized and co-hosted backup system with full deduplication  Lower cost & traffic

4 Requirements  Nondedicated resource Cohosted with existing cloud services Resource friendly – small memory footprint and CPU usage  Compute and backup for tens of thousand VMs within a few hours each day during light cloud workload.

5 Focus and Related Work Previous work  Inline chunk-based deduplication –High cost for fingerprint lookup  Speedup fingerprint comparison with approximation (e.g. subsampling, bloomfilter, stateless routing) Focus of this paper  Not inline - shorten overall backup times of many VM images, but not individual request  Not offline - multi-stage parallel backup with small storage overhead, & limited computing resource  Work-in-progress

6 Key Ideas Separation of duplicate detection and data backup  Different from inline deduplication. Buffered data redistribution in parallel duplicate detection Stage 1: Collect fingerprints in parallel Stage 2: Detect duplicates in parallel Stage 3: Perform actual VM backup in parallel

7 VM Snapshot Representation Data blocks are variable-sized Segments are fix-sized

8 Stage 1: Deduplication request accumulation ➔ Scan dirty data blocks ➔ Exchange &accumulate dedup requests ➔ Map data from VM-based to fingerprint-based distribution

9 Stage 2: Fingerprint comparison and summary output Load global index and dedup requests one partition at a time Compare fingerprints in parallel Output dedup summary from fingerprint-based to VM-based distribution

10 Stage 3: Non-duplicate data backup Load dup summaries Read dirty segments Output non-duplicate data blocks

11 Memory Usage per Machine at Different Stages Stage 1: Request accumulation  1 I/O buffer to read dirty segments  p network send and p recv buffers for p machines  q dedup request buffers for local disk write of q partitions Stage 2: Fingerprint comparison  Space for hosting 1 partition index and corresponding requests  p network send and p recv buffers, v local summary buffers for disk write Stage 3: Nonduplicate backup An I/O buffer to read dirty segments and write non-duplicates Duplicate summary within dirty segments

12 Issues with Incidental Redundancy Two VM blocks with the same fingerprint are created in parallel in different machines  Both are identified as new blocks  The rest of occurrences are detected as duplicates and logged Repaired inconsistency periodically during index update

13 Snapshot Deletion Mark-and-sweep – A block can be deleted if its reference count is zero Similar to deduplication stages  Scan the meta data and accumulate block reference pointers  Compute the reference count of each index entry, partition by partition  Log deletion instructions Periodically perform a compact operation when its deletion log is too big.

14 Evaluation Evaluated on a cluster of  Dual quad-core Intel Nehalem 2.4Hz E5530 with 24GB memory. Test data from Alibaba Aliyuan cloud  41TB. 10 snapshots per VM  Segment size: 2MB. Avg. Block size: 4KB Evaluation objectives  1) Analyze the deduplication throughput and effectiveness for a large number of VMs.  2) Examine the impacts of buffering during metadata exchange.

15 Data Characteristics Each VM uses 40GB storage space on average OS and user data disks: each takes ~50% of space OS data  7 main stream OS releases:  Debian, Ubuntu, Redhat, CentOS, Win2003 32bit, win2003 64 bit and win2008 64 bit. User data  From 1323 VM users

16 Setting & Resource Usage per Machine P=100 machines. 25VMs per machine Disk 8 GB metadata usage 10millsec local disk seek cost 50MB/second I/O per machine < 16.7% of local IO bandwidth usage. Memory usage: ~35MB CPU: Single-thread execution per machine 10-13% of single core

17 Parallel Time When Memory Limit Varies

18 Performance when 35MB memory used per machine Option1: unoptimized data redistribution.

19 Conclusions Low-cost multi-stage parallel deduplication for simultaneous backup of many VM images  Co-hosted with other cloud services  Tradeoff: Not optimized for individual backup request Read dirty data twice. Work-in-progress Evaluation – Backup throughput of 100 machines about 8.76GB per second for 2500 VMs – Resource friendly to the existing cluster services.

20 Questions?


Download ppt "Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage Wei Zhang, Tao Yang, Gautham Narayanasamy University of California at Santa Barbara."

Similar presentations


Ads by Google