Presentation is loading. Please wait.

Presentation is loading. Please wait.

New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.

Similar presentations


Presentation on theme: "New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University."— Presentation transcript:

1 New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University

2 Outline  Introduction and Common Applications  Problem Formalization  Contributions  An Approach Based on Erasure Codes A Simple Multi-Round Protocol An Efficient Single-Round Protocol A Practical Protocol Based on Erasure Codes  Implementation Overview  Preliminary Results  Conclusions

3 Introduction  Remote File Synchronization Problem: How to update the outdated version of a file over a network with minimal amount of communication  When the versions are very similar, the total data transmitted should be significantly smaller than the file size Machine A Machine B Current Version Outdated Version

4 Common Applications  Synchronization of User Files Synchronization between different machines that may only be connected over over a slow network (home and work machine) Both rsync and unison are widely used tools  Web and Ftp Site Mirroring Significant similarities between successive versions Including sites distributing new versions of a software rsync is widely used

5 Common Applications  Content Distribution Networks File synchronization is a natural approach to for updating content replicated at the network edge  Web Access over Slow Links A user revisiting a webpage may already have a previous version in the browser cache It would be desirable to avoid the entire transmission This idea is implemented in rproxy which uses rsync algorithm

6 Problem Formalization  We have two files (strings) over some alphabet : f new (current file), f old (outdated file)  We have two machines: C (the client), S (the server) connected by a communication link  C only has a copy of f old, and S only has a copy of f new  Goal: Design a protocol between the parties that result C holding a copy of f new while minimizing the total communication cost

7 Problem Formalization  The communication cost should depend on the degree of similarity between the two files The Hamming distance The edit distance The edit distance with block moves  We focus mainly on the edit distance with block moves. We assume that each block move operation adds 3 to the distance, while other operations add 1

8 Problem Formalization  We focus on single-round protocols between client and server Single-round protocols can be more easily integrated into existing tools currently relying on rsync Multiple rounds are undesirable in many scenarios involving small files or large latencies Multi-round protocols can introduce other complications due to state that may have to be kept at the server for best performance

9 Assumptions  The collection consists of unstructured files  We are not concerned with issues of consistency in between synchronization steps  A simple two-party scenario where it is known which files need to be updated and which is the current version

10 Contributions  We describe a new approach to single-round file synchronization based on erasure codes  We derive a protocol that communicates at most O(k lg(n) lg(n/k)) bits on files with edit distance with block moves of at most k  We derive another practical algorithm and optimized implementation that achieves very promising improvements over rsync

11 Outline  Introduction and Common Applications  Problem Formalization  Contributions  An Approach Based on Erasure Codes A Simple Multi-Round Protocol An Efficient Single-Round Protocol A Practical Protocol Based on Erasure Codes  Implementation Overview  Preliminary Results  Conclusions

12 A Simple Multi-Round Protocol  Runs in a number of rounds  In the first round, server partitions the file into blocks of size b max and sends a hash (MD5) for each block  Client attempts to match the received hashes to all possible alignments in the outdated file.  Client responds with a bit vector to notify the server which of the hashes are understood  Server repeats the process for the blocks whose hashes did not find a match  Once block size b min is reached, the server sends all the unmatched blocks

13 A Simple Multi-Round Protocol

14  Given two files with edit distance with block moves of k, if we choose b max = next smaller power of 2 of n/k b min = lg(n) hash size = 4lg(n) bits  Lemma: If we partition f new into some number of blocks, then at most k of these blocks do not occur in f old On each level, at most k hashes do not find a match  The algorithm transmits at most O(k lg(n) lg(n/k) ) bits and correctly updates the file with probability at least 1-1/n

15 Outline  Introduction and Common Applications  Problem Formalization  Contributions  An Approach Based on Erasure Codes A Simple Multi-Round Protocol An Efficient Single-Round Protocol A Practical Protocol Based on Erasure Codes  Implementation Overview  Preliminary Results  Conclusions

16 An Efficient Single-Round Protocol  First, we define complete multi-round algorithm: Sends hashes for all blocks  Second, we describe Systematic Erasure Code briefly

17 Erasure Code  Erasure Code: Given k source data items of size s, which are encoded into n>k encoded items of same size s.  If any n-k of the encoded items are lost they can be recovered  A systematic erasure code is the one where the encoded data items consist of k source items plus n-k additional items Figure by Luigi Rizzo

18 An Efficient Single-Round Protocol  Any hash value sent in the complete multi-round algorithm that would not be sent in the simple multi- round algorithm is not transmitted

19 An Efficient Single-Round Protocol  Any hash value that would be sent by the simple multi- round algorithm is also not sent to the client, but considered lost

20 An Efficient Single-Round Protocol  On each level there can be at most 2k lost blocks  Client can recreate the entire level of hashes using the 2k erasure hashes and recovering the lost hashes

21 An Efficient Single-Round Protocol  Theorem: Given a bound k on the edit distance between f old and f new, the erasure-based file synchronization algorithm correctly updates f old to f new with probability at least 1-1/n, using a single message of O(k lg(n) lg(n/k)) bits  We note that there are highly efficient single-message protocols for estimating the file distance k  Another property of the protocol is that by broadcasting a single message, the current version can be communicated to several clients that have different outdated versions

22 Outline  Introduction and Common Applications  Problem Formalization  Contributions  An Approach Based on Erasure Codes A Simple Multi-Round Protocol An Efficient Single-Round Protocol A Practical Protocol Based on Erasure Codes  Implementation Overview  Preliminary Results  Conclusions

23 A Practical Protocol Based on Erasure Codes  Previous protocol has two main shortcomings: The protocol requires us to estimate an upper bound on the file distance k. An underestimation would make the recovery impossible at the client More importantly, the algorithm does not support compression of unmatched literals  To address these problems we design another erasure- based algorithm that works better in practice

24 A Practical Protocol Based on Erasure Codes  The hashes are sent from client to server  For level i, mi erasure hashes are sent  The server identifies the common blocks and then sends unmatched literals in compressed form

25 Implementation Overview  We included three additional optimizations over rsync :  Server now transmits the resulting delta and bit vector to allow the client create the same reference file 1)We replace gzip algorithm used for transmission of the unmatched literals and match tokens with an optimized delta compressor

26 Implementation Overview 3)We integrate decomposable hashes:  This technique allows the hash of a child block to be computed from the hashes of its parent and sibling, halving the number of erasure hashes transmitted 2)We make a better choice of the number of bits per hash:  We assume some upper bound on the probability of a collision, say 1/2^d for some d, then we use lg(n)+lg(y)+d bits per hash  n is the file size  y is the total number of hashes sent from client to server

27 Preliminary Results  For the experiments we used the gcc and emacs datasets, consisting of 2.7.0 and 2.7.1 of gcc and 19.28 and 19.29 of emacs

28 Conclusions  We have described a new approach to remote file synchronization based on erasure codes  Using this approach, we derived a single- round protocol that is feasible and communication efficient w.r.t a common file distance measure


Download ppt "New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University."

Similar presentations


Ads by Google