Improved File Synchronization Techniques for Maintaining Large Replicated Collections over Slow Networks Torsten Suel CIS Department Polytechnic University.

Slides:



Advertisements
Similar presentations
Click to continue Network Protocols. Click to continue Networking Protocols A protocol defines the rules of procedures, which computers must obey when.
Advertisements

Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Error Control Code.
Henry C. H. Chen and Patrick P. C. Lee
UDgateway WAN Optimization. 1. Why UDgateway? All-in-one solution Value added services – Networking project requirements Optimize IP traffic on constrained.
An Introduction to Secure Sockets Layer (SSL). Overview Types of encryption SSL History Design Goals Protocol Problems Competing Technologies.
Shared-Dictionary Compression over HTTP (SDCH)‏ Wei-Hsin Lee June 2008.
1 Accessing nearby copies of replicated objects Greg Plaxton, Rajmohan Rajaraman, Andrea Richa SPAA 1997.
Content  Overview of Computer Networks (Wireless and Wired)  IP Address, MAC Address and Workgroups  LAN Setup and Creating Workgroup  Concept on.
Server-Friendly Delta Compression for Efficient Web Access Anubhav Savant Torsten Suel CIS Department Polytechnic University Brooklyn, NY
Beneficial Caching in Mobile Ad Hoc Networks Bin Tang, Samir Das, Himanshu Gupta Computer Science Department Stony Brook University.
Incremental Network Programming for Wireless Sensors NEST Retreat June 3 rd, 2004 Jaein Jeong UC Berkeley, EECS Introduction Background – Mechanisms of.
Lecture 11: Data Synchronization Techniques for Mobile Devices © Dimitre Trendafilov 2003 Modified by T. Suel 2004 CS623, 4/20/2004.
Chapter 15 – Part 2 Networks The Internal Operating System The Architecture of Computer Hardware and Systems Software: An Information Technology Approach.
Internet Networking Spring 2004
Cluster-Based Delta Compression of a Collection of Files Zan Ouyang Nasir Memon Torsten Suel Dimitre Trendafilov CIS Department Polytechnic University.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
Rsync: Efficiently Synchronizing Files Using Hashing By David Shao For CS 265, Spring 2004.
1Bloom Filters Lookup questions: Does item “ x ” exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with.
On-Demand Media Streaming Over the Internet Mohamed M. Hefeeda, Bharat K. Bhargava Presented by Sam Distributed Computing Systems, FTDCS Proceedings.
Gursharan Singh Tatla Transport Layer 16-May
Tree-Structured Indexes. Range Searches ``Find all students with gpa > 3.0’’ –If data is in sorted file, do binary search to find first such student,
NovaBACKUP 10 xSP Technical Training By: Nathan Fouarge
Team CMD Distributed Systems Team Report 2 1/17/07 C:\>members Corey Andalora Mike Adams Darren Stanley.
A Low-Bandwidth Network File System A. Muthitacharoen, MIT B. Chen, MIT D. Mazieres, NYU.
New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.
UDgateway WAN Optimization. 1. Why UDgateway? All-in-one solution Value added services – Networking project requirements Optimize IP traffic on constrained.
Midterm Review - Network Layers. Computer 1Computer 2 2.
Lesson 24. Protocols and the OSI Model. Objectives At the end of this Presentation, you will be able to:
How the Internet Works. The Internet and the Web The Web is actually just one of many computer applications that run on the Internet Among others are.
M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.
Network Services Networking for Home & Small Business.
Chapter 1: The Internet and the WWW CIS 275—Web Application Development for Business I.
FTP Server and FTP Commands By Nanda Ganesan, Ph.D. © Nanda Ganesan, All Rights Reserved.
Chapter 9 How Do Users Share Computer Files?. What is a File Server A (central) computer which stores files which can be accessed by network users.
TCP : Transmission Control Protocol Computer Network System Sirak Kaewjamnong.
MIMO continued and Error Correction Code. 2 by 2 MIMO Now consider we have two transmitting antennas and two receiving antennas. A simple scheme called.
Chapter 15 – Part 2 Networks The Internal Operating System The Architecture of Computer Hardware and Systems Software: An Information Technology Approach.
AS Computing Data Transmission and Networks. Transmission error Detecting errors in data transmission is very important for data integrity. There are.
A Low-bandwidth Network File System Athicha Muthitacharoen et al. Presented by Matt Miller September 12, 2002.
A P2P-Based Architecture for Secure Software Delivery Using Volunteer Assistance Purvi Shah, Jehan-François Pâris, Jeffrey Morgan and John Schettino IEEE.
XP Browser and Basics COM111 Introduction to Computer Applications.
Bloom Cookies: Web Search Personalization without User Tracking Authors: Nitesh Mor, Oriana Riva, Suman Nath, and John Kubiatowicz Presented by Ben Summers.
| nectar.org.au NECTAR TRAINING Module 9 Backing up & Packing up.
1 Reforming Software Delivery Using P2P Technology Purvi Shah Advisor: Jehan-François Pâris Department of Computer Science University of Houston Jeffrey.
1 Plaxton Routing. 2 History Greg Plaxton, Rajmohan Rajaraman, Andrea Richa. Accessing nearby copies of replicated objects, SPAA 1997 Used in several.
Accelerating Multi-Pattern Matching on Compressed HTTP Traffic Dr. Anat Bremler-Barr (IDC) Joint work with Yaron Koral (IDC), Infocom[2009]
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Proctor Caching Overview. 2 Proctor Caching Diagram.
Compression & Networking. Background  network links are getting faster and faster but  many clients still connected by fairly slow links (mobile?) 
How Do Users Share Computer Files?
The Transport Layer Implementation Services Functions Protocols
Compression of documents
Reddy Mainampati Udit Parikh Alex Kardomateas
Communication Networks: Technology & Protocols
Where we are in the Course
Networking for Home and Small Businesses – Chapter 6
Web Caching? Web Caching:.
Net 323 D: Networks Protocols
CIS 321 Data Communications & Networking
Networking for Home and Small Businesses – Chapter 6
Accessing nearby copies of replicated objects
Topic 5: Communication and the Internet
UDgateway WAN Optimization
Xiaoyang Zhang1, Yuchong Hu1, Patrick P. C. Lee2, Pan Zhou1
Process-to-Process Delivery:
Net 323 D: Networks Protocols
Chapter 15 – Part 2 Networks The Internal Operating System
Internet Protocols IP: Internet Protocol
Networking for Home and Small Businesses – Chapter 6
Presentation transcript:

Improved File Synchronization Techniques for Maintaining Large Replicated Collections over Slow Networks Torsten Suel CIS Department Polytechnic University Joint work with Patrick Noel (1) and Dimitre Trendafilov (2) Current Affiliations: (1) National Bank of Haiti, (2) Google Inc. Work performed while at Polytechnic University

If server has copy of both files local problem at server (delta compression) ServerClient update f_new f_old request File Synchronization Problem clients wants to update outdated file server has new file but does not know old file update without sending entire new file (using similarity) rsync: file synchronization tool, part of Linux note: files are unstructured (record/page oriented case is different)

Main Application: Mirroring/Replication to mirror web and ftp servers (e.g., downloads: tucows, linux) to synchronize data between PCs (e.g., at work and at home) our motivation: maintaining web collections for mining our motivation: efficient web page subscription currently done using rsync protocol and tool can we significantly improve performance? recall: server has no knowledge of old version servermirror new version of collection or data old version of collection or data

network links are getting faster and faster but many clients still connected by fairly slow links 56K and cell modems, USB, cable, DSL, bandwidth also a problem in P2P systems (large data) how can we deal with such slow links in ways that are transparent to the user? - at the application layer (e.g., http, mail) - application-independent at IP or TCP layer applications: web access, handheld synchronization, software updates, server mirroring, link compression General Background and Motivation:

Setup: Data Transmission over Slow Networks sender needs to send data set to receiver data set may be “similar” to data the receiver already holds sender may or may not know data at receiver - sender may know the data at receiver - sender may know something about data at receiver - sender may initially know nothing senderreceiver data knowledge about data at receiver

caching: “avoid sending same object again” - widely used and important scenarios - done on the basis of objects - only works if objects completely unchanged - how about objects that are slightly changed? compression: “remove redundancy in transmitted data” - avoid repeated substrings in data (Lempel-Ziv approach) - often limited to within one object - but can be extended to history of past transmissions - often not feasible to extend to long histories and large data - overhead: sender may not be able to store total history - robustness: receiver data may get corrupted - does not apply when sender has never seen data at receiver Two Standard Techniques:

file synchronization delta compression database reconciliation (for structured data or blocks/pages) substring caching techniques (Spring/Wetherall, LBFS, VB Cach.) Types of Techniques: Major Applications: data mirroring (web servers, accounts, collections) web access acceleration (NetZero, AOL 9.0, Sprint Vision) handheld synchronization (Minsky/Trachtenberg/Zippel) link compression (Spring/Wetherall, pivia) content distribution networks software updates (cellphone software upgrades: DoOnGo, others)

the rsync algorithm performance of rsync theoretical results - file similarity measures - theoretical bounds (Orlitsky 1984, Cormode et al, Orlitsky/Viswanathan) improved practical algorithms - Cormode et al (Soda 2000) - Orlitsky/Viswanathan (ISIT 2001) - Langford (2001) File Synchronization: State of the Art

ServerClient encoded file f_new f_old hashes clients splits f_old into blocks of size b compute a hash value for each block and send to server server stores received hashes in dictionary server transmits f_new to client, but replaces any b-byte window that hashes to value in dictionary by reference The rsync Algorithm

The rsync Algorithm (cont.) simple, widely used, single roundtrip optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals choice of block size problematic (default: max{700, sqrt(n)} ) not good in theory: granularity of changes, sqrt(n lg n) lower bound

Some performance numbers for rsync rsync vs. gzip, vcdiff, xdelta, zdelta files: gnu and emacs versions (also used by Hunt/Vo/Tichy) gcc size emacs size total gzip xdelta vcdiff zdelta rsync Compressed size in KB (slightly outdated numbers) factor of 3-5 gap between rsync and delta !

Hamming distance - inserting one character gives large distance Levenshtein distance - insertion, deletion, change of single characters Edit distance with block moves - also allows blocks of data to be moved around Edit distance with block copies - allows copies, and deletions of repeated blocks Formal File Similarity Measures: Theoretical Results:

Suppose files of length n with distance k lower bound Omega(k lg n) bits (even for Hamming) general framework with asymptotically optimal upper bounds for many cases (Orlitsky 1994) one message not optimal, but three messages are impractical: decoding takes exponential time! recent practical algos with provable bounds: - Orlitsky/Viswanathan 2001, Cormode et al 2000, Langford lg n factor from optimal for measures without block copies - not fully implemented and optimized Some Known Bounds

How to choose block size in rsync ? rsync has cost at least O(sqrt(n lg n)) Idea: recursive splitting of blocks Orlitsky/Viswanathan, Langford, Cormode et al. Upper bound O(k lg^2(n)) (does not work for block copies) This paper: - optimized practical multi-round protocols - several new techniques with significant benefits Ongoing work: - show optimal upper bounds - new approaches for single- and two-round protocols - applications Recursive approach to improve on rsync

server sends hash values to client (unlike in rsync) client checks if it has a matching block if yes: “client understands the hash” server recursively splits blocks that do not find match New Techniques in zsync use of optimized delta compressor: zdelta decomposable hash functions - after sending parent hash, need to only send one child hash continuation hashes to extend matches to left and right optimized match verification several other minor optimizations Our Contribution: zsync

Our Framework: zsync multiple roundtrips (starting with 32K block size, down to 8 bytes) servers sends hashes, client confirms and verifies both parties maintain a “map” of the new file at end, server sends unresolved parts (delta-compressed)

decomposable hash functions - after sending parent hash, need to only send one child hash Techniques and Ideas: continuation hashes - to extend hashes to left or right using fewer bits - standard hashes expensive: are compared to all shifts in f_old - continuation hashes: only compared to one position “is this a continuation of an adjacent larger hash?” - might give improved asymptotic bounds … h(parent) h(left)h(right) decomposable: h(right) = f ( h(parent), h(left) )

Techniques and Ideas: optimized match verification - server sends a fairly weak hash to find likely matches - client sends one verification hash for several likely matches - closely related to group testing techniques and binary search with errors server: “here’s a hash. Look for something that matches” client: “how about this one? Here are 2 additional bits” server: “looks good, but I need a few more bits to be sure” client: “OK, but to save bits, I will send a joint hash for 8 matches to check if all of them are correct” server: “apparently not all 8 are correct. Let’s back off” client: “OK, how about this joint hash for two matches?”

An example:

Experimental evaluation with decomposable hash gcc data set initial block size 32K, variable minimum block size

Basic results for emacs data set

Continuation hashes

Optimized Match Verification

Best Numbers > 50 roundtrips but within 1-2% with about 10 roundtrips not per-file roundtrips!

Results for Web Repository Synchronization updating web pages to new version for search/mining or subscription

Building zsync tool and library Improving asymptotic bounds - continuation hashes seem to do the trick New approaches to file synchronization Applications Substring caching, record-based data, and other techniques Current and Future Work