Bandwidth and latency optimizations Jinyang Li w/ speculator slides from Ed Nightingale.

Slides:



Advertisements
Similar presentations
More on File Management
Advertisements

Two phase commit. Failures in a distributed system Consistency requires agreement among multiple servers –Is transaction X committed? –Have all servers.
Kernel memory allocation
Rethink the Sync Ed Nightingale Kaushik Veeraraghavan Peter Chen Jason Flinn University of Michigan.
Speculative Execution In Distributed File System and External Synchrony Edmund B.Nightingale, Kaushik Veeraraghavan Peter Chen, Jason Flinn Presented by.
IO-Lite: A Unified Buffering and Caching System By Pai, Druschel, and Zwaenepoel (1999) Presented by Justin Kliger for CS780: Advanced Techniques in Caching.
Speculations: Speculative Execution in a Distributed File System 1 and Rethink the Sync 2 Edmund Nightingale 12, Kaushik Veeraraghavan 2, Peter Chen 12,
Two phase commit. What we’ve learnt so far Sequential consistency –All nodes agree on a total order of ops on a single object Crash recovery –An operation.
Other File Systems: LFS and NFS. 2 Log-Structured File Systems The trend: CPUs are faster, RAM & caches are bigger –So, a lot of reads do not require.
Lecture 17 I/O Optimization. Disk Organization Tracks: concentric rings around disk surface Sectors: arc of track, minimum unit of transfer Cylinder:
CS 333 Introduction to Operating Systems Class 18 - File System Performance Jonathan Walpole Computer Science Portland State University.
Chapter 12: File System Implementation
Cooperative backup on Social Network Nguyen Tran and Jinyang Li.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts Amherst Operating Systems CMPSCI 377 Lecture.
File Systems (2). Readings r Silbershatz et al: 11.8.
DESIGN AND IMPLEMENTATION OF THE SUN NETWORK FILESYSTEM R. Sandberg, D. Goldberg S. Kleinman, D. Walsh, R. Lyon Sun Microsystems.
Network File System (NFS) Brad Karp UCL Computer Science CS GZ03 / M030 6 th, 7 th October, 2008.
Network File Systems Victoria Krafft CS /4/05.
A Low-Bandwidth Network File System A. Muthitacharoen, MIT B. Chen, MIT D. Mazieres, NYU.
Network File Systems II Frangipani: A Scalable Distributed File System A Low-bandwidth Network File System.
A LOW-BANDWIDTH NETWORK FILE SYSTEM A. Muthitacharoen, MIT B. Chen, MIT D. Mazieres, New York U.
Transactions and Reliability. File system components Disk management Naming Reliability  What are the reliability issues in file systems? Security.
Operating System Support for Application-Specific Speculation Benjamin Wester Peter Chen and Jason Flinn University of Michigan.
Distributed Systems. Interprocess Communication (IPC) Processes are either independent or cooperating – Threads provide a gray area – Cooperating processes.
1 File Systems Chapter Files 6.2 Directories 6.3 File system implementation 6.4 Example file systems.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Distributed Shared Memory Steve Ko Computer Sciences and Engineering University at Buffalo.
Distributed File Systems Overview  A file system is an abstract data type – an abstraction of a storage device.  A distributed file system is available.
UNIX File and Directory Caching How UNIX Optimizes File System Performance and Presents Data to User Processes Using a Virtual File System.
Parallelizing Security Checks on Commodity Hardware Ed Nightingale Dan Peek, Peter Chen Jason Flinn Microsoft Research University of Michigan.
Background: Operating Systems Brad Karp UCL Computer Science CS GZ03 / M th November, 2008.
The Design and Implementation of Log-Structure File System M. Rosenblum and J. Ousterhout.
1 File Systems: Consistency Issues. 2 File Systems: Consistency Issues File systems maintains many data structures  Free list/bit vector  Directories.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
SPECULATIVE EXECUTION IN A DISTRIBUTED FILE SYSTEM E. B. Nightingale P. M. Chen J. Flint University of Michigan.
A Low-bandwidth Network File System Athicha Muthitacharoen et al. Presented by Matt Miller September 12, 2002.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Speculative Execution in a Distributed File System Ed Nightingale Peter Chen Jason Flinn University of Michigan.
Chapter 11: File System Implementation Silberschatz, Galvin and Gagne ©2005 Operating System Concepts Chapter 11: File System Implementation Chapter.
A Low-bandwidth Network File System Presentation by Joseph Thompson.
Storage Systems CSE 598d, Spring 2007 Rethink the Sync April 3, 2007 Mark Johnson.
CS333 Intro to Operating Systems Jonathan Walpole.
I MPLEMENTING FILES. Contiguous Allocation:  The simplest allocation scheme is to store each file as a contiguous run of disk blocks (a 50-KB file would.
Chapter 4 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
11.1 Silberschatz, Galvin and Gagne ©2005 Operating System Principles 11.5 Free-Space Management Bit vector (n blocks) … 012n-1 bit[i] =  1  block[i]
Speculation Supriya Vadlamani CS 6410 Advanced Systems.
Transactions and Reliability Andy Wang Operating Systems COP 4610 / CGS 5765.
COT 4600 Operating Systems Fall 2009 Dan C. Marinescu Office: HEC 439 B Office hours: Tu-Th 3:00-4:00 PM.
Distributed File Systems Questions answered in this lecture: Why are distributed file systems useful? What is difficult about distributed file systems?
Speculative Execution in a Distributed File System Ed Nightingale Peter Chen Jason Flinn University of Michigan Best Paper at SOSP 2005 Modified for CS739.
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
File System Performance CSE451 Andrew Whitaker. Ways to Improve Performance Access the disk less  Caching! Be smarter about accessing the disk  Turn.
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
Jonathan Walpole Computer Science Portland State University
Transactions and Reliability
File System Implementation
Improving File System Synchrony
Cache Memory Presentation I
Filesystems 2 Adapted from slides of Hank Levy
Rethink the Sync Ed Nightingale Kaushik Veeraraghavan Peter Chen
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Rethink the Sync Ed Nightingale Kaushik Veeraraghavan Peter Chen
Operating Systems Lecture 1.
CSE 451 Fall 2003 Section 11/20/2003.
by Mikael Bjerga & Arne Lange
EEC 688/788 Secure and Dependable Computing
File System Performance
CSE 542: Operating Systems
The Design and Implementation of a Log-Structured File System
Presentation transcript:

Bandwidth and latency optimizations Jinyang Li w/ speculator slides from Ed Nightingale

What we’ve learnt so far Programming tools Consistency Fault tolerance Security Today: performance boosting techniques –Caching –Leases –Group commit –Compression –Speculative execution

Performance metrics Throughput –Measures the achievable rate (ops/sec) –Limited by the bottleneck resource 10Mbps link: max ~150 ops/sec for writing 8KB blocks –Reduce b/w by using less bottleneck resource Latency –Measures the latency of a single client response –Reduce latency by pipelining multiple operations

Caching (in NFS) NFS clients cache file content and directory name mappings Caching saves network bandwidth, improves latency GETATTR READ data

Leases (not in NFS) Leases eliminate latency in freshness check, at the cost of keeping extra state at the server READ READ fh1 LEASE fh1, data fh1: C1 INVAL fh1 WRITE fh1 OK

Group commit (in NFS) Group commit reduces the latency of a sequence of writes COMMIT WRITE

Two cool tricks Further optimization for b/w and latency is necessary for wide area –Wide area network challenges Low bandwidth (10~100Mbps) High latency (10~100ms) Promising solutions: –Compression (LBFS) –Speculative execution (Speculator)

Low Bandwidth File System Goal: avoid redundant data transfer between clients and the server Why isn’t caching enough? –A file with duplicate content  duplicate cache blocks –Two files that share content  duplicate cache blocks –A file that’s modified  previous cache is useless

LBFS insights: name by content hash Traditional cache naming: (fh#, offset) LBFS naming: SHA-1(cached block) Same contents have the same name –Two identical files share cached blocks Cached blocks keep the same names despite file changes

Naming granularity Name each file by its SHA-1 hash –It’s rare for two files to be exactly identical –No cache reuse across file modifications Cut a file into 8KB blocks, name each [ x*8K,(x+1)*8K ) range by hash –If block boundaries misalign, two almost identical files could share no common block –If block boundaries misalign, a new file could share no common block with its old version SHA-1(8K)

Align boundaries across different files Idea: determine boundary based on the actual content –If two boundaries have the same 48-byte content, they probably correspond to the same position in a contiguous region of identical content

Align boundaries across different files ab9f..0a 87e6b..f5

LBFS content-based chunking Examine every sliding window of 48-bytes Compute a 2-byte Rabin fingerprint f of 48- byte window If the lower 13-bit of f is equal to v, f corresponds to a breakpoint

LBFS chunking Two files with same content of length x have x identical fingerprints and x/8K aligned breakpoints f1 f2 f3 f4 f1 f2 f3 f4

Why Rabin fingerprints? Why not use the lower 13 bit of every 2- byte sliding window for breakpoints? –Data is not random, resulting in extreme variable chunk size Rabin fingerprints computes a random 2-byte value out of 48-bytes data

Rabin fingerprint is fast Treat 48-byte data D as a 48 digit radix-256 number f 47 = fingerprint of D[0…47] = ( D[47] + 256*D[46] + … *D[1]+ … *D[0] ) % q f 48 = fingerprint of D[1..48] = ((f 47 - D[0]* )* D[48] ) % q A new fingerprint is computed from the old fingerprint and the new shifted-in byte

LBFS reads GETHASH File not in cache (h1, size1, h2, size2, h3, size3) READ(h1,size1) READ(h2,size2) Ask for missing Chunks: h1, h2 Reconstruct file as h1,h2,h3 Fetching missing chunks Only saves b/w by reusing common cached blocks across different files or different versions of the same file

LBFS writes MKTMPFILE(fd) CONDWRITE(fd, h1,size1, h2,size2, h3,size3) HASHNOTFOUND(h1,h2) TMPWRITE(fd, h1) TMPWRITE(fd, h2) COMMITTMP(fd, target_fhandle) Create tmp file fd Reply with missing chunks h1, h2 Construct tmp file from h1,h2,h3 copy tmp file content to target file Transferring missing chunks saves b/w if different files or different versions of the same file have pieces of identical content

LBFS evaluations In practice, there are lots of content overlap among different files and different version of the same file –Save a Word document –Recompile after a header change –Different versions of a software package LBFS results in ~1/10 b/w use

Speculative Execution in a Distributed File System Nightingale et al. SOSP’05

How to reduce latency in FS? What are potentially “wasteful” latencies? Freshness check –Client issues GETATTR before reading from cache –Incurs an extra RTT for read –Why wasteful? Most GETATTRs confirm freshness ok Commit ordering –Client waits for commit on modification X to finish before starting modification Y –No pipelining of modifications on X & Y –Why wasteful? Most commits succeed!

RPC Req Client RPC Resp Guarantees without blocking I/O! Server Block!2) Speculate! 1) Checkpoint Key Idea: Speculate on RPC responses 3) Correct? Yes: discard ckpt.No: restore process & re-execute RPC Req RPC Resp RPC Req RPC Resp

Conditions of useful speculation Operations are highly predictable Checkpoints are cheaper than network I/O –52 µs for small process Computers have resources to spare –Need memory and CPU cycles for speculation

Undo log Implementing Speculation Process Checkpoint Spec 1) System call2) Create speculation Time

Speculation Success Undo log Checkpoint 1) System call2) Create speculation Process 3) Commit speculation Time Spec

Speculation Failure Undo log Checkpoint 1) System call 2) Create speculation Process 3) Fail speculation Process Time Spec

Ensuring Correctness Speculative processes hit barriers when they need to affect external state –Cannot roll back an external output Three ways to ensure correct execution –Block –Buffer –Propagate speculations (dependencies) Need to examine syscall interface to decide how to handle each syscall

Handle systems calls Block calls that externalize state –Allow read-only calls (e.g. getpid) –Allow calls that modify only task state (e.g. dup2) File system calls -- need to dig deeper –Mark file systems that support Speculator getpid reboot mkdir Call sys_getpid() Block until specs resolved Allow only if fs supports Speculator

Output Commits “stat worked” “mkdir worked” Undo log Checkpoint Spec (stat) Spec (mkdir) 1) sys_stat2) sys_mkdir Process Time 3) Commit speculation

Multi-Process Speculation Processes often cooperate –Example: “make” forks children to compile, link, etc. –Would block if speculation limited to one task Allow kernel objects to have speculative state –Examples: inodes, signals, pipes, Unix sockets, etc. –Propagate dependencies among objects –Objects rolled back to prior states when specs fail

Spec 1 Multi-Process Speculation Spec 2 pid 8001 Checkpoint inode 3456 Chown -1 Write -1 pid 8000 Checkpoint Chown -1 Write -1

Multi-Process Speculation What’s handled: –DFS objects, RAMFS, Ext3, Pipes & FIFOs –Unix Sockets, Signals, Fork & Exit What’s not handled (i.e. block) –System V IPC –Multi-process write-shared memory

Example: NFSv3 Linux Client 1Client 2Server Open B Getattr Modify B Write Commit

Example: SpecNFS Modify B speculate Getattr Open B speculate Open B Getattr speculate Write+Commit Client 1Client 2Server

Problem: Mutating Operations bar depends on speculative execution of “cat foo” If bar’s state could be speculative, what does client 2 view in bar? Client 1 1. cat foo > bar Client 2 2. cat bar

Solution: Mutating Operations Server determines speculation success/failure –State at server is never speculative Clients send server hypothesis speculation based on –List of speculations an operation depends on Server reports failed speculations Server performs in-order processing of messages

Server checks speculation’s status Client 1 Server Cat foo>bar Write+ Commit Foo v=1 Check if foo indeed has version=1, if no fail

Group Commit Previously sequential ops now concurrent Sync ops usually committed to disk Speculator makes group commit possible write commit Client Server

Putting it Together: SpecNFS Apply Speculator to an existing file system Modified NFSv3 in Linux 2.4 kernel –Same RPCs issued (but many now asynchronous) –SpecNFS has same consistency, safety as NFS –Getattr, lookup, access speculate if data in cache –Create, mkdir, commit, etc. always speculate

Putting it Together: BlueFS Design a new file system for Speculator –Single copy semantics –Synchronous I/O Each file, directory, etc. has version number –Incremented on each mutating op (e.g. on write) –Checked prior to all operations. –Many ops speculate and check version async

Apache Benchmark SpecNFS up to 14 times faster

Rollback cost is small All files out of date SpecNFS up to 11x faster

What we’ve learnt today Traditional Performance boosting techniques –Caching –Group commit –Leases Two new techniques –Content-based hash and chunking –Speculative execution