Improving File System Synchrony

Slides:



Advertisements
Similar presentations
Remus: High Availability via Asynchronous Virtual Machine Replication
Advertisements

More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.
Rethink the Sync Ed Nightingale Kaushik Veeraraghavan Peter Chen Jason Flinn University of Michigan.
Speculative Execution In Distributed File System and External Synchrony Edmund B.Nightingale, Kaushik Veeraraghavan Peter Chen, Jason Flinn Presented by.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,
Speculations: Speculative Execution in a Distributed File System 1 and Rethink the Sync 2 Edmund Nightingale 12, Kaushik Veeraraghavan 2, Peter Chen 12,
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)
EEC-681/781 Distributed Computing Systems Lecture 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Ext3 Journaling File System “absolute consistency of the filesystem in every respect after a reboot, with no loss of existing functionality” chadd williams.
Distributed File System: Design Comparisons II Pei Cao Cisco Systems, Inc.
Transactions and Recovery
Sun NFS Distributed File System Presentation by Jeff Graham and David Larsen.
I/O Systems ◦ Operating Systems ◦ CS550. Note:  Based on Operating Systems Concepts by Silberschatz, Galvin, and Gagne  Strongly recommended to read.
Distributed File Systems Concepts & Overview. Goals and Criteria Goal: present to a user a coherent, efficient, and manageable system for long-term data.
Transactions and Reliability. File system components Disk management Naming Reliability  What are the reliability issues in file systems? Security.
Highly Available ACID Memory Vijayshankar Raman. Introduction §Why ACID memory? l non-database apps: want updates to critical data to be atomic and persistent.
1 The Google File System Reporter: You-Wei Zhang.
Networked File System CS Introduction to Operating Systems.
Distributed File Systems
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
Distributed File Systems Overview  A file system is an abstract data type – an abstraction of a storage device.  A distributed file system is available.
Serverless Network File Systems Overview by Joseph Thompson.
SPECULATIVE EXECUTION IN A DISTRIBUTED FILE SYSTEM E. B. Nightingale P. M. Chen J. Flint University of Michigan.
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
Speculative Execution in a Distributed File System Ed Nightingale Peter Chen Jason Flinn University of Michigan.
Storage Systems CSE 598d, Spring 2007 Rethink the Sync April 3, 2007 Mark Johnson.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
Speculation Supriya Vadlamani CS 6410 Advanced Systems.
Transactions and Reliability Andy Wang Operating Systems COP 4610 / CGS 5765.
Transactional Recovery and Checkpoints. Difference How is this different from schedule recovery? It is the details to implementing schedule recovery –It.
Speculative Execution in a Distributed File System Ed Nightingale Peter Chen Jason Flinn University of Michigan Best Paper at SOSP 2005 Modified for CS739.
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
DISTRIBUTED FILE SYSTEM- ENHANCEMENT AND FURTHER DEVELOPMENT BY:- PALLAWI(10BIT0033)
File System Consistency
Chapter 13: I/O Systems.
DURABILITY OF TRANSACTIONS AND CRASH RECOVERY
Module 12: I/O Systems I/O hardware Application I/O Interface
Chapter 13: I/O Systems Modified by Dr. Neerja Mhaskar for CS 3SH3.
Jonathan Walpole Computer Science Portland State University
CS6320 – Performance L. Grewe.
Transactions and Reliability
Speculative Lock Elision
Distributed Shared Memory
Outline Other synchronization primitives
Chapter 2: System Structures
Other Important Synchronization Primitives
Journaling File Systems
Outline Announcements Fault Tolerance.
Rethink the Sync Ed Nightingale Kaushik Veeraraghavan Peter Chen
Operating System Concepts
13: I/O Systems I/O hardwared Application I/O Interface
CS703 - Advanced Operating Systems
EECS 498 Introduction to Distributed Systems Fall 2017
Process Description and Control
DESIGN AND IMPLEMENTATION OF THE SUN NETWORK FILESYSTEM
Chapter 2: Operating-System Structures
Rethink the Sync Ed Nightingale Kaushik Veeraraghavan Peter Chen
Process Description and Control
CSE 451: Operating Systems Winter Module 22 Distributed File Systems
Speculative execution and storage
Distributed Systems CS
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
CS703 - Advanced Operating Systems
Database System Architectures
Programming with Shared Memory Specifying parallelism
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Chapter 2: Operating-System Structures
Module 12: I/O Systems I/O hardwared Application I/O Interface
Presentation transcript:

Improving File System Synchrony CS 614 Lecture – Fall 2007 – Tuesday October 16 By Jonathan Winter

Introduction Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony

Motivation File system I/O is a major performance bottleneck. On chip computation and caches access are very fast. Even main memory accesses are comparatively quick. Distributed computing further exacerbates the problem. Durability and fault-tolerance are also key concerns. Files systems depend on mechanical disks. Common source of crashes, data loss, and system incoherence. Distributed file systems further complicate reliability issues. Typically performance and durability must be traded off. Ease-of-use is also important. Synchronous I/O semantics make programming easier. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony

Outline Overview of Local File System I/O Scenario Overview of Distributed File System Scenario A User-Centric View of Synchronous I/O Similarities and Differences in Problem Domains Details of the Speculator Infrastructure Implementation of External Synchrony Benchmark Descriptions Performance Results Conclusions Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony

Local File System I/O Traditional File Systems Come in Two Flavors. Synchronous provide durability guarantees by blocking. OS crashes and power failures will not cause data loss. File modifications are ordered providing determinism. Blocking and sequential execution for ordering reduces performance. Asynchronous files systems don’t block on modifications. Commit can occur long after completion. Users can view output later invalidated by a crash. Synchronization can be enforced through explicit commands (fsync). fsync does not protect against data loss on a typical desktop OS. Performance is higher through buffering and group commit. ext3 is a standard journaling Linux local file system. Can be configured in async, sync, and durable modes. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony

Distributed File Systems Distributed file systems typically use synchronous I/O. Provides straightforward abstraction of single namespace. Enables cache coherence and durability. Synchronous messages have long latencies over network. Weaker consistency used (close-to-open) for speed. Common systems include AFS and NFS. Earlier research by authors created the Blue File System. Provides single-copy semantics. Distributed nature of network is transparent. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony

User-Centric View of Synchrony Synchronous I/O assumes an application-centric view. Durability of file system guaranteed for application. System views application as external entity. Application state must be kept consistent. Application must not see uncommitted results. Application must block on distributed file I/O. User-centric view considers application state as internal. Observable output must be synchronous. Kernel and applications are both internal state. Internal implementation can run asynchronously. Only external output to the screen or network must be synchronous. Execution of internal components can be speculative. Results of speculative execution cannot be observed by outside. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony

Similarities Between Scenarios Both local and distributed file system solutions require buffering output to user and external environment. Asynchrony of implementation hidden from user. Durability must be preserved in the presence of faults. Speculative execution must not be seen until commit. Both speculation and external synchrony require dependence tracking for uncommitted process and kernel state. Tracking allows speculative execution rollback misspeculations. Tracking determines which data should not yet be user visible. Asynchronous implementation allows computation, IPC, I/O messages, network communication, and disk writes to overlap. Major source of systems’ performance improvement. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony

Differences Between Scenarios Speculative execution in distributed file systems requires checkpointing and recovery on misspeculation. External synchrony does not speculate, it just allows internal state to run ahead of the output to user. Speculative execution must block in some situations. Checkpointing challenges limit the kinds of supported IPC. Shared memory was not implemented in distributed setting. External synchrony conservatively assumes all readers in shared memory inherit dependencies. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony

Details of the Speculator Infrastructure Major bottlenecks in original NFS are blocking of processes and serialization of network traffic. Speculation allows for concurrency of computation and I/O. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony

Conditions for Success of Speculations File systems chosen as first target for Speculator because: Results of speculative operations are highly predictable. Clients cache data and concurrent updates are rare. Speculating that cached data is valid is successful most of the time. Network I/O much slower than checkpointing. Checkpoint is low overhead and a lot speculative work can be completed in the time that the cached data is verified. Computers have spare resources available for speculation. Processors are idle significant portions of the time. Extra memory is available for checkpoints. Spare resources are available for use to speed up I/O throughput. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony

Speculation Interface Speculator requires modifications to system calls to allow for speculative distributed I/O and to propagate dependencies. Interface is designed to encapsulate implementation details. Speculator provides: create_speculation commit_speculation fail_speculation Speculator doesn’t worry about details of speculative hypotheses. Distributed file system is oblivious to checkpointing and recovery. Partitioning of responsibilities allows for easy modification of internal implementation and expansion of support for IPC. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony

Speculation Implementation Checkpointing performed by executing a copy-on-write fork. Must save the state of open file descriptors and copy signals. Forked child only run if speculation fails, discarded otherwise. If speculation fails, child fork is given identify of original process. Two data structures added to kernel to track speculative state. Speculation structure created by create_speculation to track the set of kernel objects that depend on the new speculation. The undo log is an ordered list of speculative operations with information to allow speculative operations to be undone. Multiple speculations can be started for the same process, with multiple speculation structures and checkpoints. If a previous speculation was read only, checkpoints are shared. New checkpoints are required every 500ms to cap recovery time. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony

Ensuring Correct Speculative Execution Two invariants must hold for correct execution. Speculative state should not be visible to the user or external devices. Output to screen, network, and other interfaces must be buffered. Processes cannot view speculative state unless they are registered as dependent upon that state. Non-speculative processes must block or become speculative when viewing speculative state. Blocking can always be used to ensure correctness. System calls that do not modify state or modify only private state can be performed speculatively unmodified. Speculation flags set in file system superblocks and for read and write system calls to indicate dependency relationships. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony

Multi-Process Speculation To extend the amount of possible speculative work, a speculative process can perform inter-process communication. Dependencies must propagate from a process P to an object X when P modifies X and P depends on speculation that X does not. Typically propagations are bi-directional between objects. A commit_speculation will deleted the associated speculation structure and removed related undo log entries. Fail_speculation will atomically perform rollback. The undo log, undo entries, and speculations are generic. Undo log entries point to type-specific state and functions to implement type-specific rollback for different forms of IPC. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony

Causal Dependency Propagation Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony

Forms of Supported IPC Distributed file system objects. Cached copies used speculatively, deleted and retrieved if stale. Local memory file system – RAMFS was modified. Modified ext3 to allow speculation for local disk file system. Speculative data never written to disk. Calling fdatasync blocks the process. Processes can observe speculative metadata in ext3 superblocks, bitmaps, and group descriptors. Metadata can be written to disk. ext3 journal modified to separate speculative and non-speculative data in compound transactions. Pipes and fifos handled like local file systems. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony

Forms of Supported IPC (continued) Unix sockets propagate dependencies bi-directionally Signals challenging because exitting process cannot restart. Signaling processes are checkpointed and managed with queue. During fork, child inherits all dependencies of parent. Exiting processes not deallocated until all dependencies are resolved. Other forms of IPC not supported: System V IPC, futexes, and shared memory. Processes block to ensure proper behavior. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony

Using Speculation in the File System For read operations, cached version of file is required. Speculation assumes file has not been modified, RPCs changed from synchronous to asynchronous. Server with full knowledge managing mutating operations. Server permits other processes to see speculatively changed files only if the cached version matches the server version. Server must process messages in same ordering as clients see. Server never stores speculative data. Clients group commit multiple operations with one disk write. NFS modified to support Speculator (keeps close-to-open). Blue File System modified to show speculation can enabled strong consistency and safety as well as good performance. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony

External Synchrony Goal: Provide the reliability and ease-of-use of synchronous I/O and the performance of asynchronous. Implementation called xsyncfs build on top of ext3. File system transactions are completed in non-blocking manner but output is not allowed to be externalized. All output is buffered in the OS to be released when all disk transactions depended on commit. Processes with commit dependencies propagate output restrictions when interacting with other processes through IPC. Xsyncfs uses output-triggered commits to balance throughput and latency. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony

Example of External Synchrony Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony

External Synchrony Design Overview Synchrony defined by externally observable behavior. I/O is externally synchronous if output cannot be distinguished from output that could be produced from synchronous I/O. Requires values of external outputs to be the same. Outputs must occur in same causal order as defined by Lamport’s happens before relation. Disk commits are considered external output. File system does all the same processing as for synchronous. Need not commit the modification to disk before returning. Two optimizations made to improve performance. Group committing is used (commits are atomic). External output is buffered and processes continue execution. Output guaranteed to be committed every 5 seconds. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony

External Synchrony Implementation Xsyncfs leverages Speculator infrastructure for output buffering and dependency tracking for uncommitted state. Checkpointing and rollback features unneeded and are disabled. Speculator tracks commit dependencies between processes and uncommitted file system transactions. Processes interacting with the dependent process are marked as dependent on the same set of uncommitted transactions. Many-to-many relationships between objects tracked in undo logs. ext3 operates in journaled mode. Multiple modifications are grouped into compound transactions. Single transaction active at any time and committed atomically. Likewise only one transaction can be committing at a time. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony

External Synchrony Data Structures Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony

Some Additional Issues External synchrony must be augmented to support explicit synchronization operations such as sync and fdatasync. A commit dependency is created between the calling process and active transaction, creating a visible event causing a commit. Xsyncfs does not require application modification. Programmers can write the same code as for synchronous. Explicit synchronization is not needed. Programmers don’t need to added group commit to the code. Hand-tuned code can provide benefits in when programmers have specialized information. However, xsyncfs has global information about external output which can be used to optimize commit throughput. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony

Evaluation Methodology All experiments run on Pentium 4 processors. RedHat Enterprise Linux release 3 (kernel 2.4.2.1) used. Speculative execution evaluated for two scenarios. First has no delay and second assumes 30ms round trip. Packets routed through NISTnet network emulator. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony

Durability Experiment Desired to obtain confirmation that ext3 does not guarantee durability. Test consists of continuously writing to local file system and sending UDP messages after each write completes. Power is cut during experiment and the file system state and log are compared. ext3 did not provide durability when mounted asynchronously or synchronously and even when fsync commands where issued after writes. The problem is that modifications are only written to the hard drive cache and not the platter unless write barriers are employed. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony

Workload Descriptions PostMark Benchmark Performs hundreds or thousands of transactions consisting of file reads, writes, creates, and deletes, and then removes all the files. Replicates small file workloads of electronic mail, netnews, and web-based commerce. Good test of file system throughput since there is little output or computation. Apache Build Benchmark Benchmark untars Apache 2.0.48 source tree, runs configure in an object directory, runs make, and then removes all the files. File system must balance throughput and latency since there is a lot of screen output interleaved with disk I/O and computation. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony

Workload Descriptions (continued) MySQL Benchmark Runs OSDL TPC-C benchmark with MySQL 5.0.16 and the InnoDB storage engine. Used to see how xsyncfs performs when application performs its own group commit strategy. Both MySQL and TPC-C client are multi-threaded so this measures xsyncfs’s support for shared memory as well. SPECweb99 Benchmark Provides a network intensive application with 50 clients, saturating the server. High level of network traffic challenges xsyncfs because the messages externalize state. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony

PostMark File System Benchmark Results Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony

Apache Build Benchmark Results Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony

MySQL and SPECweb99 Results Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony

Average Latency of HTTP Requests xsyncfs adds less than 33ms of delay to a request, less than the 50 ms commonly cited perception threshold. xsyncfs performance significantly better on large request sizes. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony

Benefit of Output-Triggered Commits The goal is to assess the speedup of this lazy approach to commits. Output-triggered commits allows grouping but can cost latency. Output-triggered commits perform better on all benchmarks except SPECweb99 where there is so much traffic that both policies have similar behavior. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony

Conclusions Speed need not be sacrificed for durability and ease-of-use. Both papers succeed in developing a system that achieves performance near that of an asynchronous implementation with the fault-tolerance and simplicity of the synchronous abstraction. Key insight is user-centric view abstraction. Speculator infrastructure provides powerful functionality through dependency tracking and checkpointing/rollback. Papers focus on using system to speed up local and distributed file systems but many other applications are possible. Amazing order-of-magnitude speedups are achieved. Simple ideas that surprisingly took until 2005 to be developed. Why didn’t I think of this for my research?  Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony