Flexible, Wide-Area Storage for Distributed Systems Using Semantic Cues Jeremy Stribling Thesis Defense, August 6, 2009 Including material previously published.

Slides:



Advertisements
Similar presentations
Building Distributed, Wide-Area Applications with WheelFS Jeremy Stribling, Emil Sit, Frans Kaashoek, Jinyang Li, and Robert Morris MIT CSAIL and NYU.
Advertisements

Don’t Give Up on Distributed File Systems Jeremy Stribling, Emil Sit, Frans Kaashoek, Jinyang Li, and Robert Morris MIT CSAIL and NYU.
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Simplifying Wide-Area Application Development with WheelFS Jeremy Stribling In collaboration with Jinyang Li, Frans Kaashoek, Robert Morris MIT CSAIL &
Amazon CloudFront An introductory discussion. What is Amazon CloudFront? 5/31/20122© e-Zest Solutions Ltd. Amazon CloudFront is a web service for content.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
CS-550: Distributed File Systems [SiS]1 Resource Management in Distributed Systems: Distributed File Systems.
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
Other File Systems: LFS and NFS. 2 Log-Structured File Systems The trend: CPUs are faster, RAM & caches are bigger –So, a lot of reads do not require.
Other File Systems: AFS, Napster. 2 Recap NFS: –Server exposes one or more directories Client accesses them by mounting the directories –Stateless server.
Large Scale Sharing GFS and PAST Mahesh Balakrishnan.
The Google File System.
Wide-area cooperative storage with CFS
Case Study - GFS.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Storage management and caching in PAST PRESENTED BY BASKAR RETHINASABAPATHI 1.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
PETAL: DISTRIBUTED VIRTUAL DISKS E. K. Lee C. A. Thekkath DEC SRC.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Distributed File Systems Steve Ko Computer Sciences and Engineering University at Buffalo.
1 The Google File System Reporter: You-Wei Zhang.
Wide-Area Cooperative Storage with CFS Robert Morris Frank Dabek, M. Frans Kaashoek, David Karger, Ion Stoica MIT and Berkeley.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
Thesis Proposal Data Consistency in DHTs. Background Peer-to-peer systems have become increasingly popular Lots of P2P applications around us –File sharing,
1 JTE HPC/FS Pastis: a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca Fabio Picconi Pierre Sens LIP6, Université Paris 6.
Distributed Systems. Interprocess Communication (IPC) Processes are either independent or cooperating – Threads provide a gray area – Cooperating processes.
Distributed File Systems
Chapter 20 Distributed File Systems Copyright © 2008.
CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM S. A. Weil, S. A. Brandt, E. L. Miller D. D. E. Long, C. Maltzahn U. C. Santa Cruz OSDI 2006.
Introduction. Readings r Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 m Note: All figures from this book.
WheelFS Jeremy Stribling, Frans Kaashoek, Jinyang Li, Robert Morris MIT CSAIL and New York University.
Introduction to DFS. Distributed File Systems A file system whose clients, servers and storage devices are dispersed among the machines of a distributed.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Presenters: Rezan Amiri Sahar Delroshan
Serverless Network File Systems Overview by Joseph Thompson.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
Introduction to HDFS Prasanth Kothuri, CERN 2 What’s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand.
1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.
ENERGY-EFFICIENCY AND STORAGE FLEXIBILITY IN THE BLUE FILE SYSTEM E. B. Nightingale and J. Flinn University of Michigan.
Distributed systems [Fall 2015] G Lec 1: Course Introduction.
Introduction to AFS IMSA Intersession 2003 An Overview of AFS Brian Sebby, IMSA ’96 Copyright 2003 by Brian Sebby, Copies of these slides.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Lecture 24: GFS.
1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.
1 JTE HPC/FS Pastis: a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca Fabio Picconi Pierre Sens LIP6, Université Paris 6.
Distributed Systems: Distributed File Systems Ghada Ahmed, PhD. Assistant Prof., Computer Science Dept. Web:
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
Ivy: A Read/Write Peer-to- Peer File System Authors: Muthitacharoen Athicha, Robert Morris, Thomer M. Gil, and Benjie Chen Presented by Saurabh Jha 1.
DISTRIBUTED FILE SYSTEM- ENHANCEMENT AND FURTHER DEVELOPMENT BY:- PALLAWI(10BIT0033)
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Google Filesystem Some slides taken from Alan Sussman.
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016
Jinyang Li’s Research Distributed Systems Wireless Networks
EECS 498 Introduction to Distributed Systems Fall 2017
CSE 451: Operating Systems Winter Module 22 Distributed File Systems
Distributed File Systems
Distributed File Systems
WheelFS Jeremy Stribling, Frans Kaashoek, Jinyang Li, Robert Morris
CSE 451: Operating Systems Spring Module 21 Distributed File Systems
Distributed File Systems
CSE 451: Operating Systems Winter Module 22 Distributed File Systems
Outline Review of Quiz #1 Distributed File Systems 4/20/2019 COP5611.
THE GOOGLE FILE SYSTEM.
Distributed File Systems
Distributed File Systems
Presentation transcript:

Flexible, Wide-Area Storage for Distributed Systems Using Semantic Cues Jeremy Stribling Thesis Defense, August 6, 2009 Including material previously published in: Flexible, Wide-Area Storage for Distributed Systems With WheelFS Jeremy Stribling, Yair Sovran, Irene Zhang, Xavid Pretzer, Jinyang Li, M. Frans Kaashoek, and Robert Morris NSDI, April 2009.

Storing Data All Over the World PlanetLab Apps store data on widely-spread resources –Testbeds, Grids, data centers, etc. –Yet there’s no universal storage layer WheelFS: a file system for wide-area apps

Wide-Area Applications Data Center (US) Data Center (Europe) Data Center (Asia) Site failure makes whole service unavailable Many users are far away Data, not just app logic, needs to be shared across sites

Current Network FSes Don’t Solve the Problem Data Center (US) Data Center (Europe) Data Center (Asia) NFS server File Same problems as before: Not fault-tolerant Far away from some sites More fundamental concerns: Reads and writes flow through one node Tries to act like local FS by hiding failures with long timeouts Apps want to distribute storage across sites, and do not necessarily want the storage to act like it’s a local FS.

Wide-Area Storage Example: Facebook Data Center (West Coast) Data Center (East Coast) 700,000 new users/day Long latency  updates take a while to show up on other coast After update, user must go back to same coast Storage requirement: control over consistency

Wide-Area Storage Example: Gmail Data Center (US) Data Center (Europe) Storage requirements: control over placement and durability Primary copy of user stored near user But data is replicated to survive failures

Wide-Area Storage Example: CoralCDN Storage requirements: distributed serving of popular files and control over consistency Network of proxies fetch pages only once Data stored by one site… …must be read from others though the data can be out of date

Apps Handle Wide-Area Differently Facebook wants consistency for some data Google stores near consumer CoralCDN prefers low delay to strong consistency  Each app builds its own storage layer (Customized MySQL/Memcached) (Gmail’s storage layer) (Coral Sloppy DHT)

Opportunity: General-Purpose Wide-Area Storage Apps need control of wide-area tradeoffs –Availability vs. consistency –Fast writes vs. durable writes –Few writes vs. many reads Need a common, familiar API: File system –Easy to program, reuse existing apps No existing DFS allows such control

Solution: Semantic Cues Small set of app-specified controls Correspond to wide-area challenges: –EventualConsistency: relax consistency –RepLevel=N: control number of replicas –Site=site: control data placement Allow apps to specify on per-file basis –/fs/.EventualConsistency/file

Contribution: WheelFS Wide-area file system Apps embed cues directly in pathnames Many apps can reuse existing software Multi-platform prototype w/ several apps Wide-area file system Apps embed cues directly in pathnames Many apps can reuse existing software Multi-platform prototype w/ several apps

File Systems 101 Basic FS operations: –Name resolution: hierarchical name  flat id (i.e., an inumber) –Data operations: read/write file data –Namespace operations: add/remove files or dirs open (“/dir1/file1”, …)  id: 1235 read (1235, …) write(1235, …) mkdir (“/dir2”, …)

File Systems 101 “/dir1/”: 246 “/dir2/”: 357 id: 0 id: 135 “file2”: 468 “file3”: 579 “file1”: 135 id: 246 id: 357 Directories map names to IDs File system uses IDs to find location of file on local hard disk /dir1/file1 /dir2/file2 /dir2/file3 / dir1/ dir2/ file1 file2 file3 id: 468 id: 579

Distributing a FS across nodes “/dir1/”: 246 “/dir2/”: 357 “file2”: 468 “file3”: 579 “file1”: 135 id: 0 id: 246 id: 135 id: 357 Must locate files/dirs using IDs and list of other nodes Files and directories are spread across nodes

Data stored in WheelFS WheelFS Design Overview Distributed Application WheelFS client software FUSE WheelFS configuration Service (Paxos + RSM) Files and directories are spread across storage nodes WheelFS client nodes WheelFS storage nodes

WheelFS Default Operation Files have a primary and two replicas –A file’s primary is the closest storage node Clients can cache files –Lease-based invalidation protocol Strict close-to-open consistency –All operations serialized through the primary

WheelFS Design: Creation “/dir1/”: 246 “/dir2/”: 357 id: 0 Create the directory entry By default, a node is the primary for data it creates id: 562 “/dir1/”: 246 “/dir2/”: 357 “/file”: 562 Directories map names to flat file IDs

Read “/file” WheelFS Design: Open id: 0 Partitions ID space among nodes consistently Configuration Service “/dir1/”: 246 “/dir2/”: 357 “/file”: 562 Read 562 id:  18.4,2,    18.4,2,    18.4,2,    18.4,2,    18.4,2,      id: 562

Enforcing Close-to-Open Consistency v2 Read 562 By default, failing to reach the primary blocks the operation to offer close-to-open consistency in the face of partitions Eventually, the configuration service decides to promote a backup to be primary Write file (backup)

Wide-Area Challenges Transient failures are common –Availability vs. consistency High latency –Fast writes vs. durable writes Low wide-area bandwidth –Few writes vs. many reads Only applications can make these tradeoffs

Semantic Cues Gives Apps Control Apps want to control consistency, data placement... How? Embed cues in path names  Flexible and minimal interface change /wfs/cache/a/b/.cue/foo /wfs/cache/a/b/.EventualConsistency/foo /wfs/cache/a/b/foo

Semantic Cue Details Cues can apply to directory subtrees Multiple cues can be in effect at once Assume developer applies cues sensibly Cues apply recursively over an entire subtree of files /wfs/cache/.EventualConsistency/a/b/foo /wfs/cache/.EventualConsistency/.RepLevel=2/a/b/foo Both cues apply to the entire subtree

A Few WheelFS Cues NamePurpose RepLevel= (permanent) How many replicas of this file should be maintained Site= (permanent) Hint for which group of nodes should store a file Hint about data placement Cues designed to match wide-area challenges HotSpot (transient) This file will be read simultaneously by many nodes, so use p2p caching Large reads Eventual- Consistency (trans/perm) Control whether reads must see fresh data, and whether writes must be serialized Durability Consistency

Eventual Consistency: Reads Read latest version of the file you can find quickly In a given time limit (.MaxTime=) v2 Read file (cached) (backup)

Write file (backup) Eventual Consistency: Writes Write to primary or any backup of the file v2 Write file v3 Create new version at backup Background process will merge divergent replicas Reconciling divergent replicas: (No application involvement)

HotSpot: Client-to-Client Reads CBA Read file (cached) Node A Node B Node A Node B Get list of nodes with cached copies Chunk (cached) Fetch chunks from other nodes in parallel Node A Node B Node C Add to list of nodes with cached copies Use Vivaldi network coordinates to find nearby copies

Example Use of Cues: Cooperative Web Cache (CWC) Apache Caching Proxy Apache Caching Proxy Apache Caching Proxy Apache Caching Proxy If $url exists in cache dir read $url from WheelFS else get page from web server store page in WheelFS One line change in Apache config file: /wfs/cache/$URL Blocks under failure with default strong consistency

.EventualConsistency Example Use of Cues: CWC Apache proxy handles potentially stale files well –The freshness of cached web pages can be determined from saved HTTP headers Cache dir: /wfs/cache/ Read a cached file even when the corresponding primary cannot be contacted Write the file data to any backup when the corresponding primary cannot be contacted /.HotSpot Tells WheelFS to read data from the nearest client cache it can find /.MaxTime=200 Reads only block for 200 ms; after that, fall back to origin web server

WheelFS Implementation Runs on Linux, MacOS, and FreeBSD User-level file system using FUSE 25K lines of C++ Unix ACL support Vivaldi network coordinates

Applications Evaluation AppCues used Lines of code/configuration written or changed Cooperative Web Cache.EventualConsistency,.MaxTime,.HotSpot 1 All-Pairs-Pings.EventualConsistency,.MaxTime,.HotSpot,.WholeFile 13 Distributed Mail.EventualConsistency,.Site,.RepLevel,.RepSites,.KeepTogether 4 File distribution.WholeFile,.HotSpotN/A Distributed make.EventualConsistency (for objects),.Strict (for source),.MaxTime 10

Performance Questions 1.Does WheelFS distribute app storage load more effectively than a single-server DFS? 2.Can WheelFS apps achieve performance comparable to apps w/ specialized storage? 3.Do semantic cues improve application performance? WheelFS is a wide-area file system that: spreads the data load across many nodes aims to support many wide-area apps give apps control over wide-area tradeoffs using cues

Storage Load Distribution Evaluation Up to 250 PlanetLab nodes Each client reads 10 files at random... NFS N clients 1 non- PlanetLab NFS server at MIT 10N 1 MB files WheelFS... N WheelFS nodes 10 1 MB files each Hypothesis: WheelFS clients will experience faster reads than NFS clients, as the number of clients grows.

Working set of files exceeds NFS server’s buffer cache WheelFS Spreads Load More Evenly than NFS on PlanetLab PlanetLab vs. dedicated MIT server

File Distribution Evaluation 15 nodes at 5 wide-area sites on Emulab All nodes download 50 MB at the same time Direct transfer time for one node is 73 secs Use.HotSpot cue Compare against BitTorrent Hypothesis: WheelFS will achieve performance comparable to BitTorrent’s, which uses a specialized data layer.

WheelFS HotSpot Cue Gets Files Faster than BitTorrent WheelFS median download time is 33% better than BitTorrent’s Both do far better than median direct transfer time of 892 seconds

CWC Evaluation 40 PlanetLab nodes as Web proxies 40 PlanetLab nodes as clients Web server –400 Kbps link –100 unique 41 KB pages Each client downloads random pages –(Same workload as in CoralCDN paper) CoralCDN vs. WheelFS + Apache Hypothesis: WheelFS will achieve performance comparable to CoralCDN’s, which uses a specialized data layer.

WheelFS Achieves Same Rate As CoralCDN... but WheelFS soon achieves similar performance CoralCDN ramps up more quickly due to special optimizations Total reqs/unique page: > 32,000 Origin reqs/unique page: 1.5 (CoralCDN) 2.6 (WheelFS)

CWC Failure Evaluation 15 proxies at 5 wide-area sites on Emulab 1 client per site Each minute, one site offline for 30 secs –Data primaries at site unavailable Eventual vs. strict consistency Hypothesis: WheelFS using eventual consistency will achieve better performance during failures than WheelFS with strict consistency.

EC Improves Performance Under Failures EventualConsistency allows nodes to use cached version when primary is unavailable

WheelFS Status Source available online Public PlanetLab deployment –PlanetLab users can mount shared storage –Usable by apps or for binary/configuration distribution

Related File Systems Single-server FS: NFS, AFS, SFS Cluster FS: Farsite, GFS, xFS, Ceph Wide-area FS: Shark, CFS, JetFile Grid: LegionFS, GridFTP, IBP, Rooter WheelFS gives applications control over wide-area tradeoffs

Storage Systems with Configurable Consistency PNUTS [VLDB ‘08] –Yahoo!’s distributed, wide-area database PADS [NSDI ‘09] –Flexible toolkit for creating new storage layers WheelFS offers broad range of controls in the context of a single file system

Conclusion Storage must let apps control data behavior Small set of semantic cues to allow control –Placement, Durability, Large reads and Consistency WheelFS: –Wide-area file system with semantic cues –Allows quick prototyping of distributed apps