1 JTE HPC/FS Pastis: a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca Fabio Picconi Pierre Sens LIP6, Université Paris 6.

Slides:



Advertisements
Similar presentations
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Advertisements

What is OceanStore? - 10^10 users with files each - Goals: Durability, Availability, Enc. & Auth, High performance - Worldwide infrastructure to.
Peer-to-Peer (P2P) Distributed Storage 1Dennis Kafura – CS5204 – Operating Systems.
SUNDR: Secure Untrusted Data Repository
Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility Antony Rowstron, Peter Druschel Presented by: Cristian Borcea.
FARSITE: Federated, Available, and Reliable Storage for an Incompletely Trusted Environment Presented by: Boon Thau Loo CS294-4 (Adapted from Adya’s OSDI’02.
Slides for Chapter 10: Peer-to-Peer Systems From Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edition 4, © Addison-Wesley.
Pond: the OceanStore Prototype CS 6464 Cornell University Presented by Yeounoh Chung.
Pond The OceanStore Prototype. Pond -- Dennis Geels -- January 2003 Talk Outline System overview Implementation status Results from FAST paper Conclusion.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
CS-550: Distributed File Systems [SiS]1 Resource Management in Distributed Systems: Distributed File Systems.
Ivy: A Read/Write P2P File System Athicha Muthitacharoan, Robert Morris, Thomer Gil and Benjie Chen Presented by Rachel Rubin CS 294-4, Fall 2003.
Vault: A Secure Binding Service Guor-Huar Lu, Changho Choi, Zhi-Li Zhang University of Minnesota.
Other File Systems: LFS and NFS. 2 Log-Structured File Systems The trend: CPUs are faster, RAM & caches are bigger –So, a lot of reads do not require.
Ivy: A Read/Write Peer-to- Peer File System A.Muthitacharoen, R. Morris, T. Gil, and B. Chen Presented by: Matthew Allen.
P2P: Advanced Topics Filesystems over DHTs and P2P research Vyas Sekar.
Large Scale Sharing GFS and PAST Mahesh Balakrishnan.
Object Naming & Content based Object Search 2/3/2003.
OceanStore: An Architecture for Global-Scale Persistent Storage Professor John Kubiatowicz, University of California at Berkeley
A Peer-to-Peer File System OSCAR LAB. Overview A short introduction to peer-to-peer (P2P) Systems Ivy: a read/write P2P file system (OSDI’02)
Farsite: Ferderated, Available, and Reliable Storage for an Incompletely Trusted Environment Microsoft Reseach, Appear in OSDI’02.
Wide-area cooperative storage with CFS
Lecture 23 The Andrew File System. NFS Architecture client File Server Local FS RPC.
A Low-Bandwidth Network File System A. Muthitacharoen, MIT B. Chen, MIT D. Mazieres, NYU.
Federated, Available, and Reliable Storage for an Incompletely Trusted Environment Atul Adya, Bill Bolosky, Miguel Castro, Gerald Cermak, Ronnie Chaiken,
Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.
FARSITE: Federated, Available, and Reliable Storage for an Incompletely Trusted Environment.
1 The Google File System Reporter: You-Wei Zhang.
Networked File System CS Introduction to Operating Systems.
Wide-area cooperative storage with CFS Frank Dabek, M. Frans Kaashoek, David Karger, Robert Morris, Ion Stoica.
Cooperative File System. So far we had… - Consistency BUT… - Availability - Partition tolerance ?
Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.
Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications Xiaozhou Li COS 461: Computer Networks (precept 04/06/12) Princeton University.
Ivy: A Read/Write Peer-to-Peer File System A. Muthitacharoen, R. Morris, T. M. Gil, and B. Chen In Proceedings of OSDI ‘ Presenter : Chul Lee.
From Coulouris, Dollimore, Kindberg and Blair Distributed Systems: Concepts and Design Edition 5, © Addison-Wesley 2012 Slides for Chapter 10: Peer-to-Peer.
1 Distributed Hash Tables (DHTs) Lars Jørgen Lillehovde Jo Grimstad Bang Distributed Hash Tables (DHTs)
Hongil Kim E. Chan-Tin, P. Wang, J. Tyra, T. Malchow, D. Foo Kune, N. Hopper, Y. Kim, "Attacking the Kad Network - Real World Evaluation and High.
1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens.
Peer-to-Peer Name Service (P2PNS) Ingmar Baumgart Institute of Telematics, Universität Karlsruhe IETF 70, Vancouver.
Chord+DHash+Ivy: Building Principled Peer-to-Peer Systems Robert Morris Joint work with F. Kaashoek, D. Karger, I. Stoica, H. Balakrishnan,
Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes.
Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.
Peer to Peer A Survey and comparison of peer-to-peer overlay network schemes And so on… Chulhyun Park
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
Automated P2P Backup Group 1 Anderson, Bowers, Johnson, Walker.
1 JTE HPC/FS Pastis: a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca Fabio Picconi Pierre Sens LIP6, Université Paris 6.
GLOBAL EDGE SOFTWERE LTD1 R EMOTE F ILE S HARING - Ardhanareesh Aradhyamath.
Plethora: Infrastructure and System Design. Introduction Peer-to-Peer (P2P) networks: –Self-organizing distributed systems –Nodes receive and provide.
1. Efficient Peer-to-Peer Lookup Based on a Distributed Trie 2. Complex Queries in DHT-based Peer-to-Peer Networks Lintao Liu 5/21/2002.
POND: THE OCEANSTORE PROTOTYPE S. Rea, P. Eaton, D. Geels, H. Weatherspoon, J. Kubiatowicz U. C. Berkeley.
Lecture 25 The Andrew File System. NFS Architecture client File Server Local FS RPC.
Large Scale Sharing Marco F. Duarte COMP 520: Distributed Systems September 19, 2004.
1 JTE HPC/FS Pastis: a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca Fabio Picconi Pierre Sens LIP6, Université Paris 6.
Peer-to-Peer (P2P) File Systems. P2P File Systems CS 5204 – Fall, Peer-to-Peer Systems Definition: “Peer-to-peer systems can be characterized as.
Distributed Systems: Distributed File Systems Ghada Ahmed, PhD. Assistant Prof., Computer Science Dept. Web:
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
Ivy: A Read/Write Peer-to- Peer File System Authors: Muthitacharoen Athicha, Robert Morris, Thomer M. Gil, and Benjie Chen Presented by Saurabh Jha 1.
DISTRIBUTED FILE SYSTEM- ENHANCEMENT AND FURTHER DEVELOPMENT BY:- PALLAWI(10BIT0033)
Providing Secure Storage on the Internet
Peer-to-Peer (P2P) File Systems
Today: Coda, xFS Case Study: Coda File System
Peer-to-Peer Storage Systems
Distributed File Systems
Distributed File Systems
Outline Announcements Lab2 Distributed File Systems 1/17/2019 COP5611.
Distributed File Systems
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Outline Review of Quiz #1 Distributed File Systems 4/20/2019 COP5611.
Distributed File Systems
Distributed File Systems
Presentation transcript:

1 JTE HPC/FS Pastis: a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca Fabio Picconi Pierre Sens LIP6, Université Paris 6 – CNRS, Paris, France INRIA, Rocquencourt, France

2 JTE HPC/FS 1.DHT-based File Systems 2.Pastis 3.Performance evaluation Outline

3 JTE HPC/FS Distributed file systems Client-serverP2P LAN (100) NFS- Organization (10.000) AFSFARSITE Pangaea Internet ( ) -Ivy * Oceanstore * Pastis * scalability (number of nodes) architecture * uses a Distributed Hash Table (DHT) to store data

4 JTE HPC/FS Distributed Hash Tables

5 JTE HPC/FS DHTs logical address space South America North America Australia Asia Europe Asia high latency, low bandwidth between logical neighbors Overlay network

6 JTE HPC/FS Insertion of blocks in DHT 04F B C52A BB2 3A AC78 895D E25A 04F2 3A B BB2 AC78 C52A E25A k = 8958 k = 8959 put(8959,block) root of key 8959 block Address space replica 895D replica

7 JTE HPC/FS Insertion of blocks in DHT 04F B C52A BB2 3A AC78 895D E25A 04F2 3A B BB2 AC78 C52A E25A block Address space replica 895D replica k = 8958 k = 8959 get(8959,block)

8 JTE HPC/FS P2P File systems architecture put(key, block) block = get(key)  files and directories  read-write access semantics  security and access control DHash / Past Ivy / Pastis DHT FS - scalability - fault-tolerance - self-organization  block store (DHT)  message routing open(), read(), write(), close(), etc.

9 JTE HPC/FS DHT-based file systems Ivy [OSDI’02]  log-based, one log per user  fast writes, slow reads  limited to small number of users Oceanstore [FAST’03]  updates serialized by primary replicas  partially centralized system  BFT agreement protocol requires well-connected primary replicas primary replicas secondary replicas User A’s log User B’s log User C’s log DHT object DHT object DHT object

10 JTE HPC/FS Pastis

11 JTE HPC/FS Pastis design Design goals  simple  completely decentralized  scalable (network size and number of users) put(key, block) block = get(key) Pastry Past Pastis DHT FS storage routing

12 JTE HPC/FS Pastis data structures Data structures similar to the Unix file system  inodes are stored in modifiable DHT blocks (UCBs)  file contents are stored in immutable DHT blocks (CHBs) metadata block addresses UCB file inode CHB1 CHB2 file contents UCB CHB1 CHB2 replica sets DHT address space Inode key

13 JTE HPC/FS Pastis data structures (cont.)  directories contain entries  use indirect blocks for large files metadata block addresses UCB directory inode CHB file1, key1 file2, key2 … metadata block addresses UCB file1 inode CHB old contents CHB indirect block CHB file contents CHB old contents CHB file contents

14 JTE HPC/FS Content Hash Block  block contents determine block key  can detect if block is modified  block is immutable data block block key = Hash( block contents ) block contents

15 JTE HPC/FS User Certificate Blocks (KB pub, KB priv ) associated to each block (KU pub, KU priv ) associated to each user Certificate  grants write access to a given user (identified by KUpub)  issued by the file owner  expiration date allows access revocation Authentication  Verify signature of certificate using the storage key (KB pub )  Verify signature of UCB using the KU pub sign(KU priv ) KU pub expiration date sign(KB priv ) timestamp UCB certificate block key = Hash( KB pub ) inode contents KB pub

16 JTE HPC/FS Pastis – Update handling File update  insert the new file contents (CHBs)  reinsert the file inode (UCB)  replace data blocks UCB1 directory inode file3 … … CHB1 directory contents … UCB2 file inode foo CHB3 file contents

17 JTE HPC/FS Pastis – Update handling File update  insert the new file contents (CHBs)  reinsert the file inode (UCB)  replace data blocks directory inode CHB1 directory contents … file inode foo CHB3 file contents foo bar CHB4 new file contents Insert new CHB into the DHT UCB1UCB2 file3 … …

18 JTE HPC/FS Pastis – Update handling File update  insert the new file contents (CHBs)  reinsert the file inode (UCB)  replace data blocks directory inode CHB1 directory contents … file inode foo CHB3 file contents foo bar CHB4 new file contents Update file inode to point to new CHB file3 … … UCB1UCB2

19 JTE HPC/FS Pastis – Update handling File update  insert the new file contents (CHBs)  reinsert the file inode (UCB)  replace data blocks directory inode CHB1 directory contents … file inode foo CHB3 file contents foo bar CHB4 new file contents Reinsert inode UCB into the DHT file3 … … UCB1UCB2

20 JTE HPC/FS Pastis – Consistency Strict consistency → too expensive, requires too many network accesses Close-to-open consistency  open(): returns the latest version of the file commited by close()  between open() and close(): user only sees his own updates  defer writes until file is closed Client A openread ‘1’ open write ‘2’ read ‘1’openread ‘2’ close write is cached until close (CHBs and inode UCB are stored in a local buffer) Client B write ‘2’ is sent to the network (CHBs and UCB and inserted into the DHT) a “close-to-open” path makes updates visible B retrieves inode from the DHT Still quite expensive: an open requires retrieving the most up-to-date inode replica

21 JTE HPC/FS Pastis – Consistency Read-your-writes consistency  relaxation of the close-to-open model  read() must reflect previous local writes only  writes from other clients may or may not be visible Client A openread ‘1’ openwrite ‘2’openread ‘2’ close Client B read must reflect local previous writes openread ‘1’ A’s read may not reflect B’s writes An open does not require retrieving the most up-to-date inode replica, just fetch one inode replica not older than those accessed previously

22 JTE HPC/FS Evaluation

23 JTE HPC/FS Evaluation Prototype  programmed in Java  Client interface : NFS, Fuse  Test program: Andrew Benchmark  Phase 1: create subdirectories  Phase 2: copy files  Phase 3: read file attributes  Phase 4: read file contents  Phase 5: make command Emulation  LAN with one DHT node per machine  DummyNet router emulates WAN latencies Simulation  discrete event simulator - LS 3  simulates overlay network latency

24 JTE HPC/FS Pastis performance with concurrent clients Configuration 16 DHT nodes 100 ms constant inter-node latency (Dummynet) 4 replicas per object close-to-open consistency Ivy’s read overhead increases rapidly with the number of users (the client must retrieve the records of more logs) normalized execution time [sec.] (each running an independent benchmark) every user reading and writing to FS

25 JTE HPC/FS Pastis consistency models Configuration 16 DHT nodes 100 ms constant inter-node latency (Dummynet) 4 replicas per object Pastis1.9 Ivy [OSDI’02]2.0 – 3.0 Oceanstore [FAST’03]2.55 execution time [sec.] performance penalty compared to NFS (close-to-open) Pastis (close-to-open) Pastis (read-your-writes) NFSv3 (dirs) (write) (attr.) (read) (make)

26 JTE HPC/FS Evaluation: consistency models N = 32768, sphere topology, max. latency: 300 ms, k = 16 CTO RYW with 10% stale UCB replicas

27 JTE HPC/FS Conclusion  Pastis  simple  completely decentralized (cf. Oceanstore)  scalable number of users (cf. Ivy)  good performance thanks to:  PAST-Pastry’s locality properties  relaxed consistency models (close-to-open, read-your- writes)  Future work  explore new consistency models  flexible replica location  evaluation in a wide-area testbed (Planetlab)

28 JTE HPC/FS Links Pastis : Pastry, Past : LS 3 :

29 JTE HPC/FS Questions?

30 JTE HPC/FS Internet Blocks distribution root replication Past / Pastry overlay Pastis FS Pastis design