© 2012 Whamcloud, Inc. Distributed Namespace Status Phase I - Remote Directories Wang Di Whamcloud, Inc.

Slides:



Advertisements
Similar presentations
File-System Interface
Advertisements

Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung
Intel® Manager for Lustre* Lustre Installation & Configuration
The Zebra Striped Network Filesystem. Approach Increase throughput, reliability by striping file data across multiple servers Data from each client is.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
11-May-15CSE 542: Operating Systems1 File system trace papers The Zebra striped network file system. Hartman, J. H. and Ousterhout, J. K. SOSP '93. (ACM.
© 2012 Whamcloud, Inc. Lustre Automation Challenges John Spray Whamcloud, Inc. 0.4.
File Management Chapter 12. File Management A file is a named entity used to save results from a program or provide data to a program. Access control.
Chapter 10: File-System Interface
File System Interface CSCI 444/544 Operating Systems Fall 2008.
Dr. Kalpakis CMSC 421, Operating Systems. Fall File-System Interface.
Network Operating Systems Users are aware of multiplicity of machines. Access to resources of various machines is done explicitly by: –Logging into the.
File System Implementation
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition File-System Interface.
Chapter 10: File-System Interface
6/24/2015B.RamamurthyPage 1 File System B. Ramamurthy.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
5.1 © 2004 Pearson Education, Inc. Exam Managing and Maintaining a Microsoft® Windows® Server 2003 Environment Lesson 5: Working with File Systems.
Naming Names in computer systems are used to share resources, to uniquely identify entities, to refer to locations and so on. An important issue with naming.
NFS. The Sun Network File System (NFS) An implementation and a specification of a software system for accessing remote files across LANs. The implementation.
7/15/2015B.RamamurthyPage 1 File System B. Ramamurthy.
1 Introducing Scenario Network Data Editing and Enterprise GIS January 27, 2010 Minhua Wang, Ph.D. Citilabs, Inc.
File System Reliability. Main Points Problem posed by machine/disk failures Transaction concept Reliability – Careful sequencing of file system operations.
Google File System.
Sun NFS Distributed File System Presentation by Jeff Graham and David Larsen.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Networked File System CS Introduction to Operating Systems.
By Lecturer / Aisha Dawood 1.  You can control the number of dispatcher processes in the instance. Unlike the number of shared servers, the number of.
1 File Systems Chapter Files 6.2 Directories 6.3 File system implementation 6.4 Example file systems.
…using Git/Tortoise Git
Chapter 10: File-System Interface Silberschatz, Galvin and Gagne ©2005 Operating System Concepts – 7 th Edition, Jan 1, 2005 Chapter 10: File-System.
File Systems CSCI What is a file? A file is information that is stored on disks or other external media.
UNIX File and Directory Caching How UNIX Optimizes File System Performance and Presents Data to User Processes Using a Virtual File System.
CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM S. A. Weil, S. A. Brandt, E. L. Miller D. D. E. Long, C. Maltzahn U. C. Santa Cruz OSDI 2006.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation MongoDB Architecture.
Module 4.0: File Systems File is a contiguous logical address space.
Ceph: A Scalable, High-Performance Distributed File System
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
Linux+ Guide to Linux Certification, Third Edition
File Systems Operating Systems 1 Computer Science Dept Va Tech August 2007 ©2007 Back File Systems & Fault Tolerance Failure Model – Define acceptable.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition File System Implementation.
File Systems. 2 What is a file? A repository for data Is long lasting (until explicitly deleted).
EE324 INTRO TO DISTRIBUTED SYSTEMS. Distributed File System  What is a file system?
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
14.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Chapter 10 & 11: File-System Interface and Implementation.
Linux Operations and Administration
COT 4600 Operating Systems Fall 2009 Dan C. Marinescu Office: HEC 439 B Office hours: Tu-Th 3:00-4:00 PM.
Intro to DFS assignment. Announcements DFS (PA4) Part 1 will be out over the weekend Part 1 Due 10/30 11:59 pm Part 2 12/7 11:59 pm Part 3 12/14 11:59.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Distributed File Systems Sun Network File Systems Andrew Fıle System CODA File System Plan 9 xFS SFS Hadoop.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
Operating Systems Distributed-System Structures. Topics –Network-Operating Systems –Distributed-Operating Systems –Remote Services –Robustness –Design.
An Introduction to GPFS
Lustre File System chris. Outlines  What is lustre  How does it works  Features  Performance.
Day 28 File System.
Distributed File Systems
Google File System.
File System Implementation
Google Filesystem Some slides taken from Alan Sussman.
File System B. Ramamurthy B.Ramamurthy 11/27/2018.
Directory Structure A collection of nodes containing information about all files Directory Files F 1 F 2 F 3 F 4 F n Both the directory structure and the.
Chapter 15: File System Internals
Today: Distributed File Systems
THE GOOGLE FILE SYSTEM.
Race Condition Vulnerability
Presentation transcript:

© 2012 Whamcloud, Inc. Distributed Namespace Status Phase I - Remote Directories Wang Di Whamcloud, Inc.

© 2012 Whamcloud, Inc. Subdirectories on a remote metadata target Scales MDT namespace, like OSTs can today Dedicated performance for users/jobs All MDTs can use any/all OSTs to create objects DNE Phase I - Remote Directory MDT1 rose dir1 file MDT0 root file bill MDT2 frank file dir2 Lustre User Group

© 2012 Whamcloud, Inc. Remote directory creation by administrator only –Remote directory creation is a synchronous disk operation lfs mkdir -i {mdtidx} /path/to/remote_dir Files/subdirs created in remote dir stay on MDT –Local operations (create, unlink, open, close) at maximum performance –Limit RPCs that need to communicate with multiple MDTs –Simplifies implementation for initial deployment Remote Directory Implementation 4 Lustre User Group 2012

© 2012 Whamcloud, Inc. Failed/disabled MDT affects all of its subtrees –Accessing failed/disabled MDT will return EIO –Disabling MDT0 causes whole namespace to be inaccessible Remote directory can only be created on MDT0 –Otherwise, failure of one MDT would isolate other MDTs Rename or link across MDTs returns –EXDEV Deliberate limitation of complexity –Limit testing, recovery, failure scenarios for initial deployment –Restrictions relaxed as experience is gained, or via override Remote Directory Limitations 5 Lustre User Group 2012

© 2012 Whamcloud, Inc. MDT disk format must use ldiskfs dir_data feature –Default for any 2.x formatted filesystem –Allows storing remote directory entry pointers –Enable on 1.x filesystems: tune2fs -O dir_data /dev/mdt0 Upgrade clients, MGS, MDS, OSS to Lustre 2.4+ –Not required to enable DNE when upgrading to Lustre 2.4+ –Once DNE is enabled, downgrade to older Lustre difficult requires copying/deleting all files not on MDT0 Add new MDTs to running filesystem –Clients without DNE support evicted at this point –New MDTs only used once a remote directory entry is created mkfs.lustre --reformat --mgsnode={mgsnode} --mdt --index=N /dev/{mdtN} mount –t lustre /dev/{mdtN} /mnt/{mdtN} Enable DNE on new/existing filesystem 6 Lustre User Group 2012

© 2012 Whamcloud, Inc. Hash a single directory across multiple MDTs Multiple servers active for directory/inodes Improve performance for large directories MDT0MDT3 MDT2 DNE Phase II - Shard/Stripe Directory MDT1 dir.0 dir.2 car cat dir.1 bobdog dir.3 dalebeealeace master slave Lustre User Group

© 2012 Whamcloud, Inc. Unique cluster-wide identifier for file/directory –Introduced in Lustre 2.0 –Three components form object address {f_seq, f_oid, f_ver} –Large sequence range is allocated to each server –Sequences are large, so FIDs are never re-used FID Location Database (FLDB) maps FID->server –FLDB is known to all clients and servers –Kept small due to few sequence ranges –Sequence is looked up in FLDB to find MDT/OST index Object Index (OI) maps FID->inode on server –OI maps FID to local inode number Lustre File IDentifier (FID) 8 Lustre User Group 2012 VersionSequence #Object # 64 bits 32 bits

© 2012 Whamcloud, Inc. Client does filename lookup in parent directory –Root directory lives on MDT0 Client maps FID to Master MDT via FLDB –If request only involves one MDT, same as current single MDT Some operations need to access Slave MDTs –Called cross-MDT operations –Master MDT forwards update(s) other MDT(s) to finish the request –Create/unlink remote directory are only cross-MDT operations today DNE Master and Slave MDTs 9 client FLDB Get Master MDT for this operation Master MDT Slave MDT2 request reply Lustre User Group 2012

© 2012 Whamcloud, Inc. Create Remote Directory DNE Operation 10 Lustre User Group 2012

© 2012 Whamcloud, Inc. Master MDT checks RPC XID against last_rcvd file –Determines whether the operation was committed to disk or not –Committed: Master MDT reconstructs RPC reply from last_rcvd entry –Uncommitted: Master MDT redoes creation Resend same directory creation RPC to Slave MDT using same FID Slave MDT checks if remote directory was created –Looks up FID requested by Master in local OI –Creates new subdirectory with FID if missing –Returns success to Master Create Resend between MDTs 11 Lustre User Group 2012

© 2012 Whamcloud, Inc. Unlink Remote Directory DNE Operation 12 Lustre User Group 2012

© 2012 Whamcloud, Inc. Master MDT checks RPC XID against last_rcvd file –Determines whether the operation was committed to disk or not –Committed: Master MDT reconstructs RPC reply from last_rcvd entry –Uncommitted: Master unlinks, deletes name, adds destroy log, etc. If Slave MDT fails during this process –llog sync thread on Master MDT will resend destroy to Slave MDT –Directory unlinks are idempotent, can be retried Unlink Resend between MDTs 13 Lustre User Group 2012

© 2012 Whamcloud, Inc. FID is packed into the name entry Each remote entry will have a local agent inode Real object (inode) on Remote MDT found via OI Remote Directory Entry 14 Lustre User Group 2012

© 2012 Whamcloud, Inc. Two directories (AGENT and REMOTE) added AGENT –Each remote entry has a local agent inode –Agent inodes located under /AGENT/MDTn, one for each remote MDT REMOTE –Remote directories on Slave MDT created under /REMOTE Keeps local disk filesystem consistent Allows efficient checking of cross links by LFSCK MDT Disk Layout 15 Lustre User Group 2012

© 2012 Whamcloud, Inc. Active-Active MDT failover available with DNE –Allows multiple MDTs to be exported from one MDS –Ensures file system remains available in face of MDS node failure –Prevents isolation of large parts of the filesystem DNE High Availability 16 Lustre User Group 2012 MDS1 MDT1MDT2 MDS2 Take over

© 2012 Whamcloud, Inc. Internal Architecture 17 Lustre User Group 2012

© 2012 Whamcloud, Inc. Testing done on LLNL Hyperion –100 clients, 8 mount points –Separate directory per mount point –One stripe per file Early Test Results 18 Lustre User Group 2012

© 2012 Whamcloud, Inc. Wang Di Whamcloud, Inc. Thank You