Presentation is loading. Please wait.

Presentation is loading. Please wait.

GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.

Similar presentations


Presentation on theme: "GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center."— Presentation transcript:

1 GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center

2 Introduction Machines are getting more powerful But, we always can find bigger problems to solve Faster networks  machines can form clusters Promising to solve big problems GPFS (general parallel file system) Mimics the semantics of a POSIX file system running on a single machine Running on 6/10 of the top supercomputers

3 Introduction Web server workloads Multiple nodes access multiple files Supercomputer workloads Single node can access a file stored on multiple nodes Multiple nodes can access the same file stored on multiple nodes Need to access files and metadata in parallel Need to perform administrative functions in parallel

4 GPFS Overview Uses shared disks switching fabric

5 General Large File System Issues Data striping and allocation, prefetch, and write-behind Large directory support Logging and recovery

6 Data Striping and Prefetch Striping implemented at the file system level Better control Fault tolerance Load balancing GPFS recognizes sequential, reverse sequential, various strided access patterns Prefetch data accordingly

7 Allocation Large files are stored in 256KB blocks Small files are stored in 8KB subblocks Need to watch out for disks with different sizes Maximizing space utilization Larger disks receive more I/O requests Bottleneck Maximizing parallel performance Under utilized disks

8 Large Directory Support GPFS uses extensible hashing to support very large directories empty 0100 | file_1 1001 | file_2 empty 0100 | file1 1001 | file2 empty 0011 | dir1 1110 | file2_hardlink

9 Logging and Recovery In a large file system, no time to run fsck Use journaling and write ahead log for metadata Data are not logged Each node has a separate log Can be read by all nodes Any node can perform recovery on behalf of a failed node

10 Managing Parallelism and Consistency in a Cluster

11 Distributed Locking vs. Centralized Management Goal: reading and writing in parallel from all nodes in the cluster Constraint: POSIX semantics Synchronizing access to data and metadata from multiple nodes If two processes on two nodes access the same file  A read on one node will see either all or none of the data written by a concurrent write

12 Distributed Locking vs. Centralized Management Two approaches to locking: Distributed Consult with all other nodes before acquiring locks Greater parallelism Centralized Consult with a designated node Better for frequently updated metadata

13 Lock Granularity Too small High overhead Too large Many contending lock requests

14 The GPFS Distributed Lock Manager Centralized global lock manager on one node Local lock managers in each node Global lock manager Hands out lock tokens (right to grant locks) to local lock managers

15 Parallel Data Access How to write to the same file from multiple nodes? Byte-range locking to synchronize reads and writes Allows concurrent writes to different parts of the same file

16 Byte-Range Tokens First write request from one node Acquires a token for the whole file Efficient for non-concurrent writes Second write request to the same file from a second node Revoke part of the byte-range token held by the first node Knowing the reference pattern helps to predict how to break up the byte-ranges

17 Byte-Range Tokens Byte-range rounded to block boundaries So two nodes cannot modify the same block False sharing: a shared block being frequently moved between computers due to updates

18 Synchronizing Access to File Metadata Multiple nodes writing to the same file Concurrent updates to the inode and indirect blocks Synchronizing updates is very expensive

19 Synchronizing Access to File Metadata GPFS Uses a shared write lock on the inode Use the largest file size, latest time stamp How do multiple nodes append to the same file concurrently? One node is responsible for updating inodes Elected dynamically

20 Allocation Maps Need 32 bits per block due to subblocks Divided into n separate lockable regions Each node keeps track of 1/n th blocks on every disk Striped across all disks Minimize lock conflicts One node maintains the free space statistics Periodically updated

21 Other File System Metadata Centralized management to coordinate metadata updates Quota manager

22 Token Manager Scaling File size is unbounded Number of byte-range tokens is also unbounded Can use up the entire memory Token manager needs to monitor and prevent unbounded growth Revoke tokens as necessary Reuse token freed by deleted files

23 Fault Tolerance Node failures Communication failures Disk failures

24 Node Failures Periodic heartbeat messages to detect node failures Run log recovery from surviving nodes Token manager releases tokens held by the failed node Other nodes can resend committed updates

25 Communication Failures Network partition Continued operation can result in corrupted file system File system is accessible only by the group containing a majority of the nodes in the cluster

26 Disk Failures Dual attached RAID controllers Files can be replicated

27 Scalable Online System Utilities Adding, deleting, replacing disks Rebalancing the file system content Defragmentation, quota-check, fsck File system manager Coordinate administrative activities

28 Experiences Skewing of workloads Small management overhead can affect parallel applications in significant ways If a node slows down by 1%, it’s the same as leaving 5 nodes completely idle for 512-node cluster Need dedicated administrative nodes

29 Experiences Even the rarest failures can happen Data loss in a RAID A bad batch of disk drives

30 Related Work Storage area network Centralized metadata server SGI’s XFS file system Not a clustered file system Frangipani, Global File System Do not support multiple accesses to the same file

31 Summary and Conclusions GPFS Uses distributed locking and recovery Uses RAID and replication for reliability Can scale up to the largest super computers in the world Provides fault tolerance and system management functions


Download ppt "GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center."

Similar presentations


Ads by Google