GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.

GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center

Introduction Machines are getting more powerful But, we always can find bigger problems to solve Faster networks  machines can form clusters Promising to solve big problems GPFS (general parallel file system) Mimics the semantics of a POSIX file system running on a single machine Running on 6/10 of the top supercomputers

Introduction Web server workloads Multiple nodes access multiple files Supercomputer workloads Single node can access a file stored on multiple nodes Multiple nodes can access the same file stored on multiple nodes Need to access files and metadata in parallel Need to perform administrative functions in parallel

GPFS Overview Uses shared disks switching fabric

General Large File System Issues Data striping and allocation, prefetch, and write-behind Large directory support Logging and recovery

Data Striping and Prefetch Striping implemented at the file system level Better control Fault tolerance Load balancing GPFS recognizes sequential, reverse sequential, various strided access patterns Prefetch data accordingly

Allocation Large files are stored in 256KB blocks Small files are stored in 8KB subblocks Need to watch out for disks with different sizes Maximizing space utilization Larger disks receive more I/O requests Bottleneck Maximizing parallel performance Under utilized disks

Logging and Recovery In a large file system, no time to run fsck Use journaling and write ahead log for metadata Data are not logged Each node has a separate log Can be read by all nodes Any node can perform recovery on behalf of a failed node

Managing Parallelism and Consistency in a Cluster

Distributed Locking vs. Centralized Management Goal: reading and writing in parallel from all nodes in the cluster Constraint: POSIX semantics Synchronizing access to data and metadata from multiple nodes If two processes on two nodes access the same file  A read on one node will see either all or none of the data written by a concurrent write

Distributed Locking vs. Centralized Management Two approaches to locking: Distributed Consult with all other nodes before acquiring locks Greater parallelism Centralized Consult with a designated node Better for frequently updated metadata

Lock Granularity Too small High overhead Too large Many contending lock requests

The GPFS Distributed Lock Manager Centralized global lock manager on one node Local lock managers in each node Global lock manager Hands out lock tokens (right to grant locks) to local lock managers

Parallel Data Access How to write to the same file from multiple nodes? Byte-range locking to synchronize reads and writes Allows concurrent writes to different parts of the same file

Byte-Range Tokens First write request from one node Acquires a token for the whole file Efficient for non-concurrent writes Second write request to the same file from a second node Revoke part of the byte-range token held by the first node Knowing the reference pattern helps to predict how to break up the byte-ranges

Byte-Range Tokens Byte-range rounded to block boundaries So two nodes cannot modify the same block False sharing: a shared block being frequently moved between computers due to updates

Synchronizing Access to File Metadata Multiple nodes writing to the same file Concurrent updates to the inode and indirect blocks Synchronizing updates is very expensive

Synchronizing Access to File Metadata GPFS Uses a shared write lock on the inode Use the largest file size, latest time stamp How do multiple nodes append to the same file concurrently? One node is responsible for updating inodes Elected dynamically

Allocation Maps Need 32 bits per block due to subblocks Divided into n separate lockable regions Each node keeps track of 1/n th blocks on every disk Striped across all disks Minimize lock conflicts One node maintains the free space statistics Periodically updated

Other File System Metadata Centralized management to coordinate metadata updates Quota manager

Token Manager Scaling File size is unbounded Number of byte-range tokens is also unbounded Can use up the entire memory Token manager needs to monitor and prevent unbounded growth Revoke tokens as necessary Reuse token freed by deleted files

Fault Tolerance Node failures Communication failures Disk failures

Node Failures Periodic heartbeat messages to detect node failures Run log recovery from surviving nodes Token manager releases tokens held by the failed node Other nodes can resend committed updates

Communication Failures Network partition Continued operation can result in corrupted file system File system is accessible only by the group containing a majority of the nodes in the cluster

Disk Failures Dual attached RAID controllers Files can be replicated

Scalable Online System Utilities Adding, deleting, replacing disks Rebalancing the file system content Defragmentation, quota-check, fsck File system manager Coordinate administrative activities

Experiences Skewing of workloads Small management overhead can affect parallel applications in significant ways If a node slows down by 1%, it’s the same as leaving 5 nodes completely idle for 512-node cluster Need dedicated administrative nodes

Experiences Even the rarest failures can happen Data loss in a RAID A bad batch of disk drives

Related Work Storage area network Centralized metadata server SGI’s XFS file system Not a clustered file system Frangipani, Global File System Do not support multiple accesses to the same file

Summary and Conclusions GPFS Uses distributed locking and recovery Uses RAID and replication for reliability Can scale up to the largest super computers in the world Provides fault tolerance and system management functions

GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.

Similar presentations

Presentation on theme: "GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.

Similar presentations

Presentation on theme: "GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center."— Presentation transcript:

Similar presentations

About project

Feedback