IBM Research Lab in Haifa Architectural and Design Issues in the General Parallel File System Benny Mandler - May 12, 2002.

IBM Research Lab in Haifa Architectural and Design Issues in the General Parallel File System Benny Mandler - mandler@il.ibm.com May 12, 2002

HRLHRLAgenda  What is GPFS? ? a file system for deep computing  GPFS uses  General architecture  How does GPFS meet its challenges - architectural issues ? performance ? scalability ? high availability ? concurrency control

HRLHRL RS/6000 SP Scalable Parallel Computer  1-512 nodes connected by high-speed switch  1-16 CPUs per node (Power2 or PowerPC)  >1 TB disk per node  500 MB/s full duplex per switch port Scalable parallel computing enables I/O-intensive applications :  Deep computing - simulation, seismic analysis, data mining  Server consolidation - aggregating file, web servers onto a centrally-managed machine  Streaming video and audio for multimedia presentation  Scalable object store for large digital libraries, web servers, databases,... Scalable Parallel Computing What is GPFS?

HRLHRL  High Performance - multiple GB/s to/from a single file  concurrent reads and writes, parallel data access - within a file and across files Support fully parallel access both to file data and metadata  client caching enabled by distributed locking  wide striping, large data blocks, prefetch  Scalability  scales up to 512 nodes (N-Way SMP). Storage nodes, file system nodes, disks, adapters...  High Availability  fault-tolerance via logging, replication, RAID support  survives node and disk failures  Uniform access via shared disks - Single image file system  High capacity multiple TB per file system, 100s of GB per file.  Standards compliant (X/Open 4.0 "POSIX") with minor exceptions GPFS addresses SP I/O requirements What is GPFS?

HRLHRL Native AIX File System (JFS)  No file sharing - application can only access files on its own node  Applications must do their own data partitioning DCE Distributed File System (follow-up of AFS)  Application nodes (DCE clients) share files on server node  Switch is used as a fast LAN  Coarse-grained (file or segment level) parallelism  Server node is performance and capacity bottleneck GPFS Parallel File System  GPFS file systems are striped across multiple disks on multiple storage nodes  Independent GPFS instances run on each application node  GPFS instances use storage nodes as "block servers" - all instances can access all disks GPFS vs. local and distributed file systems on the SP2

HRLHRL  Video on Demand for new "borough" of Tokyo  Applications: movies, news, karaoke, education...  Video distribution via hybrid fiber/coax  Trial "live" since June '96  Currently 500 subscribers  6 Mbit/sec MPEG video streams  100 simultaneous viewers (75 MB/sec)  200 hours of video on line (700 GB)  12-node SP-2 (7 distribution, 5 storage) Tokyo Video on Demand Trial

HRLHRL Major aircraft manufacturer  Using CATIA for large designs, Elfini for structural modeling and analysis  SP used for modeling/analysis  Using GPFS to store CATIA designs and structural modeling data  GPFS allows all nodes to share designs and models Engineering Design GPFS uses

HRLHRL  File systems consist of one or more shared disks ? Individual disk can contain data, metadata, or both ? Disks are designated to failure group ? Data and metadata are striped to balance load and maximize parallelism  Recoverable Virtual Shared Disk for accessing disk storage ? Disks are physically attached to SP nodes ? VSD allows clients to access disks over the SP switch ? VSD client looks like disk device driver on client node ? VSD server executes I/O requests on storage node. ? VSD supports JBOD or RAID volumes, fencing, multi- pathing (where physical hardware permits)  GPFS only assumes a conventional block I/O interface Shared Disks - Virtual Shared Disk architecture General architecture

HRLHRL  Implications of Shared Disk Model ? All data and metadata on globally accessible disks (VSD) ? All access to permanent data through disk I/O interface ? Distributed protocols, e.g., distributed locking, coordinate disk access from multiple nodes ? Fine-grained locking allows parallel access by multiple clients ? Logging and Shadowing restore consistency after node failures  Implications of Large Scale ? Support up to 4096 disks of up to 1 TB each (4 Petabytes) The largest system in production is 75 TB ? Failure detection and recovery protocols to handle node failures ? Replication and/or RAID protect against disk / storage node failure ? On-line dynamic reconfiguration (add, delete, replace disks and nodes; rebalance file system) GPFS Architecture Overview General architecture

HRLHRL  Three types of nodes: file system, storage, and manager ? Each node can perform any of these functions ? File system nodes  run user programs, read/write data to/from storage nodes  implement virtual file system interface  cooperate with manager nodes to perform metadata operations ? Manager nodes (one per “file system”)  global lock manager  recovery manager  global allocation manager  quota manager  file metadata manager  admin services fail over ? Storage nodes  implement block I/O interface  shared access from file system and manager nodes  interact with manager nodes for recovery (e.g. fencing)  file data and metadata striped across multiple disks on multiple storage nodes GPFS Architecture - Node Roles General architecture

HRLHRL GPFS Software Structure General architecture

HRLHRL  Large block size allows efficient use of disk bandwidth  Fragments reduce space overhead for small files  No designated "mirror", no fixed placement function:  Flexible replication (e.g., replicate only metadata, or only important files)  Dynamic reconfiguration: data can migrate block-by-block  Multi level indirect blocks ?Each disk address: list of pointers to replicas ?Each pointer: disk id + sector no. Disk Data Structures: Files General architecture

HRLHRL  Conventional file systems store data in small blocks to pack data more densely  GPFS uses large blocks (256KB default) to optimize disk transfer speed Large File Block Size Performance

HRLHRL Parallelism and consistency  Distributed locking - acquire appropriate lock for every operation - used for updates to user data  Centralized management - conflicting operations forwarded to a designated node - used for file metadata  Distributed locking + centralized hints - used for space allocation  Central coordinator - used for configuration changes I/O slowdown effects Additional I/O activity rather than token server overload

HRLHRL  GPFS allows parallel applications on multiple nodes to access non- overlapping ranges of a single file with no conflict  Global locking serializes access to overlapping ranges of a file  Global locking based on "tokens" which convey access rights to an object (e.g. a file) or subset of an object (e.g. a byte range)  Tokens can be held across file system operations, enabling coherent data caching in clients  Cached data discarded or written to disk when token is revoked  Performance optimizations: required/desired ranges, metanode, data shipping, special token modes for file size operations Parallel File Access From Multiple Nodes Performance

HRLHRL  GPFS stripes successive blocks across successive disks  Disk I/O for sequential reads and writes is done in parallel  GPFS measures application "think time",disk throughput, and cache state to automatically determine optimal parallelism  Prefetch algorithms now recognize strided and reverse sequential access  Accepts hints  Write-behind policy Application reads at 15 MB/sec Each disk reads at 5 MB/sec Three I/Os executed in parallel Deep Prefetch for High Throughput Performance

HRLHRL  Hardware: Power2 wide nodes, SSA disks  Experiment: sequential read/write from large number of GPFS nodes to varying number of storage nodes  Result: throughput increases nearly linearly with number of storage nodes  Bottlenecks: ? microchannel limits node throughput to 50MB/s ? system throughput limited by available storage nodes GPFS Throughput Scaling for Non-cached Files Scalability

HRLHRL  Segmented Block Allocation MAP:  Each segment contains bits representing blocks on all disks  Each segment is a separately lockable unit  Minimizes contention for allocation map when writing files on multiple nodes  Allocation manager service provides hints which segments to try Similar: inode allocation map Disk Data Structures: Allocation map Scalability

HRLHRL  Problem: detect/fix file system inconsistencies after a failure of one or more nodes ? All updates that may leave inconsistencies if uncompleted are logged ? Write-ahead logging policy: log record is forced to disk before dirty metadata is written ? Redo log: replaying all log records at recovery time restores file system consistency  Logged updates: ? I/O to replicated data ? directory operations (create, delete, move,...) ? allocation map changes  Other techniques: ? ordered writes ? shadowing High Availability - Logging and Recovery High Availability

HRLHRL  Application node failure: ? force-on-steal policy ensures that all changes visible to other nodes have been written to disk and will not be lost ? all potential inconsistencies are protected by a token and are logged ? file system manager runs log recovery on behalf of the failed node after successful log recovery tokens held by the failed node are released ? actions taken: restore metadata being updated by the failed node to a consistent state, release resources held by the failed node  File system manager failure: ? new node is appointed to take over ? new file system manager restores volatile state by querying other nodes ? New file system manager may have to undo or finish a partially completed configuration change (e.g., add/delete disk)  Storage node failure: ? Dual-attached disk: use alternate path (VSD) ? Single attached disk: treat as disk failure Node Failure Recovery High Availability

HRLHRL  When a disk failure is detected ? The node that detects the failure informs the file system manager ? File system manager updates the configuration data to mark the failed disk as "down" (quorum algorithm)  While a disk is down ? Read one / write all available copies ? "Missing update" bit set in the inode of modified files  When/if disk recovers ? File system manager searches inode file for missing update bits ? All data & metadata of files with missing updates are copied back to the recovering disk (one file at a time, normal locking protocol) ? Until missing update recovery is complete, data on the recovering disk is treated as write-only  Unrecoverable disk failure ? Failed disk is deleted from configuration or replaced by a new one ? New replicas are created on the replacement or on other disks Handling Disk Failures

HRLHRL Cache Management Total Cache General Pool: Clock list, merge, re-map Block Size pool: Clock list Stats Seq / random optimal, total Seq / random optimal, total Seq / random optimal, total Seq / random optimal, total Balance dynamically according to usage patterns Avoid fragmentation - internal and external Unified steal Periodical re-balancing

HRLHRLEpilogue  Used on six of the ten most powerful supercomputers in the world, including the largest (ASCI white)  Installed at several hundred customer sites, on clusters ranging from a few nodes with less than a TB of disk, up to 512 nodes with 140 TB of disk in 2 file systems  IP rich - ~20 filed patents  State of the art  TeraSort ? world record of 17 minutes ? using 488 node SP. 432 file system and 56 storage nodes (604e 332 MHz) ? total 6 TB disk space  References ? GPFS home page: http://www.haifa.il.ibm.com/projects/storage/gpfs.html ? FAST 2002: http://www.usenix.org/events/fast/schmuck.html ? TeraSort - http://www.almaden.ibm.com/cs/gpfs-spsort.html ? Tiger Shark: http://www.research.ibm.com/journal/rd/422/haskin.html

IBM Research Lab in Haifa Architectural and Design Issues in the General Parallel File System Benny Mandler - May 12, 2002.

Similar presentations

Presentation on theme: "IBM Research Lab in Haifa Architectural and Design Issues in the General Parallel File System Benny Mandler - May 12, 2002."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

IBM Research Lab in Haifa Architectural and Design Issues in the General Parallel File System Benny Mandler - May 12, 2002.

Similar presentations

Presentation on theme: "IBM Research Lab in Haifa Architectural and Design Issues in the General Parallel File System Benny Mandler - May 12, 2002."— Presentation transcript:

Similar presentations

About project

Feedback