GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.

Slides:



Advertisements
Similar presentations
CS 346 – April 4 Mass storage –Disk formatting –Managing swap space –RAID Commitment –Please finish chapter 12.
Advertisements

RAID Redundant Array of Independent Disks
Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung
The google file system Cs 595 Lecture 9.
G O O G L E F I L E S Y S T E M 陳 仕融 黃 振凱 林 佑恩 Z 1.
The Zebra Striped Network File System Presentation by Joseph Thompson.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Ceph: A Scalable, High-Performance Distributed File System
Ceph: A Scalable, High-Performance Distributed File System Priya Bhat, Yonggang Liu, Jing Qin.
1 Principles of Reliable Distributed Systems Tutorial 12: Frangipani Spring 2009 Alex Shraer.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
Lecture 17 I/O Optimization. Disk Organization Tracks: concentric rings around disk surface Sectors: arc of track, minimum unit of transfer Cylinder:
IBM Research Lab in Haifa Architectural and Design Issues in the General Parallel File System Benny Mandler - May 12, 2002.
G Robert Grimm New York University SGI’s XFS or Cool Pet Tricks with B+ Trees.
Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis.
Cse Feb-001 CSE 451 Section February 24, 2000 Project 3 – VM.
Database System Architectures  Client-server Database System  Parallel Database System  Distributed Database System Wei Jiang.
THE DESIGN AND IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM M. Rosenblum and J. K. Ousterhout University of California, Berkeley.
Northwestern University 2007 Winter – EECS 443 Advanced Operating Systems The Google File System S. Ghemawat, H. Gobioff and S-T. Leung, The Google File.
Chapter 9 Overview  Reasons to monitor SQL Server  Performance Monitoring and Tuning  Tools for Monitoring SQL Server  Common Monitoring and Tuning.
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Presented by: Alvaro Llanos E.  Motivation and Overview  Frangipani Architecture overview  Similar DFS  PETAL: Distributed virtual disks ◦ Overview.
Transactions and Reliability. File system components Disk management Naming Reliability  What are the reliability issues in file systems? Security.
1 The Google File System Reporter: You-Wei Zhang.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
The Hadoop Distributed File System
THE DESIGN AND IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM M. Rosenblum and J. K. Ousterhout University of California, Berkeley.
Chapter 20 Distributed File Systems Copyright © 2008.
UNIX File and Directory Caching How UNIX Optimizes File System Performance and Presents Data to User Processes Using a Virtual File System.
CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM S. A. Weil, S. A. Brandt, E. L. Miller D. D. E. Long, C. Maltzahn U. C. Santa Cruz OSDI 2006.
Properties of Layouts Single failure correcting: no two units of same stripe are mapped to same disk –Enables recovery from single disk crash Distributed.
Introduction to DFS. Distributed File Systems A file system whose clients, servers and storage devices are dispersed among the machines of a distributed.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Presenters: Rezan Amiri Sahar Delroshan
"1"1 Introduction to Managing Data " Describe problems associated with managing large numbers of disks " List requirements for easily managing large amounts.
Serverless Network File Systems Overview by Joseph Thompson.
CS 153 Design of Operating Systems Spring 2015 Lecture 22: File system optimizations.
Advanced UNIX File Systems Berkley Fast File System, Logging File Systems And RAID.
CS 153 Design of Operating Systems Spring 2015 Lecture 21: File Systems.
Ceph: A Scalable, High-Performance Distributed File System
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
I MPLEMENTING FILES. Contiguous Allocation:  The simplest allocation scheme is to store each file as a contiguous run of disk blocks (a 50-KB file would.
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Transactions and Reliability Andy Wang Operating Systems COP 4610 / CGS 5765.
1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.
Google File System Sanjay Ghemwat, Howard Gobioff, Shun-Tak Leung Vijay Reddy Mara Radhika Malladi.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
An Introduction to GPFS
I/O Errors 1 Computer Organization II © McQuain RAID Redundant Array of Inexpensive (Independent) Disks – Use multiple smaller disks (c.f.
Jonathan Walpole Computer Science Portland State University
Transactions and Reliability
Google Filesystem Some slides taken from Alan Sussman.
Filesystems 2 Adapted from slides of Hank Levy
Outline Midterm results summary Distributed file systems – continued
The Google File System (GFS)
Overview Continuation from Monday (File system implementation)
The Google File System (GFS)
The Google File System (GFS)
The Google File System (GFS)
CSE 451 Fall 2003 Section 11/20/2003.
The Google File System (GFS)
Transactions in Distributed Systems
by Mikael Bjerga & Arne Lange
The Google File System (GFS)
The Design and Implementation of a Log-Structured File System
Presentation transcript:

GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center

Introduction Machines are getting more powerful But, we always can find bigger problems to solve Faster networks  machines can form clusters Promising to solve big problems GPFS (general parallel file system) Mimics the semantics of a POSIX file system running on a single machine Running on 6/10 of the top supercomputers

Introduction Web server workloads Multiple nodes access multiple files Supercomputer workloads Single node can access a file stored on multiple nodes Multiple nodes can access the same file stored on multiple nodes Need to access files and metadata in parallel Need to perform administrative functions in parallel

GPFS Overview Uses shared disks switching fabric

General Large File System Issues Data striping and allocation, prefetch, and write-behind Large directory support Logging and recovery

Data Striping and Prefetch Striping implemented at the file system level Better control Fault tolerance Load balancing GPFS recognizes sequential, reverse sequential, various strided access patterns Prefetch data accordingly

Allocation Large files are stored in 256KB blocks Small files are stored in 8KB subblocks Need to watch out for disks with different sizes Maximizing space utilization Larger disks receive more I/O requests Bottleneck Maximizing parallel performance Under utilized disks

Large Directory Support GPFS uses extensible hashing to support very large directories empty 0100 | file_ | file_2 empty 0100 | file | file2 empty 0011 | dir | file2_hardlink

Logging and Recovery In a large file system, no time to run fsck Use journaling and write ahead log for metadata Data are not logged Each node has a separate log Can be read by all nodes Any node can perform recovery on behalf of a failed node

Managing Parallelism and Consistency in a Cluster

Distributed Locking vs. Centralized Management Goal: reading and writing in parallel from all nodes in the cluster Constraint: POSIX semantics Synchronizing access to data and metadata from multiple nodes If two processes on two nodes access the same file  A read on one node will see either all or none of the data written by a concurrent write

Distributed Locking vs. Centralized Management Two approaches to locking: Distributed Consult with all other nodes before acquiring locks Greater parallelism Centralized Consult with a designated node Better for frequently updated metadata

Lock Granularity Too small High overhead Too large Many contending lock requests

The GPFS Distributed Lock Manager Centralized global lock manager on one node Local lock managers in each node Global lock manager Hands out lock tokens (right to grant locks) to local lock managers

Parallel Data Access How to write to the same file from multiple nodes? Byte-range locking to synchronize reads and writes Allows concurrent writes to different parts of the same file

Byte-Range Tokens First write request from one node Acquires a token for the whole file Efficient for non-concurrent writes Second write request to the same file from a second node Revoke part of the byte-range token held by the first node Knowing the reference pattern helps to predict how to break up the byte-ranges

Byte-Range Tokens Byte-range rounded to block boundaries So two nodes cannot modify the same block False sharing: a shared block being frequently moved between computers due to updates

Synchronizing Access to File Metadata Multiple nodes writing to the same file Concurrent updates to the inode and indirect blocks Synchronizing updates is very expensive

Synchronizing Access to File Metadata GPFS Uses a shared write lock on the inode Use the largest file size, latest time stamp How do multiple nodes append to the same file concurrently? One node is responsible for updating inodes Elected dynamically

Allocation Maps Need 32 bits per block due to subblocks Divided into n separate lockable regions Each node keeps track of 1/n th blocks on every disk Striped across all disks Minimize lock conflicts One node maintains the free space statistics Periodically updated

Other File System Metadata Centralized management to coordinate metadata updates Quota manager

Token Manager Scaling File size is unbounded Number of byte-range tokens is also unbounded Can use up the entire memory Token manager needs to monitor and prevent unbounded growth Revoke tokens as necessary Reuse token freed by deleted files

Fault Tolerance Node failures Communication failures Disk failures

Node Failures Periodic heartbeat messages to detect node failures Run log recovery from surviving nodes Token manager releases tokens held by the failed node Other nodes can resend committed updates

Communication Failures Network partition Continued operation can result in corrupted file system File system is accessible only by the group containing a majority of the nodes in the cluster

Disk Failures Dual attached RAID controllers Files can be replicated

Scalable Online System Utilities Adding, deleting, replacing disks Rebalancing the file system content Defragmentation, quota-check, fsck File system manager Coordinate administrative activities

Experiences Skewing of workloads Small management overhead can affect parallel applications in significant ways If a node slows down by 1%, it’s the same as leaving 5 nodes completely idle for 512-node cluster Need dedicated administrative nodes

Experiences Even the rarest failures can happen Data loss in a RAID A bad batch of disk drives

Related Work Storage area network Centralized metadata server SGI’s XFS file system Not a clustered file system Frangipani, Global File System Do not support multiple accesses to the same file

Summary and Conclusions GPFS Uses distributed locking and recovery Uses RAID and replication for reliability Can scale up to the largest super computers in the world Provides fault tolerance and system management functions