The Google File System (GFS)

Slides:



Advertisements
Similar presentations
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung SOSP 2003 Presented by Wenhao Xu University of British Columbia.
Advertisements

Question Scalability vs Elasticity What is the difference?
Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung
The google file system Cs 595 Lecture 9.
THE GOOGLE FILE SYSTEM CS 595 LECTURE 8 3/2/2015.
G O O G L E F I L E S Y S T E M 陳 仕融 黃 振凱 林 佑恩 Z 1.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google Jaehyun Han 1.
The Google File System Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani 1CS5204 – Operating Systems.
GFS: The Google File System Brad Karp UCL Computer Science CS Z03 / th October, 2006.
The Google File System (GFS). Introduction Special Assumptions Consistency Model System Design System Interactions Fault Tolerance (Results)
Google File System 1Arun Sundaram – Operating Systems.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
The Google File System and Map Reduce. The Team Pat Crane Tyler Flaherty Paul Gibler Aaron Holroyd Katy Levinson Rob Martin Pat McAnneny Konstantin Naryshkin.
1 The File System Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung (Google)
GFS: The Google File System Michael Siegenthaler Cornell Computer Science CS th March 2009.
Large Scale Sharing GFS and PAST Mahesh Balakrishnan.
The Google File System.
Google File System.
Northwestern University 2007 Winter – EECS 443 Advanced Operating Systems The Google File System S. Ghemawat, H. Gobioff and S-T. Leung, The Google File.
Case Study - GFS.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
1 The Google File System Reporter: You-Wei Zhang.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
The Google File System Ghemawat, Gobioff, Leung via Kris Molendyke CSE498 WWW Search Engines LeHigh University.
Homework 1 Installing the open source cloud Eucalyptus Groups Will need two machines – machine to help with installation and machine on which to install.
The Google File System Presenter: Gladon Almeida Authors: Sanjay Ghemawat Howard Gobioff Shun-Tak Leung Year: OCT’2003 Google File System14/9/2013.
The Google File System Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Presenters: Rezan Amiri Sahar Delroshan
GFS : Google File System Ömer Faruk İnce Fatih University - Computer Engineering Cloud Computing
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
Presenter: Seikwon KAIST The Google File System 【 Ghemawat, Gobioff, Leung 】
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture Chunkservers Master Consistency Model File Mutation Garbage.
Google File System Robert Nishihara. What is GFS? Distributed filesystem for large-scale distributed applications.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Lecture 24: GFS.
Google File System Sanjay Ghemwat, Howard Gobioff, Shun-Tak Leung Vijay Reddy Mara Radhika Malladi.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
GFS: The Google File System Brad Karp UCL Computer Science CS GZ03 / M th October, 2008.
Dr. Zahoor Tanoli COMSATS Attock 1.  Motivation  Assumptions  Architecture  Implementation  Current Status  Measurements  Benefits/Limitations.
1 CMPT 431© A. Fedorova Google File System A real massive distributed file system Hundreds of servers and clients –The largest cluster has >1000 storage.
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Cloud Computing Platform as a Service The Google Filesystem
File and Storage Systems: The Google File System
Distributed File Systems
Google File System.
GFS.
Google Filesystem Some slides taken from Alan Sussman.
Google File System CSE 454 From paper by Ghemawat, Gobioff & Leung.
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
The Google File System (GFS)
The Google File System (GFS)
The Google File System (GFS)
The Google File System (GFS)
CENG334 Introduction to Operating Systems
The Google File System (GFS)
Cloud Computing Storage Systems
THE GOOGLE FILE SYSTEM.
by Mikael Bjerga & Arne Lange
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google SOSP’03, October 19–22, 2003, New York, USA Hyeon-Gyu Lee, and Yeong-Jae.
The Google File System (GFS)
Presentation transcript:

The Google File System (GFS) CS60002: Distributed Systems Antonio Bruto da Costa Ph.D. Student, Formal Methods Lab, Dept. of Computer Sc. & Engg., Indian Institute of Technology Kharagpur INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR

Design constraints Motivating application: Google Component failures are the norm 1000s of components: inexpensive servers and clients Bugs, human errors, failures of memory, disk, connectors, networking, and power supplies Constant monitoring, error detection, fault tolerance, automatic recovery are integral to the system Files are huge by traditional standards Multi-GB files are common; each file contains application objects such as web documents Billions of objects INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR

Design constraints Most modifications are appends Random writes are practically nonexistent Many files are written once, and read sequentially Types of reads Data Analysis Programs reading large repositories Large streaming reads (read once) Small random reads (in the forward direction) Sustained bandwidth more important than latency INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR

Interface Design Familiar file system interface Create, delete, open, close, read, write Files are organized hierarchically in directories and identified by path names Additional operations Snapshot (for cloning files or directories) Record append by multiple clients concurrently guaranteeing atomicity but without locking INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR

Architectural Design A GFS cluster A single master Multiple chunkservers per master Accessed by multiple clients Running on commodity Linux machines A file Represented as fixed-sized chunks Labeled with 64-bit unique and immutable global IDs Stored at chunkservers on local disks as linux files Replicated across chunkservers for reliability INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR

GFS Architecture INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR

Maintains all metadata Master server Maintains all metadata Name space, access control, file-to-chunk mappings Chunk lease management Garbage collection, chunk migration Sending Heartbeat messages to chunk servers Giving instructions; Collection of state information INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR

Linked to applications GFS clients Linked to applications Communicates with master and chunkservers on behalf of the client Consult master for metadata Access data from chunkservers No caching of file data at clients and chunkservers Streaming reads and append writes do not benefit from caching INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR

Single-Master Design Can use global knowledge to devise efficient decisions for chunk placement and replication Clients ask Master which chunkserver it should contact This information is cached at the client for some time A client typically asks for multiple chunk locations in a single request The master proactively provides chunk locations immediately following those requested Reduces future interactions at no cost INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR

Chunk Size 64 MB; much larger than typical file system block size Fewer chunk location requests to the master Good for applications that mostly read and write large files sequentially Reduced network overhead to access a chunk May keep a persistent TCP connection for some time Fewer metadata entries Kept in memory Some potential problems with fragmentation INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR

Metadata Three major types File and chunk namespaces File-to-chunk mappings Locations of a chunk’s replicas All kept in memory The first two metadata are also kept persistent Quick global scans Garbage collections Reorganizations for chunk failures, balancing load and better disk space utilization. 64 bytes per 64 MB of data INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR

Chunk Locations No persistent states Polls chunkservers at startup Use heartbeat messages to monitor chunk servers thereafter On-demand approach vs. coordination On-demand wins when changes (failures) are often Chunkservers join and leave a cluster Chunkservers may change names, fail or restart Changes happen often with a cluster having hundreds of servers INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR

Operation Logs Metadata updates are logged e.g., <old value, new value> pairs Log replicated on remote machines Files and chunks, and their versions are also identified by the logical times at which they are created Take global snapshots (checkpoints) to truncate logs Recovery Latest checkpoint + subsequent log files INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR

System Interactions The master grants a chunk lease to a replica, called the primary The primary picks a serial order for all mutations to the chunk A mutation is an operation that changes the contents or the metadata of a chunk Write or append operation The replica holding the lease determines the order of updates to all replicas Lease 60 second timeouts Can be extended indefinitely by the primary as long as the chunk is being changed Extension request are piggybacked on heartbeat messages After a timeout expires, the master can grant new leases INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR

Write Process INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR

Consistency Model Changes to namespace (i.e., metadata) are atomic Done by single master server Master uses log to define global total order of namespacechanging operations Data changes more complicated File region is consistent if all clients see as same, regardless of replicas they read from File region is defined after data mutation if it is consistent, and all clients see that entire mutation INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR

Serial writes guarantee region is defined and consistent But multiple writes from the same client may be interleaved or overwritten by concurrent operations from other clients Consistent but not defined Record append completes at least once, at offset of GFS’ choosing Application must cope with this semantics, and possible duplicates INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR

Data Flow Separation of control and data flows Avoid network bottleneck Control flows from the client to the primary and then to all secondaries Data are pushed linearly among chain of chunkservers having replicas Each machine forwards data to its closest machine which has not received it Pipelined transfers using a switched networks with full duplex links INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR

Master Operations Namespace management with locking Replica creation, re-replication Garbage Collection Stale chunk detection INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR

Locking Operations Many master operations can be time consuming Allow multiple master operations and use locks over regions of the namespace GFS does not have a per directory data structure GFS does not allow hard or symbolic links Represents namespace as a lookup table mapping full pathnames to metadata. A lock per path To access /d1/d2/leaf, need to lock /d1, /d1/d2, and /d1/d2/leaf Can modify a directory concurrently Each thread acquires a read lock on a directory and a write lock on a file Totally ordered (first by level and lexicographically in the same level) locking to prevent deadlocks INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR

Replica Placement Goals: Maximize data reliability and availability Maximize network bandwidth Need to spread chunk replicas across machines and racks Guards against disk or machine failures (different machines) Guards against network switch failures (different racks) Utilizes aggregate network bandwidth for read operations Write traffic has to flow through different racks Higher priority to replica chunks with lower replication factors Limited resources spent on replication INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR

Replica creation, re-replication Creation of empty replicas Placement in servers with below average disk utilization Limit recent creation on the same server Spread replicas across racks Re-replication Server becomes unavailable or corrupted Re-replicate chunks with some priority Chunks having one replica Chunks with live usage with running applications Master is involved only in picking up a high priority chunk and then instructs a suitable server for cloning directly from the valid replica INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR

Garbage Collection Deleted files are marked and hidden for three days Then they are garbage collected Master deletes the metadata for the deleted file Server identifies the set of deleted chunks during regular heartbeat messages Server deletes its deletes replicas of chunks Combined with other background operations (regular scans of namespaces or handshakes with servers) Done in batches and thus the cost is amortized Simpler than eager deletion due to Unfinished replicated creation Lost deletion messages Safety net against accidental irreversible deletions INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR

Fault Tolerance and Diagnosis Fast recovery Master and chunkserver are designed to restore their states and start in seconds regardless of termination conditions No distinction between normal and abnormal termination Chunk replication Master replication Master state is replicated on multiple machines Operation log and checkpoints are replicated Commit to a mutation happens after flushing logs to all replicas Infrastucture outside GFS starts a new master on failure Clients use a canonical name which is a DNC-alias for a server Shadow masters provide read-only access when the primary master is down INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR

Fault Tolerance: Data Integrity A chunk is divided into 64-KB blocks having 32bit checksum Each chunkserver uses checksum to detect corruption of stored data Verified at read and write times Chunkserver returns error to the requestor and reports the error to master Master creates a new replica and instructs to the server to delete its chunk Checksum has little performance overhead on read operations and on record append operations Chunkservers employ background scans for rarely used chunks INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR

GFS: Summary Success: used actively by Google to support search service and other applications Availability and recoverability on cheap hardware High throughput by decoupling control and data Supports massive data sets and concurrent appends Semantics not transparent to apps Must verify file contents to avoid inconsistent regions, repeated appends (at-least-once semantics) Performance not good for all apps INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR