The Google File System by S. Ghemawat, H. Gobioff, and S-T. Leung CSCI 485 lecture by Shahram Ghandeharizadeh Computer Science Department University of.

The Google File System by S. Ghemawat, H. Gobioff, and S-T. Leung CSCI 485 lecture by Shahram Ghandeharizadeh Computer Science Department University of Southern California

Primary Functionality of Google

Search content on the web in browsing mode. Search content on the web in browsing mode. Open world assumption: If your search with Google does not return results, it does not mean that the referenced content is non- exit. It only means that Google does not know about it when the search was issued. Open world assumption: If your search with Google does not return results, it does not mean that the referenced content is non- exit. It only means that Google does not know about it when the search was issued.  Google may retrieve results if the search is issued again.  Do not index/find me: Google provides tags to enable an information provider to prevent Google from indexing its pages.  No one gets angry if Google does not retrieve information known to exist on the Internet. How is this different than financial applications?

Functionality Search content on the web in browsing mode. Search content on the web in browsing mode. Open world assumption: If your search with Google does not return results, it does not mean that the referenced content is non- existent. It only means that Google does not know about it when the search was issued. Open world assumption: If your search with Google does not return results, it does not mean that the referenced content is non- existent. It only means that Google does not know about it when the search was issued.  Google may retrieve results if the search is issued again.  Do not index/find me: Google provides tags to enable an information provider to prevent Google from indexing its pages.  No one gets angry if Google does not retrieve information known to exist on the Internet. Query based: Looking for a needle in a hay stack. Query based: Looking for a needle in a hay stack. Closed world assumption: A data item that is not known does not exist. Closed world assumption: A data item that is not known does not exist.  A query must retrieve correct results 100% of the time!  If a customer insists the bank cannot find his or her account because the customer has used Google’s “do not find me” tags, the customer is kicked out!  Customers become angry if the system retrieves incorrect data. IR: DB:

Key Observation Okay to return either no or incorrect results. Okay to return either no or incorrect results. Acceptable for a user search to observe stale data. Acceptable for a user search to observe stale data. Not okay to return incorrect results. Not okay to return incorrect results. A transaction must observe consistent data. A transaction must observe consistent data. SQL front end. SQL front end. IR: DB:

Big Picture A shared-nothing architecture consisting of thousands of nodes! A shared-nothing architecture consisting of thousands of nodes!  A node is an off-the-shelf, commodity PC. Google File System Google’s Bigtable Data Model Google’s Map/Reduce Framework Yahoo’s Pig Latin …….

Big Picture Shared-nothing architecture consisting of thousands of nodes! Shared-nothing architecture consisting of thousands of nodes!  A node is an off-the-shelf, commodity PC. Google File System Google’s Bigtable Data Model Google’s Map/Reduce Framework Yahoo’s Pig Latin ……. Divide & Conquer

Big Picture Source code for Pig and hadoop are available for free download. Source code for Pig and hadoop are available for free download. Google File System Google’s Bigtable Data Model Google’s Map/Reduce Framework Yahoo’s Pig Latin ……. Hadoop Pig

Data Shipping Client retrieves data from the node. Client retrieves data from the node. Client performs computation locally. Client performs computation locally. Limitation: Dumb servers, utilizes the limited network bandwidth. Limitation: Dumb servers, utilizes the limited network bandwidth. A Node Data Process f(x) XmitData

Function Shipping Client ships the function to the node for processing. Client ships the function to the node for processing. Relevant data is sent to client. Relevant data is sent to client. Function f(x) should produce less data than the original data stored in the database. Function f(x) should produce less data than the original data stored in the database. Minimizes demand for the network bandwidth. Minimizes demand for the network bandwidth. A Node Output of f(x) Process function f(x)

Google Application (configured with GFS client) may run on the same PC as the one hosting a chunkserver. Requirements: Application (configured with GFS client) may run on the same PC as the one hosting a chunkserver. Requirements:  Machine resources are not overwhelmed.  The lower reliability is acceptable.

References Pig Latin Pig Latin  Olston et. al. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008. Map Reduce Map Reduce  Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January 2008. Bigtable Bigtable  Chang et. al. Bigtable: A Distributed Storage System for Structured Data. In OSDI 2006. GFS GFS  Ghemawat et. al. The Google File System. In SOSP 2003.

Overview: GFS A highly available, distributed file system for inexpensive commodity PCs. A highly available, distributed file system for inexpensive commodity PCs.  Supports node failures as the norm rather than the exception.  Stores and retrieves multi-GB files.  Assumes files are append only (instead of updates that modify a certain piece of existing data).  Atomic append operation to enable multiple clients to append to a file with minimal synchronization.  Relaxed consistency model to simplify the file system and enhance performance.

Google File System: Assumptions

Google File System: Assumptions (Cont…)

GFS: Interfaces Create, delete, open, close, read, and write files. Create, delete, open, close, read, and write files. Snapshot a file: Snapshot a file:  Create a copy of the file. Record append operation: Record append operation:  Allows multiple clients to append data to the same file concurrently, while guaranteeing the atomicity of each individual client’s append.

GFS: Architecture 1 Master 1 Master Multiple chunkservers Multiple chunkservers File is partitioned into fixed- size chunks. File is partitioned into fixed- size chunks. Each chunk has a 64 bit chunk handle that is unique globally. Each chunk has a 64 bit chunk handle that is unique globally. Each chunk is replicated on several chunkservers. Each chunk is replicated on several chunkservers.  Degree of replication is application specific; default is 3. Software Software  Master maintains all file system meta-data: namespace, access control info, mapping from files to chunks, current location of chunks.  GFS client caches meta-data about file system.  Client receives data from chunkserver directly.  Client and chunkserver do not cache file data.

GFS: Architecture 1 Master 1 Master Multiple chunkservers Multiple chunkservers File is partitioned into fixed- size (64 MB) chunks. File is partitioned into fixed- size (64 MB) chunks. Each chunk has a 64 bit chunk handle that is unique globally. Each chunk has a 64 bit chunk handle that is unique globally. Each chunk is replicated on several chunkservers. Each chunk is replicated on several chunkservers.  Degree of replication is application specific; default is 3. Software Software  Master maintains all file system meta-data: namespace, access control info, mapping from files to chunks, current location of chunks.  GFS client caches meta-data about file system.  Client receives data from chunkserver directly.  Client and chunkserver do not cache file data. Clientchooses one of the replicas.

GFS: Architecture 1 Master 1 Master Multiple chunkservers Multiple chunkservers File is partitioned into fixed- size (64 MB) chunks. File is partitioned into fixed- size (64 MB) chunks. Each chunk has a 64 bit chunk handle that is unique globally. Each chunk has a 64 bit chunk handle that is unique globally. Each chunk is replicated on several chunkservers. Each chunk is replicated on several chunkservers.  Degree of replication is application specific; default is 3. Software Software  Master maintains all file system meta-data: namespace, access control info, mapping from files to chunks, current location of chunks.  GFS client caches meta-data about file system.  Client receives data from chunkserver directly.  Client and chunkserver do not cache file data. Clientchooses one of the replicas. Unix allocates space lazily! Unix allocates space lazily! Many small logical files are stored in one file. Many small logical files are stored in one file.

GFS Master 1 master simplifies software design. 1 master simplifies software design. Master monitors availability of chunkservers using heart-beat messages. Master monitors availability of chunkservers using heart-beat messages. 1 master is a single point of failure: 1 master is a single point of failure:  Master does not store chunk location information persistently: When the master is started, it asks each chunkserver about its chunks (and whenever a chunkserver joins).  File and chunk namespaces,  Mapping from files to chunks,  Location of each chunk’s replica.

Mutation = Update Mutation is an operation that changes the contents of either metadata (delete or create a file) or a chunk (append a record). Mutation is an operation that changes the contents of either metadata (delete or create a file) or a chunk (append a record). Content mutation: Content mutation:  Performed on all chunk’s replicas.  Master grants a chunk lease to one of the replicas, primary.  Primary picks a serial order for all mutations to the chunk. Lease: Lease:  Granted by master, typically 60 seconds.  Primary may request extensions.  If master loses communication with a primary, it can safely grant a new lease to another replica after the current lease expires.

Master & Logging Master stores 3 types of metadata: Master stores 3 types of metadata: 1. The file and chunk namespaces, 2. Mapping from files to chunks, 3. Locations of each chunk’s replicas. First two types are kept persistent by: First two types are kept persistent by:  Logging mutations (updates) to an operation log stored on the master’s local disk,  Replicating the operation log on multiple machines.  What is required to support logging?

Master & Logging Master stores 3 types of metadata: Master stores 3 types of metadata: 1. The file and chunk namespaces, 2. Mapping from files to chunks, 3. Locations of each chunk’s replicas. First two types are kept persistent by: First two types are kept persistent by:  Logging mutations (updates) to an operation log stored on the master’s local disk,  Replicating the operation log on multiple machines.  What is required to support logging?  Uniquely identify transactions and data items.  Checkpointing.

Master & Logging Master stores 3 types of metadata: Master stores 3 types of metadata: 1. The file and chunk namespaces, 2. Mapping from files to chunks, 3. Locations of each chunk’s replicas. First two types are kept persistent by: First two types are kept persistent by:  Logging mutations (updates) to an operation log stored on the master’s local disk,  Replicating the operation log on multiple machines.  Files and chunks, as well as their versions, are uniquely identified by the logical times at which they were created.  GFS responds to a cleint operation only after flushing the log record to disk both locally and remotely.  With failures, during recovery phase, master recovers its file system by replaying the operation log.  Checkpoints are fuzzy.  Maintains a few older checkpoints and log files, deleting the prior ones.

Master & Locking Namespace management: Namespace management:  GFS represents its namespace as a lookup table mapping full pathnames to metadata.  /d1/d2/…/dn/fileA consists of the following pathnames:  /d1  /d1/d2  …  /d1/d2/…/dn  /d1/d2/…/dn/fileA

Master & Locking Namespace management: Namespace management:  GFS represents its namespace as a lookup table mapping full pathnames to metadata.  Each node in the namespace tree has an associated read-write lock.  Each master operation requires a set of locks before it can perform its read/mutation operation:  Typically, an operation involving /d1/d2/…/dn/fileA will acquire read locks on /d1, /d1/d2, /d1/d2/…/dn and either a read or write lock on /d1/d2/…/dn/fileA.  A read lock is the same as a Shared lock.  A write lock is the same as an eXclusive lock.

Example Operation 1: Operation 1:  Copy directory /home/user to /save/user Operation 2: Operation 2:  Create /home/user/foo

Example Operation 1: Operation 1:  Copy directory /home/user to /save/user Operation 2: Operation 2:  Create /home/user/foo Could they have used IS and IX locks?

Atomic Record Appends Background: Background:  With traditional writes, a client specifies the offset at which data is to be written.  GFS cannot serialize concurrent writes to the same region. With record append, the client specifies only the data. GFS appends the record to the file at least once atomically at an offset of GFS’s choosing and returns that offset to the client. With record append, the client specifies only the data. GFS appends the record to the file at least once atomically at an offset of GFS’s choosing and returns that offset to the client. What does “atomically” mean? What does “atomically” mean?

Atomic Record Appends Background: Background:  With traditional writes, a client specifies the offset at which data is to be written.  GFS cannot serialize concurrent writes to the same region. With record append, the client specifies only the data. GFS appends the record to the file at least once atomically at an offset of GFS’s choosing and returns that offset to the client. With record append, the client specifies only the data. GFS appends the record to the file at least once atomically at an offset of GFS’s choosing and returns that offset to the client. What does “atomically” mean? What does “atomically” mean?  The record is written as one sequence of bytes. Does GFS write the record partially? Does GFS write the record partially?

Atomic Record Appends Background: Background:  With traditional writes, a client specifies the offset at which data is to be written.  GFS cannot serialize concurrent writes to the same region. With record append, the client specifies only the data. GFS appends the record to the file at least once atomically at an offset of GFS’s choosing and returns that offset to the client. With record append, the client specifies only the data. GFS appends the record to the file at least once atomically at an offset of GFS’s choosing and returns that offset to the client. What does “atomically” mean? What does “atomically” mean?  The record is written as one sequence of bytes. Does GFS write the record partially? Does GFS write the record partially?  Yes, a record might be written partially to a file replica.

Atomic Record Appends How? How?  Discuss how regular chunk mutations are supported.

Updates

Atomic Record Appends: How? Client: Client:  Pushes data to all replicas of the last chunk of the file.  Sends its write request to the primary. Primary appends data to its replica and tells the secondaries to write data at the exact offset where it has written. If all secondaries succeed, primary replies success to the client. Primary appends data to its replica and tells the secondaries to write data at the exact offset where it has written. If all secondaries succeed, primary replies success to the client. If a record append fails at any replica, primary reports error and client retries the operation. If a record append fails at any replica, primary reports error and client retries the operation.  One or more of the replicas may have succeeded fully (or written partially) → replicas of the same chunk may contain different data including duplicates of the same record.  GFS does not guarantee that all replicas are bytewise identical. GFS guarantees that the record is written at the same offset at least once in its entirety (atmoic unit).

Summary File namespace mutations are managed by requiring Master to implement ACID properties: locking guarantees atomicity, consistency, and isolation. Operation log provides durability. File namespace mutations are managed by requiring Master to implement ACID properties: locking guarantees atomicity, consistency, and isolation. Operation log provides durability. State of a file region after a data mutation depends on the type of mutation, whether it succeeds or fails, and whether there are concurrent mutations: State of a file region after a data mutation depends on the type of mutation, whether it succeeds or fails, and whether there are concurrent mutations:  A file region is consistent if all clients will always see the same data, regardless of which replicas they read from.  A file region is defined after a mutation if it is consistent and clients will see what the mutation writes in its entirety.  When a mutation succeeds without interferance from concurrent writers, the affted region is defined and by implication consistent.  Concurrent successful mutations leave the region undefined and consistent: all clients see the same data that consists of mingled fragments from multiple mutations.

The Google File System by S. Ghemawat, H. Gobioff, and S-T. Leung CSCI 485 lecture by Shahram Ghandeharizadeh Computer Science Department University of.

Similar presentations

Presentation on theme: "The Google File System by S. Ghemawat, H. Gobioff, and S-T. Leung CSCI 485 lecture by Shahram Ghandeharizadeh Computer Science Department University of."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Google File System by S. Ghemawat, H. Gobioff, and S-T. Leung CSCI 485 lecture by Shahram Ghandeharizadeh Computer Science Department University of.

Similar presentations

Presentation on theme: "The Google File System by S. Ghemawat, H. Gobioff, and S-T. Leung CSCI 485 lecture by Shahram Ghandeharizadeh Computer Science Department University of."— Presentation transcript:

Similar presentations

About project

Feedback