File and Storage Systems: The Google File System

File and Storage Systems: The Google File System
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google Good Morning everyone, Today, I will be talking about the paper on the Google File System. This paper has been written by Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung. This paper was presented at the 2003 ACM SIGOPS conference. Presented by – Uma Murthy CS5204 Fall 2004

Key Ideas (1) A new file system for Google to satisfy the demands of its application workloads and technological environment The key idea of this paper is the design and development of a new file system for Google to meet the demands of its technological environment and data-intensive application workloads. We will see what these terms actually mean and how they drive the design of the file system. CS5204 Fall 2004

Key Ideas (2) Scalability Fast Recovery High aggregate throughput
Cluster architecture Fast Recovery Metadata High aggregate throughput Separate data flow from control flow Availability Replication and a relaxed consistency model Fault tolerance Replication and data integrity The file system architecture and its design parameters and implementation details meet certain goals some of which are common to other distributed file system. They are: Scalability: Use of the cluster architecture has satisfied this goal. Fast Recovery: Metadata plays a vital role in fast recovery Performance – High-aggregate throughput: Aggregate throughput is a measure of the maximum amount of data a network can carry within a time frame. It is the total summation of all data rates of all simultaneous transmissions possible at one instance on the system. High-aggregate throughput in GFS is achieved by separating data flow from control flow. Availability: Achieved through replication and a relaxed consistency model (data mutations are atomic, When new data comes in it does not have to be changed at the same instant, it can be done gradually, but each mutation needs to be atomic) Fault Tolerance: Met through the use of replication and data integrity. CS5204 Fall 2004

Technological Environment
Clusters of few thousand commodity Linux machines (versus a small number of high-end servers) Multiple clusters are distributed at different geographical locations Reliability is provided at software level by replicating services and automatically detecting and handling failures Before getting into the details about the paper it is important to understand the background and the motivation behind building such a file system. · Each Google query that is served uses hundreds of megabytes of data and tens of millions of CPU cycles. To provide this kind of computing power one would need the infrastructure of that of a supercomputer. Google achieves this by using clusters of inexpensive commodity Linux machines (instead of a few high-end servers) · Multiple clusters are geographically distributed worldwide. · Since the clusters are built from inexpensive commodity PCs, they are highly susceptible to failure. Reliability is provided through software by the use of replication, detecting and handling errors and enabling fast recovery. This gives us a idea about the technological environment at Google. CS5204 Fall 2004

Google Tasks Data Stored Crawling Searching Web documents Indices
Google Web Server Ad Server Spell Checker Index Servers Document Servers To get an idea of the application workloads at Google, we first need to understand the tasks they perform. · Google performs two major tasks: Crawling the web Searching or serving queries · Crawling the web refers to gathering web documents, storing them and keeping them up-to-date. · A typical Google search works like so – When we type a query for search, the request is routed to a Google Web Server, possibly the one geographically nearest to us. Then the search is performed in two phases. In the first phase, index servers search an inverted index of query words that are links to documents (that contain those words). The output of this phase is a list of ranked document ids (in order of relevance to the query). The document servers then take each of these document ids to provide the actual title of the document, the URL and a query-specific summary (keyword snippet). The GWS then performs ancillary tasks of spell check and retrieving related Ads from an Ad server. Finally a result is output · From these tasks, we can observe that there are two major kinds of data being stored: Web Documents (and ad documents) Indices - of query words, ads, words for the spell check. Serving a Google Query CS5204 Fall 2004

A New File System for Google
Component failures norm rather than exception Clusters of inexpensive commodity machines File size Multi-GB files Revisit I/O operation and block size Workloads Reads – Sequential and Random Writes – Dominated by appending new data rather than overwriting existing data Benefits of increasing flexibility as applications and file system API are co-designed Relaxed consistency model Having got an idea about the technological environment and application workloads, we will now be able to understand the motivations behind the design of the Google file system. Firstly, component failures are the norm rather than the exception; this is due to the clusters of commodity Linux machines File sizes are huge, typically the order of multi-GBs, since we are storing web documents and indices. Thus we need to revisit I/O operations and file system block size. The sequential reads could be used when indices are built from documents The random reads could be used to search while serving queries The large append operations could used while updating web documents and the different indices We can only speculate what these operations translate to, as there is no clear information on this. Finally, since Google is developing both the applications and the file system, there is a lot of room for flexibility and they can be built in a way such that they suit each other and communicate with each other easily. CS5204 Fall 2004

Architecture Chunks GFS Master Application GFS Client
File Namespace /foo/bar chunk 2ef0 (file name, chunk index) Application GFS Client (chunk handle, chunk locations) Instructions to chunkserver Chunkserver state (chunk handle, byte range) Linux file system GFS chunkserver Linux file system GFS chunkserver chunk data The Google architecture consists of Single Master Several Chunkservers Several Clients accessing them. Each of these is a commodity Linux machine running a user level process Files a divided into fixed sized blocks called chunks, of around 64 MB. Each chunk is identifiable by a globally unique and immutable 64-bit chunk handle. This size of a chunk is much larger than typical file system block sizes. There are reasons for this. Firstly, the data stored are large files – web documents and indices. Secondly, a large chunk reduces client-master interactions. When a client requests for a chunk it usually performs many operations on it thus reducing network overhead and keeping a persistent TCP connection. Finally, a large chunk means storing less metadata. Multiple replicas of chunks are stored in chunkservers (default 3 but can be configured) The master stores most of the file system’s metadata. We will see what exactly this is in the next slide. Having a single master simplifies the file system design and provides the master with global knowledge that allows it to make decisions on chunk placement, replication and garbage collection. Having a single master leads to other design decisions like minimizing client-master interactions and reducing the size of metadata. To understand the interactions that take place between the different components of the architecture, let us consider a typical read example. 1. A client uses the fixed chunk size, to translate the file name and the byte offset specified by the application into a chunk index within the file. It sends a request for the chunk to the master with the file name and chunk index. 2. The master replies with the chunk handle and the location of chunk replicas. 3. The client caches this information for future use. 4. Using the chunk handle and byte range, the client then sends a request to a chunkserver that has a replica, usually the nearest one, which replies with the chunk data. There are two things to be noted here – · Client-master interaction is restricted to just fetching the chunk handle and locations and is not there are not data-related communication between the master and the client. · Secondly, since the client has cached chunk information, it need not contact the master for future requests. Thus we see the rationale behind a large chunk size. Data messages Control messages ……. ……. Chunks CS5204 Fall 2004

Metadata Master Chunkserver File and chunk namespaces
Mapping from files to chunks Location of each chunk’s replicas (not persistent) Operation Log Serves as logical timeline that defines the order of concurrent operations Replicated on multiple remote machines Master checkpoints its state when log grows beyond a certain size Chunkserver Checksums for each 64 KB block of user data Chunk version number The master stores most of the file system’s metadata. This includes File and chunk namespaces Mapping from files to chunks Location of each chunk’s replicas – The master does not store this information persistently, instead it polls the chunkservers for this information at the time of startup. (This makes sense because chunkservers have the final word on what chunks it has or does not have and also errors on a chunkserver may cause a replica to go bad and be disabled) The master also stores an operation log that contains a historical record of all metadata changes. The operation log serves as a logical timeline that defines the order of concurrent operations. This log is replicated on many machines. The master checkpoints its state periodically if the log grows beyond a certain size (so that in case of failure it can recover by loading the latest checkpoint and replaying the limited number of logs after that) Since all this metadata is stored in the master’s memory, it makes master operations fast. It enables the master to easily and efficiently periodically scan its state in the background to implement chunk garbage collection, re-replication in case of chunkserver failures and chunk migration to balance load and disk space utilization. Other metadata like checksums (for data integrity) and chunk version number are stored in chunkservers. CS5204 Fall 2004

Leases and Mutation Order
Use of leases to maintain consistent mutation order A lease, when granted, has a timeout that may be extended Master grants chunk lease to primary Global mutation order defined by The lease grant order chosen by the master Within a lease by the serial numbers assigned by the primary Replication plays an important role in satisfying many of the goals of the file system. One of the functions of replica management is updating replicas. To understand how replicas are updated, we need to understand the client, master and chunkserver interactions in implementing data mutations. · Mutation operations change the contents and the metadata of a chunk. For example a write or an append operation. · Leases are used to maintain consistent mutation order across replicas. · The master first grants a chunk lease to one of the replicas (chunkserver), which is called the primary. · The primary picks a serial order for all mutations to the chunk. All replicas follow this order. Thus a global mutation order is defined by: o The lease-grant order chosen by the master. o Within a lease by the serial numbers assigned by the primary. The lease mechanism is used to minimize management overhead at the master. A lease has an initial timeout that can be extended on request. Lease extension requests and grants are piggybacked on Heart Beat messages regularly exchanged between master and all chunkservers CS5204 Fall 2004

Write Control and Data Flow
Request for lease information Reply with identity of primary and secondaries Push data to all replicas Send write request to primary Primary forwards write request to all secondaries Secondaries reply after completing operation Primary replies to client. Error handling Client Master Secondary Replica A Secondary Replica B Primary Replica 1 2 3 4 5 6 7 Data Control This slide describes the interactions between the master, client and chunkservers in the control flow of a write. 1. First, a client asks the master which chunkserver holds the current lease for the chunk and the location of other replicas. If no one holds the lease, the master grants one to a replica. 2. The master replies with the identity of the primary and secondary replicas. The client caches this information for future use. 3. The client then pushes data to all replicas. The chunkservers store the data in an LRU cache buffer until the data is used or aged out. BY decoupling data and control flow, performance is improved by scheduling the expensive data flow based on network topology, regardless of which chunkserver is the primary. 4. Once the client gets an acknowledgement from all replicas after having received the data, the client sends a write request to the primary, identifying the data that it pushed earlier. The primary assigns consecutive serial numbers to all mutations it receives, possibly from many clients, providing necessary serialization. It then applies the mutation order to its own local state. 5. The primary forwards the write request to all secondary replicas, which then apply the mutations in the same serial order assigned by the primary. 6. The secondaries all reply to the primary saying that they have completed the operation. 7. The primary replies to the client. Any errors encountered are also reported to the client. In case of errors, the write may have succeeded at the primary and failed at an arbitrary set of secondaries. The client handles errors by retrying the failed mutation. CS5204 Fall 2004

Replica Placement Multi-level distribution of replicas
Chunk replica placement policy Maximizes data reliability and availability Maximizes network bandwidth utilization Place replicas across different machines across different racks Now that we have understood how data mutation occurs let’s look at replication. The master makes placement decision, creates new chunks, hence replicas and coordinates various system wide activities to keep chunks fully replication, to balance load across all chunkservers and to reclaim storage. · A GFS cluster is highly distributed – it has hundreds of chunkservers spread across many machine racks, which are in turn accessed by many clients also spread across racks. Communication between machines on different racks may cross one or more network switches. Thus this multi-level distribution presents several challenges while distributing data in a scalable, available and reliable manner. · The chunk placement policy serves two purposes – o Maximize data reliability and availability o Maximize network bandwidth utilization · Thus it is not only important to place replicas across different machines but also across different racks in order to guard against machine and rack failures. · This also allows exploitation of aggregate bandwidth of the rack for reads. CS5204 Fall 2004

Creation, Re-replication, Rebalancing
Master creates a chunk and chooses placement of initial replicas Re-replication takes place as soon as the number of replicas falls below user specified level Master picks highest priority chunk and “clones” it Master rebalances replicas for better disk space and load balancing Chunk replicas are created for chunk creation, re-replication and rebalancing. When a master creates a replica, it chooses where to place the replica initially. It makes this choice based on chunkserver disk utilization (in order to equalize disk utilization across chunkservers over a period of time) limiting the “recent” creations on a chunkserver to avoid heavy write traffic because chunks are created when demanded by writes. Spreading replicas of chunks across racks as described earlier. The master re-replicates a chunk as soon as it falls below a user-specified goal. This could happen because of: Chunkserver becoming unavailable Chunkserver reporting that its data is corrupt Replication number is increased. Re-replication is performed by cloning a chunk based on its priority, which is decided by factors like how many replicas of the chunk have been lost, whether they are replicas of a live file or a deleted file, etc. Finally the master rebalances replicas periodically by examining the current replica distribution and moving replicas for better disk space and load balancing. CS5204 Fall 2004

Garbage Collection Storage is reclaimed lazily for deleted files at file and chunk levels Deleted files are first renamed to a hidden name with timestamp Master scans for and removes any “hidden” files that have existed for more than a configurable period In-memory metadata is erased for removed files. Orphaned chunks Identified during scan of chunk namespace and metadata is erased for those chunks Chunkservers are notified via HeartBeat messages and are free to delete replicas Master performs garbage collection and reclaims storage lazily for deleted files and chunks. When a file is deleted by an application, it is first renamed to a hidden name with a timestamp. After a configurable period, the master scans for and removes these files as well as their metadata. The deleted files can be restored before this period. This mechanism is simple and easy to implement and avoids accidental deletions. However, it is difficult to fine tune when storage is tight. The master identifies orphaned chunks during the periodical scan of its namespace and erases the metadata for these chunks. It notifies chunkservers of orphaned chunks while exchanging regular HeartBeat messages with them. They in turn are free to delete these chunks. CS5204 Fall 2004

Stale Replica Deletion
Chunk replicas become stale if a chunkserver fails and misses mutations to the chunk Chunk version number is used to distinguish between up-to-date and stale replicas Stale replicas are removed during master’s regular garbage collection · A replica becomes stale if the chunkserver fails and misses a mutation. · Stale replicas are detected with the help of chunk version numbers. · When a new lease is granted on a chunk, master increases its chunk version number and informs the up-to-date replicas. The master and replicas record this version number. · If a replica is unavailable due to chunkserver failure, its version number will not advance. · The master will detect the stale replica, when the chunkserver restarts and reports its chunks. · The master removes replicas during its regular garbage collection. If the master sees a higher chunk version number than the one on its records, the master assumes that it failed while granting the lease and takes the higher version number. Use of chunk version number in client interactions and cloning operation CS5204 Fall 2004

Master Replication Master state, its operation logs and checkpoints are replicated on many machines Since metadata is maintained in-memory, when master fails, it can restart instantly When master machine or disk fails, monitoring infrastructure outside GFS starts a new master process elsewhere Shadow masters provide read-only access to the file system when the primary master is down The master state, its operation log and checkpoints are replicated on many machines. Since metadata is stored in memory, when the master fails, it can restart almost instantly. When the master machine or disk fails, a monitoring infrastructure outside GFS starts a master process elsewhere. Also, shadow masters provide read-only access to the file system when the file system is down. This is so that at least read-operations may be performed on files and searching continues without hiccups. Shadows lag primary by few seconds A master process is in charge of mutations and background activities CS5204 Fall 2004

Microbenchmarks Aggregate read rate is 75% of the theoretical limit
Aggregate write rate is 50% of the theoretical limit For record appends, performance is limited by network bandwidth of chunkservers that store the last chunk of the file independent of the number of clients The performance of a GFS cluster was measured. The set up included A cluster of 1 master 2 master replicas 16 chunkservers 16 clients All machines were configured with Dual 1.4 GHz PIII processors 2 GB of memory Two 80 GB 5400 rpm disks 100 Mbps full-duplex Ethernet connection to a HP 2524 switch All 19 GFS server machines were connected by one switch and the clients were connected by another switch. The two switches were connected with a 1 Gbps link. Reads N clients simultaneously read from the file system. Each clients read a randomly selected 4 MB region from a 320 GB file set. This was repeated 256 times so that each client read 1 GB of data. The graph shows that the aggregate read rate reached around 94 MB/s which is around 75% of the theoretical limit. The efficiency drops because many clients are trying to read simultaneously from the same chunkserver. Write N clients write simultaneously to N files. Each client writes 1 GB of data in a series of 1 MB writes The aggregate write rate reaches 35 MB/s which is about 50% of the theoretical limit. The reason the authors give for this is the same as read – multiple clients try to write to the same chunkserver. Another reason they give is that their network stack does not talk well to their pipelining scheme of sending data. But it does not significantly affect aggregate write rates in practice so it does not bother Google. Record Appends N clients simultaneously append to a single file. Since it is a record append, performance is limited by network bandwidth of the chunkservers that store the last chunk of the file. But it does not affect them in practice since usually many clients write to many files and clients can progress writing on one file while chunkservers of other files are busy. CS5204 Fall 2004

Real World Clusters (1) Storage Metadata File data replicated thrice
Chunkservers – checksums, chunk version number Master bytes per file in master The second experiment was used to examine two real world clusters that are representative of several other clusters at Google. Cluster A is regularly used for research and development over 100 engineers Cluster B is primarily used for production data processing where a lot of the processes are automatic and require only occasional human intervention. The table shows the storage characteristics of the clusters. 100s of chunkservers in each cluster File data is replicated thrice so clusters A and B store 18 TB and 52 Tb of file data respectively. The chunkservers store aggregates of metadata of 10s of GB – most of this is checksum and chunk version numbers The master stores much lesser metadata – approximately 100 bytes per file on an average Thus we see that the size of the cluster is not limited by the size of the master. CS5204 Fall 2004

Real World Clusters (2) Recovery Time Read and Writes Master Load
Read rates much higher than write rates Master Load operations/sec Recovery Time chunks containing 600 GB data restored in 23.2 minutes, effective replication rate of 440 MB/s. Replication of 266 chunks in 2 minutes The table shows the read and write rates of the clusters. There are significantly more number of reads than writes because both the clusters were in the middle of heavy read activity. The rate of operations sent to the master was around 200 to 500 operations per second, a rate that the master can easily keep up with and hence it is not a bottleneck to the system. Recovery Two experiments were performed to test recovery in GFS clusters. In the first a chunkserver with chunks containing 600 GB of data was killed. To limit the impact on running applications, the cluster was limited to 91 concurrent clonings. All chunks were restored in 23.2 minutes at an effective replication rate of 440 MB/sec In the second experiment 2 chunkservers with chunks and 660 GB of data were killed. This failure reduced 266 chunks to having a single replica. Hence these chunks were given a high priority while cloning and were restored in 2 minutes. Thus we can see that recovery is fast. To recap, we have seen that Google has developed a new file system that meets its goals of scalability, fast recovery, availability, high aggregate throughput and fault tolerance. CS5204 Fall 2004

Evaluation (1) A new file system with an effective design built using old techniques The file system is specific to Google applications but some ideas may be applied to data processing tasks of similar size The whole file system is dependent on a single master as the global view of the system is present only in the master. If the master goes down, entire system comes to a halt for some time Not many details on actual size of the file system leads to speculation Effective - We know the file system works because we use Google everyday and there have hardly been any heard of failures Old techniques – cluster architecture, replication, snapshot using copy on write, uses journaling for recovery, checksum for data integrity New ideas – cluster architecture (single master, many chunkservers, many clients), chunks, separating data flow from control flow No record of number of racks, machines, processors, Since it is used in a commercial system, this information could be proprietary. CS5204 Fall 2004

Evaluation (2) Relation to other papers read
Follows the BASE data semantics Importance of metadata and metadata update operation. Plays a key role in recovery Uses journaling metadata update technique Follows the BASE data semantics mentioned in the “Cluster-Based Scalable Network Services” paper BASE – Basically Available, Soft-state, Eventual consistency Presented at Proceedings of the nineteenth ACM symposium on Operating systems principles October 2003, same time when the Nooks paper was presented CS5204 Fall 2004

Questions? CS5204 Fall 2004

File and Storage Systems: The Google File System

Similar presentations

Presentation on theme: "File and Storage Systems: The Google File System"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

File and Storage Systems: The Google File System

Similar presentations

Presentation on theme: "File and Storage Systems: The Google File System"— Presentation transcript:

Similar presentations

About project

Feedback