Google File System Sanjay Ghemwat, Howard Gobioff, Shun-Tak Leung Vijay Reddy Mara Radhika Malladi.

Google File System Sanjay Ghemwat, Howard Gobioff, Shun-Tak Leung Vijay Reddy Mara Radhika Malladi

Overview Introduction Design Overview System Interactions Master Operations Fault tolerance and Diagnosis Measurements Experiences Conclusion

Introduction 1.GFS was designed to meet demands of Google's data processing needs, stores data on Linux files. 2.Component failures - system should monitor error detection, fault tolerance and automatic recovery 3.Huge files - The system stores few millions of files each with 100 MB or more 4.Appending new data - the new data is added to existing data 5.Co -designing the applications and the file system - done to increase the flexibility.

Design Overview 1.Assumptions Built from many inexpensive commodity components that often fail. Stores a modest number of large files-each with 100 MB or more The workloads primarily consist of two kinds of reads: i.Large streaming reads ii.Small random reads Workloads have writes similar to reads Must efficiently implement well-defined semantics for multiple clients that concurrently append to the same file. High bandwidth is more important than low latency

GFS Semantics Normal semantics: Create, Delete, Read, Write, Close, Open GFS-specific semantics: Atomic record appends, snapshots

GFS Architecture Single Master Multiple Chunk servers Multiple Clients

Master maintains: all file system metadata access control information, mapping from files to chunks and current location of files communication with chunk servers, gives certain instructions and maintain the hold of location of chunks

Chunk servers: files are divided in to fixed sized chunks each chunk is identified by 64 bit chunk handler assigned by master stores chunks on a local disk as Linux files Clients Communicates with master about the current lease holder and performs read and write operations from chunk server

3. Architecture:

5. Chunk Size 64 MB Advantages of large chunk size Reduces clients need to interact with master client can makes many operations on a given chunk Disadvantages of large chunk size with lazy space allocation: A small file consists of a small chunk which is accessed by many clients simultaneously which in turn becomes hotspot.

6. Metadata The master stores three types of metadata The file and chunk namespaces The mapping from files to chunks The locations of each chunk’s replicas

Master stores metadata in In memory data structures: master operations becomes fast. The capacity of whole system (total number of chunks) is limited by master memory. Chunk locations: Master does not maintain persistent record of which chunk server has a replica of a given chunk Operation log: Contains historical record of critical metadata changes Store operation invisible to clients. Replicating operational log

Consistency model The state of the file region after data mutation depends on whether the mutation is success or failure. A file region is consistent if all clients see the same data regardless of which replica they read from. A region is defined if a client can see what the mutation writes its entirety A region is undefined if a client can read the same data but may not reflect what on one mutation has written

System Interactions How the client, master, and chunk servers interact to implement data mutations Atomic record append Snapshot

1. Leases and mutation order Mutation – Set of operations such as writing, appending or creating. Lease - maintains the consistent mutation across the replicas Master grants lease to one of the replica called primary Primary - picks up a order for all mutations All replicas follows the same order given by primary while mutations

Atomic record appends A B C x x x y y y z z z A B C x x x y y y z z z 1 1 g offset 4 A B C x x x y y y z z z 1 1 g 1 1 1 offset 5

Snapshot Makes a copy of a file or a directory tree while minimizing any interruptions of ongoing mutation.

Here the lease is with primary chunk server C. Client'sMASTER Secondary Chunk server Primary Chunk server Secondary Chunk server CC’CC

Master Operation The master executes all namespace operations It manages chunk replicas through the system

1.Namespace management and locking /d1/d2/d3/…/dn/leaf read lock read lock read lock write lock d1 /d1/d2 /d1/d2/d3/../dn/leaf /d1/d2/d3/../dn

Example: /home/user/foo – foo is the file to be created /home/user is snapshotted to /same/user Snapshotting locks: readlock writelock File creating locks: readlock writelock home same home/user Same/user homeHome/user Home/user/foo

2.Replica Placement Replica placement serves 2 purposes: 1.Maximize data reliability and availability 2.Maximize network bandwidth utilization Replicas should not only spread across machines but also across racks – ensures that chunk replicas will survive even though entire rack is damaged.

3.Creation, Re-replication, Re-balancing Chunk replicas are created for three reasons: 1.Chunk creation: 1. Place new replicas on chunk servers with below-average disk space utilization. 2. Limit the number of recent creations on each server. 3. Spread replicas of a chunk across racks.

2. Re-replications: 1. chunk server becomes unavailable 2. It reports that its replica may be corrupted. 3.One of its disks is disabled because of errors 4. Replication goal is increased. Chunk is re-replicated based on the priority. Master picks high priority chunks and clones it. Chunk server limits the amount of bandwidth it spends on each clone.

3. Rebalancing: Moves replicas for better disk space and load balancing. Master generally prefers those on chunk servers with below-average free space to remove.

4.Garbage collection Mechanism: Master logs the deletion immediately. Deleted file is renamed to a hidden name which includes deletion timestamp. During masters regular scan, hidden files and those that are not reachable by any other files are deleted by chunk servers when master erases its metadata.

Garbage collection : Advantages: 1. Simple and reliable. 2. Done in batches and cost is amortized. 3. Delay in reclaiming storage provides a safety net against accidental, irreversible deletion. Disadvantages: Delay hinders user effort to fine tune usage when storage is tight.

5.Stale Replica Detection When a chunk server fails or misses mutations to the chunk while it is down. Master maintains chunk version number for each chunk. When a new lease is granted for a chunk, then version number is increased. Master, Client and chunk server verifies the version number when it performs the operation so that it is accessing up-to-date data.

Fault Tolerance And Diagnosis 1.High Availability: Fast Recovery: Master and chunk server are designed to restore their state and start in seconds. Chunk Replication: Each chunk is replicated on multiple chunk serves on different racks. Default is three. Master Replication: Master state is replicated for reliability. Shadow masters provide read-only access to the file system even when the primary master is down.

2.Data Integrity Each chunk server uses check summing to detect corruption of stored data. 64KB blocks - 32 bit checksum. For reads: verifies checksum of data blocks before returning any data to requester. If a block doesn’t match, an error is reported. Reads are aligned at checksum block boundaries to increase read performance. For writes: Checksum is appended at the end of a chunk for writes.

Append: New checksum is computed for new checksum blocks filled by the append. Corruption will be easily detected. Write overwrites: Read and verify first and last blocks of the range Perform the write and finally compute and record new checksums. Idle periods of chunk servers – inactive chunks are verified to detect corruption. master replaces corrupted replica with a new uncorrupted replica.

Experiences Operational & Technical Issues: Corruption of data due to problem in kernel Solution – checksum & modification of kernel. Problem in Linux 2.2 cost of fsync() – proportional to whole file Solution – migrated to Linux 2.4 cost of fsync() – proportional to modified file Single reader-writer lock - mmap() call Solution – Replacing mmap() with pread().

Conclusion 1. GFS provides fault tolerance by constant monitoring, replicating crucial data and fast & automatic recovery. Chunk replication tolerates chunk servers. Check summing to detect data corruption. 2. GFS design delivers high aggregate throughput to many concurrent readers and writers performing different tasks.

Google File System Sanjay Ghemwat, Howard Gobioff, Shun-Tak Leung Vijay Reddy Mara Radhika Malladi.

Similar presentations

Presentation on theme: "Google File System Sanjay Ghemwat, Howard Gobioff, Shun-Tak Leung Vijay Reddy Mara Radhika Malladi."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Google File System Sanjay Ghemwat, Howard Gobioff, Shun-Tak Leung Vijay Reddy Mara Radhika Malladi.

Similar presentations

Presentation on theme: "Google File System Sanjay Ghemwat, Howard Gobioff, Shun-Tak Leung Vijay Reddy Mara Radhika Malladi."— Presentation transcript:

Similar presentations

About project

Feedback