2Consistency Model Why Append Only? Overwriting existing data is not state safe.We cannot read data while it is being modified.A customized ("Atomized") append is implemented by the system that allows for concurrent read/write, write/write, and read/write/write events.
3Consistency Model File namespace mutation (update) atomic File Region Consistent if all clients see same dataRegion – defined after file data mutation (all clients see writes in entirety, no interference from writes)Undefined but Consistent - concurrent successful mutations – all clients see same data, but not reflect what any one mutation has written, fragments of updatesInconsistent – if failed mutation (retries)
4ConsistencyRelaxed consistency can be accommodated – relying on appends instead of overwritesAppending more efficient/resilient to failure than random writesCheckpointing allows restart incrementally and no processing of incomplete successfully written data
5Namespace Management and Locking Master ops can take time, e.g. revoking leasesallow multiple ops at same time, use locks over regions for serializationGFS does not have per directory data structure listing all filesInstead lookup table mapping full pathnames to metadataEach name in tree has R/W lockIf accessing: /d1/d2/ ../dn/leaf, R lock on /d1, /d1/d2, etc., W lock on /d1/d2 …/leaf
6Locking Allows concurrent mutations in same directory R lock on directory name prevents directory from being deleted, renamed or snapshottedW locks on file names serialize attempts to create file with same name twiceR/W objects allocated lazily, delete when not in useLocks acquired in total order (by level in tree) prevents deadlocks
7Fault Tolerance Fast Recovery Chunk Replication Master/chunkservers restore state and start in seconds regardless of how terminatedAbnormal or normalChunk Replication
8Data Integrity Checksumming to detect corruption of stored data Impractical to compare replicas across chunkservers to detect corruptionDivergent replicas may be legalChunk divided into 64KB blocks, each with 32 bit checksumsChecksums stored in memory and persistently with logging
9Data Integrity Before read, checksum If problem, return error to requestor and reports to masterRequestor reads from replica, master clones chunk from other replica, delete bad replicaMost reads span multiple blocks, checksum small part of itChecksum lookups done without I/OChecksum computation optimized for appendsIf partial corrupted, will detect with next readDuring idle, chunkservers scan and verify inactive chunks
10Garbage Collection Lazy at both file and chunk levels When delete file, file renamed to hidden name including delete timestampDuring regular scan of file namespacehidden files removed if existed > 3 daysUntil then can be undeletedWhen removed, in-memory metadata erasedOrphaned chunks identified and erasedWith HeartBeat message, chunkserver/master exchange info about files, master tells chunkserver about files it can delete, chunkserver free to delete
11Garbage Collection Easy in GFS All chunks in file-to-chunk mappings of masterAll chunk replicas are Linux files under designated directories on each chunkserverEverything else garbage
12ConclusionsGFS – qualities essential for large-scale data processing on commodity hardwareComponent failures the norm rather than exceptionOptimize for huge files appended toFault tolerance by constant monitoring, replication, fast/automatic recoveryHigh aggregate throughputSeparate file system controlLarge file size
13GFS In the WildGoogle currently has multiple GFS clusters deployed for different purposes.The largest currently implemented systems have over 1000 storage nodes and over 300 TB of disk storage.These clusters are heavily accessed by hundreds of clients on distinct machines.Has Google made any adjustments?