Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google Jaehyun Han 1.

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google Jaehyun Han 1

OUTLINE  Introduction  Design Overview  System Interactions  Master Operation  Fault Tolerance and Diagnosis  Measurements  Conclusions 2

INTRODUCTION Motivation ◦ Rapidly growing demands of Google’s data processing needs Design choices ◦ Component failures are normal ◦ Files are huge ◦ Most files are mutated by appending 3

Composed of inexpensive components often fail Stores 100 MB or larger size file Large streaming reads & small random reads Many large, sequential writes that append data to files Atomicity with minimal synchronization overhead is essential High sustained bandwidth is more important than low latency 5 ASSUMPTIONS

INTERFACE  Files are organized hierarchically in directories and identified by pathnames  Does not implement a standard API such as POSIX  Usual operations  create  delete  open  close  read  write  GFS specific  snapshot – copy of a file or a directory tree at low cost  record append – append data to the file 6

7 http://en.wikipedia.org/wiki/Google_File_System ARCHITECTURE

8 SINGLE MASTER

CHUNK SIZE  64MB for chunk size – much larger than typical file system  Advantages  Reduce client-master interaction  Reduce network overhead  Reduce the size of metadata – less than 64byte  Disadvantages  Hot spot of a small file - Many clients accessing the same file  Not a major issue in practice 9

METADATA  Three major types of metadata  File and chunk namespace  File to chunk mapping  Location of each chunk’s replicas  All metadata is kept in master’s memory  The master does not store chunk location  Operations of the master are fast and efficient  Less than 64bytes metadata each chunk Operation log ◦ Contain historical record of critical metadata changes ◦ Recovery by replaying the operation log 10

CONSISTENCY MODEL  Guarantees by GFS  File namespace mutations are atomic  all clients will always see the same data regardless of which replicas they read from  Defined  consistent and clients will see what mutation writes in its entirety  Inconsistent  different clients may see different data at different times 11

LEASES AND MUTATION ORDER  Leases  Ensure data consistent & defined  Minimize load on master  Primary chunkserver  Master grants lease to one replica  Primary serializes all mutation requests  All replicas follow the order 13

LEASES AND MUTATION ORDER 14

DATA FLOW  Decouple control flow and data flow  Fully utilize network bandwidth  Forwards the data to the closest machine  Avoid network bottlenecks and high-latency  Pipelining the data transfer  Minimize latency 15

ATOMIC RECORD APPENDS  Record append : atomic append operation  Client specifies only the data  Allows for multiple writers  Append data on every replicas  May cause inconsistent states between successful appends  Checksums 16

17 SNAPSHOT  Make a copy of a file or a directory tree  Standard copy-on-write SNAPSHOT

NAMESPACE MANAGEMENT AND LOCKING  Namespace  Lookup table mapping full pathname to metadata  Locking  To ensure proper serialization multiple operations active and use locks over regions of the namespace  Allow concurrent mutations in the same directory  Prevent deadlock consistent total order 19

REPLICA PLACEMENT  Maximize data reliability and availability  Maximize network bandwidth utilization  Spread replicas across machines  Spread chunk replicas across the racks 20

CREATION, RE-REPLICATION, REBALANCING  Creation  Demanded by writers  Re-replication  Number of available replicas fall down below a user-specifying goal  Rebalancing  For better disk space and load balancing 21

GARBAGE COLLECTION  Lazy reclaim  Log deletion immediately  Rename to a hidden name with deletion timestamp  Remove 3 days later  Undelete by renaming back to normal  Regular scan  Heartbeat message exchange with each chunkserver  Identify orphaned chunks and erase the metadata 22

STALE REPLICA DETECTION  Maintain a chunk version number  Detect stale replicas  Remove stale replicas in regular garbage collection 23

HIGH AVAILABILITY  Fast recovery  Restore state and start in seconds  Chunk replication  Different replication levels for different parts of the file namespace  Master clones existing replicas as chunkservers go offline or detect corrupted replicas through checksum verification 25

HIGH AVAILABILITY  Master replication  Operation log and checkpoints are replicated on multiple machines  Master machine or disk fail  Monitoring infrastructure outside GFS starts new master process  Shadow master  Read-only access when primary master is down 26

DATA INTEGRITY  Checksum  To detect corruption  Every 64KB block in each chunk  In memory and stored persistently with logging  Read  Chunkserver verifies checksum before returning  Write  Append  Incrementally update the checksum for the last block  Compute new checksum 27

DATA INTEGRITY  Write  Overwrite  Read and verify the first and last block then write  Compute and record new checksums  During idle periods  Chunkservers scan and verify inactive chunks 28

MICRO-BENCHMARKS  GFS cluster  1 master & 2 master replicas  16 chunkservers  16 clients  Machine Specification  1.4GHz Pentium III  2GB RAM  100Mbps full-duplex Ethernet  Server machines connected to one switch  Client machines connected to the other switch  Two switches are connected with 1 Gbps link. 30

31 MICRO-BENCHMARKS Figure 3: Aggregate Throughputs. Top curves show theoretical limits imposed by our network topology. Bottom curves show measured throughputs. They have error bars that show 95% confidence intervals, which are illegible in some cases because of low variance in measurements.

32 REAL WORLD CLUSTERS Table2: characteristic Of two GFS clusters

33 REAL WORLD CLUSTERS Table 3: Performance Metrics for Two GFS Clusters

REAL WORLD CLUSTERS  In cluster B  Killed a single chunk server containing 15,000 chunks (600GB of data) ◦ All chunks restored in 23.2minutes ◦ Effective replication rate of 440MB/s  Killed two chunk servers each 16,000 chunks (660GB of data) ◦ 266 chunks only have a single replica ◦ Higher priority ◦ Restored with in 2 minutes 34

CONCLUSIONS  Demonstrates qualities essential to support large- scale processing workloads  Treat component failure as the norm  Optimize for huge files  Fault tolerance provide  Consistent monitoring  Replicating crucial data  Fast and automatic recovery  Use checksum to detect data corruption  High aggregate throughput to a variety of tasks 35

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google Jaehyun Han 1.

Similar presentations

Presentation on theme: "Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google Jaehyun Han 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google Jaehyun Han 1.

Similar presentations

Presentation on theme: "Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google Jaehyun Han 1."— Presentation transcript:

Similar presentations

About project

Feedback