Large Scale Sharing The Google File System PAST: Storage Management & Caching – Presented by Chi H. Ho.

Large Scale Sharing The Google File System PAST: Storage Management & Caching – Presented by Chi H. Ho

Introduction A next step from network file systems. How large? GFS: > 1000 storage nodes > 300 TB disk storage Hundreds of client machines PAST: Internet-scale

The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Goals Performance Scalability Reliability Availability Highly tuned for: Google’s back-end file service Workloads: multiple-producer/single-consumer, many-way merging

Assumptions H/W: inexpensive components that often fail. Files: modest number of large files. Reads/Writes: 2 kinds Large streaming: common case => optimized. Small random: supported but need not be efficient. Concurrency: hundreds of concurrent appends. Performance: high sustained bandwidth is more important than low latency.

Interface Usual operations: create, delete, open, close, read, and write. GFS specific: snapshot: creates a copy of a file or a directory tree at low cost. record append: allows concurrent appends to perform atomically.

Architecture

User-level process User-level process User-level process User-level process

Architecture (Files) Files are divided into fixed-size chunks, each is replicated at multiple (default 3) chunkservers as a Linux file. Each chunk is identified by an immutable and globally unique chunk handle assigned by the master at the time of chunk creation. Read/Write data chunk specified by

Architecture (Master) Maintains metadata: Namespace Access control Mapping files  chunks Chunks’ locations Controls system-wide activities: Chunk lease mamagement Garbage collection Chunk migration And Heartbeat messages

Architecture (Client) Interacts with Master for metadata Communicates directly with chunkservers for data

Architecture (Notes) No data cache is needed: Why? Client: ??? Chunkservers: ???

Architecture (Notes) No data cache is needed: Why? Client: most applications stream through huge files or have working sets too large to be cached. Chunkservers: already have Linux cache.

Single Master Bottleneck? Single point of failure?

Single Master Bottleneck? Never read/write data thru. the master Only ask the master for chunks’ locations Prefetch multiple chunks  Cache Single point of failure? Master’s state is replicated on multiple machines. Mutations of master’s state are atomic. “Shadow” masters are temporarily used for read.

Chunk Size Large: 64 MBs. Advantages: Reduces client-master interaction. Reduces network overhead (use persistent TCP). Reduces size of metadata => kept in memory. Disadvantages: Small files (small #chunks) may become hot spots. Solutions: Small files => more replicas. Read from clients.

Metadata Three major types: file and chunk namespaces, file-to-chunk mapping, locations of each chunk’s replicas. } in master’s memory Persistence issues: Namespaces and mapping: operation log stored on multiple machines. Chunks’ locations: polled when master starts and chunkservers joining, update by heartbeat msgs.

Operation Log In the heart of GFS: The only persistent record of metadata, The logical time line that orders concurrent ops. Operations are atomically committed. Recovery of master’s state is done by replaying operations in the log.

Consistency Metadata: solely controlled by the master Data: consistent after successful mutations. Same order of mutations is applied on all replicas. Stale replicas (missing some mutations) are detected and eliminated. Consistent and clients see what the mutation writes in its entirety Clients see same data regardless which replica

Leases and Mutation Order Lease: high-level chunk-based access control mechanism, granted by the master. Global mutation order = lease grant order + serial number within a lease, chosen by the primary (lease holder). Illustration of a mutation ask for the lease holder of the chunk locations of primary and secondary replicas locate the lease or grant one if none exists. cache the locations push data to all replicas store data in LRU buf. and ack. wait for all to ack. write request forward write request assign serial no. to request request completed reply (may be w/ errors)

Special Ops Revisited Atomic Record Appends Master chooses offset. Up on failure: pad the failed replica(s), then retry. Guarantee: the record is appended to the file at least once atomically. Snapshot Copy-on-write. Used to make a copy of a file/directory quickly.

Master Operations Namespace Management and Locking, To support concurrent master’s operations. Replica Placement, To avoid dependent failures; to exploit network bandwidth. Creation, Re-replication, Rebalancing, To better disk utilization, load balancing, fault tolerance. Garbage Collection, Lazy deletion: simple, efficient, and support undelete. Stale Replica Detection To eliminate obsolete replicas => garbage collected.

Fault Tolerance Sum Up Master fails? Chunkservers fail? Disks corrupted? Network noise?

Micro-benchmarks Configuration: 1 master, 2 replicas 16 chunkservers 16 clients Each machine: dual 1.4GHz PIII, 2GB mem, 2x80GB 5400rpm, full-duplex 100Mbps NIC. } 1 switch 1Gbps

Micro-benchmark Test and Results N clients read simultaneously, randomly from a 320GB file set. Each client read 1GB. Each read is 4MB. N clients write simultaneously to N distinct files. Each client write 1GB. Each write is 1MB. N clients append simultaneously to one file.

Real World Clusters Cluster A: R&D of over 100 engrs. Typical task: Initiated by a human user and runs up to several hours. Read (MBs – TBs) + Processed + Write results back. Cluster B: Production data processing Tasks: Long lasting. Continuously generate and process multi-TB data sets. Only occasion human intervening

Real World Measurements Table shows: Sustained high throughput. Light workload on master. Besides: recovery A full recovery of a chunkserver takes 23.2 minutes. Prioritized recovery to a state that could tolerate 1 more failure: 2 minutes.

Workload Breakdown

Conclusion Design too narrow for Google’s applications. Most the challenges are implementing---more development component than research. However, GFS is a complete, deployed solution. Any opinions/comments?

Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility Antony Rowstron, Peter Druschel

What is PAST? An Internet-based, P2P global storage utility. An archival storage and content distribution utility, not a general-purpose file system. Nodes form a self-organizing overlay network. Nodes may contribute storage. Files are inserted and retrieved, handled by fileID and maybe a key. Files are immutable. PAST does not have a lookup service => built on top of one, such as Pastry.

Goals Strong persistence, High availability, Scalability, Security.

Background – Pastry A P2P routing substrate. Given (fileID, msg), route msg to the node whose nodeID is closest to fileID. Routing Costs = ceiling(log 2 b N) steps. Eventual delivery is guaranteed, unless floor(l/2) nodes with adjacent nodeID fail. Per-node maps of (2 b -1)*ceiling(log 2 b N) + 2l entries: nodeID  IP address. Node recovery’s done by O(log 2 b N) msgs.

Pastry – A closer look… Routing: forward message with fileID to a node that (nodeID) shares more digits with fileID than the current node. if no such node found, fwd to node with similar match, but numerically closer. Other nice properties: fault resilient, self- organizing, scalable, efficient. b = 2, l = 8

PAST Operations Insert fileID := SHA-1(filename, pub key, salt) => unique File certificate is issued. Client’s quota is charged. Lookup Based on fileID. Node returns file’s contents and certificate. Reclaim Client issues reclaim certificate for authentication. Credit client’s quota; double checked by reclaim receipt.

Security Overview Each node and each user hold a smartcard. Security model: Infeasible to break the cryptosystems. Most nodes are well-behaved. Smartcards can’t be controlled by an attacker. From smartcard, various certificates and receipts are generated to ensure security: file certificates, reclaim certificates, reclaim receipts, etc.

Storage Management Assumptions: Storage capacities of nodes differ by no more than 2 orders of magnitude. Advertised capacity is the basis for the admission of nodes. 2 conflicting responsibilities: Balance free storage under stress, Keep k copies of each file fileID at k nodes with nodeIDs closest to fileID.

I) Load Balancing What causes load imbalance? Differences in: #files / node (due to the dist. of nodeIDs and fileIDs). Size distribution of inserted files. Storage capacity of nodes. What the solution aims for? Blur the differences by redistributing data: Replica diversion: on local scale (relocate a replica among leaf nodes). File diversion: on global scale (relocate all replicas w/ a different fileID).

N receive D S D / F N > t pri Store D Issue receipt (Fwd D to k-1) Replica diversion: choose diversion node N’ N’ = max storage {x | (x is N’s leaf) & (x’s fileID not in k-closest) & (not exist diverted replica)} N’ not exist || S D / F N’ > t div Store D N  N’ (k+1) st  N’ File diversion No Yes S D size of file D F N free space of N t priv primary threshold S D size of file D F N’ free space of N’ t div diversion threshold

II) Maintaining k Replicas Problem: nodes join and leave. On joining: Add ptr  replaced node (~ replica diversion). Gradually migrate replicas back (background job). On leaving: Each affected node picks a new k th closest node, update its leaf set, and fwd replicas. Notes: Extreme condition: “expand” the leaf set to 2l. Impossible to maintain k replicas if total storage decreases.

Optimizations Storage: file encoding E.g.: Reed-Solomon encoding: m replicas for each file  m checksum for n files. Performance: caching Goals: to reduce client-access latencies, maximize query throughput & balance query load. Algorithm: GreedyDual-Size (GD-S) Upon a hit: H d = c(d) / s(d) Eviction:  Evict file v where H v min.  Subtract H v from remaining H values.

Experiments – Setup Workload 1: 8 web proxy logs from NLANR: 4 mil entries Reference 1,863,055 unique URLs 18.7GBs of contents Mean = 10517 Bs, median = 1,312 Bs, max = 138 MBs, min = 0 Bs. Workload 2: Combining file name and size information from several file systems: 2,027,908 files 166.6 GBs Mean = 88,233 Bs, median = 4,578 Bs, max = 2.7 GBs, min = 0 Bs. System: k = 5, b = 4, N = 2250 Space contribution: 4 normal distribution (click to see figure.)

Experiment 0 Disable replica and file diversions: t priv = 1 t div = 0 Reject upon first failure. Results: File insertions failed = 51.1%, Storage utilization = 60.8%.

Storage Contribution & Leaf Set Size Experiment: Workload 1 t priv = 0.1 t div = 0.05 Results: Failures Utilization More leaves => better. d 2 best.

Sensitivity of Replica Diversion Parameter t pri Experiment: Workload 1 l = 32 t div = 0.05 t pri varies Results: As t pri Successful insertion Storage utilization

Sensitivity of File Diversion Paramerter t div Experiment: Workload 1 l = 32 t pri = 0.1 t div varies Results: As t div Successful insertion Storage utilization t priv = 0.1 and t div = 0.05 yields best result.

Diversions File diversions are negligible as long as storage utilization is below 83% Acceptable overhead

Insertion Failures w/ Respect to File Size Workload 1 t priv = 0.1 t div = 0.05 Workload 2 t priv = 0.1 t div = 0.05

Experiments – Caching Replica diversion increase 99% load, still effective due to small files

Conclusion PAST achieves its goals But: Application specific Hard to deploy: what is the incentive for the nodes to contribute storage? Additional comments?

Large Scale Sharing The Google File System PAST: Storage Management & Caching – Presented by Chi H. Ho.

Similar presentations

Presentation on theme: "Large Scale Sharing The Google File System PAST: Storage Management & Caching – Presented by Chi H. Ho."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Large Scale Sharing The Google File System PAST: Storage Management & Caching – Presented by Chi H. Ho.

Similar presentations

Presentation on theme: "Large Scale Sharing The Google File System PAST: Storage Management & Caching – Presented by Chi H. Ho."— Presentation transcript:

Similar presentations

About project

Feedback