Download presentation
Presentation is loading. Please wait.
Published byMohammed Sidney Modified over 9 years ago
1
Gearing for Exabyte Storage with Hadoop Distributed Filesystem Edward Bortnikov, Amir Langer, Artyom Sharov
2
Scale, Scale, Scale HDFS storage growing all the time Anticipating 1 XB Hadoop grids ~30K of dense (36 TB) nodes Harsh reality is … Single system of 5K nodes hard to build 10K impossible to build LADIS workshop, 20142
3
Why is Scaling So Hard? Look into architectural bottlenecks Are they hard to dissolve? Example: Job Scheduling Centralized in Hadoop’s early days Distributed since Hadoop 2.0 (YARN) This talk: the HDFS Namenode bottleneck LADIS workshop, 20143
4
How HDFS Works B1 B2 B3 Namenode (NN) Datanodes (DN’s) FS API (metadata) FS API (data) Client Bottleneck! Memory-speed FS Tree Block MapEdit Log Block report 4 B4
5
Quick Math Typical setting for MR I/O parallelism Small files (file:block ratio = 1:1) Small blocks (block size = 64MB = 2 26 B) 1XB = 2 60 bytes 2 34 blocks, 2 34 files Inode data = 188 B, block data = 136 B Overall, 5+ TB metadata in RAM Requires super-high-end hardware Unimaginable for 64-bit JVM (GC explodes) LADIS workshop, 20145
6
Optimizing the Centralized NN Reduce the use of Java references (HDFS-6658)HDFS-6658 Save 20% of block data Off-heap data storage (HDFS-7244)HDFS-7244 Most of the block data outside the JVM Off-heap data management via a slab allocatorslab allocator Negligible penalty for accessing non-Java memory Exploit entropy in file and directory names Huge redundancy in text LADIS workshop, 20146
7
One Process, Two Services Filesystem vs Block Management Compete for the RAM and the CPU Filesystem vs Block metadata Filesystem calls vs {Block reports, Replication} Grossly varying access patterns Filesystem data has huge locality Block data is accessed uniformly (reports) LADIS workshop, 20147
8
We Can Gain from a Split Scalability Easier to scale the services independently, on separate hardware Usability Standalone block management API attractive for applications (e.g., object store - HDFS-7240)HDFS-7240 LADIS workshop, 20148
9
The Pros Block Management Easy to infinitely scale horizontally (flat space) Can be physically co-located with datanodes Filesystem Management Easy to scale vertically (cold storage - HDFS-5389)HDFS-5389 De-facto, infinite scalability Almost always memory speed LADIS workshop, 20149
10
The Cons Extra Latency Backward compatibility of API requires an extra network hop (can be optimized) Management Complexity Separate service lifecycles New failure/recovery scenarios (can be mitigated) LADIS workshop, 201410
11
(Re-)Design Principles Correctness, Scalability, Performance API and Protocol Compatibility Simple Recovery Complete design in HDFS-5477HDFS-5477 LADIS workshop, 201411
12
Block Management as a Service FS Manager DN1DN2DN3DN4DN5DN6DN7DN8DN9DN10 Block Manager FS API (metadata) Block report NN/BM API FS API (data) External API/protocol Internal API/protocol Workers Replication LADIS workshop, 201412
13
Splitting the State FS Manager DN1DN2DN3DN4DN5DN6DN7DN8DN9DN10 Block Manager FS API (metadata) Block report Edit Log NN/BM API FS API (data) External API/protocol Internal API/protocol LADIS workshop, 201413
14
Scaling Out the Block Management FS Manager BM1BM2BM4BM5 DN1DN2DN3DN4DN5DN6DN7DN8DN9DN10 Partitioned Block Manager Block Pool BM3 Edit Log Block Collection LADIS workshop, 201414
15
Consistency of Global State State = inode data + block data Multiple scenarios modify both Big Central Lock in good old times Impossible to maintain: cripples performance when spanning RPC’s Fine-grained distributed locks? Only the path to the modified inode is locked All top-level directories in shared mode d1 d2 f1 f2 / / Add block to /d1/f2 f3 Add block to /d2/f3 No real contention! f1 / / d1 d2 f3 f2 Add block to /d2/f3 Add block to /d1/f2 LADIS workshop, 201415
16
Fine-Grained Locks Scale GL, writes GL, reads Latency, msec Throughput, transactions/sec Mixed workload 3 reads (getBlockLocations()) : 1 write (createFile()) FL, writes FL, reads GL, writes GL, reads LADIS workshop, 201416
17
Fine-Grained Locks - Challenges Impede progress upon spurious delays Might lead to deadlocks (flows starting concurrently at the FSM and the BM) Problematic to maintain upon failures Do we really need them? d1 d2 f1 f2 / / Add block to /d1/f2 f3 Add block to /d2/f3 No real contention! f1 / / d1 d2 f3 f2 Add block to /d2/f3 Add block to /d1/f2 LADIS workshop, 201417
18
Pushing the Envelope Actually, we don’t really need atomicity! Some transient state discrepancies can be tolerated for a while Example: orphaned blocks can emerge upon partially complete API’s No worries – no data loss! Can be collected lazily in the background LADIS workshop, 201418
19
Distributed Locks Eliminated No locks held across RPCs Guaranteeing serializability All updates start at the BM side Generation timestamps break ties Temporary state gaps resolved in background Timestamps used to reconcile More details in HDFS-5477 HDFS-5477 LADIS workshop, 201419
20
Beyond the Scope … Scaling the network connections Asynchronous dataflow architecture versus lock-based concurrency control Multi-tier bootstrap and recovery LADIS workshop, 201420
21
Summary HDFS namenode is a major scalability hurdle Many low-hanging optimizations – but centralized architecture inherently limited Distributed block-management-as-a-service key for future scalability Prototype implementation at Yahoo LADIS workshop, 201421
22
Backup LADIS workshop, 201422
23
Bootstrap and Recovery The common log simplifies things One peer (the FSM or the BM) enters read- only mode when the other is not available HA similar to bootstrap but failover is faster Drawback The BM not designed to operate in the FSM’s absence LADIS workshop, 201423
24
Supporting NSM Federation NSM1(/usr)NSM2(/project)NSM3(/backup) BM1BM2BM4BM5 DN1DN2DN3DN4DN5DN6DN7DN8DN9DN10 LADIS workshop, 201424
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.