Gearing for Exabyte Storage with Hadoop Distributed Filesystem Edward Bortnikov, Amir Langer, Artyom Sharov.

Slides:



Advertisements
Similar presentations
Data Freeway : Scaling Out to Realtime Author: Eric Hwang, Sam Rash Speaker : Haiping Wang
Advertisements

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
The google file system Cs 595 Lecture 9.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
1 Principles of Reliable Distributed Systems Tutorial 12: Frangipani Spring 2009 Alex Shraer.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
Comparison and Performance Evaluation of SAN File System Yubing Wang & Qun Cai.
Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis.
Overview Distributed vs. decentralized Why distributed databases
Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop Distributed File System by Swathi Vangala.
The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.
1 The Google File System Reporter: You-Wei Zhang.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Session-8 Data Management for Decision Support
Chapter 20 Distributed File Systems Copyright © 2008.
Distributed File System By Manshu Zhang. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
© Hortonworks Inc HDFS: Hadoop Distributed FS Steve Loughran, ATLAS workshop, June 2013.
CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM S. A. Weil, S. A. Brandt, E. L. Miller D. D. E. Long, C. Maltzahn U. C. Santa Cruz OSDI 2006.
Hosted by The Pros & Cons of Content Addressed Storage Arun Taneja Founder & Consulting Analyst.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
HDFS (Hadoop Distributed File System) Taejoong Chung, MMLAB.
Caching Consistency and Concurrency Control Contact: Dingshan He
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Copyright © 2006, GemStone Systems Inc. All Rights Reserved. Increasing computation throughput with Grid Data Caching Jags Ramnarayan Chief Architect GemStone.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.
Review CS File Systems - Partitions What is a hard disk partition?
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Lecture 24: GFS.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Next Generation of Apache Hadoop MapReduce Owen
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
An Introduction to GPFS
CalvinFS: Consistent WAN Replication and Scalable Metdata Management for Distributed File Systems Thomas Kao.
TensorFlow– A system for large-scale machine learning
Scaling HDFS to more than 1 million operations per second with HopsFS
Slide credits: Thomas Kao
Hadoop.
Introduction to Distributed Platforms
Large-scale file systems and Map-Reduce
CSCI5570 Large Scale Data Processing Systems
Google Filesystem Some slides taken from Alan Sussman.
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016
Software Engineering Introduction to Apache Hadoop Map Reduce
The Basics of Apache Hadoop
Filesystems 2 Adapted from slides of Hank Levy
Chapter 17: Database System Architectures
Hadoop Technopoints.
Salman Niazi1, Mahmoud Ismail1,
Distributed Databases
Presentation transcript:

Gearing for Exabyte Storage with Hadoop Distributed Filesystem Edward Bortnikov, Amir Langer, Artyom Sharov

Scale, Scale, Scale  HDFS storage growing all the time  Anticipating 1 XB Hadoop grids  ~30K of dense (36 TB) nodes  Harsh reality is …  Single system of 5K nodes hard to build  10K impossible to build LADIS workshop, 20142

Why is Scaling So Hard?  Look into architectural bottlenecks  Are they hard to dissolve?  Example: Job Scheduling  Centralized in Hadoop’s early days  Distributed since Hadoop 2.0 (YARN)  This talk: the HDFS Namenode bottleneck LADIS workshop, 20143

How HDFS Works B1 B2 B3 Namenode (NN) Datanodes (DN’s) FS API (metadata) FS API (data) Client Bottleneck! Memory-speed FS Tree Block MapEdit Log Block report 4 B4

Quick Math  Typical setting for MR I/O parallelism  Small files (file:block ratio = 1:1)  Small blocks (block size = 64MB = 2 26 B)  1XB = 2 60 bytes  2 34 blocks, 2 34 files  Inode data = 188 B, block data = 136 B  Overall, 5+ TB metadata in RAM  Requires super-high-end hardware  Unimaginable for 64-bit JVM (GC explodes) LADIS workshop, 20145

Optimizing the Centralized NN  Reduce the use of Java references (HDFS-6658)HDFS-6658  Save 20% of block data  Off-heap data storage (HDFS-7244)HDFS-7244  Most of the block data outside the JVM  Off-heap data management via a slab allocatorslab allocator  Negligible penalty for accessing non-Java memory  Exploit entropy in file and directory names  Huge redundancy in text LADIS workshop, 20146

One Process, Two Services  Filesystem vs Block Management  Compete for the RAM and the CPU  Filesystem vs Block metadata  Filesystem calls vs {Block reports, Replication}  Grossly varying access patterns  Filesystem data has huge locality  Block data is accessed uniformly (reports) LADIS workshop, 20147

We Can Gain from a Split  Scalability  Easier to scale the services independently, on separate hardware  Usability  Standalone block management API attractive for applications (e.g., object store - HDFS-7240)HDFS-7240 LADIS workshop, 20148

The Pros  Block Management  Easy to infinitely scale horizontally (flat space)  Can be physically co-located with datanodes  Filesystem Management  Easy to scale vertically (cold storage - HDFS-5389)HDFS-5389  De-facto, infinite scalability  Almost always memory speed LADIS workshop, 20149

The Cons  Extra Latency  Backward compatibility of API requires an extra network hop (can be optimized)  Management Complexity  Separate service lifecycles  New failure/recovery scenarios (can be mitigated) LADIS workshop,

(Re-)Design Principles  Correctness, Scalability, Performance  API and Protocol Compatibility  Simple Recovery  Complete design in HDFS-5477HDFS-5477 LADIS workshop,

Block Management as a Service FS Manager DN1DN2DN3DN4DN5DN6DN7DN8DN9DN10 Block Manager FS API (metadata) Block report NN/BM API FS API (data) External API/protocol Internal API/protocol Workers Replication LADIS workshop,

Splitting the State FS Manager DN1DN2DN3DN4DN5DN6DN7DN8DN9DN10 Block Manager FS API (metadata) Block report Edit Log NN/BM API FS API (data) External API/protocol Internal API/protocol LADIS workshop,

Scaling Out the Block Management FS Manager BM1BM2BM4BM5 DN1DN2DN3DN4DN5DN6DN7DN8DN9DN10 Partitioned Block Manager Block Pool BM3 Edit Log Block Collection LADIS workshop,

Consistency of Global State  State = inode data + block data  Multiple scenarios modify both  Big Central Lock in good old times  Impossible to maintain: cripples performance when spanning RPC’s  Fine-grained distributed locks?  Only the path to the modified inode is locked  All top-level directories in shared mode d1 d2 f1 f2 / / Add block to /d1/f2 f3 Add block to /d2/f3 No real contention! f1 / / d1 d2 f3 f2 Add block to /d2/f3 Add block to /d1/f2 LADIS workshop,

Fine-Grained Locks Scale GL, writes GL, reads Latency, msec Throughput, transactions/sec Mixed workload 3 reads (getBlockLocations()) : 1 write (createFile()) FL, writes FL, reads GL, writes GL, reads LADIS workshop,

Fine-Grained Locks - Challenges  Impede progress upon spurious delays  Might lead to deadlocks (flows starting concurrently at the FSM and the BM)  Problematic to maintain upon failures  Do we really need them? d1 d2 f1 f2 / / Add block to /d1/f2 f3 Add block to /d2/f3 No real contention! f1 / / d1 d2 f3 f2 Add block to /d2/f3 Add block to /d1/f2 LADIS workshop,

Pushing the Envelope  Actually, we don’t really need atomicity!  Some transient state discrepancies can be tolerated for a while  Example: orphaned blocks can emerge upon partially complete API’s  No worries – no data loss!  Can be collected lazily in the background LADIS workshop,

Distributed Locks Eliminated  No locks held across RPCs  Guaranteeing serializability  All updates start at the BM side  Generation timestamps break ties  Temporary state gaps resolved in background  Timestamps used to reconcile  More details in HDFS-5477 HDFS-5477 LADIS workshop,

Beyond the Scope …  Scaling the network connections  Asynchronous dataflow architecture versus lock-based concurrency control  Multi-tier bootstrap and recovery LADIS workshop,

Summary  HDFS namenode is a major scalability hurdle  Many low-hanging optimizations – but centralized architecture inherently limited  Distributed block-management-as-a-service key for future scalability  Prototype implementation at Yahoo LADIS workshop,

Backup LADIS workshop,

Bootstrap and Recovery  The common log simplifies things  One peer (the FSM or the BM) enters read- only mode when the other is not available  HA similar to bootstrap but failover is faster  Drawback  The BM not designed to operate in the FSM’s absence LADIS workshop,

Supporting NSM Federation NSM1(/usr)NSM2(/project)NSM3(/backup) BM1BM2BM4BM5 DN1DN2DN3DN4DN5DN6DN7DN8DN9DN10 LADIS workshop,