HDFS (Hadoop Distributed File System) 2011-10-10 Taejoong Chung, MMLAB.

HDFS (Hadoop Distributed File System) 2011-10-10 Taejoong Chung, MMLAB

Contents Introduction –Hadoop Distributed File System? –Assumption & Goals Mechanism –Structure –Data Management –Maintenance Pros and Cons

HDFS Hadoop Distributed File System –Started from ‘Nutch’ (open-source search engine project) in 2005 –Java based, Apache top-level project –To save massive data with low cost Characteristics –User-level distributed file system –Fault-tolerant –Could be deployed on low-cost hardwares

Assumption & Goals 1)Protection of Failure Detection of faults and quick, automatic recovery Consider hardware & software failure 2)Streaming Data Access Batch processing rather than interactive use High throughput of data access rather than low latency of data access

Assumption & Goals - contd 3) Large Data Set Typical file in HDFS is gigabytes to terabytes High aggregate data bandwidth scaling to hundreds of nodes. 4) Simple Coherency Model Write-once-read-many access File once created, not allowed to modified

Assumption & Goals - contd 5) Migrating Computation into data Provides interface for applications to move themselves closer to where the data is located 6) Portability Easily portable from one platfrom to another Java based

Structure Master / Slave architecture NameNode (Master) –Manages the file system namespace –Regulates access to files by clients –Not contain any data files –Unique DataNode (Slave) –Actual repository –Multiple nodes are required

Namespace (Headquarter) Directory service Namespace (Headquarter) Directory service a DataNode: contain multiple blocks of data Block: Piece of data Conceptual Diagram

Operation A file is distributed with multiple blocks with multiple duplication over the DataNodes –A file is cut into multiple blocks whose size is 64MB (default) –Each block is replicated over the DataNodes (# of replica: 3, default) Scheme –Direction to maximize the ‘tolerance’ –Local Tolerance Inside of rack –Global Tolerance Outside of rack

Example Command to save files from NameNode DataNodes Rack 2Rack 3 Rack 1 Local tolerance: in same rack Global tolerance: outside of rack Rack Awareness

Data Maintenance Each DataNode send ‘Heartbeat’ messages containing ‘Blockreport’ to NameNode –Blockreport A list of all blocks on a DataNode –Heartbeat Kinds of ‘Ping’ (I’m alive!) Receipt of a Hearbeat implies that the DataNodes is functioning properly

Data Management NameNode manages all data –EditLog All the transaction is recorded from NameNode –FsImage (File System Image) To configure the which data blocks are stored in which DataNodes Key matadata is stored in memory Heartbeat messages from DataNodes are stored in here

Data Integrity (1) Safemode –On startup, NameNode receives Heartbeat and Blockreport messages from DataNode –Each block has a specified minimum number of replicas Under this threshold, re-replication happened –No replication of new data blocks does not occur in this period –This happens regularly

Data Integrity (2) Data fetched from a DataNode could be corrupted –Checksum algorithms are implemented Operation ① When a client creates an HDFS files, it also create calculated checksum ② A client receives a file, it also downloads checksum ③ Comparing downloaded checksum and another calculated checksum from file, a client could verify the content

Robustness Data disk failure, heartbeats and re-replication –From heartbeats message, NameNode could check the liveness of DataNode Cluster rebalancing –If a DataNode have much more data than the others, procedure for redistribution of blocks happened Data integrity –Checksum Metadata disk failure –FsImage, EditLog are copied

Pros and Cons Pros –Powerful mechanism for ‘Fault-Tolerant’ –Easy to deploy –Free Cons –Single point of failure – NameNode –Not optimized solution Same magnitude of replication for each block –Not that fast

Download & More Information Official site –http://hadoop.apache.org/ –Last build at March, 2011 Korean Dev. –http://www.hadoop.co.kr/http://www.hadoop.co.kr/ –Last uploaded materials at Oct, 2011

HDFS (Hadoop Distributed File System) 2011-10-10 Taejoong Chung, MMLAB.

Similar presentations

Presentation on theme: "HDFS (Hadoop Distributed File System) 2011-10-10 Taejoong Chung, MMLAB."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

HDFS (Hadoop Distributed File System) 2011-10-10 Taejoong Chung, MMLAB.

Similar presentations

Presentation on theme: "HDFS (Hadoop Distributed File System) 2011-10-10 Taejoong Chung, MMLAB."— Presentation transcript:

Similar presentations

About project

Feedback