Presentation is loading. Please wait.

Presentation is loading. Please wait.

HDFS (Hadoop Distributed File System) 2011-10-10 Taejoong Chung, MMLAB.

Similar presentations


Presentation on theme: "HDFS (Hadoop Distributed File System) 2011-10-10 Taejoong Chung, MMLAB."— Presentation transcript:

1 HDFS (Hadoop Distributed File System) 2011-10-10 Taejoong Chung, MMLAB

2 Contents Introduction –Hadoop Distributed File System? –Assumption & Goals Mechanism –Structure –Data Management –Maintenance Pros and Cons

3 HDFS Hadoop Distributed File System –Started from ‘Nutch’ (open-source search engine project) in 2005 –Java based, Apache top-level project –To save massive data with low cost Characteristics –User-level distributed file system –Fault-tolerant –Could be deployed on low-cost hardwares

4 Assumption & Goals 1)Protection of Failure Detection of faults and quick, automatic recovery Consider hardware & software failure 2)Streaming Data Access Batch processing rather than interactive use High throughput of data access rather than low latency of data access

5 Assumption & Goals - contd 3) Large Data Set Typical file in HDFS is gigabytes to terabytes High aggregate data bandwidth scaling to hundreds of nodes. 4) Simple Coherency Model Write-once-read-many access File once created, not allowed to modified

6 Assumption & Goals - contd 5) Migrating Computation into data Provides interface for applications to move themselves closer to where the data is located 6) Portability Easily portable from one platfrom to another Java based

7 Structure Master / Slave architecture NameNode (Master) –Manages the file system namespace –Regulates access to files by clients –Not contain any data files –Unique DataNode (Slave) –Actual repository –Multiple nodes are required

8 Namespace (Headquarter) Directory service Namespace (Headquarter) Directory service a DataNode: contain multiple blocks of data Block: Piece of data Conceptual Diagram

9 Operation A file is distributed with multiple blocks with multiple duplication over the DataNodes –A file is cut into multiple blocks whose size is 64MB (default) –Each block is replicated over the DataNodes (# of replica: 3, default) Scheme –Direction to maximize the ‘tolerance’ –Local Tolerance Inside of rack –Global Tolerance Outside of rack

10 Example Command to save files from NameNode DataNodes Rack 2Rack 3 Rack 1 Local tolerance: in same rack Global tolerance: outside of rack Rack Awareness

11 Data Maintenance Each DataNode send ‘Heartbeat’ messages containing ‘Blockreport’ to NameNode –Blockreport A list of all blocks on a DataNode –Heartbeat Kinds of ‘Ping’ (I’m alive!) Receipt of a Hearbeat implies that the DataNodes is functioning properly

12 Data Management NameNode manages all data –EditLog All the transaction is recorded from NameNode –FsImage (File System Image) To configure the which data blocks are stored in which DataNodes Key matadata is stored in memory Heartbeat messages from DataNodes are stored in here

13 Data Integrity (1) Safemode –On startup, NameNode receives Heartbeat and Blockreport messages from DataNode –Each block has a specified minimum number of replicas Under this threshold, re-replication happened –No replication of new data blocks does not occur in this period –This happens regularly

14 Data Integrity (2) Data fetched from a DataNode could be corrupted –Checksum algorithms are implemented Operation ① When a client creates an HDFS files, it also create calculated checksum ② A client receives a file, it also downloads checksum ③ Comparing downloaded checksum and another calculated checksum from file, a client could verify the content

15 Robustness Data disk failure, heartbeats and re-replication –From heartbeats message, NameNode could check the liveness of DataNode Cluster rebalancing –If a DataNode have much more data than the others, procedure for redistribution of blocks happened Data integrity –Checksum Metadata disk failure –FsImage, EditLog are copied

16 Pros and Cons Pros –Powerful mechanism for ‘Fault-Tolerant’ –Easy to deploy –Free Cons –Single point of failure – NameNode –Not optimized solution Same magnitude of replication for each block –Not that fast

17 Download & More Information Official site –http://hadoop.apache.org/ –Last build at March, 2011 Korean Dev. –http://www.hadoop.co.kr/http://www.hadoop.co.kr/ –Last uploaded materials at Oct, 2011

18 QnA


Download ppt "HDFS (Hadoop Distributed File System) 2011-10-10 Taejoong Chung, MMLAB."

Similar presentations


Ads by Google