CSS534: Parallel Programming in Grid and Cloud Saranya Krishnan, Gayathri Palanisami The Hadoop Distributed File System
Outline Introduction Architecture File I/O Operations Application NameNode, DataNodes, HDFS Client, Image and Journal, CheckpointNode, BackupNode, Snapshots File I/O Operations File Read and Write, Block Placement, Replication Management, Balancer Application Wiki PageRank using Hadoop Conclusion
It is designed to store very large data sets Introduction HDFS The Hadoop Distributed File System (HDFS) - file system component of Hadoop. It is designed to store very large data sets (1) reliably, and to stream those data sets at high bandwidth to user applications. (2) These are achieved by replicating file content on multiple machines (DataNodes).
NameNode and DataNodes Architecture NameNode and DataNodes
Name Node – Data Nodes Communication Architecture Name Node – Data Nodes Communication
Data Node – Failure Recovery Architecture Data Node – Failure Recovery
NameNode - Failure Recovery Image, Journal and CheckPoint Image: The inode data and list of blocks belonging to the files comprises the metadata which is called Image. CheckPoint: The persistent record of the image is stored in the local host file system which is called CheckPoint. Journal: The NameNode stores the modification log of Image in the local host native file system which is called Journal.
NameNode - Failure Recovery CheckPointNode When journal becomes too long, checkpointNode combines the existing checkpoint and journal to create a new checkpoint and an empty journal. BackupNode BackupNode maintains up-to-date image of the file system namespace that is always synchronized with the state of the NameNode. If the NameNode fails, the BackupNode’s image in memory and the checkpoint on disk is a record of the latest namespace state. Snapshots The snapshot mechanism lets administrators persistently save the current state of the file system(both data and metadata). If the file system upgrade results in data loss or corruption, it is possible to rollback the upgrade and return HDFS to the namespace and storage state as they were at the time of the snapshot.
File I/O Operations and Replica Management Rack Awareness
File I/O Operations and Replica Management
File I/O Operations and Replica Management
File I/O Operations and Replica Management File Read and Write
File I/O Operations and Replica Management
File I/O Operations and Replica Management
File I/O Operations and Replica Management
Wiki PageRank with Hadoop Applications Wiki PageRank with Hadoop The Plan • Parse the big Wiki xml articles in Hadoop job 1. • Calculate new PageRank in Hadoop job 2. • Map the rank and page in Hadoop job 3.
Hadoop Job 1 In the Hadoop mapping phase, get the article's name and its outgoing links. In the Hadoop reduce phase, get for each WikiPage the links to other pages. Store the page, initial rank and outgoing links.
Hadoop Job 2 In the mapping phase, map each outgoing link to the page with its rank and total outgoing links. In the reduce phase, calculate the new PageRank for the pages. Store the page, new rank and outgoing links. Repeat these steps for more accurate results.
Hadoop Job 3 Store the rank and page (ordered based on rank). See the top 10 pages!
Hadoop Advantages and Disadvantages Conclusion Hadoop Advantages and Disadvantages Scalable Cost Effective Flexible Fast Resilient to failure Security Concerns Vulnerable by nature Not fit for small data Potential scalability issues
References [1] Apache Hadoop. http://hadoop.apache.org/ [2] http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html [3] The Hadoop Distributed File System. http://dl.acm.org/citation.cfm?id=1914427 [4] http://www.cse.buffalo.edu/faculty/bina/presentations/mapr educeJan19-2010.pdf [5] http://blog.xebia.com/wiki-pagerank-with-hadoop/
THANK YOU ☺