Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSS534: Parallel Programming in Grid and Cloud

Similar presentations


Presentation on theme: "CSS534: Parallel Programming in Grid and Cloud"— Presentation transcript:

1 CSS534: Parallel Programming in Grid and Cloud
Saranya Krishnan, Gayathri Palanisami The Hadoop Distributed File System

2 Outline Introduction Architecture File I/O Operations Application
NameNode, DataNodes, HDFS Client, Image and Journal, CheckpointNode, BackupNode, Snapshots File I/O Operations File Read and Write, Block Placement, Replication Management, Balancer Application Wiki PageRank using Hadoop Conclusion

3 It is designed to store very large data sets
Introduction HDFS The Hadoop Distributed File System (HDFS) - file system component of Hadoop. It is designed to store very large data sets (1) reliably, and to stream those data sets at high bandwidth to user applications. (2) These are achieved by replicating file content on multiple machines (DataNodes).

4 NameNode and DataNodes
Architecture NameNode and DataNodes

5 Name Node – Data Nodes Communication
Architecture Name Node – Data Nodes Communication

6 Data Node – Failure Recovery
Architecture Data Node – Failure Recovery

7 NameNode - Failure Recovery
Image, Journal and CheckPoint Image: The inode data and list of blocks belonging to the files comprises the metadata which is called Image. CheckPoint: The persistent record of the image is stored in the local host file system which is called CheckPoint. Journal: The NameNode stores the modification log of Image in the local host native file system which is called Journal.

8 NameNode - Failure Recovery
CheckPointNode When journal becomes too long, checkpointNode combines the existing checkpoint and journal to create a new checkpoint and an empty journal. BackupNode BackupNode maintains up-to-date image of the file system namespace that is always synchronized with the state of the NameNode. If the NameNode fails, the BackupNode’s image in memory and the checkpoint on disk is a record of the latest namespace state. Snapshots The snapshot mechanism lets administrators persistently save the current state of the file system(both data and metadata). If the file system upgrade results in data loss or corruption, it is possible to rollback the upgrade and return HDFS to the namespace and storage state as they were at the time of the snapshot.

9 File I/O Operations and Replica Management
Rack Awareness

10 File I/O Operations and Replica Management

11 File I/O Operations and Replica Management

12 File I/O Operations and Replica Management
File Read and Write

13 File I/O Operations and Replica Management

14 File I/O Operations and Replica Management

15 File I/O Operations and Replica Management

16 Wiki PageRank with Hadoop
Applications Wiki PageRank with Hadoop The Plan • Parse the big Wiki xml articles in Hadoop job 1. • Calculate new PageRank in Hadoop job 2. • Map the rank and page in Hadoop job 3.

17 Hadoop Job 1 In the Hadoop mapping phase, get the article's name and its outgoing links. In the Hadoop reduce phase, get for each WikiPage the links to other pages. Store the page, initial rank and outgoing links.

18 Hadoop Job 2 In the mapping phase, map each outgoing link to the page with its rank and total outgoing links. In the reduce phase, calculate the new PageRank for the pages. Store the page, new rank and outgoing links. Repeat these steps for more accurate results.

19 Hadoop Job 3 Store the rank and page (ordered based on rank). See the top 10 pages!

20 Hadoop Advantages and Disadvantages
Conclusion Hadoop Advantages and Disadvantages Scalable Cost Effective Flexible Fast Resilient to failure Security Concerns Vulnerable by nature Not fit for small data Potential scalability issues

21 References [1]  Apache Hadoop. [2] [3] The Hadoop Distributed File System. [4] educeJan pdf [5]

22 THANK YOU ☺


Download ppt "CSS534: Parallel Programming in Grid and Cloud"

Similar presentations


Ads by Google