CSS534: Parallel Programming in Grid and Cloud

Slides:



Advertisements
Similar presentations
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
Advertisements

Weed File System Simple and highly scalable distributed file system (NoFS)
Spark: Cluster Computing with Working Sets
Transaction.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore INTRODUCTION TO HADOOP.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop Distributed File System by Swathi Vangala.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.
1 The Google File System Reporter: You-Wei Zhang.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
The Hadoop Distributed File System
A BigData Tour – HDFS, Ceph and MapReduce These slides are possible thanks to these sources – Jonathan Drusi - SCInet Toronto – Hadoop Tutorial, Amir Payberah.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Simple introduction to HDFS Jie Wu. Some Useful Features –File permissions and authentication. –Rack awareness: to take a node's physical location into.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Hadoop & Condor Dhruba Borthakur Project Lead, Hadoop Distributed File System Presented at the The Israeli Association of Grid Technologies.
Distributed File System By Manshu Zhang. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM S. A. Weil, S. A. Brandt, E. L. Miller D. D. E. Long, C. Maltzahn U. C. Santa Cruz OSDI 2006.
Introduction to HDFS Prasanth Kothuri, CERN 2 What’s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Presenters: Rezan Amiri Sahar Delroshan
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
HDFS (Hadoop Distributed File System) Taejoong Chung, MMLAB.
Introduction to HDFS Prasanth Kothuri, CERN 2 What’s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Next Generation of Apache Hadoop MapReduce Owen
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
Presenter: Yue Zhu, Linghan Zhang A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint.
CCD-410 Cloudera Certified Developer for Apache Hadoop (CCDH) Cloudera.
Map reduce Cs 595 Lecture 11.
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Data Management with Google File System Pramod Bhatotia wp. mpi-sws
Hadoop.
Introduction to Distributed Platforms
Dhruba Borthakur Apache Hadoop Developer Facebook Data Infrastructure
HDFS Yarn Architecture
Hadoop: what is it?.
Introduction to MapReduce and Hadoop
Introduction to HDFS: Hadoop Distributed File System
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016
Software Engineering Introduction to Apache Hadoop Map Reduce
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
CSC-8320 Advanced Operating System
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
Ministry of Higher Education
The Basics of Apache Hadoop
湖南大学-信息科学与工程学院-计算机与科学系
GARRETT SINGLETARY.
Hadoop Basics.
Hadoop Technopoints.
Introduction to Apache
Distributed computing deals with hardware
Outline Review of Quiz #1 Distributed File Systems 4/20/2019 COP5611.
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

CSS534: Parallel Programming in Grid and Cloud Saranya Krishnan, Gayathri Palanisami The Hadoop Distributed File System

Outline Introduction Architecture File I/O Operations Application NameNode, DataNodes, HDFS Client, Image and Journal, CheckpointNode, BackupNode, Snapshots File I/O Operations File Read and Write, Block Placement, Replication Management, Balancer Application Wiki PageRank using Hadoop Conclusion

It is designed to store very large data sets Introduction HDFS The Hadoop Distributed File System (HDFS) - file system component of Hadoop. It is designed to store very large data sets (1) reliably, and to stream those data sets at high bandwidth to user applications. (2) These are achieved by replicating file content on multiple machines (DataNodes).

NameNode and DataNodes Architecture NameNode and DataNodes

Name Node – Data Nodes Communication Architecture Name Node – Data Nodes Communication

Data Node – Failure Recovery Architecture Data Node – Failure Recovery

NameNode - Failure Recovery Image, Journal and CheckPoint Image: The inode data and list of blocks belonging to the files comprises the metadata which is called Image. CheckPoint: The persistent record of the image is stored in the local host file system which is called CheckPoint. Journal: The NameNode stores the modification log of Image in the local host native file system which is called Journal.

NameNode - Failure Recovery CheckPointNode When journal becomes too long, checkpointNode combines the existing checkpoint and journal to create a new checkpoint and an empty journal. BackupNode BackupNode maintains up-to-date image of the file system namespace that is always synchronized with the state of the NameNode. If the NameNode fails, the BackupNode’s image in memory and the checkpoint on disk is a record of the latest namespace state. Snapshots The snapshot mechanism lets administrators persistently save the current state of the file system(both data and metadata). If the file system upgrade results in data loss or corruption, it is possible to rollback the upgrade and return HDFS to the namespace and storage state as they were at the time of the snapshot.

File I/O Operations and Replica Management Rack Awareness

File I/O Operations and Replica Management

File I/O Operations and Replica Management

File I/O Operations and Replica Management File Read and Write

File I/O Operations and Replica Management

File I/O Operations and Replica Management

File I/O Operations and Replica Management

Wiki PageRank with Hadoop Applications Wiki PageRank with Hadoop The Plan • Parse the big Wiki xml articles in Hadoop job 1. • Calculate new PageRank in Hadoop job 2. • Map the rank and page in Hadoop job 3.

Hadoop Job 1 In the Hadoop mapping phase, get the article's name and its outgoing links. In the Hadoop reduce phase, get for each WikiPage the links to other pages. Store the page, initial rank and outgoing links.

Hadoop Job 2 In the mapping phase, map each outgoing link to the page with its rank and total outgoing links. In the reduce phase, calculate the new PageRank for the pages. Store the page, new rank and outgoing links. Repeat these steps for more accurate results.

Hadoop Job 3 Store the rank and page (ordered based on rank). See the top 10 pages!

Hadoop Advantages and Disadvantages Conclusion Hadoop Advantages and Disadvantages Scalable Cost Effective Flexible Fast Resilient to failure Security Concerns Vulnerable by nature Not fit for small data Potential scalability issues

References [1]  Apache Hadoop. http://hadoop.apache.org/ [2] http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html [3] The Hadoop Distributed File System. http://dl.acm.org/citation.cfm?id=1914427 [4] http://www.cse.buffalo.edu/faculty/bina/presentations/mapr educeJan19-2010.pdf [5] http://blog.xebia.com/wiki-pagerank-with-hadoop/

THANK YOU ☺