Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

Slides:



Advertisements
Similar presentations
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
Advertisements

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
MapReduce. 2 (2012) Average Searches Per Day: 5,134,000,000 (2012) Average Searches Per Day: 5,134,000,000.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Workshop on Basics & Hands on Kapil Bhosale M.Tech (CSE) Walchand College of Engineering, Sangli. (Worked on Hadoop in Tibco) 1.
Apache Hadoop and Hive Dhruba Borthakur Apache Hadoop Developer
Google Distributed System and Hadoop Lakshmi Thyagarajan.
The Hadoop Distributed File System, by Dhyuba Borthakur and Related Work Presented by Mohit Goenka.
Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore INTRODUCTION TO HADOOP.
Hadoop Distributed File System by Swathi Vangala.
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
1 The Google File System Reporter: You-Wei Zhang.
The Hadoop Distributed File System
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Owen O’Malley Yahoo! Grid Team
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
HAMS Technologies 1
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Hadoop & Condor Dhruba Borthakur Project Lead, Hadoop Distributed File System Presented at the The Israeli Association of Grid Technologies.
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
HDFS (Hadoop Distributed File System) Taejoong Chung, MMLAB.
Introduction to HDFS Prasanth Kothuri, CERN 2 What’s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team Modified by R. Cook.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Next Generation of Apache Hadoop MapReduce Owen
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
BIG DATA/ Hadoop Interview Questions.
Map reduce Cs 595 Lecture 11.
Hadoop Aakash Kag What Why How 1.
Hadoop.
CSS534: Parallel Programming in Grid and Cloud
Introduction to MapReduce and Hadoop
Introduction to HDFS: Hadoop Distributed File System
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
The Basics of Apache Hadoop
GARRETT SINGLETARY.
Hadoop Basics.
Hadoop Technopoints.
Presentation transcript:

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

2 What’s Hadoop Framework for running applications on large clusters of commodity hardware –Scale: petabytes of data on thousands of nodes Include –Storage: HDFS –Processing: MapReduce Support the Map/Reduce programming model Requirements –Economy: use cluster of comodity computers –Easy to use Users: no need to deal with the complexity of distributed computing –Reliable: can handle node failures automatically

3 Open source Apache project Implemented in Java Apache Top Level Project – –Core (15 Committers) HDFS MapReduce Community of contributors is growing –Though mostly Yahoo for HDFS and MapReduce –You can contribute too!

4 Hadoop Characteristics Commodity HW –Add inexpensive servers –Storage servers and their disks are not assumed to be highly reliable and available Use replication across servers to deal with unreliable storage/servers Metadata-data separation - simple design –Namenode maintains metadata –Datanodes manage storage Slightly Restricted file semantics –Focus is mostly sequential access –Single writers –No file locking features Support for moving computation close to data –Servers have 2 purposes: data storage and computation –Single ‘storage + compute’ cluster vs. Separate clusters

5 Hadoop Architecture Data Data data data data data Results Data data data data Hadoop Cluster DFS Block 1 DFS Block 2 DFS Block 1 DFS Block 3 MAP Reduce

6 HDFS Data Model Data is organized into files and directories Files are divided into uniform sized blocks and distributed across cluster nodes Blocks are replicated to handle hardware failure Filesystem keeps checksums of data for corruption detection and recovery HDFS exposes block placement so that computation can be migrated to data

7 HDFS Data Model NameNode(Filename, replicationFactor, block-ids, …) /users/user1/data/part-0, r:2, {1,3}, … /users/user1/data/part-1, r:3, {2,4,5}, … Datanodes

8 HDFS Architecture Master-Slave architecture DFS Master “Namenode” –Manages the filesystem namespace –Maintain file name to list blocks + location mapping –Manages block allocation/replication –Checkpoints namespace and journals namespace changes for reliability –Control access to namespace DFS Slaves “Datanodes” handle block storage –Stores blocks using the underlying OS’s files –Clients access the blocks directly from datanodes –Periodically sends block reports to Namenode –Periodically check block integrity

9 HDFS Architecture Read Metadata ops Client Metadata (Name, #replicas, …): /users/foo/data, 3, … Namenode Client Datanodes Rack 1 Rack 2 Replication Block ops Datanodes Write Blocks

10 Block Placement And Replication A file’s replication factor can be set per file (default 3) Block placement is rack aware –Guarantee placement on two racks –1st replica is on local node, 2rd/3rd replicas are on remote rack –Avoid hot spots: balance I/O traffic Writes are pipelined to block replicas –Minimize bandwidth usage –Overlapping disk writes and network writes Reads are from the nearest replica Block under-replication & over-replication is detected by Namenode Balancer application rebalances blocks to balance DN utilization

11 HDFS Future Work: Scalability Scale cluster size Scale number of clients Scale namespace size (total number of files, amount of data) Possible solutions –Multiple namenodes Read-only secondary namenode Separate cluster management and namespace management Dynamic Partition namespace Mounting

12 Map/Reduce Map/Reduce is a programming model for efficient distributed computing It works like a Unix pipeline: –cat input | grep | sort | uniq -c | cat > output – Input | Map | Shuffle & Sort | Reduce | Output A simple model but good for a lot of applications –Log processing –Web index building

13 Word Count Dataflow

14 Word Count Example Mapper –Input: value: lines of text of input –Output: key: word, value: 1 Reducer –Input: key: word, value: set of counts –Output: key: word, value: sum Launching program –Defines the job –Submits job to cluster

15 Map/Reduce features Fine grained Map and Reduce tasks –Improved load balancing –Faster recovery from failed tasks Automatic re-execution on failure –In a large cluster, some nodes are always slow or flaky –Framework re-executes failed tasks Locality optimizations –With large data, bandwidth to data is a problem –Map-Reduce + HDFS is a very effective solution –Map-Reduce queries HDFS for locations of input data –Map tasks are scheduled close to the inputs when possible

16 Documentation Hadoop Wiki – Introduction – Getting Started – Map/Reduce Overview – DFS Javadoc –

17 Questions? Thank you!