 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.

Slides:



Advertisements
Similar presentations
Introduction to Hadoop Richard Holowczak Baruch College.
Advertisements

MapReduce Simplified Data Processing on Large Clusters
Overview of MapReduce and Hadoop
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
HADOOP ADMIN: Session -2
The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
MapReduce.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Hadoop implementation of MapReduce computational model Ján Vaňo.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Next Generation of Apache Hadoop MapReduce Owen
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
BIG DATA/ Hadoop Interview Questions.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Map reduce Cs 595 Lecture 11.
MapReduce Compiler RHadoop
Hadoop Aakash Kag What Why How 1.
Apache hadoop & Mapreduce
Software Systems Development
What is Apache Hadoop? Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created.
Chapter 10 Data Analytics for IoT
Hadoop MapReduce Framework
Introduction to MapReduce and Hadoop
Introduction to HDFS: Hadoop Distributed File System
Software Engineering Introduction to Apache Hadoop Map Reduce
Central Florida Business Intelligence User Group
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
The Basics of Apache Hadoop
Hadoop Basics.
Overview of big data tools
Lecture 16 (Intro to MapReduce and Hadoop)
Charles Tappert Seidenberg School of CSIS, Pace University
Presentation transcript:

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and Mike Carafella in  Cutting named the program after his son’s toy elephant.

 Data-intensive text processing  Assembly of large genomes  Graph mining  Machine learning and data mining  Large scale social network analysis

Contains Libraries and other modules Hadoop Common Hadoop Distributed File System HDFS Yet Another Resource Negotiator Hadoop YARN A programming model for large scale data processing Hadoop MapReduce

What considerations led to its design

 What were the limitations of earlier large- scale computing?  What requirements should an alternative approach have?  How does Hadoop address those requirements?

 Historically computation was processor- bound ◦ Data volume has been relatively small ◦ Complicated computations are performed on that data  Advances in computer technology has historically centered around improving the power of a single machine

 Moore’s Law ◦ The number of transistors on a dense integrated circuit doubles every two years  Single-core computing can’t scale with current computing needs

 Power consumption limits the speed increase we get from transistor density

 Allows developers to use multiple machines for a single task

 Programming on a distributed system is much more complex ◦ Synchronizing data exchanges ◦ Managing a finite bandwidth ◦ Controlling computation timing is complicated

“You know you have a distributed system when the crash of a computer you’ve never heard of stops you from getting any work done.” –Leslie Lamport  Distributed systems must be designed with the expectation of failure

 Typically divided into Data Nodes and Compute Nodes  At compute time, data is copied to the Compute Nodes  Fine for relatively small amounts of data  Modern systems deal with far more data than was gathering in the past

 Facebook ◦ 500 TB per day  Yahoo ◦ Over 170 PB  eBay ◦ Over 6 PB  Getting the data to the processors becomes the bottleneck

 Must support partial failure  Must be scalable

 Failure of a single component must not cause the failure of the entire system only a degradation of the application performance  Failure should not result in the loss of any data

 If a component fails, it should be able to recover without restarting the entire system  Component failure or recovery during a job must not affect the final output

 Increasing resources should increase load capacity  Increasing the load on the system should result in a graceful decline in performance for all jobs ◦ Not system failure

 Based on work done by Google in the early 2000s ◦ “The Google File System” in 2003 ◦ “MapReduce: Simplified Data Processing on Large Clusters” in 2004  The core idea was to distribute the data as it is initially stored ◦ Each node can then perform computation on the data it stores without moving the data for the initial processing

 Applications are written in a high-level programming language ◦ No network programming or temporal dependency  Nodes should communicate as little as possible ◦ A “shared nothing” architecture  Data is spread among the machines in advance ◦ Perform computation where the data is already stored as often as possible

 When data is loaded onto the system it is divided into blocks ◦ Typically 64MB or 128MB  Tasks are divided into two phases ◦ Map tasks which are done on small portions of data where the data is stored ◦ Reduce tasks which combine data to produce the final output  A master program allocates work to individual nodes

 Failures are detected by the master program which reassigns the work to a different node  Restarting a task does not affect the nodes working on other portions of the data  If a failed node restarts, it is added back to the system and assigned new tasks  The master can redundantly execute the same task to avoid slow running nodes

HDFS

 Responsible for storing data on the cluster  Data files are split into blocks and distributed across the nodes in the cluster  Each block is replicated multiple times

 HDFS is a file system written in Java based on the Google’s GFS  Provides redundant storage for massive amounts of data

 HDFS works best with a smaller number of large files ◦ Millions as opposed to billions of files ◦ Typically 100MB or more per file  Files in HDFS are write once  Optimized for streaming reads of large files and not random reads

 Files are split into blocks  Blocks are split across many machines at load time ◦ Different blocks from the same file will be stored on different machines  Blocks are replicated across multiple machines  The NameNode keeps track of which blocks make up a file and where they are stored

 Default replication is 3-fold

 When a client wants to retrieve data ◦ Communicates with the NameNode to determine which blocks make up a file and on which data nodes those blocks are stored ◦ Then communicated directly with the data nodes to read the data

Distributing computation across nodes

 A method for distributing computation across multiple nodes  Each node processes the data that is stored at that node  Consists of two main phases ◦ Map ◦ Reduce

 Automatic parallelization and distribution  Fault-Tolerance  Provides a clean abstraction for programmers to use

 Reads data as key/value pairs ◦ The key is often discarded  Outputs zero or more key/value pairs

 Output from the mapper is sorted by key  All values with the same key are guaranteed to go to the same machine

 Called once for each unique key  Gets a list of all values associated with a key as input  The reducer outputs zero or more final key/value pairs ◦ Usually just one output per input key

What parts actually make up a Hadoop cluster

 NameNode ◦ Holds the metadata for the HDFS  Secondary NameNode ◦ Performs housekeeping functions for the NameNode  DataNode ◦ Stores the actual HDFS data blocks  JobTracker ◦ Manages MapReduce jobs  TaskTracker ◦ Monitors individual Map and Reduce tasks

 Stores the HDFS file system information in a fsimage  Updates to the file system (add/remove blocks) do not change the fsimage file ◦ They are instead written to a log file  When starting the NameNode loads the fsimage file and then applies the changes in the log file

 NOT a backup for the NameNode  Periodically reads the log file and applies the changes to the fsimage file bringing it up to date  Allows the NameNode to restart faster when required

 JobTracker ◦ Determines the execution plan for the job ◦ Assigns individual tasks  TaskTracker ◦ Keeps track of the performance of an individual mapper or reducer

Other available tools

 MapReduce is very powerful, but can be awkward to master  These tools allow programmers who are familiar with other programming styles to take advantage of the power of MapReduce

 Hive ◦ Hadoop processing with SQL  Pig ◦ Hadoop processing with scripting  Cascading ◦ Pipe and Filter processing model  HBase ◦ Database model built on top of Hadoop  Flume ◦ Designed for large scale data movement