Data Intensive Computing: MapReduce and Hadoop Distributed File System

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Distributed Computations
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Distributed Computations MapReduce
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Google Distributed System and Hadoop Lakshmi Thyagarajan.
The Hadoop Distributed File System, by Dhyuba Borthakur and Related Work Presented by Mohit Goenka.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Map Reduce: Simplified Data Processing On Large Clusters Jeffery Dean and Sanjay Ghemawat (Google Inc.) OSDI 2004 (Operating Systems Design and Implementation)
Introduction to Hadoop 趨勢科技研發實驗室. Copyright Trend Micro Inc. Outline Introduction to Hadoop project HDFS (Hadoop Distributed File System) overview.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Overview Hadoop is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Hadoop & Condor Dhruba Borthakur Project Lead, Hadoop Distributed File System Presented at the The Israeli Association of Grid Technologies.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
HDFS (Hadoop Distributed File System) Taejoong Chung, MMLAB.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
BIG DATA/ Hadoop Interview Questions.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Introduction to Distributed Platforms
Large-scale file systems and Map-Reduce
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
Overview Hadoop is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications.
Overview Hadoop is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications.
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Data Intensive Computing: MapReduce and Hadoop Distributed File System Mukaddim Pathan Research Scientist, CSIRO ICT Centre

Lecture 1 Data Intensive Computing MapReduce Basics (Google) Hadoop Distributed File System (HDFS) Apache Hadoop Working Details

? The Data Age! Multi-dimensional sources and types of data 2006 2000 2011 0.18 zettabytes 1. 8 zettabytes ? ** 1 zettabytes = 1 billion terabytes Multi-dimensional sources and types of data Stock exchange, Facebook, Flickr, Picasa, Internet archive, Large Hadron Collider, sensor networks, biological applications Scientific applications and personal digital content creation Increased storage capacities and access speed over years Still problems exist with I/O read, hardware failure, and combining data from multiple sources Data storage and analysis A scalable information processing framework is required to handle vast amounts of data

Data Management and Processing Data intensive computing Concerns with the production, manipulation and analysis of data in the range of hundreds of megabytes (MB) to petabytes (PB) and beyond A range of supporting parallel and distributed computing technologies to deal with the challenges of data representation, reliable shared storage, efficient algorithms and scalable infrastructure to perform analysis

Challenges Ahead Challenges with data intensive computing Scalable algorithms that can search and process massive datasets New metadata management technologies that can scale to handle complex, heterogeneous and distributed data sources Support for accessing in-memory multi-terabyte data structures High performance, highly reliable petascale distributed file system Techniques for data reduction and rapid processing Software mobility to move computation where data is located Hybrid interconnect with support for multi-gigabyte data streams Flexible and high performance software integration technique Hadoop in rescue! A family of related project, best known for MapReduce and Hadoop Distributed File System (HDFS)

Why Hadoop Drivers Need massive scalability 500M+ unique users per month Billions of interesting events per day Data analysis is key Need massive scalability PB’s of storage, millions of files, 1000’s of nodes Need to do this cost effectively Use commodity hardware Share resources among multiple projects Provide scale when needed Need reliable infrastructure Must be able to deal with failures – hardware, software, networking Failure is expected rather than exceptional Transparent to applications very expensive to build reliability into each application The Hadoop infrastructure provides these capabilities

Introduction to Hadoop Apache Hadoop Open Source – Apache Foundation project Yahoo! is Apache Platinum Sponsor History Started in 2005 by Doug Cutting Yahoo! became the primary contributor in 2006 Yahoo! scaled it from 20 node clusters to 4000 node clusters today Deployed large scale science clusters in 2007 Began running major production jobs in Q1 2008 Portable Written in Java Runs on commodity hardware Linux, Mac OS/X, Windows, and Solaris

MapReduce Basics A programming model and its associated implementation for parallel processing of large data sets It was developed within Google as a mechanism for processing large amounts of raw data, e.g. crawled documents or web request logs. Capable of efficiently distribute processing of TB’s of data on 1000’s of processing nodes This distribution implies parallel computing since the same computations are performed on each CPU, but with a different dataset (or different segment of a large dataset) Implementation’s run-time system library takes care of parallelism, fault tolerance, data distribution, load balancing etc Complementary to RDBMS, but differs in many ways (data size, access, update, structure, integrity and scale) Features: fault tolerance, locality, task granularity, backup tasks, skipping bad records and so on

Operational Details Map Reduce Map function, written by a user of the MapReduce library, takes an input pair and produces a set of intermediate key/value pairs The MapReduce library groups together all intermediate values associated with the same intermediate key k and passes them to the reduce function Reduce The reduce function, also written by the user, accepts an intermediate key k and a set of values for that key. It merges together these values to form a possibly smaller set of values

Example Consider the problem of counting the number of occurrences of each word in a large document

See the Example Graphically Counting word frequency in a document MapReduce Runtime System Map Reduce <word,1> <word,1> <word,1,1,1> <word,1> doc <word,3>

MapReduce Overview The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits or shards. The input shards can be processed in parallel on different machines (nodes in a Cluster) Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R) The number of partitions (R) and the partitioning function are specified by the user

MapReduce: Execution Overview (2) One task is the Master, the others are workers. Master assigns Map and Reduce tasks to idle workers (1) Create Splits of input files and start set of tasks running copies of program (5) Reduce worker is notified of locations of intermediate key/value pairs by Master, and uses RPC to get the data from the corresponding Map worker. Upon getting all data, perform sort on intermediate keys (3) Worker assigned a Map task reads data from the task’s corresponding split, parsing key/value pairs from input and passing them to Map() (4) Periodic local writes, partitioned into R regions by the partitioning function. Location of buffered pairs passed back to Master (6) Reduce worker iterates over sorted intermediate data, passing data grouped by key to Reduce(). Output of Reduce() is appended to worker’s output file (7) Upon completion of all Reduce tasks, Master returns control to user program

More About the Master Master owns several data structure to keep track of execution Stores state (idle, in-progress, completed) for each task, and identity of worker machines for in-progress and completed tasks Master keeps track of location of intermediate file regions, serving as conduit from Map to corresponding Reduce tasks Master stores the locations and sizes of the R intermediate file regions produced by the map task Updates are received from Map tasks as they are completed by assigned Map workers; corresponding information pushed incrementally to workers with in-progress Reduce tasks

Hadoop takes the generated class files and manages running them Hadoop MapReduce mapreduce fm fr l = map (reducePerKey fr) (group (map fm l)) reducePerKey fr (k,v_list) = (k, (foldl (fr k) [] v_list) Hadoop The fm and fr are function objects (classes) Class for fm implements the Mapper interface Map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) Class for fr implements the Reducer interface reduce(WritableComparable key, Iterator values, Hadoop takes the generated class files and manages running them

Handling Failure Worker failure Master failure To detect failure, the master pings every worker periodically If no response is received from a worker in a certain amount of time, the master marks the worker as failed Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling on other workers Completed map tasks are re-executed as their output on is stored on the local disk(s) of the failed machine and is therefore inaccessible Completed reduce tasks do not need to be re-executed since their output is stored in a global file system Master failure Periodic checkpoints are written to handle master failure If the master task dies, a new copy can be started from the last checkpoint state

Data Locality Network bandwidth is a valuable scarce resource and it should be consumed wisely The distributed file system replicates data across different nodes The Master takes these locations into account when scheduling Map tasks, trying to place them with the data Otherwise, Map tasks are scheduled to reside “near” a replica of the data (e.g., on a worker machine that is on the same network switch) When running large MapReduce operations, most input data is read locally and consume no network bandwidth Data locality worked well with a Hadoop-specific distributed file system Integration of a Cloud-based file system incurs extra cost and loss data locality

Task Granularity Finely granular tasks: many more map tasks than machines Better dynamic load balancing Minimizes time for fault recovery Can pipeline the shuffling/grouping while maps are still running Typically 200k Map tasks, 5k Reduce tasks for 2k hosts For M map tasks and R reduce tasks there are O(M+R) scheduling decisions and O(M*R) states

Load Balancing Built-in dynamic load balancing One other problem that can slow calculations is the existence of stragglers; machines suffering from either hardware defects, contention for resources with other applications etc. When an overall MapReduce operation passes some point deemed to be “nearly complete,” the Master schedules backup tasks for all of the currently in-progress tasks When a particular task is completed, whether it be “original” or back-up, its value is used This strategy costs little more overall, but can result in big performance gains

Refinements Partitioning function Ordering guarantees MapReduce users specify the number of reduce tasks/output files (R) Data gets partitioned across these tasks using a partitioning function on the intermediate key Default is “hash(key) mod R”, resulting in well balanced partitions Special partitioning function can also be used, such as “hash(Hostname(urlkey))” to combine all URLs (output keys) from the same host to the same output file Ordering guarantees Within a given partition, the intermediate key/value pairs are processed in increasing key order This ordering guarantee makes it easy to generate a sorted output file per partition Allows users to have sorted output and efficient access lookups by key

Refinements (Cont’d) Combiner function Input/Output types There can be significant repetition in the intermediate keys produced by each map task and the reduce task is associative While one reduce task can perform the aggregation, an on-processor combiner function can be used to perform partial merging of Map output locally before sending over the network The combiner function is executed on each machine that performs a map task The program logic for the combiner function and reduce tasks are potentially same, except how the output is handled, i.e. writing output in an intermediate file or in the final output file Input/Output types Multiple input/output format supported User can also add support to new input/output type by providing an implementation to the reader/writer interface

Refinements (Cont’d) Skipping bad records Local execution/debugging MapReduce provides a mode for skipping records that are diagnosed to cause Map() crashes Each worker process installs a signal handler that catches segment violations and bus errors, tracked by master When the master notices more than one failure on a particular record, it indicates that the record should be skipped during re-execution Local execution/debugging Not straightforward due to the distributed computation of MapReduce Alternative implementation of the MapReduce library that sequentially on one node (local machine) Users can use any debugging or testing tools they find useful

Refinements (Cont’d) Status information Counters Master contains internal http server to produce status pages with information on how many tasks have been completed, how many are in progress, bytes of input, bytes of intermediate data, bytes of output, and processing rates. The status page contains links to the standard error and standard output files generated by each task A user can monitor progress, predict computation time and accelerate it by adding more hosts Counters Counters A facility to count occurrences of various events To use this facility, user code creates a named counter object and then increments the counter appropriately in Map and/or Reduce function

Performance Evaluation Tests run on cluster of 1800 machines: 4 GB of memory Dual-processor 2 GHz Xeons with Hyperthreading Dual 160 GB IDE disks Gigabit Ethernet per machine Bisection bandwidth approximately 100 Gbps Two benchmarks MR_Grep Scan 1010 100-byte records to extract records matching a rare pattern (the pattern occurs in 92K records) MR_Sort Sort 1010 100-byte records (modeled after TeraSort benchmark) Input is split into approximately 64 MB pieces (M = 15000) and the entire output is placed in one file (R = 1)

Data Transfer Locality optimization helps 1800 machines read 1 TB of data at peak of ~31 GB/s Without this, rack switches would limit to 10 GB/s Start-up overhead is significant for short jobs

Results Backup tasks reduce job completion time significantly System deals well with failures Normal No backup tasks 200 processes killed

MapReduce Applications Text tokenization (alert system), indexing, and search Data mining, statistical modeling, and machine learning Healthcare – parse, clean and reconcile extremely large amount of data Biosciences – drug discovery, meta-genomics, bioassay activities Cost-effective mash-ups – retrieving and analyzing biomedical knowledge Computational biology – parallelize bioinformatics algorithms for SNP discovery, genotyping and personal genomics, e.g. CloudBurst Emergency response – real-time monitoring/forecasting for operational decision support and so on (Check: http://wiki.apache.org/hadoop/PoweredBy) MapReduce inapplicability Database management – does not provide traditional DBMS features Database implementation – lack of schema, low data integrity Normalization poses problems for MapReduce, due to non-local reading Applications cannot have read and write many times feature

Hadoop Distributed File System (HDFS) A distributed file system designed to run on commodity hardware HDFS was originally built as infrastructure for the Apache Nutch web search engine project, with the aim to achieve fault tolerance, ability to run on low-cost hardware and handle large datasets It is now an Apache Hadoop subproject Share similarities with existing distributed file systems and supports traditional hierarchical file organization Reliable data replication and accessible via Web interface and Shell commands Benefits: Fault tolerant, high throughput, streaming data access, robustness and handling of large data sets

Assumptions and Goals Hardware failures Streaming data access Detection of faults, quick and automatic recovery Streaming data access Designed for batch processing rather than interactive use by users Large data sets Applications that run on HDFS have large data sets, typically in gigabytes to terabytes in size Simple coherency model Applications need a write-once, read-many times access model for files Computation migration Computation is moved closer to where data is located Portability Easily portable between heterogeneous hardware and software platforms

HDFS Concepts Blocks Disk block is a minimum amount of data that it can read or write HDFS block is a unit of 64 MB Files in HDFS are broken into block-sized chunks and stored as independent units A file smaller than a single block does not occupy the full block’s worth of underlying storage Benefits: filesystem abstraction, multiple disk usage for large-size file, fit well for replication Copy a file from local filesystem to HDFS % hadoop fs –copyFromLocal …/a.txt hdfs://localhost/…/a.txt Listing the blocks that make up each file in the filesystem % hadoop fsck / -files -blocks

HDFS Concepts (Cont’d) NameNodes and DataNodes HDFS follows a master/slave architecture HDFS cluster consists of a single NameNode and a number of DataNodes, one for each cluster node NameNode is a master server that manages the file system namespace and regulates access to files by clients DataNodes manage storage attached to the cluster nodes they run on HDFS exposes a file system namespace and allows user data to be stored in files Internally a file in split into one or more blocks for storing in DataNodes

HDFS Concepts (Cont’d) - NameNode Metadata NameNode executes file system namespace operations such as opening, closing, and renaming files and directories, and also determines the mapping of blocks to DataNdoes Meta-data in Memory The entire metadata is in main memory No demand paging of meta-data Types of Metadata List of files List of Blocks for each file List of DataNodes for each block File attributes, e.g creation time, replication factor A Transaction Log Records file creations, file deletions. etc

HDFS Concepts (Cont’d) - DataNode A Block Server Stores data in the local file system (e.g. ext3) Stores meta-data of a block (e.g. CRC) Serves data and meta-data to Clients DataNode is responsible for serving read and write requests from the file system’s clients, as well as perform block creation, deletion and replication upon instruction from the NameNode Block Report Periodically sends a report of all existing blocks to the NameNode Facilitates Pipelining of Data Forwards data to other specified DataNodes

HDFS Concepts (Cont’d) File system namespace A user or an application can create directories and store files inside these directories Any change to the filesystem namespace or its properties is recorded by the NameNode A application can specify the number of replicas of a file that should be maintained by HDFS Data replication Each file is stored as a sequence of blocks of same size, except the last one Block size and replication factor are configurable per file at the file creation time and can be changed later Block replication occurs based on periodic heartbeat and a Blockreport from each of the DataNodes in the cluster

HDFS Architecture

HDFS Interfaces There are many interfaces to interact with HDFS Simplest way of interacting with HDFS in command-line Two properties are set in HDFS configuration Default Hadoop filesystem fs.default.name: hdfs://localhost/ Used to determine the host (localhost) and port (8020) for the HDFS NameNode Replication factor dfs.replication Default is 3, disable replication by setting it to 1 (single datanode) Other HDFS interfaces HTTP: a read only interface for retrieving directory listings and data over HTTP FTP: permits the use of the FTP protocol to interact with HDFS

Data Pipelining Client retrieves a list of DataNodes on which to place replicas of a block Client writes block to the first DataNode The first DataNode forwards the data to the next DataNode in the Pipeline When all replicas are written, the Client moves on to write the next block in file

Replication in HDFS Replica placement Replica selection Critical to improve data reliability, availability and network bandwidth utilization Rack-aware policy as rack failure is far less than node failure With the default replication factor (3), one replica is put on one node in the local rack, another on a node in a different (remote) rack, and the last on a different node in the same remote rack One third of replication are on one node; two-third of replicas are on one rack, and the other third are evenly distributed across racks Benefits is to reduce inter-rack write traffic Replica selection A read request is satisfied from a replica that is nearby to the application Minimizes global bandwidth consumption and read latency If HDFS spans multiple data center, replica in the local data center is preferred over any remote replica

Replication in HDFS (Cont’d) Safemode On startup, NameNode enters in Safemode and wait for Heartbeat and Blockreport messages from DataNodes Each block has a specified number of replicas Once NameNode knows the status of a specified number of replicas in each block, it exists safemode and if request replicates data blocks to other DataNodes. Filesystem metadata DataNode does not create all files in the same directory, but uses a heuristic to determine the optimal number of files per directories On startup, a DataNode scans through its local filesystem, generates a list of all HDFS data blocks and sends the report to the NameNode Checkpoint Image of the entire file system namespace and file Blockmap in memory Single checkpoint is created at startup and also periodically

Communication Protocol All HDFS communication protocols are layered on top of the TCP/IP protocol A client establishes a connection to a configurable TCP port on the NameNode machine and uses ClientProtocol DataNodes talk to the NameNode using DataNode protocol A Remote Procedure Call (RPC) abstraction wraps both the ClientProtocol and DataNode protocol NameNode never initiates a RPC, instead it only responds to RPC requests issued by DataNodes or clients

Robustness Primary objective of HDFS is to store data reliably even during failures Three common types of failures: NameNode, DataNode and network partitions Data disk failure Heartbeat messages to track the health of DataNodes NameNodes performs necessary re-replication on DataNode unavailability, replica corruption or disk fault Cluster rebalancing Automatically move data between DataNodes, if the free space on a DataNode falls below a threshold or during sudden high demand Data integrity Checksum checking on HDFS files, during file creation and retrieval Metadata disk failure Manual intervention – no auto recovery, restart or failover

How Hadoop Runs a MapReduce job Client submits MapReduce job JobTracker coordinates job run TaskTracker runs split tasks HDFS is used for file storage

Streaming and Pipes Hadoop Streaming, API to MapReduce to write non-Java map and reduce function Hadoop and the user program communicates using standard I/O streams Hadoop Pipes is the C++ interface to MapReduce Uses socket as channel to communicate with the process running the C++ Map or Reduce function

Progress and Status Updates Operations constituting progress Reading an input record Writing an output record Setting status description Incrementing a counter Calling progress () method

Hadoop Failures Task failure Tasktracker failure Jobtracker failure Map or reduce task throws a runtime exception For streaming tasks, streaming processes exiting with a non-zero exit code are considered as failed Task call also be killed and re-scheduled Tasktracker failure Crash or slow execution can cause infrequent (or stop) sending heartbeats to the job tracker A tasktracker can also be blacklisted by the jobtracker if it fails a significant number of tasks, higher than average task failure rate Jobtracker failure Single point of failure - no mechanism to deal with it One solution is to run multiple jobtracker or have backup jobtracker

Checkpointing in Hadoop

Job Scheduling in Hadoop Started with FIFO scheduling and now comes with a choice of schedulers The fair scheduler Aims to give every user a fair share of the cluster capacity over time Jobs are placed in pools and by default each user gets their own pool Support preemption – capacity provisioning of over-capacity to under- capacity pool The capacity scheduler Slightly different approach to multi-user scheduling A cluster is made up of a number of queues, which may be hierarchical, and each queue has an allocated capacity Within each queue jobs are scheduled using FIFO scheduling, with priorities

Shuffle and Sort Input to each reducer is sorted by key Shuffle is the process for performing sort and sending map outputs to reducers as inputs The Map side - each map task has a memory buffer Data partitioning, in-memory sort and use of combiner function The Reduce side - reduce tasks copy map outputs in parallel Upon copying all outputs, perform merge/sort before task execution

Summary – Lecture 1 MapReduce makes it easy to parallelize and distribute computations and make them fault tolerant Locality of optimization is the key – target is to reduce the amount of data that are sent across networks Redundant execution can be used to reduce the impact of slow machines and to handle machine failures and data loss HDFS is the file-system abstraction Block-based abstraction simplifies the storage subsystem HDFS blocks are larger than disk blocks to minimize the cost of seeks Blocks fit well with replication for providing fault tolerance and availability

Lecture 2 Hadoop Cluster Setup MapReduce Advancements Aneka Case Study Hadoop Lab Notes

Setting Up a Hadoop Cluster A two-level network topology is typical for a Hadoop cluster Get Apache Hadoop distribution and install Hadoop, may also use a installation script Configure JAVA_HOME, NameNode, DataNdoe, JobTracker and TaskTracker, HADOOP_LOG_DIR, HADOOP_HEAPSIZE Can also have a single node setup for simple operations

Single Node Setup: Configurations Files to configure: hadoop-env.sh Open the file <HADOOP_INSTALL>/conf/hadoop-env.sh in the editor of your choice and set the JAVA_HOME environment variable to the Sun JDK/JRE 1.5.0 directory. ------------------------------------------------------------------- # The java implementation to use. Required. # export JAVA_HOME=/usr/lib/j2sdk1.5-sun ----------------------------------------------------------- hadoop-site.xml Any site-specific configuration of Hadoop is configured in <HADOOP_INSTALL>/conf/hadoop- site.xml. Here we will configure the directory where Hadoop will store its data files, the ports it listens to, etc. You can leave the settings below as is with the exception of the hadoop.tmp.dir variable which you have to change to the directory of your choice, for example /usr/local/hadoop- datastore/hadoop-${user.name}. -------------------------------------------------------------------- <property> <name>hadoop.tmp.dir</name> <value>/your/path/to/hadoop/tmp/dir/hadoop-${user.name}</value> <description>A base for other temporary directories.</description> </property> ----------------------------------------------------------------------

Starting the Single Node Cluster Formatting the name node: The first step to starting up your Hadoop installation is formatting the Hadoop file system which is implemented on top of the local file system of your "cluster“. You need to do this the first time you set up a Hadoop cluster. cluster. Do not format a running Hadoop filesystem, this will cause all your data to be erased. Run the command : hadoop@ubuntu:~$ <HADOOP_INSTALL>/hadoop/bin/hadoop namenode –format Starting cluster: This will startup a Namenode, Datanode, Jobtracker and a Tasktracker . Run the command: hadoop@ubuntu:~$ <HADOOP_INSTALL>/bin/start-all.sh Stopping cluster: To stop all the daemons running on your machine, hadoop@ubuntu:~$ <HADOOP_INSTALL>/bin/stop-all.sh

Multi-Node Setup Now we will modify the Hadoop configuration to make one Ubuntu box the master (which will also act as a slave) and the other Ubuntu box a slave. The best way to do this is to install, configure and test a "local" Hadoop setup for each of the two Ubuntu boxes, and in a second step to "merge" these two single-node clusters into one multi- node cluster in which one Ubuntu box will become the designated master (but also act as a slave with regard to data storage and processing), and the other box will become only a slave. The master node will run the "master" daemons for each layer: namenode for the HDFS storage layer, and jobtracker for the MapReduce processing layer. Both machines will run the "slave" daemons: datanode for the HDFS layer, and tasktracker for MapReduce processing layer. We will call the designated master machine just the master from now and the slave-only machine the slave. Both machines must be able to reach each other over the network Shutdown each single-node cluster with <HADOOP_INSTALL>/bin/stop-all.sh before continuing if you haven't done so already.

Multi-Node Setup: Configurations Files to configure: conf/masters (master only) The conf/masters file defines the master nodes of our multi-node cluster. In our case, this is just the master machine. On master, update <HADOOP_INSTALL>/conf/masters that it looks like this: ---------------------- master --------------------- conf/slaves (master only) This conf/slaves file lists the hosts, one per line, where the Hadoop slave daemons (datanodes and tasktrackers) will run. We want both the master box and the slave box to act as Hadoop slaves because we want both of them to store and process data. On master, update <HADOOP_INSTALL>/conf/slaves that it looks like this: ------------------ Master slave ------------------- If you have additional slave nodes, just add them to the conf/slaves file, one per line.

Multi-Node Setup: Configurations (Cont’d) conf/hadoop-site.xml (all machines): Assuming you configured conf/hadoop-site.xml on each machine as described in the single-node cluster tutorial, you will only have to change a few variables. Important: You have to change conf/hadoop-site.xml on ALL machines as follows. First, we have to change the fs.default.name variable which specifies the NameNode (the HDFS master) host and port. In our case, this is the master machine. ------------------------------------------ <property> <name>fs.default.name</name> <value>hdfs://master:54310</value> <description>The name of the default file system. . . </property> --------------------------------------- Second, we have to change the mapred.job.tracker variable which specifies the JobTracker (MapReduce master) host and port. Again, this is the master in our case. ------------------------------------------------------- <name>mapred.job.tracker</name> <value>master:54311</value> <description>The host and port that the MapReduce job tracker runs at . . . </description> -------------------------------------------------

Multi-Node Setup: Configurations (Cont’d) Third, we change the dfs.replication variable which specifies the default block replication. It defines how many machines a single file should be replicated to before it becomes available. If you set this to a value higher than the number of slave nodes that you have available, you will start seeing a lot of type errors in the log files. --------------------------------- <property> <name>dfs.replication</name> <value>2</value> <description>Default block replication. . .</description> </property> ---------------------------------- Additional settings: conf/hadoop-site.xml You can change the mapred.local.dir variable which determines where temporary MapReduce data is written. It also may be a list of directories.

Starting the Multi-node Cluster Formatting the namenode Before we start our new multi-node cluster, we have to format Hadoop's distributed filesystem (HDFS) for the namenode. You need to do this the first time you set up a Hadoop cluster. Do not format a running Hadoop namenode, this will cause all your data in the HDFS filesytem to be erased. To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable on the namenode), run the command (from the master): -------------------------------------------- bin/hadoop namenode -format --------------------------------------------- Starting the multi-node cluster: Starting the cluster is done in two steps. First, the HDFS daemons are started: the namenode daemon is started on master, and datanode daemons are started on all slaves (here: master and slave). Second, the MapReduce daemons are started: the jobtracker is started on master, and tasktracker daemons are started on all slaves (here: master and slave).

Starting the multi-node cluster (Cont’d) HDFS daemons: Run the command <HADOOP_INSTALL>/bin/start-dfs.sh on the machine you want the namenode to run on. This will bring up HDFS with the namenode running on the machine you ran the previous command on, and datanodes on the machines listed in the conf/slaves file. In our case, we will run bin/start-dfs.sh on master: ------------------------- bin/start-dfs.sh --------------------------- On slave, you can examine the success or failure of this command by inspecting the log file <HADOOP_INSTALL>/logs/hadoop-hadoop-datanode-slave.log. At this point, the following Java processes should run on master: ----------------------------------- hadoop@master:/usr/local/hadoop$ jps 14799 NameNode 15314 Jps 14880 DataNode 14977 SecondaryNameNode ------------------------------------

Starting the multi-node cluster (Cont’d) and the following Java processes should run on slave: -------------------------------------- hadoop@slave:/usr/local/hadoop$ jps 15183 DataNode 15616 Jps --------------------------------------- MapReduce daemons: Run the command <HADOOP_INSTALL>/bin/start-mapred.sh on the machine you want the jobtracker to run on. This will bring up the MapReduce cluster with the jobtracker running on the machine you ran the previous command on, and tasktrackers on the machines listed in the conf/slaves file. In our case, we will run bin/start-mapred.sh on master: ------------------------------------- bin/start-mapred.sh On slave, you can examine the success or failure of this command by inspecting the log file <HADOOP_INSTALL>/logs/hadoop-hadoop-tasktracker-slave.log.

Starting the multi-node cluster (Cont’d) At this point, the following Java processes should run on master: ---------------------------------------------------- hadoop@master:/usr/local/hadoop$ jps 16017 Jps 14799 NameNode 15686 TaskTracker 14880 DataNode 15596 JobTracker 14977 SecondaryNameNode And the following Java processes should run on slave: --------------------------------------- hadoop@slave:/usr/local/hadoop$ jps 15183 DataNode 15897 TaskTracker 16284 Jps -------------------------------------------

Stopping the Multi-Node Cluster First, we begin with stopping the MapReduce daemons: the jobtracker is stopped on master, and tasktracker daemons are stopped on all slaves (here: master and slave). Second, the HDFS daemons are stopped: the namenode daemon is stopped on master, and datanode daemons are stopped on all slaves (here: master and slave). MapReduce daemons: Run the command <HADOOP_INSTALL>/bin/stop-mapred.sh on the jobtracker machine. This will shut down the MapReduce cluster by stopping the jobtracker daemon running on the machine you ran the previous command on, and tasktrackers on the machines listed in the conf/slaves file. In our case, we will run bin/stop-mapred.sh on master: ------------------------------- bin/stop-mapred.sh At this point, the following Java processes should run on master: -------------------------------------- hadoop@master:/usr/local/hadoop$ jps 14799 NameNode 18386 Jps 14880 DataNode 14977 SecondaryNameNode --------------------------------------------

Stopping the Multi-Node Cluster (Cont’d) And the following Java processes should run on slave: ------------------------------- hadoop@slave:/usr/local/hadoop$ jps 15183 DataNode 18636 Jps -------------------------------- HDFS daemons: Run the command <HADOOP_INSTALL>/bin/stop-dfs.sh on the namenode machine. This will shut down HDFS by stopping the namenode daemon running on the machine you ran the previous command on, and datanodes on the machines listed in the conf/slaves file. In our case, we will run bin/stop-dfs.sh on master: --------------------------------- bin/stop-dfs.sh At this point, the only following Java processes should run on master: hadoop@master:/usr/local/hadoop$ jps 18670 Jps ------------------------------

Stopping the Multi-Node Cluster (Cont’d) And the following Java processes should run on slave: -------------------------------- hadoop@slave:/usr/local/hadoop$ jps 18894 Jps

Running a MapReduce job We will now run your first Hadoop MapReduce job. We will use the WordCount example job which reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab. Download example input data: The Notebooks of Leonardo Da Vinci Download the ebook as plain text file in us-ascii encoding and store the uncompressed file in a temporary directory of choice, for example /tmp/gutenberg. Restart the Hadoop cluster Restart your Hadoop cluster if it's not running already. ------------------------- hadoop@ubuntu:~$ <HADOOP_INSTALL>/bin/start-all.sh Copy local data file to HDFS Before we run the actual MapReduce job, we first have to copy the files from our local file system to Hadoop's HDFS ----------------------------- hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -copyFromLocal /tmp/source destination

Running a MapReduce Job (Cont’d) Run the MapReduce job Now, we actually run the WordCount example job. This command will read all the files in the HDFS “destination” directory , process it, and store the result in the HDFS directory “output”. ----------------------------------------- hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop hadoop-example wordcount destination output You can check if the result is successfully stored in HDFS directory “output”. Retrieve the job result from HDFS To inspect the file, you can copy it from HDFS to the local file system. ------------------------------------- hadoop@ubuntu:/usr/local/hadoop$ mkdir /tmp/output hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs –copyToLocal output/part-00000 /tmp/output ---------------------------------------- Alternatively, you can read the file directly from HDFS without copying it to the local file system by using the command : --------------------------------------------- hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs –cat output/part-00000

Hadoop Web Interfaces MapReduce Job Tracker Web Interface The job tracker web UI provides information about general job statistics of the Hadoop cluster, running/completed/failed jobs and a job history log file. It also gives access to the local machine's Hadoop log files (the machine on which the web UI is running on). By default, it's available at http://localhost:50030/ Task Tracker Web Interface The task tracker web UI shows you running and non-running tasks. It also gives access to the local machine's Hadoop log files. By default, it's available at http://localhost:50060/ HDFS Name Node Web Interface The name node web UI shows you a cluster summary including information about total/remaining capacity, live and dead nodes. Additionally, it allows you to browse the HDFS namespace and view the contents of its files in the web browser. It also gives access to the local machine's Hadoop log files. By default, it's available at http://localhost:50070/

Writing An Hadoop MapReduce Program Even though the Hadoop framework is written in Java, programs for Hadoop need not to be coded in Java but can also be developed in other languages like Python or C++ (the latter since version 0.14.1). Creating a launching program for your application • The launching program configures: – The Mapper and Reducer to use – The output key and value types (input types are inferred from the InputFormat)‏ – The locations for your input and output • The launching program then submits the job and typically waits for it to complete A Map/Reduce may specify how it’s input is to be read by specifying an InputFormat to be used A Map/Reduce may specify how it’s output is to be written by specifying an OutputFormat to be used

Using HDFS hadoop dfs [-ls <path>] [-du <path>] [-cp <src> <dst>] [-rm <path>] [-put <localsrc> <dst>] [-copyFromLocal <localsrc> <dst>] [-moveFromLocal <localsrc> <dst>] [-get [-crc] <src> <localdst>] [-cat <src>] [-copyToLocal [-crc] <src> <localdst>] [-moveToLocal [-crc] <src> <localdst>] [-mkdir <path>] [-touchz <path>] [-test -[ezd] <path>] [-stat [format] <path>] [-help [cmd]]

hadoop namenode –format Using HDFS (Cont’d) File system reformatting is easy hadoop namenode –format Basically most commands look similar hadoop “some command” options If you just type hadoop you get all possible commands (including undocumented ones )

Who uses Hadoop? Amazon/A9 Facebook Google IBM Joost Last.fm New York Times PowerSet Veoh Yahoo!

Eric Baldeschwieler VP Hadoop Development Hadoop @ Yahoo! Source: Eric Baldeschwieler VP Hadoop Development Yahoo!

Hadoop is critical to Yahoo’s business When you visit yahoo, you are interacting with data processed with Hadoop!

Hadoop is critical to Yahoo’s business When you visit yahoo, you are interacting with data processed with Hadoop! Content Optimization Search Index Ads Optimization Content Feed Processing

Hadoop is critical to Yahoo’s business When you visit yahoo, you are interacting with data processed with Hadoop! Content Optimization Search Index Machine Learning (e.g. Spam filters) Ads Optimization Content Feed Processing

Tremendous Impact on Productivity Makes Developers & Scientists more productive Key computations solved in days and not months Projects move from research to production in days Easy to learn, even our rocket scientists use it! The major factors You don’t need to find new hardware to experiment You can work with all your data! Production and research based on same framework No need for R&D to do IT (it just works)

Search & Advertising Sciences Hadoop Applications: Search Assist™ Database for Search Assist™ is built using Hadoop. 3 years of log-data 20-steps of map-reduce Before Hadoop After Hadoop Time 26 days 20 minutes Language C++ Python Development Time 2-3 weeks 2-3 days

Largest Hadoop Clusters in the Universe 25,000+ nodes (~200,000 cores) Clusters of up to 4,000 nodes 4 Tiers of clusters Development, Testing and QA (~10%) Proof of Concepts and Ad-Hoc work (~10%) Runs the latest version of Hadoop – currently 0.20 Science and Research (~60%) Runs more stable versions Production (~20%) Currently Hadoop 0.18.3

Large Hadoop-Based Applications 2008 2009 Webmap ~70 hours runtime ~300 TB shuffling ~200 TB output 1480 nodes ~73 hours runtime ~490 TB shuffling ~280 TB output 2500 nodes Sort benchmarks (Jim Gray contest) 1 Terabyte sorted 209 seconds 900 nodes 62 seconds, 1500 nodes 1 Petabyte sorted 16.25 hours, 3700 nodes Largest cluster 2000 nodes 6PB raw disk 16TB of RAM 16K CPUs 4000 nodes 16PB raw disk 64TB of RAM 32K CPUs (40% faster CPUs too)

Hadoop at Facebook Production cluster Test cluster 4800 cores, 600 machines, 16GB per machine – April 2009 8000 cores, 1000 machines, 32 GB per machine – July 2009 4 SATA disks of 1 TB each per machine 2 level network hierarchy, 40 machines per rack Total cluster size is 2 PB, projected to be 12 PB in Q3 2009 Test cluster 800 cores, 16GB each

Backend Data Warehousing 3 TB of compressed log data is generated per day All these data are stored and process by the Hadoop cluster consisting of 0ver 600 machines The summary of log data is then copied to Oracle and MySQL for easy access and further analysis Web Servers Scribe Servers Network Storage Hadoop Cluster Oracle RAC MySQL

Notable MapReduce Advancements GPU-based MapReduce variants Extend MapReduce to be used as a Cloud service Perform real-time distributed stream processing Use MapReduce for quality-controlled decision making, such as weather forecasting, defense, emergency response system, commerce/finance trading and transport/vehicle/dam flow control Overcome the performance bottleneck generated from a single data collection point and minimize the time taken during data validation process Ensure improved data management and better performance

MapReduce and GPUs MapReduce programming framework is a natural fit for GPUs Using this framework saves significant programming effort Lots of parallelism to execute independent threads, multithreading, and abstraction of low level details However, there are constraints MapReduce is typically applied to large batch-processing applications Efficient reductions on GPUs are difficult, due to dependencies Performance issue due to writing intermediary results in local disks Each cluster node running MapReduce is 2-4 core CPU, with no GPU attached Moving data to cloud incurs extra cost

Amazon Elastic MapReduce Elastic MapReduce starts a Hadoop cluster, which loads any specified bootstrap actions and then runs Hadoop on each node Upload to Amazon S3 the data, as well as the map and reduce executables, and then send a request to Elastic MapReduce to start a job flow Hadoop executes a job flow by downloading data from Amazon S3 to core and task nodes. Alternatively, the data is loaded dynamically at run time by map tasks The job flow is completed and processed data can be retrieved from Amazon S3 Hadoop processes the data and then uploads the results from the cluster to Amazon S3

MapReduce Online A variant of MapReduce where intermediate data is pipelined between Map and Reduce functions, while preserving the programming interfaces and fault tolerance Hadoop online prototype: http://code.google.com/p/hop/ Task pipelining Map task push data to reducers as it is produced by using two threads, one for running the map function and another for periodically sending the output from in-memory buffer to reduce If a reduce task is yet to be scheduled, map output is written to disk as in regular MapReduce Job pipelining Send a job’s Reduce output as input to next job’s Map, however, Reduction of previous job and Map of next job are not overlapped Perform reduction on whatever Map output is available, producing snapshots Online aggregation of snapshots and continuous query pipelining

Case Study on Aneka: Cloud Application Platform (CAP) using MapReduce Lightweight Container hosting multiple services All programming models available from within the same container SDK containing APIs for multiple programming models and tools Runtime Environment for managing application execution management Suitable for Development of large-scale Enterprise Cloud Applications Cloud enabling legacy applications Patent (PCT)

Aneka: components Aneka Worker Service Aneka Manager Aneka User Agent public DumbTask: ITask { … public void Execute() …… } Aneka enterprise Cloud for(int i=0; i<n; i++) { … DumbTask task = new DumbTask(); app.SubmitExecution(task); } Executor Executor work units Client Agent Executor Scheduler internet work units Executor Aneka Worker Service Aneka Manager Client Agent internet Programming / Deployment Model Aneka User Agent

MapReduce Programming Model in Aneka Implementation of Map and Reduce function Logic for MapReduce scheduler, executor and manager MapReduce Model MapReduceScheduler MapReduceExecutor MapReduceManager Mapper Reducer infrastructure scheduling execution coordination client component end users abstractions units of execution

Aneka MapReduce Infrastructure

Aneka MapReduce Abstractions Object Model

Aneka MapReduce Scheduling Service

Aneka MapReduce Execution Service

Aneka MapReduce Data File Format

Aneka MapReduce MapReduce model in Aneka is function-based: user defines the functions operating on the data user configures the middleware with these functions user provides the data middleware automatically generates the tasks required to execute the functions on the data MapReduce model samples Word counter Estimation of Pi

Application 1 - GoFront: A unit of China Southern Railway Group) Application: Locomotive design CAD rendering Raw Locomotive Design Files (Using AutoDesk Maya) Using Maya Graphical Mode Directly Case 1: Single Server 4 cores server Aneka Maya Renderer Use private Aneka Cloud GoFront Private Aneka Cloud LAN network (Running Maya Batch Mode on demand) Case 2: Aneka Enterprise Cloud Aneka utilizes 30 desktops to decrease task time from days to hours Time (in hrs) Single Server Aneka Cloud

Application 2 - TitanStrike On-line Gaming Portal TitanStrike Private Aneka Cloud LAN network (Running Game plugins on Demand) Case 2: Aneka Enterprise Cloud = Scalability Aneka-based GameController The local scheduler interacts with Aneka and distributes the load in the cloud. Distributed log parsing logs Case 1: Single Server = Huge Overload Single scheduler controlling the execution of all the matches. Game Servers Gamers profiles Players statistics Team playing Multiple games Titan Strike On Line Gaming Portal Centralized log parsing logs Single GameController

Hadoop Lab – Notable Points Pre-installed Hadoop on XE 16 processors over 2 nodes Use of a script to load Hadoop module and run application HDFS is installed on top of Lustre file system No data locality as there is no concept of local disk Namenode and datanode running on the same node Hadoop installation on XE is experimental Be prepared to get some surprise in performance and during program execution!! The main focus of the Lab will be to modify the typical MapReduce implementation for word count to perform some advanced operations

Thanks for Your Attention! Hadoop and MapReduce resources: Hadoop: The Definitive Guide, Tom White, O’Reilly | Yahoo! Press, 2009 Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters”, OSDI’04, http://labs.google.com/papers/ mapreduce.html “Yahoo! Launcehs World’s Largest Hadoop Production Application”, 19 February 2008, http://developer.yahoo.com/blogs/hadoop/posts/2008/02/ yahoo-worlds-largest-production-hadoop/ Tutorials and user guides at: http://hadoop.apache.org/ HDFS: http://hadoop.apache.org/core/docs/current/hdfs_design.html Hadoop API: http://hadoop.apache.org/core/docs/current/api/ Contact Mukaddim.Pathan@csiro.au