Hadoop: The Definitive Guide Chap. 2 MapReduce

Slides:



Advertisements
Similar presentations
Distributed and Parallel Processing Technology Chapter2. MapReduce
Advertisements

Beyond Mapper and Reducer
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Intro to Map-Reduce Feb 21, map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
MapReduce.
Beyond map/reduce functions partitioner, combiner and parameter configuration Gang Luo Sept. 9, 2010.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce Online Veli Hasanov Fatih University.
MapReduce in Action Team 306 Led by Chen Lin College of Information Science and Technology.
Developing a MapReduce Application – packet dissection.
O’Reilly – Hadoop: The Definitive Guide Ch.5 Developing a MapReduce Application 2 July 2010 Taewhi Lee.
Spark: Cluster Computing with Working Sets
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Big Data Analytics with R and Hadoop
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
Distributed and Parallel Processing Technology Chapter6
Distributed and Parallel Processing Technology Chapter7. MAPREDUCE TYPES AND FORMATS NamSoo Kim 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
HAMS Technologies 1
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
MapReduce 資工碩一 黃威凱. Outline Purpose Example Method Advanced 資工碩一 黃威凱.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
O’Reilly – Hadoop: The Definitive Guide Ch.7 MapReduce Types and Formats 29 July 2010 Taikyoung Kim.
Youngil Kim Awalin Sopan Sonia Ng Zeng.  Introduction  Concept of the Project  System architecture  Implementation – HDFS  Implementation – System.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
Next Generation of Apache Hadoop MapReduce Owen
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
GSICS Annual Meeting Laurie Rokke, DPhil NOAA/NESDIS/STAR March 26, 201.
Unit 2 Hadoop and big data
Map-Reduce framework.
MapReduce Types, Formats and Features
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
MapReduce Simplied Data Processing on Large Clusters
Hadoop MapReduce Types
Cloud Distributed Computing Environment Hadoop
湖南大学-信息科学与工程学院-计算机与科学系
Overview of big data tools
Data processing with Hadoop
Cloud Computing: Project Tutorial Hadoop Map-Reduce Programming
MAPREDUCE TYPES, FORMATS AND FEATURES
5/7/2019 Map Reduce Map reduce.
MapReduce: Simplified Data Processing on Large Clusters
Map Reduce, Types, Formats and Features
Presentation transcript:

Hadoop: The Definitive Guide Chap. 2 MapReduce Kisung Kim

MapReduce Programming model for parallel data processing Hadoop can run MapReduce programs written in various languages: e.g. Java, Ruby, Python, C++ In this chapter Introduce MapReduce programming using a simple example Introduce some of MapReduce API Explain data flow of MapReduce

Example: Analysis of Weather Dataset Data from NCDC(National Climatic Data Center) A large volume of log data collected by weather sensors: e.g. temperature Data format Line-oriented ASCII format Each record has many elements We focus on the temperature element Data files are organized by date and weather station There is a directory for each year from 1901 to 2001, each containing a gzipped file for each weather station with its readings for that year Query What’s the highest recorded global temperature for each year in the dataset? Year Temperature 0067011990999991950051507004...9999999N9+00001+99999999999... 0043011990999991950051512004...9999999N9+00221+99999999999... 0043011990999991950051518004...9999999N9-00111+99999999999... 0043012650999991949032412004...0500001N9+01111+99999999999... 0043012650999991949032418004...0500001N9+00781+99999999999... Contents of data files List of data files

Analyzing the Data with Unix Tools To provide a performance baseline Use awk for processing line-oriented data Complete run for the century took 42 minutes on a single EC2 High-CPU Extra Large Instance

How Can We Parallelize This Work? To speed up the processing, we need to run parts of the program in parallel Dividing the work Process different years in different process It is important to divide the work into even distribution Split the input into fixed-size chunks Combining the results If using the fixed-size chunks approach, the combination is more delicate But still we are limited by the processing capacity of a single machine Some datasets grow beyond the capacity of a single machine To use multiple machines, we need to consider a variety of complex problems Coordination: Who runs the overall job? Reliability: How do we deal with failed processes? Hadoop can take care of these issues

Hadoop MapReduce To use MapReduce, we need to express out query as a MapReduce job MapReduce job Map function Reduce function Each function has key-value pairs as input and output Types of input and output are chosen by the programmer

MapReduce Design of NCDC Example Map phase Text input format of the dataset files Key: offset of the line (unnecessary) Value: each line of the files Pull out the year and the temperature Indeed in this example, the map phase is simply data preparation phase Drop bad records(filtering) Input File Output of Map Function (key, value) Input of Map Function (key, value) Map

MapReduce Design of NCDC Example The output from the map function is processed by MapReduce framework Sorts and groups the key-value pairs by key Sort and Group By Reduce function iterates through the list and pick up the maximum value Reduce

Java Implementation: Map Map function: implementation of the Mapper interface Mapper interface Generic type Four type parameter: input key, input value, output key, output value type Hadoop provides its own set of basic types optimized for network serialization org.apache.hadoop.io package e.g. LongWritable: Java Long Text: Java String IntWritable: Java Integer OutputCollector Write the output Input Type Output Type

Java Implementation: Reduce Reduce function: implementation of the Reducer interface Reducer interface Generic type Four type parameter: input key, input value, output key, output value type Input types of the reduce function must match the output type of the map function Input Type Output Type

Java Implementation: Main Construct JobConf object Specification of the job Control how the job is run Pass a class to the JobConf Hadoop will locate the relevant JAR file and will distribute round the cluster Specify input and output paths addInputPath(), setOutputPath() If the output directory exists before running the job, Hadoop will complain and not run the job Specify map and reduce types setMapperClass(), setReducerClass Set output type setOutputKeyClass(), setOutputValueClass() setMapOutputKeyClass(), setMapOutputValueClass() Input type Here, we use the default, TextInputFormat runJob() Submit the job

Run the Job Install Hadoop in standalone mode (Appendix A in the book) Run using the local filesystem with a local job runner HADOOP_CLASSPATH Path of the application class Output Log (cont.) Output Log Result

Java Implementation Hadoop 0.20.0 Favors abstract classes over interfaces Easier to evolve Mapper and Reducer interfaces are abstract classes org.apache.hadoop.mapreduce package Makes extensive use of context objects to allow user code to communicate with the MapReduce system Supports both a “push” and a “pull” style of iteration Configuration has been unified Job configuration is done through a Configuration object Job control is performed through the Job class, rather than JobClient But currently not all of the MapReduce libraries of Hadoop have been ported to work with the new API. So this book uses the old API for this reason

Data Flow for Large Inputs To scale out, we need to store the data in a distributed filesystem, HDFS (Chap. 3) MapReduce job is divided into map tasks and reduce tasks Two types of nodes Jobtracker Coordinates all the jobs on the system by scheduling tasks to run on tasktrackers If a task fails, the jobtracker can reschedule it on a different tasktracker Tasktracker Run tasks and send progress reports to the jobtracker Divides input into fixed-size pieces, input splits Hadoop creates one map task for each split Map task runs the user-defined map function for each record in the spilit

Data Flow for Large Inputs Size of splits Small size is better for load-balancing: faster machine will be able to process more splits But if splits are too small, the overhead of managing the splits dominate the total execution time For most jobs, a good split size tends to be the size of a HDFS block, 64MB(default) Data locality optimization Run the map task on a node where the input data resides in HDFS This is the reason why the split size is the same as the block size The largest size of the input that can be guaranteed to be stored on a single node If the split spanned two blocks, it would be unlikely that any HDFS node stored both blocks Map tasks write their output to local disk (not to HDFS) Map output is intermediate output Once the job is complete the map output can be thrown away So storing it in HDFS with replication, would be overkill If the node of map task fails, Hadoop will automatically rerun the map task on another node

Data Flow for Large Inputs Reduce tasks don’t have the advantage of data locality Input to a single reduce task is normally the output from all mappers Output of the reduce is stored in HDFS for reliability The number of reduce tasks is not governed by the size of the input, but is specified independently When there are multiple reducers, the map tasks partition their output: One partition for each reduce task The records for every key are all in a single partition Partitioning can be controlled by a user-defined partitioning function

Combiner Function To minimize the data transferred between map and reduce tasks Combiner function is run on the map output But Hadoop do not guarantee how many times it will call combiner function for a particular map output record It is just optimization The number of calling (even zero) does not affect the output of Reducers Running a distributed MapReduce job 10-node EC2 cluster running High-CPU Extra Large Instances: 6 minutes max(0, 20, 10, 25, 15) = max(max(0, 20, 10), max(25, 15)) = max(20, 25) = 25

Hadoop Streaming API for other languages (Ruby, Python,…) Hadoop Streaming uses Unix standard streams as the interface between Hadoop and your program Map Function in Ruby Reduce Function in Ruby