Beyond Mapper and Reducer

Slides:



Advertisements
Similar presentations
Hadoop Programming. Overview MapReduce Types Input Formats Output Formats Serialization Job g/apache/hadoop/mapreduce/package-
Advertisements

Distributed and Parallel Processing Technology Chapter2. MapReduce
The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Intro to Map-Reduce Feb 21, map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…
MapReduce Simplified Data Processing on Large Clusters
Genetic Algorithms by using MapReduce Fei Teng Doga Tuncay 12/5/2011.
MapReduce.
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
Beyond map/reduce functions partitioner, combiner and parameter configuration Gang Luo Sept. 9, 2010.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce Online Veli Hasanov Fatih University.
MapReduce in Action Team 306 Led by Chen Lin College of Information Science and Technology.
© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.
Developing a MapReduce Application – packet dissection.
O’Reilly – Hadoop: The Definitive Guide Ch.5 Developing a MapReduce Application 2 July 2010 Taewhi Lee.
Spark: Cluster Computing with Working Sets
Hadoop: The Definitive Guide Chap. 2 MapReduce
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Big Data Analytics with R and Hadoop
HADOOP ADMIN: Session -2
Inter-process Communication in Hadoop
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Distributed and Parallel Processing Technology Chapter7. MAPREDUCE TYPES AND FORMATS NamSoo Kim 1.
Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High Throughput Partition-able problems Fault Tolerance.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture VI: 2014/04/14.
Introduction to Hadoop and HDFS
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Distributed Systems Fall 2014 Zubair Amjad. Outline Motivation What is Sqoop? How Sqoop works? Sqoop Architecture Import Export Sqoop Connectors Sqoop.
Whirlwind Tour of Hadoop Edward Capriolo Rev 2. Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
MapReduce How to painlessly process terabytes of data.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
O’Reilly – Hadoop: The Definitive Guide Ch.7 MapReduce Types and Formats 29 July 2010 Taikyoung Kim.
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/
Next Generation of Apache Hadoop MapReduce Owen
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
Map-Reduce framework.
Ch 8 and Ch 9: MapReduce Types, Formats and Features
Hadoop MapReduce Framework
MapReduce Types, Formats and Features
Lecture 17 (Hadoop: Getting Started)
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
MapReduce Simplied Data Processing on Large Clusters
Hadoop MapReduce Types
Cloud Distributed Computing Environment Hadoop
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
Data processing with Hadoop
Cloud Computing: Project Tutorial Hadoop Map-Reduce Programming
MAPREDUCE TYPES, FORMATS AND FEATURES
DryadInc: Reusing work in large-scale computations
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
MapReduce: Simplified Data Processing on Large Clusters
Map Reduce, Types, Formats and Features
Presentation transcript:

Beyond Mapper and Reducer Partitioner, Combiner, Hadoop Parameters and more Rozemary Scarlat September 13, 2011

Data flow with multiple reducers

Partitioner the map tasks partition their output, each creating one partition for each reduce task many keys per partition, but all records for a key are in a single partition default partitioner: HashPartitioner - hashes a record’s key to determine which partition the record belongs in another partitioner: TotalOrderPartitioner – creates a total order by reading split points from an externally generated source

job.setPartitionerClass(OurPartitioner.class); The partitioning can be controlled by a user-defined partitioning function: Don’t forget to set the partitioner class: job.setPartitionerClass(OurPartitioner.class); Useful information about partitioners: Hadoop book –Total Sort (pg. 237); Multiple Outputs (pg. 244); http://chasebradford.wordpress.com/2010/12/12/reusable-total-order-sorting-in-hadoop/ http://philippeadjiman.com/blog/2009/12/20/hadoop-tutorial-series-issue-2-getting-started-with-customized-partitioning/ (Note: uses the old API!)

Partitioner example

Combiner The combiner receives as input all data emitted by the mapper instances on a given node Its output is sent to the Reducers (instead of the mappers’ output). Hadoop does not guarantee how many times it will call the combiner for a particular map output record => calling the combiner for 0, 1 or many times should result in the same output of the reducer Generally, the combiner is called as the sort/merge result is written to disk. The combiner must: be side-effect free have the same input and output key types and the same input and output value types You may want to change the code a little bit and fulfill a different query

Combiner example

Parameters and more Cluster-level parameters (e.g. HDFS block size) Job-specific parameters (e.g. number of reducers, map output buffer size) Configurable Important for job performance Map-side/Reduce-side/Task-environment – Tables 6-1, 6-2, 6-5 from the book Full list of mapreduce paramteres with their default values: http://hadoop.apache.org/common/docs/current/mapred-default.html User-defined parameters Used to pass information from driver (main) to mapper/reducer. Help to make your mapper/reducer more generic

Identity Mapper/Reducer Also, built-in parameters managed by Hadoop that cannot be changed, but can be read For example, the path to the current input that can be used in joining datasets will be read with: FileSplit split = (FileSplit)context.getInputSplit(); String inputFile = split.getPath().toString(); Counters – built-in (Table 8.1 from the book) and user-defined (e.g. count the number of missing records and the distribution of temperature quality codes in the NCDC weather data set) MapReduceTypes – you already know some (eg. setMapOutputKeyClass()), but there are more – Table 7-1 from the book Identity Mapper/Reducer no processing of the data (output == input)

Why do we need map/reduce function without any logic in them? Most often for sorting More generally, when you only want to use the basic functionality provided by Hadoop (e.g. sorting/grouping) More on sorting at page 237 from the book MapReduce Library Classes - for commonly used functions (e.g. InverseMapper used to swap keys and values) (Table 8-2 in the book) implementing Tool interface - support of generic command-line options the handling of standard command-line options will be done using ToolRunner.run(Tool, String[]) and the application will only handle its custom arguments most used generic command-line options: -conf <configuration file> -D <property=value>

How to determine the number of splits? If a file is large enough and splitable, it will be splited into multiple pieces (split size = block size) If a file is non-splitable, only one split. If a file is small (smaller than a block), one split for file, unless... CombineFileInputFormat Merge multiple small files into one split, which will be processed by one mapper Save mapper slots. Reduce the overhead Other options to handle small files? hadoop fs -getmerge src dest ~