CS350 - MAPREDUCE USING HADOOP Spring PARALLELIZATION: BASIC IDEA Parallelization is “easy” if processing can be cleanly split into n units:

CS350 - MAPREDUCE USING HADOOP Spring 2012

PARALLELIZATION: BASIC IDEA Parallelization is “easy” if processing can be cleanly split into n units:

PARALLELIZATION: BASIC IDEA

THE PROBLEM Google faced the problem of analyzing huge sets of data (order of petabytes).  E.g. pagerank, web access logs, etc. Algorithm to process data can be reasonable simple,  But to finish it in an acceptable amount of time the task must be split and forwarded to potentially thousands of machines Programmers were forced to develop the software that:  Splits data  Forwards data and code to participant nodes  Checks nodes state to react to errors  Retrieves and organizes results Tedious, error-prone, time-consuming... and had to be done for each problem.

THE SOLUTION: MAPREDUCE MapReduce is an abstraction to organize parallelizable tasks. Algorithm has to be adapted to fit MapReduce's main two steps:  Map: data processing (collecting/grouping/distribution intermediate step)  Reduce: data collection and digesting MapReduce Architecture provides  Automatic parallelization & distribution  Fault tolerance  I/O scheduling  Monitoring & status updates

LIST PROCESSING Conceptually, MapReduce programs transform lists of input data elements into lists of output data elements. A MapReduce program will do this twice, using two different list processing idioms:  Map  Reduce These terms are taken from several list processing languages such as LISP, Scheme, or ML.

MAPPING LISTS A list of data elements are provided, one at a time, to a function called the Mapper. It transforms each element individually to an output data element.

REDUCING LISTS Reducing lets you aggregate values together. A reducer function receives an iterator of input values from an input list. It then combines these values together, returning a single output value.

MAPPING IN MAPREDUCE (KEYS AND VALUES) In MapReduce, no value stands on its own. Every value has a key associated with it. Keys identify related values.  For example, a log of time-coded speedometer readings from multiple cars could be keyed by license-plate number. The mapping and reducing functions receive not just values, but (key, value) pairs. The output of each of these functions is the same:  Both a key and a value must be emitted to the next list in the data flow. AAA-123 65mph, 12:00pm ZZZ-789 50mph, 12:02pm AAA-123 40mph, 12:05pm CCC-456 25mph, 12:15pm...

REDUCING IN MAPREDUCE (KEYS DIVIDE THE REDUCE SPACE) In MapReduce, all of the output values are not usually reduced together. All of the values with the same key are presented to a single reducer together. This is performed independently of any reduce operations occurring on other lists of values, with different keys attached.

MAPREDUCE DATA FLOW

EXAMPLE: WORD COUNT A simple MapReduce program can be written to determine how many times different words appear in a set of files. For example, if we had the files:  foo.txt: Sweet, this is the foo file  bar.txt: This is the bar file We would expect the output to be: sweet 1 this 2 is 2 the 2 foo 1 bar 1 file 2

WORD COUNT IN MAPREDUCE (2) The high-level structure would look like this: mapper (filename, file-contents): for each word in file-contents: emit (word, 1) reducer (word, values): sum = 0 for each value in values: sum = sum + value emit (word, sum)

WORD COUNT IN MAPREDUCE

WORD COUNT SOURCE

WORD COUNT DRIVER

A CLOSER LOOK

INPUT FILES This is where the data for a MapReduce task is initially stored. The input files typically reside in HDFS. The format of these files can be:  Line-based log files  Binary format files  Multi-line input records, It is typical for these input files to be very large -- tens of gigabytes or more.

INPUT FORMAT How these input files are split up and read is defined by the InputFormat. An InputFormat is a class that provides the following functionality:  Selects the files or other objects that should be used for input  Defines the InputSplits that break a file into tasks  Provides a factory for RecordReader objects that read the file

INPUT SPLITS An InputSplit describes a unit of work that comprises a single map task in a MapReduce program.  A MapReduce program applied to a data set, collectively referred to as a Job, is made up of several (possibly several hundred) tasks By processing a file in chunks, we allow several map tasks to operate on a single file in parallel.  The various blocks that make up the file may be spread across several different nodes in the cluster  The individual blocks are thus all processed locally, instead of needing to be transferred from one node to another The tasks are then assigned to the nodes in the system based on where the input file chunks are physically resident.  An individual node may have several dozen tasks assigned to it  The node will begin working on the tasks, attempting to perform as many in parallel as it can

RECORD READER The InputSplit has defined a slice of work, but does not describe how to access it. TheRecordReader class actually loads the data from its source and converts it into (key, value) pairs suitable for reading by the Mapper. The RecordReader is invoke repeatedly on the input until the entire InputSplit has been consumed.  Each invocation of the RecordReader leads to another call to the map() method of the Mapper.

MAPPER Given a key and a value, the map() method emits (key, value) pair(s) which are forwarded to the Reducers. The individual mappers are intentionally not provided with a mechanism to communicate with one another in any way.  This allows the reliability of each map task to be governed solely by the reliability of the local machine The map() method receives two parameters in addition to the key and the value:  The OutputCollector object has a method named collect() which will forward a (key, value) pair to the reduce phase of the job.  The Reporter object provides information about the current task

PARTITION & SHUFFLE After the first map tasks have completed, the nodes may still be performing several more map tasks each.  But they also begin exchanging the intermediate outputs from the map tasks to where they are required by the reducers  This process of moving map outputs to the reducers is known as shuffling A different subset of the intermediate key space is assigned to each reduce node; these subsets (known as "partitions") are the inputs to the reduce tasks. Each map task may emit (key, value) pairs to any partition; all values for the same key are always reduced together regardless of which mapper is its origin. Therefore, the map nodes must all agree on where to send the different pieces of the intermediate data.

REDUCER Sort  Each reduce task is responsible for reducing the values associated with several intermediate keys.  The set of intermediate keys on a single node is automatically sorted by Hadoop before they are presented to the Reducer. Reduce  A Reducer instance is created for each reduce task.  This is an instance of user-provided code that performs the second important phase of job-specific work.  For each key in the partition assigned to a Reducer, the Reducer's reduce() method is called once.  This receives a key as well as an iterator over all the values associated with the key.  The values associated with a key are returned by the iterator in an undefined order.  The Reducer also receives as parameters OutputCollector and Reporter objects; they are used in the same manner as in the map() method.

OUTPUT FORMAT The (key, value) pairs provided to this OutputCollector are then written to output files.  The way they are written is governed by the OutputFormat. Each Reducer writes a separate file in a common output directory.  The output directory is set by theFileOutputFormat.setOutputPath() method

RECORD WRITER The OutputFormat class is a factory for RecordWriter objects;  These are used to write the individual records to the files as directed by the OutputFormat The output files written by the Reducers are then left in HDFS for your use by,  Another MapReduce job  A separate program  Human inspection

ADDITIONAL FUNCTIONALITY

FAULT TOLERANCE One of the primary reasons to use Hadoop to run your jobs is due to its high degree of fault tolerance. Map worker failure  Map tasks completed or in-progress at worker are reset to idle  Reduce workers are notified when task is rescheduled on another worker Reduce worker failure  Only in-progress tasks are reset to idle Master failure  MapReduce task is aborted and client is notified Should we have task identities?

EXAMPLE: INVERTED INDEX An inverted index returns a list of documents that contain each word in those documents. Thus, if the word "cat" appears in documents A and B, but not C, then the line:  cat A, B should appear in the output. If the word "baseball" appears in documents B and C, then the line:  baseball B, C should appear in the output as well.

INVERTED INDEX CODE Using Eclipse and Hadoop

REFERENCES Yahoo! Hadoop tutorial  http://developer.yahoo.com/hadoop/tutorial/index.html http://developer.yahoo.com/hadoop/tutorial/index.html Processing of massive data: MapReduce  http://lsd.ls.fi.upm.es/lsd/nuevas-tendencias-en-sistemas- distribuidos/IntroToMapReduce.pdf http://lsd.ls.fi.upm.es/lsd/nuevas-tendencias-en-sistemas- distribuidos/IntroToMapReduce.pdf Hadoop webpage  http://hadoop.apache.org/common/docs/current/ http://hadoop.apache.org/common/docs/current/ CS-350 Concurrency in the Cloud (for the masses)

CS350 - MAPREDUCE USING HADOOP Spring PARALLELIZATION: BASIC IDEA Parallelization is “easy” if processing can be cleanly split into n units:

Similar presentations

Presentation on theme: "CS350 - MAPREDUCE USING HADOOP Spring PARALLELIZATION: BASIC IDEA Parallelization is “easy” if processing can be cleanly split into n units:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS350 - MAPREDUCE USING HADOOP Spring PARALLELIZATION: BASIC IDEA Parallelization is “easy” if processing can be cleanly split into n units:

Similar presentations

Presentation on theme: "CS350 - MAPREDUCE USING HADOOP Spring PARALLELIZATION: BASIC IDEA Parallelization is “easy” if processing can be cleanly split into n units:"— Presentation transcript:

Similar presentations

About project

Feedback