Google MapReduce Framework A Summary of: MapReduce & Hadoop API Slides prepared by Peter Erickson

Google MapReduce Framework A Summary of: MapReduce & Hadoop API Slides prepared by Peter Erickson (poe9514@rit.edu)

What is MapReduce?  Massively parallel processing of very large data sets (larger than 1TB) ‏.  Parallelize the computations over hundreds or thousands of CPUs.  Ensure fault tolerance.  Do this all through an easy-to-use, abstract and reusable framework. 2

What is MapReduce?  Simple algorithm that can be easily parallelized, and efficiently handles large sets of data: MAP – Sort data in key, value pairs then REDUCE – Perform a reduction across n maps (nodes) ‏  - Simple, clean abstraction for programmers. 3

Implementation  Programmers need only to implement two functions: map ( Object key, Object value ) ‏ -> Map reduce ( Object key, List values ) ‏ -> Map  Let's look at an example to better understand these functions. 4

Word Count Example Program  Given a document of words, sum the usage of all words, i.e. (the:42), (is:10), (a:23), (computer:2), etc  How would you perform this task sequentially? Map maps words to the number of occurrences in the document.  How would you perform this task in parallel? Split the document amongst n processors and map words in same way, then reduce all nodes. 5

Word Count: Map function  Map is performed on each node, using a subset of the overall data. // input_key: document name // input_value: document contents Map map( String input_key, String input_value ) { // iterate over all words (.split( “\\s” separates a string between spaces) ‏ for ( String word : input_value.split( “\\s” ) { // insert “1” into the Map for this word (there will be collisions) ‏ output.put( word, "1" ); } 6

Word Count: Map function // input_key: document name // input_value: document contents map( String input_key, String input_value ) { // iterate over all words (.split( “\\s” separates a string between spaces) ‏ for ( String word : input_value.split( “\\s” ) { // insert “1” into the Map for this word (there will be collisions) ‏ output.put( word, "1" ); } }- Collisions in the map are IMPORTANT. - Values with the same key are passed to the Reduce Function. - The Reduce Function makes the decision of how to merge these collisions. 7

Reduction Phase  Values in the map are merged in some way using the Reduce Function.  The Reduce Function can add, subtract, multiply, divide, take the average, ignore all data, or anything the programmer chooses.  reduce() is passed a key and a list of values, all that share the same key.  reduce() is expected to return a new Map with reduced values. 8

Reduction Phase  For example, if reduce() is passed: Key (word):“the” Values:{ 1, 1, 1, 1, 1 }  it would probably form a new summed (key, value) pair to the output map: Key (word):“the” Value:5 9

Reduction Phase  Reduction can happen on any number of nodes before forming a final map: Node 1Node 2Node 3 [“the”, {1, 1, 1}][“the”, {1, 1, 1, 1}][“the”, {1}] ------------------REDUCE----------------- [“the”, {7}][“the”, {1}] ---------------------REDUCE------------------ Final Result: [“the”, {8}] 10

Word Count: Reduce Function // key: a word // values: a list of collided values for this word Map reduce( String key, List values ) { // sum the counts for this word int total = 0; // iterate over all integers in the collided value list for ( int count : values ) { total += count; } output.put( key, total ); } 11

Hadoop: Java MapReduce Implementation  Hadoop is an open source project run by the Apache Foundation  Provides an API to write MapReduce programs easily and efficiently in Java  Installed on the Thug (Paranoia) cluster at RIT  Used in associated with the Google Filesystem  Hadoop has several frameworks for different parts of the MapReduce paradigm. 12

Hadoop API A Summary of: Input Formats Record Readers Mapping Function Reducing Function Output Formats General Program Layout

Hadoop Input Formats  Most Hadoop programs read their input from a file.  The data from a Hadoop input file must be parsed as map (key, value) pairs, even before the map() function.  The key and value types from the input file are separate from the key and value types using for map and reduce.  A very simple input format:TextInputFormat Key:Line number Value:A line of text from the input file 14

Hadoop Data Types  Special data types are used in Hadoop, for the purpose of serialization between nodes, and data comparison: IntWritable – integer type LongWritable – long integer type Text – string type many, many more  The WordCount program used the TextInputFormat, as we saw in the map() function: public void map( LongWritable key, Text value, OutputCollector output, Reporter reporter ) throws IOException { 15

Word Count Example Program  public void map( LongWritable key, Text value, OutputCollector output, Reporter reporter ) throws IOException {  key= line number  value = text from document at line number  output = OutputCollector, or special Map that allows collisions output is written to the output collector  reporter = special Hadoop object for reporting progress doesn't need to be used – for more advanced programs 16

Hadoop Input Format  The InputFormat object of the program reads data on a single node from a file, or other source.  The InputFormat is then asked to partition the data into smaller sets of data for each node to process.  Let's look at a new sample program, that does not read any input from a file: Sum all prime numbers between 1 and 1000. 17

Prime Number Example  We could make a file that lists all numbers from 1 to 1000, but this is unnecessary: we can write our own InputFormat class to generate these numbers and divide them amongst the nodes: PrimeInputFormat  The InputFormat should generate numbers 1 to 1000, then split the numbers into n groups.  InputFormats generate or read (key, value) pairs, so what should we use? No need to use a key, we only have values (numbers), so we use a dummy key. 18

InputFormat Interface // interface to be used with objects of type K and V public interface InputFormat { // validates the input for the specified job (can be ignored) public void validateInput( JobConf job ) throws IOException; // returns an array of “InputSplit” objects that are sent to each node; // the number of splits to be made is represented by numSplits public InputSplit[] getSplits( JobConf job, int numSplits ) throws IOException; // returns a “iterator” of sorts for a node to extract (key, value) pairs // from an InputSplit public RecordReader getRecordReader( InputSplit split, JobConf job, Reporter reporter ) throws IOException; } 19

InputSplit Interface public interface InputSplit extends Writable { // get total bytes of data in this input split public long getLength(); // get the hostnames of where the splits are (can be ignored) public String[] getLocations(); } // represents an object that can be serialized/deserialized public interface Writable { // read in the fields of this object from the DataInput object public void readFields( DataInput in ); // write the fields of this object to the DataOutput object public void write( DataOutput out ); } 20

Prime Number InputSplit: RangeSplit  An InputSplit for our program need only hold the range of numbers for each node (min, max). public class RangeSplit implements InputSplit { int min, max; public RangeSplit() { super(); } public long getLength() { return ( long )( max – min ); } public String[] getLocations() { return new String[]{}; } public void write( DataOutput out ) { out.writeInt( min ); out.writeInt( max ); } public void readFields( DataInput in ) { min = in.readInt(); max = in.readInt(); } 21

PrimeInputFormat.getSplits();  Create RangeSplit objects for our program: public InputSplit[] getSplits( JobConf job, int numSplits ) throws IOException { RangeSplit[] splits = new RangeSplit[ numSplits ]; // for simplicity sake, we're going to assume 1000 is evenly divisible // by numSplits, but this may not always be the case int len = 1000 / numSplits; for( int i = 0; i < n; i++ ) { splits[ i ].min = ( i * len ) + 1; splits[ i ].max = ( i + 1 ) * len; } // for return splits; } // getSplits 22

Record Reader  An InputSplit for our program generates range of numbers for each node (min, max). i.e. for 4 nodes: (1, 250), (251, 500), (501, 750), (751, 1000)  RecordReader is then responsible for generating (key, value) pairs from an InputSplit.  Our RecordReader will iterate from min to max on each node.  One RecordReader is used per Mapper. 23

RecordReader Interface // the record reader is responsible for iterating over (key, value) pairs in // an input split public interface RecordReader { // creates a key/value for the record reader (for generics) ‏ public K createKey(); public V createValue(); // get the position of the iterator public int getPos(); // get the progress of the iterator (0.0 to 1.0) ‏ public float getProgress(); // populate key and value with the next tuple from the InputSplit public void next( K key, V value ); // marks the end of use of the RecordReader public void close(); } // RecordReader 24

PrimeInputFormat.getRecordReader(); // the record reader is responsible for iterating over (key, value) pairs in // an input split public RecordReader getRecordReader( InputSplit split, JobConf conf, Reporter reporter ) throws IOException { final RangeSplit range = ( RangeSplit )split; // return a new anonymous inner class return new RecordReader () { int pos = 0; public Text createKey() { return new Text(); } public IntWritable createValue() { return new IntWritable(); } public int getPos() { return pos; } public float getProgress() { return ( float )pos / ( float )( range.max – range.min ); } // continued... 25

PrimeInputFormat.getRecordReader(); // return the next key, value pair and increment the position public void next( Text key, IntWritable value ) throws IOException { // get the number at this position int val = range.min + pos; // dummy key value key.setText( “key” ); // set the number for the value value.set( val ); // increment the position pos++; } // close the RecordReader public void close() { }; }; } 26

Prime Number Mapper  Our program Mapper now reads in a dummy key and a number. What should our new Map data types be? BooleanWritable = prime/not prime IntWritable = the number  Reducer can then add together all values with a “true” boolean key, and ignore all “false” values.  public static class PrimeMapper extends MapReduceBase implements Mapper // ^mapper input ^ ^ mapper output ^ 27

Prime Sum Mapper public static class PrimeMapper extends MapReduceBase implements Mapper { public void map( Text key, IntWritable value, OutputCollector output, Reporter reporter ) throws IOException { // check if the number is prime – choose your favorite prime number // testing algorithm if ( isPrime( value.get() ) ) { output.collect( new BooleanWritable( true ), value ); } else { output.collect( new BooleanWritable( false ), value ); } } // PrimeMapper 28

Prime Number Reducer  Reducer will take multiple (boolean, int) pairs, and reduce to a single (text, int) pair: “Sum”, int  public static class PrimeReducer extends MapReduceBase implements Reducer // ^reducer input^ ^reducer output^ 29

Prime Sum Reducer public static class PrimeReducer extends MapReduceBase implements Reducer { public void reduce( BooleanWritable key, Iterator values, OutputCollector output, Reporter reporter ) throws IOException { // ignore the “false” values if ( key.get() ) { // sum the values and write to the output collector int sum = 0; for ( IntWritable val : values ) { sum += val.get(); } output.collector( key, new IntWritable( sum ) ); } } // PrimeReducer 30

Output Formats  Hadoop uses an OutputFormat just like an InputFormat.  Easiest to use: TextOutputFormat Prints “key: value”, one key per line to an output file File is written to the distributed Google File System. 31

Program Layout  Our example Prime Sum program: public static void main( String[] args ) { JobConf job = new JobConf( PrimeSum.class ); job.setJobName( "primesum" ); job.setOutputPath( “primesum-output” ); job.setOutputKeyClass( Text.class ); job.setOutputValueClass( IntWritable.class ); job.setMapperClass( PrimeMapper.class ); job.setReducerClass( PrimeReducer.class ); job.setInputFormat( PrimeInputFormat.class ); job.setOutputFormat( TextOutputFormat.class ); 32

Program Layout  Last but not least: // run the job JobClient.runJob( job ); } // main 33

Google MapReduce Framework A Summary of: MapReduce & Hadoop API Slides prepared by Peter Erickson

Similar presentations

Presentation on theme: "Google MapReduce Framework A Summary of: MapReduce & Hadoop API Slides prepared by Peter Erickson"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Google MapReduce Framework A Summary of: MapReduce & Hadoop API Slides prepared by Peter Erickson

Similar presentations

Presentation on theme: "Google MapReduce Framework A Summary of: MapReduce & Hadoop API Slides prepared by Peter Erickson"— Presentation transcript:

Similar presentations

About project

Feedback