Creating Map-Reduce Programs Using Hadoop. Presentation Overview Recall Hadoop Overview of the map-reduce paradigm Elaboration on the WordCount example.

Creating Map-Reduce Programs Using Hadoop

Presentation Overview Recall Hadoop Overview of the map-reduce paradigm Elaboration on the WordCount example components of Hadoop that make WordCount possible Major new example: N-Gram Generator step-by-step assembly of this map-reduce job Design questions to ask when creating your own Hadoop jobs

Recall why Hadoop rocks Hadoop is: Free and open source high quality, like all Apache Foundation projects crossplatform (pure Java)‏ fault-tolerant highly scalable has bindings for non-Java programming languages applicable to many computational problems

Map-Reduce System Overview JobTracker – Makes scheduling decisions TaskTracker – Manages tasks for a given node Task process Runs an individual map or reduce fragment for a given job Forks from the TaskTracker

Map-Reduce System Overview Processes communicate by custom RPC implementation Easy to change/extend Defined as Java interfaces Server objects implement the interface Client proxy objects automatically created All messages originate at the client: (e.g., Task to TaskTracker)‏ Prevents cycles and therefore deadlocks

Process Flow Diagram

Application Overview Launching Program Creates a JobConf to define a job. Submits JobConf to JobTracker and waits for completion. Mapper Is given a stream of key1,value1 pairs Generates a stream of key2, value2 pairs Reducer Is given a key2 and a stream of value2’s Generates a stream of key3, value3 pairs

Job Launch Process: Client Client program creates a JobConf Identify classes implementing Mapper and Reducer interfaces JobConf.setMapperClass()‏; JobConf.setReducerClass()‏ Specify input and output formats JobConf.setInputFormat(TextInputFormat.class); JobConf.setOutputFormat(TextOutputFormat.class); Other options too: JobConf.setNumReduceTasks()‏ JobConf.setOutputFormat()‏ Many, many more (Facade pattern)‏

An onslaught of terminology We'll explain these terms, each of which plays a role in any non-trivial map/reduce job: InputFormat, OutputFormat, FileInputFormat,... JobClient and JobConf JobTracker and TaskTracker TaskRunner, MapTaskRunner, MapRunner, … InputSplit, RecordReader, LineRecordReader,... Writable, WritableComparable, WritableInt,...

InputFormat and OutputFormat The application also chooses input and output formats, which define how the persistent data is read and written. These are interfaces and can be defined by the application. InputFormat Splits the input to determine the input to each map task. Defines a RecordReader that reads key, value pairs that are passed to the map task OutputFormat Given the key, value pairs and a filename, writes the reduce task output to persistent store.

Example public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); }

Job Launch Process: JobClient Pass JobConf to JobClient.runJob() or JobClient.submitJob()‏ runJob() blocks – wait until job finishes submitJob() does not Poll for status to make running decisions Avoid polling with JobConf.setJobEndNotificationURI() JobClient: Determines proper division of input into InputSplits Sends job data to master JobTracker server

Job Launch Process: JobTracker JobTracker: Inserts jar and JobConf (serialized to XML) in shared location Posts a JobInProgress to its run queue

Job Launch Process: TaskTracker TaskTrackers running on slave nodes periodically query JobTracker for work Retrieve job-specific jar and config Launch task in separate instance of Java main() is provided by Hadoop

Job Launch Process: Task TaskTracker.Child.main(): Sets up the child TaskInProgress attempt Reads XML configuration Connects back to necessary MapReduce components via RPC Uses TaskRunner to launch user process

Job Launch Process: TaskRunner TaskRunner, MapTaskRunner, MapRunner work in a daisy-chain to launch your Mapper Task knows ahead of time which InputSplits it should be mapping Calls Mapper once for each record retrieved from the InputSplit Running the Reducer is much the same

Creating the Mapper You provide the instance of Mapper Should extend MapReduceBase Implement interface Mapper One instance of your Mapper is initialized by the MapTaskRunner for a TaskInProgress Exists in separate process from all other instances of Mapper – no data sharing!

Mapper Override function – map()‏ void map( WritableComparable key, Writable value, OutputCollector output, Reporter reporter)‏ Emit (k2,v2) with output.collect(k2, v2)‏

Example public static class Map extends MapReduceBase implements Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); }

What is Writable? Hadoop defines its own “box” classes for strings (Text), integers (IntWritable), etc. All values are instances of Writable All keys are instances of WritableComparable

Reading data Data sets are specified by InputFormats Defines input data (e.g., a directory)‏ Identifies partitions of the data that form an InputSplit Factory for RecordReader objects to extract (k, v) records from the input source

FileInputFormat and friends TextInputFormat – Treats each ‘\n’-terminated line of a file as a value KeyValueTextInputFormat – Maps ‘\n’- terminated text lines of “k SEP v” SequenceFileInputFormat – Binary file of (k, v) pairs with some add’l metadata SequenceFileAsTextInputFormat – Same, but maps (k.toString(), v.toString())‏

Filtering File Inputs FileInputFormat will read all files out of a specified directory and send them to the mapper Delegates filtering this file list to a method subclasses may override e.g., Create your own “xyzFileInputFormat” to read *.xyz from directory list

Record Readers Without a RecordReader, Hadoop would be forced to divide input on byte boundaries. Each InputFormat provides its own RecordReader implementation Provides capability multiplexing LineRecordReader – Reads a line from a text file KeyValueRecordReader – Used by KeyValueTextInputFormat

Input Split Size FileInputFormat will divide large files into chunks Exact size controlled by mapred.min.split.size RecordReaders receive file, offset, and length of chunk Custom InputFormat implementations may override split size – e.g., “NeverChunkFile”

Sending Data To Reducers Map function receives OutputCollector object OutputCollector.collect() takes (k, v) elements Any (WritableComparable, Writable) can be used

WritableComparator Compares WritableComparable data Will call WritableComparable.compare()‏ Can provide fast path for serialized data Explicitly stated in JobConf setup JobConf.setOutputValueGroupingComparator()‏

Sending Data To The Client Reporter object sent to Mapper allows simple asynchronous feedback incrCounter(Enum key, long amount) setStatus(String msg)‏ Allows self-identification of input InputSplit getInputSplit()‏

Partitioner int getPartition(key, val, numPartitions)‏ Outputs the partition number for a given key One partition == values sent to one Reduce task HashPartitioner used by default Uses key.hashCode() to return partition num JobConf sets Partitioner implementation

Reducer reduce( WritableComparable key, Iterator values, OutputCollector output, Reporter reporter)‏ Keys & values sent to one partition all go to the same reduce task Calls are sorted by key – “earlier” keys are reduced and output before “later” keys

Example public static class Reduce extends MapReduceBase implements Reducer { public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); }

OutputFormat Analogous to InputFormat TextOutputFormat – Writes “key val\n” strings to output file SequenceFileOutputFormat – Uses a binary format to pack (k, v) pairs NullOutputFormat – Discards output

Presentation Overview Recall Hadoop Overview of the map-reduce paradigm Elaboration on the WordCount example components of Hadoop that make WordCount possible Major new example: N-Gram Generator step-by-step assembly of this map-reduce job Design questions to ask when creating your own Hadoop jobs

Major example: N-Gram Generation N-Gram is a common natural language processing technique (used by Google, etc)‏ N-Gram is a subsequence of N items in a given sequence. (i.e. subsequence of words in a given text)‏ Example 3-grams (from Google) with corresponding occurrences ceramics collectables collectibles (55)‏ ceramics collected by (52)‏ ceramics collectibles cooking (45)‏

Understanding the process Someone wise said, “A week of writing code saves an hour of research.” Before embarking on developing a Hadoop job, walk through the process step by step manually and understand the flow and manipulation of data. Once you can comfortably (and deterministically!) do it mentally, begin writing code.

Requirements Input: a beginning word/phrase n-gram size (bigram, trigram, n-gram)‏ the minimum number of occurrences (frequency)‏ whether letter case matters Output: all possible n-grams that occur sufficiently frequently.

High-level view of data flow Given: one or more files containing regular text. Look for the desired startword. If seen, take the next N-1 words and add the group to the database. Similarly to word count, find the number of occurrences of each N-gram. Remove those N-grams that do not occur frequently enough for our liking.

Follow along The N-grams implementation exists and is ready for your perusal. Grab it: if you use Git revision control: git clone git://git.qnan.org/pmw/hadoop-ngram to get the files with your browser, go to: http://www.qnan.org/~pmw/software/hadoop-ngram We used Project Gutenberg ebooks as input.

Follow along Start Hadoop bin/start-all.sh Grab the NGram code and build it: Type “ant” and all will be built Look at the README to see how to run it. Load some text files into your HDFS good source: http://www.gutenberg.org Run it yourself (or see me do it) before we proceed.

Can we just use WordCount? We have the WordCount example that does a similar thing. But there are differences: We don't want to count the number of times our startword appears; we want to capture the subsequent words too. A more subtle problem is that wordcount maps one line at a time. That's a problem if we want 3-grams with startword of “pillows” in the book containing this: The guests stayed in the guest bedroom; the pillows were delightfully soft and had a faint scent of mint. Still, WordCount is a good foundation for our code.

Steps we must perform Read our text in paragraphs rather than in discrete lines: RecordReader InputFormat Develop the mapper and reducer classes: first mapper: find startword, get the next N-1 words, and return first reducer: sum the number of occurrences of each N- gram second mapper: no action second reducer: discard N-grams that are too rare Driver program

A new RecordReader Ours must implement RecordReader Contain certain functions: createKey(), createValue(), getPos(), getProgress(), next()‏ Hadoop offers a LineRecordReader but no support for Paragraphs We'll need a ParagraphRecordReader Use Delegation Pattern instead of extending LineRecordReader. We couldn't extend it because it has private elements. Create new next() function

public synchronized boolean next(LongWritable key, Text value) throws IOException { Text linevalue = new Text(); boolean appended, gotsomething; boolean retval; byte space[] = {' '}; value.clear(); gotsomething = false; do { appended = false; retval = lrr.next(key, linevalue); if (retval) { if (linevalue.toString().length() > 0) { byte[] rawline = linevalue.getBytes(); int rawlinelen = linevalue.getLength(); value.append(rawline, 0, rawlinelen); value.append(space, 0, 1); appended = true; } gotsomething = true; } } while (appended); //System.out.println("ParagraphRecordReader::next() returns "+gotsomething+" after setting value to: ["+value.toString()+"]"); return gotsomething; }

A new InputFormat Given to the JobTracker during execution getRecordReader method This is the why we need InputFormat Must return our ParagraphRecordReader

public class ParagraphInputFormat extends FileInputFormat implements JobConfigurable { private CompressionCodecFactory compressionCodecs = null; public void configure(JobConf conf) { compressionCodecs = new CompressionCodecFactory(conf); } protected boolean isSplitable(FileSystem fs, Path file) { return compressionCodecs.getCodec(file) == null; } public RecordReader getRecordReader(InputSplit genericSplit, JobConf job, Reporter reporter)‏ throws IOException { reporter.setStatus(genericSplit.toString()); return new ParagraphRecordReader(job, (FileSplit) genericSplit); }

First stage: “Find” Mapper Define the startword at startup Each time map is called we parse an entire paragraph and output matching N-Grams Tell Reporter how far done we are to track progress Output like WordCount output.collect(ngram, new IntWritable(1)); This last part is important... next slide explains.

Importance of “output.collect()” Remember Hadoop's data type model: map: (K 1, V 1 ) → list(K 2, V 2 )‏ This means that for every single (K 1, V 1 ) tuple, the map stage can output zero, one, two, or any other number of tuples, and they don't have to match the input at all. Example: output.collect(ngram, new IntWritable(1)); output.collect(“good-ol'-”+ngram, new IntWritable(0));

Find Mapper Our mapper must have a configure() class We can pass primitives through JobConf public void configure(JobConf conf) { desiredPhrase = conf.get("mapper.desired-phrase"); Nvalue = conf.getInt("mapper.N-value", 3); caseSensitive = conf.getBoolean("mapper.case-sensitive", false); }

“Find” Reducer Like WordCount example Sum all the numbers matching our N-Gram Output

Second stage: “Prune” Mapper Parse line from previous output and divide into Key/Value pairs “Prune” Reducer This way we can sort our elements by frequency If this N-Gram occurs fewer times than our minimum, trim it out

Piping data between M/R jobs How does the “Find” map/reduce job pass its results to the “Reduce” map/reduce job? I create a temporary file within HDFS. This temporary file is used as the output of Find and the input of Reduce. At the end, I delete the temporary file.

Counters The N-Gram generator has one programmer- defined counter: the number of partial/incomplete N-grams. These occur when a paragraph ends before we can read N-1 subsequent words. We can add as many counters as we want.

JobConf We need to set everything up 2 Jobs executing in series Find and Prune User inputs parameters Starting N-Gram word/phrase N-Gram size Minimum frequency for pruning JobConf ngram_find_conf = new JobConf(getConf(), NGram.class), ngram_prune_conf = new JobConf(getConf(), NGram.class);

Find JobConf Now we can plug everything in: Also pass input parameters And point to our input and output files ngram_find_conf.setJobName("ngram-find"); ngram_find_conf.setInputFormat(ParagraphInputFormat.class); ngram_find_conf.setOutputKeyClass(Text.class); ngram_find_conf.setOutputValueClass(IntWritable.class); ngram_find_conf.setMapperClass(FindJob_MapClass.class); ngram_find_conf.setReducerClass(FindJob_ReduceClass.class); ngram_find_conf.set("mapper.desired-phrase", args.get(2), true)); ngram_find_conf.setInt("mapper.N-value", new Integer(other_args.get(3)).intValue()); ngram_find_conf.setBoolean("mapper.case-sensitive", caseSensitive); FileInputFormat.setInputPaths(ngram_find_conf, other_args.get(0)); FileOutputFormat.setOutputPath(ngram_find_conf, tempDir);

Prune JobConf Perform set up as before We need to point our inputs to the outputs of the previous job ngram_prune_conf.setJobName("ngram-prune"); ngram_prune_conf.setInt("reducer.min-freq", min_freq); ngram_prune_conf.setOutputKeyClass(Text.class); ngram_prune_conf.setOutputValueClass(IntWritable.class); ngram_prune_conf.setMapperClass(PruneJob_MapClass.class); ngram_prune_conf.setReducerClass(PruneJob_ReduceClass.class); FileInputFormat.setInputPaths(ngram_prune_conf, tempDir); FileOutputFormat.setOutputPath(ngram_prune_conf, new Path(other_args.get(1)));

Execute Jobs Run as blocking process with runJob Batch processing is done in series JobClient.runJob(ngram_find_conf); JobClient.runJob(ngram_prune_conf);

Design questions to ask From where will my input come? InputFileFormat How is my input structured? RecordReader (There are already several common IFFs and RRs. Don't reinvent the wheel.)‏ Mapper and Reducer classes Do Key (WritableComparator) and Value (Writable) classes exist?

Design questions to ask Do I need to count anything while job is in progress? Where is my output going? Executor class What information do my map/reduce classes need? Must I block, waiting for job completion? Set FileFormat?

Creating Map-Reduce Programs Using Hadoop. Presentation Overview Recall Hadoop Overview of the map-reduce paradigm Elaboration on the WordCount example.

Similar presentations

Presentation on theme: "Creating Map-Reduce Programs Using Hadoop. Presentation Overview Recall Hadoop Overview of the map-reduce paradigm Elaboration on the WordCount example."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Creating Map-Reduce Programs Using Hadoop. Presentation Overview Recall Hadoop Overview of the map-reduce paradigm Elaboration on the WordCount example.

Similar presentations

Presentation on theme: "Creating Map-Reduce Programs Using Hadoop. Presentation Overview Recall Hadoop Overview of the map-reduce paradigm Elaboration on the WordCount example."— Presentation transcript:

Similar presentations

About project

Feedback