Presentation is loading. Please wait.

Presentation is loading. Please wait.

MapReduce Programming Yue-Shan Chang. split 0 split 1 split 2 split 3 split 4 worker Master User Program output file 0 output file 1 (1) fork (2) assign.

Similar presentations

Presentation on theme: "MapReduce Programming Yue-Shan Chang. split 0 split 1 split 2 split 3 split 4 worker Master User Program output file 0 output file 1 (1) fork (2) assign."— Presentation transcript:

1 MapReduce Programming Yue-Shan Chang

2 split 0 split 1 split 2 split 3 split 4 worker Master User Program output file 0 output file 1 (1) fork (2) assign map (2) assign reduce (3) read (4) local write (5) remote read (6) write Input files Map phase Intermediate files (on local disk) Reduce phase Output files

3 MapReduce Program Structure Class MapReduce{ Class Mapper …{ Map 程式碼 } Class Reduer …{ Reduce 程式碼 } Main(){ 主程式設定區 JobConf Conf=new JobConf(“MR.Class”); 其他設定參數程式碼 }}

4 package org.myorg; import; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class WordCount { public static class Map extends MapReduceBase implements Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }

5 public static class Reduce extends MapReduceBase implements Reducer { public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum +=; } output.collect(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); } }

6 MapReduce Job

7 Handled parts

8 Configuration of a Job JobConf object – JobConf is the primary interface for a user to describe a map-reduce job to the Hadoop framework for execution. – JobConf typically specifies the Mapper, combiner (if any), Partitioner, Reducer, InputFormat and OutputFormat implementations to be usedMapperPartitionerReducerInputFormat OutputFormat – Indicates the set of input files (setInputPaths(JobConf, Path...) /addInputPath(JobConf, Path)) and (setInputPaths(JobConf, String) /addInputPaths(JobConf, String)) and where the output files should be written (setOutputPath(Path)).setInputPaths(JobConf, Path...)addInputPath(JobConf, Path)setInputPaths(JobConf, String)addInputPaths(JobConf, String)setOutputPath(Path)

9 Configuration of a Job

10 Input Splitting An input split will normally be a contiguous group of records from a single input file – If the number of requested map tasks is larger than number of files – the individual files are larger than the suggested fragment size, there may be multiple input splits constructed of each input file. The user has considerable control over the number of input splits.

11 Specifying Input Formats The Hadoop framework provides a large variety of input formats. – KeyValueTextInputFormat: Key/value pairs, one per line. – TextInputFormant: The key is the line number, and the value is the line. – NLineInputFormat: Similar to KeyValueTextInputFormat, but the splits are based on N lines of input rather than Y bytes of input. – MultiFileInputFormat: An abstract class that lets the user implement an input format that aggregates multiple files into one split. – SequenceFIleInputFormat: The input file is a Hadoop sequence file, containing serialized key/value pairs.

12 Specifying Input Formats

13 Setting the Output Parameters The framework requires that the output parameters be configured, even if the job will not produce any output. The framework will collect the output from the specified tasks and place them into the configured output directory.

14 Setting the Output Parameters

15 A Simple Map Function: IdentityMapper

16 A Simple Reduce Function: IdentityReducer


18 Configuring the Reduce Phase the user must supply the framework with five pieces of information – The number of reduce tasks; if zero, no reduce phase is run – The class supplying the reduce method – The input key and value types for the reduce task; by default, the same as the reduce output – The output key and value types for the reduce task – The output file type for the reduce task output

19 How Many Maps? The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files. The right level of parallelism for maps seems to be around 10-100 maps per-node, it is best if the maps take at least a minute to execute setNumMapTasks(int)

20 Reducer Reducer reduces a set of intermediate values which share a key to a smaller set of values. Reducer Reducer has 3 primary phases: shuffle, sort and reduce. Shuffle – Input to the Reducer is the sorted output of the mappers. In this phase the framework fetches the relevant partition of the output of all the mappers, via HTTP. Sort – The framework groups Reducer inputs by keys (since different mappers may have output the same key) in this stage – The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged.

21 How Many Reduces? The right number of reduces seems to be 0.95 or 1.75 multiplied by ( * mapred.tasktracker.reduce.tasks.maximum). With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing.

22 How Many Reduces? Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures. Reducer NONE – It is legal to set the number of reduce-tasks to zero if no reduction is desired. – In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set by setOutputPath(Path). setOutputPath(Path) – The framework does not sort the map-outputs before writing them out to the FileSystem

23 Reporter Reporter is a facility for Map/Reduce applications to report progress, set application-level status messages and update Counters. Reporter Mapper and Reducer implementations can use the Reporter to report progress or just indicate that they are alive.

24 JobTracker JobTracker is the central location for submitting and tracking MR jobs in a network environment. JobClient is the primary interface by which user-job interacts with the JobTracker JobClient – provides facilities to submit jobs, track their progress, access component-tasks' reports and logs, get the Map/Reduce cluster's status information and so on.

25 Job Submission and Monitoring The job submission process involves: – Checking the input and output specifications of the job. – Computing the InputSplit values for the job. – Setting up the requisite accounting information for the DistributedCache of the job, if necessary. – Copying the job's jar and configuration to the Map/Reduce system directory on the FileSystem. – Submitting the job to the JobTracker and optionally monitoring it's status.

26 MapReduce Details for Multimachine Clusters

27 Introduction Why? – datasets that can’t fit on a single machine, – have time constraints that are impossible to satisfy with a small number of machines, – need to rapidly scale the computing power applied to a problem due to varying input set sizes.

28 Requirements for Successful MapReduce Jobs Mapper – ingest the input and process the input record, sending forward the records that can be passed to the reduce task or to the final output directly Reducer – Accept the key and value groups that passed through the mapper, and generate the final output job must be configured with the location and type of the input data, the mapper class to use, the number of reduce tasks required, and the reducer class and I/O types.

29 Requirements for Successful MapReduce Jobs The TaskTracker service will actually run your map and reduce tasks, and the JobTracker service will distribute the tasks and their input split to the various trackers. The cluster must be configured with the nodes that will run the TaskTrackers, and with the number of TaskTrackers to run per node.

30 Requirements for Successful MapReduce Jobs Three levels of configuration to address to configure MapReduce on your cluster – configure the machines, – the Hadoop MapReduce framework, – the jobs themselves

31 Launching MapReduce Jobs launch the preceding example from the command line > bin/hadoop [-libjars jar1.jar,jar2.jar,jar3.jar] jar myjar.jar MyClass


33 MapReduce-Specific Configuration for Each Machine in a Cluster install any standard JARs that your application uses It is probable that your applications will have a runtime environment that is deployed from a configuration management application, which you will also need to deploy to each machine. The machines will need to have enough RAM for the Hadoop Core services plus the RAM required to run your tasks. The conf/slaves file should have the set of machines to serve as TaskTracker nodes

34 DistributedCache distributes application-specific, large, read- only files efficiently a facility provided by the Map/Reduce framework to cache files (text, archives, jars and so on) needed by applications. The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node

35 Adding Resources to the Task Classpath Methods – JobConf.setJar(String jar): Sets the user JAR for the MapReduce job. – JobConf.setJarByClass(Class cls): Determines the JAR that contains the class cls and calls JobConf.setJar(jar) with that JAR. – DistributedCache.addArchiveToClassPath(Path archive, Configuration conf): Adds an archive path to the current set of classpath entries.

36 Configuring the Hadoop Core Cluster Information Setting the Default File System URI You can also use the JobConf object to set the default file system: – conf.set( "", "hdfs://NamenodeHostname:PORT");

37 Configuring the Hadoop Core Cluster Information Setting the JobTracker Location use the JobConf object to set the JobTracker information: – conf.set( "mapred.job.tracker", "JobtrackerHostname:PORT");

Download ppt "MapReduce Programming Yue-Shan Chang. split 0 split 1 split 2 split 3 split 4 worker Master User Program output file 0 output file 1 (1) fork (2) assign."

Similar presentations

Ads by Google