Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 6 Books: “Hadoop in Action” by Chuck Lam, “An Introduction to Parallel Programming” by Peter Pacheco.

Similar presentations


Presentation on theme: "Lecture 6 Books: “Hadoop in Action” by Chuck Lam, “An Introduction to Parallel Programming” by Peter Pacheco."— Presentation transcript:

1 Lecture 6 Books: “Hadoop in Action” by Chuck Lam, “An Introduction to Parallel Programming” by Peter Pacheco

2 Writing basic MapReduce programs Chapter 4 Patent data as an example data set to process with Hadoop Skeleton of a MapReduce program Basic MapReduce programs to count statistics Hadoop’s Streaming API for writing MapReduce programs using scripting languages Combiner to improve performance

3 Introduction The MapReduce programming model is unlike most programming models you may have learned. It’ll take some time and practice to gain familiarity. To help develop your proficiency, we go through many example programs in the next couple chapters. These examples will illustrate various MapReduce programming techniques. By applying MapReduce in multiple ways you’ll start to develop an intuition and a habit of “MapReduce thinking.”

4 Getting the patent data set To do anything meaningful with Hadoop we need data. Many of our examples will use patent data sets, both of which are available from the National Bureau of Economic Research (NBER ) at http://www.nber.org/patents/. The data sets were originally compiled for the paper “The NBER Patent Citation Data File: Lessons, Insights and Methodological Tools.” We use the citation data set cite75_99.txt and the patent description data set apat63_99.txt.

5 The patent citation data The patent citation data set contains citations from U.S. patents issued between 1975 and 1999. It has more than 16 million rows and the first few lines resemble the following: “CITING”,”CITED” 3858241,956203 3858241,1324234 3858241,3398406 3858241,3557384 3858241,3634889 3858242,1515701 3858242,3319261 3858242,3668705 3858242,3707004...

6 The patent description data This data set has more than 2.9 million records. As in many real-world data sets, it has many missing values. “PATENT”,”GYEAR”,”GDATE”,”APPYEAR”,”COUNTRY”,”POSTATE ”,”ASSIGNEE”,”ASSCODE”,”CLAIMS”,”NCLASS”,”CAT”,”SUBCAT”, ”CMADE”,”CRECEIVE”,”RATIOCIT”,”GENERAL”,”ORIGINAL”,”F WDAPLAG”,”BCKGTLAG”,”SELFCTUB”,”SELFCTLB”,”SECDUPBD ”,”SECDLWBD” 3070801,1963,1096,,”BE”,””,,1,,269,6,69,,1,,0,,,,,,, 3070802,1963,1096,,”US”,”TX”,,1,,2,6,63,,0,,,,,,,,, 3070803,1963,1096,,”US”,”IL”,,1,,2,6,63,,9,,0.3704,,,,,,, 3070804,1963,1096,,”US”,”OH”,,1,,2,6,63,,3,,0.6667,,,,,,, 3070805,1963,1096,,”US”,”CA”,,1,,2,6,63,,1,,0,,,,,,,

7 A partial view of the patent citation data set as a graph. Each patent is shown as a vertex (node), and each citation is a directed edge (arrow).

8 Constructing the basic template of a MapReduce program We write most MapReduce programs in brief and as variations on a template. When writing a new MapReduce program, you generally take an existing MapReduce program and modify it until it does what you want. In this section, we write our first MapReduce program and explain its different parts. This program can serve as a template for future MapReduce programs. Our first program will take the patent citation data and invert it. For each patent, we want to find and group the patents that cite it.

9 Template for MapReduce

10

11 Explanation Our convention is that a single class, called MyJob in this case, completely defines each MapReduce job. Hadoop requires the Mapper and the Reducer to be their own static classes. These classes are quite small, and our template includes them as inner classes to the MyJob class. The advantage is that everything fits in one fi le, simplifying code management. But keep in mind that these inner classes are independent and don’t interact much with the MyJob class. Various nodes with different JVMs clone and run the Mapper and the Reducer during job execution, whereas the rest of the job class is executed only at the client machine.

12 Skeleton The signatures for the Mapper class and the Reducer class are public static class MapClass extends MapReduceBase implements Mapper { public void map(K1 key, V1 value, OutputCollector output, Reporter reporter) throws IOException { } } public static class Reduce extends MapReduceBase implements Reducer { public void reduce(K2 key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { } } The key/value pairs generated by the mapper are outputted via the collect() method of the OutputCollector object. Somewhere in your map() method you need to call output.collect((K2) k, (V2) v);

13 Counting things Much of what the layperson thinks of as statistics is counting, and many basic Hadoop jobs involve counting. We’ve already seen the word count example in chapter 1. For the patent citation data, we may want the number of citations a patent has received. This too is counting. The desired output would look like this: 1 2 10000 1 100000 1 1000006 1 1000007 1 1000011 1 1000017 1 1000026 1 1000033 2 1000043 1 1000044 2

14 Count Citations We already have a program for getting the inverted citation index. We can modify that program to output the count instead of the list of citing patents. We need the modification only at the Reducer. If we choose to output the count as an IntWritable, we need to specify IntWritable in three places in the Reducer code. We called them V3 in our previous notation.

15 Using of static variables The map() method has one extra line for setting citationCount, which is for type casting. The reason for defining citationCount and uno in the class rather than inside the method is purely one of efficiency. The map() method will be called as many times as there are records (in a split, for each JVM). Reducing the number of objects created inside the map() method can increase performance and reduce garbage collection. As we pass citationCount and uno to output.collect(), we’re relying on the output.collect() method’s contract to not modify those two objects.

16 Adapting for Hadoop’s API changes One of the main design goals driving toward Hadoop’s major 1.0 release is a stable and extensible MapReduce API. As of this writing, version 0.20 is the latest release and is considered a bridge between the older API (that we use throughout this book) and this upcoming stable API. The 0.20 release supports the future API while maintaining backward-compatibility with the old one by marking it as deprecated. Future releases after 0.20 will stop supporting the older API.

17 Mapper and Reducer Recall the signatures under the original API: public static class MapClass extends MapReduceBase implements Mapper { public void map(K1 key, V1 value, OutputCollector output, Reporter reporter) throws IOException { } } public static class Reduce extends MapReduceBase implements Reducer { public void reduce(K2 key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { } }

18 New API The new API has simplified them somewhat: public static class MapClass extends Mapper { public void map(K1 key, V1 value, Context context) throws IOException, InterruptedException { } } public static class Reduce extends Reducer { public void reduce(K2 key, Iterable values, Context context) throws IOException, InterruptedException { } }

19 Streaming in Hadoop We have been using Java to write all our Hadoop programs. Hadoop supports other languages via a generic API called Streaming. In practice, Streaming is most useful for writing simple, short MapReduce programs that are more rapidly developed in a scripting language that can take advantage of non-Java libraries. Unix commands work, and Hadoop Streaming enables those commands to be used as mappers and reducers. If you’re familiar with using Unix commands, such as wc, cut, or uniq for data processing, you can apply them to large data sets using Hadoop Streaming. The overall data fl ow in Hadoop Streaming in pseudo-code using Unix’s command line notation, it’s cat [input_file] | [mapper] | sort | [reducer] >[output_file]

20 Streaming with Unix commands In the fi rst example, let’s get a list of cited patents in cite75_99.txt. bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar ➥ -input input/cite75_99.txt ➥ -output output ➥ -mapper ‘cut -f 2 -d,’ ➥ -reducer ‘uniq’ That’s it! It’s a one-line command. Let’s see what each part of the command does.

21 Example After getting the list of cited patents, we may want to know how many are there. Again we can use Streaming to quickly get a count, using the Unix command wc –l. bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar ➥ -input output ➥ -output output_a ➥ -mapper ‘wc -l’ ➥ -D mapred.reduce.tasks=0 Here we use wc –l as the mapper to count the number of records in each split. Hadoop Streaming (since version 0.19.0) supports the GenericOptionsParser. The -D argument is used for specifying configuration properties.

22 Streaming with scripts We can use any executable script that processes a line- oriented data stream from STDIN and outputs to STDOUT with Hadoop Streaming. For example, the Python script in listing 4.4 randomly samples data from STDIN. For those who don’t know Python, the program has a for loop that reads STDIN one line at a time. #!/usr/bin/env python import sys, random for line in sys.stdin: if (random.randint(1,100) <= int(sys.argv[1])): print line.strip()

23 Running scripts Running RandomSample.py using Streaming is like running Unix commands using Streaming, the difference being that Unix commands are already available on all nodes in the cluster, whereas RandomSample.py is not. Hadoop Streaming supports a -file option to package your executable fi le as part of the job submission. Our command to execute RandomSample.py is bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar ➥ -input input/cite75_99.txt ➥ -output output ➥ -mapper ‘RandomSample.py 10’ ➥ -file RandomSample.py ➥ -D mapred.reduce.tasks=1

24 AttributeMax.py: Python script to find maximum value of an attribute #!/usr/bin/env python import sys index = int(sys.argv[1]) max = 0 for line in sys.stdin: fields = line.strip().split(“,”) if fields[index].isdigit(): val = int(fields[index]) if (val > max): max = val else: print max

25 Running script Given the parsimonious output of the mapper, we can use the default IdentityReducer to record the (sorted) output of the mappers. bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar ➥ -input input/apat63_99.txt ➥ -output output ➥ -mapper ‘AttributeMax.py 8’ ➥ -file playground/AttributeMax.py ➥ -D mapred.reduce.tasks=1 The mapper is ‘AttributeMax.py 8’. It outputs the maximum of the ninth column in a split.

26 Streaming with key/value pairs At this point you may wonder what happened to the key/value pair way of encoding records. Our discussion on Streaming so far talks about each record as an atomic unit rather than as composed of a key and a value. This is for self-study!

27 Streaming with the Aggregate package Hadoop includes a library package called Aggregate that simplifies obtaining aggregate statistics of a data set. This package can simplify the writing of Java statistics collectors, especially when used with Streaming. The Aggregate package under Streaming functions as a reducer that computes aggregate statistics. You only have to provide a mapper that processes records and sends out a specially formatted output. Each line of the mapper’s output looks like function:key\tvalue The output string starts with the name of a value aggregator function (from the set of predefi ned functions available in the Aggregate package). A colon and a tab separated key/value pair follows.

28 Aggregator functions

29 Example For example, we may want to know whether more countries are participating in the U.S. patent system over time. We can examine this by looking at the number of countries with patents granted each year. We use a straightforward wrapper of UniqValueCount in listing 4.10 and apply it to the year and country columns of apat63_99.txt (column index of 1 and 4, respectively). bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar ➥ -input input/apat63_99.txt ➥ -output output ➥ -fi le UniqueCount.py ➥ -mapper ‘UniqueCount.py 1 4’ ➥ -reducer aggregate

30 UniqueCount.py: a wrapper around the UniqValueCount function #!/usr/bin/env python import sys index1 = int(sys.argv[1]) index2 = int(sys.argv[2]) for line in sys.stdin: fields = line.split(“,”) print “UniqValueCount:” + fields[index1] + “\t” + fields[index2]

31 Self Study! 4.6 Improving performance with combiners See how combiners can be used 4.7 Exercising what you’ve learned Try to implement some exercises related to given topic. Some of them will help you to solve Task2

32 THANK YOU THE END


Download ppt "Lecture 6 Books: “Hadoop in Action” by Chuck Lam, “An Introduction to Parallel Programming” by Peter Pacheco."

Similar presentations


Ads by Google