Lecture 17 (Hadoop: Getting Started)

Lecture 17 (Hadoop: Getting Started)
CSE 482: Big Data Analysis Lecture 17 (Hadoop: Getting Started)

Hadoop Programming Model
To write a Hadoop program, we need to decompose the computational problem into a set of map and reduce tasks Map task Performs local computation at each node Reduce task Aggregates the partial results generated by the mapper nodes to obtain the “global” solution Input and output of map and reduce tasks are key-value pairs

Mapper: transforms its block of input data into a form that can be aggregated by the reducers Input: (key1, value1) pairs Output: (key2, value2) pairs Reducer: aggregates the list of values associated with each key and writes it to the output Input: (key2, list of value2) pairs Key2 is the same type as output key of the mapper task Input records are sorted based on key2 Output: (key3, value3) pairs

When your client program submit a job to Hadoop Hadoop divides the input data into smaller fixed-size blocks (default size: 64MB) called input splits Hadoop creates a map task for each input split Hadoop does its best to run the map task on the node where the input split resides on HDFS, but sometimes if the node already performs another task, the jobtracker will look for other nodes to execute the map task Each input split contains a set of records Mapper processes one record at a time By default, each record corresponds to a line in the input file (you can override this with your own InputFormat for mappers)

Example 1: Distributed Word Count
Count the frequency of each term from a large collection of documents (a, 6) (and, 2) (be, 3) (cat, 4) (dog, 2) … This is a cat! cat is ok walk the dog … Input files Output of Hadoop program

Distributed Word Count Example
key-value pairs Input data files key-value pairs Mapper (this, 1) (is, 1) (a, 1) (cat, 1) … (0, This is a cat!) (14, cat is ok) (24, walk the dog) … map() function cat * Mapper input Mapper output Sorting, partitioning, shuffling Reducer reduce() function (a, 6) (and, 2) (be, 3) (cat, 4) (dog, 2) … Output (part-r-00000) key-value pairs (a, [1, 1, 1, 1, 1, 1]) (and, [1, 1]) (be, [1, 1, 1]) (cat, [1, 1, 1, 1]) (dog, [1, 1]) … Key, list of values pairs Reducer output Reducer input

Mapper input: a set of key-value pairs Key: byte offset of the input file Value: a line in the input file Mapper function: Parse each line into a set of tokens (terms) Mapper output: a set of key-value pairs Key: word/term value: 1

Reducer input: a set of key-value pairs Key: word/term (same as the key for mapper output) Value: a list of counts [1 1 1 …] collected from mapper outputs with the same key Reducer function: Sums up the list of counts for each word/term Reducer output: a set of key-value pairs Key: (sorted) word/term value: frequency of word

Example 2: Weather Data National Climatic Data Center provides hourly weather data from stations around the world Data for each station is stored in a separate file year temperature Each line has a fixed column width and contains information such as weather station identifier, latitude, longitude, elevation, observation date and time, as well as measurements of air temperature, visibility distance, atmospheric pressure, etc, for the given station Our objective is to find the maximum global temperature for each year

Weather Data Example Example: weather data from National Climatic Data Center (NCDC, The lines are presented to map function as follows Default: the keys are byte offsets within the file

Weather Data Example Mapper processes each record (line) separately and writes (emits) the output as key-value pairs Example: Mapper’s output is processed by the Hadoop framework before being sent to the Reducer Hadoop framework will sort and group the key-value pairs by the key and shuffle them to the reducers Example: reduce function sees the following input

Calculating Maximum Temperature
Input data files key-value pairs key-value pairs Mapper map() function cat * Mapper input Mapper output Sorting, partitioning, shuffling Reducer reduce() function Output (part-r-00000) key-value pairs Key, list of values pairs Reducer output Reducer input

Another Example: Sorting
Input data files key-value pairs Mapper (0, This) (4, is) (6, a) (7, cat) … (this, 1) (is, 1) (a, 1) (cat, 1) … map() function cat * Sorting, partitioning, shuffling Output (part-r-00000) Reducer (a, [1, 1, 1, 1, 1, 1]) (and, [1, 1]) (be, [1, 1, 1]) (cat, [1, 1, 1, 1]) (dog, [1, 1]) … (a, -) (and, -) (be, -) (cat, -) (dog, -) … reduce() function You should use only 1 reducer

Combiners If you have a billion records, the mappers may generate more than a billion key-value pairs Combiners help by performing a local reduce on each key before sending the output to the reducers Combiner function performs the same transformation to the intermediate results as the reduce function Note: The combiner function may be applied zero, one, or more times Not all reduce operations can be refactored into combiners: max ([list1, list2]) = max( [max(list1), max(list2)] ) avg ([list1, list2])  avg( [avg(list1), avg(list2)] )

Combiner (A local reducer)
Input data files key-value pairs Mapper (this, 1) (is, 1) (a, 1) (cat, 1) … (0, This is a cat!) (14, cat is ok) (24, walk the dog) … map() function cat * Combiner reduce() Output (part-r-00000) Reducer (a, [1, 1, 1, 1, 1, 1]) (and, [1, 1]) (be, [1, 1, 1]) (cat, [1, 1, 1, 1]) (dog, [1, 1]) … (a, 6) (and, 2) (be, 3) (cat, 4) (dog, 2) … (this, 1) (is, 2) (a, 1) (cat, 2) (ok, 1) … reduce() function Sorting, partitioning, shuffling Combiner performs a local reduce before sending the output to a reducer

Sorting, partitioning, shuffling
Partitioner When there are multiple reducers, we need to determine which reducer to send the key-value pairs Default partitioning approach is to hash the key Output (part-r-00000) (a, [1, 1, 1, 1, 1, 1]) (be, [1, 1, 1]) (cat, [1, 1, 1, 1]) (frog, [1, 1]) … Sorting, partitioning, shuffling (a, 6) (be, 3) (cat, 4) (frog, 2) … Reducer reduce() function Mappers outputs (part-r-00001) (and, [1, 1, 1, 1]) (big, [1, 1, 1]) (can, [1, 1, 1, 1, 1]) (dog, [1, 1, 1]) … Reducer (and, 4) (big, 3) (can, 5) (dog, 3) … reduce() function

In-Class Exercise Given a data set of stock prices, where each line corresponds to a trading day and contains prices of different stocks, calculate the correlation between every pair of stocks What are key-value pairs for mappers and reducers? Given a data set of user profile (gender, marital status, state, occupation, etc) and a class attribute (buy/not buy), calculate the entropy of each attribute with respect to the class

Hadoop Programming Hadoop framework is written in Java
There are built-in Java libraries to support Hadoop framework To use the library, the Java program should include the library import org.apache.hadoop.*; But we can use other programming languages such as Python, C++, etc to interact with Hadoop this will be discussed later for the lecture on Hadoop streaming

Java Programming import java.io.*; // specify the Java libraries used public class myExample { // name of the main class public static void main (String [] args) throws Exception { int i = 1, j = 2; System.out.println(“Hello World!”); System.out.println(i + “+” + j + “=“ + (i+j)); } Save the source file as myExample.java (same filename as name of the class) To compile: To execute:

Basic Template for Hadoop Program
import org.apache.hadoop.*; // specify the Hadoop libraries used import java.util.*; // specify the Java libraries used public class mainClass { // name of the main class public static class MapperClass extends Mapper < Types > { … } public static class ReducerClass extends Reducer < Types > { … public static void main (String [] args) throws Exception { … You can also define the Mapper and Reducer outside the Main class in separate files (you’ll need to remove the keyword static from the class declaration).

Template for Main Program
public static void main (String args) throws Exception { Configuration conf = new Configuration(); // specifies job configuration Job job = Job.getInstance(conf, “program name”); // create a job object job.setJarByClass(mainClass.class); // name of class with main program job.setMapperClass(MapperClass.class); // set name of mapper class job.setReducerClass(ReducerClass.class); // set name of reducer class job.setMapOutputKeyClass(<ClassType>); // type of output key of mapper job.setMapOutputValueClass(<ClassType>); // type of output value of mapper job.setOutputKeyClass(<ClassType>); // type of output key of reducer job.setOutputValueClass(<ClassType>); // type of output value of reducer job.setNumReduceTasks(1); // set number of reducers FileInputFormat.addInputPath(job, new Path(args[0])); // input data directory FileOutputFormat.setOutputPath(job, new Path(args[1]); // output directory System.exit(job.waitForCompletion(true) ? 0 : 1); // run the job } Configuration class is defined in org.apache.hadoop.conf package Job class is defined in org.apache.hadoop.mapreduce package FileInputFormat class is defined in org.apache.hadoop.mapreduce.lib.input package FileOutputFormat class is defined in org.apache.hadoop.mapreduce.lib.output package

Template for Hadoop with Partitioner
public class mainClass { // name of the main class public static class MapperClass extends Mapper < Types > { … } public static class ReducerClass extends Reducer < Types > { … public static class PartitionerClass extends Partitioner < Types > { … public static void main (String [] args) throws Exception { … You can define your own Partitioner Class

Partitioner Default: HashPartitioner
Partitions the mapper output by hashing the output key public class HashPartitioner<K, V> extends Partitioner<K, V> { public int getPartition (K key, V value, int numReduceTasks) { return(key.hashCode() & Integer.MAX_VALUE) % numReduceTasks; } You can create your own Partitioner class that implements the getPartition function

Template for Hadoop with Partitioner
public static void main (String args) throws Exception { Configuration conf = new Configuration(); // specifies job configuration Job job = Job.getInstance(conf, “program name”); // create a job object job.setJarByClass(mainClass.class); // name of class with main program job.setMapperClass(MapperClass.class); // set name of mapper class job.setReducerClass(ReducerClass.class); // set name of reducer class job.setPartitionerClass(PartitionerClass.class); // set name of partitioner class job.setCombinerClass(ReducerClass.class); // set name of combiner class job.setMapOutputKeyClass(<ClassType>); // type of output key of mapper job.setMapOutputValueClass(<ClassType>); // type of output value of mapper job.setOutputKeyClass(<ClassType>); // type of output key of reducer job.setOutputValueClass(<ClassType>); // type of output value of reducer job.setNumReduceTasks(1); // set number of reducers FileInputFormat.addInputPath(job, new Path(args[0])); // input data directory FileOutputFormat.setOutputPath(job, new Path(args[1]); // output directory System.exit(job.waitForCompletion(true) ? 0 : 1); // run the job } Note: CombinerClass should be the same as ReducerClass

Compiling a Hadoop Java Program
First, locate where the Java libraries are installed For hadoop2.cse.msu.edu: /soft/linux/jdk1.8.0_65-x64 For AWS: /usr/lib/jvm/java openjdk.x86_64 Define the following environment variables (for bash shell on AWS):

Compiling a Hadoop Java Program
To compile the hadoop Java program hadoop com.sun.tools.javac.Main <filename>.java Output of compilation: a collection of *.class files *.class includes the main class of the program, mapper class, reducer class, and partitioner class (if defined) You need to “archive” them into a single (jar) file To create the jar file, type the following: jar cf <jarfilename>.jar *.class

Executing Hadoop Program
First, you must upload the input data to HDFS Execute the program by typing the following: hadoop jar <jarfile> <classfile> <arguments> Example: hadoop jar myHadoop.jar mainClass inputDir outputDir Download the results from HDFS If you have only 1 reducer hadoop fs -copyToLocal outputDir/part-r results.txt If you have more than 1 reducer hadoop fs -getMerge outputDir results.txt

Your First Hadoop Program on AWS
Launch the AWS EMR cluster instance Open SSH connection to the master node Download the file AWS.zip from class website env.sh (which contains the environment variables) WordCount.java hadoop.txt Upload the data file to HDFS Compile WordCount.java and create jar file Run the Hadoop program Get output result

Step 1: Create EMR Cluster Instance

Step 2: Open SSH Connection
Master node

Step 2: Open SSH Connection

Step 3: Download AWS.zip and Unpack

Step 4: Upload Data File to HDFS
Run env.sh to set environment variables

Step 5: Compile and Create JAR file

Step 6: Run Hadoop Program

Step 7: Get Output Result
Visualize the frequencies of first 10 terms (ordered alphabetically) Copy the results from HDFS to local filesystem You can then use sftp to transfer it to another machine

Lecture 17 (Hadoop: Getting Started)

Similar presentations

Presentation on theme: "Lecture 17 (Hadoop: Getting Started)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 17 (Hadoop: Getting Started)

Similar presentations

Presentation on theme: "Lecture 17 (Hadoop: Getting Started)"— Presentation transcript:

Similar presentations

About project

Feedback