Presentation is loading. Please wait.

Presentation is loading. Please wait.

Big data analytics with R and Hadoop Chapter 4 Using HadoopStreaming with R 컴퓨터과학과 SE 연구실 아마르멘드 2015-4-9.

Similar presentations


Presentation on theme: "Big data analytics with R and Hadoop Chapter 4 Using HadoopStreaming with R 컴퓨터과학과 SE 연구실 아마르멘드 2015-4-9."— Presentation transcript:

1 Big data analytics with R and Hadoop Chapter 4 Using HadoopStreaming with R 컴퓨터과학과 SE 연구실 아마르멘드 2015-4-9

2 Understanding the basics of Hadoop streaming Understanding how to run HadoopStreaming with R Exploring the HadoopStreaming R package 2 Content

3 Understanding the basics of Hadoop streaming Hadoop streaming is a Hadoop utility for running the Hadoop MapReduce job with executable scripts such as Mapper and Reducer. This is similar to the pipe operation in Linux. 1. Text input file is printed on stream (stdin) 2. Provided as an input to Mapper 3. Output (stdout) of Mapper is provided as an input to Reducer 4. Reducer writes the output to the HDFS directory. 3

4 Understanding the basics of Hadoop streaming The main advantage of the Hadoop streaming utility is that it allows Java as well as non-Java programmed MapReduce jobs to be executed over Hadoop clusters. The Hadoop streaming supports the Perl, Python, PHP, R, and C++ Translate the application logic into the Mapper and Reducer sections with the key and value output elements. Three main components: Mapper, Reducer, and Driver 4

5 Understanding the basics of Hadoop streaming HadoopStreaming Components 5

6 Understanding the basics of Hadoop streaming Now, assume we have implemented our Mapper and Reducer as code_mapper.R and code_reducer.R. Format of the HadoopStreaming command: bin/hadoop command [generic Options] [streaming Options] 6

7 Understanding how to run Hadoop streaming with R Understanding a MapReduce application Gujarat Technological University - http://www.gtuniversity.com Medical, Hotel Management, Architecture, Pharmacy, MBA, and MCA. Purpose : Identify the fields that visitors are interested in geographically. Input dataset : The extracted Google Analytics dataset contains four data columns. 7

8 Understanding how to run Hadoop streaming with R Understanding a MapReduce application date: This is the date of visit and in the form of YYYY/MM/DD. country: This is the country of the visitor. city: This is the city of the visitor. pagePath: This is the URL of a page of the website. 8

9 Understanding how to run Hadoop streaming with R Understanding how to code a MapReduce application MapReduce application: Mapper code Reducer code Mapper code: This R script, named, will take care of the Map phase of a MapReduce job. The Mapper extract a pair (key, value) and pass it to the Reducer to be grouped/aggregated. City is a key and PagePath is a value. 9

10 Understanding how to run Hadoop streaming with R Understanding how to code a MapReduce application ga-mapper.R(R script) while(length(currentLine 0) { fields <- unlist(strsplit(currentLine, ",")) city <- as.character(fields[3]) pagepath <- as.character(fields[4]) print(paste(city, pagepath,sep="\t"),stdout()) } close(input) 10

11 Understanding how to run Hadoop streaming with R Understanding how to code a MapReduce application ga-reducer.R(R script) city.key <- NA page.value <- 0.0 while (length(currentLine 0) { fields <- strsplit(currentLine, "\t") key <- fields[[1]][1] value <- as.character(fields[[1]][2]) if (is.na(city.key)) { city.key <- key page.value <- value } else {if (city.key == key) { page.value <- c(page.value, value) } 11

12 Understanding how to run Hadoop streaming with R Executing a Hadoop streaming job from the command prompt 12

13 Understanding how to run Hadoop streaming with R Executing the Hadoop streaming job from R or an RStudio console 13

14 Understanding how to run Hadoop streaming with R Exploring an output from the command prompt 14

15 Understanding how to run Hadoop streaming with R Exploring an output from R or an RStudio console 15

16 Understanding how to run Hadoop streaming with R Monitoring the Hadoop MapReduce job Small syntax error -> Failure of MapReduce job Administration Page $ bin/hadoop job –history /output/location 16

17 Exploring the HadoopStreaming R package The hsTableReader function is designed for reading data in the table format. hsTableReader(con, cols, chunkSize=3, skip, sep, carryMemLimit, carryMaxRows) str <- “ key1\t1.91\nkey1\t2.1\nkey1\t20.2\nkey1\t3.2\n key2\t1.2\nkey2\t10\nkey3\t2.5\nkey3\t2.1\nkey4\t1.2\n" cols = list(key='',val=0) con <- textConnection(str, open = "r") hsTableReader(con, cols, chunkSize=3) 17

18 Exploring the HadoopStreaming R package The hsKeyValReader function is designed for reading the data available in the keyvalue pair format. hsKeyValReader(con, chunkSize, skip, sep) printkeyval <- function(k,v) { cat('A chunk:\n') cat(paste(k,v,sep=': '),sep='\n') } str <- "key1\tval1\nkey2\tval2\nkey3\tval3\n" con <- textConnection(str, open = "r") hsKeyValReader(con, chunkSize=1, FUN=printFn) 18

19 Exploring the HadoopStreaming R package The hsLineReader function is designed for reading the entire line as a string without performing the data-parsing operation. hsLineReader(file="",chunkSize=2,skip="") str <- " This is HadoopStreaming!!\n here are,\n examples for chunk dataset!!\n in R\n ?" con <- textConnection(str, open = "r") hsLineReader(con,chunkSize=2) 19

20 감사합니다 20


Download ppt "Big data analytics with R and Hadoop Chapter 4 Using HadoopStreaming with R 컴퓨터과학과 SE 연구실 아마르멘드 2015-4-9."

Similar presentations


Ads by Google