Big data analytics with R and Hadoop Chapter 4 Using HadoopStreaming with R 컴퓨터과학과 SE 연구실 아마르멘드 2015-4-9.

Slides:



Advertisements
Similar presentations
Distributed and Parallel Processing Technology Chapter2. MapReduce
Advertisements

The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce in Action Team 306 Led by Chen Lin College of Information Science and Technology.
Map reduce with Hadoop streaming and/or Hadoop. Hadoop Job Hadoop Mapper Hadoop Reducer Partitioner Hadoop FileSystem Combiner Shuffle Sort Shuffle Sort.
Hadoop: The Definitive Guide Chap. 2 MapReduce
Guide To UNIX Using Linux Third Edition
Lecture 02CS311 – Operating Systems 1 1 CS311 – Lecture 02 Outline UNIX/Linux features – Redirection – pipes – Terminating a command – Running program.
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 김지연.
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Big Data Analytics with R and Hadoop
HADOOP ADMIN: Session -2
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Advanced File Processing
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Introduction to Hadoop 趨勢科技研發實驗室. Copyright Trend Micro Inc. Outline Introduction to Hadoop project HDFS (Hadoop Distributed File System) overview.
Chapter Four UNIX File Processing. 2 Lesson A Extracting Information from Files.
Agenda Basic Shell Operations Standard Input / Output / Error Redirection of Standard Input / Output / Error ( >, >>,
Guide To UNIX Using Linux Fourth Edition
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Cloud Computing Other High-level parallel processing languages Keke Chen.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
An Approach for Processing Large and Non-uniform Media Objects on MapReduce-Based Clusters Rainer Schmidt and Matthias Rella Speaker: Lin-You Wu.
HAMS Technologies 1
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Vyassa Baratham, Stony Brook University April 20, 2013, 1:05-2:05pm cSplash 2013.
Advanced File Processing. 2 Objectives Use the pipe operator to redirect the output of one command to another command Use the grep command to search for.
Chapter Five Advanced File Processing Guide To UNIX Using Linux Fourth Edition Chapter 5 Unix (34 slides)1 CTEC 110.
Writing a MapReduce Program 1. Agenda  How to use the Hadoop API to write a MapReduce program in Java  How to use the Streaming API to write Mappers.
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
MapReduce. What is MapReduce? (1) A programing model for parallel processing of a distributed data on a cluster It is an ideal solution for processing.
Lecture 6 Books: “Hadoop in Action” by Chuck Lam, “An Introduction to Parallel Programming” by Peter Pacheco.
Lecture 5 Books: “Hadoop in Action” by Chuck Lam,
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project.
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
Filtering, aggregating and histograms A FEW COMPLETE EXAMPLES WITH MR, SPARK LUCA MENICHETTI, VAG MOTESNITSALIS.
URL Encoding When data is entered on a Web page form, it must be encoded before it can be passed to some program for processing. Each element on the form.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Csinparallel.org Workshop 307: CSinParallel: Using Map-Reduce to Teach Parallel Programming Concepts, Hands-On Dick Brown, St. Olaf College Libby Shoop,
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
Linux Administration Working with the BASH Shell.
FILES AND EXCEPTIONS Topics Introduction to File Input and Output Using Loops to Process Files Processing Records Exceptions.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Project 1 : Who is Popular, and Who is Not.
Hadoop MapReduce Framework
Map Reduce.
Overview of Hadoop MapReduce MapReduce is a soft work framework for easily writing applications which process vast amounts of.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Web Systems Development (CSC-215)
Cloud Distributed Computing Environment Hadoop
Guide To UNIX Using Linux Third Edition
Introduction to Apache
Overview of big data tools
Chapter Four UNIX File Processing.
VI-SEEM data analysis service
Charles Tappert Seidenberg School of CSIS, Pace University
Input and Output Python3 Beginner #3.
Server & Tools Business
Presentation transcript:

Big data analytics with R and Hadoop Chapter 4 Using HadoopStreaming with R 컴퓨터과학과 SE 연구실 아마르멘드

Understanding the basics of Hadoop streaming Understanding how to run HadoopStreaming with R Exploring the HadoopStreaming R package 2 Content

Understanding the basics of Hadoop streaming Hadoop streaming is a Hadoop utility for running the Hadoop MapReduce job with executable scripts such as Mapper and Reducer. This is similar to the pipe operation in Linux. 1. Text input file is printed on stream (stdin) 2. Provided as an input to Mapper 3. Output (stdout) of Mapper is provided as an input to Reducer 4. Reducer writes the output to the HDFS directory. 3

Understanding the basics of Hadoop streaming The main advantage of the Hadoop streaming utility is that it allows Java as well as non-Java programmed MapReduce jobs to be executed over Hadoop clusters. The Hadoop streaming supports the Perl, Python, PHP, R, and C++ Translate the application logic into the Mapper and Reducer sections with the key and value output elements. Three main components: Mapper, Reducer, and Driver 4

Understanding the basics of Hadoop streaming HadoopStreaming Components 5

Understanding the basics of Hadoop streaming Now, assume we have implemented our Mapper and Reducer as code_mapper.R and code_reducer.R. Format of the HadoopStreaming command: bin/hadoop command [generic Options] [streaming Options] 6

Understanding how to run Hadoop streaming with R Understanding a MapReduce application Gujarat Technological University - Medical, Hotel Management, Architecture, Pharmacy, MBA, and MCA. Purpose : Identify the fields that visitors are interested in geographically. Input dataset : The extracted Google Analytics dataset contains four data columns. 7

Understanding how to run Hadoop streaming with R Understanding a MapReduce application date: This is the date of visit and in the form of YYYY/MM/DD. country: This is the country of the visitor. city: This is the city of the visitor. pagePath: This is the URL of a page of the website. 8

Understanding how to run Hadoop streaming with R Understanding how to code a MapReduce application MapReduce application: Mapper code Reducer code Mapper code: This R script, named, will take care of the Map phase of a MapReduce job. The Mapper extract a pair (key, value) and pass it to the Reducer to be grouped/aggregated. City is a key and PagePath is a value. 9

Understanding how to run Hadoop streaming with R Understanding how to code a MapReduce application ga-mapper.R(R script) while(length(currentLine 0) { fields <- unlist(strsplit(currentLine, ",")) city <- as.character(fields[3]) pagepath <- as.character(fields[4]) print(paste(city, pagepath,sep="\t"),stdout()) } close(input) 10

Understanding how to run Hadoop streaming with R Understanding how to code a MapReduce application ga-reducer.R(R script) city.key <- NA page.value <- 0.0 while (length(currentLine 0) { fields <- strsplit(currentLine, "\t") key <- fields[[1]][1] value <- as.character(fields[[1]][2]) if (is.na(city.key)) { city.key <- key page.value <- value } else {if (city.key == key) { page.value <- c(page.value, value) } 11

Understanding how to run Hadoop streaming with R Executing a Hadoop streaming job from the command prompt 12

Understanding how to run Hadoop streaming with R Executing the Hadoop streaming job from R or an RStudio console 13

Understanding how to run Hadoop streaming with R Exploring an output from the command prompt 14

Understanding how to run Hadoop streaming with R Exploring an output from R or an RStudio console 15

Understanding how to run Hadoop streaming with R Monitoring the Hadoop MapReduce job Small syntax error -> Failure of MapReduce job Administration Page $ bin/hadoop job –history /output/location 16

Exploring the HadoopStreaming R package The hsTableReader function is designed for reading data in the table format. hsTableReader(con, cols, chunkSize=3, skip, sep, carryMemLimit, carryMaxRows) str <- “ key1\t1.91\nkey1\t2.1\nkey1\t20.2\nkey1\t3.2\n key2\t1.2\nkey2\t10\nkey3\t2.5\nkey3\t2.1\nkey4\t1.2\n" cols = list(key='',val=0) con <- textConnection(str, open = "r") hsTableReader(con, cols, chunkSize=3) 17

Exploring the HadoopStreaming R package The hsKeyValReader function is designed for reading the data available in the keyvalue pair format. hsKeyValReader(con, chunkSize, skip, sep) printkeyval <- function(k,v) { cat('A chunk:\n') cat(paste(k,v,sep=': '),sep='\n') } str <- "key1\tval1\nkey2\tval2\nkey3\tval3\n" con <- textConnection(str, open = "r") hsKeyValReader(con, chunkSize=1, FUN=printFn) 18

Exploring the HadoopStreaming R package The hsLineReader function is designed for reading the entire line as a string without performing the data-parsing operation. hsLineReader(file="",chunkSize=2,skip="") str <- " This is HadoopStreaming!!\n here are,\n examples for chunk dataset!!\n in R\n ?" con <- textConnection(str, open = "r") hsLineReader(con,chunkSize=2) 19

감사합니다 20