Coursework II: Google MapReduce in GridSAM Steve Crouch School of Electronics and Computer Science.

Slides:



Advertisements
Similar presentations
Chapter 6 Server-side Programming: Java Servlets
Advertisements

Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
University of Southampton Electronics and Computer Science M-grid: Using Ubiquitous Web Technologies to create a Computational Grid Robert John Walters.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Distributed Computations
MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Standard architecture emerging: – Cluster of commodity.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Distributed Computations MapReduce
Lecture 2 – MapReduce: Theory and Implementation CSE 490h – Introduction to Distributed Computing, Spring 2007 Except as otherwise noted, the content of.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
MapReduce: Simplified Data Processing on Large Clusters
Web server and web browser It’s a take and give policy in between client and server through HTTP(Hyper Text Transport Protocol) Server takes a request.
COMP3019 Coursework: Introduction to M-grid Steve Crouch School of Electronics and Computer Science.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Map Reduce: Simplified Data Processing On Large Clusters Jeffery Dean and Sanjay Ghemawat (Google Inc.) OSDI 2004 (Operating Systems Design and Implementation)
MapReduce: Acknowledgements: Some slides form Google University (licensed under the Creative Commons Attribution 2.5 License) others from Jure Leskovik.
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
Guidelines for Homework 6. Getting Started Homework 6 requires that you complete Homework 5. –All of HW5 must run on the GridFarm. –HW6 may run elsewhere.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
The gLite API – PART I Giuseppe LA ROCCA INFN Catania ACGRID-II School 2-14 November 2009 Kuala Lumpur - Malaysia.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
COMP3019 Coursework: Introduction to GridSAM Steve Crouch School of Electronics and Computer Science.
MapReduce How to painlessly process terabytes of data.
Google’s MapReduce Connor Poske Florida State University.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Chapter 6 Server-side Programming: Java Servlets
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
COMP3019 Coursework Help Steve Crouch School of Electronics and Computer Science.
Information Retrieval Lecture 9. Outline Map Reduce, cont. Index compression [Amazon Web Services]
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Web Development Web Servers.
Large-scale file systems and Map-Reduce
Map Reduce.
MapReduce Simplied Data Processing on Large Clusters
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Cse 344 May 4th – Map/Reduce.
CS-4513 Distributed Computing Systems Hugh C. Lauer
Cloud Computing MapReduce, Batch Processing
5/7/2019 Map Reduce Map reduce.
MapReduce: Simplified Data Processing on Large Clusters
Distributed Systems and Concurrency: Map Reduce
Presentation transcript:

Coursework II: Google MapReduce in GridSAM Steve Crouch School of Electronics and Computer Science

Contents  Introduction to Google’s MapReduce  Applications of MapReduce  The coursework – Extending a basic MapReduce framework provided in pseudocode  Coursework deadline: 27 th March 4pm  Handin via ECS Coursework Handin System

Google MapReduce MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, Google Inc., OSDI om/en//papers/mapreduce-osdi04.pdf

Google’s Need for a Distributed Programming Model and Infrastructure  Google implement many computations over a lot of data – Input: e.g. crawled documents, web request logs, etc. – Output: e.g. inverted indices, web document graphs, pages crawled per host, frequent per-day queries, etc.  Input usually very large (> 1TB)  Computations need to be distributed for timeliness of results  Want to do this in an easy, but scalable and robust way; provide a programming model (with a suitable abstraction) for the distributed processing aspects  Realised many computations follow a map / reduce approach – map operation applied to a set of logical input “records” to generate intermediate key/value pairs – reduce operation applied to all intermediate values sharing same key to combine data in a useful way – Used as basis for rewrite of their production indexing system!

History of MapReduce – Inspired by Functional Programming!  Functional operations only create new data structures and do not alter existing ones  Order of operations does not matter  Emphasis on data flow  e.g. Higher-Order functions in Lisp – map() – applies a function to each value in a sequence  fun map f [ ] = [ ] | map f (x::xs) = (f x) :: (map f xs) – reduce() – combines all elements of a sequence using a binary operator  fun reduce f c [ ] = c | reduce f c (x::xs) = f x (reduce f c xs)

Looking at map and reduce Another Way…  map(): – Delegates or distributes the computation for each piece of data to a given function, creating a new set of data – Each computation cannot see the effects of the other computations – The order of computation is irrelevant  reduce() takes this created data and reduces it to something we want  map() moves left to right over the list, applying the given function… can this be exploited in distributed computing?

Applying the Programming Model to the Data Distributed Computing Seminar: Lecture 2: MapReduce Theory and Implementation, Christophe Bisciglia, Aaron Kimball & Sierra Michels-Slettvet, Summer 2007.

For Example…  Counting the number of occurrences of each word in a large collection of documents: map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); map outputs each word plus occurrence count reduce sums together all counts emitted for each word doc1,”Hello world” doc2,”Hello there” map() Hello, 1 world, 1 there, 1 Hello, 1 reduce()2 1 1 (Hello) (world) (there)

How it Works in Practice "MapReduce: Simplified Data Processing on Large Clusters" by Jeffrey Dean and Sanjay Ghemawat, OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, User program: - Splits work into M 64MB pieces - Program starts up across compute nodes as either Master or Worker (with exactly 1 Master) 2. Master assigns M map tasks and R reduce tasks to idle workers (either one map or one reduce task each) 3. A map Worker: - Parses key/value pairs out of its input - Passes each key/value to map function - Buffers intermediate keys/values in mem 4. Periodically, map Worker writes intermediate key/value pairs to disk, informing Master of their locations, who forwards to reduce Workers 5/6. When notified of locations by Master, reduce Worker remotely reads in data, sorts and groups data by key, passes to reduce function, results appended to output file 7. When all maps and reduces done, Master wakes up user program which resumes

Coursework: Part II

Learning Objectives:  To develop a general architectural and operational understanding of typical production-level grid software.  To develop the programming skills required to drive typical services on a production- level grid.

Tasks  Download and install the GridSAM server and client  (a) Extend some Java code stubs (which use the GridSAM Java API) to submit and monitor jobs to GridSAM  (b) Extend some pseudocode that describes a basic MapReduce framework for performing word counting on a number of files

File Word Count map and reduce…  Counting the number of occurrences of a given word in a collections of text files: Function mapFunction(fileName, fileLocation) matchCount = countMatches(, fileLocation/fileName) Return [(fileName, matchCount)] End Function Function reduceFunction(fileName, countList) totalCount = 0 For Each count In countList totalCount += count Next Return totalCount End Function map outputs each filename plus word occurrence count reduce sums together all counts emitted for each filename file1, /some/path/ file2, /other/path map() file1, 1 file2, 4 reduce()2 4 (file1) (file2)

Coursework: Part II – Installing GridSAM

Pre-Requisites  Pre-requisites: – Client and Server: Linux only (e.g. SuSE 9.0, RedHat, Debian, Ubuntu)  May work on other Linuxs but no exhaustive testing  Tested on undergrad Linux boxes – Requires Java JDK 6 (not JRE) or above – Beware:  Firewalls blocking 8080 and your FTP port inbetween client and server – add exceptions  VPNs can cause problems with staging data to/from GridSAM

Preparation/Installation  Java 7 recommended – Note: you may need to upgrade your Java – Ensure JAVA_HOME set on path  Install client… – Download gridsam client.zip from coursework page – unzip gridsam client.zip (into a file path that contains no spaces) – cd gridsam client – java SetupGridSAM  Install server (Linux only)… – Can just reuse your Apache Tomcat / from mgrid (see mgrid install slides) – Download gridsam.war from coursework page – Shutdown Tomcat and copy in gridsam.war to apache-tomcat /webapps and restart Tomcat – Can check log files in apache-tomcat /webapps/gridsam/WEB-INF/logs if any problems occur 16

Coursework Materials  Download COMP3019-materials.tgz from coursework page – Copy to gridsam client directory – Unpack, you’ll find some GridSAMExample* files ./GridSAMExampleCompile to check compilation – Code not complete; that’s the coursework!  GridSAMExampleRun wont until you done the coursework – Note server.domain and port in script – you need to change these to point at your server (use HTTP not HTTPS!!)  Use the scripts and Java code as a basis  Refer to API docs on coursework page as required – To obtain job status, use e.g.: jobStage = jobManager.findJobInstance(jobID).getLastKnownStage().getSt ate().toString(); – Doing job.getLastKnownStage().getState().toString() directly wont work 17

The Coursework  See the coursework handout on the COMP3019 page: –  Notes for Part 1: – When specifying multiple arguments to your m-grid applet, there is a single string you can use as an argument. – Consider how you pass the two necessary arguments (i.e. a character and a textfile) as a single argument into the applet – To load the text file below into your applet, package it into the jar file along with the code, and use the following in the applet:  InputStream in = getClass().getResourceAsStream(“textfile.txt”);  Part 2 (GridSAM) Notes: – If you encounter problems using the GridSAM FTP server, some students have found success using a StupidFTP server (available under Ubuntu) – When you want to check the status of a job use e.g. jobStage = jobManager.findJobInstance(jobID).getLastKnownStage().g etState().toString();  Doing job.getLastKnownStage().getState().toString() directly wont work

Coursework: Part II – Running a Command Line Example

Example using File Staging  Objectives: submit simple job with data input and output requirements and monitor progress OMII GridSAM Client OMII GridSAM Server submit JSDL monitor OMII GridSAM FTP Server 1 output file 2 input files

JSDL Example  Gridsam-2.3.0/examples/remotecat-staging.jsdl  Change ftp URLs to match your ftp server e.g. ): … bin/concat dir2/subdir1/file2.txt stdout.txt stderr.txt dir1/file1.txt …

JSDL Example bin/concat overwrite ftp://ftp.do:55521/concat.sh dir1/file1.txt overwrite ftp://ftp.do:55521/input1.txt dir2/subdir1/file2.txt overwrite ftp://ftp.do:55521/input2.txt stdout.txt overwrite true ftp://ftp.do:55521/stdout.txt

Set up the GridSAM Client’s FTP Server  To allow GridSAM to retrieve input and store output  In gridsam client directory: >./gridsam.sh GridSAMFTPServer -p d examples/ :20:59,250 WARN [GridSAMFTPServer] (main:)../data/examples/ is exposed through FTP at :20:59,268 WARN [GridSAMFTPServer] (main:) Please make sure you understand the security implication of using anonymous FTP for file staging. FtpServer.server.config.root.dir =../data/examples/ FtpServer.server.config.data = /home/omii/COMP3019/omii-uk-client/gridsam/ftp/ftp FtpServer.server.config.port = FtpServer.server.config.self.host = Started FTP  Exposes the examples directory through FTP on port (anonymous access!)  Create input1.txt and input2.txt in this directory with some text in them

CLI Example: Submit to GridSAM Server  Ensure Java is on your path  In gridsam client directory: – Submit to GridSAM server: ./gridsam.sh GridSAMSubmit –s “ -j examples/remotecat-staging.jsdl  Unique job ID is returned – i.e. UID is urn:gridsam:

CLI Example: Monitoring the Job  Monitor job until completion: >./gridsam.sh GridSAMStatus -s “ -j – is entire urn:gridsam: string  Job progress indicated by current state: – Pending, Staging-in, Staged-in, Active, Executed, Staging-out, Staged-out, Done  When complete, output resides in the stdout.txt file in the examples/ directory

What to Hand In  Submit: source code, results files, parameter files and output  Other parts that require written answers should form a separate document: – In text, Microsoft Word or PDF – Up to 800 words in length, not including any source or trace output  Submission via ECS Coursework Handin system: Single Zip file: source, results, parameter files, output & written answers