# Coursework II: Google MapReduce in GridSAM Steve Crouch School of Electronics and Computer Science.

## Presentation on theme: "Coursework II: Google MapReduce in GridSAM Steve Crouch School of Electronics and Computer Science."— Presentation transcript:

Coursework II: Google MapReduce in GridSAM Steve Crouch s.crouch@software.ac.uk, stc@ecs School of Electronics and Computer Science

Contents  Introduction to Google’s MapReduce  Applications of MapReduce  The coursework – Extending a basic MapReduce framework provided in pseudocode  Coursework deadline: 27 th March 4pm  Handin via ECS Coursework Handin System

Google’s Need for a Distributed Programming Model and Infrastructure  Google implement many computations over a lot of data – Input: e.g. crawled documents, web request logs, etc. – Output: e.g. inverted indices, web document graphs, pages crawled per host, frequent per-day queries, etc.  Input usually very large (> 1TB)  Computations need to be distributed for timeliness of results  Want to do this in an easy, but scalable and robust way; provide a programming model (with a suitable abstraction) for the distributed processing aspects  Realised many computations follow a map / reduce approach – map operation applied to a set of logical input “records” to generate intermediate key/value pairs – reduce operation applied to all intermediate values sharing same key to combine data in a useful way – Used as basis for rewrite of their production indexing system!

History of MapReduce – Inspired by Functional Programming!  Functional operations only create new data structures and do not alter existing ones  Order of operations does not matter  Emphasis on data flow  e.g. Higher-Order functions in Lisp – map() – applies a function to each value in a sequence  fun map f [ ] = [ ] | map f (x::xs) = (f x) :: (map f xs) – reduce() – combines all elements of a sequence using a binary operator  fun reduce f c [ ] = c | reduce f c (x::xs) = f x (reduce f c xs)

Looking at map and reduce Another Way…  map(): – Delegates or distributes the computation for each piece of data to a given function, creating a new set of data – Each computation cannot see the effects of the other computations – The order of computation is irrelevant  reduce() takes this created data and reduces it to something we want  map() moves left to right over the list, applying the given function… can this be exploited in distributed computing?

Applying the Programming Model to the Data Distributed Computing Seminar: Lecture 2: MapReduce Theory and Implementation, Christophe Bisciglia, Aaron Kimball & Sierra Michels-Slettvet, Summer 2007.

For Example…  Counting the number of occurrences of each word in a large collection of documents: map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); map outputs each word plus occurrence count reduce sums together all counts emitted for each word doc1,”Hello world” doc2,”Hello there” map() Hello, 1 world, 1 there, 1 Hello, 1 reduce()2 1 1 (Hello) (world) (there)

How it Works in Practice "MapReduce: Simplified Data Processing on Large Clusters" by Jeffrey Dean and Sanjay Ghemawat, OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004. 1. User program: - Splits work into M 64MB pieces - Program starts up across compute nodes as either Master or Worker (with exactly 1 Master) 2. Master assigns M map tasks and R reduce tasks to idle workers (either one map or one reduce task each) 3. A map Worker: - Parses key/value pairs out of its input - Passes each key/value to map function - Buffers intermediate keys/values in mem 4. Periodically, map Worker writes intermediate key/value pairs to disk, informing Master of their locations, who forwards to reduce Workers 5/6. When notified of locations by Master, reduce Worker remotely reads in data, sorts and groups data by key, passes to reduce function, results appended to output file 7. When all maps and reduces done, Master wakes up user program which resumes

Coursework: Part II

Learning Objectives:  To develop a general architectural and operational understanding of typical production-level grid software.  To develop the programming skills required to drive typical services on a production- level grid.

Tasks  Download and install the GridSAM server and client  (a) Extend some Java code stubs (which use the GridSAM Java API) to submit and monitor jobs to GridSAM  (b) Extend some pseudocode that describes a basic MapReduce framework for performing word counting on a number of files

File Word Count map and reduce…  Counting the number of occurrences of a given word in a collections of text files: Function mapFunction(fileName, fileLocation) matchCount = countMatches(, fileLocation/fileName) Return [(fileName, matchCount)] End Function Function reduceFunction(fileName, countList) totalCount = 0 For Each count In countList totalCount += count Next Return totalCount End Function map outputs each filename plus word occurrence count reduce sums together all counts emitted for each filename file1, /some/path/ file2, /other/path map() file1, 1 file2, 4 reduce()2 4 (file1) (file2)

Coursework: Part II – Installing GridSAM

Pre-Requisites  Pre-requisites: – Client and Server: Linux only (e.g. SuSE 9.0, RedHat, Debian, Ubuntu)  May work on other Linuxs but no exhaustive testing  Tested on undergrad Linux boxes – Requires Java JDK 6 (not JRE) or above – Beware:  Firewalls blocking 8080 and your FTP port inbetween client and server – add exceptions  VPNs can cause problems with staging data to/from GridSAM

Preparation/Installation  Java 7 recommended – Note: you may need to upgrade your Java – Ensure JAVA_HOME set on path  Install client… – Download gridsam-2.3.0-client.zip from coursework page – unzip gridsam-2.3.0-client.zip (into a file path that contains no spaces) – cd gridsam-2.3.0-client – java SetupGridSAM  Install server (Linux only)… – Can just reuse your Apache Tomcat 5.5.28/6.0.32 from mgrid (see mgrid install slides) – Download gridsam.war from coursework page – Shutdown Tomcat and copy in gridsam.war to apache-tomcat- 6.0.32/webapps and restart Tomcat – Can check log files in apache-tomcat- 6.0.32/webapps/gridsam/WEB-INF/logs if any problems occur 16

Coursework Materials  Download COMP3019-materials.tgz from coursework page – Copy to gridsam-2.3.0-client directory – Unpack, you’ll find some GridSAMExample* files ./GridSAMExampleCompile to check compilation – Code not complete; that’s the coursework!  GridSAMExampleRun wont until you done the coursework – Note server.domain and port in script – you need to change these to point at your server (use HTTP not HTTPS!!)  Use the scripts and Java code as a basis  Refer to API docs on coursework page as required – To obtain job status, use e.g.: jobStage = jobManager.findJobInstance(jobID).getLastKnownStage().getSt ate().toString(); – Doing job.getLastKnownStage().getState().toString() directly wont work 17

The Coursework  See the coursework handout on the COMP3019 page: – http://www.ecs.soton.ac.uk/~stc/COMP3019 http://www.ecs.soton.ac.uk/~stc/COMP3019  Notes for Part 1: – When specifying multiple arguments to your m-grid applet, there is a single string you can use as an argument. – Consider how you pass the two necessary arguments (i.e. a character and a textfile) as a single argument into the applet – To load the text file below into your applet, package it into the jar file along with the code, and use the following in the applet:  InputStream in = getClass().getResourceAsStream(“textfile.txt”);  Part 2 (GridSAM) Notes: – If you encounter problems using the GridSAM FTP server, some students have found success using a StupidFTP server (available under Ubuntu) – When you want to check the status of a job use e.g. jobStage = jobManager.findJobInstance(jobID).getLastKnownStage().g etState().toString();  Doing job.getLastKnownStage().getState().toString() directly wont work

Coursework: Part II – Running a Command Line Example

Example using File Staging  Objectives: submit simple job with data input and output requirements and monitor progress OMII GridSAM Client OMII GridSAM Server submit JSDL monitor OMII GridSAM FTP Server 1 output file 2 input files

JSDL Example  Gridsam-2.3.0/examples/remotecat-staging.jsdl  Change ftp URLs to match your ftp server e.g. ftp://anonymous:anonymous@localhost:55521/concat.sh ): … bin/concat dir2/subdir1/file2.txt stdout.txt stderr.txt dir1/file1.txt …

JSDL Example bin/concat overwrite ftp://ftp.do:55521/concat.sh dir1/file1.txt overwrite ftp://ftp.do:55521/input1.txt dir2/subdir1/file2.txt overwrite ftp://ftp.do:55521/input2.txt stdout.txt overwrite true ftp://ftp.do:55521/stdout.txt

Set up the GridSAM Client’s FTP Server  To allow GridSAM to retrieve input and store output  In gridsam-2.3.0-client directory: >./gridsam.sh GridSAMFTPServer -p 55521 -d examples/ 2010-04-29 08:20:59,250 WARN [GridSAMFTPServer] (main:)../data/examples/ is exposed through FTP at ftp://anonymous@152.78.237.90:55521/ 2010-04-29 08:20:59,268 WARN [GridSAMFTPServer] (main:) Please make sure you understand the security implication of using anonymous FTP for file staging. FtpServer.server.config.root.dir =../data/examples/ FtpServer.server.config.data = /home/omii/COMP3019/omii-uk-client/gridsam/ftp/ftp1215306750 FtpServer.server.config.port = 55521 FtpServer.server.config.self.host = 152.78.237.90 Started FTP  Exposes the examples directory through FTP on port 55521 (anonymous access!)  Create input1.txt and input2.txt in this directory with some text in them

CLI Example: Submit to GridSAM Server  Ensure Java is on your path  In gridsam-2.3.0-client directory: – Submit to GridSAM server: ./gridsam.sh GridSAMSubmit –s “http://localhost:8080/gridsam/services/gridsam?wsdl” -j examples/remotecat-staging.jsdl  Unique job ID is returned – i.e. UID is urn:gridsam:

CLI Example: Monitoring the Job  Monitor job until completion: >./gridsam.sh GridSAMStatus -s “http://localhost:8080/gridsam/services/gridsam?wsdl” -j – is entire urn:gridsam: string  Job progress indicated by current state: – Pending, Staging-in, Staged-in, Active, Executed, Staging-out, Staged-out, Done  When complete, output resides in the stdout.txt file in the examples/ directory

What to Hand In  Submit: source code, results files, parameter files and output  Other parts that require written answers should form a separate document: – In text, Microsoft Word or PDF – Up to 800 words in length, not including any source or trace output  Submission via ECS Coursework Handin system: Single Zip file: source, results, parameter files, output & written answers