Presentation is loading. Please wait.

Presentation is loading. Please wait.

Poly Hadoop CSC 550 May 22, 2007 Scott Griffin Daniel Jackson Alexander Sideropoulos Anton Snisarenko.

Similar presentations


Presentation on theme: "Poly Hadoop CSC 550 May 22, 2007 Scott Griffin Daniel Jackson Alexander Sideropoulos Anton Snisarenko."— Presentation transcript:

1 Poly Hadoop CSC 550 May 22, 2007 Scott Griffin Daniel Jackson Alexander Sideropoulos Anton Snisarenko

2 Accomplishments ITS Grid Account OpenPBS, Java, Subversion, Bash, Perl, Vim Hadoop on ITS Grid Account HDFS, Node Configurations MapReduce Code Hadoop Running Natively on ITS Grid Hadoop on VMware Images Fedora 6, Image & Hadoop Configuration

3 Grid Properties All Jobs Queued Through Management Node qsub script.bsh Resource list can include which physical node assignment, number of processors, allowed execution time, etc. Script Executes on Only One Physical Node User Environment Replicated on All Nodes Shared File System

4 Hadoop on Grid Issues & Solutions Shared File System vs. Local File System Issues Single Configuration File Shared by All Hadoop Nodes Hadoop DataNodes Need “Local” Directories The File System is Shared Solution Create Separate Directories Using Node’s HostName Supply HostName via Java System Properties Use Java System Property Expansion in Hadoop Configuration File

5 Hadoop on Grid Issues & Solutions (cont.) Pseudo-Dynamic namenode Selection Issues Physical Node Assignments Not Guaranteed Hadoop Configuration File Specifies Nodes to Use Solution On-the-Fly Modification of Hadoop Configuration File  Yay for XML! On-the-Fly Modification of Hadoop masters and slaves Files

6 Hadoop on Grid Scripts run_createdirs.sh Creates dirs for each physical node update_sitexml.pl Dynamically updates hadoop-site.xml run_real_test.sh Formats HDFS Starts job management and DFS Puts dataset on DFS Runs MapReduce jobs Exports output Stops MapReduce and DFS

7 MapReduce Progress Pushing Dataset Onto Hadoop FS Simple Command Done in qsub Script MapReduce Java Code Selecting Number Of Jobs Map Jobs = 10 Per Node Reduce Jobs = 2 Per Node

8 Map Code public class UserRatingMapper extends MapReduceBase implements Mapper { private static Pattern userRatingDate = Pattern.compile("^(\\d+),(\\d+), \\d{4}-\\d{2}-\\d{2}$"); private Logger log = Logger.getLogger(this.getClass()); public void map(WritableComparable key, Writable values, OutputCollector output, Reporter reporter) throws IOException { String line = ((Text)values).toString(); Matcher userRating = userRatingDate.matcher(line); IntWritable userId = new IntWritable(); IntWritable rating = new IntWritable(); if (line.matches("^\\d+:$")) { } else if (userRating.matches()) { userId.set(Integer.parseInt(userRating.group(1))); rating.set(Integer.parseInt(userRating.group(2))); output.collect(userId, rating); } else { log.error("Unexpected input: " + line); } }

9 Reduce Code public class AverageValueReducer extends MapReduceBase implements Reducer { public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0, count = 0; while (values.hasNext()) { sum += ((IntWritable)values.next()).get(); ++count; } output.collect(key, new FloatWritable(((float)sum)/count)); }

10 VMWare Image Progress Setup a Fedora Core 6 VM Image Configured Image to always create a new key when moved Turned Firewall off on the Image. Installed Hadoop and configured it Master, slaves, HDFS namespace, output directories, format HDFS Successfully stared the HDFS, and MapReduce with a master and 1 slave Ran a test job with 99 input files

11 VMWare setup on Grid Need multiple copies of images on the grid Namenode/JobTracker image (1 copy) Datanode/TaskTracker images (many copies) Different MAC address for each copy Starting up Hadoop Start each image copy on separate blades Obtain image IP's from dhcp server and place them in config files for each image. Start the HDFS and MapReduce from the master

12 VMWare Issues Issues Slaves would not connect to the master Master would not start after formating the HDFS Need root access to install VMPlayer on the grid Images too big / not enough HD space Solutions Turn off firewall Delete all the files from the namespace dir and then format the HDFS E-mail the admin Reduce the virtual harddrive on image

13 Evaluation Techniques Processing time between the different configurations Optimizations that can be made Number of Map tasks vs. Reduce tasks per node Explanation of prelim data overhead w/ redundancy on grid We’re all setup and ready to start our experiments as soon as jkempena gives us our nodes back

14 Timeline Week 5-6 Install/Configure Environment Develop Code Week 7-8 Run Experiments Week 9-10 Analyze Data Write Paper Present results

15 Questions?


Download ppt "Poly Hadoop CSC 550 May 22, 2007 Scott Griffin Daniel Jackson Alexander Sideropoulos Anton Snisarenko."

Similar presentations


Ads by Google