Presentation is loading. Please wait.

Presentation is loading. Please wait.

Teaching HDFS/MapReduce Systems Concepts to Undergraduates Linh B. Ngo*, Edward B. Duffy**, Amy W. Apon* * School of Computing, Clemson University ** Clemson.

Similar presentations

Presentation on theme: "Teaching HDFS/MapReduce Systems Concepts to Undergraduates Linh B. Ngo*, Edward B. Duffy**, Amy W. Apon* * School of Computing, Clemson University ** Clemson."— Presentation transcript:

1 Teaching HDFS/MapReduce Systems Concepts to Undergraduates Linh B. Ngo*, Edward B. Duffy**, Amy W. Apon* * School of Computing, Clemson University ** Clemson Computing and Information Technology, Clemson University

2 Contents Introduction and Learning Objectives Challenges Hadoop computing platform – Options and Solution Module Content – Lectures, Assignments, Data Student Feedback Module Content – Project Ongoing and Future Work

3 Introduction and Learning Objectives -Hadoop/MapReduce is an important current technology in the area of data-intensive computing -Learning objectives: -Understand the challenges of data-intensive computing -Become familiar with the Hadoop Distributed File System (HDFS), the underlying driver of MapReduce -Understand the MapReduce (MR) programming model -Understand the scalability and performance of MR programs on HDFS

4 Challenges -Provide students with a high performance, stable, and robust Hadoop computing platform -Balance lecture and hands-on lab hours -Demonstrate the technical relationship between MapReduce and HDFS

5 Computing Platform Options -MapReduce parallel programming interface -WebMapReduce is an example -Enables study of MR programming model at beginning level -Does not enable the study of HDFS for advanced students -Dedicated shared Hadoop cluster with individual accounts -Multiple student programs compete for resources -Individual errors affect other students -Dedicated cluster that supports multiple virtual Hadoop clusters -Not supported by Clemson’s supercomputer configuration

6 Computing Platform Solution -Modification of SDSC’s myHadoop -Individual Hadoop platform deployment for each student in the class -First setup: -Medium amount of editing needed to set up -Numerous errors due to typos/unable to configure -Second setup: -Minimal amount of editing needed (one line) -Only a few students encountered errors due to typos

7 Lecture and Hands-on Labs -Fall 2012: 5 class hours -1 MR lecture, 1 lab, 1HDFS lecture, 1 lab, 1 advanced MR optimization -Lab time not sufficient due to problems with Hadoop computing platforms -Spring 2013: 5 class hours -Lab time still not sufficient, due to errors in modifying myHadoop scripts -Fall 2013: 7 class hours -1 MR lecture, 2 labs, 1 HDFS lecture, 2 labs, 1 HBase/Hive lecture

8 Module Content: Lectures -Reused available online material with addition clarification -Slides from UMD, Jimmy Lin -Strong emphasis on the following points: -The MR programming paradigm is a programming model that handles data parallelization -The HDFS infrastructure provides a robust and resilient way to distribute big data evenly across a cluster -The MR library takes advantages of HDFS and the MR programming paradigm to enable programmers to write applications to conveniently and transparently handle big data -Data locality is the big theme in working with big data

9 Module Content: Lectures HDD CPU RAM HDD CPU RAM HDD CPU RAM HDD CPU RAM HDFS Blocks File 01File 02File 03 HDFS Abstractions: Directories/Files Physical View at Linux FS: blk_xxx DataNode TaskTracker NameNode JobTracker DataNode TaskTracker DataNode TaskTracker DataNode TaskTracker HDFS DataNode daemons controlling block location MapReduce TaskTracker daemons executing tasks on blocks HDD CPU RAM HDD CPU RAM Block metadata lives in memory Detailed job progress lives in memory JobTracker provides NameNode with file/directory paths and receives block- level information. DataNodes report block information to NameNode -TaskTrackers report progress to JobTracker -JobTracker assigns work and faciltate map/reduce on TaskTrackers based on block location information from NameNode Could be the same machine

10 Module Content: Assignments and Data -Assignments -One MR programming assignment basing on existing codes that familiarizes students with the MR API and programming flows -One MR/HDFS programming assignment that requires students to write a MR program and deploy it to run on a Hadoop computing platform -Data -Strive to be realistic -Big enough, but not too big -Airline Traffic Data (12Gb), Google Trace (200Gb), Yahoo Music Rating (10Gb), Movie Rating (250Mb)

11 Student Feedback -In-class voluntary surveys help to encourage all students to participate (as compared to out-of-class online survey) -IRB approval for survey -Questions addressing: -Improvements in technical skills -Improvements in understanding about Hadoop/MR -Time taken to complete Hadoop/MR assignments -Time taken to set up Hadoop on Palmetto -Usefulness of guides/lectures/labs -Relevancy of Hadoop/MR topics -Appropriate level to begin teaching Hadoop/MR

12 Student Feedback QuestionsFall 2013Spring 2014 Improvements in Technical Skills (0-10):StartEndStartEnd Java6.627.286.067.32 Linux5.867.16.677.73 Networking Concepts4.386.295.267.54 Hadoop MapReduce Concepts0.034.530.36.67 Time taken to complete Hadoop/MR Assignments (hours) Assignment 1Invalid due to incorrect phrasing of question 7.2 Assignment 27.9 Time taken to set up Hadoop2.52.1 Helpfulness of materials (1 – 4: not useful to very useful) Lectures33.6 MapReduce Lab3.63.3 Hadoop-on-Palmetto guide and lab2.93.13 Relevancy of Hadoop/MR to Distributed and Cluster Computing 7.978.77 Appropriate level of undergraduate2.912.79

13 Student Feedback Primary student requests: Fall 2013 -More labs -More details in HDFS guide Spring 2014 -FAQ to address common configuration errors/interpretation of MR compilation errors -More time for projects -Reduced dependency between two Hadoop/MR assignments

14 Module Content: Project -Was added to the course in Spring 2014 -Project in place of assignments -Three categories: -Data Analytics -Big data set -Interesting analytic problem relating to data -Performance Comparison -Big data set -Comparison between Hadoop MapReduce and MPI -System implementation -Augmenting myHadoop with additional software modules: Spark, HBase, or Hadoop 2.0 -Required IEEE two-column conference format for reports

15 Module Content: Project -Data Sets: -Airline Traffic Data (12Gb) -NOAA Global Daily Weather Data (15-20Gb) -Amazon Food Reviews (354Mb – hundreds of thousands of entries) -Amazon Movie Reviews (8.7Gb – millions of entries) -Meme Trackers (53Gb - texts) -Million Song Dataset (196Gb HD5 compressed) -Google Trace Data (~171Gb)

16 Module Content: Project -Comparing performance between Hadoop and MPI-MR (Sandia) using Amazon Movie Reviews -Configuration and installation of Hadoop 2.0 on myHadoop -Amazon Crawler using iterative implementation of Hadoop MR -Performance comparison between Hadoop/MPI/MPI-IO on NOAA data -Performance comparison between Hadoop/MPI/MPI-IO on Google Trace data

17 Module Content: Project -Positive Evaluation -Appropriateness of scope: 8.17/10 -Appropriateness of difficulty: 7.74/10 -Applicability of Hadoop/MR: 8.94/10 -Student Feedback -An integral element of the module/course -More time is needed -Start the project earlier in the semester -Less assignment, more project

18 Ongoing Work -Transition to Hadoop 2.0 -Inclusion of other current distributed and data- intensive technologies: -Spark/Shark for in-memory computing -Cascade/Tez for workflow computing -Swift? -Inclusion of additional real world data and problems in student projects

19 Questions? Fall 2012: Spring 2013: Fall 2013: Spring 2014:

Download ppt "Teaching HDFS/MapReduce Systems Concepts to Undergraduates Linh B. Ngo*, Edward B. Duffy**, Amy W. Apon* * School of Computing, Clemson University ** Clemson."

Similar presentations

Ads by Google