THE HOG LANGUAGE A scripting MapReduce language. Jason Halpern Testing/Validation Samuel Messing Project Manager Benjamin Rapaport System Architect Kurry.

Slides:



Advertisements
Similar presentations
Distributed and Parallel Processing Technology Chapter2. MapReduce
Advertisements

Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
MapReduce.
Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
Spark: Cluster Computing with Working Sets
Clydesdale: Structured Data Processing on MapReduce Jackie.
Reference Book: Modern Compiler Design by Grune, Bal, Jacobs and Langendoen Wiley 2000.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
GROUP 7 TOOLS FOR BIG DATA Sandeep Prasad Dipojjwal Ray.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Overview Hadoop is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
HAMS Technologies 1
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
HAMS Technologies 1
MapReduce How to painlessly process terabytes of data.
MARISSA: MApReduce Implementation for Streaming Science Applications 作者 : Fadika, Z. ; Hartog, J. ; Govindaraju, M. ; Ramakrishnan, L. ; Gunter, D. ; Canon,
An Introduction to HDInsight June 27 th,
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Youngil Kim Awalin Sopan Sonia Ng Zeng.  Introduction  Concept of the Project  System architecture  Implementation – HDFS  Implementation – System.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Dharmen Mehta (Project Manager) Nimai Buch (Language Guru) Yash Parikh (System Architect) Amol Joshi (System Integrator) Chaitanya Korgaonkar (Verifier.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Cloud Computing project NSYSU Sec. 1 Demo. NSYSU EE IT_LAB2 Outline  Our system’s architecture  Flow chart of the hadoop’s job(web crawler) working.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Youngil Kim Awalin Sopan Sonia Ng Zeng.  Introduction  System architecture  Implementation – HDFS  Implementation – System Analysis ◦ System Information.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Next Generation of Apache Hadoop MapReduce Owen
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
BIG DATA/ Hadoop Interview Questions.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Big Data is a Big Deal!.
Introduction to Distributed Platforms
Chapter 10 Data Analytics for IoT
Software Engineering Introduction to Apache Hadoop Map Reduce
Central Florida Business Intelligence User Group
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
The Basics of Apache Hadoop
Cloud Distributed Computing Environment Hadoop
Lecture 16 (Intro to MapReduce and Hadoop)
Presentation transcript:

THE HOG LANGUAGE A scripting MapReduce language. Jason Halpern Testing/Validation Samuel Messing Project Manager Benjamin Rapaport System Architect Kurry Tran System Integrator Paul Tylkin Language Guru

OUTLINE 1.Introduction (Sam) 2.Syntax and Semantics (Paul) 3.Compiler Architecture (Ben) 4.Runtime Environment (Kurry) 5.Testing (Jason) 6.Demo 7.Conclusions

INTRODUCTION SAMUEL MESSING (PROJECT MANAGER)

MOTIVATION Say you’re… a corporation, with data from your mail server, and you want to find out the average amount of time a client waits for a response… Say you’re… a statistician, with millions upon millions of data points, and you need descriptive statistics about your sample… Samuel Messing (Project Manager)

Out In IT’S TIME TO THINK DISTRIBUTEDLY. More and more, we’re looking to distributed-computation frameworks such as Apache’s Hadoop MapReduce™ for ways to process massive amounts of data as quickly as possible… Samuel Messing (Project Manager)

SAY YOU WANT TO… Sort 400K numbers stored in a text file, e.g., ~ > head -12 numbers.txt Samuel Messing (Project Manager)

JUST WRITE ELEVEN LINES OF CODE Eleven lines of Hog code are enough to, 1.Read in gigabytes of data formatted as, Distribute the data over a highly scalable network of computers, 3.Synchronize computation across multiple machines to sort and remove duplicate numbers, 4.Store the sorted set of numbers on a fault-tolerant distributed file-system. Running your sort program is as easy as typing, ~ > Hog Sort.hog input/numbers.txt Samuel Messing (Project Manager)

PROJECT DEVELOPMENT Samuel Messing (Project Manager)

THE LANGUAGE PAUL TYLKIN (LANGUAGE GURU)

PROGRAM User-defined Define map stage of Define reduce stage of Call MapReduce(), other tasks Paul Tylkin (Language Guru)

WORD COUNT (int lineNum, text line) -> (text, int) { 1 # for every word on this line, 2 # emit that word and the number ‘1’ 3 foreach text word in line.tokenize(" ") { 4 emit(word, 1); 5 } 6 } Paul Tylkin (Language Guru)

WORD COUNT (text word, iter values) -> (text, int) { 8 # initialize count to zero 9 int count = 0; 10 While (values.hasNext()) { 11 # for every instance of '1' for this word, add to count. 12 count = count + values.next(); 13 } 14 # emit the count for this particular word 15 emit(word, count); 16 } Paul Tylkin (Language Guru)

WORD COUNT { 18 # call map reduce 19 mapReduce(); 20 } Paul Tylkin (Language Guru)

USER-DEFINED FUNCTIONS { 1 int fib(int n) { 2 if (n == 0) { 3 return 1; 4 } elseif (n == 1) { 5 return 1; 6 } else { 7 return fib(n-1) + fib(n-2); 8 } 9 } Paul Tylkin (Language Guru)

USER-DEFINED FUNCTIONS 10 list reverseList(list oldList) { 11 list newList; 12 for (int i = oldList.size() - 1; i >= 0; i--;) { 13 newList.add(oldList.get(i)); 14 } 15 return newList; 16 } # end of functions Paul Tylkin (Language Guru)

A SIMPLE DISTRIBUTED SORT (int lineNum, text line) -> (text, text) { 1 foreach text number in line.tokenize(" ") { 2 emit(number, number); 3 } 4 } (text number, iter garbage) -> (text, text) { 6 emit(number, ""); 7} { 9 mapReduce(); 10 } Paul Tylkin (Language Guru)

ARCHITECTURE BENJAMIN RAPAPORT (SYSTEM ARCHITECT)

HOG PLATFORM ARCHITECTURE Hog Compiler Map Hadoop Framework Reduce Java Compiler Hog.java Hog.jar Input Hog Source Output Benjamin Rapaport (System Architect)

HOG COMPILER ARCHITECTURE Symbol Table Visitor Parser Hog Source Token Stream AST Java Generating Visitor Type Checking Visitor Semantic Analyzer Symbol Table Partially Decorated AST Fully Decorated AST Java MapReduce Program Lexer Benjamin Rapaport (System Architect)

RUNTIME KURRY TRAN (SYSTEM INTEGRATOR)

MAKEFILE AND SHELLSCRIPT Hog Compiler – Compiles Hog Source to Java Source Java Compiler – Compiles Java Source with Hadoop Jars Copies Input Data into HDFS Executes Job on Hadoop Cluster Reports Results to User Kurry Tran (System Integrator)

RUNTIME ENVIRONMENT JVMDefault Memory Used (MB)Memory Used for 8 Processors Datanode1,000 Tasktracker1,000 Tasktracker Child Map Task2x2007x400 Tasktracker Child Reduce Task 2x2007x400 Total2,8007,600 Kurry Tran (System Integrator)

TESTING JASON HALPERN (TESTING/VALIDATION)

ITERATIVE TESTING CYCLE White Box Tests Test Internal Structure: token streams, nodes, ASTs Black Box Tests Test Functionality Six Phases of Unit Testing JUnit Lexer Testing Parser Testing AST Testing Type Checker Testing Symbol Table Testing Code Generation Testing Code Generation Testing Jason Halpern (Testing/Validation)

INTEGRATION TESTING Sample Programs Word Count Sort Log Processing Exception Handling and Errors Undeclared Variables Invalid Arguments Type Mismatch Testing on Amazon Elastic MapReduce Upload Compiled Jar from Hog Program Create Job Flow and Launch EC2 Instances Analyze Output Files Jason Halpern (Testing/Validation)

DEMO

CONCLUSIONS THE HOG TEAM

CONCLUSIONS Modularity is key. Expend the effort to reduce development time. Pare down your goals as much as possible in the beginning, allow yourself to not know at every stage how your language will develop. Work in the same room as your teammates.

THANK YOU!

HADOOP ARCHITECTURE A small Hadoop cluster will include a single master and multiple worker nodes. Master Node – JobTracker, TaskTracker, NameNode, and DataNode DataNode – Sends blocks of data over the network using TCP/IP layer for communication; clients use RPC to communicate between each other. JobTracker – Sends MapReduce tasks to nodes

HADOOP ARCHITECTURE (CONTINUED) NameNode – Keeps the directory tree of all files in the file system, and trackers where file data is kept. TaskTracker– A node in the cluster that accepts tasks. The TaskTracker spawns separate JVM processes to do work to ensure process failure does not take down the task tracker. When the process finishes, successfully or not, the tracker notifies the JobTracker.

PERFORMANCE BENEFITS Improves CPU Utilization Node Failure Recovery Data Awareness Portability Six Scheduling Priorities