Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan.

Slides:

Advertisements

Similar presentations

Advertisements

(Hadoop) Pig Dataflow Language B. Ramamurthy Based on Cloudera’s tutorials and Apache’s Pig Manual 6/27/2015.

** MapReduce Debugging with Jumbune. * Agenda * Debugging Challenges Debugging MapReduce Jumbune’s Debugger Zero Tolerance in Production.

L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.

Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.

Google Distributed System and Hadoop Lakshmi Thyagarajan.

Jeffrey D. Ullman Stanford University.  Mining of Massive Datasets, J. Leskovec, A. Rajaraman, J. D. Ullman.  Available for free download at i.stanford.edu/~ullman/mmds.html.

Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.

UNIT-V The MVC architecture and Struts Framework.

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.

Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.

USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.

Christopher Jeffers August 2012

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Per Møldrup-Dalum State and University Library SCAPE Information Day State and University Library, Denmark, SCAPE Scalable Preservation Environments.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read

Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)

Domain-Specific Languages for Composing Signature Discovery Workflows Ferosh Jacob*, Adam Wynne+, Yan Liu+, Nathan Baker+, and Jeff Gray* *Department of.

MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.

The Limitation of MapReduce: A Probing Case and a Lightweight Solution Zhiqiang Ma Lin Gu Department of Computer Science and Engineering The Hong Kong.

CSE 219 Computer Science III Program Design Principles.

Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.

Application of Hadoop to Proteomic Searches Steven Lewis 1, Attila Csordas 2, Sarah Killcoyne 1, Henning Hermjakob 2, John Boyle 1 1 Institute for Systems.

Alastair Duncan STFC Pre Coffee talk STFC July 2014 The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Efficiently Mining Source Code with Boa Robert Dyer The research activities described in this talk were supported in part by the US National Science Foundation.

Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.

Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,

CSE 548 Advanced Computer Network Security Trust in MobiCloud using Hadoop Framework Updates Sayan Cole Jaya Chakladar Group No: 1.

By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.

Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.

ApproxHadoop Bringing Approximations to MapReduce Frameworks

CSE 548 Advanced Computer Network Security Trust in MobiCloud using Hadoop Framework Updates Sayan Kole Jaya Chakladar Group No: 1.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

Mining Programming Language Usage with Boa Robert Dyer These research activities supported in part by the US National Science Foundation (NSF) grants CNS ,

HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.

Integrating Big Data into the Computing Curricula 02/2015 Achmad Benny Mutiara

Mining Programming Feature Usage at a Very Large Scale Robert Dyer These research activities supported in part by the US National Science Foundation (NSF)

From Use Cases to Implementation 1. Structural and Behavioral Aspects of Collaborations  Two aspects of Collaborations Structural – specifies the static.

PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.

INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.

From Use Cases to Implementation 1. Mapping Requirements Directly to Design and Code  For many, if not most, of our requirements it is relatively easy.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.

MapReduce Compilers-Apache Pig

Image taken from: slideshare

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.

Spark Presentation.

Stefan Kaestle, Reto Achermann, Timothy Roscoe, Tim Harris ATC’15

Big Data Analytics: HW#3

Introduction to MapReduce and Hadoop

Extraction, aggregation and classification at Web Scale

Cloud Distributed Computing Environment Hadoop

CS6604 Digital Libraries IDEAL Webpages Presented by

CS110: Discussion about Spark

Overview of big data tools

Lecture 16 (Intro to MapReduce and Hadoop)

Charles Tappert Seidenberg School of CSIS, Pace University

From Use Cases to Implementation

Map Reduce, Types, Formats and Features

Presentation transcript:

Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan Valayil John | The University of Texas at Arlington | CSE 6324

Agenda  Motivation  Ultra-large-scale software repositories  Barriers to mining software repositories  Solution - Boa  Goals of Boa  Boa Architecture  Evaluation Joshan Valayil John | The University of Texas at Arlington | CSE

Motivation  Big-3 software repositories known to have close to 1 million projects.  Contains a wealth of software and information about software.  Systematic extraction of relevant data from these repositories and their analysis for testing hypotheses is hard.  Boa, a domain-specific language and infrastructure, developed to ease testing ‘Mining Software Repository’ related hypotheses. Joshan Valayil John | The University of Texas at Arlington | CSE

Ultra-large-scale Software Repositories Joshan Valayil John | The University of Texas at Arlington | CSE

Why analyze software repositories?  Curiosity  Identify patterns  Forecasting  Plan for better designs  Empirical Validation Joshan Valayil John | The University of Texas at Arlington | CSE

Barriers to mining software repositories  Develop programming expertise to access version control system.  Establish infrastructure to store downloaded data from software repositories. Joshan Valayil John | The University of Texas at Arlington | CSE 6324  Develop an application to access this local data.  Improve scalability of analysis infrastructure to process ultra-large-scale data. 6

Barriers to mining software repositories  Experiments are often irreproducible  Low reusability of experimental infrastructure  Lack of systematic curation leads to loss of experimental data.  Building analysis infrastructure to process ultra- large-scale data efficiently can be very hard. Joshan Valayil John | The University of Texas at Arlington | CSE

Solution - Boa  Designed a domain specific language and infrastructure to analyze ultra-large-scale software repositories – Boa. Joshan Valayil John | The University of Texas at Arlington | CSE

Goals of Boa  Easy to use  Better abstractions  Efficient & Scalable  Enhances reproducibility Joshan Valayil John | The University of Texas at Arlington | CSE

A Research Question  Consider a program that answers: “What are the churn rates for all Java projects that use SVN?” Joshan Valayil John | The University of Texas at Arlington | CSE

Solution in Java  Full program over 70 lines of code.  Uses JSON and SVN libraries.  Runs sequentially.  Takes over 24 hours.  Takes almost 3 hours with data locally cached.  Can be parallelized, but very complex. Joshan Valayil John | The University of Texas at Arlington | CSE

Solution in Boa Joshan Valayil John | The University of Texas at Arlington | CSE 6324  Simple program, 6 lines of code.  Hides implementation specifics.  Auto parallelization, results in 1 minute.  Results can be easily reproduced by publishing these small programs with the data sets used. 12

Performance Results Joshan Valayil John | The University of Texas at Arlington | CSE

Boa Architecture Joshan Valayil John | The University of Texas at Arlington | CSE

Boa Architecture  Three main components  The Boa Language  Boa Compiler & Runtime  Supporting data infrastructure Joshan Valayil John | The University of Texas at Arlington | CSE

The Boa Language  Domain-Specific Types  MapReduce Support  Quantifiers  User defined functions  Output Aggregators Joshan Valayil John | The University of Texas at Arlington | CSE

Boa Language – Domain-Specific Types  Provides several domain-specific types which aid in abstracting mining software repository details ( ) Joshan Valayil John | The University of Texas at Arlington | CSE

Boa Language – MapReduce Support  Computations specified via two user-defined functions:  Mapper – takes key-value pairs as input & produces key-value pairs as output.  Reducer – Consumes the above output and aggregates data based on individual keys. Joshan Valayil John | The University of Texas at Arlington | CSE

Boa Language – Quantifiers  Boa defines the quantifiers:  exists  foreach  ifall Joshan Valayil John | The University of Texas at Arlington | CSE

Boa Language – User-Defined Functions  Users can define their own mining algorithms  Facilitates code re-use. Joshan Valayil John | The University of Texas at Arlington | CSE

Boa Language – Output aggregators Joshan Valayil John | The University of Texas at Arlington | CSE 6324  Output can be indexed  Output defined in terms of predefined data aggregators 21

Boa’s Supporting Infrastructure  Compiler & Runtime  Data Infrastructure  Web based interface Joshan Valayil John | The University of Texas at Arlington | CSE

Boa’s Compiler & Runtime  Initial implementation was based upon the Sizzle compiler & framework  Sizzle is an open-source Java implementation of the Sawzall language.  Sizzle provides support for generating programs that run on the Hadoop open-source MapReduce framework. Joshan Valayil John | The University of Texas at Arlington | CSE

Boa’s Data Infrastructure  Local cache of repository information.  First Step – Locally replicate data.  Second Step – Run the caching translator to convert data into the framework required format.  Input (JSON file + SVN repositories) -> Output (Hadoop SequenceFile) Joshan Valayil John | The University of Texas at Arlington | CSE

Boa’s Web based Interface  Submit programs.  Compile & run them on their clusters.  Each submission creates a job in the system. Joshan Valayil John | The University of Texas at Arlington | CSE

Evaluation  Programs were executed on a Hadoop install.  Cluster was not tuned for performance, except for setting the maximum number of map tasks for each compute node equal to the number of cores on that node and increasing the VM heap size. Joshan Valayil John | The University of Texas at Arlington | CSE

Evaluation – Applicability  Research Question 1 – Does Boa help researchers analyze ultra-large-scale software repositories?  A set of 21 tasks in four different categories were examined.  Programming Languages  Project Management  Legal  Platform/Environment Joshan Valayil John | The University of Texas at Arlington | CSE

Joshan Valayil John | The University of Texas at Arlington | CSE

Evaluation - Applicability Joshan Valayil John | The University of Texas at Arlington | CSE

Evaluation - Scalability  Research Question 2 – Does the approach scale to the size of the cluster?  Research Question 3 – Does the approach scale with the size of the input? Joshan Valayil John | The University of Texas at Arlington | CSE

Evaluation - Scalability Joshan Valayil John | The University of Texas at Arlington | CSE

Evaluation - Scalability Joshan Valayil John | The University of Texas at Arlington | CSE

Evaluation - Reproducibility  Research Question 4 – Using their infrastructure, can researchers easily reproduce previously published results? Joshan Valayil John | The University of Texas at Arlington | CSE

Evaluation - Reproducibility  Conducted controlled experiment  Selected group of 8 researchers  Each chose 3 tasks Joshan Valayil John | The University of Texas at Arlington | CSE

References  13/icse13.pdf 13/icse13.pdf  Joshan Valayil John | The University of Texas at Arlington | CSE

Joshan Valayil John | The University of Texas at Arlington | CSE