Presentation is loading. Please wait.

Presentation is loading. Please wait.

Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan.

Similar presentations


Presentation on theme: "Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan."— Presentation transcript:

1 Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan Valayil John | The University of Texas at Arlington | CSE 6324

2 Agenda  Motivation  Ultra-large-scale software repositories  Barriers to mining software repositories  Solution - Boa  Goals of Boa  Boa Architecture  Evaluation Joshan Valayil John | The University of Texas at Arlington | CSE 6324 2

3 Motivation  Big-3 software repositories known to have close to 1 million projects.  Contains a wealth of software and information about software.  Systematic extraction of relevant data from these repositories and their analysis for testing hypotheses is hard.  Boa, a domain-specific language and infrastructure, developed to ease testing ‘Mining Software Repository’ related hypotheses. Joshan Valayil John | The University of Texas at Arlington | CSE 6324 3

4 Ultra-large-scale Software Repositories Joshan Valayil John | The University of Texas at Arlington | CSE 6324 4

5 Why analyze software repositories?  Curiosity  Identify patterns  Forecasting  Plan for better designs  Empirical Validation Joshan Valayil John | The University of Texas at Arlington | CSE 6324 5

6 Barriers to mining software repositories  Develop programming expertise to access version control system.  Establish infrastructure to store downloaded data from software repositories. Joshan Valayil John | The University of Texas at Arlington | CSE 6324  Develop an application to access this local data.  Improve scalability of analysis infrastructure to process ultra-large-scale data. 6

7 Barriers to mining software repositories  Experiments are often irreproducible  Low reusability of experimental infrastructure  Lack of systematic curation leads to loss of experimental data.  Building analysis infrastructure to process ultra- large-scale data efficiently can be very hard. Joshan Valayil John | The University of Texas at Arlington | CSE 6324 7

8 Solution - Boa  Designed a domain specific language and infrastructure to analyze ultra-large-scale software repositories – Boa. Joshan Valayil John | The University of Texas at Arlington | CSE 6324 8

9 Goals of Boa  Easy to use  Better abstractions  Efficient & Scalable  Enhances reproducibility Joshan Valayil John | The University of Texas at Arlington | CSE 6324 9

10 A Research Question  Consider a program that answers: “What are the churn rates for all Java projects that use SVN?” Joshan Valayil John | The University of Texas at Arlington | CSE 6324 10

11 Solution in Java  Full program over 70 lines of code.  Uses JSON and SVN libraries.  Runs sequentially.  Takes over 24 hours.  Takes almost 3 hours with data locally cached.  Can be parallelized, but very complex. Joshan Valayil John | The University of Texas at Arlington | CSE 6324 11

12 Solution in Boa Joshan Valayil John | The University of Texas at Arlington | CSE 6324  Simple program, 6 lines of code.  Hides implementation specifics.  Auto parallelization, results in 1 minute.  Results can be easily reproduced by publishing these small programs with the data sets used. 12

13 Performance Results Joshan Valayil John | The University of Texas at Arlington | CSE 6324 13

14 Boa Architecture Joshan Valayil John | The University of Texas at Arlington | CSE 6324 14

15 Boa Architecture  Three main components  The Boa Language  Boa Compiler & Runtime  Supporting data infrastructure Joshan Valayil John | The University of Texas at Arlington | CSE 6324 15

16 The Boa Language  Domain-Specific Types  MapReduce Support  Quantifiers  User defined functions  Output Aggregators Joshan Valayil John | The University of Texas at Arlington | CSE 6324 16

17 Boa Language – Domain-Specific Types  Provides several domain-specific types which aid in abstracting mining software repository details ( http://boa.cs.iastate.edu/docs/dsl-types.php ) Joshan Valayil John | The University of Texas at Arlington | CSE 6324 17

18 Boa Language – MapReduce Support  Computations specified via two user-defined functions:  Mapper – takes key-value pairs as input & produces key-value pairs as output.  Reducer – Consumes the above output and aggregates data based on individual keys. Joshan Valayil John | The University of Texas at Arlington | CSE 6324 18

19 Boa Language – Quantifiers  Boa defines the quantifiers:  exists  foreach  ifall Joshan Valayil John | The University of Texas at Arlington | CSE 6324 19

20 Boa Language – User-Defined Functions  Users can define their own mining algorithms  Facilitates code re-use. Joshan Valayil John | The University of Texas at Arlington | CSE 6324 20

21 Boa Language – Output aggregators Joshan Valayil John | The University of Texas at Arlington | CSE 6324  Output can be indexed  Output defined in terms of predefined data aggregators 21

22 Boa’s Supporting Infrastructure  Compiler & Runtime  Data Infrastructure  Web based interface Joshan Valayil John | The University of Texas at Arlington | CSE 6324 22

23 Boa’s Compiler & Runtime  Initial implementation was based upon the Sizzle compiler & framework  Sizzle is an open-source Java implementation of the Sawzall language.  Sizzle provides support for generating programs that run on the Hadoop open-source MapReduce framework. Joshan Valayil John | The University of Texas at Arlington | CSE 6324 23

24 Boa’s Data Infrastructure  Local cache of repository information.  First Step – Locally replicate data.  Second Step – Run the caching translator to convert data into the framework required format.  Input (JSON file + SVN repositories) -> Output (Hadoop SequenceFile) Joshan Valayil John | The University of Texas at Arlington | CSE 6324 24

25 Boa’s Web based Interface  Submit programs.  Compile & run them on their clusters.  Each submission creates a job in the system. Joshan Valayil John | The University of Texas at Arlington | CSE 6324 25

26 Evaluation  Programs were executed on a Hadoop 1.0.3 install.  Cluster was not tuned for performance, except for setting the maximum number of map tasks for each compute node equal to the number of cores on that node and increasing the VM heap size. Joshan Valayil John | The University of Texas at Arlington | CSE 6324 26

27 Evaluation – Applicability  Research Question 1 – Does Boa help researchers analyze ultra-large-scale software repositories?  A set of 21 tasks in four different categories were examined.  Programming Languages  Project Management  Legal  Platform/Environment Joshan Valayil John | The University of Texas at Arlington | CSE 6324 27

28 Joshan Valayil John | The University of Texas at Arlington | CSE 6324 28

29 Evaluation - Applicability Joshan Valayil John | The University of Texas at Arlington | CSE 6324 29

30 Evaluation - Scalability  Research Question 2 – Does the approach scale to the size of the cluster?  Research Question 3 – Does the approach scale with the size of the input? Joshan Valayil John | The University of Texas at Arlington | CSE 6324 30

31 Evaluation - Scalability Joshan Valayil John | The University of Texas at Arlington | CSE 6324 31

32 Evaluation - Scalability Joshan Valayil John | The University of Texas at Arlington | CSE 6324 32

33 Evaluation - Reproducibility  Research Question 4 – Using their infrastructure, can researchers easily reproduce previously published results? Joshan Valayil John | The University of Texas at Arlington | CSE 6324 33

34 Evaluation - Reproducibility  Conducted controlled experiment  Selected group of 8 researchers  Each chose 3 tasks Joshan Valayil John | The University of Texas at Arlington | CSE 6324 34

35 References  http://design.cs.iastate.edu/papers/ICSE- 13/icse13.pdf http://design.cs.iastate.edu/papers/ICSE- 13/icse13.pdf  http://boa.cs.iastate.edu/docs/ http://boa.cs.iastate.edu/docs/ Joshan Valayil John | The University of Texas at Arlington | CSE 6324 35

36 Joshan Valayil John | The University of Texas at Arlington | CSE 6324 36


Download ppt "Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan."

Similar presentations


Ads by Google