Presentation on theme: "1 Experimentation in Software Engineering: an introduction Mariano Ceccato FBK Fondazione Bruno Kessler"— Presentation transcript:
1 Experimentation in Software Engineering: an introduction Mariano Ceccato FBK Fondazione Bruno Kessler
2 Motivation To understand better Software engineering … Why should we perform empirical software engineering? The major reasons for carrying out empirical studies is the opportunity of getting statistically significant results regarding understanding, prediction, and improvement of software development. Empirical studies are important input to the decision-making in an organization.
3 Experimental questions Experimental studies try to answer to experimental questions. Examples: Does stereotypes improve the understandability of UML diagrams? Does Design patterns improve the maintainability of code? is Rational rose more usable than Omondo in the development of the software house X? Does the use of Junit reduce the number of defects in the code of the industry Y? General Specific Researchers execute an empirical study for answering to a question.
4 Our research question Are stereotyped reverse engineered UML diagram (Conallen proposal) useful in Web application comprehension and maintenance tasks?stereotyped
Experiment process Definition Planning Operation Analysis & interpretation Presentation & package Idea Conclusions Experiment process Is the technique A better than B ? No!!
6 Definition In this activity the hypothesis that we want to test has to be stated clearly. Usually the goal definition template is used. Object of the study: is the entity that is studied in the experiment (code, process, design documents, …). Purpose. What is the intention of the experiment? (i.e. evaluate the impact of two different techniques) Quality focus. The effect under study in the experiment. (i.e. cost, reliability, …) Perspective. Viewpoint from which the experiment results are interpreted. (developer, project manager, researcher, …). Context. The environment in which the experiment is run. It defines subjects and objects used.
7 Conallen vs. Pure UML: definition Object of the study. Design documents (Conallen and Pure UML) of Web applications. Purpose. Evaluating the usefulness of Conallen diagrams in Web application comprehension, impact analysis and maintenance tasks. Quality focus. Comprehensibility and maintainability. Perspective. Multiple. - Researcher: evaluate how effective are Conallen diagrams during maintenance. - Project manager: evaluate the possibility of adopting a Web application reverse engineering tool (Conallen notation) in his/her organization. Context. - Subjects: 13 Computer science students. - Objects: two simple Web application using JSP and servlets (Claros, WfMS).
8 Planning Activity very important! Definition determines Why the experiment is conducted Planning prepares for How the experiment is conducted. We have to state clearly: research questions, context, subjects, variables, metrics, design of the experiment, … The result of the experiment can be disturbed or even destroyed if not planned properly …
10 Planning activities (1) Context selection. We have four dimensions: off-line vs. on-line, student vs. professional, toys vs. real problems, specific context (i.e. only a particular industry) vs. general context. Hypothesis formulation. The hypothesis of the experiment is stated formally, including a null hypothesis and an alternative hypothesis. Variables selection. Determine independent variables (inputs) and dependent variables (outputs). The effect of the treatments is measured in the dependent variable (or variables). Determine the values the variables actually can take. Selection of subjects. In order to generalize the results to the desired population, the selection must be representative for that population. The size of the sample also impacts the results when generalizing.
11 Planning activities (2) Experiment design. To draw conclusions from an experiment, we apply statistical analysis methods (tests) on the collected data to interpret the results. Statistical analysis methods depend on the chosen design (one factor with two treatments, one factor with more than two treatments, …) Instrumentation. In this activity guidelines are decided to guide the participants in the experiment. Materials (training, questionnaires, diagrams, …) are prepared and measurement procedures (metrics) are defined. Validity evaluation. A fundamental question concerning results from an experiment is how valid the results are. External validity: can the result of the study be generalized outside the scope of our study? Threats to validity. Compiling a list of possible threats …
12 Threats to validity Examples: Violated assumptions of statistical tests Researchers may influence the results by looking for a specific outcome. If the group is very heterogeneous there is a risk that the variation due to individual differences is larger than due to the treatment. Experiment badly designated (materials, …) Confounding factors. …
13 Hypothesis formulation Two hypothesis have to be formulated: 1.H 0. The null hypothesis. H 0 states that there are no real underlying trends in the experiment; the only reasons for differences in our observations are coincidental. This is the hypothesis that the experimenter wants to reject with as high significance as possible. 2.H 1. The alternative hypothesis. This is the hypothesis in favor of which the null hypothesis is rejected. H 0 : a new inspection method finds on average the same number of faults as the old one. H 1 : a new inspection method finds on average more faults than the old one.
14 Conallen vs. Pure UML hypothesis H 01 : When doing a comprehension task the use of stereotyped reverse engineered class diagrams (versus non- stereotyped reverse engineered class) does not significantly affect the comprehension level. H a1 : When doing a comprehension task the use of stereotyped reverse engineered class diagrams (versus non- stereotyped reverse engineered class) significantly affects the comprehension level. H 02 : When doing a maintenance task, the use of stereotyped reverse engineered class diagrams does not significantly affect the effectiveness of the task. H a2 : When doing a maintenance task, the use of stereotyped reverse engineered class diagrams significantly affects the effectiveness of the task.
15 Variables selection Before any design can start we have to choose the dependent and independent variables. The independent variables (inputs) are those variables that we can control and change in the experiment. They describe the treatments and are thus the variables for which the effects should be evaluated. The dependent variables (outputs) are the response variables that describe the effects of the treatments described by the independent variables. Often there is only one dependent variable and it should therefore be derived directly from the hypothesis Experiment independentdependent
16 Confounding factors Pay attention to the confounding factors! A confounding factor is defined as a variable that can hide a genuine association between factors and confuse the conclusions of the experiment. For example, it may be difficult to tell if a better result depends on the tool or the experience of the user of the tool …
17 C versus C++ Research question: C++ is better than C? The independent variable of interest in this study is the choice of programming language (C++ or C). Dependent variables Total time required to develop programs Total time required for testing Total defects … Potential confounding factor: different experience of the programmers … Experiment not valid with confounding factors …
18 Standard design types One factor with two treatments One factor with more than treatments Two factors with two treatments More than two factors each with two treatments … The design and the statistical analysis are closely related. We decide the design type considering: objects and subjects we are able to use, hypothesis and measurement chose.
19 One factor with two treatments (1) We want to compare the two treatments against each other. Example: to investigate if a new design method produces software with higher quality than the previously used design method. Factor = design method. Treatments = new/old. SubjectsNew designOld design 1X 2X 3X 4X 5X 6X Completely randomized design If we have the same number of subjects per treatment the design is balanced
20 One factor with two treatments (2) Example: to investigate if a new design method produces software with higher quality than the previously used design method. The dependent variable can be the number of faults found in the development. To understand if the new method is better than the old we use:. SubjectsNew designOld design 1X 2X 3X 4X 5X 6X Completely randomized design Hypothesis:H0: 1 = 2 H1: 1 > 2 i = The average of faults for treatment i
21 Two factors with two treatments The experiments gets more complex. Example: to investigate the understandability of the design document when using structured or OO design based on one good and one bad requirements documents. OOStructured Good Subjects: 4, 6Subjects: 1, 7 Bad Subjects: 2, 3Subjects: 5, 8 Factor A: design methodtreatments: OO, structured Factor B: requirements documenttreatments: good, bad We need more subjects …
22 Conallen vs. Pure UML: design experiment WfMS Application Claros application ConallenYellowBlue Pure UMLRedGreen We have divided students in four groups. We have tried to make sure that High and Low ability people are uniformly distributed across groups (first questionnaire). Each group will undergo to 2 different treatments. This is one factor (the design) with two treatments (Conallen and pure UML). We have used this strategy to double the number of subjects (26 instead of 13). Used when the number of subjects is small! WfMS Application Claros application ConallenGreenRed Pure UMLBlueYellow
23 Metrics (dependent variables) How to measure the collected data? Using metrics … Metrics reflect the data that are collected from experiments; they are decided at design time of the experiment and computed after the entire experiment has ended. Usually, they are derived directly from the research questions (or hypothesis).
24 Metrics: examples Question: Does the design method A produce software with higher quality than the design method B? Metric: number of faults found in the development. Question: Are OO design documents easier to understand than structured design documents? Metric: percentage of questions that were answered correctly. Question: Are Computer science students more productive (as programmers) than Electronic engineers? Metric: number of line of codes per total development time
25 Conallen vs. Pure UML: understanding task The questionnaire had 12 questions. Answers was a list of pages/classes/interfaces. Example: Question: Which controller classes are used to retrieve the users from the page referred in the question 4? Correct answer: LoginController.java, CLUserController.java How to evaluate these (real) answers? 1.LoginController.java, CLUserController.java 2.CLUser.java 3.LoginController.java 4.LoginController.java, CLUserController.java, CLUser.java Correct! Wrong! ? 1 0
26 Precision, recall, F-measure Precision = |answers given П correct answers| |answers given| Recall = |answers given П correct answers| |correct answers| F-measure = 2* recall* precision recall+precision | …|number of elements П intersection Used in information retrieval …
27 How to compute the F-measure in our experiment? Question: Which controller classes are used to retrieve the users from the page referred in the question 4? Correct answer: LoginController.java, CLUserController.java Answer 1: LoginController.java Precision = 1/1 = 1 ; Recall = ½ = 0.5 ; F-measure = 0.67 Answer 2: LoginController.java, CLUserController.java, CLUser.java Precision = 2/3 = 0.67 ; Recall = 2/2 = 1 ; F-measure = 0.8 Precision = |answers given П correct answers| Recall = |answers given П correct answers| |answers given| |correct answers|
28 Subject X Subject X P R F F-measure average = Good work!
29 Operation Definition Planning Operation Analysis & interpretation Presentation & package Idea Conclusions Experiment process
30 Operation When an experiment has been designed and planned it must be carried out in order to collect the data. Experimenter actually meets the subjects. Treatments are applied to the subjects. Even if an experiment has been perfectly designed and the data are analyzed with the appropriate methods everything depends on the operation …
31 Operation Steps Preparation Experiment design Execution Data validation Experiment data Experiment operation
32 Preparation, execution, data validation Preparation: before the experiment can be executed, all instruments must be ready (forms, tools, software, …). Participants must be formed to execute the task. Execution: Subjects perform their tasks according to different treatments and data is collected. Data validation: the experimenter must check that the data is reasonable and that it has been collected correctly. Have participants understood the task? Have participants participated seriously in the experiment?
33 Analysis and interpretation After collecting experimental data in the operation phase, we want to be able to draw conclusions based on this data. Descriptive statistics Experiment data Data set reduction Hypothesis testing Conclusions Analysis and interpretation
34 Descriptive statistics Descriptive statistics deal with the presentation and numerical processing of a data set. Descriptive statistics may be used to describe and graphically present interesting aspects of the data set. Descriptive statistics may be used before carrying out hypothesis testing, in order to better understand the nature of the data and to identify abnormal data points (outliers).
35 Descriptive statistics Measures of central tendency (mean, median, mode, …) Measures of dispersion (variance, standard deviation, …) Measures of dependency between variables Graphical visualization (scatter plots, histograms, pie charts, …)
36 Outlier analysis Outlier is a point that is much larger or much smaller than one could expect looking at the other points. A way to identify outliers is to draw scatter plots. There are statistical methods to identify outliers. outlier Scatter plots
37 Data set reductions When outliers have been identified, it is important to decide what to do with them. If the outlier is due to a strange or rare event that never will happen again, the point can be excluded. Example. The task was not understood. If the outlier is due to a rare event that may occur again, it is not advisable to exclude the point. There is relevant information in the outlier. Example: a module that is implemented by inexperienced programmers.
38 Hypothesis testing The hypothesis of the experiment are evaluated statistically. Is it possible to reject a certain null hypothesis? To answer at this question we have to use statistical tests (t-test, ANOVA, Chi-2, …) next lessons!!
39 Presentation and package Write report Experiment Experimnet report Experiment presentation and package