Presentation on theme: "Experimentation in Software Engineering: an introduction"— Presentation transcript:
1Experimentation in Software Engineering: an introduction Mariano CeccatoFBKFondazione Bruno Kessler
2Motivation Why should we perform empirical software engineering? To understand better Software engineering …MotivationWhy should we perform empirical software engineering?The major reasons for carrying out empirical studies is the opportunity of getting statistically significant results regarding understanding, prediction, and improvement of software development.Empirical studies are important input to the decision-making in an organization.
3Experimental questions Researchers execute an empirical study for answering to a question.Experimental questionsExperimental studies try to answer to experimental questions.Examples:Does ‘stereotypes’ improve the understandability of UML diagrams?Does ‘Design patterns’ improve the maintainability of code?is ‘Rational rose’ more usable than ‘Omondo’ in the development of the software house X?Does the use of ‘Junit’ reduce the number of defects in the code of the industry Y?GeneralSpecific
4Our research questionAre stereotyped reverse engineered UML diagram (Conallen proposal) useful in Web application comprehension and maintenance tasks?
5Experiment process No!! “Is the technique A better than B ?” IdeaExperiment processDefinitionPlanningOperationAnalysis &interpretationPresentation &packageConclusionsNo!!
6Definition Usually the goal definition template is used. In this activity the hypothesis that we want to test has to be stated clearly.DefinitionUsually the goal definition template is used.Object of the study: is the entity that is studied in the experiment (code, process, design documents, …).Purpose. What is the intention of the experiment? (i.e. evaluate the impact of two different techniques)Quality focus. The effect under study in the experiment. (i.e. cost, reliability, …)Perspective. Viewpoint from which the experiment results are interpreted. (developer, project manager, researcher, …).Context. The “environment” in which the experiment is run. It defines subjects and objects used.
7Conallen vs. Pure UML: definition Object of the study. Design documents (Conallen and Pure UML) of Web applications.Purpose. Evaluating the usefulness of Conallen diagrams in Web application comprehension, impact analysis and maintenance tasks.Quality focus. Comprehensibility and maintainability.Perspective. Multiple.- Researcher: evaluate how effective are Conallen diagrams during maintenance.- Project manager: evaluate the possibility of adopting a Web application reverse engineering tool (Conallen notation) in his/her organization.Context.- Subjects: 13 Computer science students.- Objects: two simple Web application using JSP and servlets (Claros, WfMS).
8Planning Definition determines “Why the experiment is conducted” Planning prepares for “How the experiment is conducted”.We have to state clearly: research questions, context, subjects, variables, metrics, design of the experiment, …The result of the experiment can be disturbed or even destroyed if not planned properly …Activity very important!
10Planning activities (1) Context selection. We have four dimensions: off-line vs. on-line, student vs. professional, toys vs. real problems, specific context (i.e. only a particular industry) vs. general context.Hypothesis formulation. The hypothesis of the experiment is stated formally, including a null hypothesis and an alternative hypothesis.Variables selection. Determine independent variables (inputs) and dependent variables (outputs). The effect of the treatments is measured in the dependent variable (or variables). Determine the values the variables actually can take.Selection of subjects. In order to generalize the results to the desired population, the selection must be representative for that population. The size of the sample also impacts the results when generalizing.
11Planning activities (2) Experiment design. To draw conclusions from an experiment, we apply statistical analysis methods (tests) on the collected data to interpret the results. Statistical analysis methods depend on the chosen design (one factor with two treatments, one factor with more than two treatments, …)Instrumentation. In this activity guidelines are decided to guide the participants in the experiment. Materials (training, questionnaires, diagrams, …) are prepared and measurement procedures (metrics) are defined.Validity evaluation. A fundamental question concerning results from an experiment is how valid the results are. External validity: can the result of the study be generalized outside the scope of our study?Threats to validity. Compiling a list of possible threats …
12Threats to validity Examples: Violated assumptions of statistical testsResearchers may influence the results by looking for a specific outcome.If the group is very heterogeneous there is a risk that the variation due to individual differences is larger than due to the treatment.Experiment badly designated (materials, …)Confounding factors.…
13Hypothesis formulation Two hypothesis have to be formulated:H0. The null hypothesis. H0 states that there are no real underlying trends in the experiment; the only reasons for differences in our observations are coincidental. This is the hypothesis that the experimenter wants to reject with as high significance as possible.H1. The alternative hypothesis. This is the hypothesis in favor of which the null hypothesis is rejected.H0: a new inspection method finds on average the same numberof faults as the old one.H1: a new inspection method finds on average more faults than the old one.
14Conallen vs. Pure UML hypothesis H01: When doing a comprehension task the use of stereotyped reverse engineered class diagrams (versus non-stereotyped reverse engineered class) does not significantly affect the comprehension level.Ha1: When doing a comprehension task the use of stereotyped reverse engineered class diagrams (versus non-stereotyped reverse engineered class) significantly affects the comprehension level.H02: When doing a maintenance task, the use of stereotyped reverse engineered class diagrams does not significantly affect the effectiveness of the task.Ha2: When doing a maintenance task, the use of stereotyped reverse engineered class diagrams significantly affects the effectiveness of the task.
15Variables selection Experiment independent dependent Before any design can start we have to choose the dependent and independent variables.The independent variables (inputs) are those variables that we can control and change in the experiment. They describe the treatments and are thus the variables for which the effects should be evaluated.The dependent variables (outputs) are the response variables that describe the effects of the treatments described by the independent variables.Often there is only one dependent variable and it should therefore be derived directly from the hypothesisExperimentindependentdependent
16Confounding factors Pay attention to the confounding factors! A confounding factor is defined as a variable that can hide a genuine association between factors and confuse the conclusions of the experiment.For example, it may be difficult to tell if a better result depends on the tool or the experience of the user of the tool …
17C versus C++ Experiment not valid with confounding factors … Research question: C++ is better than C?The independent variable of interest in this study is the choice of programming language (C++ or C).Dependent variablesTotal time required to develop programsTotal time required for testingTotal defects…Potential confounding factor: different experience of the programmers …
18Standard design types One factor with two treatments One factor with more than treatmentsTwo factors with two treatmentsMore than two factors each with two treatments…The design and the statistical analysis are closely related.We decide the design type considering: objects and subjects we are able to use, hypothesis and measurement chose.
19One factor with two treatments (1) Completely randomized designWe want to compare the two treatments against each other.Example: to investigate if a new design method produces software with higher quality than the previously used design method.Factor = design method.Treatments = new/old.SubjectsNew designOld design1X23456If we have the same number of subjects per treatment the design is balanced
20One factor with two treatments (2) Example: to investigate if a new design method produces software with higher quality than the previously used design method.The dependent variable can be the number of faults found in the development.To understand if the new method is better than the old we use: .Completely randomized designSubjectsNew designOld design1X23456Hypothesis: H0: 1 = 2H1: 1 > 2i = The average of faults for treatment i
21Two factors with two treatments We need more subjects …Two factors with two treatmentsThe experiments gets more complex.Example: to investigate the understandability of the design document when using structured or OO design based on one “good” and one “bad” requirements documents.OOStructuredGoodSubjects: 4, 6Subjects: 1, 7BadSubjects: 2, 3Subjects: 5, 8Factor A: design method treatments: OO, structuredFactor B: requirements document treatments: “good”, “bad”
22Conallen vs. Pure UML: design experiment We have divided students in four groups. We have tried to make sure that ‘High’ and ‘Low’ ability people are uniformly distributed across groups (first questionnaire).Each group will undergo to 2 different treatments.This is one factor (the design) with two treatments (Conallen and pure UML).We have used this strategy to double the number of subjects (26 instead of 13).WfMSApplicationClarosapplicationConallenYellowBluePure UMLRedGreenWfMSApplicationClarosapplicationConallenGreenRedPure UMLBlueYellowUsed when the number of subjects is small!
23Metrics (dependent variables) How to measure the collected data? Using metrics …Metrics reflect the data that are collected from experiments; they are decided at “design time” of the experiment and computed after the entire experiment has ended.Usually, they are derived directly from the research questions (or hypothesis).
24Metrics: examplesQuestion: Does the design method A produce software with higher quality than the design method B?Metric: number of faults found in the development.Question: Are OO design documents easier to understand than structured design documents?Metric: percentage of questions that were answered correctly.Question: Are Computer science students more productive (as programmers) than Electronic engineers?Metric: number of line of codes per total development time
25Conallen vs. Pure UML: understanding task The questionnaire had 12 questions.Answers was a list of pages/classes/interfaces.Example:Question: Which controller classes are used to retrieve the users from the page referred in the question 4?Correct answer: LoginController.java, CLUserController.javaHow to evaluate these (real) answers?LoginController.java, CLUserController.javaCLUser.javaLoginController.javaLoginController.java, CLUserController.java, CLUser.javaCorrect!1Wrong!?
26Precision, recall, F-measure Used in information retrieval …Precision, recall, F-measurePrecision = |answers given П correct answers||answers given|Recall = |answers given П correct answers||correct answers|F-measure = 2* recall* precisionrecall+precision| …| number of elementsП intersection
27How to compute the F-measure in our experiment? Question: Which controller classes are used to retrieve the users from the page referred in the question 4?Correct answer: LoginController.java, CLUserController.javaAnswer 1: LoginController.javaPrecision = 1/1 = 1 ; Recall = ½ = 0.5 ; F-measure = 0.67Answer 2: LoginController.java, CLUserController.java, CLUser.javaPrecision = 2/3 = 0.67 ; Recall = 2/2 = 1 ; F-measure = 0.8Precision = |answers given П correct answers| Recall = |answers given П correct answers||answers given| |correct answers|
28Subject “X” 1 2 3 4 5 6 7 8 9 10 11 12 Subject “X” P R F Good work! 0.860.50.780.830.920.670.900.87PRFF-measure average = 0.781Good work!
29Operation Experiment process Idea Definition Planning Operation Analysis &interpretationPresentation &packageConclusions
30OperationWhen an experiment has been designed and planned it must be carried out in order to collect the data.Experimenter actually meets the subjects.Treatments are applied to the subjects.Even if an experiment has been perfectly designed and the data are analyzed with the appropriate methods everything depends on the operation …
32Preparation, execution, data validation Preparation: before the experiment can be executed, all instruments must be ready (forms, tools, software, …). Participants must be formed to execute the task.Execution: Subjects perform their tasks according to different treatments and data is collected.Data validation: the experimenter must check that the data is reasonable and that it has been collected correctly. Have participants understood the task? Have participants participated seriously in the experiment?
33Analysis and interpretation After collecting experimental data in the operation phase, we want to be able to draw conclusions based on this data.ExperimentdataAnalysis and interpretationDescriptivestatisticsData setreductionHypothesistestingConclusions
34Descriptive statistics Descriptive statistics deal with the presentation and numerical processing of a data set.Descriptive statistics may be used to describe and graphically present interesting aspects of the data set.Descriptive statistics may be used before carrying out hypothesis testing, in order to better understand the nature of the data and to identify abnormal data points (outliers).
35Descriptive statistics Measures of central tendency (mean, median, mode, …)Measures of dispersion (variance, standard deviation, …)Measures of dependency between variablesGraphical visualization (scatter plots, histograms, pie charts, …)
36Outlier analysisOutlier is a point that is much larger or much smaller than one could expect looking at the other points.A way to identify outliers is to draw scatter plots.There are statistical methods to identify outliers.••outlier••••••••••Scatter plots
37Data set reductionsWhen outliers have been identified, it is important to decide what to do with them.If the outlier is due to a strange or rare event that never will happen again, the point can be excluded. Example. The task was not understood.If the outlier is due to a rare event that may occur again, it is not advisable to exclude the point. There is relevant information in the outlier.Example: a module that is implemented by inexperienced programmers.
38Hypothesis testingThe hypothesis of the experiment are evaluated statistically.Is it possible to reject a certain null hypothesis?To answer at this question we have to use statistical tests (t-test, ANOVA, Chi-2, …)next lessons!!
39Presentation and package ExperimentExperiment presentation and packageWrite reportExperimnetreport