A HUMAN STUDY OF PATCH MAINTAINABILITY Zachary P. Fry, Bryan Landau, Westley Weimer University of Virginia

Slides:

Advertisements

Similar presentations

Introducing Formal Methods, Module 1, Version 1.1, Oct., Formal Specification and Analytical Verification L 5.

Advertisements

ACM International Conference on Software Engineering (ICSE) T.H. Ng, S.C. Cheung, W.K. Chan, Y.T. Yu Presented by Jason R. Beck and Enrique G. Ortiz.

A Metric for Software Readability by Raymond P.L. Buse and Westley R. Weimer Presenters: John and Suman.

Coding Standards A Presentation by Jordan Belone.

Testing Without Executing the Code Pavlina Koleva Junior QA Engineer WinCore Telerik QA Academy Telerik QA Academy.

Semantic analysis Parsing only verifies that the program consists of tokens arranged in a syntactically-valid combination, we now move on to semantic analysis,

Software Quality Metrics

Chapter3: Language Translation issues

Computer Programming and Basic Software Engineering 4. Basic Software Engineering 1 Writing a Good Program 4. Basic Software Engineering 3 October 2007.

About the Presentations The presentations cover the objectives found in the opening of each chapter. All chapter objectives are listed in the beginning.

EE694v-Verification-Lect5-1- Lecture 5 - Verification Tools Automation improves the efficiency and reliability of the verification process Some tools,

C++ for Engineers and Scientists Third Edition

CODING Research Data Management. Research Data Management Coding When writing software or analytical code it is important that others and your future.

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Java Software Solutions Foundations of Program Design Sixth Edition by Lewis.

SEG Software Maintenance1 Software Maintenance “The modification of a software product after delivery to correct faults, to improve performance or.

©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 27 Slide 1 Quality Management 1.

CS527: (Advanced) Topics in Software Engineering Overview of Software Quality Assurance Tao Xie ©D. Marinov, T. Xie.

Introduction to Systems Analysis and Design Trisha Cummings.

JS Arrays, Functions, Events Week 5 INFM 603. Agenda Arrays Functions Event-Driven Programming.

CC0002NI – Computer Programming Computer Programming Er. Saroj Sharan Regmi Week 7.

EGR 2261 Unit 4 Control Structures I: Selection  Read Malik, Chapter 4.  Homework #4 and Lab #4 due next week.  Quiz next week.

INTRODUCTION TO COMPUTING CHAPTER NO. 06. Compilers and Language Translation Introduction The Compilation Process Phase 1 – Lexical Analysis Phase 2 –

Java: Chapter 1 Computer Systems Computer Programming II.

SWEN 5430 Software Metrics Slide 1 Quality Management u Managing the quality of the software process and products using Software Metrics.

The Java Programming Language

CSC204 – Programming I Lecture 4 August 28, 2002.

1 Chapter 4: Selection Structures. In this chapter, you will learn about: – Selection criteria – The if-else statement – Nested if statements – The switch.

Chapter 4: Control Structures I (Selection). Objectives In this chapter, you will: – Learn about control structures – Examine relational operators – Discover.

Chapter 4: Control Structures I (Selection). Objectives In this chapter, you will: – Learn about control structures – Examine relational and logical operators.

SE: CHAPTER 7 Writing The Program

Software Quality Metrics

Programming for Beginners Martin Nelson Elizabeth FitzGerald Lecture 5: Software Design & Testing; Revision Session.

Testing. 2 Overview Testing and debugging are important activities in software development. Techniques and tools are introduced. Material borrowed here.

Introduction to Software Testing. Types of Software Testing Unit Testing Strategies – Equivalence Class Testing – Boundary Value Testing – Output Testing.

Lecture 1 Introduction Figures from Lewis, “C# Software Solutions”, Addison Wesley Richard Gesick.

This material is approved for public release. Distribution is limited by the Software Engineering Institute to attendees. Sponsored by the U.S. Department.

Disciplined Software Engineering Lecture #2 Software Engineering Institute Carnegie Mellon University Pittsburgh, PA Sponsored by the U.S. Department.

Copyright © 1994 Carnegie Mellon University Disciplined Software Engineering - Lecture 1 1 Disciplined Software Engineering Lecture #2 Software Engineering.

Chapter 3: Software Project Management Metrics

Chapter 10 Verification and Validation of Simulation Models

1. 2 Preface In the time since the 1986 edition of this book, the world of compiler design has changed significantly 3.

The Software Development Process

REPRESENTATIONS AND OPERATORS FOR IMPROVING EVOLUTIONARY SOFTWARE REPAIR Claire Le Goues Westley Weimer Stephanie Forrest

Xusheng Xiao North Carolina State University CSC 720 Project Presentation 1.

CSc 461/561 Information Systems Engineering Lecture 5 – Software Metrics.

Assessment and Testing

A Metrics Program. Advantages of Collecting Software Quality Metrics Objective assessments as to whether quality requirements are being met can be made.

“Isolating Failure Causes through Test Case Generation “ Jeremias Rößler Gordon Fraser Andreas Zeller Alessandro Orso Presented by John-Paul Ore.

CS5103 Software Engineering Lecture 02 More on Software Process Models.

Design - programming Cmpe 450 Fall Dynamic Analysis Software quality Design carefully from the start Simple and clean Fewer errors Finding errors.

Chapter 4: Control Structures I (Selection). Objectives In this chapter, you will: – Learn about control structures – Examine relational and logical operators.

Intermediate 2 Computing Unit 2 - Software Development.

A HUMAN STUDY OF FAULT LOCALIZATION ACCURACY Zachary P. Fry Westley Weimer University of Virginia September 16, 2010.

1 The Software Development Process ► Systems analysis ► Systems design ► Implementation ► Testing ► Documentation ► Evaluation ► Maintenance.

T EST T OOLS U NIT VI This unit contains the overview of the test tools. Also prerequisites for applying these tools, tools selection and implementation.

Chapter 1: Preliminaries Lecture # 2. Chapter 1: Preliminaries Reasons for Studying Concepts of Programming Languages Programming Domains Language Evaluation.

© 2011 Pearson Education, publishing as Addison-Wesley Chapter 1: Computer Systems Presentation slides for Java Software Solutions for AP* Computer Science.

Java Programming Fifth Edition Chapter 1 Creating Your First Java Classes.

Chapter 4: Control Structures I (Selection). Objectives In this chapter, you will: – Learn about control structures – Examine relational operators – Discover.

1 Problem Solving  The purpose of writing a program is to solve a problem  The general steps in problem solving are: Understand the problem Dissect the.

Implementation Topics Describe –Characteristics of good implementations –Best practices to achieve them Understand role of comments Learn debugging techniques.

Software Metrics 1.

David Shepherd, Zachary P. Fry, Emily Hill, Lori Pollock, and K

Chapter 10 Verification and Validation of Simulation Models

Chapter 4: Control Structures I (Selection)

PROGRAMMING METHODOLOGY

Chapter 1: Computer Systems

Test Case Test case Describes an input Description and an expected output Description. Test case ID Section 1: Before execution Section 2: After execution.

Chapter 9: Implementation

Presentation transcript:

A HUMAN STUDY OF PATCH MAINTAINABILITY Zachary P. Fry, Bryan Landau, Westley Weimer University of Virginia

Bug Fixing  Fixing bugs manually is difficult and costly.  Recent techniques explore automated patches:  Evolutionary techniques – GenProg  Dynamic modification – ClearView  Enforcement of pre/post-conditions – AutoFix-E  Program transformation via static analysis – AFix  While these techniques save developers time, there is some concern as to whether the patches produced are human-understandable and maintainable in the long run. 2

Questions Moving Forward  How can we concretely measure these notions of human understandability and future maintainability?  Can we automatically augment machine- generated patches to improve maintainability?  In practice, are machine-generated patches as maintainable as human-generated patches? 3

Questions Moving Forward  How can we concretely measure these notions of human understandability and future maintainability?  Can we automatically augment machine- generated patches to improve maintainability?  In practice, are machine-generated patches as maintainable as human-generated patches? 4

Measuring quality and maintainability  Functional Quality – Does the implementation match the specification?  Does the code execute “correctly”?  Non-functional Quality – Is the code understandable to humans?  How difficult is it to understand and alter the code in the future? ✓ ? 5

Software Functional Quality  Perfect:  Implementation matches specification  Direct software quality metrics:  Testing  Defect density  Mean time to failure  Indirect software quality metrics:  Cyclomatic complexity  Coupling and cohesion (CK metrics)  Software readability 6

Software Non-functional Quality  Maintainability:  Human-centric factors affecting the ease with which bugs can be fixed and features can be added  Broadly related to the “understandability” of code  Not easy to concretely measure with heuristics like functional correctness  These automatically-generated patches have been shown to be of high quality functionally – what about non-functionally? 7

Patch Maintainability Defined  Rather than using an approximation to measure understandability, we will directly measure humans’ abilities to perform maintenance tasks  Task: ask human participants questions that require them to read and understand a piece of code and measure the effort required to provide correct answers  Simulate the maintenance process as closely as possible 8

Php Bug #54454  Title: “ substr_compare incorrectly reports equality in some cases”  Bug description:  “if main_str is shorter than str, substr_compare [mistakenly] checks only up to the length of main_str ”  substr_compare( “cat”, “catapult”) = true 9

Motivating Example if (offset >= s1_len) { php_error_docref(NULL TSRMLS_CC, E_WARNING, "The start position cannot exceed string length"); RETURN_FALSE; } if (len > s1_len - offset) { len = s1_len - offset; } cmp_len = (uint) (len ? len : MAX(s2_len, (s1_len - offset))); 10

Motivating Example len--; if (mode & 2) { for (i = len - 1; i >= 0; i--) { if (mask[(unsigned char)c[i]]) { len--; } else { break; } } if (return_value) { RETVAL_STRINGL(c, len, 1); } else { 11

Automatic Documentation  Intuitions suggest that patches augmented with documentation are more maintainable  Human patches can contain comments with hints as to the developer’s intention when changing code  Automatic approaches cannot easily reason about why a change is made, but can describe what was changed  Automatically Synthesized Documentation:  DeltaDoc (Buse et al. ASE 2010)  Measures semantic program changes  Outputs natural language descriptions of changes 12

Automatic Documentation if (!con->conditional_is_valid[dc->comp]) { if (con->conf.log_condition_handling) { TRACE("cond[%d] is valid: %d", dc->comp, con->conditional_is_valid[dc->comp]); } /* If not con->conditional_is_valid[dc->comp] No longer return COND_RESULT_UNSET; */ return COND_RESULT_UNSET; } /* pass the rules */ switch (dc->comp) { case COMP_HTTP_HOST: { char *ck_colon = NULL, *val_colon = NULL; 13

Questions Moving Forward  How can we concretely measure these notions of human understandability and future maintainability?  Can we automatically augment machine- generated patches to improve maintainability?  In practice, are machine-generated patches as maintainable as human-generated patches? 14

Evaluation Focused research questions to answer:  1) How do different types of patches affect maintainability?  2) Which source code characteristics are predictive of our maintainability measurements?  3) Do participants’ intuitions about maintainability and its causes agree with measured maintainability?  To answer these questions directly we performed a human study using over 150 participants with real patches from existing systems 15

Experiment - Subject Patches ProgramLOCDefectsPatches gzip 491,08312 libtiff 77, lighttpd 61,52834 php 1,046, python 407,91712 wireshark 2,812,34011 Total:4,896,  We used patches from six benchmarks over a variety subject domains 16

Experiment - Subject Patches  Original – the defective, un-patched code used as a baseline for measuring relative changes  Human-Accepted – human patches that have not been reverted to date  Human-Reverted – human-created patches that were later reverted  Machine – automatically-generated patches created by the GenProg tool  Machine+Doc – the same patches as above, but augmented with automatically synthesized documentation 17

Experiment – Maintenance Task  Sillito et al. – “Questions programmers ask during software evolution tasks”  Recorded and categorized the questions developers actually asked while performing real maintenance tasks  “What is the value of the variable “y” on line X?”  Not: “Does this type have any siblings in the type hierarchy?” 18

Human Study … 15 if (dc->prev) { 16 if (con->conf.log_condition_handling) { 17 log_error_write(srv, __FILE__, __LINE__, "sb", "go prev", dc->prev->key); 18 } 19 /* make sure prev is checked first */ 20 config_check_cond_cached(srv, con, dc->prev); 21 /* one of prev set me to FALSE */ 22 if (COND_RESULT_FALSE == con->cond_cache[dc->context_ndx].result) { 23 return COND_RESULT_FALSE; 24 } } if (!con->conditional_is_valid[dc->comp]) { 29 if (con->conf.log_condition_handling) { 30 TRACE("cond[%d] is valid: %d", dc->comp, con->conditional_is_valid[dc->comp]); 31 } return COND_RESULT_UNSET; 34 } … 19

Human Study  Question presentation Question: What is the value of the variable "con->conditional_is_valid[dc->comp]" on line 33? (recall, you can use inequality symbols in your answer) Answer to the Question Above: 20

Human Study … 15 if (dc->prev) { 16 if (con->conf.log_condition_handling) { 17 log_error_write(srv, __FILE__, __LINE__, "sb", "go prev", dc->prev->key); 18 } 19 /* make sure prev is checked first */ 20 config_check_cond_cached(srv, con, dc->prev); 21 /* one of prev set me to FALSE */ 22 if (COND_RESULT_FALSE == con->cond_cache[dc->context_ndx].result) { 23 return COND_RESULT_FALSE; 24 } } if (!con->conditional_is_valid[dc->comp]) { 29 if (con->conf.log_condition_handling) { 30 TRACE("cond[%d] is valid: %d", dc->comp, con->conditional_is_valid[dc->comp]); 31 } return COND_RESULT_UNSET; 34 } … 21

Human Study  Question presentation Question: What is the value of the variable "con->conditional_is_valid[dc->comp]" on line 33? (recall, you can use inequality symbols in your answer) Answer to the Question Above: False 22

Evaluation Metrics  Correctness – is the right answer reported?  Time – what is the “maintenance effort” associated with understanding this code?  We favor correctness over time  Participants were instructed to spend as much time as they deemed necessary to correctly answer the questions  The percentages of correct answers over all types of patches were not different in a statistically significant way  We focus on time, as it is an analog for the software engineering effort associated with program understanding 23

Type of Patch vs. Maintainability  Effort = average number of minutes it took participants to report a correct answer for all patches of a given type relative to the original code 24

Type of Patch vs. Maintainability  Effort = average number of minutes it took participants to report a correct answer for all patches of a given type relative to the original code 25

Characteristics of Maintainability  We measured various code features for all patches used in the human study  Using a logistic regression model, we can predict human accuracy when answering the questions in the study 73.16% of the time  A Principle Component Analysis shows that 17 features account for 90% of the variance in the data  Modeling maintainability is a complex problem 26

Characteristics of Maintainability Code FeaturePredictive Power Ratio of variable uses per assignment0.178 Code readability0.157 Ratio of variables declared out of scope vs. in scope0.146 Number of total tokens0.097 Number of non-whitespace characters0.090 Number of macro uses0.080 Average token length0.078 Average line length0.072 Number of conditionals0.070 Number of variable declarations or assignments0.056 Maximum conditional clauses on any path0.055 Number of blank lines

Human Intuition vs. Measurement  After completing the study, participants were asked to report which code features they thought increased maintainability the most Human Reported FeatureVotesPredictive Power Descriptive variable names35*0.000 Clear whitespace and indentation25*0.003 Presence of comments Shorter function8*0.000 Presence of nested conditionals Presence of compiler directives / macros Presence of global variables Use of goto statements 5*0.000 Lack of conditional complexity Uniform use and format of curly braces

Conclusions  From conducting a human study involving over 150 participants and patches fixing high- priority defects from real systems we conclude:  The fact that humans take less time, on average, to answer questions about machine-generated patches with automated documentation than human-created patches validates the possibility of using automatic patch generation techniques in practice  There is a strong disparity between human intuitions about maintainability and our measurements and thus we think further study is merited in this area 29

 Questions? 30

Modified DeltaDoc  We modify DeltaDoc in the following ways:  Include all changes, regardless of length of output  Ignore all internal optimizations that lead to loss of information (e.g. ignore suspected unrelated statements)  Include all relevant programmatic information (e.g. function arguments)  Ignore all high-level output optimizations  Favor comprehensive explanations over brevity  Insert output directly above patches as comments 31

Experiment - Participants  Over 150 participants  27 fourth-year undergraduate CS students  14 CS graduate students  116 Mechanical Turk internet participants  Accuracy cutoff imposed  Ensuring people don’t try to “game the system” requires special consideration  Any participant who failed to answer all questions or scored below one standard deviation of the average undergraduate student’s score was removed 32

Experiment - Questions  What conditions must hold to always reach line X during normal execution?  What is the value of the variable “y” on line X?  What conditions must be true for the function “z()” to be called on line X?  At line X, which variables must be in scope?  Given the following values for relevant variables, what lines are executed by beginning at line X? Y=5 && Z=True. 33