Automated Documentation Inference to Explain Failed Tests Sai Zhang University of Washington Joint work with: Cheng Zhang, Michael D. Ernst.

Slides:



Advertisements
Similar presentations
Delta Debugging and Model Checkers for fault localization
Advertisements

CS527: Advanced Topics in Software Engineering (Software Testing and Analysis) Darko Marinov September 18, 2008.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Bug Isolation via Remote Program Sampling Ben Liblit, Alex Aiken, Alice X.Zheng, Michael I.Jordan Presented by: Xia Cheng.
Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)
Software Quality Metrics
Dynamically Discovering Likely Program Invariants to Support Program Evolution Michael D. Ernst, Jake Cockrell, William G. Griswold, David Notkin Presented.
Statistical Debugging: A Tutorial Steven C.H. Hoi Acknowledgement: Some slides in this tutorial were borrowed from Chao Liu at UIUC.
An Experimental Evaluation on Reliability Features of N-Version Programming Xia Cai, Michael R. Lyu and Mladen A. Vouk ISSRE’2005.
Michael Ernst, page 1 Improving Test Suites via Operational Abstraction Michael Ernst MIT Lab for Computer Science Joint.
Delta Debugging - Demo Presented by: Xia Cheng. Motivation Automation is difficult Automation is difficult fail analysis needs complete understanding.
Parameterizing Random Test Data According to Equivalence Classes Chris Murphy, Gail Kaiser, Marta Arias Columbia University.
Automated Diagnosis of Software Configuration Errors
CS527: (Advanced) Topics in Software Engineering Overview of Software Quality Assurance Tao Xie ©D. Marinov, T. Xie.
Unit Testing & Defensive Programming. F-22 Raptor Fighter.
Practical Semantic Test Simplification Sai Zhang University of Washington.
Cristian Urs and Ben Riveira. Introduction The article we chose focuses on improving the performance of Genetic Algorithms by: Use of predictive models.
Testing Michael Ernst CSE 140 University of Washington.
Change Impact Analysis for AspectJ Programs Sai Zhang, Zhongxian Gu, Yu Lin and Jianjun Zhao Shanghai Jiao Tong University.
Automatically Synthesizing SQL Queries from Input-Output Examples Sai Zhang University of Washington Joint work with: Yuyin Sun.
Bug Localization with Machine Learning Techniques Wujie Zheng
Scalable Statistical Bug Isolation Authors: B. Liblit, M. Naik, A.X. Zheng, A. Aiken, M. I. Jordan Presented by S. Li.
The Daikon system for dynamic detection of likely invariants MIT Computer Science and Artificial Intelligence Lab. 16 January 2007 Presented by Chervet.
Which Configuration Option Should I Change? Sai Zhang, Michael D. Ernst University of Washington Presented by: Kıvanç Muşlu.
Experimentation in Computer Science (Part 1). Outline  Empirical Strategies  Measurement  Experiment Process.
Testing. 2 Overview Testing and debugging are important activities in software development. Techniques and tools are introduced. Material borrowed here.
Summarizing the Content of Large Traces to Facilitate the Understanding of the Behaviour of a Software System Abdelwahab Hamou-Lhadj Timothy Lethbridge.
Automatically Repairing Broken Workflows for Evolving GUI Applications Sai Zhang University of Washington Joint work with: Hao Lü, Michael D. Ernst.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Technology and Science, Osaka University Dependence-Cache.
Chapter 8 Usability Specification Techniques Hix & Hartson.
1 Test Selection for Result Inspection via Mining Predicate Rules Wujie Zheng
Automated Patch Generation Adapted from Tevfik Bultan’s Lecture.
REPRESENTATIONS AND OPERATORS FOR IMPROVING EVOLUTIONARY SOFTWARE REPAIR Claire Le Goues Westley Weimer Stephanie Forrest
References: “Pruning Dynamic Slices With Confidence’’, by X. Zhang, N. Gupta and R. Gupta (PLDI 2006). “Locating Faults Through Automated Predicate Switching’’,
“Isolating Failure Causes through Test Case Generation “ Jeremias Rößler Gordon Fraser Andreas Zeller Alessandro Orso Presented by John-Paul Ore.
Finding Errors in.NET with Feedback-Directed Random Testing Carlos Pacheco (MIT) Shuvendu Lahiri (Microsoft) Thomas Ball (Microsoft) July 22, 2008.
Directed Random Testing Evaluation. FDRT evaluation: high-level – Evaluate coverage and error-detection ability large, real, and stable libraries tot.
Design - programming Cmpe 450 Fall Dynamic Analysis Software quality Design carefully from the start Simple and clean Fewer errors Finding errors.
Bug Localization with Association Rule Mining Wujie Zheng
1 CEN 4072 Software Testing PPT12: Fixing the defect.
When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington.
WERST – Methodology Group
David Streader Computer Science Victoria University of Wellington Copyright: David Streader, Victoria University of Wellington Debugging COMP T1.
Week 5-6 MondayTuesdayWednesdayThursdayFriday Testing III No reading Group meetings Testing IVSection ZFR due ZFR demos Progress report due Readings out.
The Hashemite University Computer Engineering Department
Pruning Dynamic Slices With Confidence Original by: Xiangyu Zhang Neelam Gupta Rajiv Gupta The University of Arizona Presented by: David Carrillo.
Robust Requirements Tracing Via Internet Tech:Improving an IV&V Technique SAS 2004July 20, 2004 Alex Dekhtyar Jane Hayes Senthil Sundaram Ganapathy Chidambaram.
Testing plan outline Adam Leko Hans Sherburne HCS Research Laboratory University of Florida.
1 The Software Development Process ► Systems analysis ► Systems design ► Implementation ► Testing ► Documentation ► Evaluation ► Maintenance.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
/ PSWLAB Evidence-Based Analysis and Inferring Preconditions for Bug Detection By D. Brand, M. Buss, V. C. Sreedhar published in ICSM 2007.
CS527 Topics in Software Engineering (Software Testing and Analysis) Darko Marinov September 7, 2010.
Mutation Testing Breaking the application to test it.
Simplifying and Isolating Failure-Inducing Input Andreas Zeller and Ralf Hildebrandt IEEE Transactions on Software Engineering (TSE) 2002.
Random Test Generation of Unit Tests: Randoop Experience
Defensive Programming. Good programming practices that protect you from your own programming mistakes, as well as those of others – Assertions – Parameter.
Test Case Purification for Improving Fault Localization presented by Taehoon Kwak SoftWare Testing & Verification Group Jifeng Xuan, Martin Monperrus [FSE’14]
The PLA Model: On the Combination of Product-Line Analyses 강태준.
Jeremy Nimmer, page 1 Automatic Generation of Program Specifications Jeremy Nimmer MIT Lab for Computer Science Joint work with.
A Review of Software Testing - P. David Coward
Testing and Debugging PPT By :Dr. R. Mall.
Different Types of Testing
Development History Granularity Transformations
Eclat: Automatic Generation and Classification of Test Inputs
Test Case Purification for Improving Fault Localization
Precise Condition Synthesis for Program Repair
Using Automated Program Repair for Evaluating the Effectiveness of
Review of Previous Lesson
Defensive Programming
Presentation transcript:

Automated Documentation Inference to Explain Failed Tests Sai Zhang University of Washington Joint work with: Cheng Zhang, Michael D. Ernst

A failed test reveals a potential bug Before bug-fixing, programmers must:  find code relevant to the failure  understand why the test fails 2

Programmers often need to guess about relevant parts in the test and tested code Long test code Multiple class interactions Poor documentation 3

A failed test 4 Which parts of the test are most relevant to the failure? (The test is minimized, and does not dump a useful stack trace.) public void test1() { int i = 1; ArrayList lst = new ArrayList(i); Object o = new Object(); boolean b = lst.add(o); TreeSet ts = new TreeSet(lst); Set set = Collections.synchronizedSet(ts); assertTrue(set.equals(set)); }

FailureDoc: inferring explanatory documentation FailureDoc infers debugging clues: –Indicates changes to the test that will make it pass –Helps programmers understand why the test fails FailureDoc provides a high-level description of the failure from the perspective of the test –Automated fault localization tools pinpoint the buggy statements without explaining why 5

6 Documenting the failed test public void test1() { int i = 1; ArrayList lst = new ArrayList(i); //Test passes if o implements Comparable Object o = new Object(); //Test passes if o is not added to lst boolean b = lst.add(o); TreeSet ts = new TreeSet(lst); Set set = Collections.synchronizedSet(ts); assertTrue(set.equals(set)); } (The red part is generated by FailureDoc) The documentation indicates: The add method should not accept a non-Comparable object, but it does. It is a real bug.

Outline Overview The FailureDoc technique Implementation & Evaluation Related work Conclusion 7

The architecture of FailureDoc A Failed Test A Failed Test with Documentation Property Generalization Mutant Generation Execution Observation Filter for Root Causes x = -1; assert x > 0; x = 0; assert x>0; x = 5; assert x>0; x = 2; assert x>0; x = 0; assert x>0; x = 5; assert x>0; x = 2; assert x>0; x = 5 x = 2 x > 0 //Test passes if x > 0 x = -1; assert x > 0;

The architecture of FailureDoc A Failed Test A Failed Test with Documentation Property Generalization Mutant Generation Execution Observation Filter for Root Causes x = -1; assert x > 0; x = 0; assert x>0; x = 5; assert x>0; x = 2; assert x>0; x = 0; assert x>0; x = 5; assert x>0; x = 2; assert x>0; x = 5 x = 2 x > 0 //Test passes if x > 0 x = -1; assert x > 0;

Mutant generation via value replacement Mutate the failed test by repeatedly replacing an existing input value with an alternative one –Generate a set of slightly different tests Object o = new Object(); boolean b = lst.add(o);... Object o = new Integer(1); boolean b = lst.add(o);... TreeSet t = new TreeSet(l); Set s = synchronizedSet(t);... TreeSet t = new TreeSet(); t.add(10); Set s = synchronizedSet(t);... Original test Mutated test

Value selection in replacement Exhaustive selection is inefficient Random selection may miss some values FailureDoc selects replacement candidates by: –mapping each value to an abstract domain using an abstract object profile representation –sample each abstract domain 11

The architecture of FailureDoc A Failed Test A Failed Test with Documentation Property Generalization Mutant Generation Execution Observation Filter for Root Causes x = -1; assert x > 0; x = 0; assert x>0; x = 5; assert x>0; x = 2; assert x>0; x = 0; assert x>0; x = 5; assert x>0; x = 2; assert x>0; x = 5 x = 2 x > 0 //Test passes if x > 0 x = -1; assert x > 0;

Execution result observation FailureDoc executes each mutated test, and classifies it as: –Passing –Failing The same failure as the original failed test –Unexpected exception A different exception is thrown int i = 1; ArrayList lst = new ArrayList(i);... int i = -10; ArrayList lst = new ArrayList(i);... Unexpected exception: IllegalArgumentException Original test Mutated test

Record expression values in test execution After value replacement, FailureDoc only needs to record expressions that can affect the test result: –Computes a backward static slice from the assertion in passing and failing tests –Selectively records expression values in the slice 14

The architecture of FailureDoc A Failed Test A Failed Test with Documentation Property Generalization Mutant Generation Execution Observation Filter for Root Causes x = -1; assert x > 0; x = 0; assert x>0; x = 5; assert x>0; x = 2; assert x>0; x = 0; assert x>0; x = 5; assert x>0; x = 2; assert x>0; x = 5 x = 2 x > 0 //Test passes if x > 0 x = -1; assert x > 0;

Statistical failure correlation A statistical algorithm isolates suspicious statements in a failed test –A variant of the CBI algorithms [Liblit’05] –Associate a suspicious statement with a set of failure-correcting objects Characterize the likelihood of each observed value v to be a failure-correcting object –Define 3 metrics: Pass, Increase, and Importance for each observed value v of each statement 16

17 Original testObserved value in a mutant public void test1() { int i = 1; ArrayList lst = new ArrayList(i); Object o = new Object(); boolean b = lst.add(o); b = false TreeSet ts = new TreeSet(lst); Set set = synchronizedSet(ts); //This assertion fails assertTrue(set.equals(set)); } PASS! Pass( b=false ) = 1 The test always passes, when b is observed as false Pass(v): the percentage of passing tests when v is observed

18 Original testObserved value in a mutant public void test1() { int i = 1; ArrayList lst = new ArrayList(i); Object o = new Object(); boolean b = lst.add(o); TreeSet ts = new TreeSet(lst);ts = an empty set Set set = synchronizedSet(ts); //This assertion fails assertTrue(set.equals(set)); } PASS! Pass( ts = an empty set) = 1 The test always passes, when ts is observed as an empty set! Pass(v): the percentage of passing tests when v is observed

19 Original testObserved value in a mutant public void test1() { int i = 1; i = 10 ArrayList lst = new ArrayList(i); Object o = new Object(); boolean b = lst.add(o); TreeSet ts = new TreeSet(lst); Set set = synchronizedSet(ts); //This assertion fails assertTrue(set.equals(set)); } FAIL! Pass( i=10 ) = 0 Test never passes, when i is observed as 10. Pass(v): the percentage of passing tests when v is observed

20 Original testObserved value in a mutant public void test1() { int i = 1; ArrayList lst = new ArrayList(i); Object o = new Object(); boolean b = lst.add(o); b = false TreeSet ts = new TreeSet(lst);ts = an empty set Set set = synchronizedSet(ts); //This assertion fails assertTrue(set.equals(set)); } PASS! Increase( b = false ) = 1 Increase( ts = an empty set) = 0 Distinguish the difference each observed value makes Increase(v): indicating root cause for test passing Changing b ’s initializer to false implies ts is an empty set

Importance (v) : - harmonic mean of increase(v) and the ratio of passing tests - balance sensitivity and specificity - prefer high score in both dimensions 21

Algorithm for isolating suspicious statements Input: a failed test t Output: suspicious statements with their failure-correcting objects Statement s is suspicious if its failure-correcting object set FC s ≠ Ø FC s = { v | Pass( v ) = 1 ∧ /* v corrects the failed test */ Increase( v ) > 0 ∧ /* v is a root cause */ Importance( v ) > threshold /* balance sensitivity & specificity */ } 22

Failure-correcting objects for the example 23 Original testFailure-correcting object set public void test1() { int i = 1; ArrayList lst = new ArrayList(i); Object o = new Object(); o ∈ { 100, (byte)1, “hi” } boolean b = lst.add(o); b ∈ { false } TreeSet ts = new TreeSet(lst); Set set = synchronizedSet(ts); //This assertion fails assertTrue(set.equals(set)); }

The architecture of FailureDoc A Failed Test A Failed Test with Documentation Property Generalization Mutant Generation Execution Observation Filter for Root Causes x = -1; assert x > 0; x = 0; assert x>0; x = 5; assert x>0; x = 2; assert x>0; x = 0; assert x>0; x = 5; assert x>0; x = 2; assert x>0; x = 5 x = 2 x > 0 //Test passes if x > 0 x = -1; assert x > 0;

Property generalization Generalize properties for failure-correcting objects –Use a Daikon-like technique –E.g., property of the object set: { 100, “hi!”, (byte)1 } is: all values are comparable. Rephrase properties into readable documentation –Employ a small set of templates: x instanceof Comparable ⇒ x implements Comparable x.add(y) replaced by false ⇒ y is not added to x 25

Outline Overview The FailureDoc technique Implementation & Evaluation Related work Conclusion 26

Research questions RQ1: can FailureDoc infer explanatory documentation for failed tests? RQ2: is the documentation useful for programmers to understand the test and fix the bug? 27

Evaluation procedure An experiment to explain 12 failed tests from 5 subjects –All tests were automatically generated by Randoop [Pacheco’07] –Each test reveals a distinct real bug A user study to investigate the documentation’s usefulness –16 CS graduate students –Compare the time cost in test understanding and bug fixing : 1. Original tests (undocumented) vs. FailureDoc 2. Delta debugging vs. FailureDoc 28

Subjects used in explaining failed tests Average test size: 41 statements Almost all failed tests involve complex interactions between multiple classes Hard to tell why they fail by simply looking at the test code 29 SubjectLines of Code# Failed TestsTest size Time and Money 2, Commons Primitives 9, Commons Math 14, Commons Collections 55, java.util 48,026227

Results for explaining failed tests FailureDoc infers meaningful documentation for 10 out of 12 failed tests –Time cost is acceptable: 189 seconds per test –Documentation is concise: 1 comment per 17 lines of test code –Documentation is accurate: each comment indicates a different way to make the test pass, and is never in conflict with each other FailureDoc fails to infer documentation for 2 tests: –no way to use value replacement to correct them 30

Feedback from developers We sent all documented tests to subject developers, and got positive feedback Feedback from a Commons Math developer: Documented tests and communications with developers are available at: 31 I think these comments are helpful. They give a hint about what to look at. … the comment showed me exactly the variable to look at.

User study: how useful is the documentation? Participants: 16 graduate students majoring in CS –Java experience: max = 7, min = 1, avg = 4.1 years –JUnit experience: max = 4, min = 0.1, avg = 1.9 years 3 experimental treatments: –Original tests (undocumented) –Delta-debugging-annotated tests –FailureDoc-documented tests Measure: –time to understand why a test fails –time to fix the bug –30-min time limit per test 32

Results of comparing undocumented tests with FailureDoc 33 GoalSuccess RateAverage Time Used (min) JUnitFailureDocJUnitFailureDoc Understand Failure 75% Understand Failure + Fix Bug 35% JUnit: Undocumented Tests FailureDoc: Tests with FailureDoc-inferred documentation Conclusion: FailureDoc helps participants understand a failed test 2.7 mins (or 14%) faster FailureDoc slightly speeds up the bug fixing time (0.6 min faster)

Results of comparing Delta debugging with FailureDoc 34 GoalSuccess RateAverage Time Used (min) DDFailureDocDDFailureDoc Understand Failure 75% Understand Failure + Fix Bug 40%45% DD: Tests annotated with Delta-Debugging-isolated faulty statements FailureDoc: Tests with FailureDoc-inferred documentation Conclusion: FailureDoc helps participants fix more bugs FailureDoc helps participants to understand a failed test faster (1.7 mins or 8.5%) Participants spent slightly more time (0.4 min) in fixing a bug on average with FailureDoc, though more bugs were fixed Delta debugging can only isolate faulty statements in 3 tests

Feedback from Participants Overall feedback –FailureDoc is useful –FailureDoc is more useful than Delta Debugging Positive feedback Negative feedback 35 The comment at line 68 did provide information very close to the bug! The comments are useful, because they indicate which variables are suspicious, and help me narrow the search space. The comments, though [they] give useful information, can easily be misunderstood, when I am not familiar with the [program].

Experiment discussion & conclusion Threats to validity –Have not used human-written tests yet. –Limited user study, small tasks, a small sample of people, and unfamiliar code (is 30 min per test enough?) Experiment conclusion –FailureDoc can infer concise and meaningful documentation –The inferred documentation is useful in understanding a failed test 36

Outline Overview The FailureDoc technique Implementation & Evaluation Related work Conclusion 37

Related work Automated test generation Random [Pacheco’07], Exhaustive [Marinov’03], Systematic [Sen’05] … Generate new tests instead of explaining the existing tests Fault localization Testing-based [Jones’04], delta debugging [Zeller’99], statistical [Liblit’05] … Localize the bug in the tested code, but doesn’t explain why a test fails Documentation inference Method summarization [Sridhara’10], Java exception [Buse’08], software changes [Kim’09, Buse’10], API cross reference [Long’09] Not applicable to tests (e.g., different granularity and techniques) 38

Outline Overview The FailureDoc technique Implementation & Evaluation Related work Conclusion 39

Future Work FailureDoc proposes a different abstraction to help programmers understand a failed test, and fix a bug. Is there a better way? Which information is more useful for programmers? –Fault localization: pinpointing the buggy program entities –Simplifying a failing test –Inferring explanatory documentation –…. Need more experiments and studies 40

Contributions FailureDoc: an automated technique to explain failed tests –Mutant Generation –Execution Observation –Statistical Failure Correlation –Property Generalization An open-source tool implementation, available at: An experiment and a user study to show its usefulness –Also compared with Delta debugging 41

[Backup slides] 42

Comparison with Delta debugging Delta debugging: –Inputs: A passing and a failing version of a program –Output: failure-inducing edits –Methodology: systematically explore the change space FailureDoc: –Inputs: a single failing test –Outputs: high-level description to explain the test failure –Methodology: create a set of slightly-different tests, and generalize the failure-correcting edits 43

Comparison with the CBI algorithm The CBI algorithm: –Goal: identify likely buggy predicates in the tested code –Input: a large number of executions –Method: use the boolean value of an instrumented predicate as the feature vector Statistical failure correlation in FailureDoc –Goal: identify failure-relevant statements in a test –Input: a single failed execution –Method: use multiple observed values to isolate suspicious statements. associate each suspicious statement with a set of failure- correcting objects 44