Zichao Qi, Fan Long, Sara Achour, and Martin Rinard MIT CSAIL

Slides:



Advertisements
Similar presentations
Delta Debugging and Model Checkers for fault localization
Advertisements

TaintScope: A Checksum-Aware Directed Fuzzing Tool for Automatic Software Vulnerability Detection Tielei Wang 1, Tao Wei 1, Guofei Gu 2, Wei Zou 1 1 Peking.
Towards Self-Testing in Autonomic Computing Systems Tariq M. King, Djuradj Babich, Jonatan Alava, and Peter J. Clarke Software Testing Research Group Florida.
Annoucements  Next labs 9 and 10 are paired for everyone. So don’t miss the lab.  There is a review session for the quiz on Monday, November 4, at 8:00.
ASE’14, 18 th September 2014 This work is supported by Microsoft Research through its PhD Scholarship Programme Microsoft is a registered trademark of.
A SYSTEMATIC STUDY OF AUTOMATED PROGRAM REPAIR: FIXING 55 OUT OF 105 BUGS FOR $8 EACH Claire Le Goues Michael Dewey-Vogt Stephanie Forrest Westley Weimer.
Accuracy-Aware Program Transformations Sasa Misailovic MIT CSAIL.
Automatic Software Repair Using GenProg 张汉生 ZHANG Hansheng 2013/12/3.
Automatic Program Correction Anton Akhi Friday, July 08, 2011.
Efficient Systematic Testing for Dynamically Updatable Software Christopher M. Hayden, Eric A. Hardisty, Michael Hicks, Jeffrey S. Foster University of.
Parameterizing Random Test Data According to Equivalence Classes Chris Murphy, Gail Kaiser, Marta Arias Columbia University.
(c) 2007 Mauro Pezzè & Michal Young Ch 1, slide 1 Software Test and Analysis in a Nutshell.
Leveraging User Interactions for In-Depth Testing of Web Applications Sean McAllister, Engin Kirda, and Christopher Kruegel RAID ’08 1 Seoyeon Kang November.
State coverage: an empirical analysis based on a user study Dries Vanoverberghe, Emma Eyckmans, and Frank Piessens.
EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7,
CSCI 5801: Software Engineering
Dr. Pedro Mejia Alvarez Software Testing Slide 1 Software Testing: Building Test Cases.
Reverse Engineering State Machines by Interactive Grammar Inference Neil Walkinshaw, Kirill Bogdanov, Mike Holcombe, Sarah Salahuddin.
Software Testing. Definition To test a program is to try to make it fail.
High Volume Automated Testing for Software Components Harry RobinsonDoug Szabo Microsoft.
CS 501: Software Engineering Fall 1999 Lecture 16 Verification and Validation.
Michael Ernst, page 1 Collaborative Learning for Security and Repair in Application Communities Performers: MIT and Determina Michael Ernst MIT Computer.
` Research 2: Information Diversity through Information Flow Subgoal: Systematically and precisely measure program diversity by measuring the information.
Chapter 8 – Software Testing Lecture 1 1Chapter 8 Software testing The bearing of a child takes nine months, no matter how many women are assigned. Many.
Verification and Validation Overview References: Shach, Object Oriented and Classical Software Engineering Pressman, Software Engineering: a Practitioner’s.
Lecture 11 Testing and Debugging SFDV Principles of Information Systems.
Using Execution Paths to Evolve Software Patches ThanhVu Nguyen*, Westley Weimer**, Claires Le Gouges**, Stephanie Forrest* * University of New Mexico.
Helix Automatic Software Repair with Evolutionary Computation Stephanie Forrest Westley Weimer.
AUTOMATIC PROGRAM REPAIR USING GENETIC PROGRAMMING CLAIRE LE GOUES APRIL 22,
Dr. Tom WayCSC Testing and Test-Driven Development CSC 4700 Software Engineering Based on Sommerville slides.
Introduction to Software Testing. Types of Software Testing Unit Testing Strategies – Equivalence Class Testing – Boundary Value Testing – Output Testing.
Self-defending software: Automatically patching errors in deployed software Michael Ernst University of Washington Joint work with: Saman Amarasinghe,
Automatically Repairing Broken Workflows for Evolving GUI Applications Sai Zhang University of Washington Joint work with: Hao Lü, Michael D. Ernst.
AUTOMATIC PROGRAM REPAIR USING GENETIC PROGRAMMING 1 CLAIRE LE GOUES APRIL 22, 2013
Hai Wan School of Software Sun Yat-sen University KRW-2012 June 17, 2012 Boolean Program Repair Reverse Conversion Tool via SMT.
1 Test Selection for Result Inspection via Mining Predicate Rules Wujie Zheng
Automated Patch Generation Adapted from Tevfik Bultan’s Lecture.
REPRESENTATIONS AND OPERATORS FOR IMPROVING EVOLUTIONARY SOFTWARE REPAIR Claire Le Goues Westley Weimer Stephanie Forrest
Xusheng Xiao North Carolina State University CSC 720 Project Presentation 1.
Finding Errors in.NET with Feedback-Directed Random Testing Carlos Pacheco (MIT) Shuvendu Lahiri (Microsoft) Thomas Ball (Microsoft) July 22, 2008.
+ Moving Targets: Security and Rapid-Release in Firefox Presented by Carlos Bernal-Cárdenas.
Integration testing Integrate two or more module.i.e. communicate between the modules. Follow a white box testing (Testing the code)
Hussein Alhashimi. “If you can’t measure it, you can’t manage it” Tom DeMarco,
Software Quality Assurance and Testing Fazal Rehman Shamil.
Collaborative learning for security and repair in application communities MIT site visit April 10, 2007 Welcome.
Automating Configuration Troubleshooting with Dynamic Information Flow Analysis Mona Attariyan Jason Flinn University of Michigan.
Rely: Verifying Quantitative Reliability for Programs that Execute on Unreliable Hardware Michael Carbin, Sasa Misailovic, and Martin Rinard MIT CSAIL.
Lecturer: Eng. Mohamed Adam Isak PH.D Researcher in CS M.Sc. and B.Sc. of Information Technology Engineering, Lecturer in University of Somalia and Mogadishu.
What is a software? Computer Software, or just Software, is the collection of computer programs and related data that provide the instructions telling.
Software Testing and Quality Assurance Practical Considerations (1) 1.
Shadow Shadow of a Doubt: Testing for Divergences Between Software Versions Hristina PalikarevaTomasz KuchtaCristian Cadar ICSE’16, 20 th May 2016 This.
Code Learning and Transfer for Automatic Patch Generation
Anti-patterns in Search-based Program Repair
Evidence-Based Automated Program Fixing
Presented by: Daniel Taylor
Regression Testing with its types
Quality and applicability of automated repair
Different Types of Testing
Towards Trustworthy Program Repair
Verification and Validation Overview
Ruru Yue1, Na Meng2, Qianxiang Wang1 1Peking University 2Virginia Tech
Ask the Mutants: Mutating Faulty Programs for Fault Localization
Eclat: Automatic Generation and Classification of Test Inputs
Testing and Test-Driven Development CSC 4700 Software Engineering
Software Engineering Lecture #12.
Code search & recommendation engines
Precise Condition Synthesis for Program Repair
Using Automated Program Repair for Evaluating the Effectiveness of
Stelios Sidiroglou-Douskos, Eric Lahtinen, Fan Long, Martin Rinard
Shin Hwei Tan, Hiroaki Yoshida, Mukul R. Prasad, Abhik Roychoudhury
Presentation transcript:

Zichao Qi, Fan Long, Sara Achour, and Martin Rinard MIT CSAIL An Analysis of Patch Plausibility and Correctness of Generate-And-Validate Patch Generation System Zichao Qi, Fan Long, Sara Achour, and Martin Rinard MIT CSAIL

Generate-And-Validate Patch Generation Systems Candidate patch space Test suite of test cases Buggy Program

Generate-And-Validate Patch Generation Systems GenProg – Genetic Programming C. L. Goues, M. Dewey-Vogt, S. Forrest, and W. Weimer. A systematic study of automated program repair: Fixing 55 out of 105 bugs for $8 each. ICSE 2012 W. Weimer, T. Nguyen, C. Le Goues, and S. Forrest. Automatically finding patches using genetic programming. ICSE 2009 S. Forrest, T. Nguyen, W. Weimer, and C. Le Goues. A genetic programming approach to automated software repair. GECCO 2009 C. Le Goues, T. Nguyen, S. Forrest, and W. Weimer. Genprog: A generic method for automatic software repair. Software Engineering, IEEE Transactions on 38(1), 2012 AE – Adaptive Search W. Weimer, Z. P. Fry, and S. Forrest. Leveraging program equivalence for adaptive program repair: Models and first results. ASE 2013 RSRepair – Random Search Y. Qi, X. Mao, Y. Lei, Z. Dai, and C. Wang. The strength of random search on automated program repair. ICSE 2014

All of them report impressive results GenProg AE RSRepair Benchmark Defects 105 105 24 Reported Fixed Defects 55 54 24 Patches generated by these systems are different from human written patch No systematic analysis

We analyze the reported patches for these systems Plausible? Produce correct outputs for all test cases in the test suite All generated patches should be plausible

Plausibility Reason - Weak Proxy 105 105 24 55 54 24 18 27 10 GenProg AE RSRepair Benchmark Defects 105 105 24 Reported Fixed Defects 55 54 24 Defects With Plausible Patches 18 27 10 Reason - Weak Proxy Patch evaluation does not check for correct output php, libtiff – check exit code, not output Accepted php patch: main(){ exit(0); }

Analysis of the reported patches for these systems Plausible? Produce correct outputs for all test cases in the test suite Majority of the patches are not plausible Correct? Eliminate the defect Passing test suite != correctness

Correctness GenProg AE RSRepair Benchmark Defects 105 105 24 Reported Fixed Defects 55 54 24 Defects With Plausible Patches 18 27 10 Defects With Correct Patches 2 3 Developed new test cases that expose defects for all plausible but incorrect patches

GenProg Statistics Stronger Test Suites? Fixed Test Scripts? Will GenProg generate correct patches given new test cases that eliminate incorrect patches? Fixed Test Scripts? Will GenProg generate plausible patches given fixed patch evaluation scripts?

Analysis of the reported patches for these systems Plausible? Produce correct outputs for all test cases in the test suite Majority of the patches are not plausible Correct? Eliminate the defect The overwhelming majority of the patches are not correct Do stronger test suites help? Rerun GenProg with fixed patch evaluation scripts and new test cases that eliminate incorrect patches

Reexecution of GenProg on Remaining 103 Defects First Reexecution Fixed patch evaluation New test cases Second Reexecution 2 additional test cases Patches for 0 defects Patches for 2 defects Why? Developer patches are not in GenProg search space GenProg search space may not contain any correct patch for these 103 defects May need richer search space to generate correct patches

Bottom Line For GenProg Rerun GenProg with Fixed test scripts Stronger test suites GenProg generates patches for only 2 of 105 defects (both patches are correct)

Examples of Correct GenProg Patch(1/2) Developer GenProg Note – not a bug!

Examples of Correct GenProg Patch(2/2) Developer GenProg

All Correct Patches Simply Delete Code

Analyze all the plausible patches Semantic Analysis Analyze all the plausible patches Determine if patch is equivalent to single functionality deletion modification Results GenProg: 14/18 AE: 22/27 RSRepair: 8/10

We found a common scenario A negative test case exposes the defect Feature is otherwise unexercised The patch simply deletes the functionality Introduces new security vulnerabilities (buffer overflows) Disables critical functionality (gzip cannot decompress non-zero files) Weak test suites May be appropriate for human developers May not be appropriate for automatic patch generation systems (at least not by themselves)

If all these patches simply delete functionality Why not build a patch generation system that ONLY deletes functionality?

We present Kali Automatic patch generation system Consider the search space that consists of only patches that remove functionality Please refer our paper for more details

Experimental Results of Kali GenProg AE RSRepair Kali Benchmark Defects 105 24 Reported Fixed Defects 55 54 24 Defects With Plausible Patches 18 27 10 Defects With Correct Patches 2 3 Kali is as good as previous systems Much simpler Not need to know the source code file to repair Can pinpoint the defective code Can provide insight into important defect characteristics.

Experimental Results of Kali GenProg AE RSRepair Kali Benchmark Defects 105 24 Reported Fixed Defects 55 54 24 Defects With Plausible Patches 18 27 10 Defects With Correct Patches 2 3

Is Automatic Patch Generation A Total Failure? NO!

Path To Success Richer search spaces More efficient search algorithms Incorporate additional sources of information Correct code from other applications Learned characteristics of human patches Learned invariants Specifications

Promising directions Learn invariant from correct execution Solvers ClearView: J. H. Perkins, S. Kim, S. Larsen, S. Amarasinghe, J. Bachrach, M. Carbin, C. Pacheco, F. Sherwood, S. Sidiroglou, G. Sullivan, et al. Automatically patching errors in deployed software. SOSP 2009. Patches security vulnerabilities in 9 of 10 defects At least 4 patches are correct Solvers NOPOL: F. DeMarco, J. Xuan, D. Le Berre, and M. Monperrus. Automatic repair of buggy if conditions and missing preconditions with smt. CSTVA 2014 SemFix: H. D. T. Nguyen, D. Qi, A. Roychoudhury, and S. Chandra. Semfix: Program repair via semantic analysis. ICSE 2013

Promising directions Specifications Correctness evaluation Autofix-E: Yu Pei, Carlo A. Furia, Martin Nordio, Yi Wei, Andreas Zeller, and Bertrand Meyer. Automated Fixing of Programs with Contracts. IEEE Transactions on Software Engineering, 2014. Etienne Kneuss, Manos Koukoutos and Viktor Kuncak. Deductive Program Repair. CAV 2015 Correctness evaluation Thomas Durieux, Matias Martinez, Martin Monperrus, Romain Sommerard, Jifeng Xuan. Automatic Repair of Real Bugs: An Experience Report on the Defects4J Dataset. Technical report 1505.07002, Arxiv, 2015 Code from another application CodePhage: S. Sidiroglou, E. Lahtinen, F. Long, and M. Rinard. Automatic error elimination by multi-application code transfer. PLDI 2015

Promising Directions 105 105 41 42 11 15 SPR Prophet Staged condition synthesis Prophet Learn from successful patches Benchmark Defects 105 105 Defects With Plausible Patches 41 42 Defects With Correct Patches 11 15 SPR: F. Long and M. Rinard. Staged program repair in SPR. To appear in ESEC-FSE 2015 Prophet: F. Long and M. Rinard. Prophet: Automatic patch generation via learning from successful human patches. Under submission

Take Aways Facts about GenProg/AE/RSRepair Paths to success These systems fix 2/3/2 of 105 bugs (not 55/54/24) Errors in test scripts and weak test suites Fixed test scripts and stronger test suites do not help Paths to success Richer search spaces More efficient search algorithms Incorporate additional sources of information Correct code from other applications (CodePhage) Learned characteristics of human patches (Prophet) Learned invariants (ClearView) Specifications (AutoFixE, Deductive Repair)

Summary Evaluation of GenProg, AE and RSRepair Kali Incorrect results Equivalent to functionality elimination Stronger test suites do not help Kali Functionality elimination system Help developer better understand the bug

Path to Success for the automatic patch generation systems Richer search spaces More efficient search algorithms Incorporate additional sources of information Correct code from other applications Learned characteristics of human patches Learned invariants Better patch evaluation Correctness Understand the negative effects Questions?