Quality and applicability of automated repair

Quality and applicability of automated repair
Yuriy Brun, UMass Amherst Ted Smith Yalin Ke Manish Motwani Mauricio Soto Earl Barr Prem Devanbu Claire Le Goues René Just Katie Stolee

Quality of the patches produced by automated repair
Applicability of automated repair to important and hard defects

What does it mean for automated repair to be successful?
How many bugs can it fix? that is, make all known test cases pass But is making test cases pass the ultimate success? No! It conflates training and evaluation data! notable exceptions: Fry et al. ISSTA’12 Kim et al. ICSE’13 Qi et al. ISSTA’15 Martinez et al. EMSE’16 Don’t get me wrong: Tweaking code to pass all tests is really cool! It’s just not the end-all of automated repair

Lingering questions Are the produced repairs any good?
How can we tell quality objectively?

How can we know if APR repairs
Look at the produced patches by hand [Qi, Long, Achour, Rinard ISSTA’15] [Martinez, Durieux, Sommerard, Xuan, Monperrus ESEM’15] Have others look at the produced patches by hand [ Fry, Landau, Weimer ISSTA’12] [Kim, Nam, Song, Kim ICSE’13] Produce patches with test suite T, evaluate them on independent test suite T' [Brun, Barr, Xiao, Le Goues, Devanbu 2013] [Smith, Barr, Le Goues, Brun ESEC/FSE 2015] objective repeatable Zichao, Fan, Sara, Martin MartAn’s team

IntroClass Benchmark Requires a large set of bugs
for programs with 2 independent test suites and the test suites need to be good IntroClass: 998 bugs in very small, student-written C programs, with a KLEE-generated test suite, and a human-written test suite. [Le Goues, Holtschulte, Smith, Brun, Devanbu, Forrest, Weimer TSE’15] 450 unique bugs

Cobra effect The British government was concerned about the number of venomous cobra snakes in Delhi. They put a bounty on the cobra heads. Business-minded individuals began breeding cobras to harvest them for the heads. The British government cancelled the program and the breeders released all their snakes, greatly increasing the population.

Do GenProg and TrpAutoRepair patches pass kept-out tests?
Overfitting is common! Repair quality is not to be taken for granted.

Repair quality distribution
GenProg TrpAutoRepair

More GenProg and TrpAutoRepair findings
The better the test suite coverage, the better the patch APR causes harm to high-quality programs, but is helpful for low-quality programs Human-written tests lead to better patches More answers and details in “Is the Cure Worse Than the Disease? Overfitting in Automated Program Repair” by Smith, Barr, Le Goues, Brun ESEC/FSE 2015

Why study patch quality?
To build repair techniques that produce high-quality repairs!

Can we improve the patch quality?
Recent work: SearchRepair [Ke, Stolee, Le Goues, Brun ASE’15] SPR [Long and Rinard ESEC/FSE’15] Prophet [Long and Rinard POPL’16, ICSE’16] SearchRepair, SPR, Prophet produce higher quality patches than GenProg, TrpAutoRepair, AE

Applicability What kinds of defects can automated repair repair?
Can it repair defects hard for humans to fix? Can it repair defects humans think are important? Can is synthesize feature requests?

Defect sets ManyBugs [Le Goues, Holtschulte, Smith, Brun, Devanbu, Forrest, Weimer TSE’15] Defects4J [Just, Jalali, Ernst ISSTA’14]

How do you measure importance?
Importance: priority, time from report to fix, project versions affected, etc. Complexity: size of the developer-written patch Test effectiveness: relevant tests, triggering tests, coverage, etc. Independence: does the defect depend on other defects? Developed a methodology for measuring a defect’s importance, complexity, test-suite effectiveness, and independence using: bug trackers, version control history, and human-written patches.

Seven repair techniques
Java: GenProg GenProgJ TrpAutoRepair KaliJ AE Nopol Kali SPR Prophet Our focus was measuring defect properties, so we used prior results for which defects can be repaired.

Importance, complexity, tests
APR is more likely to patch defects of a higher priority. (surprising!) Human time to fix didn’t correlate with APR’s ability to patch. (encouraging!) APR is less likely to repair defects for which human patches were large, but fixed some hard defects. More tests can reduce ability to produce patches. (concerning!)

Quality, patch attributes
APR1 is less likely to produce correct patches for more complex defects. but the effect is smaller for producing high-quality patches than low-quality ones. (surprising and encouraging!) Higher-coverage test suites resulted in higher-quality patches. APR struggled with inserting loops, new variables, new function calls, and changing signatures. 1Only Prophet and SPR produce a sufficient number of high-quality patches to evaluate.

Contributions Repair Quality
Research needed focusing on repair quality. Repeatable, automated, objective methodology for evaluating automated repair quality. including the IntroClass dataset Repair Applicability Defects humans and automated repair find difficult are different. Methodology and dataset for measuring repair applicability to hard and important defects.

Quality and applicability of automated repair

Similar presentations

Presentation on theme: "Quality and applicability of automated repair"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Quality and applicability of automated repair

Similar presentations

Presentation on theme: "Quality and applicability of automated repair"— Presentation transcript:

Similar presentations

About project

Feedback