Mitigating the Effects of Flaky Tests on Mutation Testing

Mitigating the Effects of Flaky Tests on Mutation Testing
August Shi, Jonathan Bell, Darko Marinov ISSTA 2019 Beijing, China 7/18/2019 Hello, my name is August Shi, and I am here to present our work, “Mitigating the Effects of Flaky Tests on Mutation Testing”. This is joint work with Jonathan Bell and Darko Marinov, and we are from the University of Illinois at Urbana-Champaign and George Mason University. CCF CNS CNS CCF CCF OAC

UNRELIABLE Mutation Testing Compare test suites by mutation score
Code Under Test Code Under Test test1 test2 test3 Mutant 1 Code Under Test Code Under Test UNRELIABLE Mut 1 Mut 1 Mut 2 test1 Survived test2 test3 Killed test1 test2 test3 Killed Code Under Test Code Under Test As you can see from our title, our work addresses mutation testing, so I would like to start with some background on mutation testing. The goal of mutation testing is to check the quality of the test suite. Let’s say we have some code under test and the corresponding test suite. The tests all pass when run on the code, but are they able to detect faults that get introduced into the code as code evolves? Mutation testing tries to evaluate the fault-detection capability… test1 test2 test3 Mutant 2 Compare test suites by mutation score Guide testing based on mutant-test matrix Mut 2 Survived

Mutation Testing with Flaky Tests
Code Under Test Code Under Test test1 test2 test3 Mutant 1 Code Under Test Code Under Test Code Under Test STILL FLAKY Mut 1 Mut 1 Mut 2 test1 Survived? test2 test3 Killed? test1 test2 test3 Killed? Code Under Test Code Under Test That was traditional mutation testing, but what happens when we consider the possibility of flaky tests? First, what are flaky tests? Well, let’s say we run the tests once on the code and observe all tests passing, as we saw earlier. But let’s say we run it again on the same version of code with no changes, and now we see a test, test3, failing… test1 test2 test3 Mutant 2 Run 1 Run 2 Get test suite with deterministic outcomes Debug/fix flaky tests1 Remove/ignore flaky tests Mut 2 Survived? 1August Shi et al. “iFixFlakies: A Framework for Automatically Fixing Order-Dependent Tests”. ESEC/FSE 2019

Flaky Coverage Example
Other reasons for flakiness: Concurrency Randomness I/O Order dependency 1 public class WatchDog { 3 public void run() { synchronized (this) { long timeLeft = timeout – (System.currentTimeMillis() - startTime); isWaiting = timeLeft > 0; while (isWaiting) { wait(timeLeft); }} 14 }} Variable/Call timeout startTime currentTimeMillis() timeLeft isWaiting Value (Run 1) 5000 300000 300300 4700 true Value (Run 2) 5000 500000 510000 -5000 false Okay, so what if we don’t have flaky tests to the degree of their outcomes changing between runs. Can we still have problems with mutation testing due to flakiness? Let’s consider this example (adapted from code and tests we observed in Apache commons-exec)… public void test() { new WatchDog.run(); ... } TEST OUTCOME PASS PASS

Motivating Study Measure flakiness of coverage
30 open-source GitHub projects from prior work No flaky test outcomes! (all 35,850 tests pass in 17 runs) Rerun tests and measure differences in coverage 113,356 (22%) statements with different tests covering across runs 5,736 (16%) tests cover different statements across runs Lots of flakiness in coverage, even without flaky outcomes! We performed a motivating study to measure this flakiness in coverage

Mutation Testing with Flaky Coverage
1 public class WatchDog { 3 public void run() { synchronized (this) { long timeLeft = timeout – (System.currentTimeMillis() - startTime); isWaiting = timeLeft > 0; while (isWaiting) { wait(timeLeft); }} 14 }} Variable/Call timeout startTime currentTimeMillis() timeLeft isWaiting Value (Run 1) 5000 300000 300300 4700 true Value (Mut Run) 5000 500000 510000 -5000 false So how does flakiness in coverage affect mutation testing? Mutation delete call public void test() { new WatchDog.run(); ... } Mutation not covered!

Mutation Testing Results are Unreliable
Flakiness can shift mutation testing results Mutation scores may be inflated/deflated Mutant-test matrix unreliable Need to mitigate the effects of flakiness on mutation testing! Mitigation strategies based on reruns and isolation2 Implemented on PIT, a popular mutation testing tool for Java 2Jonathan Bell et al. “DeFlaker: Automatically Detecting Flaky Tests”. ICSE 2018

Mitigating Flakiness in Mutation Testing
Traditional mutation testing Full test-suite coverage collection Mutants to test Test-mutant prioritization Sorted tests per mutant Mutant execution Improvements to cope with flakiness Rerun and isolate tests Run tests with least flaky coverage first Track mutations covered Rerun/isolate tests See paper

Coverage Collection When running multiple times, union coverage
Once Rerun Multiple Times All tests in same JVM Default Default-Reruns Each test in own JVM Isolation Isolation-Reruns When running multiple times, union coverage More lines covered means more mutants generated Run tests in isolation to remove test-order dependencies

Executing Tests on Mutants
Monitor if tests actually execute mutated bytecode Traditionally, mutant-test pair has status Killed or Survived Only applicable if test executes the mutated bytecode Mutant-test pair with test that does not execute mutated bytecode has new status Unknown Test can potentially cover mutation, based on prior coverage Mut 1 Mut 2 test1 Survived test2 Unknown test3

New Status for Mutants Overall mutant status depends on status of all mutant-test pairs run for the mutant Need to reduce number of Unknown mutants and pairs Killed Survived + Covered + Covered Unknown (not covered)

Rerunning Mutant-test Pairs
While status of mutant-test pair is Unknown, rerun Change isolation level during reruns Mutant-test pairs for mutants in same class in same JVM Default Mutant-test Pairs Why does isolation help? Why is it expensive? Reduce flakiness at cost of performance Rerun number of times at each level, aim is to reduce number of unknowns but may not get completely 0 More Isolation Mutant-test pairs for same mutant in same JVM Increasing Cost Most Isolation Mutant-test pairs in own JVM

Experimental Setup Evaluate on same 30 projects in motivating study
All modifications on top of PIT mutation testing tool RQ1: Flakiness in traditional mutation testing? RQ2: Effect of coverage on mutants generated? RQ3: Effect of re-executing tests on mutant status? RQ4: Prioritize tests for mutant-test executions? See paper See paper

RQ1: Flakiness in Traditional Mutation Testing
Mutants by Status Killed Survived Unknown Total Mut. Score Overall 51,687 11,965 2,866 66,518 77.7%-82.0% Max difference up to 23pp! Must improve mutation scores more than this variance! Mutant-Test Pairs by Status Killed Survived Unknown Total Overall 1,569,658 1,097,506 255,194 2,922,358 <Call out the findings, that mutation scores can vary!!!> <Also call out that the tests did not appear flaky from initial outcomes!> 9% of mutants-test pairs are unknown (max up to 55%)! Matrix results can be unreliable

RQ3: Mutant Re-execution Results
Unknown Mutants Unknown Mutant-Test Pairs Before After Reduction Overall 2,866 591 2,275 (79.4%) 255,194 30,321 224,873 (88.1%) Add. Covered Pairs Default Reruns 1 2 3 4 5 Overall 61,437 41,302 14,787 6,590 18,762 Increasing isolation greatly increases covered pairs Unnecessary to rerun too often with the most isolation Add. Covered Pairs More Isolation Reruns 1 2 3 4 5 Overall 46,819 14,072 1,000 629 3,872 Add. Covered Pairs Most Isolation Reruns 1 2 3 4 5 Overall 15,594

Discussion Flakiness can have negative effects beyond mutation testing
Tools/studies that rely on coverage must consider flakiness Fault localization, program repair, test prioritization, test-suite reduction, test selection, test generation, runtime verification, … Mitigation strategies applicable beyond mutation testing Different isolation strategies for different tasks Flakiness in coverage happens, can have effects on anything We observe on mutation testing, but others can suffer too Our mitigation strategies can be applicable to applications beyond mutation testing

Conclusions Even seemingly non-flaky tests have flaky coverage
22% of statements not covered consistently! We present problems in mutation testing due to flakiness We propose techniques to mitigate effects Different combinations of reruns and isolation We reduce Unknown mutants/pairs by 79.4%/88.1% Flakiness can have negative effects beyond mutation testing Link is in the ACM digital library

BACKUP

Prioritizing Tests for Mutants
Run mutant-test pairs in the order that gets the overall mutant status faster, more reliably Once mutant status known, no need to run more Prioritize tests per mutant based on coverage Tests with more “stable” coverage on mutant prioritized earlier Later prioritize based on time When to rerun? Immediately rerun pair? Run all pairs first before rerunning?

RQ2: Coverage and Mutant Generation
Number of Mutants Default Isolated Reruns Overall 70,773 70,993 70,877 71,112 Number of Mutant-Test Pairs Default Isolated Reruns Overall 3,089,051 3,162,138 3,101,314 3,165,527 Not much difference in numbers of mutants and pairs Can potentially use Default for mutant generation

RQ4: Prioritizing Tests
Running Time for Immediately Rerun (s) Random Coverage PIT Best Worst Overall 84,013.0 51,821.8 51,804.9 42,333.4 Running Time for Not Immediately Rerun (s) Random Coverage PIT Best Worst Overall 90,479.0 60,810.3 60,793.3 52,014.6 284,820.7

Mitigating the Effects of Flaky Tests on Mutation Testing

Similar presentations

Presentation on theme: "Mitigating the Effects of Flaky Tests on Mutation Testing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mitigating the Effects of Flaky Tests on Mutation Testing

Similar presentations

Presentation on theme: "Mitigating the Effects of Flaky Tests on Mutation Testing"— Presentation transcript:

Similar presentations

About project

Feedback