August Shi, Tifany Yung, Alex Gyori, and Darko Marinov

August Shi, Tifany Yung, Alex Gyori, and Darko Marinov
Comparing and Combining Test-Suite Reduction and Regression Test Selection August Shi, Tifany Yung, Alex Gyori, and Darko Marinov FSE 2015 Bergamo, Italy 09/02/2015 NSF Grant Nos. CCF , CCF , CCF , CCF

Testing is Important but Slow
… testN-1 testN Code Under Test V0 Testing is an important part of software development, but running tests is slot. For a given code under test, developers have to run a test suite with a large number of tests and the process of running all these tests takes a long time.

Regression Testing is Slow(er)
… testN-1 testN Code Under Test V0 test0 test1 test2 test3 … testN-1 testN Code Under Test V1 test0 test1 test2 test3 … testN-1 testN Code Under Test V2 Unfortunately, the situation gets even worse in the context of regression testing. In regression testing, after every change, a developer has to run this large test suite to ensure the changes he/she made did not break any existing functionality.

Speeding up Regression Testing
Test-Suite Reduction Regression Test Selection Test-Suite Parallelization Refactoring Tests Many More

Speeding up Regression Testing
Test-Suite Reduction Regression Test Selection Test-Suite Parallelization Refactoring Tests Many More In this work, we study the first two approaches, test-suite reduction and regression test selection, with the goal of seeing which one is better.

Test-Suite Reduction (TSR)
… testN-1 testN Code Under Test V0 test0 test1 test2 test3 … testN-1 testN Code Under Test V1 test0 test1 test2 test3 … testN-1 testN Code Under Test V2 Analysis only on the first revision, no need to do anymore later

Regression Test Selection (RTS)
Δ test0 test1 test2 test3 … testN-1 testN Code Under Test V0 test0 test1 test2 test3 … testN-1 testN Code Under Test V1 Another approach is RTS… Select tests that are dependent on the change Do not select tests that are not dependent on the change

Regression Test Selection (RTS)
… testN-1 testN Code Under Test V0 test0 test1 test2 test3 … testN-1 testN Code Under Test V1 test0 test1 test2 test3 … testN-1 testN Code Under Test V2 Emphasize this is analysis is all based on changes, so analysis run at every revision

TSR versus RTS (Known Qualitative Comparison)
Test-Suite Reduction Regression Test Selection How are tests chosen to run? Redundancy (one revision) Changes (two revisions) How often is analysis performed? Infrequently Every revision We can compare the two approaches qualitatively Can it miss failing tests from the original test suite? Yes No (if safe)

How do TSR and RTS compare quantitatively
How do TSR and RTS compare quantitatively? How can TSR and RTS be combined? We said how they compare qualitatively, we care about quantitatively (measurements, numbers)

How do TSR and RTS compare quantitatively
How do TSR and RTS compare quantitatively? How can TSR and RTS be combined? Furthermore, from qualitative comparison they are orthogonal, so how to combine them?

TSR Background T = Tests S = Statements S1 S2 S3 S4 S5 T1 X T2 T3 T4
Quality metrics always on ONE REVISION Emphasize traditional (using different requirements to do evaluation)!!!

TSR Background Reduced Test Suite R = {T3,T5} T = Tests S = Statements
X T2 T3 T4 T5 Quality metrics always on ONE REVISION Emphasize traditional (using different requirements to do evaluation)!!! Reduced Test Suite R = {T3,T5}

TSR Background R = {T3,T5} T = Tests S = Statements S1 S2 S3 S4 S5 T1
X T2 T3 T4 T5 Researchers want to see how good TSR technique is and to compare different TSR techniques R = {T3,T5} Size

X T2 T3 T4 T5 R = {T3,T5} Size 𝑆𝑖𝑧= |𝑅| |𝑂| =40%

X T2 T3 T4 T5 Emphasize traditional (using different requirements to do evaluation)!!! R = {T3,T5} Size Fault-Detection Capability 𝑆𝑖𝑧= |𝑅| |𝑂| =40%

TSR Background R = {T3,T5} T = Tests S = Statements M = Mutants S1 S2
X T2 T3 T4 T5 M1 M2 M3 M4 X Emphasize traditional (using different requirements to do evaluation)!!! R = {T3,T5} Size Fault-Detection Capability 𝑆𝑖𝑧= |𝑅| |𝑂| =40% 𝑅𝑒𝑞𝐿𝑜𝑠𝑠= |𝑟𝑒𝑞 𝑂 \ 𝑟𝑒𝑞 𝑅 | |𝑟𝑒𝑞(𝑂)| =25% 𝑟𝑒𝑞 ∈{𝑠𝑡𝑚𝑡, 𝑚𝑢𝑡𝑎𝑛𝑡}

RTS Background Δ Vi-1 Vi T = Tests S = Statements S1 S2 S3 S4 S5 T1 X
S2 changed S6 added S1 S2 S3 S4 S5 S6 T1 X T2 T3 T4 T5 Vi-1 Vi

RTS Background Δ Selected Tests Si,Δ = {T1,T2,T3} Vi-1 Vi T = Tests
S = Statements Δ S1 S2 S3 S4 S5 T1 X T2 T3 T4 T5 S2 changed S6 added S1 S2 S3 S4 S5 S6 T1 X T2 T3 T4 T5 Vi-1 Vi Selected Tests Si,Δ = {T1,T2,T3}

RTS Background Δ Si,Δ = {T1,T2,T3} Vi-1 Vi T = Tests S = Statements S1
X T2 T3 T4 T5 S2 changed S6 added S1 S2 S3 S4 S5 S6 T1 X T2 T3 T4 T5 Vi-1 Vi Si,Δ = {T1,T2,T3} Size 𝑆𝑖𝑧= | 𝑆 𝑖,∆ | |𝑂| =60%

RTS Background Δ Si,Δ = {T1,T2,T3} Vi-1 Vi T = Tests S = Statements S1
X T2 T3 T4 T5 S1 S2 S3 S4 S5 T1 X T2 T3 T4 T5 S2 changed S6 added S2 changed S6 added S1 S2 S3 S4 S5 S6 T1 X T2 T3 T4 T5 S1 S2 S3 S4 S5 S6 T1 X T2 T3 T4 T5 Safe RTS -> does not select tests whose behavior does not change Vi-1 Vi Si,Δ = {T1,T2,T3} Size Fault-Detection Capability 𝑆𝑖𝑧= | 𝑆 𝑖,∆ | |𝑂| =60% Safe RTS does not fail to detect change-related faults

How can TSR and RTS be combined?
Furthermore, from qualitative comparison they are orthogonal, so how to combine them?

Applying RTS after TSR Vi-1 T = Tests S = Statements S1 S2 S3 S4 S5 T1
X T2 T3 T4 T5 After diving into the details of TSR and RTS, we can start to see a way to combine the two approaches… Vi-1

Applying RTS after TSR R = {T3,T5} Vi-1 T = Tests S = Statements S1 S2
X T2 T3 T4 T5 Vi-1 R = {T3,T5}

Applying RTS after TSR Δ Ri = {T3,T5} Vi Vi-1 T = Tests S = Statements
X T2 T3 T4 T5 S1 S2 S3 S4 S5 T1 X T2 T3 T4 T5 S2 changed S6 added S2 changed S6 added S1 S2 S3 S4 S5 S6 T1 X T2 T3 T4 T5 S1 S2 S3 S4 S5 S6 T1 X T2 T3 T4 T5 Vi-1 Vi Ri = {T3,T5}

Applying RTS after TSR Δ Ri = {T3,T5} Vi-1 Vi T = Tests S = Statements
X T2 T3 T4 T5 S1 S2 S3 S4 S5 T1 X T2 T3 T4 T5 S2 changed S6 added S2 changed S6 added S1 S2 S3 S4 S5 S6 T1 X T2 T3 T4 T5 S1 S2 S3 S4 S5 S6 T1 X T2 T3 T4 T5 Vi-1 Vi Ri = {T3,T5}

Applying RTS after TSR Selection of Reduction (SeRe) Δ 𝑆𝑅 𝑖,∆ = {T3}
T = Tests S = Statements Δ S1 S2 S3 S4 S5 T1 X T2 T3 T4 T5 S1 S2 S3 S4 S5 T1 X T2 T3 T4 T5 S2 changed S6 added S2 changed S6 added S1 S2 S3 S4 S5 S6 T1 X T2 T3 T4 T5 S1 S2 S3 S4 S5 S6 T1 X T2 T3 T4 T5 Vi-1 𝑆𝑅 𝑖,∆ = {T3} Vi Size Fault-Detection Capability 𝑆𝑖𝑧= | 𝑆𝑅 𝑖,∆ | |𝑂| =20% If RTS is safe, then as good as reduced test-suite Selection of Reduction (SeRe)

Metrics to compare between approaches
Size Decrease: TSR: |𝑅| |𝑂| RTS: | 𝑆 𝑖,∆ | |𝑂| SeRe: | 𝑆𝑅 𝑖,∆ | |𝑂| Fault-Detection Capability Decrease Currently, NO metric for fault-detection capability between approaches We need a metric that takes CHANGE into account How can we compare all these different approaches? They all already have some way to compare techniques of the same approach… We already have Size, which is pretty straightforward But we want fault-detection capability, but also it needs to be change-related (more on that in a bit)

Map Tests to Faults T = Tests F = Faults F1 F2 F3 F4 F5 T1 X T2 T3 T4 T5 F1 F2 F3 F4 F5 F6 T1 X T2 T3 T4 T5 This is idealized evaluation, we somehow have this mapping from failing tests to faults they detect Other faults existed before and developer consciously ignored them These change-related faults are important because they are faults that can only be detected after the developer’s changes Vi-1 Vi Need criteria that includes these change-related faults

Detect all Faults? Vi-1 Vi T = Tests F = Faults F1 F2 F3 F4 F5 T1 X T2
One could potentially demand to detect all faults… But they are not change-related! We should not care about faults that could be detected before! QUESTION: is this the mindset developers should be in? Vi-1 Vi

Which is Better? Vi-1 Vi T = Tests F = Faults F1 F2 F3 F4 F5 T1 X T2
One could potentially demand to detect all faults… But they are not change-related! We should not care about faults that could be detected before! QUESTION: is this the mindset developers should be in? Vi-1 Vi Detects 5 faults Detects 1 change-related fault

Which is Better? Vi-1 Vi T = Tests F = Faults F1 F2 F3 F4 F5 T1 X T2
One could potentially demand to detect all faults… But they are not change-related! We should not care about faults that could be detected before! QUESTION: is this the mindset developers should be in? Vi-1 Vi Detects 5 faults Detects 1 change-related fault Detects 4 faults Detects 2 change-related faults If criteria is to detect all faults, can get misleading comparisons with respect to these change-related faults!

Finding Change-Related Faults
T = Tests F = Faults F1 F2 F3 F4 F5 T1 X T2 T3 T4 T5 F1 F2 F3 F4 F5 F6 T1 X T2 T3 T4 T5 Vi-1 Vi Safe RTS will not fail to select tests whose behavior differs after the change

T = Tests F = Faults F1 F2 F3 F4 F5 T1 X T2 T3 T4 T5 F1 F2 F3 F4 F5 F6 T1 X T2 T3 T4 T5 Vi-1 Vi Si,Δ = {T1,T2,T3}

T = Tests F = Faults F1 F2 F3 F4 F5 T1 X T2 T3 T4 T5 F1 F2 F3 F4 F5 F6 T1 X T2 T3 T4 T5 Vi-1 Vi Si,Δ = {T1,T2,T3} Faults(Si,Δ) = {F1,F2,F3,F4,F6}

Faults detected by non-selected tests cannot be change-related! ChangeRelatedFaultsi,Δ = Faults(Si,Δ) \ Faults(Oi \ Si,Δ) Faults(Si,Δ) \ Faults(Oi \ Si,Δ) = Faults({T1,T2,T3}) \ Faults({T4,T5}) Faults(S1,Δ) \ Faults(O1 \ S1,Δ= {F1,F2,F3,F4,F6} \ {F1,F3,F5} = {F2,F4,F6} F1 F2 F3 F4 F5 T1 X T2 T3 T4 T5 F1 F2 F3 F4 F5 F6 T1 X T2 T3 T4 T5 Vi-1 Vi

Change-Related Requirements (CRR)
Use testing requirements (statements covered or mutants killed) to approximate fault-detection capability of test suite T chosen from Oi 𝐶𝑅𝑅 𝑆 𝑖,∆ 𝑇 =𝑟𝑒𝑞 𝑆 𝑖,∆ ∩𝑇 \ 𝑟𝑒𝑞(𝑇\ 𝑆 𝑖,∆ ) Evaluate loss in change-related fault-detection capability of reduced test suite Since we don’t have idealized version with faults, we use testing requirements, liked used for TSR evaluation 𝐶𝑅𝑅𝐿𝑜𝑠𝑠 𝑖,∆ =100× | 𝐶𝑅𝑅 𝑆 𝑖,∆ 𝑂 𝑖 \ 𝐶𝑅𝑅 𝑆 𝑖,∆ 𝑅 𝑖 | | 𝐶𝑅𝑅 𝑆 𝑖,∆ 𝑂 𝑖 |

Evaluation Setup

Projects LOC range from 5652 to Tests range from 62 to 5281

Experimental Setup Use Greedy heuristic to perform TSR
Remove redundant tests with respect to statement coverage Statement coverage/mutants killed collected using PIT Use Ekstazi to perform (safe) RTS Select tests based on file-level dependencies Tests selected at test class level Simulate evolving reduced test suite and selection of reduction

Evolving Reduced Test Suite
Code Under Test V0 test0 test1 test2 test3 Code Under Test V0 Reduced test suite does not necessarily stay static… Ideally we would talk with developers and see how they evolve their reduced test suite, but we can’t so we instead simulate

Code Under Test V0 test0 test1 test2 test3 Code Under Test V0 𝐸 0,0 = 𝑅 0

Code Under Test V0 test0 test1 test2 test3 test4 Code Under Test V1 test0 test1 test2 test3 Code Under Test V0 𝐸 0,0 = 𝑅 0

Code Under Test V0 test0 test1 test2 test3 test4 Code Under Test V1 test0 test1 test2 test3 Code Under Test V0 test0 test1 test2 test3 test4 Code Under Test V1 𝐸 0,0 = 𝑅 0

Code Under Test V0 test0 test1 test2 test3 test4 Code Under Test V1 test0 test1 test2 test3 Code Under Test V0 test0 test1 test2 test3 test4 Code Under Test V1 𝐸 0,0 = 𝑅 0 𝐸 0,1 = 𝐸 0,0 ∪( 𝑂 1 \ 𝑂 0 )

Code Under Test V0 test0 test1 test2 test3 test4 Code Under Test V1 test0 test1 test2 test3 test4 Code Under Test V2 test0 test1 test2 test3 Code Under Test V0 test0 test1 test2 test3 test4 Code Under Test V1 𝐸 0,0 = 𝑅 0 𝐸 0,1 = 𝐸 0,0 ∪( 𝑂 1 \ 𝑂 0 )

Code Under Test V0 test0 test1 test2 test3 test4 Code Under Test V1 test0 test1 test2 test3 test4 Code Under Test V2 test0 test1 test2 test3 Code Under Test V0 test0 test1 test2 test3 test4 Code Under Test V1 test0 test1 test2 test3 test4 Code Under Test V2 𝐸 0,0 = 𝑅 0 𝐸 0,1 = 𝐸 0,0 ∪( 𝑂 1 \ 𝑂 0 )

Code Under Test V0 test0 test1 test2 test3 test4 Code Under Test V1 test0 test1 test2 test3 test4 Code Under Test V2 test0 test1 test2 test3 Code Under Test V0 test0 test1 test2 test3 test4 Code Under Test V1 test0 test1 test2 test3 test4 Code Under Test V2 𝐸 0,0 = 𝑅 0 𝐸 0,1 = 𝐸 0,0 ∪( 𝑂 1 \ 𝑂 0 ) 𝐸 0,2 = 𝐸 0,1 ∩ 𝑂 2

Code Under Test V0 test0 test1 test2 test3 test4 Code Under Test V1 test0 test1 test2 test3 test4 Code Under Test V2 test0 test1 test2 test3 Code Under Test V0 test0 test1 test2 test3 test4 Code Under Test V1 test0 test1 test2 test3 test4 Code Under Test V2 After reducing the test suite at a revision r, evolve the reduced test suite to subsequent revision i Can use this evolved reduced test suite everywhere else we used reduced test suite 𝐸 𝑟,𝑖 =( 𝑅 𝑟 ∩ 𝑂 𝑖 )∪( 𝑂 𝑖 \ 𝑂 𝑟 )

Selection of Reduction
Can see result of selection of reduction by looking at tests chosen by TSR and RTS Given 𝐸 𝑟,𝑖 and 𝑆 𝑖,∆ , can intersect the two to see what tests from 𝐸 𝑟,𝑖 are selected due to changes 𝑆𝐸 𝑖,∆ = 𝐸 𝑟,𝑖 ∩ 𝑆 𝑖,∆

Size Comparison

Evaluation: Size Comparisons
| 𝐸 𝑟,𝑖 | | 𝑂 𝑖 | ×100 | 𝑆 𝑖,∆ | | 𝑂 𝑖 | ×100 | 𝑆𝐸 𝑖,∆ | | 𝑂 𝑖 | ×100 Apache Commons-Lang

| 𝐸 𝑟,𝑖 | | 𝑂 𝑖 | ×100 | 𝑆 𝑖,∆ | | 𝑂 𝑖 | ×100 | 𝑆𝐸 𝑖,∆ | | 𝑂 𝑖 | ×100 LA4J

Evaluation: Size Comparison (Aggregated)
Apache commons-lang LA4J P7 = SQL-Parser P15 = LA4J | 𝐸 𝑟,𝑖 | | 𝑂 𝑖 | ×100 | 𝑆 𝑖,∆ | | 𝑂 𝑖 | ×100 | 𝑆𝐸 𝑖,∆ | | 𝑂 𝑖 | ×100 RTS runs fewer tests than TSR (difference in median of 40.15pp) SeRe runs even fewer tests (difference in median of 5.34pp)

Change-Related Fault-Detection Capability Comparison

Evaluation: Fault-Detection Capability Comparison
Highest median = 5.93% (JOPT-Simple) 𝐶𝑅𝑅𝐿𝑜𝑠𝑠 𝑖,∆ =100× | 𝐶𝑅𝑅 𝑆 𝑖,∆ 𝑂 𝑖 \ 𝐶𝑅𝑅 𝑆 𝑖,∆ 𝐸 𝑟,𝑖 | | 𝐶𝑅𝑅 𝑆 𝑖,∆ 𝑂 𝑖 | 𝐶𝑅𝑅 𝑆 𝑖,∆ 𝑇 =𝑚𝑢𝑡 𝑆 𝑖,∆ ∩𝑇 \ 𝑚𝑢𝑡(𝑇\ 𝑆 𝑖,∆ ) TSR has small loss in change-related fault-detection capability (greatest median loss 5.93%) RTS has no loss SeRe has same loss as TSR

Discussion CRR is not an optimal way of measuring change- related fault-detection capability But better than only looking at changed portions of code Future work in finding better criteria

Conclusions Regression testing is slow, but there are approaches to speed it up Test-suite reduction (TSR) and regression test selection (RTS) are such approaches, and we compare them quantitatively RTS performs better than TSR Runs fewer tests (40.15pp), no loss in change-related fault-detection capability Selection of Reduction (SeRe) runs even fewer tests (5.34pp) with small loss in change related- fault-detection capability (5.93%)

BACKUP

Threats to Validity Results for projects used for evaluation may not generalize for all projects RTS tracks dependencies at file level and selects at test class level, TSR tracks dependencies at statement level and reduces at test method level RTS selects at coarser granularity level, yet our findings show that it selects fewer tests on average than TSR CRR relies on RTS to be safe and precise Although RTS tool is safe, it is imprecise, meaning possibly more requirements are considered change- related than actually should be

Evaluation: SeRe Selection Ratio
Median difference 0.72pp Surprising in that they are so similar, and not much smaller or much larger Much smaller = there is much redundancy in the tests that are selected and reduction helps get rid of that Much larger = reduction tends to choose the large tests, so those are more likely to be affected by change and always being selected | 𝑆 𝑖,∆ | | 𝑂 𝑖 | ×100 | 𝑆𝑅 𝑖,∆ | | 𝐸 𝑟,𝑖 | ×100 Ratios are very similar (mean ratio difference only 0.72pp) Reduced test suite representative of original test suite

LA4J Re-Reduction

Apache Commons-Lang Joda-Time | 𝐸 𝑟,𝑖 | | 𝑂 𝑖 | ×100 | 𝑆 𝑖,∆ | | 𝑂 𝑖 | ×100

LA4j (Reduced Early) LA4J (Reduced Late) | 𝐸 𝑟,𝑖 | | 𝑂 𝑖 | ×100 | 𝑆 𝑖,∆ | | 𝑂 𝑖 | ×100

August Shi, Tifany Yung, Alex Gyori, and Darko Marinov

Similar presentations

Presentation on theme: "August Shi, Tifany Yung, Alex Gyori, and Darko Marinov"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

August Shi, Tifany Yung, Alex Gyori, and Darko Marinov

Similar presentations

Presentation on theme: "August Shi, Tifany Yung, Alex Gyori, and Darko Marinov"— Presentation transcript:

Similar presentations

About project

Feedback