Covrig: A Framework for the Analysis of Code, Test, and Coverage Evolution in Real Software Paul Marinescu, Petr Hosek, Cristian Cadar Imperial College London 1
Goal Answer questions about software evolution – Code quality – Test quality – Development model – Testing improvement opportunities …using software development historical data 2
Target Audience Researchers – Hypothesis validation (e.g. are software patches poorly tested?) Programmers/Project Managers – Assess development quality 3
Software Metrics Static – Measured by parsing the software artifacts Dynamic – Require running the evolving software – More challenging – Very few studies 4
Example questions 1.Do executable and test code evolve in sync? 2.How many patches touch only code/test/none/both? 3.What is the distribution of patch sizes? 4.How spread out is each patch through the code? 5.Is test suite execution deterministic? 6.How does the overall coverage evolve? 7.What is the distribution of patch coverage across revisions? 8.What is the latent patch coverage? 9.Are bug fixes better covered than other patches? 10.Is the coverage of buggy code less than average? 5
Data mining infrastructure Empirical case study 6
Covrig Overview
Docker Containers Lightweight, OS-level virtualization – Guest shares kernel with host – Namespace isolation PID Network IPC Filesystem – Resource limiting cgroups + Linux Containers + Docker 8
Docker Containers Features Isolation Consistency Reproducibility Easy cloud deployment Performance 9
Covrig 10
MetricGranularity Static Test sizeLines Executable code size Lines Hunks Patch executable size Files Dynamic Overall coverage Lines Branches Patch coverage Lines Branches Latent patch coverage Lines Test result FAIL/PASS 11
Challenges Evolving dependencies Evolving containers Custom compile flags (-Wno-error) 12
Challenges Branching development structure Consider only the ‘main’ branch Alice Bob r1 r3 m1 r2 r4 r1 r3 r2+r4 13
Challenges Revisions that fail to compile Accumulate until reaching a compilable revision r1 r2 r3r1+r2+r3 14
Data mining infrastructure Empirical case study 15
Case Study Subjects AppELOC Tests Period (mo) LangLOC Binutils27,029DejaGnu5,18635 Git79,760C/shell108,4645 Lighttpd23,884Python2,44036 Memcached4,426C/Perl4,60547 Redis18,203Tcl7,5896 ZeroMQ7,276C++3, revisions and 12 years of development in total 16
17 Patch type
Is test suite execution deterministic? FAIL/PASS determinism BinutilsGitLighttpdMemcachedRedisZeroMQ Nondeterministic Revisions
Is test suite execution deterministic? Coverage determinism BinutilsGitLighttpdMemcachedRedisZeroMQ Nondeterministic Lines (median)
Test Suite Nondeterminism Causes Bugs – Race conditions – Hardcoded wall clock timeouts – Incorrect resource consumption expectations Random test data Benign race conditions 20
Are patches properly tested? Sometimes 21
Patch coverage 22
Patch coverage 23 0%
Does covered code contain fewer bugs that not covered code? Not really 24
Does covered code contain fewer bugs that not covered code? Patch Coverage (median)Patches Fully Covered BuggyAllBuggyAll Memcached100%89%67%45% Redis94%0%47%25% ZeroMQ71%76%37%33% total bugs
Conclusions Dynamic software metrics mining Case study on 6 systems/1500 revisions/12 years of development Open source extensible infrastructure 26