Topics in Testing We’ve Covered

Name: Topics in Testing We’ve Covered
Uploaded: 2017-11-03T06:59:57+00:00
Duration: PTM31S2
Channel: Dale Ryan
Description: Topics in Testing We’ve Covered

Topics in Testing We’ve Covered
Black box (Finite State Machine) testing Design for testability Coverage measures Random testing Constraint-based testing Debugging and test case minimization Using model checkers for testing Coverage revisited (“small model property”)

Black box (Finite State Machine) testing There “are no Turing machines” Vasilevskii and Chow algorithm for conformance testing based on spanning trees and distinguishing sets Exhaustive testing that cannot miss bugs is often computationally intractable a a b d

Design for testability Controllability and observability Simulation and stubbing, assertions, downward scalability, etc.

Coverage measures Not necessarily correlated with fault detection! Still useful! Graph coverage: node and edge (statement and branch coverage) Logic coverage Input space partitioning Syntax-based coverage 4 1 2 3 x >= y x < y x = y y = 0 x = x + 1 ((a <= b) && !G) || (x >= y)

Random testing Generate inputs at random Explore very large numbers of executions Relies on a good automatic test oracle Feedback to bias choices away from redundant and irrelevant inputs is useful Good baseline for evaluating other methods, and often very effective

Constraint-based testing Addresses weaknesses of random testing E.g., finding needles in haystacks, such as where hash(x) = y Combines concrete and symbolic execution to generate inputs Concrete execution helps where symbolic solvers choke

Debugging and test case minimization Automatic minimization of test cases is very valuable for debugging and reducing regression suite size Debugging can be considered as an application of the scientific method Various techniques exist for using test cases to localize faults

Using model checkers for testing Testing based on states, rather than on executions or paths Use abstractions to reduce state space Use automatic instrumentation to handle the engineering difficulties

NOW BEGINS THE REVIEW Hang onto your hats It’s going to be a fast ride
Anything in these slides is fair game for the test: anything not even mentioned in these slides is not fair game (so I’ll mention valgrind right now to let you know it might show up…) So ask questions as we go if something is unclear (so that you think even re-reading the slides isn’t going to help)

Basic Definitions: Testing
What is software testing? Running a program In order to find faults a.k.a. defects a.k.a. errors a.k.a. flaws a.k.a. faults a.k.a. BUGS

Testing What isn’t software testing?
Purely static analysis: examining a program’s source code or binary in order to find bugs, but not executing the program Good stuff, and very important, but it’s not testing We’ll get back to this in a future class Fuzzy borderline: if we only symbolically execute the program For this class, we’ll call it testing when the program actually runs (but maybe in a virtual machine)

Why Testing? Ideally: we prove code correct, using formal mathematical techniques (with a computer, not chalk) Extremely difficult: for some trivial programs (100 lines) and many small (5K lines) programs Simply not practical to prove correctness in most cases – often not even for safety or mission critical code

Why Testing? Nearly ideally: use symbolic or abstract model checking to prove the system correct Automatically extracts a mathematical abstraction from a system Proves properties over all possible executions In practice, can work well for very simple properties (“this program never crashes in this particular way”), but can’t handle complex properties (“this is a working file system”) Doesn’t work well for programs with complex data structures (like a file system)

Why Does Testing Matter?
Ariane 5: exception-handling bug : forced self destruct on maiden flight (64-bit to 16-bit conversion: about 370 million $ lost) NIST report, “The Economic Impacts of Inadequate Infrastructure for Software Testing” (2002) Inadequate software testing costs the US alone between $22 and $59 billion annually Better approaches could cut this amount in half Major failures: Ariane 5 explosion, Mars Polar Lander, Intel’s Pentium FDIV bug Insufficient testing of safety-critical software can cost lives: THERAC-25 radiation machine: 3 dead We want our programs to be reliable Testing is how, in most cases, we find out if they are Mars Polar Lander crash site? THERAC-25 design

Testing and Monitoring
In this class, we’ll look at which executions of a program to run I’ll call this problem “the” testing problem Second problem: how do we know if an execution reveals a bug? Key question when monitoring deployed programs to handle faults or send in bug reports from the field I’ll (mostly) take this for granted: we have a reference model or assertions to check

Example: File System Testing
How hard would it be to just try “all” the possibilities? Consider only core 7 operations (mkdir, rmdir, creat, open, close, read, write) Most of these take either a file name or a numeric argument, or both Even for a “reasonable” (but not provably safe) limitation of the parameters, there are executions of length 10 to try Not a realistic possibility (unless we have years to test)

The Testing Problem This is a primary topic of this class: what “questions” do we pose to the software, i.e., How do we select a small set of executions out of a very large set of executions? Fundamental problem of software testing research and practice An open (and essentially unsolvable, in the general case) problem

Terms: Verification and Validation
These two terms appear a lot, often in vague or sloppy ways, in the literature Verification is checking that a program matches a specification Validation is making sure it meets the original requirements – satisfies customers, operates ok onboard the spacecraft, etc. Verification: “you built it right” Validation: “you built the right thing” (our focus, for the most part)

Terms: Unit, Integration, System Testing
Stages of testing Unit testing is the first phase, done by developers of modules Integration testing combines unit tested modules and tests how they interact System testing tests a whole program to make sure it meets requirements “Design testing” is testing prototypes or very abstract models before implementation – seldom mentioned, but when possible it can save your bacon Exhaustive model checking may be possible at this stage

Terms: Functional Testing
Functional testing is a related term Tests a program from a “user’s” perspective – does it do what it should? Opposed to unit testing, which often proceeds from the perspective of other parts of the program Module spec/interface, not user interaction Sort of a fuzzy line – consider a file system – how different is the use by a program and use of UNIX commands at a prompt by a user? Building inspector does “unit testing”; you, walking through the house to see if its livable, perform “functional testing” Kick the tires vs. take it for a spin?

Terms: Regression Testing
Changes can break code, reintroduce old bugs Things that used to work may stop working (e.g., because of another “fix”) – software regresses Usually a set of cases that have failed (& then succeeded) in the past Finding small regressions is an ongoing research area – analyze dependencies “. . . as a consequence of the introduction of new bugs, program maintenance requires far more system testing Theoretically, after each fix one must run the entire batch of test cases previously run against the system, to ensure that it has not been damaged in an obscure way. In practice, such regression testing must indeed approximate this theoretical idea, and it is very costly." - Brooks, The Mythical Man-Month

Terms: The Oracle Problem
(oracle: a magical source of truth, often cryptic, given by the gods) The oracle problem How to know if a test fails If the oracle says every execution is good, why bother running the program? Some obvious, easily automated approaches: The program probably shouldn’t crash Assertions shouldn’t be violated Automatable, but more difficult to apply: Differential testing (McKeeman, etc.) – when you have another program, likely correct, that does the same thing, just compare outputs over same inputs Last resort, not automatable: Hand inspection of executions

Terms: Test (Case) vs. Test Suite
Test (case): one execution of the program, that may expose a bug Test suite: a set of executions of a program, grouped together A test suite is made of test cases Tester: a program that generates tests Line gets blurry when testing functions, not programs – especially with persistent state

Terms: Black Box Testing
Treats a program or system as a That is, testing that does not look at source code or internal structure of the system Send a program a stream of inputs, observe the outputs, decide if the system passed or failed the test Abstracts away the internals – a useful perspective for integration and system testing Sometimes you don’t have access to source code, and can make little use of object code True black box? Access only over a network

Terms: White Box Testing
Opens up the box! (also known as glass box, clear box, or structural testing) Use source code (or other structure beyond the input/output spec.) to design test cases Brings us to the idea of coverage

Terms: Coverage Coverage measures or metrics
Abstraction of “what a test suite tests” in a structural sense Best explained by giving examples Common measures: Statement coverage A.k.a line coverage or basic block coverage Which statements execute in a test suite Decision coverage Which boolean expressions in control structures evaluated to both true and false during suite execution Path coverage Which paths through a program’s control flow graph are taken in the test suite

Terms: Mutation Testing
A mutation of a program is a version of the program with one or more random changes Mutation testing is another way to measure the quality of a test suite Amman and Offutt call it syntax-based coverage Idea: generate a large number of mutants Run the test suite on these If few mutants are detected, the test suite may not be very good Difficulties Cost of testing many versions of a program How to generate mutants (operators) In principle, can subsume many other forms of coverage

Faults, Errors, and Failures
Fault: a static flaw in a program What we usually think of as “a bug” Error: a bad program state that results from a fault Not every fault always produces an error Failure: an observable incorrect behavior of a program as a result of an error Not every error ever becomes visible

To Expose a Fault with a Test
Reachability: the test much actually reach and execute the location of the fault Infection: the fault must actually corrupt the program state (produce an error) Propagation: the error must persist and cause an incorrect output – a failure

Controllability and Observability
Goals for a test case: Reach a fault Produce an error Make the error visible as a failure In order to make this easy the program must be controllable and observable Controllability: How easy it is to drive the program where we want to go Observability: How easy it is to tell what the program is doing

Design for Testability
If a program is not designed to be controllable and observable, it generally won’t be We have to start preparing for testing before we write any code Testing as an after-the-fact, ad hoc, exercise is often limited by earlier design choices

Test-Driven Development
One way to design for testability is to write the test cases before the code Idea arising from Extreme Programming and agile development Write automated test cases first Then write the code to satisfy tests Helps focus attention on making software well-specified Forces observability and controllability: you have to be able to handle the test cases you’ve already written (before deciding they were impractical) Reduces temptation to tailor tests to idiosyncratic behaviors of implementation

Controllability: Simulation and Stubbing
A key to controllable code is effective simulation and stubbing Simulation of low-level hardware devices through a clean driver interface Real hardware may be slow May be impossible/expensive to induce some hardware failure modes on real hardware Real hardware may be a limited resource Stubbing for other routines and code Other code/modules may not be complete May be slow and irrelevant to test May need to simulate failure of other modules

Controllability: Downwards Scalability
Another important aspect of controllability is to make code “downwards scalable” Many faults cause an error only in a corner case due to a resource limit An effective strategy for finding errors is to reduce the resource limits Test a version of the program with very tight bounds Finding corner cases is easier if the corners are close together Too many programs hard-code resource limits or make assumptions about resources unconnected to defined limits E.g., not checking the result of malloc

Observability: Assertions
Assertions improve observability by making (some) errors into failures Even if the effect of a fault doesn’t propagate, it may be visible if an assertion checks the state at the right time Assertions also improve observability by making the error, rather than failure, visible Know how the state was corrupted directly, not just eventual effect

Observability: Invariant Checkers
Can extend the idea of assertions to writing “full” invariant checkers Do a crawl of code’s basic data structures Check various invariants that would be too expensive to check at runtime Invariant checker can be written to be easy-to-use: recursion, memory allocation, etc. Won’t run on actual system But be careful! If your invariant checker has a bug and changes the system state. . .

Graph Coverage Cover all the nodes, edges, or paths of some graph related to the program Examples: Statement coverage Branch coverage Path coverage Data flow (def-use) coverage Model-based testing coverage Many more – most common kind of coverage, by far

Statement/Basic Block Coverage
if (x < y) { y = 0; x = x + 1; } else x = y; Statement coverage: Cover every node of these graphs 4 1 2 3 x >= y x < y x = y y = 0 x = x + 1 3 1 2 x >= y x < y y = 0 x = x + 1 if (x < y) { y = 0; x = x + 1; } Treat as one node because if one statement executes the other must also execute (code is a basic block)

Branch Coverage if (x < y) { y = 0; x = x + 1; } else x = y;
Branch coverage vs. statement coverage: Same for if-then-else 4 1 2 3 x >= y x < y x = y y = 0 x = x + 1 3 1 2 x >= y x < y y = 0 x = x + 1 if (x < y) { y = 0; x = x + 1; } But consider this if-then structure. For branch coverage can’t just cover all nodes, but must cover all edges – get to node 3 both after 2 and without executing 2!

Path Coverage How many paths through this code are there? Need one test case for each to get path coverage if (x < y) { y = 0; x = x + 1; } else x = y; 4 1 2 3 x >= y x < y x = y y = 0 x = x + 1 To get statement and branch coverage, we only need two test cases: and 6 4 5 x >= y x < y y = 0 x = x + 1 Path coverage needs two more: In general: exponential in the number of conditional branches!

Data Flow Coverage 1 2 3 4 4 5 6 7 x = 3; y = 3; if (w) { x = y + 2; }
if (z) { y = x – 2; n = x + y x = 3 Def(x) Annotate program with locations where variables are defined and used (very basic static analysis) 2 y = 3 Def(y) 5 3 4 !w w x = y + 2 Def-use pair coverage requires executing all possible pairs of nodes where a variable is first defined and then used, without any intervening re-definitions Def(x) Use(y) 7 4 6 !z z y = x - 2 E.g., this path covers the pair where x is defined at 1 and used at 7: Def(y) Use(x) May be many pairs, some not actually executable But this path does NOT: n = x + y Use(x) Use(y)

((a>b) || G)) && (x < y) ((a <= b) && !G) || (x >= y)
Logic Coverage What if, instead of: if (x < y) { y = 0; x = x + 1; } 1 ((a>b) || G)) && (x < y) y = 0 x = x + 1 2 ((a <= b) && !G) || (x >= y) 3 we have: if (((a>b) || G)) && (x < y)) { y = 0; x = x + 1; } Now, branch coverage will guarantee that we cover all the edges, but does not guarantee we will do so for all the different logical reasons We want to test the logic of the guard of the if statement

Active Clause Coverage
( (a > b) or G ) and (x < y) 1 T F T T 2 F F T F With these values for G and (x<y), (a>b) determines the value of the predicate duplicate 3 F T T T 4 F F T F With these values for (a>b) and (x<y), G determines the value of the predicate With these values for (a>b) and G, (x<y) determines the value of the predicate 5 T T T T 6 T T F F 43

Input Domain Partitioning
Partition scheme q of domain D The partition q defines a set of blocks, Bq = b1 , b2 , … bQ The partition must satisfy two properties: blocks must be pairwise disjoint (no overlap) together the blocks cover the domain D (complete) b1 b2 b3 bi  bj = ,  i  j, bi, bj  Bq  b = D b  Bq Coverage then means using at least one input from each of b1, b2, b3, . . . 44

Syntax-Based Coverage
Based on mutation testing (a pet topic of Amman and Offutt, who are heavily into this research area) Bit different kind of creature than the other coverages we’ve looked at Idea: generate many syntactic mutants of the original program Coverage: how many mutants does a test suite kill (detect)? 45

Generation vs. Recognition
Generation of tests based on coverage means producing a test suite to achieve a certain level of coverage As you can imagine, generally very hard Consider: generating a suite for 100% statement coverage easily reaches “solving the halting problem” level Obviously hard for, say, mutant-killing Recognition means seeing what level of coverage an existing test suite reaches

Coverage and Subsumption
Sometimes one coverage approach subsumes another If you achieve 100% coverage of criteria A, you are guaranteed to satisfy B as well For example, consider node and edge coverage (there’s a subtlety here, actually – can you spot it?) What does this mean? Unfortunately, not a great deal If test suite X satisfies “stronger” criteria A and test suite Y satisfies “weaker” criteria B Y may still reveal bugs that X does not! For example, consider our running example and statement vs. branch coverage It means we should take coverage with a grain of salt, for one thing

Levels of Testing Adapted from Beizer, by Amman and Offutt
Level 0: Testing is debugging Level 1: Testing is to show the program works Level 2: Testing is to show the program doesn’t work Level 3: Testing is not to prove anything specific, but to reduce risk of using program Level 4: Testing is a mental discipline that helps develop higher quality software

What’s So Good About Coverage?
Consider a fault that causes failure every time the code is executed Don’t execute the code: cannot possibly find the fault! That’s a pretty good argument for statement coverage int findLast (int a[], int n, int x) { // Returns index of last element // in a equal to x, or -1 if no // such. n is length of a int i; for (i = n-1; i >= 0; i--) { if (a[i] = x) return i; } return 0; }

Topics in Testing We’ve Covered

Similar presentations

Presentation on theme: "Topics in Testing We’ve Covered"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Topics in Testing We’ve Covered

Similar presentations

Presentation on theme: "Topics in Testing We’ve Covered"— Presentation transcript:

Similar presentations

About project

Feedback