Software Testing “There are only 2 hard problems in Computer Science. Naming things, cache invalidation and off-by-one errors.” —Phil Haack There are three.

Software Testing “There are only 2 hard problems in Computer Science. Naming things, cache invalidation and off-by-one errors.” —Phil Haack There are three kinds of mistakes that all programmers make at some time: Errors in program logic, and off by one errors. "Modern computer systems are so complicated you would need to perform more tests than there are stars in the sky to be 100% sure there were no problems in the system.” -- Lev Lesokhin, strategy chief at New York-based software analysis firm Cast. TBD: add discussion of techniques for gauging whether or not software is ready for release: Defect pooling, defect seeding and defect modeling. [ “Program testing can be used to show the presence of bugs, but never to show their absence!” —Edsger Dijkstra

Humans are infallible; software is written by humans; expect software to have defects. Testing is the most common way of removing defects in software and improving the quality of software. Because humans are fallible software will most likely have defects. Testing is the most common way of removing these defects and improving the quality of software. Important because your first responsibility at a new job is likely to involve testing. Second: maintenance. Third: new development. Forth: design/architecture or management.

Outline Foundations; Motivations; Terminology Principles and Concepts
Levels of Testing Test Process Techniques Measures Deciding when to stop What is it; how do you do it? Test Process = project life cycle testing activities. Measures = Since good engineering relies on quantitative methods, it’s not surprising that there are several measures related to testing. Deciding when to stop = For all but the most trivial problems, the number of potential test cases is infinite or practically infinite. In most cases there isn’t enough time to run all potential test cases so you have to decide when to stop.

Defects are Bad At a minimum defects in software annoy users.
Glitchy software reflects poorly on the company issuing the software. If defects aren’t controlled during a software project, they increase the cost and duration of the project. For safety critical systems the consequences can be even more severe. Obvious comment of the day: defects in software are bad

Spectacular Failures Therac-25 Ariane 5, 1996 Rocket + Cargo = $500M
People have died from defective software (not in a public, visible, relatable way…yet) Patriot Missile, 1991 Failed to destroy an Iraqi Scud missile which hit a barracks. Software defects between 1985 and 1987 lead to 6 accidents. Three patients died as a direct consequence.

Controlling defects in software
There are two ways of dealing with the potential for defects in software: The most obvious is: work to identify and remove defects that make it into the software. Another approach that often goes unnoticed is: stop making errors in the first place. In other words, take action to prevent defects from ever being injected in the first place. This second approach is called Defect Prevention. Testing is one method of uncovering defects in software. (Inspection is another.) Testing might not be the most efficient method of uncovering defects but for many companies it is their primary means of ensuring quality. Take action to prevent defects from ever being injected in the first place—Defect Prevention, OR Work to identify and remove those defects that do make it into the software.

What is testing? Testing is the dynamic execution of the software for the purpose of uncovering defects. Testing is one technique for improving product quality. Don’t confuse testing with other, distinct techniques for improving product quality: Inspections and reviews (sometimes called static testing) Debugging Defect prevention Quality assurance Quality control The purpose of testing is to find defects. Testing: running software with test cases. Test case = sample input and expected output. International Software Testing Qualifications Board (ISTQB): testing = static and dynamic testing.

Testing and its relationship to other related activities
Don’t confuse testing with other related activities. Note, there isn’t universal agreement on the scope of the term “testing”. Some, most notably the International Software Testing Qualifications Board (ISTQB), consider inspections and other forms of static review to be considered a form of testing. Here the term “testing” is used exclusively to refer to defect detection activities that involve the dynamic execution of the software under test. Static techniques such as inspection, technical review, walkthrough and informal review are a complementary form of defect detection. (There is a subtle difference in these two strategies of uncovering defects. Static techniques tend to uncover defects that could result in failures, where as, testing uncovers failures and leaves it up to the programmer to identify the defects that caused the failures. In other words, testing involves the extra step of debugging.) Defect prevention is a form of process improvement. The goal of defect prevention is to keep defects from entering the product in the first place. Inspections and testing are really a second line of defense. The most cost-effective route to quality is to avoid injecting defects in the first place. The best way to improve product quality and minimize cost is to prevent defects from being injected in the first place. For those that are injected you want to find and fix them early rather than later. It’s better to avoid problems than to correct them. Testing is limited to exposing anomalous behavior likely caused by a defect in the software. Investigating (finding) and fixing the defect (debugging) is the responsibility of the programmer. Testing != quality assurance. Testing is concerned with finding defects. Quality assurance has a much broader role. Quality assurance is concerned with providing confidence that the software is suitable for its intended purpose. Testing, inspections and defect prevention are all aspects of quality assurance. When the testing role is given the title “quality assurance” the distinction is blurred.

Benefits of Testing Testing improves product quality (at least when the defects that are revealed are fixed), The rate and number of defects found during testing gives an indication of overall product quality. A high rate of defect detection suggests that product quality is low. Finding few errors after rigorous testing, increases confidence in overall product quality. Such information can be used to decide the release date. Or, it could mean…??? Defect data from testing may suggest opportunities for process improvement preventing certain type of defects from being introduced into future systems. The purpose of testing is to find defects. When these defects are fixed, the quality of the product improves. Testing provides an assessment of overall product quality. Consequently, testing is used to improve and assess product quality. The role of testing. Why is testing necessary? What can you expect, directly or indirectly, from testing? Finding few defects increases confidence in overall quality of the product. Testing plays a role in quality assurance and continuous process improvement. Defect data can be mined for process improvement opportunities. Goals of testing: improve the product (system being tested) and development process. Finding defects; gaining confidence in the quality of the system being tested (gain confidence that the software system meets its stated requirements); gathering data that will support project decisions (when to ship); suggest changes in the development process likely to prevent certain types of defects from being introduced into (injected) in future systems. Performing root cause analysis on recurring defects will provide valuable information that can be used to refine the development process. Testing uncovers/reveals/exposes failures caused by defects. It is the job of programmers to find and correct the defects (debugging).

Errors, Faults and Failures! Oh my!
Error or Mistake – human action or inaction that produces an incorrect result Fault or Defect – the manifestation of an error in code or documentation Failure – an incorrect result. Some definitions. A program may have latent defects. A program's failure is a clear sign it contains a defect, but the presence of a defect doesn't always result in a failure. “A software error occurs when the program does not do what its end user reasonably expects it to do.” [p 123 The Art of S/W Testing]

Software Bugs 1947 log book entry for the Harvard Mark II
Computer glitches are often called bugs. A software bug is a general term used to refer to an error, fault or failure in a computer program or system. The term doesn't have a precise meaning but is more innocent sounding than error, fault or defect. It may be used subconsciously to deflect blame or fault away from the real source. The term bug didn’t originate with software. Engineers have long used the term to describe inexplicable defects in mechanical systems. One explanation for why the term is often thought of as exclusive to software is the story of the moth found obstructing a relay in an early mechanical computer. Before transistors, the flow of electricity in computers was controlled by mechanical relays. On September 9th 1947 the operators of the Harvard Mark II found a moth stuck in a relay and reported it as the “first actual case of [a] bug being found”. Well aware of the term’s special meaning to engineers, they were amused it had become more than just a symbol of a problem. (Grace Hopper wasn’t one of the engineers that found the moth.) 1947 log book entry for the Harvard Mark II

Verification and Validation
Verification and validation are two complementary testing objectives. Verification – Comparing program outcomes against a specification. “Are we building the product right?” Validation – Comparing program outcomes against user expectations. “Are we building the right product?” Verification and validation is accomplished using both dynamic testing and static evaluation (peer review) techniques. Classical distinction between two complementary testing aims. Verification comes from the Latin root veritas meaning "truth". Verification is objective. How well does it meet some spec? Validation comes from the Latin root valere meaning "worth". Validation is subjective. How much is it worth to someone? Validation – Comparing program outcomes against user expectations. Verification – Comparing program outcomes against a specification. The specifications may be explicit (Software Requirements Specification, coding standards, etc.) or they may be implicit requirements or reasonable expectations. For example, the specification might not say that all works in dialog boxes should be spelled correctly. It’s reasonable to assume this is an implicit requirement.

Principles of Testing “Program testing can be used to show the presence of bugs, but never to show their absence!” [Edsger Dijkstra] He is speaking of course about non-trivial programs Mindset is important. The goal of testing is to demonstrate that the system doesn’t work correctly not that the software meets its specification. You are trying to break it. If you approach testing with the attitude of trying to show that the software works correctly, you might unconsciously avoid difficult tests that threaten your assumption. Should programmers test their own code? Cognitive dissonance is the extreme emotional discomfort we feel when two important beliefs, attitudes or perceptions collide. Humans cannot tolerate dissonance for long, so they ease the tension by making a change in belief or attitude—and justifying the change. Cognitive dissonance – “the state of having inconsistent thoughts, beliefs, or attitudes, esp. as relating to behavioral decisions and attitude change”. As it relates to testing: programmers must cope with two apposing thoughts: I’m a great programmer; there might be a defect in my code. Solution to resolving this cognitive dissonance? Don’t look to hard for bugs (consciously or unconsciously) because it may create more cognitive dissonance. Defect-Free programs are an impossible goal Mathematically impossible to verify a non-trivial program works correctly for every input. Defects in software are inevitable. Thorough testing can increase the confidence there are no defects but can’t show or prove the absence of defects (for non-trivial programs). Exhaustive testing is impractical for all but the simplest programs. Exhaustive testing can show the absence of defects (a program is free of defects) but for all but for most software systems exhaustive testing is impossible or impractical. Every additional successful test increases the probability the software is defect free, but unless you can test all possible inputs you can’t prove the system is defect free. All possible inputs and sequence of inputs for sequential systems (a system who’s behavior depends not just upon present/immediate input but also the history of the input (the sequence of inputs it receives over time).) is impractical for most non-trivial software systems. Mindset. It’s like the product spokesperson showing how well laundry detergent works. They are trying to show that the detergent does a good job getting out stains. They aren’t going to attempt a tough stain that might not confirm their assumption.

Organization Who should do the testing?
Developers shouldn’t system test their own code. There is no problem with developers unit testing their own code—they are probably the most qualified to do so—but experience shows programmers are too close to their code in order to do a good job at system testing their own code. Independent testers are more effective. Levels of independence: independent testers on a team; independent of the team; independent of the company. There are levels of independence. There could be independent testers within the team reporting to the project manager. There could be testers independent of the team (I.e. reporting to executive management or a supervisor above the project manager. Finally, testing might be outsourced. “their lack of objectivity often limits their effectiveness.” Programmers are biased. Independent testers can verify assumptions.

The cost of finding and fixing a defect increases with the length of time the defect remains in the product Phase Containment Image is from McConnell Phase containment : catching a defect in the phase that it was introduced. This is a crucial concept because so much of what (good) developers do (inspections, pair programming, continuous integration, test-driven development, automated testing, etc.) is a consequence of defects becoming more expensive to find and fix the longer they remain in the software. The trajectory of the curve is what is important. Introduce a defect during maintenance and wait two iterations and it’s going to be costly to fix also. The graph extends to the right (time dimension).

Cost to correct late-stage defects
For large projects, a requirements or design error is often 100 times more expensive to find and fix after the software is released than during the phase the error was injected. The 100 X generalization is a widely accepted statistic. Boehm 81 is the most common reference. 100 X is a generalization. If the consequences of an error are less and/or if the design of the system makes it easy to make changes late in the software life cycle, the actual cost may be as low as 5 X. [Software Defect Reduction Top 10 List, Boehm, et. al.] Another ref showing lower numbers: (S/W Defect Reduction Top 10 List) Making Software, page 163: “a 5:1 ratio was more characteristic of smaller systems (2,000–5,000 source lines of code)” The section also discusses evidence for the 100:1 figure in large projects. “for large projects, the increase from fixing requirements changes and defects during requirements definition to fixing them once the product is fielded continues to be around 100:1. However, this ratio can be significantly reduced by higher investments in early requirements and architecture verification and validation.” “The evidence for small projects continues to show ratios around 5:1, but these can also be flattened by the use of outstanding personnel and by Agile methods, such as pair programming and continuous integration, that shorten the delay-of-fix time. Small, noncritical projects can also spread their architecting activity across the life cycle via refactoring,” TBD: similar data in Implementing software inspections: ?? Software Requirements, Wiegers, page 17. The cost of correcting a requirements defect during the operation phase is 100x the cost of correcting it during the requirements phase.

Correspondence between Development and different opportunities for Verification and Validation
Different levels of testing and their correspondence with the phases/steps and activities of the software life cycle. Q. Can you find a construction error during system testing? ??? Sure. If you are finding a lot of construction errors during system test, it’s an indication your unit testing is inadequate.

Two dimensions to testing
What are you testing for? Performance, correctness, usability, etc.? When (at what point in the s/w lifecycle are you testing?) Tests vary by the point in the software lifecycle when they run and by purpose. Timing: unit, integration, system, alpha, beta and acceptance. Purpose: correctness, usability, reliability, security, performance.

Levels of testing Unit – testing individual cohesive units (modules). Usually white-box testing done by the programmer. Integration – verifying the interaction between software components. Integration testing is done on a regular basis during development (possibly once a day/week/month depending on the circumstances of the project). Architecture and design defects typically show up during integration. System – testing the behavior of the system as a whole. Testing against the requirements (system objectives and expected behavior). Also a good environment for testing non-functional software requirements such as usability, security, performance, etc. Acceptance – used to determine if the system meets its acceptance criteria and is ready for release. These levels of testing are distinguished by what is being tested and the objective of the test. Unit = module; Integration = group of modules; System = whole system. The purpose of Unit, Integration and System is to assess and improve product quality. Integration strategies described shortly. (top-down, bottom-up, etc.) IEEE “system testing: Testing conducted on a complete, integrated system to evaluate the system’s compliance with its specified requirements.” “integration testing: Testing in which software components, hardware components, or both are combined and tested to evaluate the interaction among them. This term is commonly used for both the integration of components and the integration of entire systems.” “acceptance testing: (A) Testing conducted to establish whether a system satisfies its acceptance criteria and to enable the customer to determine whether to accept the system. (B) Formal testing conducted to enable a user, customer, or other authorized entity to determine whether to accept a system or component.” “Finding defects is not the main focus in acceptance testing. Acceptance testing may assess the system’s readiness for deployment and use.” “Acceptance testing is often the responsibility of the customers or users of a system”

Other types of testing Regression testing
Alpha and Beta testing – limited release of a product to a few select customers for evaluation before the general release. The primary purpose of a beta test isn’t to find defects, but rather, assess how well the software works in the real-world under a variety of conditions that are hard to simulate in the lab. Customers’ impressions are starting to be formed during beta testing so the product should have release-like quality. Stress testing, load testing etc. – Smoke test – a very brief test to determine whether or not there are obvious problems that would make more extensive testing futile. Alpha testing = internal operational testing. Internal acceptance testing. Beta testing = external operational testing. May be a limited audience. External acceptance testing. Smoke test - there term originates from electronic hardware testing where the first test is to plug in the device, turn it on and look for smoke. If it doesn’t smoke, it passes the smoke test. TBD: Add stress testing, configuration testing (hardware compatibility testing), usability testing. TBD: Add graphic using Word change control showing new code and old code. Could add an example of a down cast causing a problem. Regression testing is selective re-testing. It’s no uncommon for a fix in one area of code to cause a problem in another area. Designs based on loose coupling can mitigate this tendency. Regression tests are ran frequently and often under time constraints. Therefore, regression testing is often automated. Regression testing is performed in order to ensure there were no unintended consequences of a program change. Software producers have less control over alpha/beta testing. The expected errors during alpha/beta testing are those that affect a limited number of users (unusual environments that couldn’t be anticipated).

Regression Testing Imagine adding a 24-inch lift kit and monster truck tires to your sensible sedan: Sometimes when you change something in one area you break something in another area. In a complex system it’s hard to foresee or anticipate all possible interactions. After making the changes you would of course test the new and modified components, but is that all that should be tested? Not by a mile!

Regression Testing [Cont]
When making changes to a complex system there is no reliable way of predicting which components might be affected. Therefore, it is imperative that at least a subset of tests be ran on all components. In this analogy, that means testing: the heater, air conditioner, radio, cup holders, speedometer…hum, that’s interesting, there seems to be a problem with the speedometer. It significantly understates the speed of the car. On closer inspection you discover the speedometer has a dependency on wheel size. Who could have predicted it? The implementation for the speedometer makes an assumption about the wheel size and how far the car will move for each rotation of the tires. Larger wheels mean the car is going a greater distance for each revolution. Who would have predicted that? Good thing we performed regression testing.

Regression Testing [Cont]
Making sure new code doesn’t break old code. Regression testing is selective retesting. You want to ensure that changes or enhancements don’t impair existing functionality. During regression testing you rerun a subset of all test cases on old code to make sure new code hasn’t caused old code to regress or stop working properly. It’s no uncommon for a change in one area of code to cause a problem in another area. Designs based on loose coupling can mitigate this tendency but regression testing is still needed in order to increase the assurance there were no unintended consequences of a program change. Want to make sure previously working software hasn’t regressed or stopped working correctly as a result of a new change. “Frequently, a fix for a problem in one area inadvertently causes a software bug in another area” Regression tests are ran frequently and often under time constraints. Therefore, regression testing is often automated. Regression testing is performed in order to ensure there were no unintended consequences of a program change. “Regression tests seek to verify that the functions provided by a modified system or software product perform as specified and that no unintended change has occurred in the operation of the system or product. “

Testing Objectives Conformance testing (aka correctness or functional testing) – does the observed behavior of the software conform to its specification (SRS)? Non-functional requirements testing – have non-functional requirements such as usability, performance and reliability been met? Regression testing – does an addition or change break existing functionality? Stress testing – how well does the software hold up under heavy load and extreme circumstances? Installation testing – can the system be installed and configured with reasonable effort? Alpha/Beta testing – how well does the software work under the myriad of real-world conditions? Acceptance testing – how well does the software work in the user’s environment? Testing levels (unit, integration, etc) define different stages of testing with each stage having somewhat different objectives. You can view testing by stage or by objectives. What is the purpose of the tests? What are you trying to establish? Functional (correctness)? Usability? Reliability? Acceptance? Objectives are things like user acceptance, validating performance, regression testing, etc. Testing non-functional req: Examples: performance testing, load testing, usability testing, etc. Rather than testing what the system does you are testing how it does it. Non-functional defects are among the hardest to fix because they are cross-cutting. There is is no one place in the code you can point to and say, “there’s the error”. When testing functional requirements the result is normally: works or doesn’t work. Non-functional requirements tend to be present to some degree.

Integration Strategies
What doesn’t work? All-at-once or Big Bang – waiting until all of the components are ready before attempting to build the system for the first time. Not recommended. What does work? Top-Down – high-level components are integrated and tested before low level components are complete. Example high-level components: life-cycle methods of component framework, screen flow of web application. Bottom-Up – low-level components are integrated and tested before top-level components. Example low-level components: abstract interface onto database, component to display animated image. Incremental features Assumption: independent development of components. Could be individuals or teams developing components or one individual developing the components over time. To a certain extent your integration strategy is driven by your architecture and design strategy. For example, if you are designing for reuse you are probably doing bottom-up design. This would necessitate a bottom-up implementation strategy. (Assuming you are developing and testing incrementally. It would be strange to do top down design but bottom up integration.) Top-down and bottom-up are incremental integration strategies. Except when integrating very small software systems, incremental integration is much preferred to all-at-once or “big bang” testing.

Advantages of Incremental/ Continuous Integration
Easier to find problems. If there is a problem during integration testing it is most likely related to the last component integrated—knowing this usually reduces the amount of code that has to be examined in order to find the source of the problem. Testing can begin sooner. Big bang testing postpones testing until the whole system is ready. The challenge of incremental integration is how to build an incomplete system that will still function. Two approaches: top-down and bottom-up. “My computer stopped working.” “What was the last thing you changed?” Continuous

Top-Down Integration Stubs and mock objects are substituted for as yet unavailable lower-level components. Stubs – A stub is a unit of code that simulates the activity of a missing component. A stub has the same interface as the low-level component it emulates, but is missing some or all of its full implementation. Subs return minimal values to allow the functioning of top-level components. Mock Objects – mock objects are stubs that simulate the behavior of real objects. The term mock object typically implies a bit more functionality than a stub. A stub may return pre-arranged responses. A mock object has more intelligence. It might simulate the behavior of the real object or make assertions of its own. Stubs and mock objects both emulate the interface for a class. With mock objects there are additional expectations of this interface. For example, a mock object might check to make sure its methods were called in a certain order. Stubs, drivers and mock objects. ( Mock objects can be used when the real object doesn't exist yet (i.e. during integration). You might also use mock objects when the real object is too complex or inconvenient to work with itself. For example, an object that encapsulates a relational database might be implemented as a mock object if you don't want to go through the hassle of setting up the actual database. An service stub might let you call its send() method before it’s compose() or attachfile() method. An service stub would catch this error.

Bottom-Up Integration
Scaffolding code or drivers are used in place of high-level code. One advantage of bottom-up integration is that it can begin before the system architecture is in place. One disadvantage of bottom-up integration is it postpones testing of system architecture. This is risky because architecture is a critical aspect of a software system that needs to be verified early. Disadvantage of bottom-up integration: postpones testing of the system architecture. Architecture is a critical aspect of a software system that needs to be verified early. Incremental integration can of course be a mix of top-down and bottom-up.

Continuous Integration
Top-down and bottom-up is how you are going to integrate. Continuous integration is when or how often you are going to integrate. Continuous integration = frequent integration where frequent = daily, maybe hourly, but not longer than weekly. You can’t find integration problems early unless you integrate frequently. Clarification: Continuous integration means frequent integration and what qualifies as frequent depends on the length of the project. If it’s a 6-year project integration every two weeks might be considered frequent. If it’s a 1-month project, every two weeks definitely isn’t frequent.

Test Process Test planning Test case generation
Test environment preparation Execution Test results evaluation Problem reporting Defect tracking We can discuss concepts, principles, techniques, methods and measures related to testing independently. In practice all of these ideas are integrated in the testing process –a defined and controlled way of testing software. The test process provides guidelines for the day-to-day activities of testing. Testing should start early. You don’t need code to begin testing activities. Test planning Test case design Test creation and execution (Creating and running tests) Reporting issues Measurement Test closure

Testing artifacts/products
Test plan – who is doing what when. Test case specification – specification of actual test cases including preconditions, inputs and expected results. Test procedure specification – how to run test cases. Test log – results of testing Test incident report – record and track errors.

Test Plan “A document describing the scope, approach, resources, and schedule of intended test activities. It identifies test items, the features to be tested, the testing tasks, who will do each task, and any risks requiring contingency planning.” [IEEE std] Q. Why have a separate plan for testing? Why not one project plan with development and testing activities? A. You might want to reuse test plan. System will be developed once but tested often. IEEE std: “test plan: (A) A document describing the scope, approach, resources, and schedule of intended test activities. It identifies test items, the features to be tested, the testing tasks, who will do each task, and any risks requiring contingency planning. (B) A document that describes the technical and management approach to be followed for testing a system or component. Typical contents identify the items to be tested, tasks to be performed, responsibilities, schedules, and required resources for the testing activity.” “test approach: A particular method that will be employed to pick the particular test case values. This may vary in specificity from very general (e.g., black box or white box) to very specific (e.g., minimum and maximum boundary values).”

Test Case “A test case consists of a set of input values, execution preconditions, expected results and execution post-conditions, developed to cover certain test condition” A test case is a set of inputs and expected outputs when the system is operating according to its requirements. IEEE std: “test case: (A) A set of test inputs, execution conditions, and expected results developed for a particular objective, such as to exercise a particular program path or to verify compliance with a specific requirement.” Istqb:” A test case consists of a set of input values, execution preconditions, expected results and execution post-conditions, developed to cover certain test condition(s).” Expected results includes consequences of the test.

Oracle When you run a test there has to be some way of determining if the test failed. For every test there needs to be an oracle that compares expected output to actual output in order to determine if the test failed. For tests that are executed manually, the tester is the oracle. For automated unit tests, actual and expected results are compared with code.

Test Procedure “Detailed instructions for the setup, execution, and evaluation of results for a given test case.” IEEE std: “test procedure: (A) Detailed instructions for the setup, execution, and evaluation of results for a given test case. (B) A document containing a set of associated instructions as in (A). (C) Documentation that specifies a sequence of actions for the execution of a test.” “The test procedure (or manual test script) specifies the sequence of action for the execution of a test. If tests are run using a test execution tool, the sequence of actions is specified in a test script (which is an automated test procedure).”

Incident Reporting What you track depends on what you need to understand, control and estimate. Example incident report: TBD: Read: IEEE Software article: Enhancing Defect Tracking Systems to Facilitate Software Quality Improvement Data will help identify trends and opportunities for improving the software process. An “incident” is logged with actual results deviate from expected results. It’s an incident at this point because there usually isn’t enough information to say definitively it is a defect. Rules for classification. Incidents are tracked. Incident ID – Unique identifier for the incident. Date – Date incident was discovered. Product – Affected product. Originator – Person that discovered incident. Severity – How serious is the incident? Status – One of: new, assigned, resolved (fixed, duplicate, won’t fix, invalid, not reproducible), verified, closed. Description – Description of the incident including expected and actual results. Include all the information needed in order to reproduce the incident.

Testing Strategies Two very broad testing strategies are:
White-Box (Transparent) – Test cases are derived from knowledge of the design and/or implementation. Black-Box (Opaque) – Test cases are derived from external software specifications. White-box testing: testing that takes into account the internal design and implementation of the system under test rather than just the inputs and outputs White-box testing: consider internal structure of the software. Look at internal implementation for clues helpful when testing. If you are interested in code coverage, you need to be doing white-box testing. Opaque vs transparent are probably better terms. There is also gray-box testing. Gray-box testing is as you might imagine: a combination of white-box and black-box testing. With gray-box testing you test from a black-box perspective but take a peek at the code for inspiration.

Testing Strategies White and black box testing are complementary.
Q. What’s the danger (types of defects missed) if you do only white box testing? Q. What’s the danger (types of defects missed) if you do only black box testing? A. You risk testing what was written not what was suppose to be written. A. If you don’t look at the code, many control flow paths are likely to go untested.

Black-Box Techniques Equivalence Partitioning – Tests are divided into groups according to the criteria that two test cases are in the same group if both test cases are likely to find the same error. Classes can be formed based on inputs or outputs. Boundary value analysis – create test cases with values that are on the edge of equivalence partitions We look for techniques that can lead us to tests with a greater probability of causing the software to fail. Equivalence partitioning (EP) is a technique for selecting test cases. For most non-trivial programs there are an infinite or near infinite number of test cases you could run. EP helps you select a profitable and manageable subset. Test cases are divided into groups (equivalence partitions). Equivalence partitioning is a strategy for coping with the prospect of having an infinite (or overwhelming) number of potential test cases to run. The strategy is to divide the infinite or near infinite number of test cases into a finite number of equivalence partitions and run a few test cases from each ep. The criteria for inclusion in a group is the expectations that if one in the group finds a certain defect, all in the group will find the same defect. For example, if you are testing a function that adds two integers, add(354,-832) is just as likely to find a defect as add(418,-723). Classes can be formed based on inputs or outputs. Output example: a function that computes the square root of it’s input might have a class of all inputs that return an integer. Precise terms for common-sense testing techniques most testers—even those without any formal training—probably do instinctively. There are limits on valid and invalid input values and limits on values within an equivalence partition. Boundary value analysis results in test cases that exercise these minimum and maximum values.

Equivalence Partitioning
What test cases would you use to test the following routine? // This routine returns true if score is >= // 50% of possiblePoints, else it returns false. // This routine throws an exception if either // input is negative or score is > possiblePoints. boolean isPassing(int score, int possiblePoints); ID Input Expected Result 1 -1, Exception 2 50, true . . . Simple example: boolean isEven(int i); “The goal of equivalence partitioning is to reduce the set of possible test cases into a smaller, manageable set that still adequately tests the software.” Subjective and non-deterministic. Two equally skilled testers, testing the same non-trivial program, could arrive at different equivalence classes. Consider augmenting this example with an example of testing the Effective_Interest_Rate(rate,periods) function in Excel. How We Test Software at Microsoft (on Safari) - has a good example of creating equivalence classes for testing a Gregorian date.

Equivalence Classes Score/Possible Pts >= 50%
Not unique. Instead of 1, you could have two equivalence classes: Score / Possible Points > 50% could be a separate, and Score / Possible Points = 50% could be a separate

Test Cases 1 5,10 True 2 30,30 3 19,40 False 4 -1,10 Exception
Test Case Data Expected Outcome Classes Covered 1 5,10 True 2 30,30 3 19,40 False 4 -1,10 Exception Write test cases covering all valid equivalence classes. Cover as many valid equivalence classes as you can with each test case. (Note, there are no overlapping equivalence classes in this example.) Write one and only one test case for each invalid equivalence class. When testing a value from an equivalence class that is expected to return an invalid result all other values should be valid. You want to isolate tests of invalid equivalence classes. It is sufficient to use one test case from each partition. (Taking a couple from each groups gives extra safety. It’s possible the equivalence classes were improperly defined.) (TBD: Consider adding the following EC: score = possible points. That would give overlapping equivalence classes.) Is the following a valid test case: -1,-1? (no, it takes values from two invalid equivalence classes)

Boundary Value Analysis
Rather than select any element within an equivalence class, select values at the edge of the equivalence class. For example, given the class: 1 <= input <= 12 you would select values: -1,1,12,13. test all boundary conditions, on both sides.

Experience-Based Techniques
Error guessing – “testers anticipate defects based on experience” Think of potential errors that programmers might make under the circumstances and create test cases that attempt to show that the error was made. Keeping failure data helps to identify recurring errors. Use skill and intuition. Applicable during white- and black-box testing.

Testing Effectiveness Metrics
Defect density Defect removal effectiveness (efficiency) Code coverage Is 12 errors a lot to find during testing (suggesting a low quality program)?

Defect Density Software engineers often need to quantify how buggy a piece of software is. Defect counts alone are not very meaningful though. Is 12 defects a lot to have in a program? Depends on the size of the product (as measured by features or LOC). 12 defects in a 200 line program = 60 defects/KLOC  low quality. 12 defects in a 20,000 line program is .6 defects/KLOC  high quality. Defect counts are more interesting (meaningful) when tracked relative to the size of the software.

Defect Density [Cont] Defect density is an important measure of software quality. Defect density = total known defects / size. Defect density is often measured in defects/KLOC. (KLOC = thousand lines of code) Dividing by size normalizes the measure which allows comparison between modules of different size. Size is typically measured in LOC or FP’s. Measurement is over a particular time period (e.g. from system test through one year after release) Might calculate defect density after inspections to decide which modules should be rewritten or give more focused testing. Be sure to define LOC. Also, consider weighting defects. A severe defect is worse than a trivial on.) Gives wrong incentive. Defect density = total known defects / size; Size is typically measured in LOC of FP’s. Measurement is over a particular time period. Time period might be from system release through the first year in production. If you plan to compare defect densities between products it is important that it be calculated during the same time period for all products. LOC are easy to understand and measure (can be automatic) but can deliver misleading results. The main problem is LOC is a measure of solution size and depending on the implementation the same problem can have wildly different solution sizes. For the same number of defects, a very verbose implementation will have a lower defect density than a more thrifty solution. Two problems with LOC: (1) Gives wrong incentive. When DD is used to evaluate programmers, they have an incentive to write verbose code, which in the long run is harder to maintain and evolve than concise succinct code. (2) provides misleading results. For the same number of errors, a concise implementation can appear to have less quality than a more verbose implementation. Comparisons across teams or even application domains or languages is not meaningful. Can compare across releases. Also be sure to define what a defect and LOC is. You need consistent definitions if you are going to do comparisons. Severity levels matter. 12 trivial defects is not as bad as 12 critical/severe or blocker defects. Maybe you could weight defects. Is 12 defects a lot to have in a product? If it is a 200 line program, yes. If it is a 100,000 LOC program, not so much.

Defect Density [Cont] Defect density measures can be used to track product quality across multiple releases. Good example turning data into information. Note how Release N+2 levels off faster. Release N doesn’t quite level off.

Defect removal effectiveness
DRE tells you what percentage of defects that are present are being found (at a certain point in time). Example: when you started system test there were 40 errors to be found. You found 30 of them. The defect removal effectiveness of system test is 30/40 or 75%. The trick of course is calculating the latent number of errors at any one point in the development process. Solution: to calculate latent number of errors at time x, wait a certain period after time x to learn just how many errors were present at time x. Also called Defect Removal Efficiency How do you calculate the number of latent defects present in the product at any one point in time, say time X? Wait for a specified duration and add up the number of defects found since time X. Number defects found during development phase + those found after development phase = latent defects. DRE is a measure that can only be calculated retroactively. The DRE for phase X can only be calculated after a certain period of time after phase X. There is no other way to calculate latent defects missed while testing during phase x. How to calculate DRE for each phase. Early defect detection is a goal of many teams because the cost of finding and fixing a defect increases with the amount of time defect remains in the product. DRE can help teams measure their ability t detect defects early closer to their origin. There is also development DRE which = defects caught in development as opposed to production.

Example Calculation of Defect Removal Effectiveness
Example is from:

Levels of White-Box Code Coverage
Another important testing metric is code coverage. How thoroughly have paths through the code been tested. Some of the more popular options are: Statement coverage Decision coverage (aka branch coverage) Condition coverage Basis path coverage Path coverage Studying code coverage will provide some justification for Dijkstra’s quote: ”Program testing can be used to show the presence of bugs, but never to show their absence!” Note code coverage and input value coverage are two different ideas. You will have to choose different input values to get different paths through the code but covering paths through the code isn’t the same as covering the range of possible inputs.

Statement Coverage Each line of code is executed. if (a) stmt1; if (b)
a=t;b=t gives statement coverage a=t;b=f doesn’t give statement coverage The following needs two test cases: if (a) stmt1; else stmt2; 52

Decision Coverage Decision coverage is also known as branch coverage
The boolean condition at every branch point (if, while, etc) has been evaluated to both T and F. if (a and b) stmt1; if (c) stmt2; a=t;b=t;c=t and a=f;b=?;c=f gives decision coverage Q: Does decision coverage guarantee statement coverage? A: Yes assuming the program has at least one decision, there is only one entry point, and you ignore exception handling code. 53

Does statement coverage guarantee decision coverage?
if (a) stmt1; If no, give an example of input that gives statement coverage but not decision coverage. No. a = t. If this is the only test case, you don’t have decision coverage because you haven’t tested the decision where the if statement evaluates to false. 54

Condition Coverage Each boolean sub-expression at a branch point has been evaluated to true and false. if (a and b) stmt1; a=t,b=t and a=f;b=f gives condition coverage 55

Condition Coverage Does condition coverage guarantee decision coverage? if (a and b) stmt1; If no, give example input that gives condition coverage but not decision coverage. No. a=t,b=f and a=f,b=t. This gives condition coverage but doesn’t test the case where the branch is taken. 56

Basis Path Coverage A path represents one flow of execution from the start of a method to its exit. For example, a method with 3 decisions has 23 paths : if (a) stmt1; if (b) stmt2; if (c) stmt3; if (a) stmt1; elseif (b) stmt2; elseif (c) stmt3; OR Input needed to exercise every path: a=f,b=f,c=f a=f,b=f,c=t a=f,b=t,c=f a=f,b=t,c=t a=t,b=f,c=f a=t,b=f,c=t a=t,b=t,c=f a=t,b=t,c=t 57

Basis Path Coverage Loops in code make path coverage impractical for most programs. Each time through a loop is a new path. A practical alternative to path coverage is basis path coverage.

Basis Path Coverage Basis path coverage is the set of all linearly independent paths though a method or section of code. The set of linearly independent paths through a method are special because this set is the smallest set of paths that can be combined to create every other possible path through a method. The cyclomatic complexity of a method is the number of linearly independent paths though a method. These paths define the basis paths for the method.

Basis Path Coverage A path represents one flow of execution from the start of a method to its exit. For example, a method with 3 decisions has 23 paths : if (a) stmt1; if (b) stmt2; if (c) stmt3; if (a) stmt1; elseif (b) stmt2; elseif (c) stmt3; There are 4 linearly independent paths through both sections of code above. To see, draw the control flow graph and calculate cyclomatic complexity. The set of linearly independent paths for a section of code is not unique. TBD: add basis path example. References:

Path Coverage Path coverage is the most comprehensive type of code coverage. In order to achieve path coverage you need a set of test cases that executes every possible route through a unit of code. Path coverage is impractical for all but the most trivial units of code. Loops are the biggest obstacle to achieving path coverage. Each time through a loop is a new/different path. 61

Path Coverage How many paths are there in the following unit of code?
if (a) stmt1; if (b) stmt2; if (c) stmt3; If there is a loop in your flow chart and you can’t “unroll” the loop (calculate an upper bound on the number of times through the loop) its impossible to achieve path coverage.

Path Coverage What inputs (test cases) are needed to achieve path coverage on the following code fragment? procedure AddTwoNumbers() top: print “Enter two numbers”; read a; read b; print a+b; if (a != -1) goto top; There is no finite number of test cases that will achieve path coverage. If there is a loop in your flow chart and you can’t “unroll” the loop (calculate an upper bound on the number of times through the loop) its impossible to achieve path coverage.

Deciding when to stop testing
“When the marginal cost finding another defect exceeds the expected loss from that defect.” Both factors (cost of finding another defect and expected loss from that defect) can only be estimated. Stopping criteria is something that should be determined at the start of a project. Why? The question is really how much testing is enough to achieve the stated objectives. This is something that should be determined at the start. Factors to consider when deciding when to stop: code coverage; functionality coverage; budget or scheduled constraints. 64

Peer Reviews Inspection Walkthrough Pair Programming Code Review
Technical review vs. management review TBD: Wiegers, Karl E. (2001). Peer Reviews in Software: A Practical Guide TBD: IEEE Standard for Software Reviews and Audits From Economics of Software Quality: (See reference for more info) “To utilize formal inspections there must be a team of trained individuals that include a moderator and a recorder. In addition, there may be other inspectors, plus, of course, the person whose work products are being inspected. Inspection teams have a minimum size of three (moderator, recorder, and principal); an average size of five (moderator, recorder, principal, and two other inspectors); and a maximum size of eight.” “Formal inspections have demonstrated about the highest levels of defect removal efficiency ever recorded. They average about 85% efficiency in finding defects and have occasionally topped 97%.” “While this description of the inspection process might make it sound time-consuming and expensive, empirical data demonstrates that applications that use formal inspections have shorter schedules and lower costs than similar applications that use only testing. A prime example was the IBM IMS database product in the 1970s. Prior to inspections, IMS testing took about 90 days of three-shift testing. After inspections had been deployed, the same testing cycle was reduced to 30 days of one-shift testing.”

Old Example Use equivalence partitioning to define test cases for the following function: // This function takes integer values for day, // month and year and returns the day of the // week in string format. The function returns // an empty string when given invalid inputs values. // Year must be > 1752. // Example: DayOfWeek(12,31,2009)  “Tuesday” // Example: DayOfWeek(13,13,2009)  “” String DayOfWeek(int month, int day, int year); Assume the Gregorian calendar. Consider replacing or augmenting this example with an example of testing the Effective_Interest_Rate(rate,periods) function in Excel. How We Test Software at Microsoft (on Safari) - has a good example of creating equivalence classes for testing a Gregorian date.

Equivalence Classes Month < 1 (invalid) Month > 12 (invalid)
Year > 1752 (valid) Year < 1753 (invalid) Month = 1; 0 > Day < 32 (valid) Month = 1; Day > 32 (invalid) Month = 4; 0 > Day < 31 (valid) Month = 4; Day > 31 (invalid) Etc… Note, month 2 is more difficult to define because of leap year. Different testers will come up with different equivalence classes. Question: what about adding an equivalence class for: 0 < Month < 13 (valid)? Not needed. It would overlap 4 and 6. It doesn’t add anything.

Test Cases Test Case # Test Case Data Expected Outcome Classes Covered 1 1,1,2010 “Friday” 3,5 2 0,1,1999 “” 3 45,1,1999 4 4,1,1752 8 Write test cases covering all valid equivalence classes. Cover as many valid equivalence classes as you can with each test case. Write one and only one test case for each invalid equivalence class. When testing a value from an equivalence class that is expected to return an invalid result all other values should be valid. You want to isolate tests of invalid equivalence classes. It is sufficient to use one test case from each partition. (Taking a couple from each groups gives extra safety. It’s possible the equivalence classes were improperly defined.) Is the following test case one you would expect to find when doing equivalence class testing? 0,1,1753 (no, it takes values from two invalid equivalence classes)

Software Testing “There are only 2 hard problems in Computer Science. Naming things, cache invalidation and off-by-one errors.” —Phil Haack There are three.

Similar presentations

Presentation on theme: "Software Testing “There are only 2 hard problems in Computer Science. Naming things, cache invalidation and off-by-one errors.” —Phil Haack There are three."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Software Testing “There are only 2 hard problems in Computer Science. Naming things, cache invalidation and off-by-one errors.” —Phil Haack There are three.

Similar presentations

Presentation on theme: "Software Testing “There are only 2 hard problems in Computer Science. Naming things, cache invalidation and off-by-one errors.” —Phil Haack There are three."— Presentation transcript:

Similar presentations

About project

Feedback