Directed Random Testing Evaluation. FDRT evaluation: high-level – Evaluate coverage and error-detection ability large, real, and stable libraries tot.

Slides:



Advertisements
Similar presentations
Chapter 5 Errors Bjarne Stroustrup
Advertisements

Automating Software Module Testing for FAA Certification Usha Santhanam The Boeing Company.
Feedback-directed Random Test Generation (to appear in ICSE 2007) Carlos Pacheco Shuvendu Lahiri Michael Ernst Thomas Ball MIT Microsoft Research January.
Author: Carlos Pacheco, Shuvendu K. Lahiri, Michael D. Ernst, Thomas Ball MIT CSAIL.
Abstraction and Modular Reasoning for the Verification of Software Corina Pasareanu NASA Ames Research Center.
1 Symbolic Execution for Model Checking and Testing Corina Păsăreanu (Kestrel) Joint work with Sarfraz Khurshid (MIT) and Willem Visser (RIACS)
Symbolic execution © Marcelo d’Amorim 2010.
Grey Box testing Tor Stålhane. What is Grey Box testing Grey Box testing is testing done with limited knowledge of the internal of the system. Grey Box.
1 of 24 Automatic Extraction of Object-Oriented Observer Abstractions from Unit-Test Executions Dept. of Computer Science & Engineering University of Washington,
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development.
(Quickly) Testing the Tester via Path Coverage Alex Groce Oregon State University (formerly NASA/JPL Laboratory for Reliable Software)
Software testing.
© The McGraw-Hill Companies, 2006 Chapter 9 Software quality.
1 Today Another approach to “coverage” Cover “everything” – within a well-defined, feasible limit Bounded Exhaustive Testing.
Software Testing. Testing Levels (McConnel) Unit Testing Single programmer involved in writing tested code Component Testing Multiple programmers involved.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 23 Slide 1 Software testing.
SE 450 Software Processes & Product Metrics 1 Defect Removal.
Elementary Data Types Scalar Data Types Numerical Data Types Other
1 Software Testing and Quality Assurance Lecture 14 - Planning for Testing (Chapter 3, A Practical Guide to Testing Object- Oriented Software)
Finding Errors in.NET with Feedback-Directed Random Testing By Carlos Pacheco, Shuvendu K. Lahiri and Thomas Ball Presented by Bob Mazzi 10/7/08.
Software Testing. “Software and Cathedrals are much the same: First we build them, then we pray!!!” -Sam Redwine, Jr.
CS4723 Software Validation and Quality Assurance Lecture 02 Overview of Software Testing.
CS527: (Advanced) Topics in Software Engineering Overview of Software Quality Assurance Tao Xie ©D. Marinov, T. Xie.
Finding Errors in.NET with Feedback-Directed Random Testing Carlos Pacheco (MIT) Shuvendu Lahiri (Microsoft) Thomas Ball (Microsoft) July 22, 2008.
Reverse Engineering State Machines by Interactive Grammar Inference Neil Walkinshaw, Kirill Bogdanov, Mike Holcombe, Sarah Salahuddin.
Feed Back Directed Random Test Generation Carlos Pacheco1, Shuvendu K. Lahiri2, Michael D. Ernst1, and Thomas Ball2 1MIT CSAIL, 2Microsoft Research Presented.
Software Testing. Definition To test a program is to try to make it fail.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 23 Slide 1 Software testing.
An Introduction to MBT  what, why and when 张 坚
CS 501: Software Engineering Fall 1999 Lecture 16 Verification and Validation.
Software testing techniques 3. Software testing
1 Debugging and Testing Overview Defensive Programming The goal is to prevent failures Debugging The goal is to find cause of failures and fix it Testing.
CSCE 548 Code Review. CSCE Farkas2 Reading This lecture: – McGraw: Chapter 4 – Recommended: Best Practices for Peer Code Review,
Chapter 8 – Software Testing Lecture 1 1Chapter 8 Software testing The bearing of a child takes nine months, no matter how many women are assigned. Many.
Lecture 11 Testing and Debugging SFDV Principles of Information Systems.
 CS 5380 Software Engineering Chapter 8 Testing.
272: Software Engineering Fall 2012 Instructor: Tevfik Bultan Lecture 6: Exhaustive Bounded Testing and Feedback-Directed Random Testing.
Dr. Tom WayCSC Testing and Test-Driven Development CSC 4700 Software Engineering Based on Sommerville slides.
Experimentation in Computer Science (Part 1). Outline  Empirical Strategies  Measurement  Experiment Process.
Introduction to Software Testing. Types of Software Testing Unit Testing Strategies – Equivalence Class Testing – Boundary Value Testing – Output Testing.
Grey Box testing Tor Stålhane. What is Grey Box testing Grey Box testing is testing done with limited knowledge of the internal of the system. Grey Box.
CS527 Topics in Software Engineering (Software Testing and Analysis) Darko Marinov September 9, 2010.
1 Test Selection for Result Inspection via Mining Predicate Rules Wujie Zheng
What is Testing? Testing is the process of finding errors in the system implementation. –The intent of testing is to find problems with the system.
Xusheng Xiao North Carolina State University CSC 720 Project Presentation 1.
Feedback-directed Random Test Generation Carlos Pacheco Shuvendu Lahiri Michael Ernst Thomas Ball MIT Microsoft Research January 19, 2007.
“Isolating Failure Causes through Test Case Generation “ Jeremias Rößler Gordon Fraser Andreas Zeller Alessandro Orso Presented by John-Paul Ore.
Finding Errors in.NET with Feedback-Directed Random Testing Carlos Pacheco (MIT) Shuvendu Lahiri (Microsoft) Thomas Ball (Microsoft) July 22, 2008.
Chapter 8 Testing. Principles of Object-Oriented Testing Å Object-oriented systems are built out of two or more interrelated objects Å Determining the.
1 Exposing Behavioral Differences in Cross-Language API Mapping Relations Hao Zhong Suresh Thummalapenta Tao Xie Institute of Software, CAS, China IBM.
Benchmarking Effectiveness for Object-Oriented Unit Testing Anthony J H Simons and Christopher D Thomson.
CS451 Lecture 10: Software Testing Yugi Lee STB #555 (816)
CAPP: Change-Aware Preemption Prioritization Vilas Jagannath, Qingzhou Luo, Darko Marinov Sep 6 th 2011.
Software Quality Assurance and Testing Fazal Rehman Shamil.
PROGRAMMING TESTING B MODULE 2: SOFTWARE SYSTEMS 22 NOVEMBER 2013.
Automated Test Generation CS Outline Previously: Random testing (Fuzzing) – Security, mobile apps, concurrency Systematic testing: Korat – Linked.
CS527 Topics in Software Engineering (Software Testing and Analysis) Darko Marinov September 7, 2010.
Fuzzing And Oracles By: Thomas Sidoti. Overview Introduction Motivation Fuzzable Exploits Oracles Implementation Fuzzing Results.
Random Test Generation of Unit Tests: Randoop Experience
Symstra: A Framework for Generating Object-Oriented Unit Tests using Symbolic Execution Tao Xie, Darko Marinov, Wolfram Schulte, and David Notkin University.
SOFTWARE TESTING SOFTWARE TESTING Presented By, C.Jackulin Sugirtha-10mx15 R.Jeyaramar-10mx17K.Kanagalakshmi-10mx20J.A.Linda-10mx25P.B.Vahedha-10mx53.
Cs498dm Software Testing Darko Marinov January 24, 2012.
Testing Tutorial 7.
Chapter 8 – Software Testing
CS5123 Software Validation and Quality Assurance
Eclat: Automatic Generation and Classification of Test Inputs
Testing and Test-Driven Development CSC 4700 Software Engineering
Chapter 1 Introduction(1.1)
50.530: Software Engineering
Software Testing.
Presentation transcript:

Directed Random Testing Evaluation

FDRT evaluation: high-level – Evaluate coverage and error-detection ability large, real, and stable libraries tot. 800KLOC. – Internal evaluation Compare with basic random generation (random walk) Evaluate key ideas – External evaluation Compare with host of systematic techniques – Experimentally – Industrial case studies – Minimization – Random/enumerative generation

Internal evaluation Random walk – Vast majority of effort generating short sequences – Rare failures: more likely to trigger by a long seq. Component-based generation – Longer sequences, at the cost of diversity Randoop – Increases diversity by pruning space – Each component yields a distinct object state

External evaluation 1.Small data structures – Pervasive in literature – Allows for fair comparison... 2.Libraries – To determine practical effectiveness 3.User studies: individual programmers – single (or few) users, MIT students unfamiliar with testing, test generation 4.Industrial case study

Xie data structures Seven data structures (stack, bounded stack, list, bst, heap, rbt, binomial heap) Used in previous research – Bounded exhaustive testing [ Marinov 2003 ] – Symbolic execution [ Xie 2005 ] – Exhaustive method sequence generation [Xie 2004 ] All above techniques achieve high coverage in seconds Tools not publicly available

FDRT achieves comparable results data structuretime (s)branch cov. Bounded stack (30 LOC)1100% Unbounded stack (59 LOC)1100% BS Tree (91 LOC)196% Binomial heap (309 LOC)184% Linked list (253 LOC)1100% Tree map (370 LOC)181% Heap array (71 LOC)1100%

Visser containers Visser et al. (2006) compares several input generation techniques – Model checking with state matching – Model checking with abstract state matching – Symbolic execution – Symbolic execution with abstract state matching – Undirected random testing Comparison in terms of branch and predicate coverage Four nontrivial container data structures Experimental framework and tool available

FDRT: >= coverage, < time best systematic feedback-directed undirected random feedback-directed best systematic undirected random feedback-directed best systematic undirected random feedback-directed

Libraries: error detection LOCClassestest cases output error- revealing tests cases distinct errors JDK (2 libraries) 53K Apache commons (5 libraries) 150K Net framework (5 libraries) 582K Total785K

Errors found: examples JDK Collections classes have 4 methods that create objects violating o.equals(o) contract Javax.xml creates objects that cause hashCode and toString to crash, even though objects are well-formed XML constructs Apache libraries have constructors that leave fields unset, leading to NPE on calls of equals, hashCode and toString (this only counts as one bug) Many Apache classes require a call of an init() method before object is legal— led to many false positives.Net framework has at least 175 methods that throw an exception forbidden by the library specification (NPE, out-of-bounds, of illegal state exception).Net framework has 8 methods that violate o.equals(o).Net framework loops forever on a legal but unexpected input

Comparison with model checking Used JPF to generate test inputs for the Java libraries (JDK and Apache) – Breadth-first search (suggested strategy) – max sequence length of 10 JPF ran out of memory without finding any errors – Out of memory after 32 seconds on average – Spent most of its time systematically exploring a very localized portion of the space For large libraries, random, sparse sampling seems to be more effective

Comparison with external random test generator JCrasher implements undirected random test generation Creates random method call sequences – Does not use feedback from execution Reports sequences that throw exceptions Found 1 error on Java libraries – Reported 595 false positives

Regression testing Randoop can create regression oracles Generated test cases using JDK 1.5 – Randoop generated 41K regression test cases Ran resulting test cases on – JDK 1.6 Beta 25 test cases failed – Sun’s implementation of the JDK 73 test cases failed – Failing test cases pointed to 12 distinct errors – These errors were not found by the extensive compliance test suite that Sun provides to JDK developers

User study 1 Goal: regression/compliance testing Meng. student at MIT, 3 weeks (part-time) Generated test cases using Sun 1.5 JDK – Ran resulting test cases on Sun 1.6 Beta, IBM 1.5 Sun 1.6 Beta: 25 test cases failed IBM 1.5: 73 test cases failed Failing test cases pointed to 12 distinct errors – not found by extensive Sun compliance test suite

User study 2 Goal: usability 3 PhD students, 2 weeks Applied Randoop to a library – Ask them about their experience (to-do) How was the tool easy to use? How was the tool difficult to use? Would they use the tool on their code in the future?

quotes

FDRT vs. symbolic execution

Industrial case study Test team responsible for a critical.NET component 100KLOC, large API, used by all.NET applications Highly stable, heavily tested – High reliability particularly important for this component – 200 man years of testing effort (40 testers over 5 years) – Test engineer finds 20 new errors per year on average – High bar for any new test generation technique Many automatic techniques already applied 18

Case study results 19 Human time spent interacting with Randoop 15 hours CPU time running Randoop150 hours Total distinct method sequences4 million New errors revealed30 Randoop revealed 30 new errors in 15 hours total human effort. (interacting with Randoop, inspecting results) A test engineer discovers on average 1 new error per 100 hours of effort.

Example errors library reported new reference to an invalid address – In code for which existing tests achieved 100% branch coverage Rarely-used exception was missing message in file – That another test tool was supposed to check for – Led to fix in testing tool in addition to library Concurrency errors – By combining Randoop with stress tester Method doesn't check for empty array – Missed during manual testing – Led to code reviews

Comparison with other techniques Traditional random testing – Randoop found errors not caught by previous random testing – Those efforts restricted to files, stream, protocols – Benefits of "API fuzzing" only now emerging Symbolic execution – Concurrently with Randoop, test team used a method sequence generator based on symbolic execution – Found no errors over the same period of time, on the same subject program – Achieved higher coverage on classes that Can be tested in isolation Do not go beyond managed code realm 21

Plateau Effect Randoop was cost effective during the span of the study After this initial period of effectiveness, Randoop ceased to reveal new errors Parallel run of Randoop revealed fewer errors than it first 2 hours of use on a single machine 22

Minimization

Selective systematic exploration

Odds and ends Repetition Weights, other