Feedback-Based Specification, Coding and Testing… …with JWalk

Feedback-Based Specification, Coding and Testing… …with JWalk
Anthony J H Simons, Neil Griffiths and Christopher D Thomson

Overview Lazy Systematic Unit Testing The JWalkEditor tool
JWalk testing tool (Simons) The JWalkEditor tool Integrated Java editor and JWalk (Griffiths) Feedback-based methodology Prototype, validate, specify, test Evaluation of cost-effectivness Testing effort, time, coverage (Thomson) JWalk first reported in Automated Sofware Engineering (Simons, 2007) JWalkEditor was an undergraduate project (Griffiths, 2008) Two comparative evaluations of JWalk vs. JUnit (including poster at TaicPart 2007) – focused on time taken, and test effectiveness (Simons and Thomson, 2007; 2008).

Motivation State of the art in agile testing
Test-driven development is good, but… …no specification to inform the selection of tests …manual test-sets are fallible (missing, redundant cases) …reusing saved tests for conformance testing is fallible –state partitions hide paths, faults (Simons, 2005) Lazy systematic testing method: the insight Complete testing requires a specification (even in XP!) Infer an up-to-date specification from a code prototype Let tools handle systematic test generation and coverage Let the programmer focus on novel/unpredicted results No specifications in XP – Holcombe et al. (2003) first questioned “where do the tests come from?” for XP and similar methods; however XP continues to reject even lightweight specifications. Manual test fallibility – should be obvious why. Fallible conformance testing – XP expects regression testing with saved JUnit test-sets to guarantee the conformance of a modified/extended object to the original intent. However (Simons, 2005) proved that this is a fallacy. Why? Extending an object formally partitions its state-space. Paths in the original state-space that were completely tested by a given test-set are split by the partitioning of state; the number of old tested paths falls off at a geometric rate, according to the partitioning factor. This is without even considering new test-cases required by the extension. Up-to-date specification – need to acquire current state-space for a complete testing method to work

Lazy Systematic Unit Testing
Lazy Specification late inference of a specification from evolving code semi-automatic, by static and dynamic analysis of code with limited user interaction specification evolves in step with modified code Systematic Testing bounded exhaustive testing, up to the specification emphasis on completeness, conformance, correctness properties after testing, repeatable test quality Lazy specification – term coined by analogy with “lazy evaluation” in functional programming. Notion is that you can delay specification for as long as possible, until the code is judged stable (and even later, since the specification can evolve in-step with the code). Semi-automatic inference – the tool knows about systematic coverage criteria, and proposes test sequences by simulation from the prototype code, prompting the user with key test results to confirm/reject. The tool uses these key results to predict further outcomes by rule (see later slide for examples). Bounded exhaustive testing – based on T S Chow’s coverage criterion for state-based systems (Chow, 1976). Completeness = all states and transitions, 1-switch, 2-switch, 3-switch cover, etc. (1-switch = reach every state, exercise up to every transition pair; 2-switch = ditto, but up to every transition triple, etc.) Conformance, correctness = testing that OUT conforms to specification; contrasts with exploratory, random, statistical or fault-locating testing methods Repeatable quality = modified OUT can be tested up to the same confidence criteria as old OUT; viz. what regression testing seeks to deliver, but cannot. Why can we? Because the same Chow-criteria are satisfied for the old and the revised state-spaces – means that test-sets have to be completely regenerated for the revised state-space (Simons, 2005).

JWalk Testing Tool Lazy systematic unit testing for Java
static analysis - extracts the public API of a compiled Java class protocol walking (code exploration) - executes all interleaved methods to a given path depth algebraic testing (memory states) - validates all observations on all mutator-method sequences state-based testing (high-level states) - validates all state-transitions (n-switch coverage) for inferred high-level states Protocol-walking – exploring all paths through the OUT’s method protocols, viz.all legal permutations of its public API. Notion of a method path – a constructor, followed by some sequence of method invocations. Notion of path depth – the number of methods executed on a new OUT after construction, a single test sequence. All interleaved paths – repeatedly constructing the OUT to exercise all single constructors, followed by all permutations of methods, even including repeated invocation of the same methods Algebraic testing – informed by notion of primitive constructors, derived transformers and observers from ADT algebras. Idea is that you test all derived constructions w.r.t equivalence to more primitive constructions. In Java, you cannot easily distinguish algebraic constructors and transformers, so treat these as mutators (state-modifying methods); observers = access methods. State-based testing – informed by Chow’s state and transition coverage critera, based on high-level states. These correspond to complete partitions of the OUT’s concrete memory states, inferred from boolean state predicates present in the API (if any) – (see later slide for explanation).

JWalkEditor Integration of JWalk with a Java-editor Editing features
Java-sensitive Integrated with JDK compiler tools – track exceptions to source Testing features Invokes JWalk (6 validation/testing modes) Confirm/reject key results via dialogs Browse test result-sets via tabbed panes Java sensitive – keyword highlighting, comment highlighting, literal value highlighting, automatic code indentation. Icon example here, larger snapshot on following slide. Integration with JDK tools – invokes Java’s own compiler within the same runtime; compiler exceptions thrown back to the editor tool; so possible to track exceptions to points in the original source files. Integration with JWalk testing – invokes JWalk within the same runtime; requires significant coding effort to get around restrictions on Java ClassLoader objects, which cache loaded CUTs, or fail to recognise that reloaded CUTs have the same type as before. Communication between Editor and JWalk is via Java’s event handling metaphor – uses Observer-pattern to register listeners for testing events. Confirmation dialogs – not shown here, but propose single test sequences with the predicted result, and 3 buttons: Accept = confirm valid result, Reject = reject invalid result, Abort = abandon rest of test cycle. Confirmation of a single case typically takes 1-3 seconds. Test result-sets – list all normally-executing cases, exceptional cases, cases that were manually confirmed/rejected, cases that were automatically passed/failed (from known information) and sequences that were pruned, e.g. because their prefixes had already failed.

Snapshot – Editing a Stack
Full build cycle, multiple sources Syntax sensitive text highlights JWalk’s six modes are: 1 – interface inspection: static analysis of API of CUT 2 – protocol walking: explore all interleaved method paths in the CUT’s protocols (including repeated occurrences of the same method; including observers in the prefix) 3 – algebra walking: explore all mutaor method sequences, followed by a single observation (eliminates observers from the prefix) 4 – state walking: explore all high-level states and transition paths to a given depth; state cover discovered by growing mutator-sequences and evaluating whether this is a different high-level state (see later slide). 5 – algebra testing: like algebra walking, but validating all sequences with respect to user confirmation, or predicted outcome 6 – state testing: like state walking, but validating all sequences with respect to user confirmation, or predicted outcome Test prediction – see later slide. JWalk’s test depth = the tested method path length, not counting the initial constructor, or initial state cover prefix. Depth 0 = state cover; depth 1 = transition (0-switch) cover; depth 2 = 1-switch cover; depth 3 = 2-switch cover; etc. Set JWalk test parameters

Snapshot –Testing a Stack
Tabbed pane with test results for the Empty state, for all paths of depth 2 Colour-coded test sequences Different tabbed panes for each result-set, indexed by the states of the OUT and by path depth. Altogether, we have: Empty state: paths of length 0, 1, 2 and 3; Default state: paths of length 0, 1, 2, 3; and Full state: paths of length 0, 1, 2 and 3. Example tabbed pane shows Empty state, paths of depth 2. Colour coding of test-sequences: blue = constructor expression; orange = method expression; yellow = return value, viz. test result. Colour coding of test outcomes: green = normal execution; red = termination with an exception; tick-box = validated correct test outcome; cross-box = validated incorrect test outcome. Note that an exception can be the correct test outcome (as here, for popping an empty Stack) Colour-coded test outcomes

Dynamic Analysis of States
Memory states generate all interleaved method paths to depth 1..n prune sequences ending in observers from the active edges, preserving mutator sequences distinguish observer/mutator methods by empirical low-level state comparison (extracted by reflection) High-level states generate all mutator-method paths (as above) evaluate all state predicates, eg: isEmpty(), isFull() seek states corresponding to the product of boolean outcomes: viz: {Default, Empty, Full, Full&Empty} Introduce this slide as the first of two slides giving a little more detail about how JWalk works. This slide – how JWalk detects what it thinks are different states. Memory states – the fine-grained states, corresponding to all possible attribute value assignments to the OUT (and recursively, if the OUT has sub-objects) JWalk distinguishes observer and mutator methods empirically. Does not rely on method signatures. Sensitive enough to determine, from call to call, whether a method has actually modified state (Acknowlegement to Arne-Michael Toersel, for the Wallet case-study, after TaicPart2007, which helped us to see this). From this, it is possible to construct sequences known to modify object state (mutator-sequences). Algebra-testing simply appends observations to this. High-level states are detected using state predicates supplied in the API. Mutator-sequences are grown, then all predicates are evaluated to see if they return true/false. Each new boolean product generates a new high-level state. Not all tuples in the product may exist if the predicates are not independent (e.g. Full&Empty is potentially expected, but never found, because these are mutually exclusive predicates). JWalk aborts the state search after a given path depth.

Test Result Prediction
Strong prediction From known results, guarantee further outcomes in the same equivalence class eg: test sequences containing observers in the prefix map onto a shorter sequence target.push(e).size().top() == target.push(e).top() Weak prediction From known facts, guess further outcomes; an incorrect guess will be revealed in the next cycle eg: methods with void type usually return no result, but may raise an exception target.pop() predicted to have no result target.pop().size() == -1 reveals an error The main effectiveness of JWalk as a testing tool comes from the fact that it can eventually predict the majority of test outcomes, given information that it has already acquired from the user. We call this “result prediction”. There are two kinds of prediction: Strong prediction – predictions that are guaranteed to hold, so testing may always assume they are valid. E.g. JWalk empirically determines no-state-change for observers, so can always eliminate them from the prefix and map such a sequence onto a shorter sequence that was already confirmed (e.g. eliminate size() from the prefix above). Weak prediction – predictions that typically hold, but testing assumes they could be invalidated by later results. E.g. if a void method returns no result, this is assumed correct. However, a later test result may indicate that the method should have thrown an exception (e.g. pop() called on an empty Stack; only found to be faulty in the next test cycle). Meaning of “observers in the prefix”: A test sequence consists of a chain of method invocations. The prefix is all the methods in the chain except the last method. Having observers in the prefix means that access-methods were evaluated for their (non-existent) side-effects, before observing the result of the last method in the sequence.

Feedback-based Methodology
Coding The programmer prototypes a Java class in the editor Validation JWalk systematically explores method paths, providing useful instant feedback to the programmer Specification JWalk infers a specification, building a test oracle based on key test results confirmed by the programmer Testing JWalk tests the class to bounded exhaustive depths, based on confirmed and predicted test outcomes JWalk uses state-based test generation algorithms This is the main new contribution of the JWalkEditor. Coding – emphasise difference between this and XP test-driven development. The programmer is free to prototype code in any way – need not have a fixed set of tests in mind at the start. Validation – this is a new aspect supported by JWalkEditor – the fact that the programmer can switch seamlessly between edit mode and test mode means that s/he can instantly see the consequences of coding decisions, including unexpected permutations of methods in the API. Specification – when the programmer is mostly satisfied that the code does what is expected, runs JWalk in one of the test-modes to build an oracle, with interactive confirmation of key test results. Definition of “key test result” – a unique observation on a unique mutator-method sequence (excludes predicted void results). Testing – JWalk uses key test results to predict many more test outcomes. When testing to depth n+1, JWalk reuses all oracles found for depth n. When performing state-based testing, the oracles confirmed for algebraic testing are reused to predict outcomes of extending the state cover by all transition paths up to the chosen depth. Such paths often have observers in the prefixes. Test coverage – according to Chow’s criteria: state cover, transition cover, 1-switch, 2-switch etc. cover.

Example – Library Book Validation Testing
public class LibraryBook { private String borrower; public LibraryBook(); public void issue(String); public void discharge(); public String getBorrower(); public Boolean isOnLoan(); } Validation surprise: target.issue(“a”).issue(“b”).getBorrower() == “b” violates business rules: fix code to raise an exception Testing all observations on chains of issue(), discharge() n-switch cover on states {Default, OnLoan} Note: the code stub just shows the API of the CUT; the full code was written to implement each operation. Validation – explores sequences of length 0, 1, 2 which seem to behave as expected. When you get to length 3, you start to see interesting observations that maybe you did not expect. Here, the programmer found that he could issue a book twice to different borrowers (violates a business rule of the Library) and replace the original borrower by a new one. As a consequence, he went back to fix the code to prevent this from being legal. Algebraic testing – JWalk determines that the mutator methods are issue(), discharge() and generates mutator sequences consisting only of a constructor followed by these methods. State testing – JWalk finds only one state predicate, isOnLoan(), so discovers two states (for the false, true outcomes), which it automatically names {Default, OnLoan}. JWalk then tests all single, all pairs, all triples of transitions starting in each of these states.

Extension – Reservable Book
public class ReservableBook extends LibraryBook { private String requester; public ReservableBook(); public void reserve(String); public void cancel(); public String getRequester(); public Boolean isReserved(); } Validation surprise: target.reserve(“a”).issue(“b”).getBorrower() == “b” violates business rules: override issue() to refuse “b” here. Testing all obs. on chains of issue(), discharge(), reserve(), cancel() n-switch cover on states {Default, OnLoan, Reserved, Reserved&OnLoan} Idea of this example is to develop an extension to an existing class, in order to explore the advantages, if any, of reusing specifications or tests developed for the base class. During validation, explore sequences of length 0, 1, 2 as before – and encounter cases like target.reserve(“a”).reserve(“b”) that need deciding (do we support one, or more reservations?); and nullops like target.cancel().cancel(). During testing, JWalk determines that the methods issue(), discharge(), reserve() and cancel() are the four mutators, and builds chains interleaving all of these in all permutations. JWalk presents all new (previously unseen) permutations to the programmer. This is a significant help, as it automates the selection of difficult test cases in a systematic way. JWalk re-uses the oracle for LibraryBook. If previously-presented sequences still generate the predicted results, then they are not presented again, otherwise they are “novel” and must be confirmed again. So, JWalk intelligently reuses old tests (emphasize how this is much smarter than simply reusing saved tests unconditionally). Note how four high-level states are detected, since the predicates isOnLoan() and isReserved() are orthogonal, so the whole boolean product is possible.

Evaluation User Acceptance Cost of Confirmations Comparison with JUnit
programmers find JWalk habitable they can concentrate on creative aspects (coding) while JWalk handles systematic aspects (validation, testing) Cost of Confirmations not so burdensome, since amortized over many test cycles metric: measure amortized confirmations per test cycle Comparison with JUnit propose a common testing objective for manual and lazy systematic testing; evaluate coverage and testing effort Eclipse+JUnit vs. JWalkEditor: given the task of testing the “transition cover + all equivalence partitions of inputs” Obvious point: Just being able to automate execution and re-execution of saved tests is not such a great thing. The manual paradigm with JUnit puts all the burden on the programmer to think up the right tests. With JWalk, the tool takes on the job of determining the right tests, while the programmer can concentrate on what they like to do best, which is the coding, the creative side. Cost of confirmations – some types require many 10s of confirmations. But this is not so burdensome (a) because they are amortized over test cycles of increasing length and (b) even if you add up the cumulative total of confirmations, they are eventually a small fraction of the test-set. Also, note that it takes 1-3 seconds for a trained user to confirm a result (that’s > 25 unique tests created per minute!!) Comparison with JUnit – expectation was that JWalk would be much faster, much better at state coverage, but perhaps not complete on argument equivalence partitions (JWalk doesn’t handle that yet).

Amortized Interaction Costs
Test class a1 a2 a3 s1 s2 s3 LibBk con 3 5 7 LibBk pre 2 8 18 38 133 ResBk con 14 56 11 83 ResBk pre 6 27 89 36 241 1649 number of new confirmations, amortized over 6 test cycles con = manual confirmations, > 25 test cases/minute pre = JWalk’s predictions, eventually > 90% of test cases Explanation of columns/rows: a1 – a3 = algebra testing from depth 1 –3 s1 – s3 = state testing from depth 1 – 3 con = manual confirmations; pre = predicted outcomes LibBk = LibraryBook; ResBk = ReservableBook Add together con+pre to get the total number of test cases for the given test mode and depth. Note how ResBk does not double the cases of LibBk, because the old oracle is reused. ResBk confirmations rise, once you start to consider new interleavings. Amortized costs means that for each increasing depth n, you predict all the results for the depths 0..n-1. But even if you add up the total number of new confirmations on each con-row, you get: eg: for LibBk, 20 total confirmations, out of 138 test cases. eg: for ResBk, 167 total confirmations, out of 1732 test cases! Might take 6 minutes in total to build this oracle, for a trained user. How many unique tests is that???!! eg: state-test to depth 2, 241 predicted results eg: algebra-test to depth 2, 14 new confirmations

Comparison with JUnit manual testing method
Manual test creation takes skill, time and effort (eg: ~20 min to develop manual cases for ReservableBook) The programmer missed certain corner-cases eg: target.discharge().discharge() - a nullop? The programmer redundantly tested some properties eg: assertTrue(target != null) - multiple times The state coverage for LibraryBook was incomplete, due to the programmer missing hard-to-see cases The saved tests were not reusable for ReservableBook, for which all-new tests were written to test new interleavings Manual testing: Good points: Tests can be arbitrarily complex, so it’s easier to write manual tests for complicated code that a tool can’t work out how to exercise automatically. Bad points: But for most classes, it’s the simple combinations of methods that programmers don’t expect that cause the problems. Eg should a double-discharge be a nullop? Should it raise an exception? The programmer often does not think of all the ways their API might be abused, because their mental focus is on the correct use of the code. About test reuse – on the one hand, JUnit supports regression testing using saved test sets. Clearly, this doesn’t exercise new functionality in the subclass, but it is expected to validate the functionality inherited from the superclass. We have proved that this is a fallacy (see notes to slide 3). On the other hand, JUnit expects the tests to be packaged in ways that do not foster the reuse and extension of old test code. In particular, new constructors mean that the tests are typically re-written (but maybe you can use search and replace?) New interleavings – this is where manual testing really falls down. It’s beyond human competence to think up all the interleavings.

Advantages of JWalk JWalk lazy systematic testing
JWalk automates test case selection - relieves the programmer of the burden of thinking up the right test cases! Each test case is guaranteed to test a unique property Interactive test result confirmation is very fast (eg: ~80 sec in total for 36 unique test cases in ReservableBook) All states and transitions covered, including nullops, to the chosen depth The test oracle created for LibraryBook formed the basis for the new oracle for ReservableBook, but… JWalk presented only those sequences involving new methods, and all interleavings with inherited methods Obvious point: Jwalk automates the activity that programmers find most hard, namely thinking up all the unique test cases. The timing information ~80 sec relates to the “transition cover +…” ie. we only required testing to depth 1. The manual tester generated far more test cases than this. When testing ReservableBook, JWalk re-uses the oracle for LibraryBook. If previously-presented sequences still generate the predicted results, then they are not presented again, otherwise they are “novel” and must be confirmed again. So, JWalk intelligently reuses old tests (emphasize how this is much smarter than simply reusing saved tests unconditionally).

Speed and Adequacy of Testing
Test class T TE TR Adeq time min.sec LibBk manual 31 9 22 90% 11.00 ResBk manual 104 21 83 53% 20.00 LibBk jwalk 10 100% 0.30 ResBk jwalk 36 0.46 Test goal: transition cover + equiv. partitions of inputs manual testing expensive, redundant and incomplete JWalk testing very efficient, close to complete Meaning of rows and columns: T = actual number of tests created. For JWalk, this meant a unique test sequence. For JUnit, this meant one atomic assertTrue() etc. statement within the body of a testMethodX(). TE = the number of effective tests. This was judged later by enumerating all the tests in the “transition cover + input equiv. partitions” carefully by hand, and then determining which of these tests had been covered by each of the competing approaches. TR = the number of redundant tests, = T-TE. Note that the large number of redundant tests for manual testing was due to generating more than the transition cover (~ switch-1 cover), but also because redundant asserts were included in each testMethodX(). This is a disadvantage caused by the mental focus on testing one method at a time in JUnit. Adeq = test adequacy, the % of ideal test cases covered by the actual test set. Does not penalise redundant tests. So, manual testing only managed to test about half the unique behaviours of ReservableBook. JWalk only scores 90% for the last test, because it missed four cases of arguments that fall into different equivalence partitions. Otherwise its state coverage is perfect. Timings – JWalk between 1-2 orders of magnitude faster!!!! eg: JWalk achieved 100% test coverage eg: wrote 104 tests, 21 were effective and 83 not!

Conclusion Performance of JWalk testing Performance of JWalkEditor
clearly outperformed manual testing coverage based on all states and transitions input equivalence partitions are not yet handled Performance of JWalkEditor unexpected gain: automatic validation of prototype code c.f. Alloy’s model checking from a partial specification Moral for testing just automatically executing saved tests is not so great need systematic test generation tools to get coverage automate the parts that humans get wrong! I think that the conclusions are pretty obvious! We have done some experimental work on equivalence partitions – at the moment, too many results are presented to the user for confirmation to make this habitable. Highlight the feedback-based development possible with JWalkEditor, where the programmer validates the code as they create it. This is very much like Daniel Jackson’s Alloy tool, where you specify a system in a Z-like notation and then simulate the specification to find counterexamples in expected properties. (Full paper has ref. to Alloy). Highlight the moral for testing! (Preachers would say, thump the lectern at this point…) 

Any Questions? http://www.dcs.shef.ac.uk/~ajhs/jwalk/
Acknowledgement to Arne-Michael Toersel (bent over here) for devising the Wallet test-class after last year’s TaicPart (2007). This allowed us to develop the more robust empirical detection of state mutation on a call-by-call basis. Let the conference know that they can download JWalk from the URL here. They may integrate the modular version of the JWalk testing tool with the editor of their choice. JWalkEditor is not yet on public release. Chris – thanks to you also for presenting this. I hope that I’ve given enough extra information in these notes to allow you to answer any questions they may throw at you! Of course, I’ll be happy to respond by . If you have any questions on the slides, or notes, please call me on my mobile (I will be in Austria) on: I think this number works no matter which country I’m in (ie without country code prefix).

Bibliography A J H Simons, JWalk: a tool for lazy systematic testing of Java classes by introspection and user interaction, Automated Software Engineering, 14 (4), December, ed. B. Nuseibeh, (Springer, USA, 2007), SpringerLink: DOI /s , 8 September, Final draft version also deposited with White Rose Research Online. A J H Simons and C D Thomson, Lazy systematic unit testing: JWalk versus JUnit, Proc 2nd. Testing in Academia and Industry Conference - Practice and Research Techniques, September, eds. P McMinn and M Harman, (Cumberland Lodge, Windsor Great Park: IEEE, 2007), 138. See also the A1 poster presenting this result. A J H Simons and C D Thomson, Benchmarking effectiveness for object-oriented unit testing, Proc 1st. Software Testing Benchmark Workshop, 9-11 April, eds. M Roper and W M L Holcombe, (Lillehammer: ICST/IEEE, 2008). A J H Simons, N Griffiths and C D Thomson, Feedback-based specification, coding and testing with JWalk, Proc 3rd. Testing in Academia and Industry Conference - Practice and Research Techniques, August, eds. L. Bottacci and G. M. Kapfhammer and M. Roper, (Cumberland Lodge, Windsor Great Park: IEEE, 2008), to appear. A J H Simons, A theory of regression testing for behaviourally compatible object types, rev. and ext., Software Testing, Verification and Reliability, 16 (3), UKTest 2005 Special Issue, September, eds. M Woodward, P. McMinn, M. Holcombe, R. Hierons (London: John Wiley, 2006), A J H Simons, Testing with guarantees and the failure of regression testing in eXtreme Programming, Proc. 6th Int. Conf. on eXtreme Programming and Agile Processes in Software Engineering (XP 2005), eds. H Baumeister et al., Lecture Notes in Computer Science, 3556, (Berlin: Springer Verlag, 2005), Wikipedia entry for JWalk Wikipedia entry for Lazy Systematic Unit Testing I don’t expect them to be able to read this, but this is just for reference.

Feedback-Based Specification, Coding and Testing… …with JWalk

Similar presentations

Presentation on theme: "Feedback-Based Specification, Coding and Testing… …with JWalk"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Feedback-Based Specification, Coding and Testing… …with JWalk

Similar presentations

Presentation on theme: "Feedback-Based Specification, Coding and Testing… …with JWalk"— Presentation transcript:

Similar presentations

About project

Feedback