1 Revisiting Difficult Constraints if (hash(x) == hash(y)) {... } How do we cover this code? Suppose we’re running (DART, SAGE, SMART, CUTE, SPLAT, etc.)

1 Revisiting Difficult Constraints if (hash(x) == hash(y)) {... } How do we cover this code? Suppose we’re running (DART, SAGE, SMART, CUTE, SPLAT, etc.) – we get here, but hash(x) != hash(y). Can we solve for hash(x) == hash(y) ? Concrete values won’t help us much – we still have to solve for hash(x) == C1 or for hash(y) == C2... Any ideas?

2 Today A brief “digression” on causality and philosophy (of science) Fault localization & error explanation Renieris & Reiss: Nearest Neighbors Jones & Harrold: Tarantula How to evaluate a fault localization PDGs (+ BFS or ranking) Solving for a nearest run (not really testing)

3 Causality When a test case fails we start debugging We assume that the fault (what we’re really after) causes the failure Remember RIP (Reachability, Infection, Propagation)? What do we mean when we say that “A causes B”?

4 Causality We don’t know Though it is central to everyday life – and to the aims of science A real understanding of causality eludes us to this day Still no non-controversial way to answer the question “does A cause B”?

5 Causality Philosophy of causality is a fairly active area, back to Aristotle, and (more modern approaches) Hume General agreement that a cause is something that “makes a difference” – if the cause had not been, then the effect wouldn’t have been One theory that is rather popular with computer scientists is David Lewis’ counterfactual approach Probably because it (and probabilistic or statistical approaches) are amenable to mathematical treatment and automation

6 Causality (According to Lewis) For Lewis (roughly – I’m conflating his counterfactual dependency and causal dependency) A causes B (in world w) iff In all possible worlds that are maximally similar to w, and in which A does not take place, B also does not take place

7 Causality (According to Lewis) Causality does not depend on B being impossible without A Seems reasonable: we don’t, when asking “Was Larry slipping on the banana peel causally dependent on Curly dropping it?” consider worlds in which new circumstances (Moe dropping a banana peel) are introduced

8 Causality (According to Lewis) Many objections to Lewis in the literature e.g. cause precedes the event in time seems to not be required by his approach One is not a problem for our purposes Distance metrics (how similar is world w to world w’) are problematic for “worlds” Counterfactuals are tricky Not a problem for program executions May be details to handle, but no one has in- principle objections to asking how similar two program executions are Or philosophical problems with multiple executions (no run is “privileged by actuality”)

9 Causality (According to Lewis) A B Did A cause B in this program execution? d d’ Yes! d < d’ A B d d’ B No. d > d’

10 Formally A predicate e is causally dependent on a predicate c in an execution a iff: 1.c(a)  e(a) 2.  b. (  c(b)   e(b)  (  b’. (  c(b’)  e(b’))  (d(a, b) < d(a, b’))))

11 What does this have to do with automated debugging?? A fault is an incorrect part of a program In a failing test case, some fault is reached and executes Causing the state of the program to be corrupted (error) This incorrect state is propagated through the program (propagation is a series of “A causes B”s) Finally, bad state is observable as a failure – caused by the fault

12 Fault Localization Fault localization, then, is: An effort to automatically find (one of the) causes of an observable failure It is inherently difficult because there are many causes of the failure that are not the fault We don’t mind seeing the chain of cause and effect reaching back to the fault But the fact that we reached the fault at all is also a cause!

13 Enough! Ok, let’s get back to testing and some methods for localizing faults from test cases But – keep in mind that when we localize a fault, we’re really trying to automate finding causal relationships The fault is a cause of the failure

14 Lewis and Fault Localization Causality: Generally agreed that explanation is about causality. [Ball,Naik,Rajamani],[Zeller],[Groce,Visser],[Sosa,Tooley],[Lewis],etc. Similarity: Also often assumed that successful executions that are similar to a failing run can help explain an error. [Zeller],[Renieris,Reiss][Groce,Visser],etc. This work was not based on Lewis’ approach – it seems that this point about similarity is just an intuitive understanding most people (or at least computer scientists) share

15 Distance and Similarity We already saw this idea at play in one version of Zeller’s delta-debugging Trying to find the one change needed to take a successful run and make it fail Most similar thread schedule that doesn’t cause a failure, etc. Renieris and Reiss based a general fault localization technique on this idea – measuring distances between executions To localize a fault, compare the failing trace with its nearest neighbor according to some distance metric

16 Renieris and Reiss’ Localization Basic idea (over-simplified) We have lots of test cases Some fail A much larger number pass Pick a failure Find most similar successful test case Report differences as our fault localization “nearest neighbor”

17 Renieris and Reiss’ Localization Collect spectra of executions, rather than the full executions For example, just count the number of times each source statement executed Previous work on using spectra for localization basically amounted to set difference/union – for example, find features unique to (or lacking in) the failing run(s) Problem: many failing runs have no such features – many successful test cases have R (and maybe I) but not P! Otherwise, localization would be very easy

18 Renieris and Reiss’ Localization Some obvious and not so obvious points to think about Technique makes intuitive sense But what if there are no successful runs that are very similar? Random testing might produce runs that all differ in various accidental ways Is this approach over-dependent on test suite quality?

19 Renieris and Reiss’ Localization Some obvious and not so obvious points to think about What if we minimize the failing run using delta-debugging? Now lots of differences with original successful runs just due to length! We could produce a very similar run by using delta-debugging to get a 1-change run that succeeds (there will actually be many of these) Can still use Renieris and Reiss’ approach – because delta-debugging works over the inputs, not the program behavior, spectra for these runs will be more or less similar to the failing test case

20 Renieris and Reiss’ Localization Many details (see the paper): Choice of spectra Choice of distance metric How to handle equal spectra for failing/passing tests? Basic idea is nonetheless straightforward

21 The Tarantula Approach Jones, Harrold (and Stasko): Tarantula Not based on distance metrics or a Lewis-like assumption A “statistical” approach to fault localization Originally conceived of as a visualization approach: produces a picture of all source in program, colored according to how “suspicious” it is Green: not likely to be faulty Yellow: hrm, a little suspicious Red: very suspicious, likely fault

22 The Tarantula Approach

23 The Tarantula Approach How do we score a statement in this approach? (where do all those colors come from?) Again, assume we have a large set of tests, some passing, some failing “Coverage entity” e (e.g., statement) failed(e) = # tests covering e that fail passed(e) = # tests covering e that pass totalfailed, totalpassed = what you’d expect

24 The Tarantula Approach How do we score a statement in this approach? (where do all those colors come from?)

25 The Tarantula Approach Not very suspicious: appears in almost every passing test and almost every failing test Highly suspicious: appears much more frequently in failing than passing tests

26 The Tarantula Approach Simple program to compute the middle of three inputs, with a fault. mid() int x, y, z, m; 1 read (x, y, z); 2 m = z; 3 if (y y) 10 m = y; 11 else if (x > z) 12 m = x; 13 print (m);

27 The Tarantula Approach mid() int x, y, z, m; 1 read (x, y, z); 2 m = z; 3 if (y y) 10 m = y; 11 else if (x > z) 12 m = x; 13 print (m); Run some tests... (3,3,5) (1,2,3) (3,2,1) (5,5,5) (5,3,4) (2,1,3) Look at whether they pass or fail Look at coverage of entities Compute suspiciousness using the formula 0.5 0.5 0.5 0.63 0.0 0.71 0.83 0.0 0.0 0.0 0.0 0.0 0.5 Fault is indeed most suspicious!

28 The Tarantula Approach Obvious benefits: No problem if the fault is reached in some successful test cases Doesn’t depend on having any successful tests that are similar to the failing test(s) Provides a ranking of every statement, instead of just a set of nodes – directions on where to look next Numerical, even – how much more suspicious is X than Y? The pretty visualization may be quite helpful in seeing relationships between suspicious statements Is it less sensitive to accidental features of random tests, and to test suite quality in general? What about minimized failing tests here?

29 Tarantula vs. Nearest Neighbor Which approach is better? Once upon a time: Fault localization papers gave a few anecdotes of their technique working well, showed it working better than another approach on some example, and called it a day We’d like something more quantitative (how much better is this technique than that one?) and much less subjective!

30 Evaluating Fault Localization Approaches Fault localization tools produce reports We can reduce a report to a set (or ranking) of program locations Let’s say we have three localization tools which produce A big report that includes the fault A much smaller report, but the actual fault is not part of it Another small report, also not containing the fault Which of these is the “best” fault localization?

31 Evaluating a Fault Localization Report Idea (credit to Renieris and Reiss): Imagine an “ideal” debugger, the perfect programmer Starts reading the report Expands outwards from nodes (program locations) in the report to associated nodes, adding those at each step If a variable use is in the report, looks at the places it might be assigned If code is in the report, looks at the condition of any ifs guarding that code In general, follows program (causal) dependencies As soon as a fault is reached, recognizes it!

32 Evaluating a Fault Localization Report Score the reports according to How much code the ideal debugger would read, starting from the report Empty report: score = 0 Every line in the program: score = 0 Big report, containing the bug? mediocre score Small report, far from the bug? bad score Small report, “near” the bug? good score Report is the fault: great score (0.9) 0.4 0.8 0.2 0.9

33 Evaluating a Fault Localization Report Breadth-first search of Program Dependency Graph (PDG) starting from fault localization: Terminate the search when a real fault is found Score is proportion of the PDG that is not explored during the breadth-first search Score near 1.00 = report includes only faults

34 Details of Evaluation Method PDG 12 total nodes in PDG

35 Details of Evaluation Method PDG 12 total nodes in PDG Fault Report

36 Details of Evaluation Method PDG 12 total nodes in PDG Fault Report + 1 Layer BFS

37 Details of Evaluation Method PDG 12 total nodes in PDG Fault Report + 1 Layer BFS STOP: Real fault discovered

38 Details of Evaluation Method PDG 12 total nodes in PDG 8 of 12 nodes not covered by BFS: score = 8/12 ~= 0.67. Fault Report + 1 Layer BFS STOP: Real fault discovered

39 Details of Evaluation Method PDG 12 total nodes in PDG Fault Report

40 Details of Evaluation Method PDG 12 total nodes in PDG Fault Report + 1 layer BFS

41 Details of Evaluation Method PDG 12 total nodes in PDG Fault Report + 2 layers BFS

42 Details of Evaluation Method PDG 12 total nodes in PDG Fault Report + 3 layers BFS

43 Details of Evaluation Method PDG 12 total nodes in PDG Fault Report + 4 layers BFS STOP: Real fault discovered

44 Details of Evaluation Method PDG 12 total nodes in PDG 0 of 12 nodes not covered by BFS: score = 0/12 ~= 0.00. Fault Report + 4 layers BFS

45 Details of Evaluation Method PDG Fault= Report 12 total nodes in PDG 11 of 12 nodes not covered by BFS: score = 11/12 ~= 0.92.

46 Evaluating a Fault Localization Report Caveats: Isn’t a misleading report (a small number of nodes, far from the bug) actually much worse than an empty report? “I don’t know” vs. “Oh, yeah man, you left your keys in the living room somewhere” (when in fact your keys are in a field in Nebraska) Nobody really searches a PDG like that! Not backed up by user studies to show high scores correlate to users finding the fault quickly from the report

47 Evaluating a Fault Localization Report Still, the Renieris/Reiss scoring has been widely adopted by the testing community and some model checking folks Best thing we’ve got, for now

48 Evaluating Fault Localization Approaches So, how do the techniques stack up? Tarantula seems to be the best of the test suite based techniques Next best is the Cause Transitions approach of Cleve and Zeller (see their paper), but it sometimes uses programmer knowledge Two different Nearest-Neighbor approaches are next best Set-intersection and set-union are worst For details, see the Tarantula paper

49 Evaluating Fault Localization Approaches Tarantula got scores at the 0.99 or > level 3 times more often than the next best Trend continued at every ranking – Tarantula was always the best approach Also appeared to be efficient: Much faster than Cause-Transitions approach of Cleve and Zeller Probably about the same as the Nearest Neighbor and set-union/intersection methods

50 Evaluating Fault Localization Approaches Caveats: Evaluation is over the Siemens suite (again!) But Tarantula has done well on larger programs Tarantula and Nearest Neighbor might both benefit from larger test suites produced by random testing Siemens is not that many tests, done by hand

51 Another Way to Do It Question: How good would the Nearest Neighbors method be if our test suite contained all possible executions (the universe of tests)? We suspect it would do much better, right? But of course, that’s ridiculous – we can’t check for distance to every possible successful test case! Unless our program can be model checked Leads us into next week’s topic, in a roundabout way: testing via model checking

52 Explanation with Distance Metrics Algorithm (very high level): 1. Find a counterexample trace (model checking term for “failing test case”) 2. Encode search for maximally similar successful execution under a distance metric d as an optimization problem 3. Report the differences (  s) as an explanation (and a localization) of the error

53 Implementation #1 CBMC Bounded Model Checker for ANSI-C programs: Input: C program + loop bounds Checks for various properties: assert statements Array bounds and pointer safety Arithmetic overflow Verifies within given loop bounds Provides counterexample if property does not hold Now provides error explanation and fault localization.

54 14: assert (a < 4); 5: b = 4 6: c = -4 7: a = 2 8: a = 1 9: a = 6 10: a = 4 11: c = 9 12: c = 10 13: a = 10 4: a = 5 Given a counterexample,

55 14: assert (a < 4); 5: b = 4 6: c = -4 7: a = 2 8: a = 1 9: a = 6 10: a = 4 11: c = 9 12: c = 10 13: a = 10 4: a = 5 produce a successful execution that is as similar as possible (under a distance metric)

56 14: assert (a < 4); 5: b = 4 6: c = -4 7: a = 2 8: a = 1 9: a = 6 10: a = 4 11: c = 9 12: c = 10 13: a = 10 4: a = 5 14: assert (a < 4); 5: b = -3 6: c = -4 7: a = 2 8: a = 1 9: a = 6 10: a = 4 11: c = 9 12: c = 3 13: a = 3 4: a = 5 produce a successful execution that is as similar as possible (under a distance metric)

57 14: assert (a < 4); 5: b = 4 6: c = -4 7: a = 2 8: a = 1 9: a = 6 10: a = 4 11: c = 9 12: c = 10 13: a = 10 4: a = 5 14: assert (a < 4); 5: b = -3 6: c = -4 7: a = 2 8: a = 1 9: a = 6 10: a = 4 11: c = 9 12: c = 3 13: a = 3 4: a = 5 and examine the necessary differences:

58 14: assert (a < 4); 5: b = 4 6: c = -4 7: a = 2 8: a = 1 9: a = 6 10: a = 4 11: c = 9 12: c = 10 13: a = 10 4: a = 5 14: assert (a < 4); 5: b = -3 6: c = -4 7: a = 2 8: a = 1 9: a = 6 10: a = 4 11: c = 9 12: c = 3 13: a = 3 4: a = 5 and examine the necessary differences: ss

59 14: assert (a < 4); 5: b = 4 6: c = -4 7: a = 2 8: a = 1 9: a = 6 10: a = 4 11: c = 9 12: c = 10 13: a = 10 4: a = 5 14: assert (a < 4); 5: b = -3 6: c = -4 7: a = 2 8: a = 1 9: a = 6 10: a = 4 11: c = 9 12: c = 3 13: a = 3 4: a = 5 and examine the necessary differences: these are the causes

60 14: assert (a < 4); 5: b = 4 6: c = -4 7: a = 2 8: a = 1 9: a = 6 10: a = 4 11: c = 9 12: c = 10 13: a = 10 4: a = 5 14: assert (a < 4); 5: b = -3 6: c = -4 7: a = 2 8: a = 1 9: a = 6 10: a = 4 11: c = 9 12: c = 3 13: a = 3 4: a = 5 and the localization – lines 5, 12, and 13 are likely bug locations.

61 Explanation with Distance Metrics How it’s done: Model checker P+spec First, the program (P) and specification (spec) are sent to the model checker.

62 Explanation with Distance Metrics How it’s done: Model checker P+spec C The model checker finds a counterexample, C.

63 Explanation with Distance Metrics How it’s done: Model checker BMC/constraint generator P+spec C The explanation tool uses P, spec, and C to generate (via Bounded Model Checking) a formula with solutions that are executions of P that are not counterexamples

64 Explanation with Distance Metrics How it’s done: Model checker BMC/constraint generator P+spec C S Constraints are added to this formula for an optimization problem: find a solution that is as similar to C as possible, by the distance metric d. The formula + optimization problem is S

65 Explanation with Distance Metrics How it’s done: Model checker BMC/constraint generator P+spec C Optimization tool S -C An optimization tool (PBS, the Pseudo-Boolean Solver) finds a solution to S: an execution of P that is not a counterexample, and is as similar as possible to C: call this execution -C

66 Explanation with Distance Metrics How it’s done: Model checker BMC/constraint generator P+spec C Optimization tool S -C C ss Report the differences (  s) between C and –C to the user: explanation and fault localization

67 Explanation with Distance Metrics The metric d is based on Static Single Assignment (SSA) (plus loop unrolling) A variation on SSA, to be precise CBMC model checker (bounded model checker for C programs) translates an ANSI C program into a set of equations An execution of the program is just a solution to this set of equations

68 “SSA” Transformation int main () { int x, y; int z = y; if (x > 0) y--; else y++; z++; assert (y == z); } int main () { int x0, y0; int z0 = y0; y1 = y0 - 1; y2 = y0 + 1; guard1 = x0 > 0; y3 = guard1?y1:y2; z1 = z0 + 1; assert (y3 == z1); }

69 Transformation to Equations int main () { int x0, y0; int z0 = y0; y1 = y0 - 1; y2 = y0 + 1; guard1 = x0 > 0; y3 = guard1?y1:y2; z1 = z0 + 1; assert (y3 == z1); } (z0 == y0  y1 == y0 – 1  y2 == y0 + 1  guard1 == x0 > 0  y3 == guard1?y1:y2  z1 == z0 + 1  y3 == z1)

70 Transformation to Equations int main () { int x0, y0; int z0 = y0; y1 = y0 - 1; y2 = y0 + 1; guard1 = x0 > 0; y3 = guard1?y1:y2; z1 = z0 + 1; assert (y3 == z1); } (z0 == y0  y1 == y0 – 1  y2 == y0 + 1  guard1 == x0 > 0  y3 == guard1?y1:y2  z1 == z0 + 1  y3 == z1) Uninitialized variables in CBMC are unconstrained inputs.

71 Transformation to Equations int main () { int x0, y0; int z0 = y0; y1 = y0 - 1; y2 = y0 + 1; guard1 = x0 > 0; y3 = guard1?y1:y2; z1 = z0 + 1; assert (y3 == z1); } (z0 == y0  y1 == y0 – 1  y2 == y0 + 1  guard1 == x0 > 0  y3 == guard1?y1:y2  z1 == z0 + 1  y3 == z1) CBMC (1) negates the assertion

72 Transformation to Equations int main () { int x0, y0; int z0 = y0; y1 = y0 - 1; y2 = y0 + 1; guard1 = x0 > 0; y3 = guard1?y1:y2; z1 = z0 + 1; assert (y3 == z1); } (z0 == y0  y1 == y0 – 1  y2 == y0 + 1  guard1 == x0 > 0  y3 == guard1?y1:y2  z1 == z0 + 1  y3 != z1) (assertion is now negated)

73 Transformation to Equations int main () { int x0, y0; int z0 = y0; y1 = y0 - 1; y2 = y0 + 1; guard1 = x0 > 0; y3 = guard1?y1:y2; z1 = z0 + 1; assert (y3 == z1); } (z0 == y0  y1 == y0 – 1  y2 == y0 + 1  guard1 == x0 > 0  y3 == guard1?y1:y2  z1 == z0 + 1  y3 != z1) then (2) translates to SAT and uses a fast solver to find a counterexample

74 Execution Representation (z0 == y0  y1 == y0 – 1  y2 == y0 + 1  guard1 == x0 > 0  y3 == guard1?y1:y2  z1 == z0 + 1  y3 != z1) Remove the assertion to get an equation for any execution of the program (take care of loops by unrolling)

75 Execution Representation (z0 == y0  y1 == y0 – 1  y2 == y0 + 1  guard1 == x0 > 0  y3 == guard1?y1:y2  z1 == z0 + 1  y3 != z1) Execution represented by assignments to all variables in the equations x0 == 1 y0 == 5 z0 == 5 y1 == 4 y2 == 6 guard1 == true y3 == 4 z1 == 6 Counterexample

76 Execution Representation x0 == 1 y0 == 5 z0 == 5 y1 == 4 y2 == 6 guard1 == true y3 == 4 z1 == 6 Counterexample Execution represented by assignments to all variables in the equations x0 == 0 y0 == 5 z0 == 5 y1 == 4 y2 == 6 guard1 == false y3 == 6 z1 == 6 Successful execution

77 The Distance Metric d x0 == 1 y0 == 5 z0 == 5 y1 == 4 y2 == 6 guard1 == true y3 == 4 z1 == 6 Counterexample d = number of changes (  s) between two executions x0 == 0 y0 == 5 z0 == 5 y1 == 4 y2 == 6 guard1 == false y3 == 6 z1 == 6 Successful execution

78 The Distance Metric d x0 == 1 y0 == 5 z0 == 5 y1 == 4 y2 == 6 guard1 == true y3 == 4 z1 == 6 Counterexample d = number of changes (  s) between two executions x0 == 0 y0 == 5 z0 == 5 y1 == 4 y2 == 6 guard1 == false y3 == 6 z1 == 6 Successful execution

79 The Distance Metric d x0 == 1 y0 == 5 z0 == 5 y1 == 4 y2 == 6 guard1 == true y3 == 4 z1 == 6 Counterexample d = number of changes (  s) between two executions x0 == 0 y0 == 5 z0 == 5 y1 == 4 y2 == 6 guard1 == false y3 == 6 z1 == 6 Successful execution   

80 The Distance Metric d x0 == 1 y0 == 5 z0 == 5 y1 == 4 y2 == 6 guard1 == true y3 == 4 z1 == 6 Counterexample d = number of changes (  s) between two executions x0 == 0 y0 == 5 z0 == 5 y1 == 4 y2 == 6 guard1 == false y3 == 6 z1 == 6 Successful execution    d = 3

81 The Distance Metric d x0 == 1 y0 == 5 z0 == 5 y1 == 4 y2 == 6 guard1 == true y3 == 4 z1 == 6 Counterexample 3 is the minimum possible distance between the counterexample and a successful execution x0 == 0 y0 == 5 z0 == 5 y1 == 4 y2 == 6 guard1 == false y3 == 6 z1 == 6 Successful execution    d = 3

82 The Distance Metric d x0 == 1 y0 == 5 z0 == 5 y1 == 4 y2 == 6 guard1 == true y3 == 4 z1 == 6 Counterexample To compute the metric, add a new SAT variable for each potential  x0  == (x0 != 1) y0  == (y0 != 5) z0  == (z0 != 5) y1  == (y1 != 4) y2  == (y2 != 6) guard1  == !guard1 y3  == (y3 != 4) z1  == (z1 != 6) New SAT variables

83 The Distance Metric d x0 == 1 y0 == 5 z0 == 5 y1 == 4 y2 == 6 guard1 == true y3 == 4 z1 == 6 Counterexample And minimize the sum of the  variables (treated as 0/1 values): a pseudo-Boolean problem x0  == (x0 != 1) y0  == (y0 != 5) z0  == (z0 != 5) y1  == (y1 != 4) y2  == (y2 != 6) guard1  == !guard1 y3  == (y3 != 4) z1  == (z1 != 6) New SAT variables 

84 The Distance Metric d An SSA-form oddity: Distance metric can compare values from code that doesn’t run in either execution being compared This can be the determining factor in which of two traces is most similar to a counterexample Counterintuitive but not necessarily incorrect: simply extends comparison to all hypothetical control flow paths

85 Explanation with Distance Metrics Algorithm (lower level): 1. Find a counterexample using Bounded Model Checking (SAT) 2. Create a new problem: SAT for a successful execution + constraints for minimizing distance to counterexample (least changes) 3. Solve this optimization problem using a pseudo-Boolean solver (PBS) (= 0-1 ILP) 4. Report the differences (  s) to the user as an explanation (and a localization) of the error

86 Explanation with Distance Metrics Model checker BMC/constraint generator P+spec C Optimization tool S -C C ss CBMC explain PBS

87 Explanation with Distance Metrics Details hidden behind a Graphical User Interface (GUI) that hides SAT and distance metrics from users GUI automatically highlights likely bug locations, presents changed values Next slides: GUI in action + a teaser for experimental results

90 Explaining Abstract Counterexamples

91 Explaining Abstract Counterexamples First implementation presents differences as changes in concrete values, e.g.: “In the counterexample, x is 14. In the successful execution, x is 18.” Which can miss the point: What really matters is whether x is less than y But y isn’t mentioned at all!

92 Explaining Abstract Counterexamples If the counterexample and successful execution were abstract traces, we’d get variable relationships and generalization for “free” Abstraction should also make the model checking more scalable This is why abstraction is traditionally used in model checking, in fact

93 Model Checking + Abstraction In abstract model checking, the model checker explores an abstract state space In predicate abstraction, states consist of predicates that are true in a state, rather than concrete values: Concrete: x = 12, y = 15, z = 0 Abstract: x < y, z != 1

94 Model Checking + Abstraction In abstract model checking, the model checker explores an abstract state space. In predicate abstraction, states consist of predicates that are true in a state, rather than concrete values: Concrete: x = 12, y = 15, z = 0 Abstract: x < y, z != 1 Potentially represents many concrete states

95 Model Checking + Abstraction Conservative predicate abstraction preserves all erroneous behaviors in the original system Abstract “executions” now potentially represent a set of concrete executions Must check execution to see if it matches some real behavior of program: abstraction adds behavior

96 Implementation #2 MAGIC Predicate Abstraction Based Model Checker for C programs: Input: C program Checks for various properties: assert statements Simulation of a specification machine Provides counterexample if property does not hold Counterexamples are abstract executions – that describe real behavior of the actual program Now provides error explanation and fault localization

97 Model Checking + Abstraction Predicates & counterexample produced by the usual Counterexample Guided Abstraction Refinement Framework. Explanation will work as in the first case presented, except: The explanation will be in terms of control flow differences and Changes in predicate values.

98 MAGIC Overview Yes Abstraction Model Counterexample Real ? Counterexample Real ? No Abstract Counterexample Abstraction Refinement Abstraction Refinement New Predicates No Spurious Counterexample Yes Verification Spec Spec Holds P Real

99 MAGIC Overview Yes Abstraction Model Counterexample Real ? Counterexample Real ? No Abstraction Refinement Abstraction Refinement No Spurious Counterexample Yes Verification Spec Spec Holds P Real New PredicatesAbstract Counterexample

100 Model Checking + Abstraction Explain an abstract counterexample that represents (at least one) real execution of the program Explain with another abstract execution that: Is not a counterexample Is as similar as possible to the abstract counterexample Also represents real behavior

101 14: assert (a < 4); 5: b = 4 6: c = -4 7: a = 2 8: a = 1 9: a = 6 10: a = 4 11: c = 9 12: c = 10 13: a = 10 4: a = 5 14: assert (a < 4); 5: b = -3 6: c = -4 7: a = 2 8: a = 1 9: a = 6 10: a = 4 11: c = 9 12: c = 3 13: a = 3 4: a = 5 Abstract rather than concrete traces: represent more than one execution Automatic generalization

102 14: assert (a < 4); 5: b > 2 6: c < 7 7: a >= 4 8: a <= 4 9: a >= 4 10: a <= 4 11: c >= 7 12: c >= 7 13: a >= 4 4: a >= 4 14: assert (a < 4); 5: b <= 2 6: c < 7 7: a > 4 8: a <= 4 9: a > 4 10: a <= 4 11: c >= 9 12: c < 7 13: a < 3 4: a >= 4 Abstract rather than concrete traces: represent more than one execution Automatic generalization

103 14: assert (a < 4); 5: b > 2 6: c < 7 7: a >= 4 8: a <= 4 9: a >= 4 10: a <= 4 11: c >= 7 12: c >= 7 13: a >= 4 4: a >= 4 14: assert (a < 4); 5: b <= 2 6: c < 7 7: a > 4 8: a <= 4 9: a > 4 10: a <= 4 11: c >= 9 12: c < 7 13: a < 3 4: a >= 4 Automatic generalization c >= 7: c = 7, c = 8, c = 9, c = 10… c < 7: c = 6, c = 5, c = 4, c = 3…

104 14: assert (a < 4); 5: b > 2 6: c < 7 7: a >= 4 8: a <= 4 9: a >= 4 10: a <= 4 11: c >= 7 12: c >= a 13: a >= 4 4: a >= 4 14: assert (a < 4); 5: b <= 2 6: c < 7 7: a > 4 8: a <= 4 9: a > 4 10: a <= 4 11: c >= 9 12: c < a 13: a < 3 4: a >= 4 Relationships between variables c >= a: c = 7  a = 7, c = 9  a = 6… c < a: c = 7  a = 10, c = 3  a = 4…

105 An Example 1 int main () { 2 int input1, input2, input3; 3 int least = input1; 4 int most = input1; 5 if (most < input2) 6 most = input2; 7 if (most < input3) 8 most = input3; 9 if (least > input2) 10 most = input2; 11 if (least > input3) 12 least = input3; 13 assert (least <= most); 14 }

108 An Example Value changed (line 2): input3#0 from 2147483615 to 0 Value changed (line 12): least#2 from 2147483615 to 0 Value changed (line 13): least#3 from 2147483615 to 0

109 An Example Not very obvious what this means… Value changed (line 2): input3#0 from 2147483615 to 0 Value changed (line 12): least#2 from 2147483615 to 0 Value changed (line 13): least#3 from 2147483615 to 0

110 An Example Control location deleted (step #5): 10: most = input2 Predicate changed (step #5): was: most < least now: least <= most Predicate changed (step #5): was: most < input3 now: input3 <= most ------------------------ Predicate changed (step #6): was: most < least now: least <= most Action changed (step #6): was: assertion_failure

111 An Example Control location deleted (step #5): 10: most = input2 Predicate changed (step #5): was: most < least now: least <= most Predicate changed (step #5): was: most < input3 now: input3 <= most ------------------------ Predicate changed (step #6): was: most < least now: least <= most Action changed (step #6): was: assertion_failure Here, on the other hand:

112 An Example Control location deleted (step #5): 10: most = input2 Predicate changed (step #5): was: most < least now: least <= most Predicate changed (step #5): was: most < input3 now: input3 <= most ------------------------ Predicate changed (step #6): was: most < least now: least <= most Action changed (step #6): was: assertion_failure Here, on the other hand: Line with error indicated Avoid error by not executing line 10

113 An Example Control location deleted (step #5): 10: most = input2 Predicate changed (step #5): was: most < least now: least <= most Predicate changed (step #5): was: most < input3 now: input3 <= most ------------------------ Predicate changed (step #6): was: most < least now: least <= most Action changed (step #6): was: assertion_failure Predicates show how change in control flow affects relationship of the variables

114 Explaining Abstract Counterexamples Implemented in the MAGIC predicate abstraction-based model checker MAGIC represents executions as paths of states, not in SSA form New distance metrics resembles traditional metrics from string or sequence comparison: Insert, delete, replace operations State = PC + predicate values

115 Explaining Abstract Counterexamples Same underlying method as for concrete explanation Revise the distance metric to account for the new representation of program executions Model checker BMC/constraint generator P+spec C Optimization tool S -C C ss MAGIC MAGIC/explain still PBS

116 CBMC vs. MAGIC Representations input1#0 == 0 input2#0 == -1 input3#0 == 0 least#0 == 0 most#0 == 0 guard0 == true guard1 == false least#1 == 0 … CBMC: SSA Assignments s0s0 s1s1 s2s2 s3s3 MAGIC: States & actions 00 11 22

117 CBMC vs. MAGIC Representations input1#0 == 0 input2#0 == -1 input3#0 == 0 least#0 == 0 most#0 == 0 guard0 == true guard1 == false least#1 == 0 … CBMC: SSA Assignments s0s0 s1s1 s2s2 s3s3 MAGIC: States & actions 00 11 22 Control location: Line 5 Predicates: input1 > input2 least == input1...

118 A New Distance Metric s0s0 s1s1 s2s2 s3s3 00 11 22 s’ 0 s’ 1 s’ 2 s’ 3 00 11 22 s’ 4 33 Must determine which states to compare: may be different number of states in two executions Make use of literature on string/sequence comparison & metrics

119 Alignment s0s0 s1s1 s2s2 s3s3 00 11 22 s’ 0 s’ 1 s’ 2 s’ 3 00 11 22 s’ 4 33 1. Only compare states with matching control locations 1 5 7 9 1 3 7 8 11

120 Alignment s0s0 s1s1 s2s2 s3s3 00 11 22 s’ 0 s’ 1 s’ 2 s’ 3 00 11 22 s’ 4 33 1 5 7 9 1 3 7 8 11

121 Alignment s0s0 s1s1 s2s2 s3s3 00 11 22 s’ 0 s’ 1 s’ 2 s’ 3 00 11 22 s’ 4 33 1 5 7 9 1 3 7 8 11

122 Alignment s0s0 s1s1 s2s2 s3s3 00 11 22 s’ 0 s’ 1 s’ 2 s’ 3 00 11 22 s’ 4 33 2. Must be unique

123 Alignment s0s0 s1s1 s2s2 s3s3 00 11 22 s’ 0 s’ 1 s’ 2 s’ 3 00 11 22 s’ 4 33

124 Alignment s0s0 s1s1 s2s2 s3s3 00 11 22 s’ 0 s’ 1 s’ 2 s’ 3 00 11 22 s’ 4 33 3. Don’t cross over other alignments

125 Alignment s0s0 s1s1 s2s2 s3s3 00 11 22 s’ 0 s’ 1 s’ 2 s’ 3 00 11 22 s’ 4 33

126 A New Distance Metric s0s0 s1s1 s2s2 s3s3 00 11 22 s’ 0 s’ 1 s’ 2 s’ 3 00 11 22 s’ 4 33 In sum: much like the traditional metrics used to compare strings, except the alphabet is over control locations, predicates, and actions

127 A New Distance Metric s0s0 s1s1 s2s2 s3s3 00 11 22 s’ 0 s’ 1 s’ 2 s’ 3 00 11 22 s’ 4 33 Encoded using BMC and psuedo-Boolean optimization as in the first case, with variables for alignment and control, predicate and action differences

128 Explaining Abstract Counterexamples One execution(Potentially) many executions Changes in valuesChanges in predicates Always real executionMay be spurious - May need to iterate/refine Execution as SSA valuesExecution as path & states - Counterintuitive metric - Intuitive metric - No alignment problem - Must consider alignments: Which states to compare?BMC to produce PBS problem (CBMC) (MAGIC)

129 Results

130 Results: Overview Produces good explanations for numerous interesting case studies:  C/OS-II RTOS Microkernel (3K lines) OpenSSL code (3K lines) Fragments of Linux kernel TCAS Resolution Advisory component Some smaller, “toy” linear temporal logic property examples  C/OS-II, SSL, some TCAS bugs precisely isolated: report = fault

131 Results: Quantitative Evaluation Very good scores by Renieris & Reiss method for evaluating fault localization: Measures how much source code user can avoid reading thanks to the localization. 1 is a perfect score For SSL and  C/OS-II case studies, scores of 0.999 Other examples (almost) all in range 0.720-0.993

132 Results: Comparison Scores were generally much better than Nearest Neighbor – when it could be applied at all Much more consistent Testing-based methods of Renieris and Reiss occasionally worked better Also gave useless (score 0) explanations much of the time Scores a great improvement over the counterexample traces alone

133 Results: Comparison Scores and times for various localization methods Best score for each program highlighted * alternative scoring method for large programs

134 Results: MAGIC No program required iteration to find a non-spurious explanation: good abstraction already discovered

135 Results: Time Time to explain comparable to model checking time No more than 10 seconds for abstract explanation (except when it didn’t find one at all…) No more than 3 minutes for concrete explanations

136 Results: Room for Improvement Concrete explanation worked better than abstract in some cases When SSA based metric produced smaller optimization constraints For TCAS examples, user assistance was needed in some cases Assertion of form (A implies B) First explanation “explains” by showing how A can fail to hold Easy to get a good explanation—force model checker to assume A

137 Conclusions: Good News Counterexample explanation and fault localization can provide good assistance in locating errors The model checking approach, when it can be applied (usually not to large programs or with complex data structures) may be most effective But Tarantula is the real winner, unless model checking starts scaling better

138 Future Work? The ultimate goal: testing tool or model checker fixes our programs for us – automatic program repair! That’s not going to happen, I think But we can try (and people are doing just that, right now)

139 Model Checking and Scaling Next week we’ll look at a kind of “model checking” that doesn’t involve building SAT equations or producing an abstraction We’ll run the program and backtrack execution Really just an oddball form of testing Can’t do “stupid SAT-solver tricks” like using PBS to produce great fault localizations, but has some other benefits

1 Revisiting Difficult Constraints if (hash(x) == hash(y)) {... } How do we cover this code? Suppose we’re running (DART, SAGE, SMART, CUTE, SPLAT, etc.)

Similar presentations

Presentation on theme: "1 Revisiting Difficult Constraints if (hash(x) == hash(y)) {... } How do we cover this code? Suppose we’re running (DART, SAGE, SMART, CUTE, SPLAT, etc.)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Revisiting Difficult Constraints if (hash(x) == hash(y)) {... } How do we cover this code? Suppose we’re running (DART, SAGE, SMART, CUTE, SPLAT, etc.)

Similar presentations

Presentation on theme: "1 Revisiting Difficult Constraints if (hash(x) == hash(y)) {... } How do we cover this code? Suppose we’re running (DART, SAGE, SMART, CUTE, SPLAT, etc.)"— Presentation transcript:

Similar presentations

About project

Feedback