Statically Validating Must Summaries for Incremental Compositional Dynamic Test Generation Patrice Godefroid Shuvendu K. Lahiri Cindy Rubio-González International Static Analysis Symposium – September 2011 Microsoft Research University of Wisconsin – Madison
Valid input Constraints Recorded trace Background Systematic Dynamic Test Generation (= DART) 2 Used in many tools o EXE, CUTE, SAGE, PEX, KLEE, BitScope, Apollo, etc. Run program Symbolically execute program Negate and solve constraints New inputs And the process repeats (possibly forever!)
o 200+ machines (since 2008) #1 application for SMT solvers today (CPU usage) o 1 st whitebox fuzzer for security testing 3 Microsoft o 1 billion+ constraints o 100s of apps, 100s of security bugs Example: Win7 file fuzzing Found ~1/3 of all fuzzing bugs o Millions of dollars saved for Microsoft + time/energy for the world
Compositional Test Generation Compositional Dynamic Test Generation Compute summaries that can be reused later Avoid retesting Can provide the same path coverage exponentially faster! 4 Systematically executing all feasible paths does not scale
Example of Function Summary 5 1 int is_positive(int x) { 2 if (x > 0) return 1; 3 return 0; 4 } Where ret denotes the value returned by the function is_positive
Function Summaries 6 Conjunction of constraints on the inputs of f Conjunction of constraints on the outputs of f
Must Summaries 7 1 int g(int x, int y) { 2 if ((x > 0) && (hash(y) > 10)) 3 return 1; 4 return 0; 5 } Under-approximate with smaller precondition Assume hash is a complex or unknown function Assume if g is invoked with y = 45, then hash(45) = 987
Must Summaries 8 Defined as quadruple lp, P, lq, Q where: Prog Ip lq P summary precondition holding at lp Q summary postcondition holding at lq
Some Facts About Summaries Time to be produced: weeks/months 9 Number of summaries: millions Number of instructions executed between lp and lq: can be hundreds of thousands
Incremental Compositional Test Generation 10 Have to start from scratch if there is a small code change Incremental compositional test generation As in smart/selective regression testing Reuse summaries still valid in new program Recompute invalid summaries
Must Summary Checking 11 Given a valid must summary for a program and a new version of the program, is the summary still valid for the new version? Intraprocedural summaries o locations lp and lq are in a same function f o function f does not return between lp to lq when the summary is generated
Some proposals Naïve o For each summary, record executed instructions Too expensive, ~100K of instructions executed Runtime overhead 12 Our proposal o Verify statically what summaries are valid in order to reuse them Less precise than recomputing summaries from scratch, but cheaper
Algorithms 1. Static Change Impact Analysis Predicate-Sensitive Change Impact Analysis 3. Must Summary Validity Checking Analysis
Phase 1: Static Change Impact Analysis Impact analysis of code changes in the control- flow and call graphs of the program 14 Old programNew program Ip lq Ip lq
Modified Instructions and Functions Instruction i of a program Prog is modified if: o i is changed or deleted in Prog or o Its ordered set of immediate successors has changed 15 Function f in a program Prog is modified if f: o contains a modified instruction o calls a modified function o calls an unknown function
Phase 1: Static Change Impact Analysis Construct call graph for the program 1
17... U MMU M IM IU IM S S S S SS Find modified and unknown functions 2 Find indirectly modified and unknown functions 3 Phase 1: Static Change Impact Analysis 4 Map summaries, construct control-flow graphs
18... U MMU M IM IU IM S S S S SS Find summaries as valid or invalid 5 Phase 1: Static Change Impact Analysis
Phase 2: Predicate-Sensitive Change Impact Analysis 19 Exploit the predicates P and Q in a summary if(x > 0) if (y==0) w = w + 1w = 0w = 1... Ip lq Q: w = 0 Old program Invalidated by Phase 1
Phase 2: Predicate-Sensitive Change Impact Analysis if (x > 0) { if (y == 10) w++; // MODIFIED else w = 0; } else { w = 1; // MODIFIED }... Old program void foo() { return; } Ip lq Q: w = 0
Phase 2: Predicate-Sensitive Change Impact Analysis 21 Instrumented old program void foo() { return; } Ip lq Q: w = 0
Phase 2: Predicate-Sensitive Change Impact Analysis Check assertion in instrumented code does not fail for all possible inputs 22 Verification-condition based program verifier o Create logic formula from program with assertions o Check formula validity using theorem prover o If valid, the assertion does not fail in any execution
Phase 3: Must Summary Validity Checking 23 Check must summary validity against some code, independently of code changes if(x < 0) if (y < 0) r = 1r = 0w = 1... Ip lq P: x < 0 Old program r = 4 New program Invalidated by Phase 1 and Phase 2
Phase 3: Must Summary Validity Checking if (x < 0) { if (y < 0) r = 1; else { r = 4; // r = 0 in old code }... New program void bar() { return; } Ip lq P: x < 0
Phase 3: Must Summary Validity Checking 25 reach_lq = false; goto lp;... assume P; if (x < 0) { if (y < 0) r = 1; else { r = 4; // r = 0 in old code } assert(Q); reach_lq = true;... assert(reach_lq); Instrumented new program void bar() { return; } Ip lq P: x < 0
Phase 3: Must Summary Validity Checking Check that assertions hold in the instrumented program for all possible inputs 26
Result 27 Validated summaries can be reused o Because of soundness Invalidated summaries are discarded and need to be recomputed o New tests are generated to cover their preconditions Algorithms can be used in isolation or in a pipeline
Experimental Results 28
Implementation Details 29 Map summaries, find modified insts and funcs (C++) Old DLL Summaries Old DLL New DLL Vulcan Produced by SAGE Phase 1 Change Impact Phase 2 Predicate Sensitive Phase 3 Validity Checking Valid/Invalid Summaries Library to statically analyze Windows binaries Used in pipeline or isolation
Implementation Details Translator from X86 to BoogiePL Procedure (x86) Vulcan Summary lp,P,lq,Q Sound translation Instrumented BPL file (Phase 2 or Phase 3) Boogie/Z3
Benchmarks 31 Image parsers embedded in Windows o ANI, GIF and JPEG Ran SAGE to generate summaries (small sample) o 286 for ANI, 288 for GIF and 517 for JPEG Identified the DLLs involved o 3 for ANI, 4 for GIF and 8 for JPEG Compared old version against a randomly picked newer version o Delta ~1 to 3 years
Difference Between Program Versions 32 Modified functions: 3% - 10%Indirectly modified functions: 30% - 45% Unknown functions: 27% - 37% Indirectly unknown functions: 60% - 74%
Applying Phases in Isolation 33 # Validated Summaries 58%85%30%69%92% 31% 61%94%33% Total Validated: 256/286 (90% ) Total Validated: 274/288 (95% ) Total Validated: 501/517 (97% ) Phase 1: Change Impact Phase 2: Predicate Sensitive Phase 3: Validity Checking
Applying Phases in Pipeline Fashion Phase 1 Phase 2 Phase 3 34 # Validated Summaries 58%27%4% Total Validated: 256/286 (90% ) 69%25%1% 61%35%1% Total Validated: 274/288 (95% ) Total Validated: 501/517 (97% ) Phase 1: Change Impact Phase 2: Predicate Sensitive Phase 3: Validity Checking
Running Time (Isolation) 35 # Minutes Phase 1: Change Impact Phase 2: Predicate Sensitive Phase 3: Validity Checking
Running Time Phase 1 Phase 2 Phase 3 36 # Minutes 43 min28min41min Preliminary results show that statically validating must summaries is up to 20 times faster than recomputing them! Phase 1: Change Impact Phase 2: Predicate Sensitive Phase 3: Validity Checking
Summary Formulated the problem of statically validating must summaries 37 Demonstrated the effectiveness of static must summary checking o Validated hundreds of must summaries in minutes Described three approaches for validating must summaries Presented a preliminary evaluation on three large Windows image parsers
Questions? 38 Map summaries, find modified insts and funcs (C++) Old DLL Summaries Old DLL New DLL Vulcan Phase 1 Change Impact Phase 2 Predicate Sensitive Phase 3 Validity Checking Valid/Invalid Summaries