Data Modeling for Program Analysis Scott McPeak OSQ Retreat
A Program Verifier Verification assures that a program meets some specification, e.g. "no segfaults" –Full correctness vs. partial specs This is undecidable: annotations Program Specification Annotations useful factsnew obligations
Verifier Architecture Verification condition generation (semantics) Theorem prover program annotations specification (hardcoded) predicates (collectively imply program meets spec) "proved" "not proved"
Verification Benefits Potential for reducing costs of testing and debugging is enormous –Memory safety –Concurrency safety –Adherence to domain-specific protocols Annotation appeal: capture "why" info Could prove absence of certain security violations
Run Time is Too Late Doesn't reduce testing cost Run-time cost may be significant –Cumulative across different analyses Recovery after run-time failure? Delay between introduction of a bug and the discovery of its effect
Will Anyone Annotate? Of course, if cost/benefit ratio is right Benefits can be high (previous slide) Abstraction is key to controlling cost –Can re-use "why" knowledge; libraries, etc. –Common tasks must be easy (e.g. array of non- null elements) –Module-wide defaults under user control
Development Model codecompileverifiertesting type error fix failed proof diagnosis assistant explanation fix wrong behavior debugging...
Data Modeling Program analyzer must abstract application data (otherwise it's just executing!) Model: family of mathematical objects, and axioms which relate them Enormous design space, little guidance Direct impact on success of analysis
Example: Strings Initial model: two function symbols –size(addr)# of allocated bytes –strlen(addr)least index of a 0 byte strcpy(d, s) pre: size(d) < strlen(s) post: strlen(d) = strlen(s) strcat(d, s) pre: size(d) - strlen(d) < strlen(s) post: strlen(d) = pre(strlen(d) + strlen(s))
String as a Set Add the predicate contains(addr, ch) ! {T,F} strcpy(d, s) post: 8 ch. contains(s, ch), contains(d, ch) strchr(s, ch) ! r post: contains(s, ch) ) 9 i. r = s+i && : contains(s, ch) ) r = NULL
String as a Sequence Add another symbol "[]" addr[i] ! ch strcpy(d, s) post: 8 i. d[i] = s[i] strchr(s, ch) ! r post: ( 9 i. s[i]=ch) ) *r=ch && : ( 9 i. s[i]=ch) ) r=NULL
Example: Integers " int " is easy to model, right? Well... Mathematical integers Finite partition: { 1 } 32-bit 2's complement with wraparound
Example: Memory mem toplevel obj addr &x malloc(..) a struct field offsets g array int indexes 8 3 "x" = sel(mem 0, addr x ) "a.g[3]" = sel(sel(sel(mem 0, addr a ), g), 3) "a" "a.g"
Pointers Pointers are access paths "&(a.g[3])" = sub(sub(sub(whole, a), g), 3) Rules to read via pointers Can also write, do pointer arithmetic, deeper indexing, e.g. "&(p->x)" selPtr(obj, sub(rest, index)) = v sel(selPtr(obj, rest), index) = v selPtr(obj, whole) = obj
Data Structure Invariants Classic approach: universal quantifier – 8 a. type(a)=Foo ) a->x = a->y + 1 Field admission predicate –Bar *p; admission: p!=NULL; Object state field: "ok" vs. "not ok" –Change a field ! state:="not ok" –Manually certify "ok", precondition=invariant – 8 a. type(a)=Foo ) a->state="ok"
Example: Change Sets Globals: list of changed / list of unchanged –Not ideal.. name sets of globals? Hierarchical mem: changed object is easy –new = update(old, obj_addr, some_value) But changed field (of many objects) is hard Possible alternative: staged & weakened invariants; state what is still true, rather than naming what has changed
Conclusions Try to capture invariants implicitly, via representation choices Be explicit about related entities: inDegree(n)=d vs. inDegree1(n, referrer) Let user select among possible models, even to choose not to model certain fields Try to think like a programmer