Presentation is loading. Please wait.

Presentation is loading. Please wait.

© 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum.

Similar presentations


Presentation on theme: "© 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum."— Presentation transcript:

1 © 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum University of Wisconsin nater@cs.wisc.edu

2 – 2 –© 2006 Nathan RosenblumUnconventional Code Constructs Binary Analysis  Processing of the binary code to extract syntactic and symbolic information from many sources: Symbol tables (if present) Decode (disassemble) instructions Control-flow information: basic blocks, loops, functions Data-flow information: from basic register information to highly sophisticated (and expensive) analyses.

3 – 3 –© 2006 Nathan RosenblumUnconventional Code Constructs Products of Binary Analysis  High-level organization and characteristics Function entry/exit points Intra-procedural call graph Inter-procedural control-flow graph Exception handlers Jump tables Virtual function tables  Abstract assembly representation  Data-flow characteristics Register liveness (for instrumentation, modification)

4 – 4 –© 2006 Nathan RosenblumUnconventional Code Constructs Uses of Binary Analysis  Debugging  Testing  Performance profiling  Performance modeling  Behavior Modeling  Dynamic Modification  Binary Rewriting  Reverse engineering

5 – 5 –© 2006 Nathan RosenblumUnconventional Code Constructs Binary Analysis Tool Goals SafeEliminate false positives to make instrumentation safe AccurateMinimize false negatives for complete view of the binary OpportunisticUse all available information and techniques to maximum effect ResilientTools are robust to unexpected and unusual applications AutomatedAnalysis does not depend on human interaction ComplementaryProduce products compatible with source- level analysis tools.

6 – 6 –© 2006 Nathan RosenblumUnconventional Code Constructs Why is Binary Analysis Hard? Func foo() { … switch(a) { … } … } push %ebp mov %esp, %ebp … mov [0x1d], %eax jmp *%eax … The Compiler Source CodeBinary

7 – 7 –© 2006 Nathan RosenblumUnconventional Code Constructs Current Approaches  Linear disassembly of binaries is insufficient Symbol tables often lie, or are absent Functions are not address ranges, may be non- contiguous  Parsing based on program control flow Commonly used approach: UQBTLEEL RADIDA-Pro Dyninst Must contend with gaps in known code regions after parsing

8 – 8 –© 2006 Nathan RosenblumUnconventional Code Constructs Dyninst Control Flow Parsing  Opportunistic parsing: Utilizes symbol table and other information when available (and sensible)  Provides more accurate view of the binary than linear disassembly  Addresses problem of gaps in the binary through speculative parsing Heuristics to identify function preambles

9 – 9 –© 2006 Nathan RosenblumUnconventional Code Constructs Control Flow Traversal Illustrated : 00: mov [a8], r1 04: mov [ac], r2 08: add r1, r2, r3 0c: cmp r3, 0 10: bne 24 14: call 18: add r3, 8, r3 1c: call 20: jmp 28 24: mul r2, 2, r3 28: sub r1, r3, r1... 00 1424 28 Parsing follows control flow Control transfers are edges in the CFG Target blocks can parsed in any order

10 – 10 –© 2006 Nathan RosenblumUnconventional Code Constructs Control Flow Traversal Illustrated : 00: mov [a8], r1 04: mov [ac], r2 08: add r1, r2, r3 0c: cmp r3, 0 10: bne 24 14: call 18: add r3, 8, r3 1c: call 20: jmp 28 24: mul r2, 2, r3 28: sub r1, r3, r1... Call sites determine location of functions Targets of calls are added to the function parsing work list Known Functions foo quux quuux bar baz

11 – 11 –© 2006 Nathan RosenblumUnconventional Code Constructs Binary Parsing Challenges  Pointer-based control transfer  Non-returning calls  Non-contiguous code sections  Tail calls  Gaps in the binary  Exception handlers  Shared code and multiple entry representation

12 – 12 –© 2006 Nathan RosenblumUnconventional Code Constructs Non-returning Call Sites  Some functions will not return Examples: abort, exit  Code following call site may not be valid  Even if names are available, calls may be hard to detect: dfaerrorfatalexit

13 – 13 –© 2006 Nathan RosenblumUnconventional Code Constructs Detecting Non-Returning Functions  Goal: detect non- returning functions from first principles  Identify distinguishing features of non- returning functions Wide variety of behavior in non- returning functions makes this difficult Example: operations in abort abort() -> sigaction() IO_flush_all() raise(SIGABRT) -> kill(getpid(),sig) hlt [privileged instruction]

14 – 14 –© 2006 Nathan RosenblumUnconventional Code Constructs Non-returning Call Sites 000214d0 :... 2160f: e8 cc db 0a 00 call cf1e0 21614: e8 07 7f 00 00 call 29520 21619: 90 nop 2161a: 90 nop 2161b: 90 nop 2161c: 90 nop 2161d: 90 nop 2161e: 90 nop 2161f: 90 nop 00021620 : 21620: 55 push %ebp 21621: 89 e5 mov %esp,%ebp... Example: GNU libc library routines Call to abort does not return Parser will naively follow control into the following region Bytes following call site may not be code (e.g., jump tables, other functions, string data)

15 – 15 –© 2006 Nathan RosenblumUnconventional Code Constructs Non-contiguous Code Func Foo Functions are not address ranges Symbol table representation fails Many sources of non-contiguous layout: Jump tables Data (strings, etc) Unparsed code Exception handlers Padding or junk bytes

16 – 16 –© 2006 Nathan RosenblumUnconventional Code Constructs Non-contiguous Code... 77e7b1cb: 83 41 04 04 addl $0x4,0x4(%ecx) 77e7b1cf: 5d pop %ebp 77e7b1d0: c2 0c 00 ret $0xc 77e7b1d3: 68 f5 06 00 00 push $0x6f5 77e7b1d8: eb 05 jmp 0x77e7b1df 77e7b1da: 68 e6 06 00 00 push $0x6e6 77e7b1df: e8 bb 86 02 00 call 0x77ea389f 77e7b1e4: 4c ba e7 77 77e7b1e8: 34 b2 e7 77 77e7b1ec: b5 b1 e7 77 77e7b1f0: 0c 9f e8 77 77e7b1f4: 96 37 e8 77 77e7b1f8: cf b1 e7 77 77e7b1fc: 00 00 00 00 01 01 01 02 02 02 03 03 04 02 05 77e7b20c: 3c 10 cmp $0x10,%al 77e7b20e: 0f 85 a6 3b 02 00 jne 0x77e9edba... Example: Microsoft Word Jump table separates valid instruction sequences Control following call site is invalid

17 – 17 –© 2006 Nathan RosenblumUnconventional Code Constructs Named Non-contiguous Sections 00021060 :.... 210f0: lock cmpxchg %ecx,0x2968(%ebx) 210f8: jne 2118e 210fe: xor %esi,%esi 21100: cmp $0x6,%esi... 0002118e : 2118e: lea 0x2968(%ebx),%ecx 21194: call ea0f0 21199: jmp 210fe Example: GNU libc library routines Looks like shared code Fragment is not a real function

18 – 18 –© 2006 Nathan RosenblumUnconventional Code Constructs Named Non-contiguous Sections  Recognizing function fragments Have a symbol table entry Reached by branches from one function Branch back to one function  Use combination of CFG and symbol table clues

19 – 19 –© 2006 Nathan RosenblumUnconventional Code Constructs Tail Calls Func Bar... jmp Func Quux Compiler has joined two functions into one Looks like non- contiguous shared code... ret Func Foo... call

20 – 20 –© 2006 Nathan RosenblumUnconventional Code Constructs Gap Parsing Func Foo Func Bar Unidentified section of code Gaps between known code regions may contain undiscovered functions Targets of indirect calls Speculative parsing: pattern- based heuristics to recognize function prologues in gaps

21 – 21 –© 2006 Nathan RosenblumUnconventional Code Constructs Exceptions  Exception handling code is normally unreachable  Use information in the binary where available Example: Linux ELF exception tables C++ style exception catch block push %ebp mov %esp,%ebp push %ebx sub $0x24,%esp movl $0x6,0xfffffff8(%ebp) mov 0x8(%ebp),%eax mov %eax,(%esp) call 804aafa jmp 804abe9 mov %eax,0xfffffff4(%ebp) cmp $0x2,%edx je 804ab58... mov 0xfffffff4(%ebp),%eax mov %eax,(%esp) call 804a388 add $0x24,%esp pop %ebx pop %ebp ret

22 – 22 –© 2006 Nathan RosenblumUnconventional Code Constructs Shared Code Models Shared Code Func AFunc B  Code may be shared between functions Multiple entry functions Compiler optimizations  Analysis tools must be able to recognize and handle overlapping control flow

23 – 23 –© 2006 Nathan RosenblumUnconventional Code Constructs Summary of Binary Analysis Techniques  Control flow traversal is a powerful tool for addressing the challenges of modern binaries Lying/missing symbol tables Data/code disambiguation Jump tables  Speculative parsing techniques can be useful for expanding code coverage Gaps in code Indirect calls and branches

24 – 24 –© 2006 Nathan RosenblumUnconventional Code Constructs Incidence of Shared Code in Binaries  Parsed 828 Linux/x86 binaries 238 contained shared code  Most binaries contain only a few code- sharing functions  Some code sharing may be due to non- returning call sites

25 – 25 –© 2006 Nathan RosenblumUnconventional Code Constructs Where Do We Go From Here?  Are there good solutions from first principles? Almost certainly. We are just starting to explore the limits of such techniques.  Are special case solutions necessary? Again, almost certainly. We will try to use these as sparingly as possible.

26 – 26 –© 2006 Nathan RosenblumUnconventional Code Constructs Future Directions in Binary Analysis  Problem: code exists but is unreachable through standard control-flow traversal parsing Heuristics are a moving target  Existing opportunistic parsing techniques can help, but only to an extent Exception handlers, virtual function tables may be recoverable from the binary  Given the information we can recover from traditional techniques, can we synthesize additional information that will increase coverage of the binary?

27 – 27 –© 2006 Nathan RosenblumUnconventional Code Constructs Statistical Binary Parsing  Can we utilize known code to find unknown code? We have a partial parse of the binary Code unknown regions of the binary will likely share characteristics with previously identified code  Identify code in unknown regions: Create a probabilistic model of valid code Identify sections of unknown regions in the binary that are similar to valid code

28 – 28 –© 2006 Nathan RosenblumUnconventional Code Constructs Binary Modeling Techniques  Code idioms are one possibility for validating potential code Function preambles, jump table bounds tests, system call stubs, case statements  Idioms can be identified manually  Model can be trained to identify new idioms with machine learning techniques n-gram models, long-distance interaction  Unparsed code can be scored to indicate its statistical similarity to known code

29 – 29 –© 2006 Nathan RosenblumUnconventional Code Constructs Open Questions in Binary Analysis  What learning techniques will yield the best results?  How can we overcome the relative dearth of information in binaries with very little code reachable through control flow analysis? Incorporate information from analysis of other binaries  What techniques will allow us to accurately identify the range of recognizable code?

30 – 30 –© 2006 Nathan RosenblumUnconventional Code Constructs Questions?

31 – 31 –© 2006 Nathan RosenblumUnconventional Code Constructs Backup Slides

32 – 32 –© 2006 Nathan RosenblumUnconventional Code Constructs Shared Code Models Shared CodeMultiple Entry Func AFunc B Entry AEntry B What is the difference from the perspective of the parser?

33 – 33 –© 2006 Nathan RosenblumUnconventional Code Constructs A Choice of Abstraction  Shared code and multiple entry models are similar Represent independent flows of control merging together  Shared model is a better fit for Dyninst Preserves semantic guarantees of function independence

34 – 34 –© 2006 Nathan RosenblumUnconventional Code Constructs Shared Code 000a94c0 : a94c0: cmpl $0x0,%gs:0xc a94c8: jne a94e7 000a94ca : a94ca: push %ebx a94cb: mov 0x10(%esp,1),%edx a94cf: mov 0xc(%esp,1),%ecx a94d3: mov 0x8(%esp,1),%ebx a94d7: mov $0x7,%eax a94dc: int $0x80 a94de: pop %ebx a94df: cmp $0xfffff001,%eax a94e4: jae a9513... Code common to the two functions is marked as shared. Example: GNU libc library routines


Download ppt "© 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum."

Similar presentations


Ads by Google