Paradyn Project Petascale Tools Workshop Madison, Wisconsin Aug 4-Aug 7, 2014 Binary Code is Not Easy Xiaozhu Meng, Emily Gember-Jacobson, and Bill Williams.

Slides:



Advertisements
Similar presentations
PASTE 2011 Szeged, Hungary September 5, 2011 Labeling Library Functions in Stripped Binaries Emily R. Jacobson, Nathan Rosenblum, and Barton P. Miller.
Advertisements

Fabián E. Bustamante, Spring 2007 Machine-Level Programming II: Control Flow Today Condition codes Control flow structures Next time Procedures.
ByteWeight: Learning to Recognize Functions in Binary Code
© 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum.
Machine/Assembler Language Putting It All Together Noah Mendelsohn Tufts University Web:
COMP 2003: Assembly Language and Digital Logic
Web siteWeb site ExamplesExamples Irvine, Kip R. Assembly Language for Intel-Based Computers, Conditional Loop Instructions LOOPZ and LOOPE LOOPNZ.
Lecture 11 – Code Generation Eran Yahav 1 Reference: Dragon 8. MCD
Recitation 2: Assembly & gdb Andrew Faulring Section A 16 September 2002.
Recitation: Bomb Lab June 5, 2015 Dipayan Bhattacharya.
University of Maryland Compiler-Assisted Binary Parsing Tugrul Ince PD Week – 27 March 2012.
Machine-Level Programming 3 Control Flow Topics Control Flow Switch Statements Jump Tables.
Paradyn Project Dyninst/MRNet Users’ Meeting Madison, Wisconsin August 7, 2014 The Evolution of Dyninst in Support of Cyber Security Emily Gember-Jacobson.
University of Washington x86 Programming III The Hardware/Software Interface CSE351 Winter 2013.
Machine-Level Programming V: Switch Statements Comp 21000: Introduction to Computer Systems & Assembly Lang Systems book chapter 3* * Modified slides from.
Analysis Of Stripped Binary Code Laune Harris University of Wisconsin – Madison
Detecting Code Reuse Attacks with a Model of Conformant Program Execution Emily R. Jacobson, Andrew R. Bernat, William R. Williams, Barton P. Miller Computer.
Recitation 6 – 2/26/01 Outline Linking Exam Review –Topics Covered –Your Questions Shaheen Gandhi Office Hours: Wednesday.
Auther: Kevian A. Roudy and Barton P. Miller Speaker: Chun-Chih Wu Adviser: Pao, Hsing-Kuo.
Carnegie Mellon Recitation: Bomb Lab 21 Sep 2015 Monil Shah, Shelton D’Souza.
Machine-Level Programming 3 Control Flow Topics Control Flow Switch Statements Jump Tables.
Analyzing Memory Accesses in Obfuscated x86 Executables Michael Venable Mohamed R. Choucane Md. Enamul Karim Arun Lakhotia (Presenter) DIMVA 2005 Wien.
ELF binary # readelf -a foo.out ELF Header:
Assembly and Bomb Lab : Introduction to Computer Systems Recitation 4: Monday, Sept. 16, 2013 Marjorie Carlson Section A.
Assembly Language. Symbol Table Variables.DATA var DW 0 sum DD 0 array TIMES 10 DW 0 message DB ’ Welcome ’,0 char1 DB ? Symbol Table Name Offset var.
Carnegie Mellon 1 Midterm Review : Introduction to Computer Systems Recitation 8: Monday, Oct. 19, 2015 Ben Spinelli.
Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 12-14, 2010 Binary Concolic Execution for Automatic Exploit Generation Todd Frederick.
Compiler Construction Code Generation Activation Records
תרגול 5 תכנות באסמבלי, המשך
COMP1070/2002/lec1/H.Melikian COMP1070 Lecture #2 Computers and Computer Languages Some terminology What is Software? Operating Systems.
1 Linking. 2 Outline Symbol Resolution Relocation Suggested reading: 7.6~7.7.
Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 29-May 1, 2013 Detecting Code Reuse Attacks Using Dyninst Components Emily Jacobson, Drew.
Carnegie Mellon Midterm Review : Introduction to Computer Systems October 15, 2012 Instructor:
Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin May 2-4, 2011 unstrip: Restoring Function Information to Stripped Binaries Using Dyninst Emily.
Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 12-14, 2004 Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 12-14, 2004.
Correct RelocationMarch 20, 2016 Correct Relocation: Do You Trust a Mutated Binary? Drew Bernat
OUTLINE 2 Pre-requisite Bomb! Pre-requisite Bomb! 3.
Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 12-14, 2010 Paradyn Project Safe and Efficient Instrumentation Andrew Bernat.
Spring 2016Assembly Review Roadmap 1 car *c = malloc(sizeof(car)); c->miles = 100; c->gals = 17; float mpg = get_mpg(c); free(c); Car c = new Car(); c.setMiles(100);
Introduction to Computer Systems Topics: Assembly Stack discipline Structs/alignment Caching CS 213 S ’12 rec8.pdf “The Class That Gives CMU Its.
Recitation 3: Procedures and the Stack
Machine-Level Programming 2 Control Flow
Instruction Set Architecture
Data Transfers, Addressing, and Arithmetic
Computer Architecture and Assembly Language
Homework Reading Labs PAL, pp
Recitation 2 – 2/4/01 Outline Machine Model
Low level Programming.
Assembly Language Programming V: In-line Assembly Code
Emily Jacobson and Nathan Rosenblum
Computer Architecture and Assembly Language
Computer Organization and Assembly Language
Ramblr Making Reassembly Great Again
C Prog. To Object Code text text binary binary Code in files p1.c p2.c
Getting Started Download the tarball for this session. It will include the following files: driver 64-bit executable driver.c C driver source bomb.h declaration.
Machine-Level Programming 2 Control Flow
Machine-Level Representation of Programs III
Machine-Level Programming 2 Control Flow
The Runtime Environment
Homework Reading Machine Projects Labs PAL, pp
Efficient x86 Instrumentation:
Machine-Level Programming: Introduction
Getting Started Download the tarball for this session. It will include the following files: driver 64-bit executable driver.c C driver source bomb.h declaration.
Multi-modules programming
Machine-Level Programming II: Control Flow Sept. 12, 2007
CS201- Lecture 8 IA32 Flow Control
Getting Started Download the tarball for this session. It will include the following files: driver 64-bit executable driver.c C driver source bomb.h declaration.
Computer Architecture and System Programming Laboratory
Computer Architecture and System Programming Laboratory
Low level Programming.
Presentation transcript:

Paradyn Project Petascale Tools Workshop Madison, Wisconsin Aug 4-Aug 7, 2014 Binary Code is Not Easy Xiaozhu Meng, Emily Gember-Jacobson, and Bill Williams

Three parsing stages Binary Code is Not Easy 2 7a 77 0e 20 e9 3d e0 09 e8 68 c0 45 be 79 5e c0 73 1c a d8 6a d0 56 4b fe af 40 0c b6 f f5 07 b c 85 a5 94 2b 20 fd 5b 95 e7 c2 42 3d f0 2d 7a 77 0e 20 e9 3d e0 09 e8 68 c0 45 be 79 5e b 2f b9 xchg %eax,%ecx fdiv %st(3),%st jmp *-0xf(%esi) add %edi,%ebp jmp *-0x39(%ebp) mov 0xc(%esi),%eax xchg %eax,%ecx fdiv %st(3),%st jmp *-0xf(%esi) add %edi,%ebp jmp *-0x39(%ebp) mov 0xc(%esi),%eax Code Discovery CFG Construction CFG Partitioning foo:bar: Binary file:

Parsing approaches o Linear scan o Control flow (recursive) traversal o Speculative disassembly 3 Binary Code is Not Easy

Parsing approach: Linear scan o Decode instructions sequentially starting from a specific point o Easy to implement o Discovers almost all the instructions o Can confuse data with code o Can be confused about instruction alignment o Tools that use linear scan o GNU Objdump o IDA Pro o CodeSurfer/x86 o BAP 4 Binary Code is Not Easy

Parsing approach: Control flow traversal 5 Binary Code is Not Easy o Decodes instructions start from known function entry points and follows control transfers of the program o Only looks at code and ignores data o Completeness depends on quality of indirect control flow analysis o Depends on gap parsing to find code not reachable by the traversal o Tools that use control flow traversal o Dyninst

Parsing approach: Speculative disassembly o Tries to decode instructions starting at every byte o Will not miss any instruction o Does not always find the actual start of an instruction o May confuse code with data o Tools that use speculative disassembly o Dyninst (gap parsing only) o OllyDebug o SecondWrite 6 Binary Code is Not Easy

Difficulties of accurate parsing Binary Code is Not Easy 7.text Symbol table address of foo: 0x4000 address of bar: 0x : ………… jmp %eax 4000 : ………… jmp %eax data ………… ret xchg %eax %eax ………… ret xchg %eax %eax 4040 : ………… call _abort 4040 : ………… call _abort ………… jmp bar ………… jmp bar Code Discovery: Code and data are mixed Compiler may insert padding Function symbols can be incomplete or missing Instructions may overlap Binary file:

Difficulties of accurate parsing Binary Code is Not Easy 8.text Symbol table address of foo: 0x4000 address of bar: 0x : ………… jmp %eax 4000 : ………… jmp %eax data ………… ret xchg %eax %eax ………… ret xchg %eax %eax 4040 : ………… call _abort 4040 : ………… call _abort ………… jmp bar ………… jmp bar Code Discovery: Code and data are mixed Compiler may insert padding Function symbols can be incomplete or missing Instructions may overlap CFG Construction: Indirect control flow Non-returning functions Exception handling Binary file:

Difficulties of accurate parsing Binary Code is Not Easy 9.text Symbol table address of foo: 0x4000 address of bar: 0x : ………… jmp %eax 4000 : ………… jmp %eax data ………… ret xchg %eax %eax ………… ret xchg %eax %eax 4040 : ………… call _abort 4040 : ………… call _abort ………… jmp bar ………… jmp bar Code Discovery: Code and data are mixed Compiler may insert padding Function symbols can be incomplete or missing Instructions may overlap CFG Construction: Indirect control flow Non-returning functions Exception handling CFG Partitioning: Binary functions are complex Functions may share code Function body may not be continuous Functions do not always terminate in a return; sometimes they terminate in a jump (tail call) or a call instruction (non-returning function). Binary file:

How well do other tools do? Binary Code is Not Easy 10 StageChallengeGNU ObjdumpIDA Pro Code Discovery Code or data0/21/2 Missing symbolsrecover 0/1227 false entry 0 recover 608/1227 false entry 408 Overlapping instructions0/1 CFG Construction Indirect control flow0/62/6 Non-returning functions0/21/2 CFG Partitioning Functions sharing code0/1 Non-continuous functions0/1 Tail calls0/11/1

There is interaction between the parsing stages Binary Code is Not Easy 11 Code Discovery CFG Construction CFG Partitioning

The ParseAPI approach Binary Code is Not Easy 12 Code Discovery CFG Construction CFG Partitioning Control flow (recursive) traversal CFG function representation model Entry point of foo Entry point of bar Intraprocedural edge Interprocedural edge Challenge specific analyses

ParseAPI’s mechanisms and status Binary Code is Not Easy 13 StageChallengeOur techniqueParseAPI Status Code Discovery Code or dataControl flow traversalDone Missing symbolsProbabilistic gap parsingIntegrating Overlapping instructions Control flow traversalDone CFG Construction Indirect control flowBackward slicing + Symbolic evaluation Integrating Non-returning functions Might-ret analysisDone Exception handlingLeveraging debugging information Limited CFG Partitioning Complex functionsCFG function representation model Done Tail callsStack analysis + CFG structure analysis Prototyping

Challenge: Code or Data? 14 Binary Code is Not Easy However, if the jump table data bytes are misinterpreted as code: Jump table data: EntryAddressBytesData value eax=080b15f0e0 4c x24ce0 eax=180b15f498 4a x24a98 eax=280b15f8a6 4a x24aa6 eax=380b15fcb3 4a x24ab3 80b134e: call b1353: add $0x25025,%ebx 80b15e5: mov %ebx,%ecx 80b15e7: sub -0x24d88(%ebx,%eax,4),%ecx # [80b15f0+eax*4] 80b15ee: jmp *%ecx 80b134e: call b1353: add $0x25025,%ebx 80b15e5: mov %ebx,%ecx 80b15e7: sub -0x24d88(%ebx,%eax,4),%ecx # [80b15f0+eax*4] 80b15ee: jmp *%ecx 80b15f0: e0 4c loopne 80b163e 80b15f2: add (%eax),%al 80b15f4: 98 cwtl 80b15f5: 4a dec %edx 80b15f6: add (%eax),%al 80b15f8: a6 cmpsb %es:(%edi),%ds:(%esi) 80b15f9: 4a dec %edx 80b15fa: add (%eax),%al 80b15fc: b3 4a mov $0x4a,%bl 80b15fe: add (%eax),%al 80b15f0: e0 4c loopne 80b163e 80b15f2: add (%eax),%al 80b15f4: 98 cwtl 80b15f5: 4a dec %edx 80b15f6: add (%eax),%al 80b15f8: a6 cmpsb %es:(%edi),%ds:(%esi) 80b15f9: 4a dec %edx 80b15fa: add (%eax),%al 80b15fc: b3 4a mov $0x4a,%bl 80b15fe: add (%eax),%al

Challenge: Overlapping instructions 15 Binary Code is Not Easy 3fe9e8: je 3fe9eb 3fe9ea: lock cmpxchg %ecx, 0x35b0(%ebx) 3fe9f2: jne 3ff740 3fe9e8: je 3fe9eb 3fe9ea: lock cmpxchg %ecx, 0x35b0(%ebx) 3fe9f2: jne 3ff740 e8e9eaebecedeeeff0f1f2~f7 7401f00fb18bb f..00 je 3fe9eblock cmpxchg %ecx, 0x35b0(%ebx)jne 3ff740 cmpxchg %ecx, 0x35b0(%ebx) Address 3fe9 Bytes In ParseAPI, overlapping instructions are in separate basic blocks that have the same predecessors and successors in the CFG.

Challenge: Indirect control flow 1.Backward slice on jmpq: All instructions that calculate the jump targets 16 Binary Code is Not Easy c8a42a: cmp $0xc,%dil c8a42e: ja c8a518 c8a434: lea 0x31fd41(%rip),%r8 c8a43b: movzbl %dil,%edi c8a43f: mov %rsi,%rbp c8a442: movslq (%r8,%rdi,4),%rax c8a446: add %rax,%r8 c8a449: jmpq *%r8 c8a42a: cmp $0xc,%dil c8a42e: ja c8a518 c8a434: lea 0x31fd41(%rip),%r8 c8a43b: movzbl %dil,%edi c8a43f: mov %rsi,%rbp c8a442: movslq (%r8,%rdi,4),%rax c8a446: add %rax,%r8 c8a449: jmpq *%r8

Challenge: Indirect control flow 1.Backward slice on jmpq 17 Binary Code is Not Easy c8a42a: cmp $0xc,%dil c8a42e: ja c8a518 c8a434: lea 0x31fd41(%rip),%r8 c8a43b: movzbl %dil,%edi c8a43f: mov %rsi,%rbp c8a442: movslq (%r8,%rdi,4),%rax c8a446: add %rax,%r8 c8a449: jmpq *%r8 c8a42a: cmp $0xc,%dil c8a42e: ja c8a518 c8a434: lea 0x31fd41(%rip),%r8 c8a43b: movzbl %dil,%edi c8a43f: mov %rsi,%rbp c8a442: movslq (%r8,%rdi,4),%rax c8a446: add %rax,%r8 c8a449: jmpq *%r8 2. Symbolically evaluate the jump target VariableSymbolic value rax rdi r8 c8a42a and c8a42e: dil = RDI and RDI ≤ 12 c8a434: r8; = rip + 31fd41 ; = faa175 c8a43b: RDI ≤ 12 c8a42a and c8a42e: dil = RDI and RDI ≤ 12 c8a434: r8; = rip + 31fd41 ; = faa175 c8a43b: RDI ≤ 12 c8a442: rax = [r8 + rdi × 4] ; = [faa175 + RDI × 4] c8a446: ; r8 = r8 + rax ; = faa175 + [faa175 + RDI × 4] c8a442: rax = [r8 + rdi × 4] ; = [faa175 + RDI × 4] c8a446: ; r8 = r8 + rax ; = faa175 + [faa175 + RDI × 4] RDI (RDI ≤ 12) faa175faa175 + [faa175 + RDI × 4] [faa175 + RDI × 4] + [ + × ]

Challenge: Indirect control flow 18 Binary Code is Not Easy Variation and complexity of indirect control flow o Jump target formulas can involve 0 or more memory accesses o Variation of formula element o Table stride can vary; we have seen 2, 4 and 8 o Table index can come from any register or a memory location o Offset base and table base may be computed by various instructions o Table index condition o Comparison and conditional jump instruction pair o Bit operation “and” : and 0x3,%rax implies rax ≤ 3 ± × [ ± × ] ± [ ± × ]

Challenge: Non-returning functions 19 Binary Code is Not Easy Looks like the call is inside a loop

Challenge: Non-returning functions 20 Binary Code is Not Easy CFG of the function when not considering non- returning functions Superficial call fall through edge Function call edge Intraprocedural edge in loop Intraprocedural edge not in loop The loop is created by superficial call fall through edges

Challenge: Non-returning functions 21 Binary Code is Not Easy CFG of the function No loop Intraprocedural edge Function call edge

Might-ret analysis o Return status of a function can be "unknown", "might ret", or “does not ret" o Calculate a fix point for all functions 22 Binary Code is Not Easy 1. known funcs exit _exit abort __f90_stop fancy_abort __stack_chk_fail __assert_fail ExitProcess ……… exit _exit abort __f90_stop fancy_abort __stack_chk_fail __assert_fail ExitProcess ……… does not ret 2. reach ret might ret : push %ebp mov %esp,%ebp sub $0x18,%esp mov (%eax),%eax leave ret : push %ebp mov %esp,%ebp sub $0x18,%esp mov (%eax),%eax leave ret 3. call unknown: first analyze bar might ret: continue in foo does not ret: stop : ………… call ………… : ………… call ………… bar’s ret status 4. recursion foo and bar do not ret : unknown

Conclusions o This work was driven by our experiences with real binaries o We discussed the challenges in three parsing stages o We analyzed code examples in details and discussed how we address these challenges 23 Binary Code is Not Easy

Missing Symbols o Stripped binaries o Pattern matching o Probabilistic parsing 24 Binary Code is Not Easy

Indirect control flow 25 Binary Code is Not Easy : 0x10(%rbx),%rax : test %rax,%rax 44575c: je 4457c e: pop %rbx 44575f: pop %rbp : mov %r12,%rdi : pop %r : pop %r : pop %r : jmpq *%rax : 0x10(%rbx),%rax : test %rax,%rax 44575c: je 4457c e: pop %rbx 44575f: pop %rbp : mov %r12,%rdi : pop %r : pop %r : pop %r : jmpq *%rax 1.Backward slice 2.Symbolic evaluation 3.Bound fact analysis 4.Get jump targets Symbolic expression: jmpq [rbx+0x10] Bound facts: Calculate and check jump targets: cannot find any candidate jump targets!

Complex functions 26 Binary Code is Not Easy 38958db680 : b680: cmpl $0x0,0x2b6fe9(%rip) b687: jne 38958db db689 : b689: mov $0x1,%eax ………………… b696: jae 38958db6c9 b698: retq b699: sub $0x8,%rsp ………………… b6c6: jae 38958db6c9 b6c8: retq b6c9: mov 0x2b18d0(%rip),%rcx ………………… b6dc: jmp 38958db6c db680 : b680: cmpl $0x0,0x2b6fe9(%rip) b687: jne 38958db db689 : b689: mov $0x1,%eax ………………… b696: jae 38958db6c9 b698: retq b699: sub $0x8,%rsp ………………… b6c6: jae 38958db6c9 b6c8: retq b6c9: mov 0x2b18d0(%rip),%rcx ………………… b6dc: jmp 38958db6c8 Jump into another function! Block1: [b680, b689) Block2: [b689, b698) Block3: [b698, b699) Block4: [b699, b6c8) Block5: [b6c8, b6c9) Block6: [b6c9, b6df) MECFG Model: __write: block 1,2,3,4,5,6 __write_nocancel: block 2,3 Entry of __write _nocancel

Challenge: Indirect control flow 1.Backward slice on jmpq 2.Symbolic evaluation 3.Bound fact analysis 4.Get jump targets 27 Binary Code is Not Easy c8a42a: cmp $0xc,%dil c8a42e: ja c8a518 c8a434: lea 0x31fd41(%rip),%r8 c8a43b: movzbl %dil,%edi c8a43f: mov %rsi,%rbp c8a442: movslq (%r8,%rdi,4),%rax c8a446: add %rax,%r8 c8a449: jmpq *%r8 c8a42a: cmp $0xc,%dil c8a42e: ja c8a518 c8a434: lea 0x31fd41(%rip),%r8 c8a43b: movzbl %dil,%edi c8a43f: mov %rsi,%rbp c8a442: movslq (%r8,%rdi,4),%rax c8a446: add %rax,%r8 c8a449: jmpq *%r8 Jump target expression and condition: jmpq [rdi*4+0xfaa17c]+0xfaa17c rdi <= 0xc rdi=0 faa17c: d4 02 ce ff # -0x31fd2c faa180: 27 ff ff ff # -0x31fcec faa184: 4d ff ff ff # -0x31fcbc faa17c: d4 02 ce ff # -0x31fd2c faa180: 27 ff ff ff # -0x31fcec faa184: 4d ff ff ff # -0x31fcbc rdi=1 rdi=2 jmpq c8a450 jmpq c8a490 jmpq c8a4c0

o Identifying tail calls is difficult because known function entry points may be incomplete Tail calls 28 Binary Code is Not Easy 42b678: pop %rbp 42b679: pop %r12 42b67b: pop %r13 42b67d: pop %r14 42b67f: jmpq 42a5f0 42b678: pop %rbp 42b679: pop %r12 42b67b: pop %r13 42b67d: pop %r14 42b67f: jmpq 42a5f0 Is 42a5f0 a function entry or not?

Tail calls 29 Binary Code is Not Easy f92a0 : f92a0:add $0xfffffffffffffff0,%rdi f92a4: jmp f92a6: nopw %cs:0x0(%rax,%rax,1) f92ad: f92b0 : f92b0:push %rbx f92b1: mov %rdi,%rbx …………… f92a0 : f92a0:add $0xfffffffffffffff0,%rdi f92a4: jmp f92a6: nopw %cs:0x0(%rax,%rax,1) f92ad: f92b0 : f92b0:push %rbx f92b1: mov %rdi,%rbx …………… Code added by the compiler to implement virtual functions. We do not have evidence to decide whether : (a)A function tail calls the other one. (b)Two functions share code

Tail call analysis o Indication of tail calls o Control flow structures 30 Binary Code is Not Easy