ByteWeight: Learning to Recognize Functions in Binary Code

Slides:



Advertisements
Similar presentations
PASTE 2011 Szeged, Hungary September 5, 2011 Labeling Library Functions in Stripped Binaries Emily R. Jacobson, Nathan Rosenblum, and Barton P. Miller.
Advertisements

Fabián E. Bustamante, Spring 2007 Machine-Level Programming II: Control Flow Today Condition codes Control flow structures Next time Procedures.
© 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum.
Machine/Assembler Language Putting It All Together Noah Mendelsohn Tufts University Web:
Recitation 4: 09/30/02 Outline The Stack! Essential skill for Lab 3 –Out-of-bound array access –Put your code on the stack Annie Luo
Native x86 Decompilation Using Semantics-Preserving Structural Analysis and Iterative Control-Flow Structuring Edward J. Schwartz *, JongHyup Lee ✝, Maverick.
Recitation 8 – 3/25/02 Outline Dynamic Linking Review prior test questions 213 Course Staff Office Hours: See Posting.
C Programming and Assembly Language Janakiraman V – NITK Surathkal 2 nd August 2014.
Review: Software Security David Brumley Carnegie Mellon University.
Peephole Optimization Final pass over generated code: examine a few consecutive instructions: 2 to 4 See if an obvious replacement is possible: store/load.
1 Homework Reading –PAL, pp , Machine Projects –Finish mp2warmup Questions? –Start mp2 as soon as possible Labs –Continue labs with your.
Practical Session 8 Computer Architecture and Assembly Language.
C Prog. To Object Code text text binary binary Code in files p1.c p2.c
Branch Regulation: Low-Overhead Protection from Code Reuse Attacks Mehmet Kayaalp, Meltem Ozsoy, Nael Abu-Ghazaleh and Dmitry Ponomarev Department of Computer.
Recitation: Bomb Lab June 5, 2015 Dipayan Bhattacharya.
SCRAP: Architecture for Signature-Based Protection from Code Reuse Attacks Mehmet Kayaalp, Timothy Schmitt, Junaid Nomani, Dmitry Ponomarev and Nael.
Machine-Level Programming 3 Control Flow Topics Control Flow Switch Statements Jump Tables.
Practical Session 4. Labels Definition - advanced label: (pseudo) instruction operands ; comment valid characters in labels are: letters, numbers, _,
Analysis Of Stripped Binary Code Laune Harris University of Wisconsin – Madison
Paradyn Project Petascale Tools Workshop Madison, Wisconsin Aug 4-Aug 7, 2014 Binary Code is Not Easy Xiaozhu Meng, Emily Gember-Jacobson, and Bill Williams.
Recitation 6 – 2/26/01 Outline Linking Exam Review –Topics Covered –Your Questions Shaheen Gandhi Office Hours: Wednesday.
Machine-Level Programming III: Procedures Topics IA32 stack discipline Register-saving conventions Creating pointers to local variables CS 105 “Tour of.
Machine-Level Programming 3 Control Flow Topics Control Flow Switch Statements Jump Tables.
Analyzing Memory Accesses in Obfuscated x86 Executables Michael Venable Mohamed R. Choucane Md. Enamul Karim Arun Lakhotia (Presenter) DIMVA 2005 Wien.
CS216: Program and Data Representation University of Virginia Computer Science Spring 2006 David Evans Lecture 22: Unconventional.
Carnegie Mellon 1 Odds and Ends Intro to x86-64 Memory Layout.
Assembly and Bomb Lab : Introduction to Computer Systems Recitation 4: Monday, Sept. 16, 2013 Marjorie Carlson Section A.
Bits and Bytes September 1, F’05 class02.ppt “The Class That Gives CMU Its Zip!”
Stack Usage with MS Visual Studio Without Stack Protection.
Carnegie Mellon 1 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Machine-Level Programming V: Buffer overflow Slides.
String Analysis for Binaries Mihai Christodorescu Nicholas Kidd Wen-Han Goh University of Wisconsin, Madison.
Buffer Overflow Attack- proofing of Code Binaries Ramya Reguramalingam Gopal Gupta Gopal Gupta Department of Computer Science University of Texas at Dallas.
Reminder Bomb lab is due tomorrow! Attack lab is released tomorrow!!
Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST
University of Amsterdam Computer Systems – the instruction set architecture Arnoud Visser 1 Computer Systems The instruction set architecture.
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Carnegie Mellon Instructor: San Skulrattanakulchai Machine-Level Programming.
Overview of Back-end for CComp Zhaopeng Li Software Security Lab. June 8, 2009.
Practical Session 8. Position Independent Code- self sufficiency of combining program Position Independent Code (PIC) program has everything it needs.
Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin May 2-4, 2011 unstrip: Restoring Function Information to Stripped Binaries Using Dyninst Emily.
Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 12-14, 2004 Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 12-14, 2004.
OUTLINE 2 Pre-requisite Bomb! Pre-requisite Bomb! 3.
Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 12-14, 2010 Paradyn Project Safe and Efficient Instrumentation Andrew Bernat.
Spring 2016Assembly Review Roadmap 1 car *c = malloc(sizeof(car)); c->miles = 100; c->gals = 17; float mpg = get_mpg(c); free(c); Car c = new Car(); c.setMiles(100);
Instruction Set Architecture
Credits and Disclaimers
CSCE 212 Computer Architecture
Computer Architecture and Assembly Language
Homework In-line Assembly Code Machine Language
Introduction to Compilers Tim Teitelbaum
Assembly Language Programming V: In-line Assembly Code
Machine-Level Programming III: Procedures
Emily Jacobson and Nathan Rosenblum
Chapter 3 Machine-Level Representation of Programs
Computer Architecture and Assembly Language
Ramblr Making Reassembly Great Again
Machine-Level Programming III: Procedures /18-213/14-513/15-513: Introduction to Computer Systems 7th Lecture, September 18, 2018.
Carnegie Mellon Machine-Level Programming III: Procedures : Introduction to Computer Systems October 22, 2015 Instructor: Rabi Mahapatra Authors:
C Prog. To Object Code text text binary binary Code in files p1.c p2.c
Assembly Language Programming II: C Compiler Calling Sequences
Machine-Level Representation of Programs III
Machine-Level Programming: Introduction
Ithaca College Machine-Level Programming VII: Procedures Comp 21000: Introduction to Computer Systems & Assembly Lang Spring 2017.
Chapter 3 Machine-Level Representation of Programs
Ithaca College Machine-Level Programming VII: Procedures Comp 21000: Introduction to Computer Systems & Assembly Lang Spring 2017.
Credits and Disclaimers
Computer Architecture and System Programming Laboratory
Computer Architecture and Assembly Language
Computer Architecture and System Programming Laboratory
Computer Architecture and System Programming Laboratory
Low level Programming.
Presentation transcript:

ByteWeight: Learning to Recognize Functions in Binary Code Tiffany Bao Jonathan Burket Maverick Woo Rafael Turner David Brumley Carnegie Mellon University Very pleased to be here to talk about our work ByteWeight. This is joint work with USENIX Security ’14

Function Information Malware Analysis Binary Analysis Malware Analysis Vulnerability Signature Generation … Decompiler Binary Reuse Control Flow Integrity (CFI) Function Information Function 1 Function 3 Function 2 There are different kinds of binary analysis in the world, such as … etc. Many of them need function information. For example, decompiler pheonix, which is published last year in usenix security, needs function start information to construct control flow graph for each function. 01010010101010100101011011101010100101010101010111110001010001010110100101010001001010110101010101101011101010110110001010001000111010010011110101

Can we automatically and accurately recover function information from stripped binaries? Binary Analysis Decompiler Control Flow Integrity (CFI) Binary Reuse Malware Analysis … Vulnerability Signature Generation Stripped Function Information Function 1 Function 3 Function 2 which means essentially function as well as other symbols are discarded from program binaries. Stripped binaries are very common, especially for those delivered to public. Without function information, analysis such as decompiler and binary level Control Flow Integrity may not be able to work anymore. So our questions is: can... We need it to be automatic to meet scalability requirements. We also need it to be accurate, because inaccurate function recovery will affect downstream analysis. We starts our discussion by a simple example 01010010101010100101011011101010100101010101010111110001010001010110100101010001001010110101010101101011101010110110001010001000111010010011110101

Example: GCC Source Code #include <stdio.h> int fac(int x){ if (x == 1) return 1; else return x * fac(x - 1); } void main(int argc, char **argv){ printf("%d", fac(10)); We start our discussion by a simple example. Suppose we have a stripped binary. this binary is to calculate number’s factorial And We are going to find function fac and main from the stripped binaries. a naïve method is to match function prologue, like push ebp, move esp ebp. Source Code

Example: GCC –O0: Default 08048443 <main>: push %ebp mov %esp,%ebp and $0xfffffff0,%esp sub $0x10,%esp … 0804841c <fac>: sub $0x18,%esp cmpl $0x1,0x8(%ebp) jne 804842f <fac+0x13> mov $0x1,%eax If the binary is compiled by gcc under –O0, we will find both of them. We are very happy. –O0: Default

Example: GCC 08048330 <main>: mov $0x1,%edx mov $0xa,%eax lea 0x0(%esi),%esi … push %ebp mov %esp,%ebp and $0xfffffff0,%esp sub $0x10,%esp 0804841c <fac>: push %ebx sub $0x18,%esp mov 0x20(%esp),%ebx mov $0x1,%eax cmp $0x1,%ebx … However, if the binary is compiled under –O1, we will miss function fac. And if the binary is compiled under –O2, we will falsely identify function main. –O1: Optimize –O2: Optimize Even More

Current Industry Solution: IDA #include<stdio.h> #include<string.h> #define MAX 128 void sum(char a[MAX], char b[MAX]){ printf("%s + %s = %d\n", a, b, atoi(a) + atoi(b)); } void sub(char a[MAX], char b[MAX]){ printf("%s - %s = %d\n", a, b, atoi(a) - atoi(b)); void assign(char a[MAX], char b[MAX]){ char pre_b[MAX]; strcpy(pre_b, b); strcpy(b, a); printf("b is changed from %s to %s\n", pre_b, b); int main(){ void (*funcs[3]) (char x[MAX], char y[MAX]); int f; char a[MAX], b[MAX]; funcs[0] = sum; funcs[1] = sub; funcs[2] = assign; scanf("%d %s %s", &f, &a, &b); (*funcs[f])(a, b); return 0; IDA Misses IDA Misses IDA Misses we also look for answers from the current industry solution IDA. We compile another example, which is essentially a calculator, by gcc–O3, and we found IDA failed to identify all 3 calculation functions, including assign, which is a vulnerable function can be exploited by buffer overflow. In fact later on we will see that IDA actually failed to identify and falsely identify a large number of functions. And this led us to investigate the function identification problem. ----- Meeting Notes (8/12/14 17:14) ----- v-table in C++ function pointers

Function Identification Problems Given a stripped binary, return A list of function start addresses “Function Start Identification (FSI) Problem” A list of function (start, end) pairs “Function Boundary Identification (FBI) Problem” A list of functions as sets of instruction address “Function Identification (FI) Problem” In this work, we consider three variants of this problem. The first variant is FSI problem, which aims to output… 01010010101010100101011011101010100101010101010111110001010001010110100101010001001010110101010101101011101010110110001010001000111010010011110101

ByteWeight A machine learning + program analysis approach to function identification Training: Creates a model of function start patterns using supervised machine learning Usage: Use trained models to match function start on stripped binaries — Function Start Identification Use program analysis to identify all bytes associated with a function — Function Identification Calculate the minimum and maximum addresses of each function — Function Boundary Identification We address function start identification using machine learning, and address function boundary identification using program analysis. Finally, we extract the minimum and maximum of identified functions to address function boundary identification problem.

Function Start Identification Previous approaches Our approach We start the function identification problems by firstly talking about FSI.

Previous Work: Rosenblum et al.[1] Method: Select instruction idioms up to length 4; learn idiom parameters; label test binaries “Feature (idiom) selection for all three data sets (1,171 binaries) consumed over 150 compute-days of machine computation” Entry Idiom: push ebp | * | mov esp,ebp Prefix Idiom: ret | int3 0.84 The ML approach towards FSI is firstly proposed by Rosenblum et al. The basic method of their work is to firstly select most representative instruction sequences, which they call idioms, that either occur at function start, or are followed by function start. Then they learn idiom parameters, and in testing, they score each byte of test binaries with a combination of parameters from those matched idioms. The main issue of the approach is scalability. According to his paper, . This is a big overhead for training. So we proposed ByteWeight b1 b2 b3 b4 b5 b6 b7 b8 [1] N. E. Rosenblum, X. Zhu, B. P. Miller, and K. Hunt. Learning to Analyze Binary Computer Code. In Proceedings of the 23rd National Conference on Artificial Intelligence (2008), AAAI, pp. 798–804.

ByteWeight: Lighter (Linear) Method Weight Calculation Extraction Tree Generation Weighted Prefix Tree Training Binaries Extracted Sequences Weighted Sequences Training CFG Recovery This is a graph of our design. Training and Classification are used to address FSI problem, which we mark in red. After FSI, we use the identified function start to find the instructions in functions, which we mark in blue. In the end, we calculate the minimum and maximum addresses of each function, which we mark in green. Testing Binary Function Start Function Bytes Function Boundary Function Boundary Identification RFCR Function Identification Classification

Step 1: Extract All ≤ K-length Sequences 55 55 48 55 48 89 55 48 89 e5 … Bytes 0000000100000e3b <func_1>: 55 48 89 e5 48 83 ec 10 89 7d fc 89 75 f8 8b 55 f8 8b 45 fc 89 c6 48 8d 3d c0 00 00 00 b8 00 00 00 00 e8 86 00 00 00 c9 c3 0000000100000e3b <func_1>: 55 push %rbp 48 89 e5 mov %rsp,%rbp 48 83 ec 10 sub $0x10,%rsp 89 7d fc mov %edi,-0x4(%rbp) 89 75 f8 mov %esi,-0x8(%rbp) 8b 55 f8 mov -0x8(%rbp),%edx 8b 45 fc mov -0x4(%rbp),%eax 89 c6 mov %eax,%esi 48 8d 3d c0 00 00 00 lea 0xc0(%rip),%rdi b8 00 00 00 00 mov $0x0,%eax e8 86 00 00 00 callq 100000ee8 c9 leaveq c3 retq 0000000100000e3b <_func_1>: push %rbp mov %rsp,%rbp sub $0x10,%rsp mov %edi,-0x4(%rbp) mov %esi,-0x8(%rbp) mov -0x8(%rbp),%edx mov -0x4(%rbp),%eax mov %eax,%esi lea 0xc0(%rip),%rdi mov $0x0,%eax callq 100000ee8 leaveq retq push %rbp push %rbp; mov %rsp,%rbp push %rbp; mov %rsp,%rbp; sub $0x10,%rsp push %rbp; mov %rsp,%rbp; sub $0x10,%rsp; mov %edi,-0x4(%rbp) … Instructions In training, the first step is… the learning data are unstripped binaries which contain function start information. So we extract every first k or less bytes or instructions at the beginning of each function. For example, given function func 1, we can either extract first K or less bytes, or first K or less instructions. hexadecimal 55

Step 2: Weight Sequences 0000000100000e3b <_func_1>: 55 push %rbp 48 89 e5 mov %rsp,%rbp 48 83 ec 10 sub $0x10,%rsp 89 7d fc mov %edi,-0x4(%rbp) 89 75 f8 mov %esi,-0x8(%rbp) 8b 55 f8 mov -0x8(%rbp),%edx 8b 45 fc mov -0x4(%rbp),%eax 89 c6 mov %eax,%esi 48 8d 3d c0 00 00 00 lea 0xc0(%rip),%rdi b8 00 00 00 00 mov $0x0,%eax e8 86 00 00 00 callq 100000ee8 c9 leaveq c3 retq 0000000100000e64 <_func_2>: 48 83 ec 16 sub $0x16,%rsp 48 8d 3d a6 00 00 00 lea 0xa6(%rip),%rdi e8 5d 00 00 00 callq 100000ee8 push %rbp  55 score: 2 / (2 + 2) = 0.5 The second part is to weight those extracted sequences. For each sequence, we get its corresponding bytes, and then count the times of occurrence of the bytes to be function start and not function start.

Step 2: Weight Sequences 0000000100000e3b <_func_1>: 55 push %rbp 48 89 e5 mov %rsp,%rbp 48 83 ec 10 sub $0x10,%rsp 89 7d fc mov %edi,-0x4(%rbp) 89 75 f8 mov %esi,-0x8(%rbp) 8b 55 f8 mov -0x8(%rbp),%edx 8b 45 fc mov -0x4(%rbp),%eax 89 c6 mov %eax,%esi 48 8d 3d c0 00 00 00 lea 0xc0(%rip),%rdi b8 00 00 00 00 mov $0x0,%eax e8 86 00 00 00 callq 100000ee8 c9 leaveq c3 retq 0000000100000e64 <_func_2>: 48 83 ec 16 sub $0x16,%rsp 48 8d 3d a6 00 00 00 lea 0xa6(%rip),%rdi e8 5d 00 00 00 callq 100000ee8 Similarly, if we weight..., we find that its corresponding bytes occurs twice, and both of them are at function start. After Weighting sequences, we can obtain a list a extracted sequences along with its score. push %rbp; mov %rsp,%rbp  55 48 89 e5 score: 2 / (2 + 0) = 1.0

Step 3: Generate Weighted Prefix Tree push %rbp  2/(2+2)=0.5 push %rbp; mov %rsp,%rbp  2/(2+0)=1.0 ... push %rbp (55) mov %rsp,%rbp (48 89 e5) … 2 / (2 + 0) = 1.0 2 / (2 + 2) = 0.5 sub $0x10,%rsp (48 83 ec 10) 1 / (1 + 0) = 1.0 sub $0x16,%rsp (48 83 ec 16) Then in part 3, we store them in weghted prefix tree. In this tree, each node represents... WPT helps storing weighted sequences in a smaller size, and also helps improve future classification speed

Classification Test Binary 1.0 0.0 55 55 push %rbp (55) mov %rsp,%rbp 00 00 00 00 e8 5d 00 00 00 c9 c3 55 48 89 e5 48 83 ec 60 48 8d 05 9f ff ff ff 48 89 45 b8 48 8d 05 bd ff ff ff 48 89 45 c0 48 8d 4d a8 48 8d 55 ac 48 8d 45 a4 0.0 55 55 48 89 e5 48 83 ec 60 push %rbp (55) mov %rsp,%rbp (48 89 e5) … sub $0x10,%rsp (48 83 ec 10) sub $0x16,%rsp (48 83 ec 16) 0.5 55 0.5 1.0 Test Binary In classification, we score test byte according to the tree, and label it to be function start if the score is larger than a default threshold For each test byte, we disassemble at this byte, and find the longest path in the tree that matches the disassembly. If we want to score the red hex 55 Threshold 1.0

Normalization (Optional) sub $0x60,%rsp push %rbp (55) mov %rsp,%rbp (48 89 e5) sub $0x10,%rsp (48 83 ec 10) … 4 6 0.4 00 00 00 00 e8 5d 00 00 00 c9 c3 55 48 89 e5 48 83 ec 60 48 8d 05 9f ff ff ff 48 89 45 b8 48 8d 05 bd ff ff ff 48 89 45 c0 48 8d 4d a8 48 8d 55 ac 48 8d 45 a4 55 48 89 e5 48 83 ec 60 4 6 0.4 1 0 1.0 2 0 1.0 sub $0x[1-9a-f][0-9a-f]*,%rsp sub $0x16,%rsp (48 83 ec 16) jne 0x[0-9a-f]* jne 0x12345678 (0f 85 1c 01 00 00) Interestingly, if we disassemble one more instruction of the first hex 55, we can get…, which is very similar as next node of tree. All of them are doing stack offset subtraction. In general, the actual value of offset is not important because for different function, the stack offset is different, and the constant number subtracted from the stack pointer rsp can be different. in order to match different offset in testing, we have to match them samely in training. It makes more sense to generalize the offset to any positive number. We call this process normalization. Similarly, it makes sense to normalize jump instruction by generalizing the target address to any address. 1 0 1.0 2 6 0.25

Function (Boundary) Identification Identify all bytes associated with a function, and extract the lowest and highest addresses Below is how we address function start identification problem. Next we are going to talk about FI and FBI.

ByteWeight: Function (Boundary) Identification Recursive disassembly, using Value Set Analysis[2] to resolve indirect jumps. Weight Calculation Extraction Tree Generation Weighted Prefix Tree Training Binaries Extracted Sequences Weighted Sequences Recursive Function Call Resolution—add any call target as a function start. Training Control Flow Graph Recovery Function Identification is based on function start identification result from classification. Given a function start, we firstly run CFG recovery, in which… Then we run, in the sense that including. We run CFG Recovery and RFCR iteratively until no more new function start is added. These program analysis is based on the version 0.7 of CMU Binary Analysis Platform. Testing Binary Function Start Function Bytes Function Boundary Function Boundary Identification RFCR Function Identification Classification [2] G. Balakrishan. WYSINWYX: What You See Is Not What You Execute. PhD thesis, University of Wisconsin-Madison, 2007.

ByteWeight: Function (Boundary) Identification (instr1, instr101) instr1, instr2, instr3, instr6, instr10, instr12, …, instr100, instr101. F1 Control Flow Graph Recovery This is how we design byteweight. Testing Binary Function Start Function Bytes Function Boundary Function Boundary Identification RFCR Function Identification Classification [2] G. Balakrishan. WYSINWYX: What You See Is Not What You Execute. PhD thesis, University of Wisconsin-Madison, 2007.

Experiment Results Compilers: GCC , ICC, and MSVS Platforms: Linux and Windows Optimizations: O0(Od), O1, O2, and O3(Ox) let us talk about experimental result...

Training Performance ByteWeight: Rosenblum et al.: 10-fold cross-validation, 2200 binaries 6.1 days to train from all platforms and all compilers including logging Rosenblum et al.: ??? (They reported 150 compute days for one step of training, but did not report total time, or make their training implementation available.) training data and code both unavailable

Precision and Recall TP FN FP Precision = Recall = TP TP TP + FP Tool Truth TP TP Precision = Recall = TP + FP TP + FN

Function Start Identification: Comparison with Rosenblum et al.

Function Start Identification: Existing Binary Analysis Tools some bar is zero because the tool doesn't support the architecture

Function Boundary Identification: Existing Binary Analysis Tools

Summary: ByteWeight Machine-learning based approach Creates a model of function start patterns using supervised machine learning Matches model on new samples Uses program analysis to identify all bytes associated with a function Faster and more accurate than previous work

Thank You Our experiment VM is available at: http://security.ece.cmu.edu/byteweight/ Tiffany Bao tiffanybao@cmu.edu Finally, we have uploaded our experiment vm at the following url. Please feel free to download the vm. Try to reproduce our experiment, or try our tool with your binaries. At this moment I will be very happy to take any questions. Thank you.