# ByteWeight: Learning to Recognize Functions in Binary Code

## Presentation on theme: "ByteWeight: Learning to Recognize Functions in Binary Code"— Presentation transcript:

ByteWeight: Learning to Recognize Functions in Binary Code
Tiffany Bao Jonathan Burket Maverick Woo Rafael Turner David Brumley Carnegie Mellon University Very pleased to be here to talk about our work ByteWeight. This is joint work with USENIX Security ’14

Function Information Malware Analysis
Binary Analysis Malware Analysis Vulnerability Signature Generation Decompiler Binary Reuse Control Flow Integrity (CFI) Function Information Function 1 Function 3 Function 2 There are different kinds of binary analysis in the world, such as … etc. Many of them need function information. For example, decompiler pheonix, which is published last year in usenix security, needs function start information to construct control flow graph for each function.

Can we automatically and accurately recover function information from stripped binaries?
Binary Analysis Decompiler Control Flow Integrity (CFI) Binary Reuse Malware Analysis Vulnerability Signature Generation Stripped Function Information Function 1 Function 3 Function 2 which means essentially function as well as other symbols are discarded from program binaries. Stripped binaries are very common, especially for those delivered to public. Without function information, analysis such as decompiler and binary level Control Flow Integrity may not be able to work anymore. So our questions is: can... We need it to be automatic to meet scalability requirements. We also need it to be accurate, because inaccurate function recovery will affect downstream analysis. We starts our discussion by a simple example

Example: GCC Source Code #include <stdio.h> int fac(int x){
if (x == 1) return 1; else return x * fac(x - 1); } void main(int argc, char **argv){ printf("%d", fac(10)); We start our discussion by a simple example. Suppose we have a stripped binary. this binary is to calculate number’s factorial And We are going to find function fac and main from the stripped binaries. a naïve method is to match function prologue, like push ebp, move esp ebp. Source Code

Example: GCC –O0: Default 08048443 <main>: push %ebp
mov %esp,%ebp and \$0xfffffff0,%esp sub \$0x10,%esp c <fac>: sub \$0x18,%esp cmpl \$0x1,0x8(%ebp) jne f <fac+0x13> mov \$0x1,%eax If the binary is compiled by gcc under –O0, we will find both of them. We are very happy. –O0: Default

Example: GCC <main>: mov \$0x1,%edx mov \$0xa,%eax lea 0x0(%esi),%esi push %ebp mov %esp,%ebp and \$0xfffffff0,%esp sub \$0x10,%esp c <fac>: push %ebx sub \$0x18,%esp mov 0x20(%esp),%ebx mov \$0x1,%eax cmp \$0x1,%ebx However, if the binary is compiled under –O1, we will miss function fac. And if the binary is compiled under –O2, we will falsely identify function main. –O1: Optimize –O2: Optimize Even More

Current Industry Solution: IDA
#include<stdio.h> #include<string.h> #define MAX 128 void sum(char a[MAX], char b[MAX]){ printf("%s + %s = %d\n", a, b, atoi(a) + atoi(b)); } void sub(char a[MAX], char b[MAX]){ printf("%s - %s = %d\n", a, b, atoi(a) - atoi(b)); void assign(char a[MAX], char b[MAX]){ char pre_b[MAX]; strcpy(pre_b, b); strcpy(b, a); printf("b is changed from %s to %s\n", pre_b, b); int main(){ void (*funcs[3]) (char x[MAX], char y[MAX]); int f; char a[MAX], b[MAX]; funcs[0] = sum; funcs[1] = sub; funcs[2] = assign; scanf("%d %s %s", &f, &a, &b); (*funcs[f])(a, b); return 0; IDA Misses IDA Misses IDA Misses we also look for answers from the current industry solution IDA. We compile another example, which is essentially a calculator, by gcc–O3, and we found IDA failed to identify all 3 calculation functions, including assign, which is a vulnerable function can be exploited by buffer overflow. In fact later on we will see that IDA actually failed to identify and falsely identify a large number of functions. And this led us to investigate the function identification problem. ----- Meeting Notes (8/12/14 17:14) ----- v-table in C++ function pointers

Function Identification Problems
Given a stripped binary, return A list of function start addresses “Function Start Identification (FSI) Problem” A list of function (start, end) pairs “Function Boundary Identification (FBI) Problem” A list of functions as sets of instruction address “Function Identification (FI) Problem” In this work, we consider three variants of this problem. The first variant is FSI problem, which aims to output…

ByteWeight A machine learning + program analysis approach to function identification Training: Creates a model of function start patterns using supervised machine learning Usage: Use trained models to match function start on stripped binaries — Function Start Identification Use program analysis to identify all bytes associated with a function — Function Identification Calculate the minimum and maximum addresses of each function — Function Boundary Identification We address function start identification using machine learning, and address function boundary identification using program analysis. Finally, we extract the minimum and maximum of identified functions to address function boundary identification problem.

Function Start Identification
Previous approaches Our approach We start the function identification problems by firstly talking about FSI.

Previous Work: Rosenblum et al.[1]
Method: Select instruction idioms up to length 4; learn idiom parameters; label test binaries “Feature (idiom) selection for all three data sets (1,171 binaries) consumed over 150 compute-days of machine computation” Entry Idiom: push ebp | * | mov esp,ebp Prefix Idiom: ret | int3 0.84 The ML approach towards FSI is firstly proposed by Rosenblum et al. The basic method of their work is to firstly select most representative instruction sequences, which they call idioms, that either occur at function start, or are followed by function start. Then they learn idiom parameters, and in testing, they score each byte of test binaries with a combination of parameters from those matched idioms. The main issue of the approach is scalability. According to his paper, . This is a big overhead for training. So we proposed ByteWeight b1 b2 b3 b4 b5 b6 b7 b8 [1] N. E. Rosenblum, X. Zhu, B. P. Miller, and K. Hunt. Learning to Analyze Binary Computer Code. In Proceedings of the 23rd National Conference on Artificial Intelligence (2008), AAAI, pp. 798–804.

ByteWeight: Lighter (Linear) Method
Weight Calculation Extraction Tree Generation Weighted Prefix Tree Training Binaries Extracted Sequences Weighted Sequences Training CFG Recovery This is a graph of our design. Training and Classification are used to address FSI problem, which we mark in red. After FSI, we use the identified function start to find the instructions in functions, which we mark in blue. In the end, we calculate the minimum and maximum addresses of each function, which we mark in green. Testing Binary Function Start Function Bytes Function Boundary Function Boundary Identification RFCR Function Identification Classification

Step 1: Extract All ≤ K-length Sequences
55 55 48 e5 Bytes e3b <func_1>: e ec d fc f8 8b 55 f8 8b 45 fc 89 c6 48 8d 3d c b e c9 c3 e3b <func_1>: push %rbp 48 89 e mov %rsp,%rbp 48 83 ec sub \$0x10,%rsp 89 7d fc mov %edi,-0x4(%rbp) 89 75 f mov %esi,-0x8(%rbp) 8b 55 f mov -0x8(%rbp),%edx 8b 45 fc mov -0x4(%rbp),%eax 89 c mov %eax,%esi 48 8d 3d c lea 0xc0(%rip),%rdi b mov \$0x0,%eax e callq ee8 c leaveq c retq e3b <_func_1>: push %rbp mov %rsp,%rbp sub \$0x10,%rsp mov %edi,-0x4(%rbp) mov %esi,-0x8(%rbp) mov -0x8(%rbp),%edx mov -0x4(%rbp),%eax mov %eax,%esi lea 0xc0(%rip),%rdi mov \$0x0,%eax callq ee8 leaveq retq push %rbp push %rbp; mov %rsp,%rbp push %rbp; mov %rsp,%rbp; sub \$0x10,%rsp push %rbp; mov %rsp,%rbp; sub \$0x10,%rsp; mov %edi,-0x4(%rbp) Instructions In training, the first step is… the learning data are unstripped binaries which contain function start information. So we extract every first k or less bytes or instructions at the beginning of each function. For example, given function func 1, we can either extract first K or less bytes, or first K or less instructions. hexadecimal 55

Step 2: Weight Sequences
e3b <_func_1>: push %rbp 48 89 e mov %rsp,%rbp 48 83 ec sub \$0x10,%rsp 89 7d fc mov %edi,-0x4(%rbp) 89 75 f mov %esi,-0x8(%rbp) 8b 55 f mov -0x8(%rbp),%edx 8b 45 fc mov -0x4(%rbp),%eax 89 c mov %eax,%esi 48 8d 3d c lea 0xc0(%rip),%rdi b mov \$0x0,%eax e callq ee8 c leaveq c retq e64 <_func_2>: 48 83 ec sub \$0x16,%rsp 48 8d 3d a lea 0xa6(%rip),%rdi e8 5d callq ee8 push %rbp  55 score: 2 / (2 + 2) = 0.5 The second part is to weight those extracted sequences. For each sequence, we get its corresponding bytes, and then count the times of occurrence of the bytes to be function start and not function start.

Step 2: Weight Sequences
e3b <_func_1>: push %rbp 48 89 e mov %rsp,%rbp 48 83 ec sub \$0x10,%rsp 89 7d fc mov %edi,-0x4(%rbp) 89 75 f mov %esi,-0x8(%rbp) 8b 55 f mov -0x8(%rbp),%edx 8b 45 fc mov -0x4(%rbp),%eax 89 c mov %eax,%esi 48 8d 3d c lea 0xc0(%rip),%rdi b mov \$0x0,%eax e callq ee8 c leaveq c retq e64 <_func_2>: 48 83 ec sub \$0x16,%rsp 48 8d 3d a lea 0xa6(%rip),%rdi e8 5d callq ee8 Similarly, if we weight..., we find that its corresponding bytes occurs twice, and both of them are at function start. After Weighting sequences, we can obtain a list a extracted sequences along with its score. push %rbp; mov %rsp,%rbp  e5 score: 2 / (2 + 0) = 1.0

Step 3: Generate Weighted Prefix Tree
push %rbp  2/(2+2)=0.5 push %rbp; mov %rsp,%rbp  2/(2+0)=1.0 ... push %rbp (55) mov %rsp,%rbp (48 89 e5) 2 / (2 + 0) = 1.0 2 / (2 + 2) = 0.5 sub \$0x10,%rsp (48 83 ec 10) 1 / (1 + 0) = 1.0 sub \$0x16,%rsp (48 83 ec 16) Then in part 3, we store them in weghted prefix tree. In this tree, each node represents... WPT helps storing weighted sequences in a smaller size, and also helps improve future classification speed

Classification Test Binary 1.0 0.0 55 55 push %rbp (55) mov %rsp,%rbp
e8 5d c9 c e ec d 05 9f ff ff ff b8 48 8d 05 bd ff ff ff c0 48 8d 4d a8 48 8d 55 ac 48 8d 45 a4 0.0 55 55 48 89 e5 48 83 ec 60 push %rbp (55) mov %rsp,%rbp (48 89 e5) sub \$0x10,%rsp (48 83 ec 10) sub \$0x16,%rsp (48 83 ec 16) 0.5 55 0.5 1.0 Test Binary In classification, we score test byte according to the tree, and label it to be function start if the score is larger than a default threshold For each test byte, we disassemble at this byte, and find the longest path in the tree that matches the disassembly. If we want to score the red hex 55 Threshold 1.0

Normalization (Optional)
sub \$0x60,%rsp push %rbp (55) mov %rsp,%rbp (48 89 e5) sub \$0x10,%rsp (48 83 ec 10) 4 6 0.4 e8 5d c9 c e ec d 05 9f ff ff ff b8 48 8d 05 bd ff ff ff c0 48 8d 4d a8 48 8d 55 ac 48 8d 45 a4 55 48 89 e5 48 83 ec 60 4 6 0.4 1 0 1.0 2 0 1.0 sub \$0x[1-9a-f][0-9a-f]*,%rsp sub \$0x16,%rsp (48 83 ec 16) jne 0x[0-9a-f]* jne 0x (0f 85 1c ) Interestingly, if we disassemble one more instruction of the first hex 55, we can get…, which is very similar as next node of tree. All of them are doing stack offset subtraction. In general, the actual value of offset is not important because for different function, the stack offset is different, and the constant number subtracted from the stack pointer rsp can be different. in order to match different offset in testing, we have to match them samely in training. It makes more sense to generalize the offset to any positive number. We call this process normalization. Similarly, it makes sense to normalize jump instruction by generalizing the target address to any address. 1 0 1.0 2 6 0.25

Function (Boundary) Identification
Identify all bytes associated with a function, and extract the lowest and highest addresses Below is how we address function start identification problem. Next we are going to talk about FI and FBI.

ByteWeight: Function (Boundary) Identification
Recursive disassembly, using Value Set Analysis[2] to resolve indirect jumps. Weight Calculation Extraction Tree Generation Weighted Prefix Tree Training Binaries Extracted Sequences Weighted Sequences Recursive Function Call Resolution—add any call target as a function start. Training Control Flow Graph Recovery Function Identification is based on function start identification result from classification. Given a function start, we firstly run CFG recovery, in which… Then we run, in the sense that including. We run CFG Recovery and RFCR iteratively until no more new function start is added. These program analysis is based on the version 0.7 of CMU Binary Analysis Platform. Testing Binary Function Start Function Bytes Function Boundary Function Boundary Identification RFCR Function Identification Classification [2] G. Balakrishan. WYSINWYX: What You See Is Not What You Execute. PhD thesis, University of Wisconsin-Madison, 2007.

ByteWeight: Function (Boundary) Identification
(instr1, instr101) instr1, instr2, instr3, instr6, instr10, instr12, …, instr100, instr101. F1 Control Flow Graph Recovery This is how we design byteweight. Testing Binary Function Start Function Bytes Function Boundary Function Boundary Identification RFCR Function Identification Classification [2] G. Balakrishan. WYSINWYX: What You See Is Not What You Execute. PhD thesis, University of Wisconsin-Madison, 2007.

Experiment Results Compilers: GCC , ICC, and MSVS
Platforms: Linux and Windows Optimizations: O0(Od), O1, O2, and O3(Ox) let us talk about experimental result...

Training Performance ByteWeight: Rosenblum et al.:
10-fold cross-validation, 2200 binaries 6.1 days to train from all platforms and all compilers including logging Rosenblum et al.: ??? (They reported 150 compute days for one step of training, but did not report total time, or make their training implementation available.) training data and code both unavailable

Precision and Recall TP FN FP Precision = Recall = TP TP TP + FP
Tool Truth TP TP Precision = Recall = TP + FP TP + FN

Function Start Identification: Comparison with Rosenblum et al.

Function Start Identification: Existing Binary Analysis Tools
some bar is zero because the tool doesn't support the architecture

Function Boundary Identification: Existing Binary Analysis Tools

Summary: ByteWeight Machine-learning based approach
Creates a model of function start patterns using supervised machine learning Matches model on new samples Uses program analysis to identify all bytes associated with a function Faster and more accurate than previous work

Thank You Our experiment VM is available at: Tiffany Bao Finally, we have uploaded our experiment vm at the following url. Please feel free to download the vm. Try to reproduce our experiment, or try our tool with your binaries. At this moment I will be very happy to take any questions. Thank you.