Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine-Learning Assisted Binary Code Analysis N. Rosenblum, X. Zhu, B. Miller Computer Sciences Department University of Wisconsin - Madison

Similar presentations


Presentation on theme: "Machine-Learning Assisted Binary Code Analysis N. Rosenblum, X. Zhu, B. Miller Computer Sciences Department University of Wisconsin - Madison"— Presentation transcript:

1 Machine-Learning Assisted Binary Code Analysis N. Rosenblum, X. Zhu, B. Miller Computer Sciences Department University of Wisconsin - Madison {nater,jerryzhu,bart}@cs.wisc.edu K. Hunt National Security Agency huntkc@gmail.com

2 Rosenblum, Zhu, Miller, Hunt 2 ML Assisted Binary Code Analysis Supporting Static Binary Analysis Malware detection Vulnerability analysis Static and Dynamic Instrumentation Formal verification Example Uses Code is found through symbol information and parsing Binary Analysis is a Foundational Technique for Many Areas Source code unavailable –e.g., malware Source code is inaccurate –Compiler transforms structure Provides most accurate representation Why Analyze Binaries? MUCH HARDER without symbols

3 Rosenblum, Zhu, Miller, Hunt 3 ML Assisted Binary Code Analysis Many Binaries are Stripped Stripped binaries lack symbol & debug information Malicious programs Operating system distributions Commercial software packages Legacy codes EXAMPLES: Standard Approach: Parse from entry point BINARY Headers Code Segment (functions?) Data Segment

4 Rosenblum, Zhu, Miller, Hunt 4 ML Assisted Binary Code Analysis Stripped Binaries Exhibit Gaps After static parsing, gap regions remain Indirect (pointer-based) control ambiguity Deliberate calls/branch obfuscation Gaps in code segment may not contain code Code Segment

5 Rosenblum, Zhu, Miller, Hunt 5 ML Assisted Binary Code Analysis Stripped Binaries Exhibit Gaps.__gmon_start__.lib c.so.6.stpcpy.strcp y.__divdi3.printf.s tdout.strerror.memm ove.getopt_long.re_ syntax_options.__ct ype_b.getenv.__strt ol_internal.getpage size.re_search_2.me mcpy.puts.feof.mall oc.optarg.btowc._ob stack_newchunk.re_m atch.__ctype_touppe r.__xstat64.abort.s trrchr._obstack_beg in.calloc.re_set_re gisters.fprintf. Gap contents may vary String data Code Segment Dialog Constants Import names Other strings

6 Rosenblum, Zhu, Miller, Hunt 6 ML Assisted Binary Code Analysis Stripped Binaries Exhibit Gaps 0x8022346 0x802434b 0x80243ad 0x80403d0 0x80503d0 0x8052140 0x8053142 0x806000b 0x802321a 0x8023332 0x804132a 0x8050ca0 Gap contents may vary Tables or lists of addresses Jump tables Virtual function tables Data objects Code Segment

7 Rosenblum, Zhu, Miller, Hunt 7 ML Assisted Binary Code Analysis Stripped Binaries Exhibit Gaps gap_funcA {... } gap_funcB {... gap_funcC {... } Code unreachable through standard static parsing Gap contents may vary Code Segment Function pointers Virtual methods Obfuscated calls

8 Rosenblum, Zhu, Miller, Hunt 8 ML Assisted Binary Code Analysis Stripped Binaries Exhibit Gaps Gap contents may vary But… all of these just look like bytes 7a 01 00 fd a2 b3 74 68 69 73 20 65 78 61 6d 70 6c 65 20 69 73 20 62 6f 67 75 73 2e 2e 2e 7a 01 00 fd a2 b3 74 68 69 73 20 65 78 61 6d 70 6c 65 20 69 73 20 62 6f 67 75 73 2e 2e 2e 7a 01 00 fd a2 b3 74 68 69 73 20 65 78 61 6d 70 6c 65 20 69 73 20 62 6f Our approach: Use information in known code to model code in gaps Every byte in gaps may be the start of a function Code Segment Previous work (Vigna et al., 2007) augments parsing with simple instruction frequency information How can we find code in gaps?

9 Rosenblum, Zhu, Miller, Hunt 9 ML Assisted Binary Code Analysis Modeling Binary Code Content: Idiom features of function entry points –Based on instruction sequences Structure: Control flow & conflict features –Capture relationship of candidate function entry points –Requires joint assignment over all function entry point candidates Problem reduces to finding function entry points Task: Classifying every byte in a gap as entry point or non-entry point Two types of features:

10 Rosenblum, Zhu, Miller, Hunt 10 ML Assisted Binary Code Analysis Content-based Features Entry idioms are common patterns at function entry points C1C1 push ebp push ebp|mov esp,ebp push ebp|*|sub esp push ebp|*|mov esp,ebp *|mov_esp,ebp *|sub 0x8,esp *|mov 0x8(ebp),eax PRE nop PRE ret|nop PRE pop ebp|*|nop Idioms are preceding and succeeding instruction sequences with wildcards Candidate Entry idioms For each idiom u,

11 Rosenblum, Zhu, Miller, Hunt 11 ML Assisted Binary Code Analysis Call Consistency & Overlap C1C1 C3C3 C4C4 C2C2 Call & conflict features relate candidate FEPs over entire gap Candidates y 1 = 1 y 3 = -1 y 2 = 1 y 4 = 1

12 Rosenblum, Zhu, Miller, Hunt 12 ML Assisted Binary Code Analysis Experimental Setup Large set (100’s) of binaries from department Linux servers and Windows workstations Additional binaries compiled with Intel compiler Binaries have full symbol information Model implemented as extensions to Dyninst instrumentation library 1.Strip binary copies and parse to obtain training set 2.Select top idiom features by forward feature selection 3.Perform logistic regression to build idiom model 4.Evaluate model on test data from gap regions in Step 1. Unstripped copies of binaries provide reference set

13 Rosenblum, Zhu, Miller, Hunt 13 ML Assisted Binary Code Analysis Preliminary Results Compiler Programs examined Total Training Examples (pos+neg) Total Test Examples (pos+neg) Actual number of functions in gaps GCC6258,412,71122,806,44985,870 MS VS4438,020,82811,231,72170,620 ICC1121,364,59813,169,48747,841 GNU C Compiler –Simple, regular function preamble Intel C Compiler –Most variation in entry points; highly optimized MS Visual Studio –High variation in function entry points

14 Rosenblum, Zhu, Miller, Hunt 14 ML Assisted Binary Code Analysis Preliminary Results Compiler Orig. DyninstIDA ProDyninst w/ Model FPFNFPFNFPFN GCC2,8332,01214,57638,0744031,860 MS VS79,32065,5869,04421,49172514,143 ICC3,78640,19514,42226,9702,33716,220 Original Dyninst –Scans for common entry preamble Dyninst w/ Model –Model replaces entry preamble heuristic IDA Pro Disassembler –Scans for common entry preamble –List of Library Fingerprints (Windows) Comparison of three binary analysis tools:

15 Rosenblum, Zhu, Miller, Hunt 15 ML Assisted Binary Code Analysis Preliminary Results Classifier maintains high precision with good recall Model performance highly system- dependent MS Visual Studio & Intel C Compiler FEPs are highly variable

16 Rosenblum, Zhu, Miller, Hunt 16 ML Assisted Binary Code Analysis Backup Slides

17 Rosenblum, Zhu, Miller, Hunt 17 ML Assisted Binary Code Analysis Idiom Feature Selection & Training Statically reachable functions … 1. Obtain training data from traditional parse Corpus is hundreds of stripped binaries 2. Use Condor HTC to drive forward feature selection on idioms Features: Feat1 Feat2 Feat3... Featk 3. Perform logistic regression on the selected idiom features to obtain model parameters t

18 Rosenblum, Zhu, Miller, Hunt 18 ML Assisted Binary Code Analysis Model Formalization Joint assignment of y i = {1,-1} for each FEP x i in binary P Unary idiom features f u –Weights u trained through logistic regression Binary features f o (overlap), f c (call consistency) –Weights o, c large, negative


Download ppt "Machine-Learning Assisted Binary Code Analysis N. Rosenblum, X. Zhu, B. Miller Computer Sciences Department University of Wisconsin - Madison"

Similar presentations


Ads by Google