Presentation on theme: "Saumya Debray The University of Arizona Tucson, AZ 85721."— Presentation transcript:
Saumya Debray The University of Arizona Tucson, AZ 85721
The Problem Rapid analysis and understanding of malware code essential for swift response to new threats ‒ Malicious software are usually heavily obfuscated against analysis Existing approaches to reverse engineering such code are primitive ‒ not a lot of high-level tool support ‒ requires a lot of manual intervention ‒ slow, cumbersome, potentially error-prone Delays development of countermeasures
Goals Develop automated techniques for analysis and reverse engineering of obfuscated binaries semantics-based ‒ output is functionally equivalent to, but simpler than, the input program generality ‒ should work on any obfuscation even ones we haven’t thought of yet! ‒ should minimize assumptions about obfuscations
Challenges can’t make assumptions about obfuscations ‒ what do we leverage for deobfuscation? ‒ distinguishing code we care about from code we don’t how do we know which instructions we care about? scale ‒ “needle in haystack” no. of instructions executed increases by 270 x (VMprotect) to 4300 x (Themida) [Lau 2008] anti-analysis defenses ‒ runtime unpacking ‒ anti-emulation, anti-debug checks
Our Approach no obfuscation-specific assumptions ‒ treat programs as input-to-output transformations ‒ use semantics-preserving transformations to simplify execution traces dynamic analysis to handle runtime unpacking Taint analysis (bit-level) Control flow reconstruction Semantics- preserving transformations input program control flow graph map flow of values from input to output simplify logic of input-to-output transformation reconstruct logic of simplified computation
Ex 1:Emulation-based Obfuscation examination of the code reveals only the emulator’s logic ‒ actual program logic embedded in byte code lots of “chaff” during execution ‒ separating emulator logic from payload logic tricky emulators can be nested Obfuscator input program random seed bytecode logic (data) emulator (code) mutation engine
Ex 2:Return-Oriented Programs (ROP) Originally designed to bypass anti-code-injection defenses ‒ stitches together existing code fragments ( “gadgets” ), e.g., in system libraries Logic can be difficult to discern ‒ gadgets are typically scattered across many different functions and/or libraries ‒ gadgets can overlap in memory in weird ways ‒ control flow structures (if-else, loops, function calls) are typically implemented using non-standard idioms
Example 1 (emulation-obfuscation) factorial (Themida)
Example 2 (ROP) o originalROP factorial
Interactions between Obfuscations Example: Unpacking + Emulation unpack output input instructions “tainted” as propagating values from input to output execution trace input-to-output computation (further simplified) used to construct control flow graph
Results Ex. 4. Win32/Kryptik.OHY: Code Virtualizer obfuscateddeobfuscated multiple layers of runtime code generation unpacking cod e initial unpacker is emulation-obfuscated the CFG shown materializes incrementally
Results: CFG Similarity
Lessons and Issues Static vs. dynamic analysis ‒ multiple layers of runtime code generation/unpacking limits utility of static analysis ‒ dynamic analysis can run into problems of scale O(n 2 ) algorithms impractical ; even O(n log n) can be problematic trade memory space for execution time/complexity code coverage — multi-path exploration? Taint propagation ‒ byte/word-level analyses may not be precise enough we use (enhanced) bit-level taint propagation Simplified trace → CFG: NP-hard ‒ semantic considerations?
Conclusions Rapid analysis and understanding of malware code essential for swift response to new threats ‒ need to deal with advanced code obfuscations ‒ obfuscation-specific solutions tend to be fragile We describe a semantics-based framework for automatic code deobfuscation ‒ no assumptions about the obfuscation(s) used ‒ promising results on obfuscators (e.g., Themida) not handled by prior research
Semantics-based simplification Quasi-invariant locations: locations that have the same value at each use. Our transformations (currently): ‒ Arithmetic simplification adaptation of constant folding to execution traces consider quasi-invariant locations as constants controlled to avoid over-simplification ‒ Data movement simplification use pattern-driven rules to identify and simplify data movement. ‒ Dead code elimination need to consider implicit destinations, e.g., condition code flags.