Presentation is loading. Please wait.

Presentation is loading. Please wait.

Performance Optimizations in Dyninst

Similar presentations


Presentation on theme: "Performance Optimizations in Dyninst"— Presentation transcript:

1 Performance Optimizations in Dyninst
Andrew Bernat, Matthew Legendre

2 Instrumentation is Complicated
User perspective: “Insert some new code here, here, and here.” Dyninst’s perspective: Relocation – Move code to make space for instrumentation Infrastructure – Save/restore machine state Instrumentation – Generate user provided code Performance Optimizations in Dyninst

3 Performance Optimizations in Dyninst
Sources of Overhead Relocation Infrastructure Instrumentation Extra jumps Unnecessary emulation Traps Extra register saves Tramp guards Inefficient register usage Poor code generation Optimizations Inlining instrumentation Compiler optimizations of generated code 665% -> 32% Performance Optimizations in Dyninst

4 Performance Optimizations in Dyninst
History Enable fast (and frequent) insertion and removal of code “Linked list” model Insert/remove by patching branches Model has evolved over time Long-lived instrumentation (particularly with static rewriter) Focus on speed of execution instead of speed of insertion Performance Optimizations in Dyninst

5 Outlined Instrumentation
Original Code Relocated Code Instrumentation/Infrastructure Relocated Function Relocated Block Basetramp Minitramp Branch Minitramp Basetramp Minitramp Relocated Block Basetramp Minitramp Branch Basetramp Relocated Function Relocated Block Basetramp Minitramp Relocated Function Branch Minitramp Basetramp Performance Optimizations in Dyninst

6 Performance Optimizations in Dyninst
Outlined System Fast insertion and removal Simple to update Original serves as a “handle” Reduced code relocation Block or instruction Hard to optimize New code can be inserted without warning Poor code locality Performance Optimizations in Dyninst

7 Performance Optimizations in Dyninst
Partial Inlining Original Code Relocated Code Instrumentation & Instrumentation Relocated Function Relocated Block Minitramp Basetramp Branch Minitramp Basetramp Minitramp Relocated Block Basetramp Minitramp Branch Basetramp Relocated Function Relocated Block Basetramp Minitramp Relocated Function Branch Basetramp Minitramp Performance Optimizations in Dyninst

8 Performance Optimizations in Dyninst
Full Inlining Original Code Relocated Code & Instrumentation Relocated Function Relocated Block Branch ? Relocated Function Relocated Block Branch Relocated Function Relocated Block Relocated Function Branch Performance Optimizations in Dyninst

9 Performance Optimizations in Dyninst
Branch Reduction Inlining removed three levels of branching Function to block to basetramp to minitramp One level is left Function original to relocated copy Can we remove this branch as well? Identify and rewrite calls to relocated functions Regenerate whenever target is moved Performance Optimizations in Dyninst

10 Optimizing BaseTramps and MiniTramps
DyninstAPI contains a built-in compiler Converts ASTs to machine code Used for BaseTramps and MiniTramps Designed to be cross-platform (x86, x86_64, ppc32, ppc64, IA-64, Sparc) Build new optimizations into compiler Some optimizations from classic compilers Some optimizations are instrumentation specific Performance Optimizations in Dyninst

11 Optimizing Code Generation
pusha pushf push %ebp mov %esp,%ebp sub $128,%esp mov 0x805a490,%eax mov (%eax),%ecx test %ecx,%ecx je done mov $0x0,(%ecx) mov $1,%eax mov %eax,4(%ebp) mov 0x805a494,%ebx mov 4(%ebp),%eax add %eax,%ebx mov %ebx,0x805a494 mov $0x1,(%eax) done: leave popf popa Saving too many registers Register Saves Stack frame (Setup) Stack frame unnecessary Tramp guards unnecessary Trampoline Guard (Check) Extraneous register usage “Virtual” registers unnecessary Instrumentation Inefficient instrumentation Trampoline Guard (Restore) Recalculating old value Stack frame (Clean) Register Restores

12 Performance Optimizations in Dyninst
Register Saves Register Saves pusha pushf push %eax lahf Calculate live registers at inst point Calculate registers used by instrumentation Save intersection Use more efficient flag saves Performance Optimizations in Dyninst

13 Performance Optimizations in Dyninst
Virtual Registers Instrumentation mov $1,%eax mov %eax,4(%ebp) mov 4(%ebp),%eax mov $1,%eax “Virtual Registers” were stack slots on x86 Load from virtual register to eax Operate on eax Store from eax to virtual register Now use real register allocation algorithm, with spilling Performance Optimizations in Dyninst

14 AST to Machine Code Compilation
Instrumentation mov $1,%eax incl 0x805a494 = mov 0x805a494,%ebx 0x805a494 + add %eax,%ebx mov $0x805a494,%ecx 0x805a494 1 mov %ebx,(%ecx) Each AST node is converted to an instruction Not optimal on CISC systems Recognize sequences of ASTs, emit optimized code Performance Optimizations in Dyninst

15 Optional Infrastructure
Tramp Guard Stack Frame mov 0x805a490,%eax mov (%eax),%ecx test %ecx,%ecx je done mov $0x0,(%ecx) push %ebp mov %esp,%ebp sub $0x32,%esp ... FP Saves mov %esp,%eax sub $512,%esp and 0xfffffff0,%esp fxsave (%esp) push %eax Stack Shift lea 0x128(%rsp),%rsp Some tramp infrastructure not always required. E.g, Stack frame only needed for register spilling Tramp guard only need for function calls Save only necessary infrastructure Performance Optimizations in Dyninst

16 Fixed Point Code Generation
Optimizations may be interlinked. E.g., Removing code may leave registers unused Removing unused registers eliminates saves Eliminating saves removes stack access Removing stack accesses may eliminate stack shift Typical code generation requires 2 passes Performance Optimizations in Dyninst

17 Optimizing Code Generation
pusha pushf push %ebp mov %esp,%ebp sub $128,%esp mov 0x805a490,%eax mov (%eax),%ecx test %ecx,%ecx je done mov $0x0,(%ecx) mov $1,%eax mov %eax,4(%ebp) mov 0x805a494,%ebx mov 4(%ebp),%eax add %eax,%ebx mov %ebx,0x805a494 mov $0x1,(%eax) done: leave popf popa pusha pushf push %ebp mov %esp,%ebp sub $128,%esp mov 0x805a490,%eax mov (%eax),%ecx test %ecx,%ecx je done mov $0x0,(%ecx) mov $1,%eax mov %eax,4(%ebp) mov 0x805a494,%ebx mov 4(%ebp),%eax incl 0x805a494 mov %ebx,0x805a494 mov $0x1,(%eax) done: leave popf popa Register Saves Stack frame (Setup) Trampoline Guard (Check) incl 0x805a494 Instrumentation Trampoline Guard (Restore) Stack frame (Clean) Register Restores

18 Results Basic block instrumentation on ‘go’ from SPEC2000
Instrumented run time (base: 12.25s) Old Optimizations + Inlining Dynamic 70.96s (479%) 25.18s (105%) NA Static 93.72s (665%) 24.10s (97%) 16.21s (32%) Instrumentation time Original Optimizations Optimizations + Inlining Dynamic 17.43s 3.22s NA Static 2.77s 2.12s 4.81s Performance Optimizations in Dyninst

19 Performance Optimizations in Dyninst
Conclusions Optimizations in DyninstAPI instrumentation Inline instrumentation levels Generate more efficient code Significant performance gains Instrumentation code runs faster More time spent generating instrumentation Performance Optimizations in Dyninst

20 Performance Optimizations in Dyninst
Questions? Performance Optimizations in Dyninst


Download ppt "Performance Optimizations in Dyninst"

Similar presentations


Ads by Google