Download presentation

Presentation is loading. Please wait.

Published byJaquan Locke Modified about 1 year ago

1
Binary Translation Using Peephole Superoptimizers Sorav Bansal, Alex Aiken Stanford University

2
Binary Translation Allow one ISA to run on another Applications –Portability (e.g., running legacy software) –Virtualization –Backward and Forward Compatibility –On-chip binary translation –Java Virtual Machines

3
Hypervisor x86 hardware x86 OS x86 app x86 app Binary Translator powerpc app powerpc OS Binary Translation x86 hardware OS x86 app x86 app Binary Translator powerpc app x86 hardware OS x86 app x86 app Binary Translator powerpc app

4
Binary Translation Wish-list Performance Large Complex ISAs RetargetabilityOS Compatibility

5
Talk Outline Superoptimization Peephole Superoptimization Application to Binary Translation Implementation & Experimental Results Conclusion

6
Superoptimization Superoptimizer is a unique code generator that uses brute-force search to attempt to find the optimal code Eg. int signum(int x) { if (x > 0) return 1; if (x < 0) return –1; else return 0; } On Motorola 68020: add.l d0, d0 subx.l d1, d1 negx.l d0 addx.l d1, d1

7
Superoptimization Enumerate all sequences up to a certain length and Compare each enumerated sequence with target function for equivalence

8
Talk Outline Superoptimization Peephole Superoptimization Application to Binary Translation Implementation & Experimental Results Conclusion

9
Peephole Superoptimization Use a superoptimizer to automatically infer peephole optimizations add $1, reginc reg mul $2, regshl reg …… Table of Peephole Optimizations [S. Bansal, A. Aiken. Automatic Generation of Peephole Superoptimizers, ASPLOS 2006] patternreplace-with

10
Peephole Superoptimizer Step 1 a.out 01000100101111010001 11011010111010101000 10101010001010100010 00101010100101010010 10101010010100001010 11111101100101010101 10111101001010100101 01001010100101010011 10011111010010001101 11101101110101000100 11010101010101010101 01010101010101010100 11010010010101010101 01010101010000111111 01010111101010001111 01010101110111011011 10111011101110101001 1011001010101101101… 01100 101 mov %eax, %ecx mov %ecx, %eax sub $123, %eax add $456, %eax movl (%eax), %ecx inc %ecx movl %ecx, (%eax) … Harvest instruction sequences that can potentially be optimized. Canonicalize and store them. Target Sequences

11
Peephole Superoptimization Step 2 mov %eax, %ecx mov %ecx, %eax sub $123, %eax add $456, %eax movl (%eax), %ecx inc %ecx movl %ecx, (%eax) … Target Sequences mov %eax, %ecx add $333, %eax inc (%eax) … Brute force Optimization Optimized Sequences

12
Equivalence Test Execution Test Boolean Test Two sequences pass fail not-equivalent equivalent

13
Peephole Superoptimization Step 3 mov %eax, %ecx mov %ecx, %eax sub $123, %eax add $456, %eax movl (%eax), %ecx inc %ecx movl %ecx, (%eax) … mov %eax, %ecx add $333, %eax inc (%eax) … Table of Peephole Optimizations

14
Talk Outline Superoptimization Peephole Superoptimizers Application to Binary Translation Implementation & Experimental Results Conclusion

15
Application to Binary Translation Our approach: Use lots of peephole transformations pattern (ppc) translate-to (x86) shl %eax add %ecx,%eax addi r1,r1,1 mullw r1,r1,2 add r1,r1,r2 inc %eax ppc x86 register map r1 eax r1 eax; r2 ecx

16
Peephole Binary Translation mr r1, r2 mr r2, r1 lis r1, 0x12 ori r1, r1, 0x3456 ldl r2, (r1) addi r2, r2, 1 stl r2, (r1) … mov %eax, %ecx mov $0x123456, M r1 inc (%eax) … r1 eax r2 ecx r1 M r1 r1 eax r2 ecx … source arch. (ppc) register mapdestination arch. (x86)

17
Register Map Selection The best code may require changing the register map from one code point to another The choice of register maps affects the choice of instruction selection and vice-versa

18
Register Map Selection li r1, 123 addi r2, r2, 1 subf r2, r1, r2 ori r1, r1, 31 powerpc sequence: ? x86 sequence: Instruction costs If accesses memory, 10 Else, 1 Switching Costs R M or M R : 10 Cost Model P0 P1 P2 P3 exit At entry: r1 M r1 ; r2 M r2 At exit: r1 M r1 ; r2 M r2 Example

19
Register Map Selection li r1, 123 r1 M r1 ; r2 M r2 entry addi r2,r2,1 subf r2,r1,r2 ori r1,r1,31 movl $123, M r1 r1 M r1 0 10 incl M r2 r2 M r2 0 10subl M r1, eaxr1 M r1 ; r2 eax 10 exit orl $31, M r1 10r1 M r1 0 10 Total 40Total 20 Grand Total 60 r1 M r1 ; r2 M r2 Instruction costs If accesses memory, 10 Else, 1 Switching Costs R M or M R : 10 Greedy Strategy P0: P1: P2: P3:

20
li r1, 123 r1 M r1 ; r2 M r2 entry addi r2,r2,1 subf r2,r1,r2 ori r1,r1,31 exit movl $123, eaxr1 eax 10 1 incl ecxr2 ecx 10 1 subl eax, ecxr1 eax ; r2 ecx 0 1 orl $31, eax1r1 eax 0 20 Total 4Total 40 Grand Total 44 r1 M r1 ; r2 M r2 Switching Costs R M or M R : 10 Instruction costs If accesses memory, 10 Else, 1 Register Map Selection Optimal Solution

21
Register Map Selection Use Dynamic Programming –near-optimal solution –account for translations spanning multiple instructions –simultaneously perform instruction- selection and register-mapping

22
Talk Outline Superoptimization Peephole Superoptimizers Application to Binary Translation Implementation & Experimental Results Conclusion

23
Powerpc X86 Translator Implementation Superoptimizer –Use a PPC emulator (Qemu) for execution test –Use a SAT solver (zChaff) for boolean test Static user-level translator –ELF 32-bit ppc/Linux binary ELF 32-bit x86/Linux binary –Translate most (but not all) system calls

24
Implementation Endianness: ppc big-endian ; x86 little-endian –Convert all memory writes to big-endian (source) –Convert all memory reads to little-endian (dest) Compiler Optimizations –Problem:PowerPC optimizer staggers data- dependent instructions to reduce pipeline stalls –Solution: Cluster data-dependent instructions in basic block before translation Many Issues –Condition Codes, Endianness, System Calls, Stack and Heap, Indirect Jumps, Function Calls and Returns, Register Name Constraints, Untranslated Opcodes, Compiler Optimizations

25
Experimental Results Setup –Pentium4 3.0 GHz, 1MB Cache, 4GB Memory –gcc 4.0.1, glibc 2.3.6 –Use soft-float library –Statically-linked input executables Benchmarks –Microbenchmarks, SPEC CINT2000 Metrics –Compare against natively-compiled code –Compare against other binary translators Qemu, Apple’s Rosetta

26
Experimental Setup For our experiments –there are around 750 translation rules in the peephole table –the translation table is computed offline and it can take up to a week to compute the peephole rules

27
Experimental Results: Setup C source PowerPC executable x86 executable gcc -arch=ppcgcc -arch=x86 Peephole Binary Translation x86 executable Compare

28
Microbenchmarks emptyloopA bounded for-loop doing nothing fiboCompute first few fibonacci numbers quicksortQuicksort on 64-bit integers mergesortMergesort on 64-bit integers bubblesortBubblesort on 64-bit integers hanoi1Towers of Hanoi Algorithm 1 hanoi2Towers of Hanoi Algorithm 2 hanoi3Towers of Hanoi Algorithm 3 traverseTraverse a linked list binsearchBinary search on a sorted array

29
Microbenchmarks Percentage of native (%) avg: 90% of native

30
Experimental Results: Microbenchmarks We sometimes outperform native performance on these small benchmarks! –gcc generates better code for powerpc primarily because it has the luxury of many registers –Our register-mapping algorithm performs an efficient “re-allocation” of the PowerPC registers to x86 registers.

31
Experimental Results: SPEC CINT2000 Percentage of native (%)

32
Comparisons with Qemu and Rosetta Qemu –Use same PowerPC and x86 executables as used for our own translator Rosetta –Runs on Mac OS X and hence supports on Mac executables –Recompiled the benchmarks on Mac using the same compiler version (gcc 4.0.1) –Mac Hardware: Intel Core 2 Duo 1.83GHz processor, 32KB L1-cache, 2MB L2-cache and 2GB memory

33
Comparisons with Qemu and Rosetta -O0-O2 avg: 3% faster than rosettaavg: 12% faster than rosetta qemurosettapeep

34
Translation Time Takes 2-6 minutes to translate a 650KB executable (around 100K instructions) –majority of time spent in optimal register map computation It is possible to reduce this to <10 seconds –For 98K instructions (<0.01% of time), use any register map. Fast (<1second) –For other 2K, use optimal computation

35
Conclusions and Future Work A scheme to perform efficient binary translation using a superoptimizer –Competitive performance –Simplified Design Other applications –Just-in-time compilation –Machine virtualization

36
Q&A Thank you.

37
Backup Slides

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google