Using Dynamic Binary Translation to Fuse Dependent Instructions Shiliang Hu & James E. Smith.

Slides:



Advertisements
Similar presentations
Target Code Generation
Advertisements

Instruction Set Design
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Computer Organization and Architecture
CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
ISA Issues; Performance Considerations. Testing / System Verilog: ECE385.
1 Lecture 3: Instruction Set Architecture ISA types, register usage, memory addressing, endian and alignment, quantitative evaluation.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Programming Languages Marjan Sirjani 2 2. Language Design Issues Design to Run efficiently : early languages Easy to write correctly : new languages.
PART 4: (2/2) Central Processing Unit (CPU) Basics CHAPTER 13: REDUCED INSTRUCTION SET COMPUTERS (RISC) 1.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Memory Management 2010.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Choice for the rest of the semester New Plan –assembler and machine language –Operating systems Process scheduling Memory management File system Optimization.
An Efficient Compiler Technique for Code Size Reduction using Reduced Bit-width ISAs S. Ashok Halambi, Aviral Shrivastava, Partha Biswas, Nikil Dutt, Alex.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Computer Organization and Architecture
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
COMPUTER ORGANIZATIONS CSNB123 May 2014Systems and Networking1.
Classifying GPR Machines TypeNumber of Operands Memory Operands Examples Register- Register 30 SPARC, MIPS, etc. Register- Memory 21 Intel 80x86, Motorola.
1 Instruction Sets and Beyond Computers, Complexity, and Controversy Brian Blum, Darren Drewry Ben Hocking, Gus Scheidt.
Computer architecture Lecture 11: Reduced Instruction Set Computers Piotr Bilski.
Hardware Support for Compiler Speculation
An Approach for Implementing Efficient Superscalar CISC Processors
Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith Future Superscalar Processors Based on Instruction Compounding.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
1 Instruction Set Architecture (ISA) Alexander Titov 10/20/2012.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
Low-Power Cache Organization Through Selective Tag Translation for Embedded Processors with Virtual Memory Support Xiangrong Zhou and Peter Petrov Proceedings.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
Lecture 04: Instruction Set Principles Kai Bu
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.
The Instruction Set Architecture. Hardware – Software boundary Java Program C Program Ada Program Compiler Instruction Set Architecture Microcode Hardware.
COMPUTER ORGANIZATIONS CSNB123 NSMS2013 Ver.1Systems and Networking1.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Efficient Binary Translation In Co-Designed Virtual Machines
Advanced Architectures
15-740/ Computer Architecture Lecture 3: Performance
Instruction Level Parallelism
William Stallings Computer Organization and Architecture 8th Edition
Multiscalar Processors
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
A Closer Look at Instruction Set Architectures
Micro-programmed Control Unit
Improving Program Efficiency by Packing Instructions Into Registers
Flow Path Model of Superscalars
Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith
Instruction Level Parallelism and Superscalar Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
EE 382N Guest Lecture Wish Branches
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Sampoorani, Sivakumar and Joshua
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
* From AMD 1996 Publication #18522 Revision E
Instruction Set Principles
Chapter 12 Pipelining and RISC
Design of Digital Circuits Lecture 19a: VLIW
Target Code Generation
Lecture 4: Instruction Set Design/Pipelining
Presentation transcript:

Using Dynamic Binary Translation to Fuse Dependent Instructions Shiliang Hu & James E. Smith

2 Outline Introduction Fused Instruction Set Fusing Algorithm Evaluation Conclusion

3 MicroArchitecture Model Dependence-based Architectures: ILDP [ISCA’02] etc. Fuse dependent instruction pairs to be processed as if single Introductions in the processor pipeline Both higher IPC and deeper pipelining can be achieved simultaneously. Original proposal by I.Kim and M. Lipasti Macro-op Scheduling [MICRO’03] (Hardware Intensive, RISC) Other related work on pipelined scheduling logic.

4 Pipelined Scheduling Window Critical path in select-wakeup for single cycle instructions: If producer has latency > 1, then wakeup can be done a cycle late => wakeup and select in different pipe stages Reax  mem[Resi + 4] Select Wakeup Reax  Reax & 7 Reax  Reax & 7 :: Rebx  Reax+Rebx Select & Wakeup Select Rebx  Reax + Rebx Wakeup Selcct & Wakeup Recx  Rebx + 4 Recx  Rebx + 4

5 Performance Implications Pros: –Effectively larger scheduling window by holding two instructions in the same window slot. –Effectively wider issue: by issuing one slot with two fused instructions, two dependent instructions are kicked off for execution with a single issue decision. –Can pipeline the scheduling logic without a heavy penalty if there is high fusing rate. Cons: –Non-fused single cycle instructions have two cycle latency. –If the head (the 1 st instr. in the pair) provides value to another critical consumer – most values are consumed only once. –If the tail (the 2 nd instr. in the pair) has a critical dependence, slows down the wakeup of the pair.

6 Co-Designed Virtual Machine Concurrently design ISA, microarchitecture, and dynamic binary translation (DBT) system Examples -- Transmeta Crusoe & Efficeon processors; IBM DAISY, BOA. Our Design for x86: –RISC-style implementation ISA with fuse bit –Fetch straightened code generated by fast DBT –Run on an enhanced dynamic superscalar

7 Implementation Instruction Set Allocate 1-bit of each instruction, the fuse bit, to fuse two instructions in the pipeline Dense Instruction Encoding: 16/32 bit instruction set design Features specialized for efficient emulation of the x86 ISA: long immediates, condition code, addressing modes etc

8 Fused Instruction Set

9 An Illustrative Example

10 Dynamic Binary Translation Goals: Simple, Fast & Effective Hot Superblock detection and formation Translation from x86 binary to fused instruction set Code cache placement & linking among superblocks in the code cache

11 Hot Superblock Detection & Formation Modified MRET (Most Recently Executed Tail) -- Stop at indirect jumps. Threshold: 32. Max Len: 256. a c b d d Early exit Entry Superblock generated later Translated Superblocks: Basic block A C B D Taken at superblock construction time Hot Threshold

12 Translation Procedure Single Pass Algorithm: 1. Form superblocks using Modified MRET method 2. Crack x86 instructions into RISC-like abstract micro- ops 3. Perform Cluster Analysis of long immediates and assign to regs. 4. Generate micro-ops in the implementation ISA 5. Fusing Algorithm Scan looking for dependent pairs to be fused. Forward scan, backward pairing. 6. Assign registers; extend live ranges for precise traps, use consistent state mapping at superblock exits 7. Code generation

13 Cluster Analysis Objectives: –Remove embedded long immediates in x86 binary. –Reduce static and dynamic instructions. Long Immediate Conversion. –Scan superblock looking for all long immediate values. –Perform value clustering analysis and allocate registers to frequent long immediate values. –Convert some x86 embedded long immeidates into register access or register plus a short immediate that can be handled in implementation ISA.

14 Fusing Algorithm Objectives: –Maximize fused dependent pairs –Minimize non-fused single cycle ALU ops. Heuristics: –Only single cycle ALU ops can be a head. –Fuse instructions that are close in the original sequence cracked from x86 binary. Fusing Algorithm: –Single pass forward scan. –For each tail candidate, look backward in the scan for its head.

15 Dependence Cycle Detection All cases are generalized in (d) due to Anti-Scan Fusing Heuristic

16 Dynamic x86 Superblock Size Average superblock size is about 15 x86 instructions, 20+ RISC ops. String instructions are common in some x86 applications.

17 Static Translation Size Variable length ISA is only about 33% bigger than x86 binary Fixed length ISA is 60% to 120% bigger than original x86 binary.

18 Long Immediate Values Converted Intra superblock conversion for now. Address Displacement is easier to convert, but not the general long immediate values.

19 Registers For Long Immediate Two or three registers are enough for 95+% dynamic superblocks. Most SPEC2000INT benchmarks need no more than 5 registers

20 Scheduling Density Consistently high fusing rate across SPEC2000INT benchmarks. 1.5 Scheduling Density means more than 60% instructions are fused

21 Non-Fused Instruction Profile Consistently low single cycle ALU leftovers across SPEC2000INT (~23%) X (~35%) means single cycle ALU ops are about 8% of all.

22 Distance Distribution of Fused Pairs Most pairs are consecutive or very close in the original cracked RISC ops cracked from x86 superblock.

23 Code Re-organization More than 50% pairs are across x86 instruction boundaries. Single cycle ALU ops pairs is about 60%

24 Source Register Operands 99+% fusable pairs have no more than 3 source register operands. 95+% fusable pairs have no more than 2 source register operands.

25 Conclusion High degree of fusing in typical x86 binary: 60% of all dynamic instructions Two source register operands are enough: 95% of fusable dependent pairs. Non-fused instructions are mostly LD, ST, BR, FP and NOPs –Little impact from pipelined issue Variable length ISA improves code density: by 30% in our case Co-Designed VM featuring fused instruction execution is promising  Future work: Complete the co-designed microarchitecture

26 Backup: Dynamic Binary Translation Start program execution by interpretation; identify “hot” (frequently executed) program paths Translate hot paths into translation cache If program control flow reaches already translated code, execute natively Interpret Translate Native execution Threshold End of superblock Translation found DBT (VMM) Target translation found Not found (call- DBT instruction) End of superblock Translation not found