Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith Future Superscalar Processors Based on Instruction Compounding.

Slides:



Advertisements
Similar presentations
JUST-IN-TIME COMPILATION
Advertisements

Intro to the “c6x” VLIW processor
EZ-COURSEWARE State-of-the-Art Teaching Tools From AMS Teaching Tomorrow’s Technology Today.
HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.
Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.
Instruction-Level Parallelism (ILP)
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
Computer Architecture Lecture 3 Coverage: Appendix A
1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.
Choice for the rest of the semester New Plan –assembler and machine language –Operating systems Process scheduling Memory management File system Optimization.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
Lect 13-1 Lect 13: and Pentium. Lect Microprocessor Family  Microprocessor  Introduced in 1989  High Integration  On-chip 8K.
Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.
The Pentium Processor.
Linked Lists in MIPS Let’s see how singly linked lists are implemented in MIPS on MP2, we have a special type of doubly linked list Each node consists.
An Approach for Implementing Efficient Superscalar CISC Processors
Kyushu University Koji Inoue ICECS'061 Supporting A Dynamic Program Signature: An Intrusion Detection Framework for Microprocessors Koji Inoue Department.
Assembly Language A Brief Introduction. Unit Learning Goals CPU architecture. Basic Assembler Commands High level Programming  Assembler  Machine Language.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
ECE 252 / CPS 220 Pipelining Professor Alvin R. Lebeck Compsci 220 / ECE 252 Fall 2008.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
M. Mateen Yaqoob The University of Lahore Spring 2014.
ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.
Reduced Instruction Set Computers. Major Advances in Computers(1) The family concept —IBM System/ —DEC PDP-8 —Separates architecture from implementation.
Using Dynamic Binary Translation to Fuse Dependent Instructions Shiliang Hu & James E. Smith.
Transmeta’s New Processor Another way to design CPU By Wu Cheng
The Instruction Set Architecture. Hardware – Software boundary Java Program C Program Ada Program Compiler Instruction Set Architecture Microcode Hardware.
COMPUTER ORGANIZATIONS CSNB123 NSMS2013 Ver.1Systems and Networking1.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.
Chapter Overview General Concepts IA-32 Processor Architecture
Instruction Set Architecture
Efficient Binary Translation In Co-Designed Virtual Machines
Advanced Architectures
Assembly language.
15-740/ Computer Architecture Lecture 3: Performance
IA32 Processors Evolutionary Design
COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
ISA's, Compilers, and Assembly
Morgan Kaufmann Publishers
CS161 – Design and Architecture of Computer Systems
15-740/ Computer Architecture Lecture 7: Pipelining
Morgan Kaufmann Publishers The Processor
Micro-programmed Control Unit
Improving Program Efficiency by Packing Instructions Into Registers
CDA 3101 Spring 2016 Introduction to Computer Organization
The fetch-execute cycle
Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith
Instruction Scheduling for Instruction-Level Parallelism
Superscalar Processors & VLIW Processors
CS 301 Fall 2002 Computer Organization
Processor Organization and Architecture
Instruction encoding We’ve already seen some important aspects of processor design. A datapath contains an ALU, registers and memory. Programmers and compilers.
Pipelined Implementation : Part I
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
* From AMD 1996 Publication #18522 Revision E
Instruction Execution Cycle
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
Microarchitecture of Superscalars (4) Decoding
Lecture 4: Instruction Set Design/Pipelining
Chapter 4 The Von Neumann Model
Presentation transcript:

Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith Future Superscalar Processors Based on Instruction Compounding

Future Microprocessors 2 Instruction Compounding (Fusing) Instruction compounding, or “fusing” has become a key idea in high performance microprocessors “A compound instruction reflects the parallel issue of instructions; it comprises some number of independent instructions or interlocked instructions” “Instructions composing a compound instruction need not be consecutive.” -- S. Vassiliadis et al. IBM Journal of R and D, Jan. 1994

Future Microprocessors 3 The Future Processor: Three Key Aspects  Instruction compounding or fusing Based on S. Vassiliadis work Employs compounding and 3-input ALU  Co-designed VM for dynamic translation/fusing Concealed from all software Optimized (fused) instructions held in code-cache  Dual decoder front-end for fast startup Hardware front-end decoder for fast startup Software translator for sustained high performance

Future Microprocessors 4 Processor Micro-architecture

Future Microprocessors 5 Fusible Instruction Set  RISC-ops with unique features: A fusible bit per instruction fuses two dependent instructions Dense instruction encoding, 16/32-bit ISA design  Special Features to Support the x86 ISA Condition codes Addressing modes Aware of long immediate & displacement values

Future Microprocessors 6 Microarchitecture: Macro-op Execution Enhanced OOO superscalar microarchitecture –Process & execute fused macro-ops as single Instructions throughout the entire pipeline

Future Microprocessors 7 Macro-op Fusing Algorithm  Objectives: Maximize fused dependent pairs Simple & Fast  Heuristics: Pipelined Scheduler: Only single-cycle ALU ops can be a head. Minimize non-fused single-cycle ALU ops Criticality: Fuse instructions that are “close” in the original sequence. ALU-ops criticality is easier to estimate. Simplicity: 2 or fewer distinct register operands per fused pair  Solution: Two-pass Fusing Algorithm: The 1 st pass, forward scan, prioritizes ALU ops, i.e. for each ALU-op tail candidate, look backward in the scan for its head The 2 nd pass considers all kinds of RISC-ops as tail candidates

Future Microprocessors 8 Fusing Algorithm: Example x86 asm: lea eax, DS:[edi + 01] 2. mov [DS:080b8658], eax 3. movzx ebx, SS:[ebp + ecx << 1] 4. and eax, f 5. mov edx, DS:[eax + esi << 0 + 0x7c] RISC-ops: ADDReax, Redi, 1 2. ST Reax, mem[R22] 3. LD.zx Rebx, mem[Rebp + Recx << 1] 4. ANDReax, f 5. ADDR17, Reax, Resi 6. LDRedx, mem[R17 + 0x7c] After fusing: Macro-ops ADDR18, Redi, 1 :: ANDReax, R18, 007f 2. ST R18, mem[R22] 3. LD.zx Rebx, mem[Rebp + Recx << 1] 4. ADD R17, Reax, Resi :: LDRebx, mem[R17+0x7c]

Future Microprocessors 9 Instruction Fusing Profile  55+% fused RISC-ops  increases effective ILP by 1.4  Only 6% single-cycle ALU ops left un-fused.

Future Microprocessors 10 Other DBT Software Profile  Of all fused macro-ops: 50%  ALU-ALU pairs. 30%  fused condition test & conditional branch pairs. Others  mostly ALU-MEM ops pairs.  Of all fused macro-ops: 70+% are inter-x86instruction fusion. 46% access two distinct source registers, only 15% (6% of all instruction entities) write two distinct destination registers.  Translation Overhead Profile About 1000 instructions per translated hotspot instruction.

Future Microprocessors 11 Co-designed x86 Processor Performance

Future Microprocessors 12 Dual Decoder Front-End

Future Microprocessors 13 Evaluation: Startup Performance

Future Microprocessors 14 Activity of HW Assists

Future Microprocessors 15 Important Research Issues  Profiling Probe insertion via software translator not feasible  Multi-core Shared code cache SMT designs  Memory consistency Stores can be done in-order Re-scheduled loads may be important for performance  Precise traps Potential HW assist?