In-Order Execution In-order execution does not always give the best performance on superscalar machines. The following example uses in-order execution.

Slides:



Advertisements
Similar presentations
In-Order Execution In-order execution does not always give the best performance on superscalar machines. The following example uses in-order execution.
Advertisements

Topics Left Superscalar machines IA64 / EPIC architecture
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Computer Organization and Architecture
CSCI 4717/5717 Computer Architecture
1 Advanced Computer Architecture Limits to ILP Lecture 3.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Microprocessors. Von Neumann architecture Data and instructions in single read/write memory Contents of memory addressable by location, independent of.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Embedded Computer Architectures Hennessy & Patterson Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation Gerard Smit (Zilverling 4102),
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Computer Organization and Architecture
Computer Organization and Architecture
National & Kapodistrian University of Athens Dep.of Informatics & Telecommunications MSc. In Computer Systems Technology Advanced Computer Architecture.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Mult. Issue CSE 471 Autumn 011 Multiple Issue Alternatives Superscalar (hardware detects conflicts) –Statically scheduled (in order dispatch and hence.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
Chapter 12 Pipelining Strategies Performance Hazards.
EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Tanenbaum, Structured Computer Organization, Fifth Edition, (c) 2006 Pearson Education, Inc. All rights reserved The Microarchitecture Level.
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Chapter 12 CPU Structure and Function. Example Register Organizations.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
In-Order Execution In-order execution does not always give the best performance on superscalar machines.  The following example uses in-order execution.
OOO execution © Avi Mendelson, 4/ MAMAS – Computer Architecture Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken.
EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
SUPERSCALAR EXECUTION. two-way superscalar The DLW-2 has two ALUs, so it’s able to execute two arithmetic instructions in parallel (hence the term two-way.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing.
The Microarchitecture Level
The original MIPS I CPU ISA has been extended forward three times The practical result is that a processor implementing MIPS IV is also able to run MIPS.
Tanenbaum, Structured Computer Organization, Fifth Edition, (c) 2006 Pearson Education, Inc. All rights reserved The Microarchitecture Level.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
PART 5: (1/2) Processor Internals CHAPTER 14: INSTRUCTION-LEVEL PARALLELISM AND SUPERSCALAR PROCESSORS 1.
Pentium III Instruction Stream. Introduction Pentium III uses several key features to exploit ILP This part of our presentation will cover the methods.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
PipeliningPipelining Computer Architecture (Fall 2006)
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
Protection in Virtual Mode
Instruction Level Parallelism
Computer Architecture
William Stallings Computer Organization and Architecture 8th Edition
PowerPC 604 Superscalar Microprocessor
Part IV Data Path and Control
CS203 – Advanced Computer Architecture
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Lecture 6: Advanced Pipelines
The Microarchitecture of the Pentium 4 processor
Superscalar Processors & VLIW Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
How to improve (decrease) CPI
Instruction Level Parallelism (ILP)
* From AMD 1996 Publication #18522 Revision E
Presentation transcript:

In-Order Execution In-order execution does not always give the best performance on superscalar machines. The following example uses in-order execution and in-order completion. Multiplication takes one more cycle to complete than addition/subtraction. A scoreboard keeps track of register usage. User-visible registers are RO to R8. Multiple instructions can read a register, but only one can write a register.

In-Order Execution

In-Order Execution The scoreboard has a small counter for each register telling how many times that register is in use by currently-executing instructions. If a maximum of, say, 15 instructions may be executing at once, then a 4-bit counter will do. The scoreboard also has a counter to keep track of registers being used as destinations. Since only one write at a time is allowed, these registers can be 1-bit wide. In a real machine, the scoreboard also keeps track of functional unit usage.

In-Order Execution We can notice three kinds of dependencies which can cause problems (instruction stalls): RAW (Read After Write) dependence WAR (Write After Read) dependence WAW (Write After Write) dependence In a WAR dependence, one instruction is trying to overwrite a register that a previous instruction may not yet have finished reading. A WAW dependence is similar.

In-Order Execution In-order completion is important as well in order to have the property of precise interrupts. Out-of-order completion leads to imprecise interrupts (we don’t know what has completed at the time of an interrupt - this is not good). In order to avoid stalls, let us now permit out-of-order execution and out-of-order retirement.

Out-of-Order Execution

Out-of-Order Execution The previous example also introduces a new technique called register renaming. The decode unit has changed the use of R1 in I6 and I7 to a secret register, S1, not visible to the programmer. Now I6 can be issued concurrently with I5. Modern CPUs often have dozens of secret registers for use with register renaming. This can often eliminate WAR and WAW dependencies.

Speculative Execution Computer programs can be broken up into basic blocks, with each basic block consisting of a linear sequence of code with one entry point and one exit. A basic block does not contain any control structures. Therefore its machine language translation does not contain any branches. Basic blocks are connected by control statements. Programs in this form can be represented by directed graphs.

Basic Blocks

Speculative Execution Within each basic block, the reordering techniques seen work well. Unfortunately, most basic blocks are short and there is insufficient parallelism to exploit. The next step is to allow reordering to cross block boundaries. The biggest gains come when a potentially slow operation can be moved upward in the graph to get it going earlier. Moving code upward over a branch is called hoisting.

Speculative Execution Imagine that all of the variables of the previous example except evensum and oddsum are kept in registers. It might make sense to move their LOAD instructions to the top of the loop, before computing k, to get them started early on, so the values will be available when they are needed. Of course only one of them will be needed on each iteration, so the other LOAD will be wasted.

Speculative Execution Speculative execution introduces some interesting problems. It is essential that none of the speculative instructions have irrevocable results because it may turn out later that they should not have been executed. One way to do this is to rename all the destination registers to be used by speculative code. In this way, only scratch registers are modified.

Speculative Execution Another problem arises if a speculatively executed instruction causes an exception. A LOAD instruction may cause a cache miss on a machine with a large cache line and a memory far slower than the CPU and cache. One solution is to have a special SPECULATIVE-LOAD instruction that tries to fetch the word from the cache, but if it is not there, just gives up.

Speculative Execution A worse situation happens with the following statement: if (x > 0) z = y/x; Suppose that the variables are all fetched into registers in advance and that the (slow) floating-point division is hoisted above the if test. If x is 0, the resulting divide-by-zero trap terminates the program even though the programmer has put in explicit code to prevent this situation. One solution is to have special versions of instructions that might cause exceptions.

Pentium II Microarchitecure Known as the P6 microarchitecture The Pentium II supports 32-bit operands and arithmetic as well as 64-bit floating point operations. It also supports 8- and 16-bit operands and operations as a legacy from earlier processors in the family. It can address up to 64 GB of memory and it reads words from memory 64 bits at a time. The Pentium II SEC package contains two ICs: the CPU and the unified level 2 cache.

Pentium II Microarchitecure There are three primary components of the CPU: Fetch/Decode unit Dispatch/Execute unit Retire unit Together they act as a high-level pipeline. The units communicate through an instruction pool. The ROB (ReOrder Buffer) is a table which stores information about partially completed instructions.

Pentium II Microarchitecure

Pentium II Microarchitecure The Fetch/Decode unit fetches the instructions and breaks them into micro-operations for storage in the ROB. The Dispatch/Execute unit takes micro-operations from the ROB and executes them. The Retire unit completes execution of each micro-operation and updates the registers. Instructions enter the ROB in order, can be executed out of order, but are retired in order again.

The Fetch/Decode Unit The Fetch/Decode unit is highly pipelined, with seven stages. Instructions enter the pipeline in stage IFU0, where entire 32-byte lines are loaded from the I-cache. Since the IA-32 instruction set has variable-length instructions with many formats, IFU1 analyzes the byte stream to locate the start of each instruction. IFU2 aligns the instructions so the next stage can decode them easily. Decoding starts in ID0. Each IA-32 instruction is broken up into one or more micro-operations. Simple instructions may require just 1 micro-op.

The Fetch/Decode Unit The micro-operations are queued in stage ID1. This stage also does branch prediction. The static predictor predicts backward branches to be taken and forward ones not to be. After that, the dynamic branch predictor uses a 4-bit history-based algorithm. If the branch is not in the history table, the static prediction is used. To avoid WAR and WAW dependencies, the Pentium II supports register renaming using one of 40 internal scratch registers. This is done in the RAT stage. Finally, the micro-operations are deposited in the ROB three per clock-cycle. The micro-op will be issued when all required resources are ready.

The Fetch/Decode Unit

The Dispatch Execute Unit The Dispatch/Execute unit schedules and executes the micro-operations, resolving dependencies and resource conflicts. Although only three ISA instructions can be decoded per clock cycle (in ID0) as many as five micro-operations can be issued for execution in one clock cycle, one on each port. A complex scoreboard keeps track of resources. Some execution units share a single port. The Reservation Station is a 20-entry queue.

The Dispatch Execute Unit

The Retire Unit Once a micro-operation has been executed, it goes back to the Reservation Station and then to the ROB to await retirement. The Retire unit is responsible for sending the results to the appropriate place - the correct register. If a speculatively executed instruction needs to be rolled back, it is done here (its results are discarded).

Microarchitecture of the Pentium 4 The microarchitecture of the Pentium 4 represents a break with the Pentium II/III microarchitecture Known as the NetBurst microarchitecture Consists of four major subsystems Memory subsystem Front end Out-of-order control Execution unit

Overview of the NetBurst Microarchitecture The block diagram of the Pentium 4.

Overview of the NetBurst Microarchitecture Memory subsystem contains a unified L2 cache 8-way set-associative 128-byte cache lines 1 Mbyte Write-back Associated pre-fetch unit Provides high speed data access to the other caches

Overview of the NetBurst Microarchitecture The front-end fetches instructions from the L2 cache and decodes them The sequence of complex instructions is looked up in the micro-ROM Decoded micro-ops are fed into the trace cache (L1 instruction cache) No need to decode a 2nd time if cache hit in L1 instruction cache Branch prediction

The NetBurst Pipeline A simplified view of the Pentium 4 data path.

NetBurst Pipeline Front-end is fed instructions from L2 cache, 64 bits at a time Decodes them and stores them in the trace cache Trace cache holds groups of 6 micro-ops in a single trace line to be executed in order Multiple trace lines can be linked together for longer sequences ISA instructions requiring more than 4 micro-ops are represented in the trace cache by an index into the microcode ROM

NetBurst Pipeline Conditional branch targets found by the decode unit are looked up in the L1 BTB (Branch Target Buffer) 4K branches stored If not found, use static prediction The trace BTB predicts where micro-op branches will go The trace cache feeds the out-of-order control logic which works in a similar way as in P6 120 scratch registers Separate queues for memory, non-memory ops

NetBurst Pipeline There are two ALUs and two floating point execution units (superscalar) Each ALU and floating point unit performs different sets of operations These units are fed by 128-entry register files 120 scratch registers 8 ISA-level registers L1 data cache is 4-way set associative with 64 byte entries 24-entry buffer provides store-to-load forwarding

UltraSPARC III Microarchitecure The UltraSPARC III is a full 64-bit machine, with 64-bit registers and a 64-bit data path. The memory bus is 128 bits wide. The entire SPARC series has been a pure RISC design from the beginning. Most instructions have two source registers and one destination register, so they are well suited for pipelined execution in one cycle. There is no need to break up CISC instructions into RISC micro-operations as the Pentium II/4 has to do.

UltraSPARC III Microarchitecure The UltraSPARC II is a superscalar machine that can issue four instructions per clock cycle on a sustained basis indefinitely. Instructions are issued in order, but they may complete out of order. There is hardware support for speculative loads in the form of a PREFETCH instruction that does not cause a fault on a cache miss. Results are put in a prefetch cache The compiler can put PREFECTHes far advance of where they will be needed to increase performance.

Overview of the UltraSPARC III Microarchitecture The block diagram of the UltraSPARC III.

UltraSPARC III Microarchitecure 32-KB four-way set associative instruction cache 32-byte cache lines The Instruction issue unit selects four instructions at a time for issuing. It must find four instructions that can be issued at once. 16-K branch table for conditional branches The Integer execution unit has two ALUs for parallel execution. The FP execution unit has three ALUs.

UltraSPARC III Microarchitecure The load/store unit handles various load and store instructions 3 data caches Traditional 64-KB 4-way set associative data cache 2-KB prefetch cache 2-KB write cache to allow multiple writes to be batched together for writing into the L2 cache Memory interface hides the implementation of memory Can be either bus-based or switched for parallel

UltraSPARC II Pipeline The UltraSPARC III has a 14-stage pipeline, some of whose stages are different for integer and floating-point instructions.

UltraSPARC III Pipeline A simplified representation of the UltraSPARC III pipeline.

picoJava II Microarchitecture The picoJava II is a processor for embedded systems which can run JVM binary programs without (a lot of) software interpretation. Most JVM instructions are directly executed by the hardware in a single clock cycle. About 30 JVM instructions are microprogrammed. A very small number of JVM instructions cannot be executed by the picoJava II hardware and cause a trap to a software interpreter.

picoJava II Microarchitecture The picoJava II contains an (optional) direct-mapped I-cache and an (optional) two-way set associative D-cache. It uses writeback and write allocate. An optional FPU is also part of the picoJava II design. The most interesting part of the following slide is a register file containing 64 32-bit registers. These registers can hold the top 64 words of the JVM machine stack, greatly speeding up access to the words of the stack.

picoJava II Block Diagram

picoJava II Microarchitecture The picoJava II has a six-stage pipeline. The first stage fetches instructions from the I-cache 8 bytes at a time into an instruction buffer. The next stage decodes the instruction and folds (combines) them as described later. What comes out of the decode unit is a sequence of micro-operations, each containing an opcode and three register numbers, similar to the Pentium II. Unlike the Pentium II/4, the picoJava II is not superscalar and instructions are executed and retired in order.

picoJava II Pipeline

Instruction Folding As mentioned before, the decode unit is capable of folding multiple instructions together. The following slide shows an example of this. Assuming that all three variables are high enough on the stack that all are contained in the register file, no memory references at all are needed to carry out this sequence of instructions. In this way, several JVM instructions are folded into one micro-operation.

Instruction Folding

Instruction Folding

Instruction Folding

The Microarchitecture of the 8051 CPU A simple CPU for embedded processing No pipelining, no caching Set of heterogeneous registers A single main bus (for reduced chip area) RAM for data, ROM for program code 3 on-chip timers for real-time apps 4 8-bit I/O ports for controlling external buttons, lights, etc. In order execution/retirement

The Microarchitecture of the 8051 CPU