National & Kapodistrian University of Athens Dep.of Informatics & Telecommunications MSc. In Computer Systems Technology Advanced Computer Architecture.

Slides:



Advertisements
Similar presentations
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Advertisements

Computer Organization and Architecture
CSCI 4717/5717 Computer Architecture
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Instruction-Level Parallelism (ILP)
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
Computer Organization and Architecture The CPU Structure.
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )
Chapter 12 Pipelining Strategies Performance Hazards.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Edited By Miss Sarwat Iqbal (FUUAST) Last updated:21/1/13
Superscalar Processors by
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
Spring 2003CSE P5481 Precise Interrupts Precise interrupts preserve the model that instructions execute in program-generated order, one at a time If an.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
PART 5: (1/2) Processor Internals CHAPTER 14: INSTRUCTION-LEVEL PARALLELISM AND SUPERSCALAR PROCESSORS 1.
1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
CS203 – Advanced Computer Architecture ILP and Speculation.
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
Lecture: Out-of-order Processors
Dynamic Scheduling Why go out of style?
Instruction Level Parallelism
William Stallings Computer Organization and Architecture 8th Edition
/ Computer Architecture and Design
PowerPC 604 Superscalar Microprocessor
CS203 – Advanced Computer Architecture
Lecture: Out-of-order Processors
Flow Path Model of Superscalars
Lecture 6: Advanced Pipelines
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Lecture 11: Memory Data Flow Techniques
Lecture: Out-of-order Processors
Lecture 8: Dynamic ILP Topics: out-of-order processors
Adapted from the slides of Prof
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Control unit extension for data hazards
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
* From AMD 1996 Publication #18522 Revision E
Adapted from the slides of Prof
Control unit extension for data hazards
Dynamic Hardware Prediction
Control unit extension for data hazards
Lecture 9: ILP Innovations
Lecture 9: Dynamic ILP Topics: out-of-order processors
Presentation transcript:

National & Kapodistrian University of Athens Dep.of Informatics & Telecommunications MSc. In Computer Systems Technology Advanced Computer Architecture The Microarchitecture of Superscalar Processors, by J.E.Smith and G.S.Sohi Giorgos Matrozos M 414

An Introduction Superscalar processing is the capability of initiating multiple instructions in the same clock cycle.  In IF phase, the results of conditional branches are calculated earlier.  Then, we have a resolution of data dependences and after that the instructions are distributed to the units.  The execution begins in parallel, based on the availability of the operands. Usually, the sequence of the original program is not followed. (DEFINITION) Therefore, this is called dynamic instruction scheduling.  After this phase ends, the instructions are replaced in the original sequential order.

The microarchitecture of superscalar MPs

The Instruction Processing Model A dominant element in designing a computer architecture is the compatibility. In superscalar processors, this compatibility is called binary. It is the possibility of executing programs written for older versions or generations. At some point, it was obvious that the Instruction Sets should be designed to be compatible. Till now, the sequential execution model was followed. That is, the instructions were executed in order as they entered the processor. But, there is the need of defining the precise state. The processor saves the state of the memory and the registers, at that point of time that the interrupt occurs.

Elements Of High Performance To accomplish higher performance, we need to decrement the execution time. The ‘secret’ of superscalar processors is to execute multiple instructions in parallel. (DEFINITION) The time to fetch and execute an instruction is called Latency. Superscalar processing contains :  Fetching strategies for simultaneous fetching of multiple instructions and branch prediction techniques.  Methods for determining all kinds of dependences.  Methods for issuing multiple instructions in parallel.  Resources for parallel execution (multiple pipelined functional units, memory hierarchies).  Methods for handling the data in the memories.  Methods for committing the states. A ‘good’ superscalar processor is facing all the above as an integrated unit and not separately.

Problem Solved by Superscalar MPs (I) The sequence of executed instructions forms a dynamic instruction stream. The first step to increase ILP is to overcome control dependences. (DEFINITION) An instruction is said to be control dependent on its preceding instruction, because the flow must pass through the preceding first. 1 st type  due to an incremented PC 2 nd type  due to an updated PC (branches, jumps)  Solution of the 1 st type : In the static program there are blocks. Once a block is entered in the IF Reg, it is known that all the instructions will be executed eventually. Any sequence of instructions in a block can be initiated into a conceptual window of execution. This window is free to execute in parallel.  Solution of the 2 nd type : Prediction of the outcome and speculatively fetch and execute instructions from the predicted path. Instructions of the predicted path are entered in the WoE. If the prediction is correct, then the speculative status is removed and the effect on the state is the same as any other instruction. If the prediction is incorrect, the speculative execution was incorrect and recovery must be initiated.

Problem Solved by Superscalar MPs (II) Now, we have the data dependences. They occur among instructions because the instructions may R/W the same storage location. (DEFINITION) When this happens, a hazard is said to exist. We have 3 types of hazards : RAW, WAR, WAW.  After control and data dependences are resolved, instructions are issued for execution. In essence, the H/W creates a parallel execution schedule. In this the order of instructions is different from the sequential program. Moreover, speculative execution means that some instructions complete execution. But these instructions would not have been executed at all, if the sequential model was followed. Let us see that in the next picture.

Instruction Fetching and Branch Prediction (I) In superscalar MPs there is the instruction cache. This is a memory containing recently used instructions. This is done to reduce latency. It is organised into blocks or lines. The default method for IF is to increment the PC by the number of instructions fetched and use the incremented PC to fetch the next block. Processing of conditional branch instructions can be broken down into : Recognizing conditional branches : Obvious!! Some extra bits for decode info is held in the Instruction Cache, for identifying all types of instructions. Determining the outcome : Some predictors use static info, like certain opcode types result more often in taken branches or execution statistics etc. Other predictors use dynamic info, like the past history of branch outcomes. A history (or prediction) table is used. Usually two bits are used. These bits form a counter that is incremented when the branch is taken and decremented when is not taken.

Instruction Fetching and Branch Prediction (II) Computing branch targets : Usually an integer addition is required. In most computers, the targets are related to the PC and use an offset. To speed up the process, we have a branch target buffer that holds the target address that was used the last time the branch was executed. Transferring control : When there is a predicted taken path, there is at least one cycle delay in recognizing the branch, modifying the PC and fetching instructions from the target. This may result to pipeline bubbles. The solution is to use the instruction buffer to mask the delay. Some of the earlier RISC instruction sets use the delayed branches, that is a branch did not take effect until the instruction after the branch.

Decoding, Renaming and Dispatch This phase includes the detection and resolution of hazards. The main job is to set up one ore more execution tuples for each instruction. (DEFINITION) A tuple is an ordered list containing the operation, the storage elements for input and the location of the output. Often to increase ILP, there are physical storage elements. There is the possibility of storing multiple data there with different logical addresses. When an instruction creates new value for logical address, the physical one is given a name known by the H/W. (DEFINITION) Renaming is defined as replacing the logical register with the new physical name. There are 2 renaming methods There is a physical register file larger than the logical one. A mapping table is used to associate them. Renaming is performed in sequential program order. The second method uses a physical register file in the same size as the logical one. There is a buffer with one entry per active instruction. This buffer is called the reorder buffer.

Instruction Issuing and Parallel Execution There are 3 ways of organizing the instruction issue buffers :  Single Queue Method : The register renaming is not requiring. Operand availability can be managed via simple reservation bits assigned to each register. An instruction may issue if there are no reservations on its operands.  Multiple Queue Method : Instructions issue from each queue in order. The individual queues are organized according to instruction types (E.g. fp queues, int queues, load/store queues).  Reservation Stations : Instructions may issue out of order. All the stations monitor their source operands for data availability, at the same time. The way of doing this is to hold operand data in the reservation station.

Organizing instr. issue queues

Handling Memory Operations To reduce the latency, we use memory hierarchies. Most PCs today use caches (L1,L2). The first one is smaller but faster, on chip. Unlike, ALU operations, load/store instructions need address calculation, usually an integer addition. After that, we need an address translation to generate a physical address. A Translation Lookaside Buffer is used to speed up this action. Some superscalar processors, allow single memory operations/cycle. The trend is to allow multiple memory requests at the same time, with multiported memory hierarchy. Most commonly, only the L1 cache is multiported, because requests do not proceed to lower levels of memory. Once the operation has been submitted to the memory hierarchy, it may hit or miss in the data cache. In case of missing, the accessed location must be fetched into the cache. Miss Handling Status Registers are used to track the status of outstanding misses and allow multiple requests to be overlapped.

The Committing State It is the final phase of an instruction. Its purpose is to implement the appearance of a sequential execution model, even though the reality is different. The actions necessary in this phase depend on the technique used to recover the precise state.  1 st technique : The state is saved (or checkpointed) at certain points, in a history buffer. Instructions update the state as they execute and when a precise state is needed, it is recovered from the history buffer.  2 nd technique : Separation of the state into 2 parts. The implemented physical state and a logical state. The physical is updated immediately as the operations complete. The logical is updated in sequential program order, as the speculative status is cleared. The speculative state is maintained in a reorder buffer.

MIPS R10000 Fetches 4 instructions/time. These are predecoded when they enter the cache. Branch prediction with a prediction table (512 lines and 2-bit counter to encode history). If branch is taken, it takes 1cc to redirect the IF. During this cycle, sequential instructions are fetched and placed in a resume cache (4 Blocks). When a branch is predicted, the processor takes a snapshot of the register mapping table. If branch is mispredicted, the register mapping can be quickly recovered. 4 instructions are dispatched into one of three instruction queues : memory, integer, fp. Address adder, 2 int ALUs (One shifts and the other multiplies/adds), fp multiplier/divider/square-rooter, fp adder. On chip primary cache, L2 cache. Reorder buffer mechanism for maintaining a precise state.

ALPHA IF from an 8KB instruction cache. 4 instructions/time. Instructions are issued in program order. That restricts the instruction issue rate but simplifies the control logic. Branch prediction with a prediction table that records history using 2-bit counter. This table is in cache. 2 int ALUs, a fp adder, a fp multiplier. 2 levels of cache on chip. Primary can sustain a number of outstanding misses through six entry miss address files (MAFs) that contains address and target register for a load that misses. To provide a sequential state, this processor does not issue out of order and keeps the instructions in sequence as they flow down the pipeline.

DEC ALPHA Organization

AMD K5 Uses variable length instructions, sequentially predecoded with 5 predecode bits. Branch prediction with one prediction entry/cache line. It uses single-bit counter to encode history. 2 cycles consumed for decoding. It uses RISC-like OPerations known as ROPs. 2 int ALUs (one shifts, the other divides), a fp unit, 2 load/store units, a branch unit. The reservation stations are distributed to these functional units. 8KB cache. 16 entry reorder buffer to maintain a precise state.

AMD K5 Organization