Presentation on theme: "National & Kapodistrian University of Athens Dep.of Informatics & Telecommunications MSc. In Computer Systems Technology Advanced Computer Architecture."— Presentation transcript:
National & Kapodistrian University of Athens Dep.of Informatics & Telecommunications MSc. In Computer Systems Technology Advanced Computer Architecture The Microarchitecture of Superscalar Processors, by J.E.Smith and G.S.Sohi Giorgos Matrozos M 414 email@example.com
An Introduction Superscalar processing is the capability of initiating multiple instructions in the same clock cycle. In IF phase, the results of conditional branches are calculated earlier. Then, we have a resolution of data dependences and after that the instructions are distributed to the units. The execution begins in parallel, based on the availability of the operands. Usually, the sequence of the original program is not followed. (DEFINITION) Therefore, this is called dynamic instruction scheduling. After this phase ends, the instructions are replaced in the original sequential order.
The Instruction Processing Model A dominant element in designing a computer architecture is the compatibility. In superscalar processors, this compatibility is called binary. It is the possibility of executing programs written for older versions or generations. At some point, it was obvious that the Instruction Sets should be designed to be compatible. Till now, the sequential execution model was followed. That is, the instructions were executed in order as they entered the processor. But, there is the need of defining the precise state. The processor saves the state of the memory and the registers, at that point of time that the interrupt occurs.
Elements Of High Performance To accomplish higher performance, we need to decrement the execution time. The ‘secret’ of superscalar processors is to execute multiple instructions in parallel. (DEFINITION) The time to fetch and execute an instruction is called Latency. Superscalar processing contains : Fetching strategies for simultaneous fetching of multiple instructions and branch prediction techniques. Methods for determining all kinds of dependences. Methods for issuing multiple instructions in parallel. Resources for parallel execution (multiple pipelined functional units, memory hierarchies). Methods for handling the data in the memories. Methods for committing the states. A ‘good’ superscalar processor is facing all the above as an integrated unit and not separately.
Problem Solved by Superscalar MPs (I) The sequence of executed instructions forms a dynamic instruction stream. The first step to increase ILP is to overcome control dependences. (DEFINITION) An instruction is said to be control dependent on its preceding instruction, because the flow must pass through the preceding first. 1 st type due to an incremented PC 2 nd type due to an updated PC (branches, jumps) Solution of the 1 st type : In the static program there are blocks. Once a block is entered in the IF Reg, it is known that all the instructions will be executed eventually. Any sequence of instructions in a block can be initiated into a conceptual window of execution. This window is free to execute in parallel. Solution of the 2 nd type : Prediction of the outcome and speculatively fetch and execute instructions from the predicted path. Instructions of the predicted path are entered in the WoE. If the prediction is correct, then the speculative status is removed and the effect on the state is the same as any other instruction. If the prediction is incorrect, the speculative execution was incorrect and recovery must be initiated.
Problem Solved by Superscalar MPs (II) Now, we have the data dependences. They occur among instructions because the instructions may R/W the same storage location. (DEFINITION) When this happens, a hazard is said to exist. We have 3 types of hazards : RAW, WAR, WAW. After control and data dependences are resolved, instructions are issued for execution. In essence, the H/W creates a parallel execution schedule. In this the order of instructions is different from the sequential program. Moreover, speculative execution means that some instructions complete execution. But these instructions would not have been executed at all, if the sequential model was followed. Let us see that in the next picture.
Instruction Fetching and Branch Prediction (I) In superscalar MPs there is the instruction cache. This is a memory containing recently used instructions. This is done to reduce latency. It is organised into blocks or lines. The default method for IF is to increment the PC by the number of instructions fetched and use the incremented PC to fetch the next block. Processing of conditional branch instructions can be broken down into : Recognizing conditional branches : Obvious!! Some extra bits for decode info is held in the Instruction Cache, for identifying all types of instructions. Determining the outcome : Some predictors use static info, like certain opcode types result more often in taken branches or execution statistics etc. Other predictors use dynamic info, like the past history of branch outcomes. A history (or prediction) table is used. Usually two bits are used. These bits form a counter that is incremented when the branch is taken and decremented when is not taken.
Instruction Fetching and Branch Prediction (II) Computing branch targets : Usually an integer addition is required. In most computers, the targets are related to the PC and use an offset. To speed up the process, we have a branch target buffer that holds the target address that was used the last time the branch was executed. Transferring control : When there is a predicted taken path, there is at least one cycle delay in recognizing the branch, modifying the PC and fetching instructions from the target. This may result to pipeline bubbles. The solution is to use the instruction buffer to mask the delay. Some of the earlier RISC instruction sets use the delayed branches, that is a branch did not take effect until the instruction after the branch.
Decoding, Renaming and Dispatch This phase includes the detection and resolution of hazards. The main job is to set up one ore more execution tuples for each instruction. (DEFINITION) A tuple is an ordered list containing the operation, the storage elements for input and the location of the output. Often to increase ILP, there are physical storage elements. There is the possibility of storing multiple data there with different logical addresses. When an instruction creates new value for logical address, the physical one is given a name known by the H/W. (DEFINITION) Renaming is defined as replacing the logical register with the new physical name. There are 2 renaming methods There is a physical register file larger than the logical one. A mapping table is used to associate them. Renaming is performed in sequential program order. The second method uses a physical register file in the same size as the logical one. There is a buffer with one entry per active instruction. This buffer is called the reorder buffer.
Instruction Issuing and Parallel Execution There are 3 ways of organizing the instruction issue buffers : Single Queue Method : The register renaming is not requiring. Operand availability can be managed via simple reservation bits assigned to each register. An instruction may issue if there are no reservations on its operands. Multiple Queue Method : Instructions issue from each queue in order. The individual queues are organized according to instruction types (E.g. fp queues, int queues, load/store queues). Reservation Stations : Instructions may issue out of order. All the stations monitor their source operands for data availability, at the same time. The way of doing this is to hold operand data in the reservation station.
Handling Memory Operations To reduce the latency, we use memory hierarchies. Most PCs today use caches (L1,L2). The first one is smaller but faster, on chip. Unlike, ALU operations, load/store instructions need address calculation, usually an integer addition. After that, we need an address translation to generate a physical address. A Translation Lookaside Buffer is used to speed up this action. Some superscalar processors, allow single memory operations/cycle. The trend is to allow multiple memory requests at the same time, with multiported memory hierarchy. Most commonly, only the L1 cache is multiported, because requests do not proceed to lower levels of memory. Once the operation has been submitted to the memory hierarchy, it may hit or miss in the data cache. In case of missing, the accessed location must be fetched into the cache. Miss Handling Status Registers are used to track the status of outstanding misses and allow multiple requests to be overlapped.
The Committing State It is the final phase of an instruction. Its purpose is to implement the appearance of a sequential execution model, even though the reality is different. The actions necessary in this phase depend on the technique used to recover the precise state. 1 st technique : The state is saved (or checkpointed) at certain points, in a history buffer. Instructions update the state as they execute and when a precise state is needed, it is recovered from the history buffer. 2 nd technique : Separation of the state into 2 parts. The implemented physical state and a logical state. The physical is updated immediately as the operations complete. The logical is updated in sequential program order, as the speculative status is cleared. The speculative state is maintained in a reorder buffer.
MIPS R10000 Fetches 4 instructions/time. These are predecoded when they enter the cache. Branch prediction with a prediction table (512 lines and 2-bit counter to encode history). If branch is taken, it takes 1cc to redirect the IF. During this cycle, sequential instructions are fetched and placed in a resume cache (4 Blocks). When a branch is predicted, the processor takes a snapshot of the register mapping table. If branch is mispredicted, the register mapping can be quickly recovered. 4 instructions are dispatched into one of three instruction queues : memory, integer, fp. Address adder, 2 int ALUs (One shifts and the other multiplies/adds), fp multiplier/divider/square-rooter, fp adder. On chip primary cache, L2 cache. Reorder buffer mechanism for maintaining a precise state.
ALPHA 21164 IF from an 8KB instruction cache. 4 instructions/time. Instructions are issued in program order. That restricts the instruction issue rate but simplifies the control logic. Branch prediction with a prediction table that records history using 2-bit counter. This table is in cache. 2 int ALUs, a fp adder, a fp multiplier. 2 levels of cache on chip. Primary can sustain a number of outstanding misses through six entry miss address files (MAFs) that contains address and target register for a load that misses. To provide a sequential state, this processor does not issue out of order and keeps the instructions in sequence as they flow down the pipeline.
AMD K5 Uses variable length instructions, sequentially predecoded with 5 predecode bits. Branch prediction with one prediction entry/cache line. It uses single-bit counter to encode history. 2 cycles consumed for decoding. It uses RISC-like OPerations known as ROPs. 2 int ALUs (one shifts, the other divides), a fp unit, 2 load/store units, a branch unit. The reservation stations are distributed to these functional units. 8KB cache. 16 entry reorder buffer to maintain a precise state.