# CS5365 Pipelining. Divide task into a sequence of subtasks. Each subtask is executed by a stage (segment)of the pipe.

## Presentation on theme: "CS5365 Pipelining. Divide task into a sequence of subtasks. Each subtask is executed by a stage (segment)of the pipe."— Presentation transcript:

CS5365 Pipelining

Divide task into a sequence of subtasks. Each subtask is executed by a stage (segment)of the pipe

Linear Pipeline Structure All stages execute simultaneously different subtask. A stage is specialized hardware: combinational circuits, A/L operations, processors, etc.

Pipelining Ideally, all stages take same time to execute their task. Otherwise, the pipe operates at the speed of the slowest subtask.

Pipelining

Clock Period

Note: 1. Once the pipeline is full it will yield one result every clock period, 2. A linear pipeline with k stages can process n tasks in: clock periods where k cycles are used to fill up the pipe and complete the first task, and n-1 additional cycles will be needed to complete the n - 1 remaining tasks.

Speedup Denote T seq the time required by a non- pipelined uniprocessor to execute n tasks, then: assuming each of the k operations needs the same length of time to execute.

Speedup Otherwise

Speedup The speedup S(k) obtained by a k-pipelined processor is given as follows:

Ideal Pipeline Speedup S(k) Note that when n >> k, then

Ideal Pipeline However, a maximum (ideal) speedup is not possible because overhead due to: 1. Data dependencies between tasks. 2. Interrupts. 3. Program branches.

Space-time diagram

Efficiency The efficiency is obtained by dividing the speedup by the number of stages k:

Efficiency

Note also that:

Throughput It is defined as the number of tasks completed per unit of time:

Throughput Techniques to increase throughput. Consider a pipeline with the following configuration where: T 1 =T 3 =T andT 2 =3T Clearly the bottleneck is S 2 with a 3T delay.

Throughput Recall that the throughput ω is inversely proportional to the pipe clock period and: So =3T with a total delay of 3x3T=9T and for large number of tasks (steady state operation).

Throughput How might the throughput be increased?

Throughput How might the throughput be increased? Subdivisions?

Throughput Subdivisions What is the total delay now?

Throughput Subdivisions What is the total delay now? with a total delay of 5T

Throughput Subdivisions J0J1J2J3J4J5 J0J1J2J3J4J5 J0J1J2J3J4J5 J0J1J2J3J4J5 J0J1J2J3J4J5 S3S3 S 2C S 2B S 2A S1S1

Throughput What are the disadvantages of this solution?

Throughput What are the disadvantages of this solution? increased hardware, additional latches.

Throughput Replication: Stage 2 is replicated into three stages which are then interleaved

Space-time diagram replication J0J1J2J3J4J5 J2 J5 J1 J4 J0 J3 J0J1J2J3J4J5 S3S3 S 2C S 2B S 2A S1S1

Control strategies and configurations Unifunctional vs. multi-functional pipelines unifunctional pipelines execute a fixed and dedicated function. A multi-functional pipeline may perform several functions either at the same time or at different times. Multi-functional functions are possible by interconnecting (reconfiguring) several stages at different times.

Control strategies and configurations Static vs. Dynamic pipelines A static pipeline may assume only one functional configuration (unifunctional or multi-functional) at a time. A dynamic pipeline allows several functional configurations at any time (multi- functional) which require a more complex control mechanisms than those required for static pipelines.

Control strategies and configurations Scalar vs. vector pipelines Scalar pipelines processes a sequence of scalar operands under the control of a DO loop. Instructions are prefetched and stored in an instruction buffer. As instructions are executed operands are fetched from a data cache. vector pipelines (vector processors) handle vector instructions over vector operands under firmware and hardware control.

Levels of processing Arithmetic Pipelines – ALUs are partitions for pipelined operations. Ex: 4-stage pipes are used in the Star-100, Cray-1 uses 14 pipeline stages, the Cyber 205 uses 26 stages, etc. Instruction Pipelines – (instruction lookahead) - overlaps the execution of the current instruction with the fetch, decode and operand fetch of subsequent instructions. Processor Pipeline – it is a cascade of processors. Each executes a different task (a job is divided into different tasks).

Floating-Point Arithmetic Pipeline

Processor pipelining

Instruction Pipeline

Instruction Pipelining

Consider the execution of a single instruction in an uniprocessor system. A sequence of steps can be identified and implemented using a pipeline design:

Problems?

Problems 1. Instruction dependency 2. Pipeline Stalling 3. Branching 4. Conflicts 5. Interrupts

Instruction dependency. An instruction I + 1 being fetched may need the results of a previous instruction I currently in the pipe. So I +1 must be delayed until results are known. Stalling An instruction I +1 must not destroy data that can be needed for a previous instruction I still in the pipe.

Dependency

Stalling

Memory Access

Stalling

Assume in instruction cache access Assume data cache access Stall condition?

Branching This problem is causes by conditional and unconditional branches

Branching Suppose an unconditional branch instruction is on the pipe. The target of the branch is not known until S3; the branch instruction updates the program to point to a target instruction and all instructions fetched after the branch instruction must be disregarded. Draining the pipe results in a degraded throughput. A solution is to freeze the pipeline from fetching new instructions as soon as the branch opcode is decoded in stage S2. Memory traffic as well as throughput degradation is minimized.

Branching In case of a conditional branch, the evaluation of the condition takes place in S5. Possibilities: 1. Freeze the pipeline from fetching new instructions until the branch target is known. 2. Predict the target address will not be taken and accept new instructions, else the pipe must be drained and restarted at the target address. 3. Fetch target instruction sequence into a buffer, that is as soon as the target address is known in S3. Non-branch instructions are fed into the pipe. If the branch is taken the pipeline is flushed and instructions are taken from the buffer which is faster.

Branching A branch instruction is decoded half-way through the pipe and if it is successful all prefetched instructions in the pipe are useless and fetching of new instructions is delayed until the completion of the branch instruction. The higher the % of branch type instructions in a program, the slower the program will run in a pipelined processor.

Branching Let p be the probability of a conditional branch instruction (i.e., percentage of conditional instructions in a typical program); let q denote the probability that a conditional branch instruction is successful. Now if m is the number of instructions to be executed, then m×q×p corresponds to the number of instructions that cause successful branches.

Branching If a branch is successful, k -1 cycles are needed to fill the pipe again; where k is the number of stages. So (k - 1)mpq are the cycles needed to fill the pipe for all successful branch instructions.

Branching The average time to execute an instruction:

Branching If the instruction cycle takes k pipe cycles, let the cycle of the pipe Sub this into the previous equation

Branching So if pq=0, given

Branching However, a typical value is pq =.12 then if k = 5 and I = 3.24, 3.24 instructions/ instruction cycle, therefore implies that 35.2% of pipeline cycles are wasted due to successful conditional branches.

Reducing the Effect of Branching What might be a strategy employed to reduce the negative effects of branching?

Prefetch Strategy Fetch instructions ahead of the instruction currently being decoded. Store these instructions in a buffer local to the pipeline

Prefetch Strategy Sequential prefetch buffer Hold instructions from sequential part of the code, if branch is successful invalidate the buffer. Target prefetch buffer Hold instructions from the target of a conditional branch. If branch is unsuccessful invalidate buffer.

Branching The decoder can be used to Detect unconditional branches and fetch a new set of sequential instructions Detect conditional branches and fetch target instructions

Branching A requested instruction enters the decoder without delay if; The instruction is in the sequential buffer for sequential instructions The instruction is in the target buffer if a conditional branch is successful. The target buffer now becomes the sequential buffer.

Branch Prediction How easy is it to predict branching?

Branch Prediction How easy is it to predict branching? Given that most branching is associated with loops. for(i=0; i < 100; i++){ /* if the prediction is to branch. and loop again.. How often will the prediction be. wrong? */ }

Delayed branching Execute an additional instruction then branch conditionally.

Conflicts Data is written to a common location by two instructions in the pipeline. The order of updates must be preserved.

Interrupts Instruction I is being executed when an interrupt occurs, this delays the execution of instruction I+1.

Interrupts Two types of interrupts can occur. Precise interrupts- These are caused by illegal opcode detection during instruction decoding. The result is instructions I and I+1 are not executed, all instructions J where J < I are completed before the instruction is serviced.

Interrupts Imprecise Interrupts - caused by storage, address generation and execution functions. When instruction I is halfway through the pipeline and this interrupts occurs all instructions in the pipeline that are before and after I are completed after the interrupt is serviced.

Reservation Tables These are two-dimensional charts used to show how successive pipeline stages are utilized for a specific function that may require feed-forward or feedback connections.

RT No interlocking pipeline conditions.

RT The reservation table identifies the space-time data flow pattern through the pipeline for one function evaluation. The total number of clock units in the table is called the evaluation time for that function.

RT

Collision Vectors Latency - number of time units between two initiations. An initiation refers to the launching of an operation into the pipeline. Initiation refers to the launching of an operation into the pipeline

CV A collision occurs when two tasks are initiated with a latency (initiation interval) equal to the column distance between two entries on the same row in the reservation table. A collision implies that a stage is required to perform more than one task at a time.

CV

Forbidden set F the collection of column distances between all possible pairs of entries on each row of the reservation table. This set contains all possible latencies that may cause collisions between two initiations.

CV What is the RT?

CV What is the RT?

CV What is the F set?

CV What is the F set?

CV What is the collision vector? C 5, C 4, C 3, C 2, C 1

CV What is the collision vector?

CV-controlling a pipeline Approach: The collision vector is initially loaded into a shift register. For a single function, a controller shifts this register to the right at each clock cycle inserting a zero on the MSB.

CV - controlling a pipeline The following basic steps are observed by the controller: 1. A collision-free initiation is allowed if a 0 emerges from the shift register. 2. If an initiation is allowed then the shift register is updated by OR-ing its contents with the collision vector. 3. If an initiation is not allowed (because a 1 appeared), then shift the register to the right. At the left end 0s are shifted in.

CV The pipeline can now be controlled, but how many states does it have and which is optimal?

State Diagram State Diagram – characterizes the successive initiations of tasks in the pipeline. A state on the diagram is represented by the contents of the shift register.

State Diagram Algorithm to generate the state diagram. 1. Start with the initial collision vector (initial state in a register). 2. For each kth bit = 0, do Shift register k bits to the right Delete the first k bits Append k zeros to the left Determine new state: OR shift register with CV Connect new state by an arc with weight k 3. The shift register is set to CV if the latency (number of shifts) is n. A weight n is assigned to such arcs.

State Diagram A four stage pipeline has the following CV (00111) What is its state diagram?

00111 1 00011 2 00001 3 00000 4 5 6 State Diagram Only after state 4 can a new operation initiate If no initiate after state 4 then state 5 is reached. If no initiate after state 5 state 6 is reached. If an initiate occurs after 4,5 or 6, state 1 is reached. State 1 State 2 State 3 State 4

00111 1 00011 2 00001 3 00000 4 5 6 State Diagram Because states 2,3 and 4 correspond to states that do not allow initiation they are not important. This diagram can be reduced to. State 1 State 2 State 3 State 4 00111 4 5 6 4

Example Consider a forbidden set of latencies: F = (642), CV = (101010)

Example What is the state diagram?

Example

State Diagram 1 Consider loop 1 above This is formed by states 1 and 2 If the pipeline was to follow this loop it would generate 2 outputs every 1+7=8 cycles. The number of states traversed is a indication of the number of outputs from the pipeline

State Diagram 1 The average latency is ?

State Diagram 1 The average latency is ? 8/2=4

State Diagram 2 The average latency is ?

State Diagram 2 The average latency is ? 5+3=8/2=4 Of course the pipeline has to get to state 3 from state 1 before it can initiate this loop.

State Diagram What are the latencies for the rest of the loops?

State Diagram

Pipeline An instruction requires four stages to execute: stage 1 (instruction fetch) requires 30 ns, stage 2 (instruction decode) = 9 ns, stage 3 (instruction execute) = 20 ns and stage 4 (store results) = 10 ns. An instruction must proceed through the stages in sequence. What is the minimum asynchronous time for any single instruction to complete?

Pipeline An instruction requires four stages to execute: stage 1 (instruction fetch) requires 30 ns, stage 2 (instruction decode) = 9 ns, stage 3 (instruction execute) = 20 ns and stage 4 (store results) = 10 ns. An instruction must proceed through the stages in sequence. What is the minimum asynchronous time for any single instruction to complete? 30 + 9 + 20 + 10 = 69 ns

Pipeline We want to set the previous questions pipeline. How many stages should we have and at what rate should we clock the pipeline?

Pipeline We want to set the previous questions pipeline. How many stages should we have and at what rate should we clock the pipeline? We have 4 natural stages given and no information on how we might be able to further subdivide them, so we use 4 stages in our pipeline. We have a choice of what clock rate to use. The simplest choice would be to use a clock cycle that accommodates the longest stage in our pipe - 30 ns.

Pipeline This would allow us to initiate a new instruction every 30 ns with a latency through the pipe of 30 ns x 4 stages = 120 ns.

Pipeline What if we chose a finer clock rate?

Pipeline What if we chose a finer clock rate?

Pipeline What if we chose a finer clock rate? The smallest time is 9 ns, we could pick a clock rate of 10ns. A clock of 10 ns would be a good match and would require three clocks for the first stage, 1 clock for the second, 2 clocks for the third and 1 clock for the fourth.

Pipeline What is the latency?

Pipeline What is the latency? This would allow us to initiate a new instruction every 30 ns but provide a latency of 70 ns rather than 120 ns. So 30ns or 10ns are good choices.

Pipeline What is the speedup of the pipeline? stage 1 (instruction fetch) requires 30 ns, stage 2 (instruction decode) = 9 ns, stage 3 (instruction execute) = 20 ns and stage 4 (store results) = 10 ns

Pipeline What is the speedup of the pipeline? stage 1 (instruction fetch) requires 30 ns, stage 2 (instruction decode) = 9 ns, stage 3 (instruction execute) = 20 ns and stage 4 (store results) = 10 ns Ideal Speedup is (30 + 9 + 20 + 10)/30 = 2.3 Speedup per best clocked definition is (30 + 10 + 20 + 10)/30 = 2.33

Pipeline Draw the reduced state-diagram using the following collision vector:

Pipeline