CS 704 Advanced Computer Architecture

CS 704 Advanced Computer Architecture
Lecture 11 Computer Hardware Design (Pipeline and Instruction Level Parallelism) Prof. Dr. M. Ashraf Chughtai Welcome to the 10th lecture of the series of lectures on Advanced Computer Architecture. Today we will focus our discussion the pipeline architecture

Lecture 11 –Computer Hardware Design (5)
Today’s Topics Recap Lecture 10 Structural Hazards Data Hazards Control Hazards MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5)

Recap: Lecture 10 Multi cycle datapath verses pipeline datapath Key components of pipeline data path Performance enhancement due to pipeline Introduction to hazards in pipelined datapath MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5)

Structural Hazards Attempt to use the same resource two different ways at the same time, e.g., Single memory port is accessed for instruction fetch and data read in the same clock cycle would be a structural hazard …. Example : next slide MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5)

Single Memory is a Structural Hazard
Time (clock cycles) Instr 1 Load Instr 2 Instr 3 Instr 4 ALU Reg Instr 5 I n s t r. O r d e Two memory read operations in the 4th cycle: The LOAD instruction accesses memory to read data and the 4th instruction fetched from the same memory MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5)

Single Memory is a Structural Hazard
Time (clock cycles) I n s t r. O r d e ALU Mem Instr 1 Load Reg Mem Reg ALU Mem Reg Instr 2 ADD ALU Mem Reg Instr 3 Stall Bubble ALU Mem Reg Instr 4 Insert stall (bubble) to avoid memory structural hazard MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5)

Structural Hazards Structural hazard exists when Single write port of register accessed for two WB operations in same clock cycle – this situation does not exist in 5-stage pipeline But it may exist in 4 and 5 stage multi-cycle pipeline Explanation next………………… MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5)

Pipelining the Load Instruction
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Clock Ifetch Reg/Dec Exec Mem Wr 1st lw 2nd lw Ifetch Reg/Dec Exec Mem Wr 3rd lw Ifetch Reg/Dec Exec Mem Wr The five independent functional units in the pipeline datapath are: Inst. Fetch, Dec/Reg. Rd, ALU for Exec, Data Mem and Register File’s Write port for the Wr stage Here, we have separate register’s read and write ports so registers read and write is allowed at the same time Each functional unit is used once For the load instructions, the five independent functional units in the pipeline datapath are: (a) Instruction Memory for the Ifetch stage. (b) Register File’s Read ports for the Reg/Decode stage. (c) ALU for the Exec stage. (d) Data memory for the Mem stage. (e) And finally Register File’s write port for the Write Back stage. Notice that I have treat Register File’s read and write ports as separate functional units because the register file we have allows us to read and write at the same time. Notice that as soon as the 1st load finishes its Ifetch stage, it no longer needs the Instruction Memory. Consequently, the 2nd load can start using the Instruction Memory (2nd Ifetch). Furthermore, since each functional unit is only used ONCE per instruction, we will not have any conflict down the pipeline (Exec-Ifet, Mem-Exec, Wr-Mem) either. I will show you the interaction between instructions in the pipelined datapath later. But for now, I want to point out the performance advantages of pipelining. If these 3 load instructions are to be executed by the multiple cycle processor, it will take 15 cycles. But with pipelining, it only takes 7 cycles. This (7 cycles), however, is not the best way to look at the performance advantages of pipelining. A better way to look at this is that we have one instruction enters the pipeline every cycle so we will have one instruction coming out of the pipeline (Wr stages) every cycle. Consequently, the “effective” (or average) number of cycles per instruction is now ONE even though it takes a total of 5 cycles to complete each instruction. +3 = 14 min. (X:54) MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5)

The Four Stages of R-type
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Ifetch Reg/Dec Exec Wr R-type R-type instruction does not access data memory, so it only takes 4 clocks, or say 4 stages to complete Here, the ALU is used to operate on the register operands The result is written in to the register during WB stage Well, so far so good. Let’s take a look at the R-type instructions. The R-type instruction does NOT access data memory so it only takes four clock cycles, or in our new pipeline terminology, four stages to complete. The Ifetch and Reg/Dec stages are identical to the Load instructions. Well they have to be because at this point, we do not know we have a R-type instruction yet. Instead of calculating the effective address during the Exec stage, the R-type instruction will use the ALU to operate on the register operands. The result of this ALU operation is written back to the register file during the Wr back stage. +1 = 15 min. (55) MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5)

Pipelining the R-type and Load Instruction
Clock Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Ifetch Reg/Dec Exec Wr R-type Mem Load Ops! We have a problem! We have pipeline conflict or structural hazard: Two instructions try to write to the register file at the same time! Only one write port What happened if we try to pipeline the R-type instructions with the Load instructions? Well, we have a problem here!!! We end up having two instructions trying to write to the register file at the same time! Why do we have this problem (the write “bubble”)? Well, the reason for this problem is that there is something I have not yet told you. +1 = 16 min. (X:56) MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5)

Important Observation
Each functional unit can only be used once per instruction Each functional unit must be used at the same stage for all instructions: Load uses Register File’s Write Port during its 5th stage R-type uses Register File’s Write Port during its 4th stage Two possible solutions ………. Next I already told you that in order for pipeline to work perfectly, each functional unit can ONLY be used once per instruction. What I have not told you is that this (1st bullet) is a necessary but NOT sufficient condition for pipeline to work. The other condition to prevent pipeline hiccup is that each functional unit must be used at the same stage for all instructions. For example here, the load instruction uses the Register File’s Wr port during its 5th stage but the R-type instruction right now will use the Register File’s port during its 4th stage. This (5 versus 4) is what caused our problem. How do we solve it? We have 2 solutions. +1 = 17 min. (X:57) MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5)

Solution 1: Insert “Bubble” into the Pipeline
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock Ifetch Reg/Dec Exec Wr Load Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Wr R-type Ifetch Reg/Dec Pipeline Exec Wr R-type R-type Ifetch Bubble Reg/Dec Exec Wr Ifetch Reg/Dec Exec Insert a “bubble” into the pipeline to prevent 2 writes at the same cycle The control logic can be complex. Lose instruction fetch and issue opportunity. No instruction is started in Cycle 6! The first solution is to insert a “bubble” into the pipeline AFTER the load instruction to push back every instruction after the load that are already in the pipeline by one cycle. At the same time, the bubble will delay the Instruction Fetch of the instruction that is about to enter the pipeline by one cycle. Needless to say, the control logic to accomplish this can be complex. Furthermore, this solution also has a negative impact on performance. Notice that due to the “extra” stage (Mem) Load instruction has, we will not have one instruction finishes every cycle (points to Cycle 5). Consequently, a mix of load and R-type instruction will NOT have an average CPI of 1 because in effect, the Load instruction has an effective CPI of 2. So this is not that hot an idea Let’s try something else. +2 = 19 min. (X:59) MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5)

Solution 2: Delay R-type’s Write by One Cycle
Delay R-type’s register write by one cycle: Now R-type instructions also use Reg File’s write port at Stage 5 Mem stage is a NO-OP stage: nothing is being done. 1 2 3 4 5 R-type Ifetch Reg/Dec Exec Mem Wr Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock R-type Ifetch Reg/Dec Exec Mem Wr Well one thing we can do is to add a “Nop” stage to the R-type instruction pipeline to delay its register file write by one cycle. Now the R-type instruction ALSO uses the register file’s witer port at its 5th stage so we eliminate the write conflict with the load instruction. This is a much simpler solution as far as the control logic is concerned. As far as performance is concerned, we also gets back to having one instruction completes per cycle. This is kind of like promoting socialism: by making each individual R-type instruction takes 5 cycles instead of 4 cycles to finish, our overall performance is actually better off. The reason for this higher performance is that we end up having a more efficient pipeline. +1 = 20 min. (Y:00) R-type Ifetch Reg/Dec Exec Mem Wr Load Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Mem Wr MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5)

Eliminating Structural Hazards?
Structural hazards can be eliminated or minimized by either using the stall operation or adding multiple functional units Program Flow IFetch Dcd Exec Mem WB Time Load 2nd Inst. 3rd Inst 4th Inst 5th Inst. stall MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5)

Example: Dual-port vs. Single-port
Machine A: Dual ported memory Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate Ideal CPI = 1 for both Loads are 40% of instructions executed SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe) = Pipeline Depth SpeedUpB = Pipeline Depth/( x 1) x (clockunpipe/(clockunpipe / 1.05) = (Pipeline Depth/1.4) x 1.05 = 0.75 x Pipeline Depth SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33 Machine A is 1.33 times faster

Stall degrades the performance
Here, is an example: Suppose data reference instructions constitute 40% of mix, and processor with structural hazard has clock rate 1.05 times higher than the processor without hazard The Average Instruction time = CPI x Clock Cycle Time = ( x 1) x clock cycle time Ideal / 1.05 = 1.4 / 1.05 x clock cycle time Ideal = 1.3 x clock cycle time Ideal The processor without structural hazard is 1.3 times faster than with Structural hazard MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5)

Additional Functional Units increase cost
Memory structural hazard is removed by - using two Cache memory units: - Instruction memory - Data Memory Two write ports in register file allow 4-stage and 5-stage pipe mix MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5)

Data Hazards Attempt to use item before it is ready; e.g., One sock of pair in dryer and one in washer; can’t fold until get sock from washer through dryer Instruction depends on result of prior instruction still in the pipeline MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5)

Data Hazards Pipelining changes the relative timing of instruction by overlapping their execution This overlap introduces the Data and Control Hazard Data Hazard occurs when order of operand read/write is changed viz-z-viz sequential access to the operands, which gives rise to data dependency Let us consider an example …… MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5)

Example Data Hazard on R1
Add R1 ,R2,R3 Sub R4, R1 ,R3 And R6, R1 ,R7 Or R8, R1 ,R9 Xor R10, R1 ,R11 MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5)

Data Hazard due to Dependencies backwards in time are hazards
Time (clock cycles) Add R1,R2,R3 Sub R4,R1,R3 And R6,R1,R7 Or R8,R1,R9 Xor R10,R1,R11 IF ID/RF EX MEM WB ALU Im Reg Dm Add instruction provide its results to sub after 3 cycles, to and after 2 and to Or after 1 clock cycles MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5)

Data Hazard Solution #1 - Stall
stall cycles after next IF and decode, before the register read Time (clock cycles) ALU Dm Reg Im IF ID/RF EX MEM WB Stall I n s t r. O r d e add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5)

XOR: No Data Hazard here, as register is read after being written
Time (clock cycles) EX MEM WB ID/RF IF add r1,r2,r3 Im Reg ALU Reg Dm I n s t r. O r d e ALU sub r4,r1,r3 Im Reg Dm Reg ALU and r6,r1,r7 Im Reg Dm Reg or r8,r1,r9 Im Reg ALU Dm Reg ALU Im Dm Reg xor r10,r1,r11 MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5)

Data Hazard Solution - Forwarding
No forwarding As register is written in the first half and read in the second half cycle “Forward” result from one stage to another From the EX/MEM pipeline register to Sub ALU stage, MEM/WB pipeline register to AND ALU stage I n s t r. O r d e Time (clock cycles) add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 IF ID/RF EX MEM WB ALU Im Dm Reg MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5)

Forwarding (or Bypassing): What about Loads?
Dependencies backwards in time are hazards In this case, we Can’t solve with forwarding: Must delay/stall instruction dependent on loads Time (clock cycles) IF ID/RF EX MEM WB lw r1,0(r2) Im Reg ALU Reg Dm ALU sub r4,r1,r3 Im Reg Dm Reg MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5)

Control Hazards Control hazard occurs when one attempt to take action before condition is evaluated Explanation: When Branch instructions is executed it may or may not change the PC to something other than PC+4 Branch Taken: Branch changes the PC+4 to new target Branch Not Taken: Branch does not change the PC+4 The simple way to deal with the branch is: Next please ……….. MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5)

Control Hazards The simple way to deal with the branch is: freeze the pipeline holding any instruction after the branch instruction and flush the pipeline to delete the instructions after the branch if condition is evaluated, and branch is to take MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5)

Control Hazard: Example BEQ Taken
I n s t r. O r d e Time (clock cycles) ADD BEQ LOAD ALU IF Reg Mem Instruction Fetched Condition Evaluated At the begin of ID stage at the end of EXE stage of BEQ instruction of BEQ instruction MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5)

Explanation Branch Hazard
Here, If the BEQ is taken then The next instruction address is determined after evaluating the branch condition in the EX stage; but the next instruction (LOAD) is fetched in the ID stage, i.e., before the PC+4 is changed to new target address This gives rise to Branch or Control Hazard See example on next slide …………….. MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5)

Dealing with Branches Stall Redo Fetch after branch Delayed branching Branch prediction Multiple Streams MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 41

Solution#1 Stall Result of the condition evaluation is available after the EXE stage and the target address is available in the next stage Thus 3 stall cycles I n s t r. O r d e Time (clock cycles) ADD BEQ Target Inst. ALU IF Reg Mem Stall MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5)

Reducing number of Stall
Extra H/W to evaluate condition at the end of ID stage of BEQ instruction I n s t r. O r d e Time (clock cycles) ADD BEQ Target Inst ALU IF Reg Mem Stall Explanation: next slide MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5)

Reducing number of Stall
Here, you can see that if we move up decision to the end of ID stage (2nd stage) by adding hardware to compare the registers being read. The number of stalls reduces to 2 clock cycles per branch instruction It can further be reduced to 1 in case of BEQZ or BNEZ if zero register is tested after Instruction Fetch MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5)

Solution# 2 Redo Fetch after Branch
Branch successor Branch successor + 1 IFetch Dcd Exec Mem WB We know that once a branch has been detected during the Instruction decode /Register read stage, the next instruction fetch cycle should essentially be a stall, if we assume that branch is taken Next slide please …………….. MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5)

Solution# 2 Redo Fetch after Branch
However, the instruction fetched in this cycle never performs useful work, and is ignored Therefore, re-fetch the Branch successor instruction will is provide the correct instruction. Indeed, the second fetch is not essential branch is not taken Impact: 1 clock cycles per branch instruction if branch is un- taken MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5)

Solution# 3 Delayed Branch – S/W method
Time (clock cycles) I n s t r. O r d e ALU IF Reg Mem Reg Add ALU Beq IF Reg Mem Reg ALU Misc IF Reg Mem Reg ALU Load IF Reg Mem Reg Redefine branch behavior to take place after the next instruction by introducing other instruction (may be No-OP) which is always executed Impact: 0 clock cycles per branch instruction if can find instruction to put in “slot” ( 50% of time) MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5)

Solution#4 Prediction This techniques suggest that for a branch instruction we guess one direction of the branch, to begin, then back up if wrong The two possible predictions are: - Predict Branch not-taken - Predict branch taken MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5)

Branch Prediction Flowchart
MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5)

1. Predict – Branch not taken
This scheme is implemented assuming every branch as branch Not-taken So the processor continues to fetch branch as normal instructions Sequence when branch is not-taken Branch Inst ‘i’ IF ID EX MEM WB Inst. ‘i+1’ IF ID EX MEM WB Inst. ‘i+2’ IF ID EX MEM WB Inst. ‘i+2’ IF ID EX MEM WB MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5)

Predict Branch not taken … Cont’d
We the decision has been made, and the branch is taken, then fetch operations are turned into NO-OP and fetch is restarted at the target address Sequence when branch is taken Taken Branch Inst ‘i’ IF ID EX MEM WB Inst. ‘i+1’ IF Idle Idle Idle Idle Branch target IF ID EX MEM WB Branch target +1’ IF ID EX MEM WB MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5)

2. Predict - Branch taken An alternative way is to treat every branch as Branch taken As soon as the target address is computed, we assume that the branch is to be taken and start fetching and executing at the target In a five stage pipeline the target address and condition evaluation are available at the same time, so this technique is of no use. Let us consider this example of a LOOP to explain the concept: MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5)

2. Predict - Branch taken i = 0 Loop: ……. i = i+1 IF i ≠ 1001 THEN Loop …….. Here, the branch is taken for 1000 time, so the prediction “Branch Taken” fails 1 in 1000, hence no stall for 1000 times Further, the compiler can improve performance by organizing the code so that the most frequent path matches the hardware choice MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5)

Solution #5 Multiple Streams
Have two pipelines Pre-fetch each branch into a separate pipeline Use appropriate pipeline Results Leads to bus & register contention Multiple branches lead to further pipelines being needed MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 42

Solution# 5 Pre-fetch Branch Target
Target of branch is pre-fetched in addition to instructions following branch Keep target until branch is executed Used by IBM 360/91 MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 43

Summary Type of hazards in pipelined datapath Structural hazards occur when same resource is accessed by more than one instructions One memory port or one register write port It can be removed by using either multiple resources or inserting stall Stall degrades the pipeline performance MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 43

Summary Data Hazards occur when attempt is made to read invalid data Data hazard can be removed by using stall and forwarding techniques Control hazards occur when an attempt is made to branch prior to the evaluation of the condition Four ways to handle control hazards MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 43

Summary – 4 ways to handle control hazard
1: Stall until branch direction is clear 2: Predict Branch Not Taken Execute successor instructions in sequence “Squash” instructions in pipeline if branch actually taken PC+4 already calculated, so use it to get next instruction 3: Predict Branch Taken 4: Delayed Branch Define branch to take place AFTER a following instruction 1 slot delay allows proper decision and branch target address in 5 stage pipeline MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5) 43

Asslam-u-aLacum and ALLAH Hafiz MAC/VU-Advanced Computer Architecture Lecture 11 –Computer Hardware Design (5)

CS 704 Advanced Computer Architecture

Similar presentations

Presentation on theme: "CS 704 Advanced Computer Architecture"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 704 Advanced Computer Architecture

Similar presentations

Presentation on theme: "CS 704 Advanced Computer Architecture"— Presentation transcript:

Similar presentations

About project

Feedback