Dave Patterson (http.cs.berkeley.edu/~patterson)

Slides:



Advertisements
Similar presentations
1 IKI20210 Pengantar Organisasi Komputer Kuliah no. 25: Pipeline 10 Januari 2003 Bobby Nazief Johny Moningka
Advertisements

Chapter 8. Pipelining.
Review: Pipelining. Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer.
Pipelining I (1) Fall 2005 Lecture 18: Pipelining I.
Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.
Computer Architecture
Chapter Six 1.
Mary Jane Irwin ( ) [Adapted from Computer Organization and Design,
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 1.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Computer ArchitectureFall 2007 © October 24nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
ECE 232 L19.Pipeline2.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 19 Pipelining,
Computer ArchitectureFall 2007 © October 22nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
Pipelining Datapath Adapted from the lecture notes of Dr. John Kubiatowicz (UC Berkeley) and Hank Walker (TAMU)
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 9, 2002 Topic: Pipelining Basics.
1 Atanasoff–Berry Computer, built by Professor John Vincent Atanasoff and grad student Clifford Berry in the basement of the physics building at Iowa State.
Pipelining - II Adapted from CS 152C (UC Berkeley) lectures notes of Spring 2002.
Computer ArchitectureFall 2008 © October 6th, 2008 Majd F. Sakr CS-447– Computer Architecture.
Ceg3420 L13.1 DAP Fa97,  U.CB CEG3420 Computer Design Introduction to Pipelining.
CS152 / Kubiatowicz Lec /17/01©UCB Fall 2001 CS152 Computer Architecture and Engineering Lecture 13 Introduction to Pipelining: Datapath and Control.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 17 - Pipelined.
Pipelining - II Rabi Mahapatra Adapted from CS 152C (UC Berkeley) lectures notes of Spring 2002.
Lecture 12: Pipeline Datapath Design Professor Mike Schulte Computer Architecture ECE 201.
CS1104: Computer Organisation School of Computing National University of Singapore.
B 0000 Pipelining ENGR xD52 Eric VanWyk Fall
Analogy: Gotta Do Laundry
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
CMPE 421 Parallel Computer Architecture
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
CS 152 Lec 12.1 CS 152: Computer Architecture and Engineering Lecture 12 Multicycle Controller Design Pipelining Randy H. Katz, Instructor Satrajit Chatterjee,
ECE 232 L18.Pipeline.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 18 Pipelining.

Cs 152 L1 3.1 DAP Fa97,  U.CB Pipelining Lessons °Pipelining doesn’t help latency of single task, it helps throughput of entire workload °Multiple tasks.
Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining.
CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Pipelining CS365 Lecture 9. D. Barbara Pipeline CS465 2 Outline  Today’s topic  Pipelining is an implementation technique in which multiple instructions.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 10 Computer Hardware Design (Pipeline Datapath and Control Design) Prof. Dr.
Chapter Six.
Lecture 18: Pipelining I.
Computer Organization
Pipelines An overview of pipelining
CMSC 611: Advanced Computer Architecture
ECE232: Hardware Organization and Design
Pipelining Lessons 6 PM T a s k O r d e B C D A 30
Chapter 3: Pipelining 순천향대학교 컴퓨터학부 이 상 정 Adapted from
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
Chapter 4 The Processor Part 3
Chapter 4 The Processor Part 2
CMSC 611: Advanced Computer Architecture
CS 704 Advanced Computer Architecture
Lecturer: Alan Christopher
CS152 – Computer Architecture and Engineering Lecture 10 –
CS-447– Computer Architecture Lecture 14 Pipelining (2)
Pipelining Lessons 6 PM T a s k O r d e B C D A 30
Chapter Six.
Chapter Six.
An Introduction to pipelining
Instruction Execution Cycle
Chapter 8. Pipelining.
Pipelining Appendix A and Chapter 3.
CMCS Computer Architecture Lecture 20 Pipelined Datapath and Control April 11, CMSC411.htm Mohamed.
Guest Lecturer: Justin Hsia
A relevant question Assuming you’ve got: One washer (takes 30 minutes)
Recall: Performance Evaluation
Presentation transcript:

Dave Patterson (http.cs.berkeley.edu/~patterson) CS152: Computer Architecture and Engineering Introduction to Pipelining October 15, 1997 Dave Patterson (http.cs.berkeley.edu/~patterson) lecture slides: http://www-inst.eecs.berkeley.edu/~cs152/

Recap: Microprogramming Specialize state-diagrams easily captured by microsequencer simple increment & “branch” fields datapath control fields Control design reduces to Microprogramming Microprogramming is a fundamental concept implement an instruction set by building a very simple processor and interpreting the instructions essential for very complex instructions and when few register transfers are possible overkill when ISA matches datapath 1-1

Summary: Microprogramming one inspiration for RISC If simple instruction could execute at very high clock rate… If you could even write compilers to produce microinstructions… If most programs use simple instructions and addressing modes… If microcode is kept in RAM instead of ROM so as to fix bugs … If same memory used for control memory could be used instead as cache for “macroinstructions”… Then why not skip instruction interpretation by a microprogram and simply compile directly into lowest language of machine? (microprogramming is overkill when ISA matches datapath 1-1)

Recap: Exceptions and Interrupts Exceptions are the hard part of control Need to find convenient place to detect exceptions and to branch to state or microinstruction that saves PC and invokes the operating system As we get pipelined CPUs that support page faults on memory accesses which means that the instruction cannot complete AND you must be able to restart the program at exactly the instruction with the exception, it gets even harder

Modification to the Control Specification IR <= MEM[PC] PC <= PC + 4 “instruction fetch” 0000 undefined instruction “decode” EPC <= PC - 4 PC <= exp_addr cause <= 12 (Ovf) EPC <= PC - 4 PC <= exp_addr cause <= 10 (RI) S<= PC +SX other 0001 LW BEQ R-type ORi SW S <= A - B If A = B then PC <= S Execute S <= A fun B S <= A op ZX S <= A + SX S <= A + SX 0100 0110 1000 1011 0010 overflow Memory M <= MEM[S] MEM[S] <= B 1001 1100 R[rd] <= S R[rt] <= S R[rt] <= M Write-back 0101 0111 1010

The Big Picture: Where are We Now? The Five Classic Components of a Computer Today’s Topics: Pipelining by Analogy Administrivia; Course road map Processor Input Control Memory Datapath Output So where are in in the overall scheme of things. Well, we just finished designing the processor’s datapath. Now I am going to show you how to design the control for the datapath. +1 = 7 min. (X:47)

Pipelining is Natural! Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 30 minutes “Folder” takes 30 minutes “Stasher” takes 30 minutes to put clothes into drawers A B C D

Sequential Laundry 6 PM 7 8 9 10 11 12 1 2 AM T a s k O r d e 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 Time A B C D Sequential laundry takes 8 hours for 4 loads If they learned pipelining, how long would laundry take?

Pipelined Laundry: Start work ASAP 6 PM 7 8 9 10 11 12 1 2 AM B C D A 30 Time T a s k O r d e Pipelined laundry takes 3.5 hours for 4 loads!

Pipelining Lessons 6 PM 7 8 9 T a s k O r d e B C D A 30 Pipelining doesn’t help latency of single task, it helps throughput of entire workload Multiple tasks operating simultaneously using different resources Potential speedup = Number pipe stages Pipeline rate limited by slowest pipeline stage Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup Stall for Dependences 6 PM 7 8 9 Time T a s k O r d e B C D A 30

Ifetch: Instruction Fetch The Five Stages of Load Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Load Ifetch Reg/Dec Exec Mem Wr Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode Exec: Calculate the memory address Mem: Read the data from the Data Memory Wr: Write the data back to the register file As shown here, each of these five steps will take one clock cycle to complete. And in pipeline terminology, each step is referred to as one stage of the pipeline. +1 = 8 min. (X:48)

Pipelining Improve perfomance by increasing instruction throughput Ideal speedup is number of stages in the pipeline. Do we achieve this?

Basic Idea What do we need to add to actually split the datapath into stages?

Graphically Representing Pipelines Can help with answering questions like: how many cycles does it take to execute this code? what is the ALU doing during cycle 4? use this representation to help understand datapaths

Conventional Pipelined Execution Representation Time IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB Program Flow IFetch Dcd Exec Mem WB

Single Cycle, Multiple Cycle, vs. Pipeline Clk Single Cycle Implementation: Load Store Waste Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk Multiple Cycle Implementation: Load Store R-type Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem Ifetch Here are the timing diagrams showing the differences between the single cycle, multiple cycle, and pipeline implementations. For example, in the pipeline implementation, we can finish executing the Load, Store, and R-type instruction sequence in seven cycles. In the multiple clock cycle implementation, however, we cannot start executing the store until Cycle 6 because we must wait for the load instruction to complete. Similarly, we cannot start the execution of the R-type instruction until the store instruction has completed its execution in Cycle 9. In the Single Cycle implementation, the cycle time is set to accommodate the longest instruction, the Load instruction. Consequently, the cycle time for the Single Cycle implementation can be five times longer than the multiple cycle implementation. But may be more importantly, since the cycle time has to be long enough for the load instruction, it is too long for the store instruction so the last part of the cycle here is wasted. +2 = 77 min. (X:57) Pipeline Implementation: Load Ifetch Reg Exec Mem Wr Store Ifetch Reg Exec Mem Wr R-type Ifetch Reg Exec Mem Wr

Suppose we execute 100 instructions Single Cycle Machine Why Pipeline? Suppose we execute 100 instructions Single Cycle Machine 45 ns/cycle x 1 CPI x 100 inst = 4500 ns Multicycle Machine 10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = 4600 ns Ideal pipelined machine 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns

Why Pipeline? Because the resources are there! Time (clock cycles) ALU Im Reg Dm I n s t r. O r d e Inst 0 ALU Im Reg Dm Inst 1 ALU Im Reg Dm Inst 2 Inst 3 ALU Im Reg Dm Inst 4 ALU Im Reg Dm

Can pipelining get us into trouble? Yes: Pipeline Hazards structural hazards: attempt to use the same resource two different ways at the same time E.g., combined washer/dryer would be a structural hazard or folder busy doing something else (watching TV) data hazards: attempt to use item before it is ready E.g., one sock of pair in dryer and one in washer; can’t fold until get sock from washer through dryer instruction depends on result of prior instruction still in the pipeline control hazards: attempt to make a decision before condition is evaulated E.g., washing football uniforms and need to get proper detergent level; need to see after dryer before next load in branch instructions Can always resolve hazards by waiting pipeline control must detect the hazard take action (or delay action) to resolve hazards

Single Memory is a Structural Hazard Time (clock cycles) ALU I n s t r. O r d e Mem Reg Mem Reg Load ALU Mem Reg Instr 1 ALU Mem Reg Instr 2 ALU Instr 3 Mem Reg Mem Reg ALU Mem Reg Instr 4 Detection is easy in this case! (right half highlight means read, left half write)

Structural Hazards limit performance Example: if 1.3 memory accesses per instruction and only one memory access per cycle then average CPI  1.3 otherwise resource is more than 100% utilized

Control Hazard Solutions Stall: wait until decision is clear Its possible to move up decision to 2nd stage by adding hardware to check registers as being read Impact: 2 clock cycles per branch instruction => slow I n s t r. O r d e Time (clock cycles) ALU Mem Reg Mem Reg Add ALU Beq Mem Reg Mem Reg Load ALU Mem Reg Mem Reg

Control Hazard Solutions Predict: guess one direction then back up if wrong Predict not taken Impact: 1 clock cycles per branch instruction if right, 2 if wrong (right ­ 50% of time) More dynamic scheme: history of 1 branch (­ 90%) I n s t r. O r d e Time (clock cycles) ALU Mem Reg Mem Reg Add ALU Beq Mem Reg Mem Reg Load ALU Mem Reg Mem Reg

Control Hazard Solutions Redefine branch behavior (takes place after next instruction) “delayed branch” Impact: 0 clock cycles per branch instruction if can find instruction to put in “slot” (­ 50% of time) As launch more instruction per clock cycle, less useful I n s t r. O r d e Time (clock cycles) ALU Mem Reg Mem Reg Add ALU Beq Mem Reg Mem Reg ALU Misc Mem Reg Mem Reg ALU Load Mem Reg Mem Reg

add r1 ,r2,r3 sub r4, r1 ,r3 and r6, r1 ,r7 or r8, r1 ,r9 Data Hazard on r1 add r1 ,r2,r3 sub r4, r1 ,r3 and r6, r1 ,r7 or r8, r1 ,r9 xor r10, r1 ,r11

add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 Data Hazard on r1: Dependencies backwards in time are hazards Time (clock cycles) IF ID/RF EX MEM WB add r1,r2,r3 Im Reg ALU Reg Dm I n s t r. O r d e ALU sub r4,r1,r3 Im Reg Dm Reg ALU and r6,r1,r7 Im Reg Dm Reg or r8,r1,r9 Im Reg ALU Dm Reg ALU Im Reg Dm xor r10,r1,r11

add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 Data Hazard Solution: “Forward” result from one stage to another “or” OK if define read/write properly Time (clock cycles) IF ID/RF EX MEM WB add r1,r2,r3 Im Reg ALU Reg Dm I n s t r. O r d e ALU sub r4,r1,r3 Im Reg Dm Reg ALU and r6,r1,r7 Im Reg Dm Reg or r8,r1,r9 Im Reg ALU Dm Reg ALU Im Reg Dm xor r10,r1,r11

Forwarding (or Bypassing): What about Loads Dependencies backwards in time are hazards Can’t solve with forwarding: Must delay/stall instruction dependent on loads Time (clock cycles) IF ID/RF EX MEM WB lw r1,0(r2) Im Reg ALU Reg Dm ALU sub r4,r1,r3 Im Reg Dm Reg

Designing a Pipelined Processor Go back and examine your datapath and control diagram associated resources with states ensure that flows do not conflict, or figure out how to resolve assert control in appropriate stage

Pipelined Datapath (as in book); hard to read

Pipelined Processor (almost) for slides What happens if we start a new instruction every cycle? Valid IRex IR IRwb Inst. Mem IRmem WB Ctrl Dcd Ctrl Ex Ctrl Mem Ctrl Equal Reg. File Reg File A S Exec PC Next PC B Mem Access M Data Mem

Control and Datapath Equal Reg. File Reg File A S Exec PC IR Next PC IR <- Mem[PC]; PC <– PC+4; A <- R[rs]; B<– R[rt] S <– A + B; S <– A or ZX; S <– A + SX; S <– A + SX; If Cond PC < PC+SX; M <– Mem[S] Mem[S] <- B R[rd] <– S; R[rt] <– S; R[rd] <– M; Equal Reg. File Reg File A S Exec PC IR Next PC Inst. Mem B M Mem Access D Data Mem

Pipelining the Load Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Clock Ifetch Reg/Dec Exec Mem Wr 1st lw 2nd lw Ifetch Reg/Dec Exec Mem Wr 3rd lw Ifetch Reg/Dec Exec Mem Wr The five independent functional units in the pipeline datapath are: Instruction Memory for the Ifetch stage Register File’s Read ports (bus A and busB) for the Reg/Dec stage ALU for the Exec stage Data Memory for the Mem stage Register File’s Write port (bus W) for the Wr stage For the load instructions, the five independent functional units in the pipeline datapath are: (a) Instruction Memory for the Ifetch stage. (b) Register File’s Read ports for the Reg/Decode stage. (c) ALU for the Exec stage. (d) Data memory for the Mem stage. (e) And finally Register File’s write port for the Write Back stage. Notice that I have treat Register File’s read and write ports as separate functional units because the register file we have allows us to read and write at the same time. Notice that as soon as the 1st load finishes its Ifetch stage, it no longer needs the Instruction Memory. Consequently, the 2nd load can start using the Instruction Memory (2nd Ifetch). Furthermore, since each functional unit is only used ONCE per instruction, we will not have any conflict down the pipeline (Exec-Ifet, Mem-Exec, Wr-Mem) either. I will show you the interaction between instructions in the pipelined datapath later. But for now, I want to point out the performance advantages of pipelining. If these 3 load instructions are to be executed by the multiple cycle processor, it will take 15 cycles. But with pipelining, it only takes 7 cycles. This (7 cycles), however, is not the best way to look at the performance advantages of pipelining. A better way to look at this is that we have one instruction enters the pipeline every cycle so we will have one instruction coming out of the pipeline (Wr stages) every cycle. Consequently, the “effective” (or average) number of cycles per instruction is now ONE even though it takes a total of 5 cycles to complete each instruction. +3 = 14 min. (X:54)

The Four Stages of R-type Cycle 1 Cycle 2 Cycle 3 Cycle 4 R-type Ifetch Reg/Dec Exec Wr Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode Exec: ALU operates on the two register operands Update PC Wr: Write the ALU output back to the register file Well, so far so good. Let’s take a look at the R-type instructions. The R-type instruction does NOT access data memory so it only takes four clock cycles, or in our new pipeline terminology, four stages to complete. The Ifetch and Reg/Dec stages are identical to the Load instructions. Well they have to be because at this point, we do not know we have a R-type instruction yet. Instead of calculating the effective address during the Exec stage, the R-type instruction will use the ALU to operate on the register operands. The result of this ALU operation is written back to the register file during the Wr back stage. +1 = 15 min. (55)

Pipelining the R-type and Load Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock Ops! We have a problem! R-type Ifetch Reg/Dec Exec Wr R-type Ifetch Reg/Dec Exec Wr Load Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Wr R-type Ifetch Reg/Dec Exec Wr What happened if we try to pipeline the R-type instructions with the Load instructions? Well, we have a problem here!!! We end up having two instructions trying to write to the register file at the same time! Why do we have this problem (the write “bubble”)? Well, the reason for this problem is that there is something I have not yet told you. +1 = 16 min. (X:56) We have pipeline conflict or structural hazard: Two instructions try to write to the register file at the same time! Only one write port

Important Observation Each functional unit can only be used once per instruction Each functional unit must be used at the same stage for all instructions: Load uses Register File’s Write Port during its 5th stage R-type uses Register File’s Write Port during its 4th stage Ifetch Reg/Dec Exec Mem Wr Load 1 2 3 4 5 Ifetch Reg/Dec Exec Wr R-type 1 2 3 4 I already told you that in order for pipeline to work perfectly, each functional unit can ONLY be used once per instruction. What I have not told you is that this (1st bullet) is a necessary but NOT sufficient condition for pipeline to work. The other condition to prevent pipeline hiccup is that each functional unit must be used at the same stage for all instructions. For example here, the load instruction uses the Register File’s Wr port during its 5th stage but the R-type instruction right now will use the Register File’s port during its 4th stage. This (5 versus 4) is what caused our problem. How do we solve it? We have 2 solutions. +1 = 17 min. (X:57) 2 ways to solve this pipeline hazard.

Solution 1: Insert “Bubble” into the Pipeline Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock Ifetch Reg/Dec Exec Wr Load Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Wr R-type Ifetch Reg/Dec Pipeline Exec Wr R-type R-type Ifetch Bubble Reg/Dec Exec Wr Ifetch Reg/Dec Exec Insert a “bubble” into the pipeline to prevent 2 writes at the same cycle The control logic can be complex. Lose instruction fetch and issue opportunity. No instruction is started in Cycle 6! The first solution is to insert a “bubble” into the pipeline AFTER the load instruction to push back every instruction after the load that are already in the pipeline by one cycle. At the same time, the bubble will delay the Instruction Fetch of the instruction that is about to enter the pipeline by one cycle. Needless to say, the control logic to accomplish this can be complex. Furthermore, this solution also has a negative impact on performance. Notice that due to the “extra” stage (Mem) Load instruction has, we will not have one instruction finishes every cycle (points to Cycle 5). Consequently, a mix of load and R-type instruction will NOT have an average CPI of 1 because in effect, the Load instruction has an effective CPI of 2. So this is not that hot an idea Let’s try something else. +2 = 19 min. (X:59)

Solution 2: Delay R-type’s Write by One Cycle Delay R-type’s register write by one cycle: Now R-type instructions also use Reg File’s write port at Stage 5 Mem stage is a NOOP stage: nothing is being done. 1 2 3 4 5 R-type Ifetch Reg/Dec Exec Mem Wr Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock R-type Ifetch Reg/Dec Exec Mem Wr Well one thing we can do is to add a “Nop” stage to the R-type instruction pipeline to delay its register file write by one cycle. Now the R-type instruction ALSO uses the register file’s witer port at its 5th stage so we eliminate the write conflict with the load instruction. This is a much simpler solution as far as the control logic is concerned. As far as performance is concerned, we also gets back to having one instruction completes per cycle. This is kind of like promoting socialism: by making each individual R-type instruction takes 5 cycles instead of 4 cycles to finish, our overall performance is actually better off. The reason for this higher performance is that we end up having a more efficient pipeline. +1 = 20 min. (Y:00) R-type Ifetch Reg/Dec Exec Mem Wr Load Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Mem Wr

Modified Control & Datapath IR <- Mem[PC]; PC <– PC+4; A <- R[rs]; B<– R[rt] S <– A + B; S <– A or ZX; S <– A + SX; S <– A + SX; if Cond PC < PC+SX; M <– S M <– Mem[S] Mem[S] <- B M <– S R[rd] <– M; R[rt] <– M; R[rd] <– M; Equal Reg. File Reg File A M Exec S PC IR Next PC Inst. Mem B Mem Access D Data Mem

The Four Stages of Store Cycle 1 Cycle 2 Cycle 3 Cycle 4 Store Ifetch Reg/Dec Exec Mem Wr Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode Exec: Calculate the memory address Mem: Write the data into the Data Memory Let’s continue our lecture by looking at the store instruction. Once again, the Ifetch and Reg/Decode stages are the same as all other instructions. The Exec stage of the store instruction calculates the memory address. Once the address is calculated, the store instruction will write the data it read from the register file back at the Reg/Decode stage into the data memory during the Mem stage. Notice that unlike the load instruction which takes five cycles to accomplish its task, the Store instruction only takes four cycles or four pipe stages. In order to keep our pipeline diagram looks more uniform, however, we will keep the Wr stage for the store instruction in the pipeline diagram. But keep in mind that as far as the pipelined control and pipelined datapath are concerned, the store instruction requires NOTHING to be done once it finishes its Mem stage. +2 = 27 min. (Y:07)

Ifetch: Instruction Fetch Reg/Dec: Exec: The Three Stages of Beq Cycle 1 Cycle 2 Cycle 3 Cycle 4 Beq Ifetch Reg/Dec Exec Mem Wr Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode Exec: compares the two register operand, select correct branch target address latch into PC Well similar to the store instruction, the branch instruction only consists of four pipe stages. Ifetch and Reg/decode are the same as all other instructions because we do not know what instruction we have at this point. We have not finish decoding the instruction yet. During the Execute stage of the pipeline, the BEQ instruction will use the ALU to compare the two register operands it fetched during the Reg/Dec stage. At the same time, a separate adder is used to calculate the branch target address. If the registers we compared during the Execute stage (point to the last bullet) have the same value, the branch is taken. That is, the branch target address we calculated earlier (last bullet) will be written into the Program Counter. Once again, similar to the Store instruction, the BEQ instruction will require NEITHER the pipelined control nor the pipelined datapath to do ANY thing once it finishes its Mem stage. With all these talk about pipelined datapath and pipelined control, let’s take a look at how the pipelined datapath looks like. +2 = 29 min. (Y:09)

Control Diagram Equal Reg. File Reg File A M S Exec PC IR Next PC IR <- Mem[PC]; PC < PC+4; A <- R[rs]; B<– R[rt] S <– A + B; S <– A or ZX; S <– A + SX; S <– A + SX; If Cond PC < PC+SX; M <– S M <– Mem[S] Mem[S] <- B M <– S R[rd] <– S; R[rt] <– S; R[rd] <– M; Equal Reg. File Reg File A M S Exec PC IR Next PC Inst. Mem B Mem Access D Data Mem

Datapath + Data Stationary Control IR v v v fun rw rw rw wb wb wb Inst. Mem Decode rt me me WB Ctrl rs Mem Ctrl ex op im rs rt Reg. File Reg File A M S Exec B Mem Access D Data Mem PC Next PC

Let’s Try it Out 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 these addresses are octal

Start: Fetch 10 n n n n Inst. Mem Decode WB Ctrl Mem Ctrl IR im rs rt Reg. File Reg File A M S Exec B Mem Access D Data Mem 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Next PC 10 PC

Fetch 14, Decode 10 n n n Inst. Mem Decode WB Ctrl Mem Ctrl IR im 2 rt lw r1, r2(35) Decode WB Ctrl Mem Ctrl IR im 2 rt Reg. File Reg File A M S Exec B Mem Access D Data Mem 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Next PC 14 PC

Fetch 20, Decode 14, Exec 10 n n Inst. Mem Decode WB Ctrl Mem Ctrl IR addI r2, r2, 3 Decode WB Ctrl lw r1 Mem Ctrl IR 2 rt 35 Reg. File Reg File M r2 S Exec B Mem Access D Data Mem 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Next PC 20 PC

Fetch 24, Decode 20, Exec 14, Mem 10 n Inst. Mem Decode WB Ctrl Mem sub r3, r4, r5 Decode addI r2, r2, 3 WB Ctrl lw r1 Mem Ctrl IR 4 5 3 Reg. File Reg File M r2 r2+35 Exec B Mem Access D Data Mem 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Next PC 24 PC

Fetch 30, Dcd 24, Ex 20, Mem 14, WB 10 Inst. Mem Decode WB Ctrl Mem beq r6, r7 100 Decode addI r2 WB Ctrl sub r3 lw r1 Mem Ctrl IR 6 7 Reg. File Reg File r4 M[r2+35] r2+3 Exec r5 Mem Access D Data Mem 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Next PC 30 PC

Fetch 34, Dcd 30, Ex 24, Mem 20, WB 14 Inst. Mem Decode WB Ctrl Mem ori r8, r9 17 Decode addI r2 WB Ctrl sub r3 beq Mem Ctrl IR 9 xx 100 Reg. File r1=M[r2+35] Reg File r6 r2+3 r4-r5 Exec r7 Mem Access D Data Mem 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Next PC 34 PC

Fetch 100, Dcd 34, Ex 30, Mem 24, WB 20 Inst. Mem Decode WB Ctrl Mem ori r8 sub r3 WB Ctrl add r10, r11, r12 beq Mem Ctrl 11 12 17 Reg. File IR r1=M[r2+35] Reg File r9 r4-r5 xxx Exec r2 = r2+3 x Mem Access D Data Mem 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Next PC 100 PC ooops, we should have only one delayed instruction

Fetch 104, Dcd 100, Ex 34, Mem 30, WB 24 n Inst. Mem Decode add r10 ori r8 beq WB Ctrl and r13, r14, r15 Mem Ctrl 14 15 xx Reg. File IR r1=M[r2+35] Reg File r11 r9 | 17 xxx Exec r2 = r2+3 r3 = r4-r5 r12 Mem Access D Data Mem 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Next PC 104 PC Squash the extra instruction in the branch shadow!

Fetch 108, Dcd 104, Ex 100, Mem 34, WB 30 n Inst. Mem Decode add r10 ori r8 and r13 WB Ctrl Mem Ctrl xx Reg. File IR r1=M[r2+35] Reg File r14 r9 | 17 r11+r12 Exec r2 = r2+3 r3 = r4-r5 r15 Mem Access D Data Mem 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Next PC 110 PC Squash the extra instruction in the branch shadow!

Fetch 114, Dcd 110, Ex 104, Mem 100, WB 34 n NO WB NO Ovflow and r13 Inst. Mem Decode add r10 WB Ctrl Mem Ctrl Reg. File IR r1=M[r2+35] Reg File r11+r12 r14 & R15 Exec r2 = r2+3 r3 = r4-r5 r8 = r9 | 17 Mem Access D Data Mem 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Next PC 114 PC Squash the extra instruction in the branch shadow!

We’ll build a simple pipeline and look at these issues Summary: Pipelining What makes it easy all instructions are the same length just a few instruction formats memory operands appear only in loads and stores What makes it hard? structural hazards: suppose we had only one memory control hazards: need to worry about branch instructions data hazards: an instruction depends on a previous instruction We’ll build a simple pipeline and look at these issues We’ll talk about modern processors and what really makes it hard: exception handling trying to improve performance with out-of-order execution, etc.

Pipelining is a fundamental concept Summary Pipelining is a fundamental concept multiple steps using distinct resources Utilize capabilities of the Datapath by pipelined instruction processing start next instruction while working on the current one limited by length of longest stage (plus fill/flush) detect and resolve hazards