1 Atanasoff–Berry Computer, built by Professor John Vincent Atanasoff and grad student Clifford Berry in the basement of the physics building at Iowa State.

1 Atanasoff–Berry Computer, built by Professor John Vincent Atanasoff and grad student Clifford Berry in the basement of the physics building at Iowa State College during 1939–42. Binary digits, electronic computation. Clock frequency 60 Hz. (Wikipedia). See info on ENIAC patent fight.

2 Homework 1  On website later today  Due Thu, Feb 12, beginning of class

3 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 3, 2009 Pipelining I: Basics

4 Reading: Appendix A (HP4) Lecture Overview  Pipelining Basics Introduction to the concept of pipelined processor Introduction to the concept of pipelined processor Basic 5-stage pipelining Basic 5-stage pipelining  What occurs at each stage?  Pipeline registers  Pipelining Load instruction  Pipelining register-type instruction  Pipelining store instruction  Pipelining a branch

5 ABCD Pipelining: It’s Natural! Laundry Example: Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Washer takes 30 minutes Dryer takes 40 minutes Dryer takes 40 minutes Folding takes 20 minutes Folding takes 20 minutes

6  Sequential laundry takes 6 hours for 4 loads With pipelining, how long would laundry take? A B C D 304020304020304020304020 6 PM 789 10 11 Midnight TaskOrderTaskOrder Time Sequential Laundry

7 Pipelined laundry takes 3.5 hours for 4 loads A B C D 6 PM 789 10 11 Midnight TaskOrderTaskOrder Time 3040 20 Pipelined Laundry: Start work ASAP

8 A B C D 6 PM 789 TaskOrderTaskOrder Time 3040 20 Pipelining Principles  Pipelining doesn’t help latency of single task, it helps throughput of entire workload  Pipeline rate limited by slowest pipeline stage  Multiple tasks operating simultaneously  Potential speedup = Number pipe stages  Unbalanced lengths of pipe stages reduces speedup  Time to “fill” pipeline and time to “drain” it reduces speedup

9  Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Fetch the instruction from the Instruction Memory  Reg/Dec: Registers Fetch and Instruction Decode  Exec: Calculate the memory address  Mem: Read the data from the Data Memory  WrB: Write the data back to the register file Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5 IfetchReg/DecExecMemWrBLoad The Five Stages of a RISC Instruction

10 Example: Load Instruction lw $1, -70($2) lw $1, -70($2) lw $5, 100($0) lw $5, 100($0)  First field is destination register  Last field is source register for computing address memory address = register value + offset memory address = register value + offset  Note that register 0 is always 0

11 Pipelining the LOAD Instruction  The five independent pipeline stages are: Read next instruction: The Ifetch stage Read next instruction: The Ifetch stage Decode instruction and fetch register values: The Reg/Dec stage Decode instruction and fetch register values: The Reg/Dec stage Execute the operation: The Exec stage Execute the operation: The Exec stage Access data memory: The Mem stage Access data memory: The Mem stage Write data to destination register: The WrB stage Write data to destination register: The WrB stage  One instruction enters the pipeline every cycle The latency of a single load is still 5 cycles The latency of a single load is still 5 cycles The throughput is much higher The throughput is much higher  The “effective” CPI for 3 instructions is 7/3 (tends to 1)  Cycle time is ~1/5th the cycle time of unpipelined implementation  One instruction comes out of the pipeline (completed) every cycle Clock Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7IfetchReg/DecExecMemWrB1st lw IfetchReg/DecExecMemWrB2nd lw IfetchReg/DecExecMemWrB3rd lw

12 Load, Pipelined and Not

13 A Pipelined MIPS Datapath Review: Let’s look at the types of blocks

14 Load, Fetch Stage Instruction fetched, PC <- PC+4, new PC saved

15 lw $1, 0x100 ($2) PC = 12 “8” Adder Instruction Memory “4” Instruction Address Clk Ifetch You are here! Reg/Dec PC+4 32 Detailed View Location 8: lw $1, 0x100($2)

16 Load, Decode Stage Immediate field sign extended, regs fetched

17 Load, Execution Stage ALU adds reg 1 and immediate, result saved

18 Load, Memory Use address and get data from memory

19 Load, Write Back Write data to register; oops, need reg #

20 Corrected Pipeline

21 e.g.:add R1, R2, R3  Ifetch: Instruction fetch Fetch the instruction from the instruction memory Fetch the instruction from the instruction memory  Reg/Dec: Registers fetch and instruction decode  Exec: ALU operates on the two register operands  WrB: Write the ALU output back to the register file Cycle 1Cycle 2Cycle 3Cycle 4 IfetchReg/DecExecWrBR-type The Four Stages of R-type

22  We have a problem called pipeline conflict or hazard 2 instructions try to write to the register file at the same time! 2 instructions try to write to the register file at the same time! “Contention for a shared resource” (in OS terminology) “Contention for a shared resource” (in OS terminology)  It is no longer meaningful to talk about the execution of a single instruction in isolation Execution is inherently concurrent; need to achieve serializability Execution is inherently concurrent; need to achieve serializability Clock Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9 IfetchReg/DecExecWrR-type IfetchReg/DecExecWrR-type IfetchReg/DecExecMemWrLoad IfetchReg/DecExecWrR-type IfetchReg/DecExecWrR-type OOPS! We have a problem! Pipelining the R-type and Load Instructions

23  Each functional unit can only be used once per instr  Each functional unit must be used at the same stage for all instructions Load uses Register File’s Write Port during its 5th stage Load uses Register File’s Write Port during its 5th stage R-type uses Register File’s Write Port during its 4th stage R-type uses Register File’s Write Port during its 4th stage IfetchReg/DecExecMemWrBLoad 12345 IfetchReg/DecExecWrBR-type 1234  How to resolve this pipeline hazard? Important Observations

24 Clock Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9 IfetchReg/DecMemWrBR-type IfetchReg/DecMemWrBR-type IfetchReg/DecExecMemWrBLoad IfetchReg/DecMemWrBR-type IfetchReg/DecMemWrBR-type Exec IfetchReg/DecExecWrR-type Mem 123 4 5 Solution: Delay R-type’s Write by 1 Cycle  Delay R-type’s register write by one cycle: Now R-type instrs also use Reg File’s write port at Stage 5 Now R-type instrs also use Reg File’s write port at Stage 5 Mem stage is a NO-OP stage: nothing is being done Mem stage is a NO-OP stage: nothing is being done

25  Ifetch: Instruction fetch Fetch the instruction from the instruction memory Fetch the instruction from the instruction memory  Reg/Dec: Registers fetch and instruction decode  Exec: Calculate the memory address  Mem: Write the data into the data memory Cycle 1Cycle 2Cycle 3Cycle 4 IfetchReg/DecExecMemStoreWrB The Four Stages of Store

26 Third Stage of Store Very similar to load, save reg contents to be written to mem

27 (NOTE: This is a slow/unoptimized branch; will cover faster branching later)  Ifetch: Instruction fetch Fetch the instruction from the instruction memory Fetch the instruction from the instruction memory  Reg/Dec: Registers fetch and instruction decode  Exec: ALU compares the two register operands Adder calculates the branch target address Adder calculates the branch target address  Mem: If the registers compared in Exec stage are equal Write the branch target address into the PC Write the branch target address into the PC Cycle 1Cycle 2Cycle 3Cycle 4 IfetchReg/DecExecMemBeqWrB The Four Stages of Beq

28 Each instruction has 5 stages: Five independent functional units to work on each stage Five independent functional units to work on each stage  Each functional unit is used only once! A second instr can start doing Ifetch as soon as the first finishes its Ifetch stage A second instr can start doing Ifetch as soon as the first finishes its Ifetch stage Each instr still takes five cycles to complete Each instr still takes five cycles to complete  The latency of a single instr is still 5 cycles The throughput is much higher The throughput is much higher  CPI approaches 1  Cycle time is ~1/5th the cycle time of the single-cycle implementation Instructions start executing before previous instructions complete execution Instructions start executing before previous instructions complete execution IfetchReg/DecExecMemWrB Key Ideas Behind Instruction Pipelining CPI  Cycle time 

29 Next Time  Hazards Complications that arise due to dependencies between instructions, or disruptions to linear flow of execution Complications that arise due to dependencies between instructions, or disruptions to linear flow of execution

30Readings/References  The undergrad book by Patterson and Hennessy (Computer Organization and Design) has a detailed description of pipelining

1 Atanasoff–Berry Computer, built by Professor John Vincent Atanasoff and grad student Clifford Berry in the basement of the physics building at Iowa State.

Similar presentations

Presentation on theme: "1 Atanasoff–Berry Computer, built by Professor John Vincent Atanasoff and grad student Clifford Berry in the basement of the physics building at Iowa State."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Atanasoff–Berry Computer, built by Professor John Vincent Atanasoff and grad student Clifford Berry in the basement of the physics building at Iowa State.

Similar presentations

Presentation on theme: "1 Atanasoff–Berry Computer, built by Professor John Vincent Atanasoff and grad student Clifford Berry in the basement of the physics building at Iowa State."— Presentation transcript:

Similar presentations

About project

Feedback