15-447 Computer ArchitectureFall 2008 © October 6th, 2008 Majd F. Sakr CS-447– Computer Architecture.

15-447 Computer ArchitectureFall 2008 © October 6th, 2008 Majd F. Sakr msakr@qatar.cmu.edu www.qatar.cmu.edu/~msakr/15447-f08/ CS-447– Computer Architecture Lecture 13 Pipelining (1)

15-447 Computer ArchitectureFall 2008 © Quiz °Give an example of an instruction that would take longer to execute in a multicycle data path than in a single cycle data path. Explain why it takes longer. °Give an example of an instruction that takes a shorter time to execute in a multicycle data path compared to a single cycled one. Explain why.

15-447 Computer ArchitectureFall 2008 © Computer Performance CPU time= Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle CPU time= Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle inst count CPI Cycle time

15-447 Computer ArchitectureFall 2008 © Cycles Per Instruction (Throughput) “Instruction Frequency” CPI = (CPU Time * Clock Rate) / Instruction Count = Cycles / Instruction Count “Average Cycles per Instruction”

15-447 Computer ArchitectureFall 2008 © Example: Calculating CPI bottom up Typical Mix of instruction types in program Base Machine (Reg / Reg) OpFreqCyclesCPI(i)(% Time) ALU50%1.5(33%) Load20%2.4(27%) Store10%2.2(13%) Branch20%2.4(27%) 1.5

15-447 Computer ArchitectureFall 2008 © Sequential Laundry °Sequential laundry takes 6 hours for 4 loads °How can we make better use of the available resources? ABCD 304020304020304020304020 6 PM 789 10 11 Midnight TaskOrderTaskOrder Time

15-447 Computer ArchitectureFall 2008 © Pipelined Laundry - Start work ASAP °Pipelined laundry takes 3.5 hours for 4 loads ABCD 6 PM 789 10 11 Midnight TaskOrderTaskOrder Time 3040 20

15-447 Computer ArchitectureFall 2008 © Pipelining Lessons °Pipelining doesn’t help latency of single task, it helps throughput of entire workload °Pipeline rate limited by slowest pipeline stage °Multiple tasks operating simultaneously °Potential speedup = Number pipe stages °Unbalanced lengths of pipe stages reduces speedup °Time to “fill” pipeline and time to “drain” it reduces speedup ABCD 6 PM 789 TaskOrderTaskOrder Time 3040 20

15-447 Computer ArchitectureFall 2008 © Pipelining °Doesn’t improve latency! °Execute billions of instructions, so throughput is what matters!

15-447 Computer ArchitectureFall 2008 © The Five Stages of Load Instruction °IFetch: Instruction Fetch and Update PC °Dec: Registers Fetch and Instruction Decode °Exec: Execute R-type; calculate memory address °Mem: Read/write the data from/to the Data Memory °WB: Write the result data into the register file Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5 IFetchDecExecMemWB lw

15-447 Computer ArchitectureFall 2008 © Pipelined Processor °Start the next instruction while still working on the current one improves throughput or bandwidth - total amount of work done in a given time (average instructions per second or per clock) instruction latency is not reduced (time from the start of an instruction to its completion) pipeline clock cycle (pipeline stage time) is limited by the slowest stage for some instructions, some stages are wasted cycles Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5 IFetchDecExecMemWB lw Cycle 7Cycle 6Cycle 8 sw IFetchDecExecMemWB R-type IFetchDecExecMemWB

15-447 Computer ArchitectureFall 2008 © Single Cycle, Multiple Cycle, vs. Pipeline Clk Cycle 1 Multiple Cycle Implementation: IFetchDecExecMemWB Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9Cycle 10 lw IFetchDecExecMemWB IFetchDecExecMem lwsw Pipeline Implementation: IFetchDecExecMemWB sw Clk Single Cycle Implementation: LoadStoreWaste IFetch R-type IFetchDecExecMemWB R-type Cycle 1Cycle 2 “wasted” cycles

15-447 Computer ArchitectureFall 2008 © Multiple Cycle v. Pipeline, Bandwidth v. Latency Clk Cycle 1 Multiple Cycle Implementation: IFetchDecExecMemWB Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9Cycle 10 lw IFetchDecExecMemWB IFetchDecExecMem lwsw Pipeline Implementation: IFetchDecExecMemWB sw IFetch R-type IFetchDecExecMemWB R-type Latency per lw = 5 clock cycles for both Bandwidth of lw is 1 per clock clock (IPC) for pipeline vs. 1/5 IPC for multicycle Pipelining improves instruction bandwidth, not instruction latency

15-447 Computer ArchitectureFall 2008 © Pipeline Datapath Modifications Read Address Instruction Memory Add PC 4 0 1 Write Data Read Addr 1 Read Addr 2 Write Addr Register File Read Data 1 Read Data 2 1632 ALU 1 0 Shift left 2 Add Data Memory Address Write Data Read Data 1 0 °What do we need to add/modify in our MIPS datapath? registers between pipeline stages to isolate them IFetch/Dec Dec/Exec Exec/Mem Mem/WB IF:IFetchID:DecEX:ExecuteMEM: MemAccess WB: WriteBack System Clock Sign Extend

15-447 Computer ArchitectureFall 2008 © Graphically Representing the Pipeline Can help with answering questions like: how many cycles does it take to execute this code? what is the ALU doing during cycle 4? ALU IM Reg DMReg

15-447 Computer ArchitectureFall 2008 © Why Pipeline? For Throughput! I n s t r. O r d e r Time (clock cycles) Inst 0 Inst 1 Inst 2 Inst 4 Inst 3 ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg Once the pipeline is full, one instruction is completed every cycle Time to fill the pipeline

15-447 Computer ArchitectureFall 2008 © Important Observation °Each functional unit can only be used once per instruction (since 4 other instructions executing) °If each functional unit used at different stages then leads to hazards: Load uses Register File’s Write Port during its 5th stage R-type uses Register File’s Write Port during its 4th stage °2 ways to solve this pipeline hazard. IfetchReg/DecExecMemWrLoad 12345 IfetchReg/DecExecWrR-type 1234

15-447 Computer ArchitectureFall 2008 © Solution 1: Insert “Bubble” into the Pipeline °Insert a “bubble” into the pipeline to prevent 2 writes at the same cycle The control logic can be complex. Lose instruction fetch and issue opportunity. °No instruction is started in Cycle 6! Clock Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9 IfetchReg/DecExecWrR-type IfetchReg/DecExec IfetchReg/DecExecMemWrLoad IfetchReg/DecExecWr R-type IfetchReg/DecExecWr R-type Pipeline Bubble IfetchReg/DecExecWr

15-447 Computer ArchitectureFall 2008 © Solution 2: Delay R-type’s Write by One Cycle °Delay R-type’s register write by one cycle: Now R-type instructions also use Reg File’s write port at Stage 5 Mem stage is a NOP stage: nothing is being done. Clock Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9 IfetchReg/DecMemWrR-type IfetchReg/DecMemWrR-type IfetchReg/DecExecMemWrLoad IfetchReg/DecMemWrR-type IfetchReg/DecMemWrR-type IfetchReg/Dec Exec Wr R-type Mem Exec 123 4 5

15-447 Computer ArchitectureFall 2008 © Can Pipelining Get Us Into Trouble? °Yes: Pipeline Hazards structural hazards: attempt to use the same resource by two different instructions at the same time data hazards: attempt to use data before it is ready -instruction source operands are produced by a prior instruction still in the pipeline -load instruction followed immediately by an ALU instruction that uses the load operand as a source value control hazards: attempt to make a decision before condition has been evaluated -branch instructions °Can always resolve hazards by waiting pipeline control must detect the hazard take action (or delay action) to resolve hazards

15-447 Computer ArchitectureFall 2008 © I n s t r. O r d e r Time (clock cycles) lw Inst 1 Inst 2 Inst 4 Inst 3 ALU Mem Reg MemReg ALU Mem Reg MemReg ALU Mem Reg MemReg ALU Mem Reg MemReg ALU Mem Reg MemReg A Single Memory Would Be a Structural Hazard Reading data from memory Reading instruction from memory

15-447 Computer ArchitectureFall 2008 © How About Register File Access? I n s t r. O r d e r Time (clock cycles) add r1, Inst 1 Inst 2 Inst 4 add r2,r1, ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg Potential read before write data hazard

15-447 Computer ArchitectureFall 2008 © Summary °All modern day processors use pipelining °Pipelining doesn’t help latency of single task, it helps throughput of entire workload Multiple tasks operating simultaneously using different resources °Potential speedup = Number of pipe stages °Pipeline rate limited by slowest pipeline stage Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup °Must detect and resolve hazards Stalling negatively affects throughput

15-447 Computer ArchitectureFall 2008 © October 6th, 2008 Majd F. Sakr CS-447– Computer Architecture.

Similar presentations

Presentation on theme: "15-447 Computer ArchitectureFall 2008 © October 6th, 2008 Majd F. Sakr CS-447– Computer Architecture."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

15-447 Computer ArchitectureFall 2008 © October 6th, 2008 Majd F. Sakr CS-447– Computer Architecture.

Similar presentations

Presentation on theme: "15-447 Computer ArchitectureFall 2008 © October 6th, 2008 Majd F. Sakr CS-447– Computer Architecture."— Presentation transcript:

Similar presentations

About project

Feedback