# Introduction to PipelineCS510 Computer ArchitecturesLecture 6 - 1 Lecture 6 Introduction to Pipelining.

## Presentation on theme: "Introduction to PipelineCS510 Computer ArchitecturesLecture 6 - 1 Lecture 6 Introduction to Pipelining."— Presentation transcript:

Introduction to PipelineCS510 Computer ArchitecturesLecture 6 - 1 Lecture 6 Introduction to Pipelining

Introduction to PipelineCS510 Computer ArchitecturesLecture 6 - 2 Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold A BCD Pipelining: Its Natural! Washer takes 30 minutes Dryer takes 40 minutes Folder takes 20 minutes

Introduction to PipelineCS510 Computer ArchitecturesLecture 6 - 3 Sequential Laundry TaskOrderTaskOrder 304020304020304020304020 6 PM 789 10 11 Midnight Time If they learned pipelining, how long would laundry take? Sequential laundry takes 6 hours for 4 loads A 90 B C D

Introduction to PipelineCS510 Computer ArchitecturesLecture 6 - 4 Pipelined Laundry Start Work ASAP TaskOrderTaskOrder 3040 20 6 PM 789 10 11 Midnight Time Pipelined laundry takes 3.5 hours for 4 loads A 90 B C D

Introduction to PipelineCS510 Computer ArchitecturesLecture 6 - 5 Pipelining Lessons Pipelining doesn’t help latency of single task, it helps throughput of entire workload Pipeline rate is limited by the slowest pipeline stage Multiple tasks operating simultaneously Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduce speedup Time to “fill” pipeline and time to “drain” it reduces speedup TaskOrderTaskOrder 6 PM 789 Time 3040 20 A B C D Filling Draining

Introduction to PipelineCS510 Computer ArchitecturesLecture 6 - 7 DLX instructions Shift SLL, SRL, SRA, SLLI,Shifts: both immediate(S__I) and variable form (S__); logical, arithmetic SRLI, SRAI S__, S__ISet conditional: “__” may be LT, GT, LE, GE, EQ, NE Control Conditional branches and jumps; PC-relative or through register BEQZ, BNEZBranch GPR equal/not equal to zero: 16-bit offset from PC+4 BFPT, BFPF Test comparison bit in the FP status register and branch; 16-bit offset J, JRJumps:26-bit offset or target in register JAL, JALRJump and link: save PC+4 in R31 TRAPTransfer to operating system at a vectored address RFE Return to user code from an exception; restore user mode Floating pointFP operations on DP and SP format FcnD, FcnF Fcn: ADD, SUB, MULT, DIV CVTF2D, CVTF2I,Convert instructions: F single precision, D double precision, I integer CVTD2F, CVTD2I,Both operands are FPRs CVTI2F, CVTI2D, __D, __FDP and SP compares: “__” = LT, GT, LE, GE, EQ, NE; sets bits in FP status register

Introduction to PipelineCS510 Computer ArchitecturesLecture 6 - 8 DLX Instruction Format Opcode rs1 rdImmediate 6 55 16 I - type instruction Loads, stores, all immediates, conditional branches, Jump register, jump and link reg 6 55 R - type instruction 5 11 Opcode rs1 rs2 rd func Register-register ALU operations: Func - Add, Sub,... Opcode 6 J - type instruction Offset added to PC 26 Jump and Jump and link, trap and return from exception

Introduction to PipelineCS510 Computer ArchitecturesLecture 6 - 9 5 Steps of DLX Instr. Execution: Step1 Step 1:Instruction fetch cycle (IF) –Read instruction from memory and store into IR IR  Mem[PC] –Calculate the next instruction address NPC  PC+4 1 instruction is stored in consecutive 4 bytes Instr. Memory PC Add +4 NPC IR

Introduction to PipelineCS510 Computer ArchitecturesLecture 6 - 10 5 Steps of DLX Instr. Execution: Step2 Step 2: Instruction decode/register fetch cycle (ID) –Read source registers to A and B A  Regs[IR 6..10 ] B  Regs[IR 11..15 ] –Make 16 bits sign extension of 16-bit immediate field to make a 32-bit immediate value Imm  ((IR 16 ) 16 ## IR 16..31 ) –Decoding is done in parallel: fixed-field decoding b  Rd Sign Ext Reg File 16 32 IR A B Imm b Rd OP

Introduction to PipelineCS510 Computer ArchitecturesLecture 6 - 11 5 Steps of DLX Instr. Execution: Step 3 Step 3:Execution/effective address cycle (EX): – Memory reference: Effective Address calculation » ALUOutput  A + Imm – Register-register ALU instruction: Perform ALU operation with R’s » ALUOutput  A func B; func B – Register-Immediate ALU instruction: Perform ALU operation with immediate operand » ALUOutput  A op Imm – Branch: Effective Address calculation for branch target address Determine condition code » ALUOutput  NPC + Imm; Cond  (A op 0)

Introduction to PipelineCS510 Computer ArchitecturesLecture 6 - 12 Step 3 EX Zero? MUX ALU NPC A B Imm ALUOut Cond OP

Introduction to PipelineCS510 Computer ArchitecturesLecture 6 - 13 5 Steps of DLX Instr. Execution: Step 4 Step 4:Memory access/branch completion cycle (MEM): –Memory reference : Access memory either for LD: LMD  Mem[ALUOutput] or for ST: Mem[ALUOutput]  B –Branch : Test Condition if (cond) PC  ALUOutput, else PC  NPC; Data Memory MUX ALUOut NPC Cond PC B LMD

Introduction to PipelineCS510 Computer ArchitecturesLecture 6 - 14 5 Steps of DLX Instr. Execution: Step 5 Step 5:Write-back cycle (WB): Reg-Reg ALU : Store the result into the destination register Regs[IR 16..20 ]  ALUOutput; Reg-Immediate ALU : Store the result into destination register Regs[IR 11..15 ]  ALUOutput; Load instruction: Store the data read from memory to the destination register Regs[IR 11..15 ]  LMD; MUX LMD ALUOut Register File OP

Introduction to PipelineCS510 Computer ArchitecturesLecture 6 - 15 5 Steps of DLX Datapath MEM Stage WB Stage IF Stage ID StageEX Stage Instr. Memory Sign Ext Zero? Data Memory PC MUX Add ALU Reg File +4 16 32 SMD ALU Output LMD

Introduction to PipelineCS510 Computer ArchitecturesLecture 6 - 16 A Simple Implementation A multi-cycle implementation –needs temporary registers-- NPC, IC, A, B, Imm, Cond, ALUOutput, LMD –CPI improvements: Branch - 4 cycles, ALU - 4 cycles if brach freq : 12 %, ALU instr. freq : 44% CPI = 0.12 x 4 + 0.44 x 4 + 0.44 x 5 = 4.44 A single-cycle implementation –one long clock cycle –very inefficient for most machines that have a reasonable variation among the amount of work –requires the duplication of FU that could be shared in a multi-cycle implementation MR-instructions

Introduction to PipelineCS510 Computer ArchitecturesLecture 6 - 17 Visualizing Pipeline IM Reg ALU DMReg IM Reg ALU DMReg IM Reg ALU DMReg IM Reg ALU DMReg IM Reg ALU DMReg Instruction Order Time(clock cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 Filling Draining

Introduction to PipelineCS510 Computer ArchitecturesLecture 6 - 18 Saving Information Produced by Each Stage of Pipeline Information need to be stored at the end of a clock cycle, otherwise it will be lost Each pipeline stage produces information(data, address, and control) at the end of the clock cycle Thus, we need a storage(called inter-stage buffer) at end of each pipeline stage

Introduction to PipelineCS510 Computer ArchitecturesLecture 6 - 19 F/D Buffer –IR, NPC D/A Buffer –A, B, Imm, b(destination Reg address to store result), OP(OP-code), cond –NPC A/M Buffer –ALUout(arithmetic result or effective address) –NPC, cond, b, OP M/W Buffer –LMD(data for LD) –ALUout(arithmetic result), b, OP Inter-Stage Buffer in DLX Pipeline

Introduction to PipelineCS510 Computer ArchitecturesLecture 6 - 20 Pipelined DLX Datapath - Multicycle - IF Stage Instr. Memory PC Add +4 MEM Stage EX Stage Zero? MUX ALU SMD Data Memory WB Stage MUX LMD ID Stage Sign Ext Reg File 16 32 MUX F/D BufferD/A BufferA/M Buffer M/W Buffer F/D Buffer

Introduction to PipelineCS510 Computer ArchitecturesLecture 6 - 21 Reminder In conventional Single Port Memory, Instruction Memory and Data Memory are the same memory –Both IF and Mem stages use memory –One instruction uses the same hardware resource in two different cycles –Two instructions try to use the same hardware resource in different stages of pipeline at the same time For Branch instructions, Branch Target Address is available in the Mem stage

Download ppt "Introduction to PipelineCS510 Computer ArchitecturesLecture 6 - 1 Lecture 6 Introduction to Pipelining."

Similar presentations