Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.

Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee

Pipelining

What is pipelining? An implementation technique that overlaps the execution of multiple instructions. An implementation technique that overlaps the execution of multiple instructions. Any architecture in which digital information flows through a series of stations that each inspect, interpret or modify the information. Any architecture in which digital information flows through a series of stations that each inspect, interpret or modify the information.

Example(1) : Laundry Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Washer takes 30 minutes Dryer takes 40 minutes Dryer takes 40 minutes “Folder” takes 20 minutes “Folder” takes 20 minutes ABCD

Sequential Laundry Sequential laundry takes 6 hours for 4 loads Sequential laundry takes 6 hours for 4 loads ABCD 304020304020304020304020 6 PM 789 10 11 Midnight TaskOrderTaskOrder Time

Pipelined Laundry Start work ASAP Pipelined laundry takes 3.5 hours for 4 loads Pipelined laundry takes 3.5 hours for 4 loads Speedup = 6/3.5 = 1.7 Speedup = 6/3.5 = 1.7 ABCD 6 PM 789 10 11 Midnight TaskOrderTaskOrder Time 3040 20

Pipelining Lessons Pipelining doesn’t help latency of single task, it helps throughput of entire workload Pipelining doesn’t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously Multiple tasks operating simultaneously Potential speedup = Number pipe stages Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup

Computer Pipelines Execute billions of instructions, so through put is what matters Execute billions of instructions, so through put is what matters RISC desirable features: all instructions sa me length, registers located in same place in instruction format, memory operands onl y in loads or stores RISC desirable features: all instructions sa me length, registers located in same place in instruction format, memory operands onl y in loads or stores

Unpipelined Design Single-cycle implementation Single-cycle implementation The cycle time depends on the slowest instruction The cycle time depends on the slowest instruction Every instruction takes the same amount of time Every instruction takes the same amount of time Multi-cycle implementation Multi-cycle implementation Divide the execution of an instruction into multiple steps Divide the execution of an instruction into multiple steps Each instruction may take variable number of steps (clock cycles) Each instruction may take variable number of steps (clock cycles)

Unpipelined System Comb. Logic REGREG 30ns3ns Clock Time Op1Op2Op3 ?? One operation must complete before next can begin One operation must complete before next can begin Operations spaced 33ns apart Operations spaced 33ns apart

Pipelined Design Divide the execution of an instruction into multiple steps (stages) Divide the execution of an instruction into multiple steps (stages) Overlap the execution of different instructions in different stages Overlap the execution of different instructions in different stages Each cycle different instruction is executed in different stages Each cycle different instruction is executed in different stages For example, 5-stage pipeline (Fetch-Decode-Read- Execute-Write), For example, 5-stage pipeline (Fetch-Decode-Read- Execute-Write), 5 instructions are executed concurrently in 5 different pipeline stages 5 instructions are executed concurrently in 5 different pipeline stages Complete the execution of one instruction every cycle (instead of every 5 cycle) Complete the execution of one instruction every cycle (instead of every 5 cycle) Can increase the throughput of the machine 5 times Can increase the throughput of the machine 5 times

Example(1) : 3 Stage Pipeline Space operations Space operations 13ns apart 13ns apart 3 operations occur si multaneously 3 operations occur si multaneously REGREG Clock Comb. Logic REGREG Comb. Logic REGREG Comb. Logic 10ns3ns10ns3ns10ns3ns Time Op1 Op2 Op3 ?? Op4 Delay = 39ns Throughput = 77MHz

Example(2) FDREW FDREW FDREW FDREW FDREW FDREW FDREW FDREW FDREW F Non-pipelined processor: 25 cycles = number of instrs (5) * number of stages (5) Pipelined processor: 9 cycles = start-up latency (4) + number of instrs (5) Filling the pipeline Draining the pipeline 5 stage pipeline: Fetch – Decode – Read – Execute - Write

Basic Performance Issues in Pipelining Pipelining increases the CPU instruction throughput - the number of instructions complete per unit of time - but it is not reduce the execution time of an individual instruction. Pipelining increases the CPU instruction throughput - the number of instructions complete per unit of time - but it is not reduce the execution time of an individual instruction.

Pipeline Speedup Example Assume the multiple cycle has a 10-ns clock cycle, loads take 5 clock cycles and account for 40% of the instructions, and all other instructions take 4 clock cycles. Assume the multiple cycle has a 10-ns clock cycle, loads take 5 clock cycles and account for 40% of the instructions, and all other instructions take 4 clock cycles. If pipelining the machine add 1-ns to the clock cycle, how much speedup in instruction execution rate do we get from pipelining. If pipelining the machine add 1-ns to the clock cycle, how much speedup in instruction execution rate do we get from pipelining. MC Ave Instr. Time = Clock cycle x Average CPI = 10 ns x (0.6 x 4 + 0.4 x 5) = 10 ns x (0.6 x 4 + 0.4 x 5) = 44 ns = 44 ns PL Ave Instr. Time = 10 + 1 = 11 ns Speedup = 44 / 11 = 4 This ignores time needed to fill & empty the pipeline and delays due to hazards. This ignores time needed to fill & empty the pipeline and delays due to hazards.

What makes it easy? What makes it easy? all instructions are the same length all instructions are the same length just a few instruction formats just a few instruction formats memory operands appear only in loads and stores memory operands appear only in loads and stores What makes it easy?

It’s not Easy for Computers Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle Structural hazards: Hardware cannot support this combination of instructions - two instructions need the same resource. Structural hazards: Hardware cannot support this combination of instructions - two instructions need the same resource. Data hazards: Instruction depends on result of prior instruction still in the pipeline Data hazards: Instruction depends on result of prior instruction still in the pipeline Control hazards: Pipelining of branches & other instructions that change the PC Control hazards: Pipelining of branches & other instructions that change the PC Common solution is to stall the pipeline until the hazard is resolved, inserting one or more “bubbles” in the pipeline Common solution is to stall the pipeline until the hazard is resolved, inserting one or more “bubbles” in the pipeline

Limitation: Nonuniform Pipelining Clock REGREG Com. Log. REGREG Comb. Logic REGREG Comb. Logic 5ns3ns15ns3ns10ns3ns Throughput limited by slowest stage Throughput limited by slowest stage Delay determined by clock period * number of stages Delay determined by clock period * number of stages Must attempt to balance stages Must attempt to balance stages Delay = 18 * 3 = 54 ns Throughput = 55MHz

Limitation: Deep Pipelines Diminishing returns as add more pipeline stages Diminishing returns as add more pipeline stages Register delays become limiting factor Register delays become limiting factor Increased latency Increased latency Small throughput gains Small throughput gains Clock REGREG Com. Log. 5ns3ns REGREG Com. Log. 5ns3ns REGREG Com. Log. 5ns3ns REGREG Com. Log. 5ns3ns REGREG Com. Log. 5ns3ns REGREG Com. Log. 5ns3ns Delay = 48ns, Throughput = 128MHz

Limitation: Sequential Dependencies Op4 gets result Op4 gets result from Op1 from Op1 Pipeline Hazard Pipeline Hazard REGREG Clock Comb. Logic REGREG Comb. Logic REGREG Comb. Logic Time Op1 Op2 Op3 ?? Op4

Structure Hazard Sometimes called Resource Conflict. Sometimes called Resource Conflict. Example. Example. Some pipelined machines have shared a single memory pipeline for a data and instruction. As a result, when an instruction contains a data memory reference, it will conflict with the instruction reference for a latter instruction. Some pipelined machines have shared a single memory pipeline for a data and instruction. As a result, when an instruction contains a data memory reference, it will conflict with the instruction reference for a latter instruction.

Solutions to Structural Hazard Resource Duplication Resource Duplication example example Separate I and D caches for memory access conflict Separate I and D caches for memory access conflict Time-multiplexed or multi-port register file for register file access conflict Time-multiplexed or multi-port register file for register file access conflict

Data Hazard Data hazard occur when pipeline changes the order of read/write accesses to operands so that the order differs from the order seen by sequentially execution instructions on an unpipelined machine Data hazard occur when pipeline changes the order of read/write accesses to operands so that the order differs from the order seen by sequentially execution instructions on an unpipelined machine

Solutions to Data Hazard Freezing the pipeline Freezing the pipeline (Internal) Forwarding (Internal) Forwarding Compiler scheduling Compiler scheduling

Control (Branch) Hazards Caused by branches Caused by branches Instruction fetch of a next instruction has to wait until the target (including the branch condition) of the current branch instruction need to be resolved Instruction fetch of a next instruction has to wait until the target (including the branch condition) of the current branch instruction need to be resolved

Solutions to Control Hazard Optimized branch processing Optimized branch processing 1. Find out branch taken or not early 1. Find out branch taken or not early → simplified branch condition → simplified branch condition 2. Compute branch target address early 2. Compute branch target address early → extra hardware → extra hardware Branch prediction Branch prediction - Predict the next target address (branch prediction) and if wrong, flush all the speculatively fetched instructions from the pipeline - Predict the next target address (branch prediction) and if wrong, flush all the speculatively fetched instructions from the pipeline Delayed branch Delayed branch - Pipeline stall to delay the fetch of the next instruction - Pipeline stall to delay the fetch of the next instruction

Summary Pipelining overlaps the execution of Pipelining overlaps the execution of multiple instructions. multiple instructions. With an idea pipeline, the CPI(Cycle Per Instruction) is one, and the speedup is equal to the number of stages in the pipeline. With an idea pipeline, the CPI(Cycle Per Instruction) is one, and the speedup is equal to the number of stages in the pipeline. However, several factors prevent us from achieving the ideal speedup, including However, several factors prevent us from achieving the ideal speedup, including Not being able to divide the pipeline evenly Not being able to divide the pipeline evenly The time needed to empty and flush the pipeline The time needed to empty and flush the pipeline Overhead needed for pipeling Overhead needed for pipeling Structural, data, and control harzards Structural, data, and control harzards

Summary Just overlap tasks, and easy if tasks are independent Just overlap tasks, and easy if tasks are independent Speed Up VS. Pipeline Depth; if ideal CPI is 1, then: Speed Up VS. Pipeline Depth; if ideal CPI is 1, then: Hazards limit performance on computers: Hazards limit performance on computers: Structural: need more HW resources Structural: need more HW resources Data: need forwarding, compiler scheduling Data: need forwarding, compiler scheduling Control: discuss next time Control: discuss next time Speedup = Pipeline Depth 1 + Pipeline stall CPI X Clock Cycle Unpipelined Clock Cycle Pipelined

Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.

Similar presentations

Presentation on theme: "Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.

Similar presentations

Presentation on theme: "Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee."— Presentation transcript:

Similar presentations

About project

Feedback