Download presentation
Presentation is loading. Please wait.
Published byCarlton Clem Modified over 9 years ago
1
Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee
2
Pipelining
3
What is pipelining? An implementation technique that overlaps the execution of multiple instructions. An implementation technique that overlaps the execution of multiple instructions. Any architecture in which digital information flows through a series of stations that each inspect, interpret or modify the information. Any architecture in which digital information flows through a series of stations that each inspect, interpret or modify the information.
4
Example(1) : Laundry Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Washer takes 30 minutes Dryer takes 40 minutes Dryer takes 40 minutes “Folder” takes 20 minutes “Folder” takes 20 minutes ABCD
5
Sequential Laundry Sequential laundry takes 6 hours for 4 loads Sequential laundry takes 6 hours for 4 loads ABCD 304020304020304020304020 6 PM 789 10 11 Midnight TaskOrderTaskOrder Time
6
Pipelined Laundry Start work ASAP Pipelined laundry takes 3.5 hours for 4 loads Pipelined laundry takes 3.5 hours for 4 loads Speedup = 6/3.5 = 1.7 Speedup = 6/3.5 = 1.7 ABCD 6 PM 789 10 11 Midnight TaskOrderTaskOrder Time 3040 20
7
Pipelining Lessons Pipelining doesn’t help latency of single task, it helps throughput of entire workload Pipelining doesn’t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously Multiple tasks operating simultaneously Potential speedup = Number pipe stages Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup
8
Computer Pipelines Execute billions of instructions, so through put is what matters Execute billions of instructions, so through put is what matters RISC desirable features: all instructions sa me length, registers located in same place in instruction format, memory operands onl y in loads or stores RISC desirable features: all instructions sa me length, registers located in same place in instruction format, memory operands onl y in loads or stores
9
Unpipelined Design Single-cycle implementation Single-cycle implementation The cycle time depends on the slowest instruction The cycle time depends on the slowest instruction Every instruction takes the same amount of time Every instruction takes the same amount of time Multi-cycle implementation Multi-cycle implementation Divide the execution of an instruction into multiple steps Divide the execution of an instruction into multiple steps Each instruction may take variable number of steps (clock cycles) Each instruction may take variable number of steps (clock cycles)
10
Unpipelined System Comb. Logic REGREG 30ns3ns Clock Time Op1Op2Op3 ?? One operation must complete before next can begin One operation must complete before next can begin Operations spaced 33ns apart Operations spaced 33ns apart
11
Pipelined Design Divide the execution of an instruction into multiple steps (stages) Divide the execution of an instruction into multiple steps (stages) Overlap the execution of different instructions in different stages Overlap the execution of different instructions in different stages Each cycle different instruction is executed in different stages Each cycle different instruction is executed in different stages For example, 5-stage pipeline (Fetch-Decode-Read- Execute-Write), For example, 5-stage pipeline (Fetch-Decode-Read- Execute-Write), 5 instructions are executed concurrently in 5 different pipeline stages 5 instructions are executed concurrently in 5 different pipeline stages Complete the execution of one instruction every cycle (instead of every 5 cycle) Complete the execution of one instruction every cycle (instead of every 5 cycle) Can increase the throughput of the machine 5 times Can increase the throughput of the machine 5 times
12
Example(1) : 3 Stage Pipeline Space operations Space operations 13ns apart 13ns apart 3 operations occur si multaneously 3 operations occur si multaneously REGREG Clock Comb. Logic REGREG Comb. Logic REGREG Comb. Logic 10ns3ns10ns3ns10ns3ns Time Op1 Op2 Op3 ?? Op4 Delay = 39ns Throughput = 77MHz
13
Example(2) FDREW FDREW FDREW FDREW FDREW FDREW FDREW FDREW FDREW F Non-pipelined processor: 25 cycles = number of instrs (5) * number of stages (5) Pipelined processor: 9 cycles = start-up latency (4) + number of instrs (5) Filling the pipeline Draining the pipeline 5 stage pipeline: Fetch – Decode – Read – Execute - Write
14
Basic Performance Issues in Pipelining Pipelining increases the CPU instruction throughput - the number of instructions complete per unit of time - but it is not reduce the execution time of an individual instruction. Pipelining increases the CPU instruction throughput - the number of instructions complete per unit of time - but it is not reduce the execution time of an individual instruction.
15
Pipeline Speedup Example Assume the multiple cycle has a 10-ns clock cycle, loads take 5 clock cycles and account for 40% of the instructions, and all other instructions take 4 clock cycles. Assume the multiple cycle has a 10-ns clock cycle, loads take 5 clock cycles and account for 40% of the instructions, and all other instructions take 4 clock cycles. If pipelining the machine add 1-ns to the clock cycle, how much speedup in instruction execution rate do we get from pipelining. If pipelining the machine add 1-ns to the clock cycle, how much speedup in instruction execution rate do we get from pipelining. MC Ave Instr. Time = Clock cycle x Average CPI = 10 ns x (0.6 x 4 + 0.4 x 5) = 10 ns x (0.6 x 4 + 0.4 x 5) = 44 ns = 44 ns PL Ave Instr. Time = 10 + 1 = 11 ns Speedup = 44 / 11 = 4 This ignores time needed to fill & empty the pipeline and delays due to hazards. This ignores time needed to fill & empty the pipeline and delays due to hazards.
16
What makes it easy? What makes it easy? all instructions are the same length all instructions are the same length just a few instruction formats just a few instruction formats memory operands appear only in loads and stores memory operands appear only in loads and stores What makes it easy?
17
It’s not Easy for Computers Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle Structural hazards: Hardware cannot support this combination of instructions - two instructions need the same resource. Structural hazards: Hardware cannot support this combination of instructions - two instructions need the same resource. Data hazards: Instruction depends on result of prior instruction still in the pipeline Data hazards: Instruction depends on result of prior instruction still in the pipeline Control hazards: Pipelining of branches & other instructions that change the PC Control hazards: Pipelining of branches & other instructions that change the PC Common solution is to stall the pipeline until the hazard is resolved, inserting one or more “bubbles” in the pipeline Common solution is to stall the pipeline until the hazard is resolved, inserting one or more “bubbles” in the pipeline
18
Limitation: Nonuniform Pipelining Clock REGREG Com. Log. REGREG Comb. Logic REGREG Comb. Logic 5ns3ns15ns3ns10ns3ns Throughput limited by slowest stage Throughput limited by slowest stage Delay determined by clock period * number of stages Delay determined by clock period * number of stages Must attempt to balance stages Must attempt to balance stages Delay = 18 * 3 = 54 ns Throughput = 55MHz
19
Limitation: Deep Pipelines Diminishing returns as add more pipeline stages Diminishing returns as add more pipeline stages Register delays become limiting factor Register delays become limiting factor Increased latency Increased latency Small throughput gains Small throughput gains Clock REGREG Com. Log. 5ns3ns REGREG Com. Log. 5ns3ns REGREG Com. Log. 5ns3ns REGREG Com. Log. 5ns3ns REGREG Com. Log. 5ns3ns REGREG Com. Log. 5ns3ns Delay = 48ns, Throughput = 128MHz
20
Limitation: Sequential Dependencies Op4 gets result Op4 gets result from Op1 from Op1 Pipeline Hazard Pipeline Hazard REGREG Clock Comb. Logic REGREG Comb. Logic REGREG Comb. Logic Time Op1 Op2 Op3 ?? Op4
21
Structure Hazard Sometimes called Resource Conflict. Sometimes called Resource Conflict. Example. Example. Some pipelined machines have shared a single memory pipeline for a data and instruction. As a result, when an instruction contains a data memory reference, it will conflict with the instruction reference for a latter instruction. Some pipelined machines have shared a single memory pipeline for a data and instruction. As a result, when an instruction contains a data memory reference, it will conflict with the instruction reference for a latter instruction.
22
Solutions to Structural Hazard Resource Duplication Resource Duplication example example Separate I and D caches for memory access conflict Separate I and D caches for memory access conflict Time-multiplexed or multi-port register file for register file access conflict Time-multiplexed or multi-port register file for register file access conflict
23
Data Hazard Data hazard occur when pipeline changes the order of read/write accesses to operands so that the order differs from the order seen by sequentially execution instructions on an unpipelined machine Data hazard occur when pipeline changes the order of read/write accesses to operands so that the order differs from the order seen by sequentially execution instructions on an unpipelined machine
24
Solutions to Data Hazard Freezing the pipeline Freezing the pipeline (Internal) Forwarding (Internal) Forwarding Compiler scheduling Compiler scheduling
25
Control (Branch) Hazards Caused by branches Caused by branches Instruction fetch of a next instruction has to wait until the target (including the branch condition) of the current branch instruction need to be resolved Instruction fetch of a next instruction has to wait until the target (including the branch condition) of the current branch instruction need to be resolved
26
Solutions to Control Hazard Optimized branch processing Optimized branch processing 1. Find out branch taken or not early 1. Find out branch taken or not early → simplified branch condition → simplified branch condition 2. Compute branch target address early 2. Compute branch target address early → extra hardware → extra hardware Branch prediction Branch prediction - Predict the next target address (branch prediction) and if wrong, flush all the speculatively fetched instructions from the pipeline - Predict the next target address (branch prediction) and if wrong, flush all the speculatively fetched instructions from the pipeline Delayed branch Delayed branch - Pipeline stall to delay the fetch of the next instruction - Pipeline stall to delay the fetch of the next instruction
27
Summary Pipelining overlaps the execution of Pipelining overlaps the execution of multiple instructions. multiple instructions. With an idea pipeline, the CPI(Cycle Per Instruction) is one, and the speedup is equal to the number of stages in the pipeline. With an idea pipeline, the CPI(Cycle Per Instruction) is one, and the speedup is equal to the number of stages in the pipeline. However, several factors prevent us from achieving the ideal speedup, including However, several factors prevent us from achieving the ideal speedup, including Not being able to divide the pipeline evenly Not being able to divide the pipeline evenly The time needed to empty and flush the pipeline The time needed to empty and flush the pipeline Overhead needed for pipeling Overhead needed for pipeling Structural, data, and control harzards Structural, data, and control harzards
28
Summary Just overlap tasks, and easy if tasks are independent Just overlap tasks, and easy if tasks are independent Speed Up VS. Pipeline Depth; if ideal CPI is 1, then: Speed Up VS. Pipeline Depth; if ideal CPI is 1, then: Hazards limit performance on computers: Hazards limit performance on computers: Structural: need more HW resources Structural: need more HW resources Data: need forwarding, compiler scheduling Data: need forwarding, compiler scheduling Control: discuss next time Control: discuss next time Speedup = Pipeline Depth 1 + Pipeline stall CPI X Clock Cycle Unpipelined Clock Cycle Pipelined
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.