Pipelining Example Laundry Example: Three Stages

Pipelining Example Laundry Example: Three Stages
Wash dirty load of clothes Dry wet clothes Fold and put clothes into drawers Each stage takes 30 minutes to complete Four loads of clothes to wash, dry, and fold A B C D

Sequential Laundry Sequential laundry takes 6 hours for 4 loads
6 PM 7 8 9 10 11 12 AM Time 30 D 30 C 30 B A 30 Sequential laundry takes 6 hours for 4 loads Intuitively, we can use pipelining to speed up laundry

Pipelined Laundry: Start Load ASAP
6 PM 7 8 9 PM A 30 C 30 B 30 Time D 30 30 30 Pipelined laundry takes 3 hours for 4 loads Speedup factor is 2 for 4 loads Time to wash, dry, and fold one load is still the same (90 minutes)

Pipelining: Laundry Example
Small laundry has one washer, one dryer and one operator, it takes 90 minutes to finish one load: Washer takes 30 minutes Dryer takes 40 minutes “operator folding” takes 20 minutes A B C D

Sequential Laundry 30 40 20 30 40 20 30 40 20 30 40 20 Task ordered. A
Time 30 40 20 30 40 20 30 40 20 30 40 20 Task ordered. A B C 90 min D Sequential laundry takes 6 hours for 4 loads

Pipelined Laundry 30 40 20 Task ordered. A B C D Time
Another operator asks for the delivery of loads to the laundry every 40 minutes!?. Pipelined laundry takes 3.5 hours for 4 loads Speedup factor is 6/3.5 = 1.7 for 4 loads Time to wash, dry, and fold one load is still the same (90 minutes)

Serial Execution versus Pipelining
Consider a task that can be divided into k subtasks The k subtasks are executed on k different stages Each subtask requires one time unit The total execution time of the task is k time units Pipelining is to overlap the execution The k stages work in parallel on k different tasks Tasks enter/leave pipeline at the rate of one task per time unit 1 2 k … 1 2 k … Without Pipelining One completion every k time units With Pipelining One completion every 1 time unit

Pipeline Performance Let ti = time delay in stage Si
Clock cycle t = max(ti) is the maximum stage delay Clock frequency f = 1/t = 1/max(ti) A pipeline can process n tasks in k + n – 1 cycles k cycles are needed to complete the first task n – 1 cycles are needed to complete the remaining n – 1 tasks Ideal speedup of a k-stage pipeline over serial execution k + n – 1 Pipelined execution in cycles Serial execution in cycles = Sk → k for large n nk Sk

The Fetch-Decode-Execute Cycle
MAR  PC Fetch Execute IR  M[MAR] Decode MAR  IR[11-0] Decode IR[15-12] MBR  M[MAR]

Register Transfer Notation - Example
The RTL for the ADD X instruction is: The RTL for the ADDI X instruction is: MAR  PC IR  M[MAR] MAR  IR[11-0] Decode IR[15-12] MAR  X MBR  M[MAR] AC  AC + MBR MAR  X MBR  M[MAR] AC  AC + MBR MAR  X MBR  M[MAR] MAR  MBR AC  AC + MBR MAR  PC IR  M[MAR] MAR  IR[11-0] Decode IR[15-12] MAR  X MBR  M[MAR] MAR  MBR AC  AC + MBR

Instruction Pipelining
Some CPUs divide the fetch-decode-execute cycle into smaller steps. These smaller steps can often be executed in parallel to increase throughput. Such parallel execution is called instruction pipelining. Instruction pipelining provides for instruction level parallelism (ILP)

6-Stage Pipelining Suppose a fetch-decode-execute cycle were broken into: Fetch instruction. . Decode opcode. Calculate effective address of operands. Fetch operands Execute instruction Store result. Suppose we have a six-stage pipeline. S1 fetches the instruction, S2 decodes it S3 determines the address of the operands S4 fetches them S5 executes the instruction S6 stores the result. Note: Not all instructions must go through each stage of the pipe.

6-Stage Pipelining For every clock cycle, one small step is carried out, and the stages are overlapped. S1 fetches the instruction, S2 decodes it S3 determines the address of the operands S4 fetches them S5 executes the instruction S6 stores the result.

Serial Execution versus Pipelining
Assumption: A 𝑘-stage pipeline the clock cycle time is 𝑡 𝑝 . 𝑛 instructions Each instruction represents a task, T, in the pipeline. Each task requires 𝑘 × 𝑡 𝑝 time to complete. 𝑛 tasks require 𝑛 𝑡 𝑝 time to emerge to the pipeline. The last task requires (𝑘−1) 𝑡 𝑝 time to complete. So, the total time to complete 𝑛 tasks using a k-stage pipeline requires: (𝑛+𝑘−1) 𝑡 𝑝 .

Pipeline Performance Speedup= 𝑛×𝑘× 𝑡 𝑝 (𝑛+𝑘−1) 𝑡 𝑝
The speedup gain using a pipeline. Without a pipeline, the total time to complete 𝑛 tasks is 𝑛×𝑘 × 𝑡 𝑝 . Using a 𝑘-stage pipeline, the total time to complete 𝑛 tasks is (𝑛+𝑘−1) 𝑡 𝑝 . Speedup= 𝑛×𝑘× 𝑡 𝑝 (𝑛+𝑘−1) 𝑡 𝑝 Take the limit as 𝑛 approaches infinity, (𝑛 +𝑘 − 1) approaches 𝑛, resulting in a theoretical speedup of: Speedup= 𝑘× 𝑡 𝑝 𝑡 𝑝 =𝑘

Pipeline Performance Summary
Pipelining doesn’t improve latency of a single instruction However, it improves throughput of entire workload Instructions are initiated and completed at a higher rate In a k-stage pipeline, k instructions operate in parallel Overlapped execution using multiple hardware resources Potential speedup = number of pipeline stages k Unbalanced lengths of pipeline stages reduces speedup Pipeline rate is limited by slowest pipeline stage Also, time to fill and drain pipeline reduces speedup

Pipeline Exercise Speedup 𝑚𝑎𝑥 =𝑘=5
A nonpipelined system takes 200ns to process a task. The same task can be processed in a 5-segment pipeline with a clock cycle of 40ns. Determine the speedup ratio of the pipeline for 200 tasks. Speedup= 𝑛×𝑘× 𝑡 𝑝 (𝑛+𝑘−1) 𝑡 𝑝 = 200×5×40 (200+5−1)×40 =4.9019 What is the maximum speedup that could be achieved with the pipeline unit over the nonpipelined unit? Speedup 𝑚𝑎𝑥 =𝑘=5

Morgan Kaufmann Publishers
Hazards 26 April, 2017 Hazards: situations that causes incorrect execution A bubble or pipeline stall is a delay in execution of an instruction in an instruction pipeline in order to resolve a hazard. Structure hazards A required resource is busy Data hazard Need to wait for previous instruction to complete its data read/write Control hazard Deciding on control action depends on previous instruction Chapter 4 — The Processor

A 4-Stage Pipeline S1: fetch instruction
S2: decode and calculate effective address S3: fetch operand S4: execute instruction and store results Time Period 1 2 3 4 5 6 7 8 9 10 11 12 13 Instruction: 1 S1 S2 S3 S4 (branch) 3

26 April, 2017 Structure Hazards Conflict for use of a resource In instruction pipeline with a single memory Example: one instruction is storing a value to memory while another is being fetched from memory, both need access to memory. Pipelined datapaths require separate instruction/data memories, Or separate instruction/data caches Chapter 4 — The Processor

Data Hazards Dependency between instructions causes a data hazard
The dependent instructions are close to each other Pipelined execution might change the order of operand access Read After Write – RAW Hazard Given two instructions I and J, where I comes before J Instruction J should read an operand after it is written by I Called a data dependence in compiler terminology I: X = Y + 3 J: Z = 2 * X Hazard occurs when J reads the operand before I writes it

Data Hazards Solutions: The problem arises at time period 4.
1 2 3 4 5 X = Y + 3 fetch instruction decode fetch Y execute & store X Z = 2 * X fetch X The problem arises at time period 4. The second instruction needs to fetch X, but the first instruction does not store the result until the execution is finished, so X is not available at the beginning of the time period. Solutions: Stalling the pipeline Data hazard detection: Detecting instructions whose source operands are destinations for instructions farther up the pipeline. Forwarding: Routing data through special paths that exist between various stages of the pipeline. Code scheduling

MIPS Processor Pipeline
Morgan Kaufmann Publishers 26 April, 2017 MIPS Processor Pipeline Five stages, one cycle per stage IF: Instruction Fetch from instruction memory ID: Instruction Decode and register read EX: Execute operation or calculate load/store address MEM: Memory access for load and store WB: Write Back result to register Chapter 4 — The Processor

26 April, 2017 Forwarding Use result when it is computed Don’t wait for it to be stored in a register Requires extra connections in the datapath Chapter 4 — The Processor

Code Scheduling to Avoid Stalls
Morgan Kaufmann Publishers Code Scheduling to Avoid Stalls 26 April, 2017 Reorder code to avoid use of load result in the next instruction C code for A = B + E; C = B + F; lw $t1, 0($t0) lw $t2, 4($t0) add $t3, $t1, $t2 sw $t3, 12($t0) lw $t4, 8($t0) add $t5, $t1, $t4 sw $t5, 16($t0) lw $t1, 0($t0) lw $t2, 4($t0) lw $t4, 8($t0) add $t3, $t1, $t2 sw $t3, 12($t0) add $t5, $t1, $t4 sw $t5, 16($t0) stall stall 13 cycles 11 cycles Chapter 4 — The Processor — 25 Chapter 4 — The Processor

Control Hazards 26 April, 2017 Branch determines flow of control Fetching next instruction depends on branch outcome Pipeline can’t always fetch correct instruction Still working on ID stage of branch Time Period 1 2 3 4 5 6 7 8 9 10 11 12 13 Instruction: 1 S1 S2 S3 S4 (branch) 3 Chapter 4 — The Processor

Branch Prediction Predict outcome of branch Static branch prediction
Only stall if prediction is wrong Static branch prediction Based on typical branch behavior Example: loop and if-statement branches Dynamic branch prediction Hardware measures actual branch behavior e.g., record recent history of each branch Assume future behavior will continue the trend When wrong, stall while re-fetching, and update history

Pipelining Example Laundry Example: Three Stages

Similar presentations

Presentation on theme: "Pipelining Example Laundry Example: Three Stages"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pipelining Example Laundry Example: Three Stages

Similar presentations

Presentation on theme: "Pipelining Example Laundry Example: Three Stages"— Presentation transcript:

Similar presentations

About project

Feedback