Real-World Pipelines: Car Washes

Real-World Pipelines: Car Washes
Sequential Parallel Pipelined Idea Divide process into independent stages Move objects through stages in sequence At any instant, multiple objects being processed

Administrivia HW5 due today Lab5 this week Next Tuesday is Monday
HW6 this week Lab5 this week Next Tuesday is Monday No lab next week (I will fix the website later) Quiz5 due next Thursday in class Midterm starts after class on Tuesday Feb 28 Open book/notes Download PDF from blackboard, work on paper Due 5PM, Mar 5. Drop off at CTB 265 to the secretary.

Computational Example
Combinational logic R e g 300 ps 20 ps Clock Delay/Latency = 320 ps Throughput = 3.12 GOPS System Computation requires total of 300 picoseconds Additional 20 picoseconds to save result in register Clock cycle must be at least 320 ps 1 operation 1000 ps Throughput = x  GOPS 320 ps 1 ns

3-Stage Pipelined Version
Clock Comb. logic A B C 100 ps 20 ps Delay/Latency = 360 ps Throughput = ??? System Divide combinational logic into 3 blocks of 100 ps each Can begin new operation as soon as previous one passes through stage A. Begin new operation every 120 ps

3-Stage Pipelined Version
Clock Comb. logic A B C 100 ps 20 ps Delay/Latency = 360 ps Throughput = 8.33 GOPS System Divide combinational logic into 3 blocks of 100 ps each Can begin new operation as soon as previous one passes through stage A. Begin new operation every 120 ps Performance impact Latency increases: 360 ps from start to finish (was 320) Throughput Speedup is less than 3x – why?

Pipelining Unpipelined 3-way pipelined
Cannot start new operation until previous one completes 3-way pipelined Up to 3 operations in process simultaneously Time OP1 OP2 OP3 Time A B C OP1 OP2 OP3

Operating a Pipeline Clock Comb. logic A B C 100 ps 20 ps 359 100 ps
300 R e g Clock Comb. logic A B C 100 ps 20 ps 239 R e g Clock Comb. logic A B C 100 ps 20 ps 241 Time OP1 OP2 OP3 A B C 120 240 360 480 640 Clock

Real-World Pipelines: Laundry
Assuming you’ve got: One washer (takes 30 minutes) One drier (takes 40 minutes) One “folder” (takes 20 minutes) It takes 90 minutes to wash, dry, and fold 1 load of laundry. How long does 4 loads take?

The slow way If each load is done sequentially it takes 6 hours 6 PM 7
8 9 10 11 Midnight Time 30 40 20 30 40 20 30 40 20 30 40 20 If each load is done sequentially it takes 6 hours

Laundry Pipelining Start each load as soon as possible: overlap loads
Pipelined laundry takes 3.5 hours (WHY NOT 3X?) 6 PM 7 8 9 10 11 Midnight Time 30 40 20

In a Perfect World Would like to divide the logic to uniform stages. R
Clock Comb. logic A B C 100 ps 20 ps Delay/Latency = 360 ps Throughput = 8.33 GOPS Would like to divide the logic to uniform stages.

Limitations: Nonuniform Delays
g Clock Comb. logic B C 50 ps 20 ps 150 ps 100 ps Delay = ??? Throughput = ??? A Time OP1 OP2 OP3 A B C What is the delay or time to execute one instruction? What is the throughput?

Limitations: Nonuniform Delays
g Clock Comb. logic B C 50 ps 20 ps 150 ps 100 ps Delay = 510 ps Throughput = 5.88 GOPS A Time OP1 OP2 OP3 A B C Throughput limited by slowest stage Other stages sit idle for much of the time Challenging to partition system into balanced stages

Limitations: Register Overhead
Delay = 420 ps, Throughput = GOPS Clock R e g Comb. logic 50 ps 20 ps As pipeline deepens, overhead of loading registers becomes more significant Percentage of clock cycle spent loading register: 1-stage pipeline: 6.25% 3-stage pipeline: % 6-stage pipeline: % High speeds of modern processor designs obtained through deep pipelining: high overhead

Data Dependencies Each operation depends on result from preceding one
Clock Combinational logic R e g Time OP1 OP2 OP3 Each operation depends on result from preceding one Not a problem for SEQ implementation

Limitations: Data Hazards
e g Clock Comb. logic A B C Time OP1 OP2 OP3 A B C OP4 Result does not feed back around in time for next operation Unless special action is taken, results will be incorrect!

Review: Arithmetic and Logical Operations
Instruction Code Function Code Add addl rA, rB 6 rA rB Refer to generically as “OPl” Encodings differ only by “function code” Low-order 4 bits in first instruction word All set condition codes as side effect Subtract (rA from rB) subl rA, rB 6 1 rA rB And andl rA, rB 6 2 rA rB Exclusive-Or xorl rA, rB 6 3 rA rB

Review: Move Operations
Register --> Register rrmovl rA, rB 2 rA rB Immediate --> Register irmovl V, rB 3 F rB V Register --> Memory rmmovl rA, D(rB) 4 rA rB D Memory --> Register mrmovl D(rB), rA 5 rA rB D Similar to the IA32 movl instruction Simpler format for memory addresses Separated into different instructions to simplify hardware implementation

Data Dependencies in Y86 Code
Time OP1 OP2 OP3 A B C OP4 1 irmovl $50, %eax 2 addl %eax , %ebx 3 mrmovl 100( %ebx ), %edx Result from one instruction used as operand for another Read-after-write (RAW) dependency Very common in actual programs Must make sure our pipeline handles these properly Essential: results must be correct Desirable: performance impact should be minimized

SEQ Hardware Stages occur in sequence
Instruction memory PC increment CC ALU Data New rB dstE dstM A B Addr srcA srcB read fun. Fetch Decode Execute Memory Write back data out Register file M E Cnd icode ifun rA valC valP valB valA valE valM newPC Stat dmem_error instr_valid imem_error stat SEQ Hardware Stages occur in sequence One operation in process at a time Mem. control write

Adding Pipeline Registers
PC increment CC ALU Data memory Fetch Decode Execute Memory Write back Register file A B M E d_srcA, d_srcB valA, valB aluA, aluB Cnd valE Addr, Data valM W_valE, W_valM, W_dstE, W_dstM W_icode, W_valM icode, ifun, rA, rB, valC W F D valP f_pc predPC Instruction M_icode, M_Cnd, M_valA Instruction memory PC increment CC ALU Data Fetch Decode Execute Memory Write back icode, ifun rA, rB valC Register file A B M E valP srcA, srcB dstE, dstM valA, valB aluA, aluB Cnd valE Addr, Data valM valE, valM newPC

Pipeline Stages Fetch Decode Execute Memory Write back
PC increment CC ALU Data memory Fetch Decode Execute Memory Write back Register file A B M E d_srcA, d_srcB valA, valB aluA, aluB Cnd valE Addr, Data valM W_valE, W_valM, W_dstE, W_dstM W_icode, W_valM icode, ifun, rA, rB, valC W F D valP f_pc predPC Instruction M_icode, M_Cnd, M_valA Fetch Select current PC Read instruction Compute incremented PC Decode Read program registers Execute Perform ALU operation Memory Read or write data memory Write back Update register file

PIPE- Hardware Forward (upward) paths
M W F D Instruction memory PC increment Register file ALU Data Select rB dstE dstM A B Mem. control Addr srcA srcB read write fun. Fetch Decode Execute Memory icode data out data in M_valA W_valM W_valE d_rvalA f_pc Predict valE valM Cnd valA ifun valC valB valP rA predPC CC d_srcB d_srcA e_Cnd M_Cnd stat imem_error instr_valid dmem_error m_stat W_stat M_stat E_stat D_stat f_stat Write back Stat PIPE- Hardware Pipeline registers hold intermediate values from instruction execution Note naming conventions Forward (upward) paths Values flow from one stage to next Values cannot skip stages; look at valC in decode dstE and dstM in execute, memory

PIPE- Example d_srcA d_srcB E_dstE e_valE M_dstE M_valE m_valM
W F D Instruction memory PC increment Register file ALU Data Select rB dstE dstM A B Mem. control Addr srcA srcB read write fun. Fetch Decode Execute Memory icode data out data in M_valA W_valM W_valE d_rvalA f_pc Predict valE valM Cnd valA ifun valC valB valP rA predPC CC d_srcB d_srcA e_Cnd M_Cnd stat imem_error instr_valid dmem_error m_stat W_stat M_stat E_stat D_stat f_stat Write back Stat PIPE- Example 1: mrmovl 4(%ebx), %ecx 2: addl %eax, %edx 3: addl %ecx, %edx d_srcA d_srcB E_dstE e_valE M_dstE M_valE m_valM

PIPE- Example At clock cycle 4, what are the values of d_srcA d_srcB
Instruction memory PC increment Register file ALU Data Select rB dstE dstM A B Mem. control Addr srcA srcB read write fun. Fetch Decode Execute Memory icode data out data in M_valA W_valM W_valE d_rvalA f_pc Predict valE valM Cnd valA ifun valC valB valP rA predPC CC d_srcB d_srcA e_Cnd M_Cnd stat imem_error instr_valid dmem_error m_stat W_stat M_stat E_stat D_stat f_stat Write back Stat PIPE- Example 1: mrmovl 4(%ebx), %ecx 2: addl %eax, %edx 3: addl %ecx, %edx At clock cycle 4, what are the values of d_srcA d_srcB E_dstE e_valE M_dstE M_valE m_valM Highlight the datapath in D, E, M stages.

Problems: Feedback Paths
W F D Instruction memory PC increment Register file ALU Data Select rB dstE dstM A B Mem. control Addr srcA srcB read write fun. Fetch Decode Execute Memory icode data out data in M_valA W_valM W_valE d_rvalA f_pc Predict valE valM Cnd valA ifun valC valB valP rA predPC CC d_srcB d_srcA e_Cnd M_Cnd stat imem_error instr_valid dmem_error m_stat W_stat M_stat E_stat D_stat f_stat Write back Stat Problems: Feedback Paths Predicted PC Guess value of next PC Branch information Jump taken/not-taken Fall-through or target address Return address Read from memory Register updates To register file write ports

Predicting the PC PC increment CC ALU Data memory Fetch Decode Execute Memory Write back Register file A B M E d_srcA, d_srcB valA, valB aluA, aluB Cnd valE Addr, Data valM W_valE, W_valM, W_dstE, W_dstM W_icode, W_valM icode, ifun, rA, rB, valC W F D valP f_pc predPC Instruction M_icode, M_Cnd, M_valA Pipeline timing requires next instruction fetch to begin one cycle after current instruction is fetched Not enough time to determine next instruction in all cases Can be done for all but conditional jumps, return Solution for conditional jumps: guess outcome + continue Recover later if prediction was incorrect

Simple Prediction Strategy
Instructions that don’t transfer control Next PC is always valP (no guess required) Call and unconditional jumps Next PC is always valC (no guess required) Conditional jumps Could predict next PC to be valC (branch target) Correct only if branch is taken (~60% of time) Could predict next PC to be valP (sequential successor) Correct only if branch not taken (~40% of time) Could predict taken if forward, not-taken if backward Correct ~65% of time Return instruction Don’t try to predict (Why? What would be required?)

Branch Misprediction Example
0x000: xorl %eax,%eax 0x002: jne t # Not taken 0x007: irmovl $1, %eax # Fall through 0x00d: nop 0x00e: nop 0x00f: nop 0x010: halt 0x011: t: irmovl $2, %edx # Target (Should not execute) 0x017: irmovl $3, %ebx # Should not execute 0x01d: irmovl $4, %edx # Should not execute Should execute only first 7 instructions Our predictor will guess the branch will be taken What must be done to recover?

Handling Misprediction: Desired Behavior
Detection Branch is in execute, and it was not taken Two incorrect instructions follow branch in pipe Action On next cycle, replace mispredicted instructions with “bubbles”

Return Example 0x000: irmovl Stack,%esp # Initialize stack pointer 0x006: call p # Procedure call 0x00b: irmovl $5,%esi # Return point 0x011: halt 0x020: .pos 0x20 0x020: p: irmovl $-1,%edi # procedure 0x026: ret 0x027: irmovl $1,%eax # Should not be executed 0x02d: irmovl $2,%ecx # Should not be executed 0x033: irmovl $3,%edx # Should not be executed 0x039: irmovl $4,%ebx # Should not be executed 0x100: .pos 0x100 0x100: Stack: # Stack: Stack pointer Without special handling we would fetch three more instructions after ret before we know return address Instead, we’ll freeze the pipe when ret is fetched

Handling Return: Desired Behavior
# demo - retb 0x026: ret F D E M W bubble F D E M W bubble F D E M W bubble F D E M W 0x00b: irmovl $5,% esi # Return F F D D E E M M W W Detection ret is in decode, execute or memory Address of next instruction unavailable Action Fill pipe with bubbles until ret’s memory access completes W valM = 0x0b • F F valC valC f f 5 5 rB rB f f % % esi esi

Pipeline Operation F D E M W Cycle 5 I1 I2 I3 I4 I5 irmovl $1,%eax #I1
6 7 8 9 F D E M W irmovl $2,%ecx #I2 irmovl $3,%edx #I3 irmovl $4,%ebx #I4 halt #I5 Cycle 5 I1 I2 I3 I4 I5

Problems with PIPE-: Example with 3 Nops
irmovl $10,% edx 1 2 3 4 5 6 7 8 9 F D E M W 0x006: $3,% eax 0x00c: nop 0x00d: 0x00e: 0x00f: addl % ,% 10 R[ ] f valA = valB # demo-h3.ys Cycle 6 11 0x011: halt Cycle 7

Example: 2 Nops F D E M W • Cycle 6 0x000: irmovl $10,% edx 0x006:
3 4 5 6 7 8 9 F D E M W 0x006: $3,% eax 0x00c: nop 0x00d: 0x00e: addl % ,% 0x010: halt 10 # demo-h2.ys R[ ] f valA = valB • Cycle 6 Error

Example: 1 Nop F D E M W • Cycle 5 0x000: irmovl $10,% edx 0x006: $3,%
2 3 4 5 6 7 8 9 F D E M W 0x006: $3,% eax 0x00c: nop 0x00d: addl % ,% 0x00f: halt # demo-h1.ys R[ ] f 10 valA = valB • Cycle 5 Error M_ valE = 3 dstE

Example: 0 Nops F D E M W Cycle 4 0x000: irmovl $10,% edx 0x006: $3,%
2 3 4 5 6 7 8 F D E M W 0x006: $3,% eax 0x00c: addl % ,% 0x00e: halt # demo-h0.ys valA f R[ ] = valB Cycle 4 Error M_ valE = 10 dstE e_ 0 + 3 = 3 E_

Hazard Solution 1: Stalling
2 3 4 5 6 7 8 9 10 11 # demo-h2.ys 0x000: irmovl $10,%edx F D E M W 0x006: irmovl $3,%eax F D E M W 0x00c: nop F D E M W 0x00d: nop F D E M W bubble F E M W 0x00e: addl %edx,%eax D D E M W 0x010: halt F F D E M W If instruction that reads register follows too closely after instruction that writes same register, delay reading instruction Delay or stall always takes place in decode stage Following instruction also delayed in fetch, but just one “stall cycle” tallied Stall = dynamically injecting nop (bubble) into execute stage Hazard  we get wrong answer without special action

Stalls Detection Action Write is pending for a src register
M W F D Instruction memory PC increment Register file ALU Data Select rB dstE dstM A B Mem. control Addr srcA srcB read write fun. Fetch Decode Execute Memory icode data out data in M_valA W_valM W_valE d_rvalA f_pc Predict valE valM Cnd valA ifun valC valB valP rA predPC CC d_srcB d_srcA e_Cnd M_Cnd stat imem_error instr_valid dmem_error m_stat W_stat M_stat E_stat D_stat f_stat Write back Stat Stalls Detection Write is pending for a src register srcA or srcB (not 0xF) matches dstE or dstM in execute, memory, or write-back stages Action Stall instruction in decode 1: addl %eax, %edx 2: mrmovl 4(%ebx), %ecx 3: addl %ecx, %edx

Detecting Stall Condition
1 2 3 4 5 6 7 8 9 10 11 # demo-h2.ys 0x000: irmovl $10,%edx F D E M W 0x006: irmovl $3,%eax F D E M W 0x00c: nop F D E M W 0x00d: nop F D E M W 0x00e: addl %edx,%eax F D E M W 0x010: halt F D E M W Cycle 6 W D • W_dstE = %eax W_valE = 3 srcA = %edx srcB = %eax

Detecting Stall Condition
1 2 3 4 5 6 7 8 9 10 11 # demo-h2.ys 0x000: irmovl $10,%edx F D E M W 0x006: irmovl $3,%eax F D E M W 0x00c: nop F D E M W 0x00d: nop F D E M W bubble F E M W 0x00e: addl %edx,%eax D D E M W 0x010: halt F F D E M W Cycle 6 W D • W_dstE = %eax W_valE = 3 srcA = %edx srcB = %eax

Multi-cycle Stalls F D E M W F D E M W E M W E M W E M W F D D D D E M
1 2 3 4 5 6 7 8 9 10 11 # demo-h0.ys 0x000: irmovl $10,%edx F D E M W 0x006: irmovl $3,%eax F D E M W bubble E M W bubble E M W bubble E M W 0x00c: addl %edx,%eax F D D D D E M W 0x00e: halt F F F F D E M W Cycle 4 D srcA = %edx srcB = %eax

1 2 3 4 5 6 7 8 9 10 11 # demo-h0.ys 0x000: irmovl $10,%edx F D E M W 0x006: irmovl $3,%eax F D E M W bubble E M W bubble E M W bubble E M W 0x00c: addl %edx,%eax F D D D D E M W 0x00e: halt F F F F D E M W Cycle 4 E E_dstE = %eax D srcA = %edx srcB = %eax

1 2 3 4 5 6 7 8 9 10 11 # demo-h0.ys 0x000: irmovl $10,%edx F D E M W 0x006: irmovl $3,%eax F D E M W bubble E M W bubble E M W bubble E M W 0x00c: addl %edx,%eax F D D D D E M W 0x00e: halt F F F F D E M W Cycle 5 Cycle 4 E E_dstE = %eax • D srcA = %edx srcB = %eax D srcA = %edx srcB = %eax

1 2 3 4 5 6 7 8 9 10 11 # demo-h0.ys 0x000: irmovl $10,%edx F D E M W 0x006: irmovl $3,%eax F D E M W bubble E M W bubble E M W bubble E M W 0x00c: addl %edx,%eax F D D D D E M W 0x00e: halt F F F F D E M W Cycle 5 M M_dstE = %eax Cycle 4 E E_dstE = %eax • D srcA = %edx srcB = %eax D srcA = %edx srcB = %eax

1 2 3 4 5 6 7 8 9 10 11 # demo-h0.ys 0x000: irmovl $10,%edx F D E M W 0x006: irmovl $3,%eax F D E M W bubble E M W bubble E M W bubble E M W 0x00c: addl %edx,%eax F D D D D E M W 0x00e: halt F F F F D E M W Cycle 6 Cycle 5 M M_dstE = %eax Cycle 4 • E E_dstE = %eax • D srcA = %edx srcB = %eax D srcA = %edx srcB = %eax D srcA = %edx srcB = %eax

1 2 3 4 5 6 7 8 9 10 11 # demo-h0.ys 0x000: irmovl $10,%edx F D E M W 0x006: irmovl $3,%eax F D E M W bubble E M W bubble E M W bubble E M W 0x00c: addl %edx,%eax F D D D D E M W 0x00e: halt F F F F D E M W Cycle 6 W W_dstE = %eax Cycle 5 M M_dstE = %eax Cycle 4 • E E_dstE = %eax • D srcA = %edx srcB = %eax D srcA = %edx srcB = %eax D srcA = %edx srcB = %eax

Stalling: Another View
0x006: irmovl $3,%eax bubble 0x00c: addl %edx,%eax Cycle 6 0x00e: halt bubble Cycle 8 0x00c: addl %edx,%eax 0x00e: halt bubble 0x00c: addl %edx,%eax Cycle 7 0x00e: halt 0x000: irmovl $10,%edx 0x006: irmovl $3,%eax bubble 0x00c: addl %edx,%eax Cycle 5 0x00e: halt 0x000: irmovl $10,%edx 0x006: irmovl $3,%eax 0x00c: addl %edx,%eax Cycle 4 0x00e: halt 0x000: irmovl $10,%edx 0x006: irmovl $3,%eax 0x00c: addl %edx,%eax # demo-h0.ys 0x00e: halt Write Back Memory Execute Decode Fetch Operational rules: Stalling instruction held back in decode stage Following instruction stays in fetch stage Bubbles injected into execute stage Like dynamically generated nop’s Move through stage-by-stage as regular instructions

Summary Pipeline Hazards Data Hazards Solution #1: Stalling
Control Hazard jmp, ret Solution #1: Simple Prediction

Real-World Pipelines: Car Washes

Similar presentations

Presentation on theme: "Real-World Pipelines: Car Washes"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Real-World Pipelines: Car Washes

Similar presentations

Presentation on theme: "Real-World Pipelines: Car Washes"— Presentation transcript:

Similar presentations

About project

Feedback