# Real-World Pipelines: Car Washes Idea  Divide process into independent stages  Move objects through stages in sequence  At any instant, multiple objects.

## Presentation on theme: "Real-World Pipelines: Car Washes Idea  Divide process into independent stages  Move objects through stages in sequence  At any instant, multiple objects."— Presentation transcript:

Real-World Pipelines: Car Washes Idea  Divide process into independent stages  Move objects through stages in sequence  At any instant, multiple objects being processed SequentialParallel Pipelined

Computational Example System  Computation requires total of 300 picoseconds  Additional 20 picoseconds to save result in register  Clock cycle must be at least 320 ps Combinational logic RegReg 300 ps20 ps Clock Delay = 320 ps Throughput = 3.12 GOPS Throughput = 1 operation 320 ps1 ns 1000 ps  3.12 GOPS x

3-Stage Pipelined Version System  Divide combinational logic into 3 blocks of 100 ps each  Can begin new operation as soon as previous one passes through stage A.  Begin new operation every 120 ps  Performance impact  Latency increases: 360 ps from start to finish (was 320)  Speedup is less than 3x – why? RegReg Clock Comb. logic A RegReg Comb. logic B RegReg Comb. logic C 100 ps20 ps100 ps20 ps100 ps20 ps Delay = 360 ps Throughput = 8.33 GOPS

Pipelining Unpipelined  Cannot start new operation until previous one completes 3-way pipelined  Up to 3 operations in process simultaneously Time OP1 OP2 OP3 Time ABC ABC ABC OP1 OP2 OP3

Operating a Pipeline Time OP1 OP2 OP3 ABC ABC ABC 0120240360480640 Clock RegReg Comb. logic A RegReg Comb. logic B RegReg Comb. logic C 100 ps20 ps100 ps20 ps100 ps20 ps 239 RegReg Clock Comb. logic A RegReg Comb. logic B RegReg Comb. logic C 100 ps20 ps100 ps20 ps100 ps20 ps 241 RegReg RegReg RegReg 100 ps20 ps100 ps20 ps100 ps20 ps Comb. logic A Comb. logic B Comb. logic C Clock 300 RegReg Clock Comb. logic A RegReg Comb. logic B RegReg Comb. logic C 100 ps20 ps100 ps20 ps100 ps20 ps 359

Real-World Pipelines: Laundry Assuming you’ve got:  One washer (takes 30 minutes)  One drier (takes 40 minutes)  One “folder” (takes 20 minutes) It takes 90 minutes to wash, dry, and fold 1 load of laundry.  How long does 4 loads take?

The slow way If each load is done sequentially it takes 6 hours 304020304020304020304020 6 PM 789 10 11 Midnight Time

8 Laundry Pipelining Start each load as soon as possible: overlap loads Pipelined laundry takes 3.5 hours 6 PM 789 10 11 Midnight Time 3040 20

Limitations: Nonuniform Delays  Throughput limited by slowest stage  Other stages sit idle for much of the time  Challenging to partition system into balanced stages RegReg Clock RegReg Comb. logic B RegReg Comb. logic C 50 ps20 ps150 ps20 ps100 ps20 ps Delay = 510 ps Throughput = 5.88 GOPS Comb. logic A Time OP1 OP2 OP3 ABC ABC ABC

Limitations: Register Overhead  As pipeline deepens, overhead of loading registers becomes more significant  Percentage of clock cycle spent loading register:  1-stage pipeline: 6.25%  3-stage pipeline: 16.67%  6-stage pipeline: 28.57%  High speeds of modern processor designs obtained through deep pipelining: high overhead Delay = 420 ps, Throughput = 14.29 GOPSClock RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps

Data Dependencies  Each operation depends on result from preceding one  Not a problem for SEQ implementation Clock Combinational logic RegReg Time OP1 OP2 OP3

Limitations: Data Hazards  Result does not feed back around in time for next operation  Unless special action is taken, results will be incorrect! RegReg Clock Comb. logic A RegReg Comb. logic B RegReg Comb. logic C Time OP1 OP2 OP3 ABC ABC ABC OP4 ABC

Review: Arithmetic and Logical Operations  Refer to generically as “ OPl ”  Encodings differ only by “function code”  Low-order 4 bits in first instruction word  All set condition codes as side effect addl rA, rB 60 rArB subl rA, rB 61 rArB andl rA, rB 62 rArB xorl rA, rB 63 rArB Add Subtract (rA from rB) And Exclusive-Or Instruction CodeFunction Code

Review: Move Operations  Similar to the IA32 movl instruction  Simpler format for memory addresses  Separated into different instructions to simplify hardware implementation rrmovl rA, rB 20 rArB Register --> Register Immediate --> Register irmovl V, rB 30F rB V Register --> Memory rmmovl rA, D ( rB) 40 rArB D Memory --> Register mrmovl D ( rB), rA 50 rArB D

Data Dependencies in Y86 Code  Result from one instruction used as operand for another  Read-after-write (RAW) dependency  Very common in actual programs  Must make sure our pipeline handles these properly  Essential: results must be correct  Desirable: performance impact should be minimized 1 irmovl \$50, %eax 2 addl %eax, %ebx 3 mrmovl 100( %ebx ), %edx

SEQ Hardware  Stages occur in sequence  One operation in process at a time write Mem. control

SEQ+ Hardware  Still sequential implementation  Reorder PC stage to put at beginning PC stage  Task is to select PC for current instruction  Based on results computed by previous instruction Processor state  PC is no longer stored in register  PC is determined from other stored information Instruction memory Instruction memory PC increment PC increment CC ALU Data memory Data memory PC rB dstEdstM ALU A ALU B Mem. control Addr srcAsrcB read write ALU fun. Fetch Decode Execute Memory Write back data out Register file Register file AB M E Cnd dstEdstMsrcAsrcB icodeifunrA pCndpValMpValCpValP pIcode PC valCvalP valBvalA Data valE valM PC Stat dmem_error stat instr_valid imem_error

Adding Pipeline Registers Instruction memory Instruction memory PC increment PC increment CC ALU Data memory Data memory Fetch Decode Execute Memory Write back icode, ifun rA, rB valC Register file Register file AB M E PC valP srcA, srcB dstE, dstM valA, valB aluA, aluB Cnd valE Addr, Data valM PC valE, valM newPC PC increment PC increment CC ALU Data memory Data memory Fetch Decode Execute Memory Write back Register file Register file AB M E d_srcA, d_srcB valA, valB aluA, aluB Cnd valE Addr, Data valM W_valE, W_valM, W_dstE, W_dstMW_icode, W_valM icode, ifun, rA, rB, valC E M W F D valP f_pc predPC Instruction memory Instruction memory M_icode, M_Cnd, M_valA

Pipeline Stages Fetch  Select current PC  Read instruction  Compute incremented PC Decode  Read program registers Execute  Perform ALU operation Memory  Read or write data memory Write back  Update register file PC increment PC increment CC ALU Data memory Data memory Fetch Decode Execute Memory Write back Register file Register file AB M E d_srcA, d_srcB valA, valB aluA, aluB Cnd valE Addr, Data valM W_valE, W_valM, W_dstE, W_dstMW_icode, W_valM icode, ifun, rA, rB, valC E M W F D valP f_pc predPC Instruction memory Instruction memory M_icode, M_Cnd, M_valA

PIPE- Hardware  Pipeline registers hold intermediate values from instruction execution  Note naming conventions Forward (upward) paths  Values flow from one stage to next  Values cannot skip stages; look at  valC in decode  dstE and dstM in execute, memory

PIPE- Example  At clock cycle 4, what are the values of  d_srcA  d_srcB  E_dstE  e_valE  M_dstE  M_valE  m_valM  Highlight the datapath in D, E, M stages. 1: mrmovl 4(%ebx), %ecx 2: addl %eax, %edx 3: addl %ecx, %edx

Problems: Feedback Paths Predicted PC  Guess value of next PC Branch information  Jump taken/not-taken  Fall-through or target address Return address  Read from memory Register updates  To register file write ports

Predicting the PC Pipeline timing requires next instruction fetch to begin one cycle after current instruction is fetched  Not enough time to determine next instruction in all cases  Can be done for all but conditional jumps, return Solution for conditional jumps: guess outcome + continue  Recover later if prediction was incorrect PC increment PC increment CC ALU Data memory Data memory Fetch Decode Execute Memory Write back Register file Register file AB M E d_srcA, d_srcB valA, valB aluA, aluB Cnd valE Addr, Data valM W_valE, W_valM, W_dstE, W_dstMW_icode, W_valM icode, ifun, rA, rB, valC E M W F D valP f_pc predPC Instruction memory Instruction memory M_icode, M_Cnd, M_valA

Simple Prediction Strategy Instructions that don’t transfer control  Next PC is always valP (no guess required) Call and unconditional jumps  Next PC is always valC (no guess required) Conditional jumps  Could predict next PC to be valC (branch target)  Correct only if branch is taken (~60% of time)  Could predict next PC to be valP (sequential successor)  Correct only if branch not taken (~40% of time)  Could predict taken if forward, not-taken if backward  Correct ~65% of time Return instruction  Don’t try to predict (Why? What would be required?)

Branch Misprediction Example  Should execute only first 7 instructions  Our predictor will guess the branch will be taken  What must be done to recover? 0x000: xorl %eax,%eax 0x002: jne t # Not taken 0x007: irmovl \$1, %eax # Fall through 0x00d: nop 0x00e: nop 0x00f: nop 0x010: halt 0x011: t: irmovl \$2, %edx # Target (Should not execute) 0x017: irmovl \$3, %ebx # Should not execute 0x01d: irmovl \$4, %edx # Should not execute

Handling Misprediction: Desired Behavior Detection  Branch is in execute, and it was not taken  Two incorrect instructions follow branch in pipe Action  On next cycle, replace mispredicted instructions with “bubbles”

0x000: irmovl Stack,%esp # Initialize stack pointer 0x006: call p # Procedure call 0x00b: irmovl \$5,%esi # Return point 0x011: halt 0x020:.pos 0x20 0x020: p: irmovl \$-1,%edi # procedure 0x026: ret 0x027: irmovl \$1,%eax # Should not be executed 0x02d: irmovl \$2,%ecx # Should not be executed 0x033: irmovl \$3,%edx # Should not be executed 0x039: irmovl \$4,%ebx # Should not be executed 0x100:.pos 0x100 0x100: Stack: # Stack: Stack pointer Return Example  Without special handling we would fetch three more instructions after ret before we know return address  Instead, we’ll freeze the pipe when ret is fetched

0x026: ret FDEM W bubble FDEM W FDEMW FDEMW 0x00b:irmovl\$5,%esi# Return FDEMW # demo-retb FDEMW F valC  5 rB  %esi F valC  5 rB  %esi W valM= 0x0b W valM= 0x0b Handling Return: Desired Behavior Detection  ret is in decode, execute or memory  Address of next instruction unavailable Action  Fill pipe with bubbles until ret ’s memory access completes

Pipeline Operation irmovl \$1,%eax #I1 123456789 FDEM W irmovl \$2,%ecx #I2 FDEM W irmovl \$3,%edx #I3 FDEMW irmovl \$4,%ebx #I4 FDEMW halt #I5 FDEMW Cycle 5 W I1 M I2 E I3 D I4 F I5

Problems with PIPE-: Example with 3 Nops

Example: 2 Nops

Example: 1 Nop

Example: 0 Nops 0x000:irmovl\$10,%edx 12345678 FDEM W 0x006:irmovl\$3,%eax FDEM W FDEMW 0x00c:addl%edx,%eax FDEMW 0x00e: halt # demo-h0.ys E D valA  R[ %edx ]=0 valB  R[ %eax ]=0 D valA  R[ %edx ]=0 valB  R[ %eax ]=0 Cycle 4 Error M M_valE= 10 M_dstE= %edx e_valE  0 + 3 = 3 E_dstE= %eax

Hazard Solution 1: Stalling  If instruction that reads register follows too closely after instruction that writes same register, delay reading instruction  Delay or stall always takes place in decode stage  Following instruction also delayed in fetch, but just one “stall cycle” tallied  Stall = dynamically injecting nop (bubble) into execute stage  Hazard  we get wrong answer without special action 0x000: irmovl \$10,%edx 123456789 FDEMW 0x006: irmovl \$3,%eax FDEMW 0x00c: nop FDEMW bubble F EMW 0x00e: addl %edx,%eax DDEMW 0x010: halt FDEMW 10 # demo-h2.ys F FDEMW 0x00d: nop 11

Stalls Detection  Write is pending for a src register  srcA or srcB (not 0xF) matches dstE or dstM in execute, memory, or write- back stages Action  Stall instruction in decode 1: addl %eax, %edx 2: mrmovl 4(%ebx), %ecx 3: addl %ecx, %edx

Stalls Detection  Write is pending for a src register  srcA or srcB (not 0xF) matches dstE or dstM in execute, memory, or write- back stages Action  Stall instruction in decode 1: addl %eax, %edx 2: mrmovl 4(%ebx), %ecx 3: addl %ecx, %edx

Detecting Stall Condition 0x000: irmovl \$10,%edx 123456789 FDEMW 0x006: irmovl \$3,%eax FDEMW 0x00c: nop FDEMW F 0x00e: addl %edx,%eax DEMW 0x010: halt DEMW 10 # demo-h2.ys F FDEMW 0x00d: nop 11 Cycle 6 W D W_dstE = %eax W_valE = 3 srcA = %edx srcB = %eax

Detecting Stall Condition 0x000: irmovl \$10,%edx 123456789 FDEMW 0x006: irmovl \$3,%eax FDEMW 0x00c: nop FDEMW bubble F EMW 0x00e: addl %edx,%eax DDEMW 0x010: halt FDEMW 10 # demo-h2.ys F FDEMW 0x00d: nop 11 Cycle 6 W D W_dstE = %eax W_valE = 3 srcA = %edx srcB = %eax

Multi-cycle Stalls 0x000: irmovl \$10,%edx 123456789 FDEMW 0x006: irmovl \$3,%eax FDEMW bubble F EMW D EMW 0x00c: addl %edx,%eax DDEMW 0x00e: halt FDEMW 10 # demo-h0.ys FF D F EMW bubble 11 Cycle 4 D srcA = %edx srcB = %eax

Multi-cycle Stalls 0x000: irmovl \$10,%edx 123456789 FDEMW 0x006: irmovl \$3,%eax FDEMW bubble F EMW D EMW 0x00c: addl %edx,%eax DDEMW 0x00e: halt FDEMW 10 # demo-h0.ys FF D F EMW bubble 11 Cycle 4 E E_dstE = %eax D srcA = %edx srcB = %eax

Multi-cycle Stalls 0x000: irmovl \$10,%edx 123456789 FDEMW 0x006: irmovl \$3,%eax FDEMW bubble F EMW D EMW 0x00c: addl %edx,%eax DDEMW 0x00e: halt FDEMW 10 # demo-h0.ys FF D F EMW bubble 11 Cycle 4 D srcA = %edx srcB = %eax E E_dstE = %eax D srcA = %edx srcB = %eax Cycle 5

Multi-cycle Stalls 0x000: irmovl \$10,%edx 123456789 FDEMW 0x006: irmovl \$3,%eax FDEMW bubble F EMW D EMW 0x00c: addl %edx,%eax DDEMW 0x00e: halt FDEMW 10 # demo-h0.ys FF D F EMW bubble 11 Cycle 4 M M_dstE = %eax D srcA = %edx srcB = %eax E E_dstE = %eax D srcA = %edx srcB = %eax Cycle 5

Multi-cycle Stalls 0x000: irmovl \$10,%edx 123456789 FDEMW 0x006: irmovl \$3,%eax FDEMW bubble F EMW D EMW 0x00c: addl %edx,%eax DDEMW 0x00e: halt FDEMW 10 # demo-h0.ys FF D F EMW bubble 11 Cycle 4 D srcA = %edx srcB = %eax M M_dstE = %eax D srcA = %edx srcB = %eax E E_dstE = %eax D srcA = %edx srcB = %eax Cycle 5 Cycle 6

Multi-cycle Stalls 0x000: irmovl \$10,%edx 123456789 FDEMW 0x006: irmovl \$3,%eax FDEMW bubble F EMW D EMW 0x00c: addl %edx,%eax DDEMW 0x00e: halt FDEMW 10 # demo-h0.ys FF D F EMW bubble 11 Cycle 4 W W_dstE = %eax D srcA = %edx srcB = %eax M M_dstE = %eax D srcA = %edx srcB = %eax E E_dstE = %eax D srcA = %edx srcB = %eax Cycle 5 Cycle 6

Stalling: Another View Operational rules:  Stalling instruction held back in decode stage  Following instruction stays in fetch stage  Bubbles injected into execute stage  Like dynamically generated nop’s  Move through stage-by-stage as regular instructions 0x000: irmovl \$10,%edx 0x006: irmovl \$3,%eax 0x00c: addl %edx,%eax Cycle 4 0x00e: halt 0x000: irmovl \$10,%edx 0x006: irmovl \$3,%eax 0x00c: addl %edx,%eax # demo-h0.ys 0x00e: halt 0x000: irmovl \$10,%edx 0x006: irmovl \$3,%eax bubble 0x00c: addl %edx,%eax Cycle 5 0x00e: halt 0x006: irmovl \$3,%eax bubble 0x00c: addl %edx,%eax bubble Cycle 6 0x00e: halt bubble 0x00c: addl %edx,%eax bubble Cycle 7 0x00e: halt bubble Cycle 8 0x00c: addl %edx,%eax 0x00e: halt Write Back Memory Execute Decode Fetch

Summary Pipeline Hazards  Data Hazards  Solution #1: Stalling  Control Hazard  jmp, ret  Solution #1: Simple Prediction

Download ppt "Real-World Pipelines: Car Washes Idea  Divide process into independent stages  Move objects through stages in sequence  At any instant, multiple objects."

Similar presentations