1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:

1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS http://www.pds.ewi.tudelft.nl/~iosup/Courses/2012_ti1400_8.ppt Course website: http://www.pds.ewi.tudelft.nl/~iosup/Courses/2012_ti1400_results.htm

TU-Delft TI1400/12-PDS 2 Basic idea (1) F1E1F2F3F4E2E3E4 I1I2I3I4 sequential execution time B1 Instruction fetch unit Execution unit buffer

TU-Delft TI1400/12-PDS 3 Basic idea (2): Overlap F1E1 F2 F3 F4 E2 E3 E4 I1 I2 I3 I4 pipelined execution time 12345 Clock cycle

TU-Delft TI1400/12-PDS 4 Instruction phases FFetch instruction DDecode instruction and fetch operands OPerform operation WWrite result

TU-Delft TI1400/12-PDS 5 Four-stage pipeline F1D1 F2 F3 F4 D2 D3 D4 I1 I2 I3 I4 pipelined execution time 12345 Clock cycle O1W1 O2W2 O3W3 O4W4

TU-Delft TI1400/12-PDS 6 Hardware organization (1) Fetch unit B1 Decode and fetch oper. B2 Oper unit B3 Write unit

TU-Delft TI1400/12-PDS 7 Hardware organization (2) During cycle 4, the buffers contain: B1: -instruction I3 B2: -the source operands of I2 -the specification of the operation -the specification of the destination operand B3: -the result of the operation of I1 -the specification of the destination operand

TU-Delft TI1400/12-PDS 8 Hardware organization (3) Fetch unit B1 Decode and fetch oper. B2 Oper unit B3 Write unit I3 Operands I2 Operation I2 Result I1

TU-Delft TI1400/12-PDS 9 Pipeline stall (1) Pipeline stall: delay in a stage of the pipeline due to an instruction Reasons for pipeline stall: -Cache miss -Long operation (for example, division) -Dependency between successive instructions -Branching

TU-Delft TI1400/12-PDS 10 Pipeline stall (2): Cache miss F1D1 F2 F3 D2 D3 I1 I2 I3 time 12345Clock cycle O1W1 O2W2 O3 W3 678 Cache miss in I2

TU-Delft TI1400/12-PDS 11 Pipeline stall (3): Cache miss F1F2 D2 F D O 12345 Clock cycle F2 D3 678 W D1 F2F3 idle O1O2O3idle W1W2W3idle Effect of cache miss in F2

TU-Delft TI1400/12-PDS 12 Pipeline stall (4): Long operation F2D2I2O2W2 F3D3 I3 O3W3 F4D4 I4 O4W4 time F1D1 I1 12345Clock cycle O1W1 678

TU-Delft TI1400/12-PDS 13 Pipeline stall (5): Dependencies Instructions: ADDR1, 3(R1) ADDR4, 4(R1) cannot be done in parallel Instructions: ADDR2, 3(R1) ADDR4, 4(R3) can be done in parallel

TU-Delft TI1400/12-PDS 14 Pipeline stall (6): Branch time IiIi IkIk FiFi EiEi FkFk EkEk (branch) Pipeline stall due to Branch only start fetching instructions after branch has been executed

TU-Delft TI1400/12-PDS 15 Data dependency (1): example MULR2,R3,R4/* R4 destination */ ADDR5,R4,R6/* R6 destination */ New value of R4 must be available before ADD instruction uses it

TU-Delft TI1400/12-PDS 16 Data dependency (2): example time I1I1 F1F1 D1D1 O1O1 W1W1 F2F2 D2D2 O2O2 W2W2 I2I2 W3W3 F3F3 D3D3 O3O3 I3I3 I4I4 F4F4 D4D4 O4O4 W4W4 MUL ADD Pipeline stall due to data dependence between W1 and D2

TU-Delft TI1400/12-PDS 17 Branching: Instruction queue Fetch Dispatch OperationWrite instruction queue........

TU-Delft TI1400/12-PDS 18 Idling at branch time IjIj I j+1 FjFj EjEj F j+1 (branch) IkIk FkFk EkEk idle I k+1 F k+1 E k+1

TU-Delft TI1400/12-PDS 19 Branch with instruction queue I1I1 F1F1 E1E1 I3I3 F3F3 E3E3 I2I2 F2F2 E2E2 I4I4 F4F4 IjIj FjFj EjEj I j+1 F j+1 E j+1 I j+2 F j+2 E j+2 I j+3 F j+3 E j+3 time branch Branch folding: execute a later branch instruction simultaneously (i.e., compute target) I 4 discarded

TU-Delft TI1400/12-PDS 20 Delayed branch (1): reordering LOOPShift_leftR1 DecrementR2 Branch_if>0 LOOP NEXTAddR1,R3 LOOPDecrementR2 Branch_if>0 LOOP Shift_leftR1 NEXTAddR1,R3 Original Reordered always executed always loose a cycle

TU-Delft TI1400/12-PDS 21 Delayed branch (2): execution timing FE FE FE FE FE FE FE Decrement Branch Shift Decrement Branch Shift Add

TU-Delft TI1400/12-PDS 22 Branch prediction (1) I1I1 F1F1 D1D1 E1E1 W1W1 F2F2 F3F3 F4F4 E2E2 D3D3 E3E3 X D4D4 X FkFk DkDk Compare Branch-if>I2I2 I3I3 I4I4 Effect of incorrect branch prediction IkIk

TU-Delft TI1400/12-PDS 23 Branch prediction (2) Possible implementation: -use a single bit -bit records previous choice of branch -bit tells from which location to fetch next instructions

TU-Delft TI1400/12-PDS 24 Data paths of CPU (1) Source 1 Source 2 SRC1SRC2 ALU RSLT Register file Destination Operand forwarding

TU-Delft TI1400/12-PDS 25 Data paths of CPU (2) OperationWrite SRC1 SRC2 RSLT forwarding data path register fileALU

TU-Delft TI1400/12-PDS 26 Pipelined operation I1I1 FR1R1 +R3R3 F Add ShiftI2I2 I3I3 I4I4 R2R2  shiftR3R3  R3R3 FDOW FDOW I1:AddR1, R2, R3 I2:Shift_leftR3 result of Add has to be available

TU-Delft TI1400/12-PDS 27 Short pipeline I1I1 FR1R1 +R3R3 R2R2  FD fwd, shift R3R3  - FDOW I2I2 I3I3

TU-Delft TI1400/12-PDS 28 Long pipeline FDOWI1I1 1 O 2 O 3 FI2I2 I3I3 DO 1 O 2 O 3 W fwd FDO 1 O 2 O 3 W

TU-Delft TI1400/12-PDS 29 Compiler solution I1:AddR1, R2, R3 I2:Shift_leftR3 I1:AddR1, R2, R3 NOP I2:Shift_leftR3 insert no-operations to wait for result

TU-Delft TI1400/12-PDS 30 Side effects I2:ADDD1, D2 I3:ADDXD3, D4 carry copy Other form of (implicit) data dependency: instructions can have side effects that are used by the next instruction

TU-Delft TI1400/12-PDS 31 Complex addressing mode FD X+[R1][X+[R1]][[X+[R1]]] R2  D FDDDfwd,O Load Next instruct.DW Load (X(R1)), R2 Cause pipe line stall X in instruction

TU-Delft TI1400/12-PDS 32 Simple addressing modes FD X+[R1] [X+[R1]] [[X+[R1]]] R2  DAdd FDD FDD R2  FDDDfwd,OW Load Next instruction Add #X,R1,R2 Load(R2),R2 Build up from simple instructions: same amount of time

TU-Delft TI1400/12-PDS 33 Addressing modes Requirements addressing modes with pipelining: -operand access not more than one memory access -only load and store instructions access memory -addressing modes do not have side effects Possible addressing modes: -register -register indirect -index

TU-Delft TI1400/12-PDS 34 Condition codes (1) Problems in RISC with condition codes (CCs): -do instructions after reordering have access to the right CC values? -are CCs already available at the next instruction? Solutions: -compiler detection -no automatic use of CCs, only when explicitly given in instruction

TU-Delft TI1400/12-PDS 35 Explicit specification of CCs IncrementR5 AddR2, R4 Add-with-increment R1, R3 ADDIR5, R5, 1 ADDCR4, R2, R4 ADDE R3, R1, R3 double precision addition PowerPC instructions (C: change carry flag, E: use carry flag)

TU-Delft TI1400/12-PDS 36 Two execution units Fetch Dispatch Unit FP Unit Write instruction queue Integer Unit........

TU-Delft TI1400/12-PDS 37 Instruction flow (superscalar) F1F1 D1D1 O1O1 W1W1 I1I1 O1O1 O1O1 F2F2 D2D2 O2O2 W2W2 F3F3 D3D3 O3O3 O3O3 O3O3 W4W4 F4F4 D4D4 O4O4 W3W3 Fadd I2I2 Add I3I3 Fsub I4I4 Sub Simultaneous execution of floating point and integer operations

TU-Delft TI1400/12-PDS 38 Completion in program order D1D1 O1O1 W1W1 I1I1 O1O1 O1O1 F2F2 D2D2 O2O2 W2W2 F3F3 D3D3 O3O3 O3O3 O3O3 W4W4 F4F4 D4D4 O4O4 W3W3 Fadd I2I2 Add I3I3 Fsub I4I4 Sub F1F1 wait until previous instruction has completed

TU-Delft TI1400/12-PDS 39 Consequences completion order When an exception occurs: writes not necessarily in order of instructions: imprecise exceptions writes in order: precise exceptions

TU-Delft TI1400/12-PDS 40 PowerPC pipeline Data cacheInstr. cache Instr. fetch Branch unit Dispatcher Instruction queue Completion queue LSU IU FPU store queue

TU-Delft TI1400/12-PDS 41 Performance Effects (1) Execution time of a program: T Dynamic instruction count: N Number of cycles per instruction: S Clock rate: R Without pipelining: T = (N x S) / R With an n-stage pipeline: T’ = T / n ???

TU-Delft TI1400/12-PDS 42 Performance Effects (2) Cycle time: 2 ns (R is 500 MHz) Cache hit (miss) ratio instructions: 0.95 (0.05) Cache hit (miss) ratio data: 0.90 (0.10) Fraction of instructions that need data from memory: 0.30 Cache miss penalty: 17 cycles Average extra delay per instruction: (0.05 + 0.3 x 0.1) x 17 = 1.36 cycles, so slow down by a factor of more than 2!!

TU-Delft TI1400/12-PDS 43 Performance Effects (3) On average, the fetch stage takes, due to instruction cache misses: 1 + (0.05 x 17) = 1.85 cycles On average, the decode stage takes, due to operand cache misses: 1 + (0.3 x 0.1 x 17) = 1.51 cycles For a total additional cost of 1.36 cycles

TU-Delft TI1400/12-PDS 44 Performance Effects (4) If only one stage takes longer, the additional time should be counted relative to one stage, not relative to the complete instruction: In other words: here, the pipeline is as slow as the slowest stage F1F1 D1D1 O1O1 W1W1 F1F1 D1D1 O1O1 W1W1

TU-Delft TI1400/12-PDS 45 Performance Effects (5) Delay of 1 cycle every 4 instructions in only one stage: average penalty: 0.25 Average inter-completion time: (3x1 + 1x2)/4=1.25 F4F4 D4D4 O4O4 W4W4 F1F1 D1D1 O1O1 W1W1 F3F3 D3D3 O3O3 W3W3 F2F2 D2D2 O2O2 W2W2 F5F5 D5D5 O5O5 W5W5

TU-Delft TI1400/12-PDS 46 Performance Effects (6) Delays in two stages: -k % of the instructions in one stage, penalty s cycles -l % of the instructions in another stage, penalty t cycles Average inter-completion time: ((100- k - l ) x 1 + k (1+ s ) + l (1+ t ))/100 = (100+ ks + lt )/100 In example ( k =5, l =3, s = t =17): 2.36

TU-Delft TI1400/12-PDS 47 Performance Effects (7) Large number of pipeline stages seems advantageous, but: -more instructions simultaneously being processed, so more opportunity for conflicts -branch penalty becomes larger -ALU is usually bottleneck, no use having smaller time steps

1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:

Similar presentations

Presentation on theme: "1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:

Similar presentations

Presentation on theme: "1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:"— Presentation transcript:

Similar presentations

About project

Feedback