Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 1 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Part 8 Instruction.

Similar presentations


Presentation on theme: "Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 1 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Part 8 Instruction."— Presentation transcript:

1 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 1 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Part 8 Instruction Level Parallelism (ILP) - Pipelining Computer Architecture Slide Sets WS 2011/2012 Prof. Dr. Uwe Brinkschulte Prof. Dr. Klaus Waldschmidt

2 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 2 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Parallel Computing Pipelining Superscalar VLIW EPIC Multithreading Multiprocessing Multi-Cores Cluster of Computers Cloud- and Grid- Computing Thread- and Task-Level Parallelism Instruction-Level Parallelism

3 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 3 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Architectures with instruction level parallelism (ILP) Pipelining vs. concurrency Basis of most computer architectures is still the well-known von Neumann or Harvard principle. This principle relies on a sequential operation. In modern high performance processors this sequential operation mode is extended by instruction level parallelism (ILP). ILP can be implemented by two modes of parallelism: Parallelism in time (pipelining) Parallelism in space (concurrency)

4 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 4 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt These two techniques of parallelism are an important feature for the high performance in combination with the technological improvement. Parallelism in time (pipelining) means that the execution of instruction is overlapped in time by partitioning the instruction cycle. Parallelism in space (concurrency) means that more than one instruction is executed in parallel, either in order or out of order. Both techniques are combined in modern microprocessors and defines the instruction level parallelism for better performance. Pipelining vs. concurrency

5 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 5 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt stage t cycle # # Pipelining vs. concurrency pipelining concurrency instruction 1 instruction 2 instruction 2 instruction 3 instruction 3 Parallelism in time relies on the assembly line principle, which is also very matured in the automotive production. It can be effective combined with concurrency. Among computer architectures an assembly line is called pipeline

6 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 6 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt "Pipelines accelerate execution speed in the same way like Henry Ford revolutionized car manufacturing with the introduction of the assembly line" (Peter Wayner, 1992) Pipelining means the fragmentation of a machine instruction into several partial operations. These partial operations are executed by partial units in a sequential and synchronized manner. Every processing unit executes only one specific partial operation. All partial processing units are called a pipeline in total. Pipelining vs. concurrency

7 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 7 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Fragmentation of the instruction cycle 1.instruction fetch The instruction addressed by the program counter is loaded from main memory or a cache into the instruction register. The program counter is incremented. 2.instruction decode Internal control signals are generated according to the instructions opcode and addressing modes. 3. operand fetch The operands are provided by registers or functional units. Possible fragmentation into 5 stages:

8 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 8 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Fragmentation of the instruction cycle 4.execute The operation is executed with the operands. 5.write back The result is written into a register or bypassed to serve as operand for a succeeding operation. Depending on the instruction or instruction class some stages may be skipped. The entirety of stages is called instruction cycle.

9 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 9 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt In the first stage, the fetch unit accesses the instruction The fetched instruction is passed to instruction decode unit. While this second unit processes the instruction, the first unit already fetches the next instruction. In best case scenarios n-stage pipelines executes n instructions in parallel. Each instruction is in a different stage of its execution. When the pipeline is filled, the execution of one instruction is finished every clock cycle. A processor capable of finishing one instruction per clock cycle is called a scalar processor Instruction pipelining

10 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 10 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt 1. instruction 2. instruction 3. instruction clock instruction fetch instruction decode operand fetch execute write back instruction fetch instruction decode operand fetch execute write back instruction fetch instruction decode operand fetch execute write back Instruction pipelining

11 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 11 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Pipeline design principles Pipeline stages are linked by registers The instruction and the intermediate result is forwarded every clock cycle (in special cases every half clock cycle) to the next pipeline register. A pipeline is as fast as its slowest stage Therefore, an important issue in pipeline design is to assure that the stages consume equivalent amounts of time A high number of pipeline stages (often called superpipeline) leads to short clock cycles and higher speedup But a stall of a long pipeline, e.g. due to a control flow dependency, results in long wait times till the pipeline can be refilled. Thus, a real trade off exists for the designer.

12 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 12 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Basic pipeline measures Pipelining belongs to the class of fine grain parallelism. It takes place at a microarchitectural level. Definitions: An operation is the application of a function F to operands. An operation produces a result. An operation can be made up of a set of partial operations f 1... f p (in most cases p = k). It is assumed that the partial operations are applied in sequential order. An instruction defines through its format the function, operands and result. A k-stage pipeline executes n operations of F in cycles t p (n,k) = k + (n – 1) k cycles to execute the first instruction (fill pipline) n-1 cycles to execute the remaining n-1 instructions

13 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 13 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt figure shows example: t p (10,5) = 5 + (10-1) = 14 start-up or fill processing drain t 1 2 3 4 5 stages Pipeline operation i i+1i i+2i+1i i+3i+2i+1i i+4i+3i+2i+1i i+5i+4i+3i+2i+1 i+6i+5i+4i+3i+2 i+7i+6i+5i+4i+3 i+8i+7i+6i+5i+4 i+9i+8i+7i+6i+5 i+9i+8i+7i+6 i+9i+8i+7 i+9i+8 i+9

14 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 14 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Pipeline throughput: Pipeline speedup: In a best case scenario where a high number of linear succeeding operations is executed pipeline speedup converts to the number of pipeline stages. Basic pipeline measures

15 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 15 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Pipeline efficiency: Pipeline efficiency reaches 1 (peak performance) if a infinite operation stream without bubbles or stalls is executed. This is of course only a best case analysis. Practical evaluation: Hockney numbers: n ∞ : pipeline peak performance at infinite number of operations n ½ : # of operations at which the pipeline reaches its half peak performance Basic pipeline measures

16 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 16 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt results stage F... instructions and operands f1f1 f2f2 f3f3 fkfk Pipeline stages Stages are seperated by registers

17 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 17 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Partitioning of an operation F: If a partitioning of an operation is impossible, F can also be applied in parallel and overlapped over two clock cycles. time t f /2 time t f F F F 1 1` 2 2´ f1f1 f2f2

18 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 18 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Operation example for partitioning time t f /2 time t f F F F 1 1` 2 2´ f1f1 f2f2 i t i i+1 i+2 i+3 t+1t+2t+3t+4t+5 i+1i+2i+3 tt+1t+2t+3t+4t+5

19 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 19 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt If t fi = max(t f1... t fk ) determines the clock frequency in an unbalanced pipeline (t fi >> t f1,..., t fi >> t fk ), f i should be partitioned further for better performance f1f1 f2f2 f3f3 f1f1 f1f1 f2f2 f2f2 f2f2 f2f2 f 2b f 2a f 2c f3f3 f 1 <<f2f2 f 2 >>f3f3 version 2 version 1 f3f3 Balancing Pipeline Suboperations

20 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 20 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Overall pipelined execution timeof an operation F: t (F) = (max (t fi ) + t pd + t su ) k corresponds to clock period # of stages = k max ( t fi ) + k ( t pd + t su ) max. processing time register delay of a suboperation Overall execution time, clock frequency Clock period: cp = max (t fi ) + t pd + t su Register delays: t pd = propagation delay time t su = set up time

21 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 21 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Architecture of a linear 5-stage pipeline with registers OROR OROR OROR OROR IF IDOF EX WB WB IRIR DE CRCR RF DC PC IC IC = instruction cache DC = data cache IR = instruction register CR = control register RF = register file, e.g. 3-gate register file DE = decoder (control unit) OR = operand register PC = program counter IF = instruction fetch ID = instruction decode OF= operand fetch EX = execute WB = write back ALU

22 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 22 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Pipeline hazards So far, we have assumed a smooth throughput of operations through the pipeline But, there are several effects which can cause stalls in pipelined operations These effects are called pipeline hazards Pipeline hazards can be caused by dataflow dependencies resource dependencies controlflow dependencies

23 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 23 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Dataflow dependencies Pipelined processors have to consider 3 classes of dataflow dependencies. The same dependencies have to be considered in concurrency. 1.true dependency: read after write (RAW) destination (i) = source (i +1) XA + Binstruction i YX + Binstruction i+1 A hazard occurs if the distance of two instructions is smaller than the number of pipelines stages. In this case X has to be read before it is created. X has to be written by instruction i before it is read by the succeeding instruction.

24 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 24 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt 2. anti dependency: write after read (WAR) source (i) = destination (i +1) Y has to be read by instruction i before it is written by the succeeding instruction. XY + Binstruction i YA + Cinstruction i +1 Dataflow dependencies A hazard occurs if the order of the instructions is changed in the pipeline.

25 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 25 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt 3. output dependency: write after write (WAW) destination (i) = destination (i + 1) Both instructions write their results into the same register. Y A / Binstruction i YC + Dinstruction i + 1 Dataflow dependencies A hazard occurs if the order of the instructions is changed in the pipeline.

26 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 26 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Example of a short assembler program containing a true dependency, anti dependencies and a output dependency. I1 ADD R1,2,R2 ; R1 = R2+2 I2 ADD R4,R3,R1 ; R4 = R1+R3 I3 MULT R3,3,R5 ; R3 = R5·3 I4 MULT R3,3,R6 ; R3 = R6·3 I1 I2 I3 I4 true dependency anti dependency output dependency anti depen- dency Dependency graph

27 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 27 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Example of a true dependency hazard (RAW) in a 5-staged pipeline i+1 write Y i write X i+1 i + 1 read X, C i+1 Y:=X op C i i i read A, B i X:=A op B fetch decode read execute write issue point pipeline stages t issue check i + 1 RAW i:X:=A op B i+1:Y:=X op C

28 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 28 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Solutions for true dependency hazards Software solutions: Inserting NOOP instructions Reorder instructions Hardware solutions: Pipeline interlocking Forwarding Any combinations of these solutions are possible as well

29 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 29 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt i+1 write Y i write X i + 1 read X, C i+1 Y:=X op C i i i read A, B i X:=A op B fetch decode read execute write pipeline stages t NOOPs inserted by compiler or programmer Solving a true dependency hazard by inserting NOOPs The RAW hazard is eliminated through insertion of NOOPs (bubbles) into the pipeline. This was the solution used in first RISC processors. NOOPs i+1 NOOP

30 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 30 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Solving a true dependency hazard by reordering instructions Sometimes, instead of inserting NOOPs instructions can be reordered to have the same effect Therefore, instructions having no true dependencies and not changing the control flow are arranged in between the conflicting instructions Example: X:=A op B NOOP Y:=X op C Z:=D op E F:= INP(0) X:=A op B Z:=D op E F:= INP(0) Y:=X op C

31 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 31 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Solving a true dependency hazard by pipeline interlocking Pipline interlocking means the pipeline processing is delayed by hardware until the conflict is solved So the compiler or programmer is relieved (used e.g. in MIPS processor, Microprocessor with Interlocked Pipeline Stages) i+1 write Y i write X i+1 i + 1 read X, C i+1 Y:=X op C i i i read A, B i X:=A op B fetch decode read execute write issue point pipeline stages t Interlocking issue check i + 1

32 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 32 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Forwarding Forwarding is simple hardware technique to save one delay slot (NOOP). An operand X needed for instruction i + 1 is directly forwarded from the output of the ALU to the input. The register file is by passed. If more then one delay slot is necessary, forwarding is combined with interlocking or NOOP insertion. The data forwarding path can also be used to provide operands of waiting instruction from the cache. This shortens the delay slot between a load and an execute instruction using this operand. Data cache access is speed up excessive through this technique.

33 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 33 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt cache memoryregister ALU bypass: load forwarding bypass: result forwarding Load and Result Forwarding

34 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 34 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Hardware realization of the forward path i+1 write Y i write X i+1 i+1read X,C i+1Y:=X op C i i i read A, B i X:=A op B fetch decode read execute write issue point pipeline stages t data forwarding RF read RF write EX (R) load data path (load forwarding) (S1) (S2) forward control data forwarding path (result forwarding) 1 NOOP or interlocking issue check for i + 1

35 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 35 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Anti- and output-dependency hazards (false dependencies) An output dependency hazard may occur if an instruction i needs more time units to execute than instruction i+ 1. Of course this is only possible if the processor consist of several processing units with different numbers of stages. Anti-dependency hazards only occur if the order of instructions is changed in the pipeline. This is never true for ordinary scalar pipelines In superscalar pipelines, this hazard occurs

36 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 36 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Output dependency hazard (regarding only 3 stages of the 5 stage pipeline) RF read RF write read execute write FU 2 FU 1 stages i read A, B i 2. A op B i 3. A op B i write Y t issue i issue i+1 i+1 read C, D i+1 write Y i +1 C op D i 1. A op B FU1 FU2

37 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 37 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Removing false dependencies False dependencies can always be removed by register renaming This can be done by hardware or by compiler So the hazard will never occur Example: X:= Y op BY:= A op B Y:= A op CY:= C op D Renaming the second Y to Z: X:= Y op BY:= A op B Z:= A op CZ:= C op D

38 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 38 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Resource dependencies An intra-pipeline dependency occurs if instructions of two succeeding stages need the same pipeline resource. The succeeding instruction (and the following instructions) have to be delayed till the resource becomes available. This happens e.g. if the common register file lacks a sufficient number of ports or some instructions need more than one clock cycle to run through a particular pipeline resource Examples: a register file with a common read/write port (possible conflict of read in stage 3 with write in stage 5) or a multi-cycle division unit in the execute stage. Resource dependencies can be classified in: intra-pipeline dependencies instruction class dependencies

39 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 39 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Resource dependencies An instruction class dependency occurs if two or more instructions which are in the same pipeline stage need a pipeline resource existing only once. This never happens in a scalar pipeline Superscalar processors with several execution units often face this sort of conflict. A twofold superscalar processor may issue two instructions to two execution units simultaneously. If these instructions need the same (only once existent) execution unit an instruction class dependency arises.

40 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 40 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Control flow dependencies Every change in control flow is a potential candidate for conflict. Several instruction classes cause changes in control flow: conditional branch jump jump to subroutine, return from subroutine The control flow target is not yet available when the next instruction is to be fetched Especially conditional branches cause severe conflicts The analysis of the condition determines the next instruction to issue, which usually is finished in the last pipeline stages

41 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 41 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt BRANCH COND CMP BRANCH COND CMP BRANCH COND NEXT CORRECT I CMP BRANCH COND IFIDOFEXWB condition code Example of a control flow hazard due to a conditional branch Control flow hazards

42 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 42 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Solutions for control flow hazards Software solutions: Inserting NOOP instructions Reorder instructions Hardware solutions: Pipeline interlocking Forwarding Fast compare and jump logic Branch prediction

43 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 43 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt BRANCH CONDCMP BRANCH CONDCMP BRANCH COND NEXT CORRECT I NEXT+1 CORRECT I NEXT CORRECT I NEXT+1 CORRECT I NEXT CORRECT I NEXT+1 CORRECT I NEXT CORRECT I NEXT+1 CORRECT I CMP BRANCH COND NEXT CORRECT I NEXT+1 CORRECT I DELAY SLOT1 DELAY SLOT2 IFIDOFEXWB condition code Solution: interlocking or NOOP insertion NOOP or interlocking Penalty: 6 cycles

44 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 44 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt IFIDOFEXWB condition code CMP BRANCH COND CMP BRANCH COND NEXT CORRECT I NEXT+1 CORRECT INEXT CORRECT I NEXT+1 CORRECT I NEXT CORRECT I NEXT+1 CORRECT INEXT CORRECT I NEXT+1 CORRECT I CMP BRANCH COND NEXT CORRECT I NEXT+1 CORRECT I DELAY SLOT2 Reducing penalty by forwarding the comparison result BRANCH COND Penalty: 4 cycles NOOP or interlocking

45 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 45 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt IFIDOFEXWB condition code BRANCH COND CMP BRANCH COND NEXT CORRECT I NEXT+1 CORRECT INEXT CORRECT I NEXT+1 CORRECT I NEXT CORRECT I NEXT+1 CORRECT INEXT CORRECT I NEXT+1 CORRECT I CMP NEXT CORRECT I NEXT+1 CORRECT I CMP BRANCH COND CMP BRANCH COND DELAY SLOT2 Reducing penalty by forwarding the next correct instruction address NOOP or interlocking Penalty: 3 cycles NOOP or interlocking

46 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 46 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt condition code DELAY SLOT2 IFIDOFEXWB BRANCH COND NEXT CORRECT I NEXT+1 CORRECT INEXT CORRECT I NEXT+1 CORRECT I NEXT CORRECT I NEXT+1 CORRECT INEXT CORRECT I NEXT+1 CORRECT I CMP NEXT CORRECT I NEXT+1 CORRECT I CMP BRANCH COND CMP BRANCH COND CMP BRANCH COND fast jump logic fast compare logic comparison result Reducing penalty by fast compare and jump logic Penalty: 2 cycles NOOP or interlocking

47 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 47 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Reducing penalty by fast compare and jump logic Special logic for compare and jump instructions can reduce the penalty by one cycle. These circuits can be much faster than a more general execution unit (ALU) allowing to complete comparison and jump in one clock cycle. The higher speed of the fast compare logic is possible because normally only simple comparisons like equal, unequal, 0, ≤0, ≥0, =0 are needed.

48 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 48 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Reducing penalty by fast compare and jump logic + reorder instructions The remaining 2 NOOPs or interlockings can be replaced by reordering code Two independent instructions could be moved after the branch instruction (delayed branch) Example: Z:=D op E F:= INP(0) CMP BRANCH COND NOOP NEXT INSTR (COND = FALSE)... NEXT INSTR (COND = TRUE) CMP BRANCH COND Z:=D op E F:= INP(0) NEXT INSTR (COND = FALSE)... NEXT INSTR (COND = TRUE)

49 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 49 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Branch prediction Another possibility of avoiding control flow hazards is branch prediction Here, the outcome of the branch (taken or not taken) is predicted before the result of the comparison is known In case of correct branch prediction, the penalty can be reduced up to 0 Firstly, lets assume we would have a perfectly working branch predictor

50 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 50 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt prediction result (taken or not taken) IFIDOFEXWB branch predictor Reducing penalty by branch prediction BRANCH COND NEXT CORRECT I NEXT+1 CORRECT INEXT CORRECT I NEXT+1 CORRECT I NEXT CORRECT I NEXT+1 CORRECT INEXT CORRECT I NEXT+1 CORRECT I CMP NEXT CORRECT I NEXT+1 CORRECT I CMP BRANCH COND CMP BRANCH COND CMP BRANCH COND Penalty: still 2 cycles next address

51 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 51 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Branch target address cache To further reduce the penalty, a branch target address cache (BTAC) can be introduced This cache holds the addresses of branches and the corresponding target addresses Therefore, if filled already in the fetch phase a branch and its possible target address can be identified branch address target address...... branch target address part of branch address (e.g. lower m bits) BTAC

52 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 52 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt NEXT CORRECT I prediction result IFIDOFEXWB branch predictor Reducing penalty by branch prediction and branch target address cache BRANCH COND CMP BRANCH COND CMP BRANCH COND CMP BRANCH COND Penalty: 0 cycles next address BTAC NEXT+1 CORRECT I NEXT CORRECT INEXT+1 CORRECT I NEXT CORRECT INEXT+1 CORRECT I NEXT CORRECT INEXT+1 CORRECT I NEXT CORRECT INEXT+1 CORRECT I

53 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 53 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Branch prediction and pipeline utilization For having 0 cycles penalty, two prerequisites must meet: the branch address must be stored in the BTAC the branch prediction must be correct Otherwise we will get a penalty

54 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 54 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Branch prediction and pipeline utilization In case of a BTAC miss, the penalty will be p b (in our example 2) In case of a misprediction, the penalty will be the number of cycles p m needed to flush the pipeline (e.g. 5) In modern processors, this can be much more (e.g. 11 for Pentium II) The overall penalty calculates to: p = m  p m + (1 - m)  b  p b with m: miss prediction rate, b: BTAC miss rate The pipeline utilization can be calculated to: u = n / (n + p) with n: number of instructions So, an excellent branch prediction is necessary

55 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 55 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Branch prediction techniques In general, two classes of branch prediction techniques can be distinguished: static branch prediction for a given branch, the prediction is always the same, it never changes dynamic branch prediction for a given branch, the prediction changes dynamically

56 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 56 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Static branch prediction Predict always not taken most simple technique, no BTAC necessary, in the first attempt the branch is always ignored Predict always taken a bit more complicated, needs a BTAC to take the branch in the first attempt. Produces slightly better results Predict backward taken, forward not taken loop-oriented prediction, a backward branch often belongs to a loop and therefore is taken quite often Compiler controlled the compiler sets a bit for each branch to tell the processor how to predict the branch. Still static since it never changes during runtime

57 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 57 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Dynamic branch prediction Dynamic branch prediction means that information about the probability of a branch is collected at runtime. Dynamic branch prediction is based on knowledge about the past behavior of the branch. This knowledge can be stored in a table and can be addressed through the address of the branch instruction. Often, this information is stored is well in the BTAC but there are also solutions with separate tables Dynamic branch prediction produces much better results then static branch prediction. Today, a misprediction rate below 10% is possible

58 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 58 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Using the BTAC to store branch history information branch address target address history bits......... part of branch address (e.g. lower m bits) BTAC branch target address branch history

59 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 59 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Interferences Only a part of the branch address is used as index to the table containing branch history If two branches have a identical bit pattern in this part, they share the same table entry => interference This often leads to mispredictions, because one branch messes up the history of the other one As larger the history table, as less interferences occur Best case: all bits of the branch address would be used as an index => no interferences Due to limited chip space, this is not possible for large programs

60 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 60 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt One bit predictor Most simple predictor, only one bit is used to store the branch history For each branch, two states (taken, not taken) dependent of the last execution are stored The prediction always refers to the last state NT T T Predict Taken Predict Not Taken

61 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 61 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Two bit predictor Two bits per branch to store history This results in for states (strongly taken, weakly taken, weakly not taken, strongly not taken) In a strong states, it takes two mispredictions to change the prediction NT T T (11) Predict Strongly Taken NT T T (00) Predict Strongly Not Taken (01) Predict Weakly Not Taken (10) Predict Weakly Taken Two bit predictor with saturation counter

62 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 62 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Two bit predictor Two bit predictor with hysteresis counter NT T T (11) Predict Strongly Taken NT T T (00) Predict Strongly Not Taken (01) Predict Weakly Not Taken (10) Predict Weakly Taken

63 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 63 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt One bit predictor versus two bit predictor One bit predictor is simpler and needs less memory For a branch at the end of a loop, the one bit predictor correctly predicts the branch direction as long as the loop is iterated In a nested loop, each iteration of the outer loop produces two mispredictions in the inner loop A two bit predictor avoids one of these two mispredictions Technique can be extended to n bits, but no significant improve in performance one bit predictor mispred. when left inner loop mispred. when reentered inner loop two bit predictor mispred. when left inner loop

64 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 64 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Correlation predictors Often, branches are not independent Example: DEC A BRZ X... X: LD A,0 BRZ Y The second branch is always taken when the first branch is taken Both branches are correlated This is not exploited by the one or two bit predictors

65 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 65 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Correlation predictors One or two bit predictors only use self-history Correlation predictors also use neighbor-history This means, the own history and the history of neighbored, in execution order preceding branches are used Notation: a (m,n) predictor uses the last m branches to select one of 2 m predictors, while each of these predictors is a n bit predictor for a single given branch A branch history register (BHR) is used to store the direction of the last m branches in a m-bit shift register The BHR is used as an index to select a pattern history table (PHT)

66 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 66 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Implementation of a (2,2) predictor

67 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 67 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Two level adaptive predictors Two level adaptive predictors have been developed by Yeh and Patt nearly the same time as the correlation predictors (1992) Like the correlation predictor, the two level adaptive predictor uses two levels of tables, while the first level is used to select prediction bits of the second level Variants of two level adaptive predictors: global PHT per-set PHTs per-address PHTs global scheme (global BHR)GAgGAsGAp per-address-scheme (per-address BHT) PAgPAsPAp per-set-scheme (per-set BHT)SAgSAsSAp Correlation predictors

68 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 68 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Two level adaptive predictors Examples: GAg(4) GAp(4) PAg(4) PAp(4) For the s/S variants, only part of the branch address is used

69 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 69 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt gshare and gselect predictors When using a global PHT, parts of the branch address bits and the BHR can be combined in two ways to address a PHT entry: gselect:branch address bits and BHR are concatenated gshare:branch address bits and BHR are XORed gshare performs a bit better than gselect due to less interferences Example: branch addr BHRgselect4/4gshare8/8 00000000000000010000000100000001 00000000000000000000000000000000 11111111000000001111000011111111 11111111100000001111000001111111

70 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 70 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Hybrid predictors A hybrid or combined predictor consists of two different branch predictors and a selection predictor choosing one of two branch predictor results for each branch prediction Any predictor can be used as selection predictor Examples: McFarling: two bit predictor combined with gshare Young and Smith: compiler controlled static predictor combined with two level adaptive predictor Often, a simple predictor with reasonable results in the warm-up phase is combined with a sophisticated predictor delivering better results later The combined predictors are often better then the individuals

71 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 71 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Misprediction rates SAg, gshare and McFarling:

72 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 72 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Multipath execution: in case of a branch both paths are followed by the processor simultaneously, the wrong path is discarded later Multipath execution RF read ALU RF write instruction issue point DEC IF CC a simple multipath pipeline with two instruction fetch and decode stages

73 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 73 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Predication Predication means, the execution of an instruction is dependend on a predicate Only if the predicate is true the instruction is executed If all instructions of an instructions set supports predication, this is called a fully predicated instruction set Examples for fully predicated instruction sets: IA64 Itanium, ARM, Fully predicated instruction sets can avoid conditional branches Example: CMP A, 0CMP A, 0, P BZ L1P.ADD B,C ADD B,C P.SUB C,D SUB C,DLD A,3 L1: LD 3,A with cond. branchpredicated

74 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 74 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Predication On the hardware side, the predicated instruction is executed anyway. In case of a false predicate, the result of the instruction is discarded Advantages: conditional branches can be avoided no speculation necessary basic block length is increased resulting in better compiler optimization Disadvantages: unnecessary execution of instructions additional predicate bits necessary in instruction format

75 Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 75 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Trace cache A trace is a sequence of executed instructions which can span several basic blocks Therefore, in a trace all branches are solved A trace cache stores such traces while the trace is executed If the same trace is executed again, the instruction sequence can be taken from the trace cache, no branch needs to be exectued While an instruction cache contains the static instruction sequence, the trace cache contains the dynamic instruction sequence Example for a trace cache: Pentium 4


Download ppt "Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 1 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Part 8 Instruction."

Similar presentations


Ads by Google