Presentation is loading. Please wait.

Presentation is loading. Please wait.


Similar presentations

Presentation on theme: "PROCESSING CONTROL TRANSFER INSTRUCTIONS Chapter No. 8 By Najma Ismat."— Presentation transcript:


2 Control Transfer Instructions zdata hazards are a big enough problem that lots of resources have been devoted to over coming them but unfortunately, the real obstacle and limiting factor in maintaining a good rate of execution in a pipeline are control dependencies zbranches are 1 out of every 5 or 6 inst. zIn an n-issue processor, they’ll arrive n times faster zA “control dependence” determines the ordering of an instruction with respect to a branch instruction so that the non-branch instruction is executed only when it should be

3 Control Transfer Instructions zIf an instruction is control dependent on a branch, it cannot be moved before the branch zThey make sure instructions execute in order zControl dependencies preserve dataflow yMakes sure that instructions that produce results and consume them get the right data at the right time

4 How Control Instruction Can Be Defined? zInstructions normally fetched and executed from sequential memory locations zPC is the address of the current instruction, and nPC is the address of the next instruction (nPC = PC + 4) zBranches and control transfer instructions change nPC to something else zBranches modify, conditionally or unconditionally, the value of the PC.

5 Types of Branches

6 Unconditional Branches 10 14 18 1c 20 24 28 2c 30 34 jmp address i1 jmp 24 i3 i4 i5 i6 i7 i8 jmp 20 i10

7 Conditional jumps i1 jle 24 i3 i4 jmp 2c i6 i7 i8 i9 i10 i1 jle 24 i3 i4 jmp 2c i6 i7 i8 i9 i10 Basic blocks 10 14 18 1c 20 24 28 2c 30 34

8 How Architectures Checks the Results of Operations?

9 Result State Concept zArchitectures that supports result state approach are IBM/360 and 370, PDP-11, VAX, x86, Pentium, MC 68000, SPARC and PowerPC zthe generation of the result state requires additional chip area zimplementation for VLIW and superscalar architectures requires appropriate mechanisms to avoid multiple or out-of-order updating of the results state zmultiple sets of flags or condition codes can be used

10 Example (Result State Concept) add r1, r2, r3 // r1<- r2 + r3 beq zero // test for result equals to zero and,if // ‘yes’ branch to location zero div r5, r4, r1 // r5 <- r4 / r1. zero:// processing the case if divisor equals to // zero

11 Example (Result State Concept) teq r1 // test for (r1)=0 and update result state // accordingly beq zero // test for results equals to zero and, if yes, // branch to the location zero div r5, r4, r1 // r5 <- r4/ r1. zero: // processing the case if divisor equals to // zero

12 The Direct Check Concept zDirect checking of a condition and a branch can be implemented in architectures in two ways: yuse two separate instructions xFirst the result value is checked and compare and the result of the compare instruction is stored in the appropriate register xthen the conditional branch instruction can be used to test outcome of the deposited test outcome and branch to the given location if the specified condition is met yuse single instruction xa single instruction fulfils both testing and conditional branching

13 Example (Use Two Separate Instructions ) add r1, r2, r3; // r1<- r2 + r3 cmpeq r7, r1; // r7 <- true, if (r1)=0, else NOP bt r7,zero // branch to ‘zero’:if (r7)=true, else NOP div r5, r4, r1 // r5 <- r4 / r1. zero:

14 Example (Use Single Instruction) add r1, r2, r3 // r1<- r2 + r3 beq r1, zero // test for (r1)=0 and branch if true div r5, r4, r1 // r5 <- r4 / r1. zero:

15 Branch Statistics zBranch frequency severely affects how much parallelism can be achieved or extracted from a program z20% of general-purpose code are branch yon average, each fifth instruction is a branch z5-10% of scientific code are branch zThe Majority of branches are conditional (80%) z75-80% of all branches are taken

16 Branch Statistics (taken/not taken)

17 Branch Problem

18 Branch Problem incase of Pipelining (unconditional branch)

19 Performance Measures of Branch Processing

20 zIn order to evaluate compare branch processing a performance measure branch penalty is used zbranch penalty ythe number of additional delay cycles occurring until the target instruction is fetched over the natural 1-cycle delay yconsider effective branch penalty P for taken and not taken branches is: P = ft * Pt + fnt * Pnt

21 Performance Measures of Branch Processing zWhere: yPt : branch penalties for taken yPnt : branch penalties for not-taken yft : frequencies of taken yfnt : frequencies for not-taken ye.g. 80386 P t = 8 cycles P nt =2 cycles, therefore P = 0.75 * 8 + 0.25 * 2 = 6.5 cycles ye.g. I486 P t = 2 cycles P nt =0 cycles, therefore P = 0.75 * 2 + 0.25 * 0 = 1.5 cycles

22 Performance Measures of Branch Processing zEffective branch penalty for branch prediction incase of correctly predicted or mispredicted branches is: P = fc * Pc + fm * Pm ye.g. In Pentium penalty for correctly predicted branches = 0 cycles & penalty for mispredicted branches = 3 cycles P = 0.9 * 0 + 0.1 * 3.5 = 0.35 cycles

23 Zero-cycle Branching (Branch Folding) zRefers to branch implementations which allow execution of branches with a one cycle gain compared to sequential execution zinstruction logically following the branch is executed immediately after the instruction which precedes the branch zthis scheme is implemented using BTAC (branch target address cache)

24 Zero-cycle Branching

25 Basic Approaches to Branch Handling

26 Delayed Branch za branch delay slot is a single cycle delay that comes after a conditional branch instruction has begun execution, but before the branch condition has been resolved, and the branch target address has been computed. It is a feature of several RISC designs, such as the SPARC

27 Delayed Branch zAssuming branch target address (BTA) is available at the end of decode stage and branch target instruction (BTI) can be fetched in a single cycle (execution stage) from the cache zin delayed branching the instruction that is following the branch is executed in the delay slot zdelayed branching can be considered as a scheme applicable to branches in general, irrespective of whether they are unconditional or conditional

28 Delayed Branch


30 Example (Delayed Branch)

31 Performance Gain (Delayed Branch) z60-70% of the delay slot can be fill with useful instruction yfill only with: instruction that can be put in the delay slot but does not violate data dependency yfill only with: instruction that can be executed in single pipeline cycle zRatio of the delay slots that can be filled with useful instructions is f f zFrequency of branches is f b y20-30% for general-propose program y5-10% for scientific program

32 Performance Gain (Delayed Branch) zDelay slot utilization is n m n m =no. of instructions * f b * f f zn instructions have n* f b delay slots, therefore z100 instructions have 100* f b delay slots, n m =100*f b * f f can be utilized zPerformance Gain is G d zG d = (no.of instructions*f b * f f )/100 = f b * f f

33 Example (Performance Gain in Delayed Branch) zSuppose there are 100 instructions, on average 20% of all executed instructions are branches and 60% of the delay slots can be filled with instructions other than NOPs. What is performance gain in this case? n m =no. of instructions * f b * f f n m =100 * 0.2 * 0.6=12 delay slots G d = (no.of instructions*f b * f f )/100 = f b * f f G d = n m /100 =12/100 G d = 12% G dmax = f b * f f (if f f =1 means each slot can be filled with useful instructions) G dmax = f b (where f b is the ratio of branches)

34 Delayed Branch Pros and Cons zPros: yLow Hardware Cost zCons: yDepends on compiler to fill delay slots xAbility to fill delay slots drops as # of slots increases yExposes implementation details to compiler xCan’t change pipeline without breaking software yinterrupt processing becomes more difficult ycompatibility xCan’t add to existing architecture and retain compatibility so needs to redefine an architecture

35 Design Space of Delayed Branching Delayed Branching Multipicity of delay slots Most architectures MIPS-X (1996) Annulment of an instruction in the delay slot


37 Kinds of Annulment annul delay slot if branch is not taken annul delay slot if branch is taken

38 Design Space of Branch Processing

39 Branch Detection Schemes zMaster pipeline approach ybranches are detected and processed in a unified instruction processing scheme zearly branch detection yin parallel branch detection (Figure 8-16) xbranches are detected in parallel with decode of other instructions using a dedicated branch decoder ylook-ahead branch detection xbranches are detected from the instruction buffer but ahead of general instruction decoding yintegrated fetch and branch detection xbranches are detected during instruction fetch



42 Blocking Branch Processing zExecution of a conditional branch is simply stalled until the specified condition can be resolved

43 Speculative Branch Processing zPredict branches and speculatively execute instructions yCorrect prediction: no performance loss yIncorrect prediction: Squash speculative instructions zit involves three key aspects: ybranch prediction scheme yextent of speculativeness yrecovery from misprediction

44 Speculative Branch Processing Basic Idea: Predict which way branch will go, start executing down that path

45 Branch Prediction Example: if (x > 0){ a=0; b=1; c=2; } d=3; When x>0 When x<0 Predicting x<0

46 Branch Prediction Schemes


48 Comparison Between Taken /Not Taken Approach

49 Static Branch Prediction

50 Dynamic Branch Prediction

51 zExplicit dynamic technique (based on history bits) y1-bit history y2-bit history y3-bit history zImplicit dynamic technique (presence of an entry for a predicted branch target access path) yBTAC yBTIC

52 1-bit Branch History Taken Not Taken T T NT 1 0

53 1-bit Branch History zSingle bit per branch is used to express whether the last occurrence of the branch was taken(T) or not taken(NT) za21064 and R8000 processors uses single bit prediction scheme

54 2-bit Branch History Predict Taken Predict not Taken Predict not Taken Predict Taken T T NT T T BP state:(predict T /NT) x (last prediction right/wrong )

55 2-bit Branch History zOperates like a four state finite state machine zUse run-time information to make prediction Change the prediction after two consecutive mistakes! zIncrement for taken, decrement for not-taken y00,01,10,11 z2-bit predictor almost as good as any general n-bit predictor za21164A, Pentium, PowerPC 604 and 620 etc

56 3-bit Branch History

57 zOutcome of the last three occurrences of the branch are stored zdecision is based on a majority basis zsimpler than the 2-bit scheme and results in similar accuracy

58 Implicit Dynamic Techniques zBTIC (Branch Target Instruction Cache) zBTAC (Branch Target Address Cache) yboth of the above two schemes are used to access branch target path and also for branch prediction yextra cache is used which holds the most recently used branch and either the corresponding branch target addresses (in the BTAC) or the corresponding branch target instructions (in the BTIC) yfor branch prediction BTAC and BTIC simply holds the entries for only taken branches

59 Implementation of History Bits

60 Extent of Speculativeness

61 Recovery from Misprediction



64 Multiway Branching

65 zBoth taken and sequential paths of the unresolved conditional branch are pursued zgood for VLIW architectures zhigher demand for hardware resources zmaintaining sequential consistency and discarding superfluously executed computation is complex and time consuming job zonly experimental implementation is available like in TRACE 500, URPR-2

66 Guarded Execution za means to eliminate branches zby conditional operate instructions yIF the condition associated with the instruction is met, yTHEN perform the specified operation yELSE do not perform the operation (NOP) zConvert control dependencies into data dependencies zconditional part is known as guard part and operational part is the instruction part

67 Guarded Execution ze.g. original beg r1, label // if (r1) = 0 branch to label move r2, r3 // move (r2) into r3 label: … ze.g. guarded cmovne r1, r2, r3 // if (r1) != 0, move (r2) into r3 …


Similar presentations

Ads by Google