Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computer Architecture: A Constructive Approach Branch Prediction - 1 Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of.

Similar presentations


Presentation on theme: "Computer Architecture: A Constructive Approach Branch Prediction - 1 Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of."— Presentation transcript:

1 Computer Architecture: A Constructive Approach Branch Prediction - 1 Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology April 9, 2012L16-1 http://csg.csail.mit.edu/6.S078

2 I-cache Fetch Buffer Issue Buffer Func. Units Arch. State Execute Decode Result Buffer Commit PC Fetch Branch executed Next fetch started Modern processors may have > 10 pipeline stages between next PC calculation and branch resolution ! Control Flow Penalty How much work is lost if pipeline doesn’t follow correct instruction flow ? ~ Loop length x pipeline width April 9, 2012 L12-2http://csg.csail.mit.edu/6.S078

3 Average Run-Length between Branches Average dynamic instruction mix from SPEC92: SPECint92 SPECfp92 ALU39 %13 % FPU Add 20 % FPU Mult13 % load26 %23 % store 9 % 9 % branch16 % 8 % other10 %12 % SPECint92: compress, eqntott, espresso, gcc, li SPECfp92: doduc, ear, hydro2d, mdijdp2, su2cor What is the average run-length between branches? April 9, 2012 L16-3http://csg.csail.mit.edu/6.S078

4 InstructionTaken known?Target known? J JR BEQZ/BNEZ MIPS Branches and Jumps Each instruction fetch depends on one or two pieces of information from the preceding instruction: 1. Is the preceding instruction a taken branch? 2. If so, what is the target address? After Inst. Decode After Reg. Fetch After Exec April 9, 2012 L16-4http://csg.csail.mit.edu/6.S078

5 Currently our simple pipelined architecture does very simple branch prediction What is it? Branch is predicted not taken: pc, pc+4, pc+8, … Can we do better? April 9, 2012L16-5 http://csg.csail.mit.edu/6.S078

6 Branch Prediction Bits Assume 2 BP bits per instruction Use saturating counter On ¬taken   On taken 11Strongly taken 10Weakly taken 01Weakly ¬taken 00Strongly ¬taken April 9, 2012 L16-6http://csg.csail.mit.edu/6.S078

7 Branch History Table (BHT) 4K-entry BHT, 2 bits/entry, ~80-90% correct predictions 00 Fetch PC Branch? Target PC + I-Cache Opcodeoffset Instruction k BHT Index 2 k -entry BHT, 2 bits/entry Taken/¬Taken? April 9, 2012 L16-7http://csg.csail.mit.edu/6.S078

8 Where does BHT fit in the processor pipeline? BHT can only be used after instruction decode What should we do at the fetch stage? Need a mechanism to update the BHT where does the update information come from April 9, 2012 L16-8http://csg.csail.mit.edu/6.S078

9 Overview of branch prediction PCPC Need next PC immediately Decode Reg Read Execute Instr type, PC relative targets available Simple conditions, register targets available Complex conditions available Next Addr Pred BP, JMP, Ret Loose loop Tight loop Best predictors reflect program behavior April 9, 2012 L16-9http://csg.csail.mit.edu/6.S078

10 Next Address Predictor (NAP) first attempt BP bits are stored with the predicted target address. IF stage: nPC = If (BP=taken) then target else pc+4 later: check prediction, if wrong then kill the instruction and update BTB & BPb else update BPb iMem PC Branch Target Buffer (2 k entries) k BPb predicted targetBP target April 9, 2012 L16-10http://csg.csail.mit.edu/6.S078

11 Address Collisions What will be fetched after the instruction at 1028? NAP prediction= Correct target=  Assume a 128-entry NAP BPb target take 236 1028 Add..... 132 Jump 100 Instruction Memory 236 1032 kill PC=236 and fetch PC=1032 Is this a common occurrence? Can we avoid these bubbles? April 9, 2012 L16-11http://csg.csail.mit.edu/6.S078

12 Use NAP for Control Instructions only NAP contains useful information for branch and jump instructions only  Do not update it for other instructions For all other instructions the next PC is (PC)+4 ! How to achieve this effect without decoding the instruction? April 9, 2012 L16-12http://csg.csail.mit.edu/6.S078

13 Branch Target Buffer (BTB) a special form of NAP Keep the (pc, predicted pc) in the BTB pc+4 is predicted if no pc match is found BTB is updated only for branches and jumps 2 k -entry direct-mapped BTB I-Cache PC k Valid valid Entry PC = match predicted target target PC Permits nextPC to be determined before instruction is decoded April 9, 2012 L16-13http://csg.csail.mit.edu/6.S078

14 Consulting BTB Before Decoding 1028 Add..... 132 Jump 100 BPb target take 236 entry PC 132 The match for pc =1028 fails and 1028+4 is fetched  eliminates false predictions after ALU instructions BTB contains entries only for control transfer instructions  more room to store branch targets Even very small BTBs are very effective April 9, 2012 L16-14http://csg.csail.mit.edu/6.S078

15 Observations There is a plethora of branch prediction schemes – their importance grows with the depth of processor pipeline Processors often use more than one prediction scheme It is usually easy to understand the data structures required to implement a particular scheme It takes considerably more effort to understand how a particular scheme with its lookup and updates is integrated in the pipeline and how various schemes interact with each other April 9, 2012 L16-15http://csg.csail.mit.edu/6.S078

16 Plan We will begin with a very simple 2-stage pipeline and integrate a simple BTB scheme in it We will extend the design to a multistage pipeline and integrate at least one more predictor, say BHT, in the pipeline (next lecture) revisit the simple two-stage pipeline without branch prediction April 9, 2012 L16-16http://csg.csail.mit.edu/6.S078

17 Decoupled Fetch and Execute Fetch Execute ir nextPC Fetch sends instructions to Execute along with pc and other control information Execute sends information about the target pc to Fetch, which updates pc and other control registers whenever it looks at the nextPC fifo April 9, 2012 L16-17http://csg.csail.mit.edu/6.S078

18 A solution using epoch Add fEpoch and eEpoch registers to the processor state; initialize them to the same value The epoch changes whenever Execute determines that the pc prediction is wrong. This change is reflected immediately in eEpoch and eventually in fEpoch via nextPC FIFO Associate the fEpoch with every instruction when it is fetched In the execute stage, reject, i.e., kill, the instruction if its epoch does not match eEpoch April 9, 2012 L16-18http://csg.csail.mit.edu/6.S078

19 Two-Stage pipeline A robust two-rule solution PC Inst Memory Decode Register File Execute Data Memory +4 ir Bypass FIFO Pipeline FIFO nextPC fEpoch eEpoch Either fifo can be a normal (>1 element) fifo April 9, 2012 L16-19http://csg.csail.mit.edu/6.S078

20 Two-stage pipeline Decoupled module mkProc(Proc); Reg#(Addr) pc <- mkRegU; RFile rf <- mkRFile; IMemory iMem <- mkIMemory; DMemory dMem <- mkDMemory; PipeReg#(TypeFetch2Decode) ir <- mkPipeReg; Reg#(Bool) fEpoch <- mkReg(False); Reg#(Bool) eEpoch <- mkReg(False); FIFOF#(Addr) nextPC <- mkBypassFIFOF; rule doFetch (ir.notFull); let inst = iMem(pc); ir.enq(TypeFetch2Decode {pc:pc, epoch:fEpoch, inst:inst}); if(nextPC.notEmpty) begin pc<=nextPC.first; fEpoch<=!fEpoch; nextPC.deq;end else pc <= pc + 4; endrule explicit guard simple branch prediction April 9, 2012 L16-20http://csg.csail.mit.edu/6.S078

21 Two-stage pipeline Decoupled cont rule doExecute (ir.notEmpty); let irpc = ir.first.pc; let inst = ir.first.inst; if(ir.first.epoch==eEpoch) begin let eInst = decodeExecute(irpc, inst, rf); let memData <- dMemAction(eInst, dMem); regUpdate(eInst, memData, rf); if (eInst.brTaken) begin nextPC.enq(eInst.addr); eEpoch <= !eEpoch; end ir.deq; endrule endmodule April 9, 2012 L16-21http://csg.csail.mit.edu/6.S078

22 Two-Stage pipeline with a Branch Predictor PC Inst Memory Decode Register File Execute Data Memory ir + ppc nextPC fEpoch eEpoch Branch Predictor April 9, 2012 L16-22http://csg.csail.mit.edu/6.S078

23 Branch Predictor Interface interface NextAddressPredictor; method Addr prediction(Addr pc); method Action update(Addr pc, Addr target); endinterface April 9, 2012 L16-23http://csg.csail.mit.edu/6.S078

24 Null Branch Prediction module mkNeverTaken(NextAddressPredictor); method Addr prediction(Addr pc); return pc+4; endmethod method Action update(Addr pc, Addr target); noAction; endmethod endmodule Replaces PC+4 with … Already implemented in the pipeline Right most of the time Why? April 9, 2012 L16-24http://csg.csail.mit.edu/6.S078

25 Branch Target Prediction (BTB) module mkBTB(NextAddressPredictor); RegFile#(LineIdx, Addr) tagArr <- mkRegFileFull; RegFile#(LineIdx, Addr) targetArr <- mkRegFileFull; method Addr prediction(Addr pc); LineIdx index = truncate(pc >> 2); let tag = tagArr.sub(index); let target = targetArr.sub(index); if (tag==pc) return target; else return (pc+4); endmethod method Action update(Addr pc, Addr target); LineIdx index = truncate(pc >> 2); tagArr.upd(index, pc); targetArr.upd(index, target); endmethod endmodule April 9, 2012 L16-25http://csg.csail.mit.edu/6.S078

26 Two-stage pipeline + BP module mkProc(Proc); Reg#(Addr) pc <- mkRegU; RFile rf <- mkRFile; IMemory iMem <- mkIMemory; DMemory dMem <- mkDMemory; PipeReg#(TypeFetch2Decode) ir <- mkPipeReg; Reg#(Bool) fEpoch <- mkReg(False); Reg#(Bool) eEpoch <- mkReg(False); FIFOF#(Tuple2#(Addr,Addr)) nextPC <- mkBypassFIFOF; NextAddressPredictor bpred <- mkNeverTaken; The definition of TypeFetch2Decode is changed to include predicted pc typedef struct { Addr pc; Addr ppc; Bool epoch; Data inst; } TypeFetch2Decode deriving (Bits, Eq); April 9, 2012 L16-26http://csg.csail.mit.edu/6.S078 Some target predictor

27 Two-stage pipeline + BP Fetch rule rule doFetch (ir.notFull); let ppc = bpred.prediction(pc); let inst = iMem(pc); ir.enq(TypeFetch2Decode {pc:pc, ppc:ppc, epoch:fEpoch, inst:inst}); if(nextPC.notEmpty) begin match{.ipc,.ippc} = nextPC.first; pc <= ippc; fEpoch <= !fEpoch; nextPC.deq; bpred.update(ipc, ippc); end else pc <= ppc; endrule April 9, 2012 L16-27http://csg.csail.mit.edu/6.S078

28 Two-stage pipeline + BP Execute rule rule doExecute (ir.notEmpty); let irpc = ir.first.pc; let inst = ir.first.inst; let irppc = ir.first.ppc; if(ir.first.epoch==eEpoch) begin let eInst = decodeExecute(irpc, irppc, inst, rf); let memData <- dMemAction(eInst, dMem); regUpdate(eInst, memData, rf); if (eInst.missPrediction) begin nextPC.enq(tuple2(irpc, eInst.brTaken ? eInst.addr : irpc+4)); eEpoch <= !eEpoch; end ir.deq; endrule endmodule April 9, 2012 L16-28http://csg.csail.mit.edu/6.S078

29 Execute Function function ExecInst exec(DecodedInst dInst, Data rVal1, Data rVal2, Addr pc, Addr ppc); ExecInst einst = ?; let aluVal2 = (dInst.immValid)? dInst.imm : rVal2 let aluRes = alu(rVal1, aluVal2, dInst.aluFunc); let brAddr = brAddrCal(pc, rVal1, dInst.iType, dInst.imm); einst.itype = dInst.iType; einst.addr = (memType(dInst.iType)? aluRes : brAddr; einst.data = dInst.iType==St ? rVal2 : aluRes; einst.brTaken = aluBr(rVal1, aluVal2, dInst.brComp); einst.missPrediction = brTaken ? brAddr!=ppc : (pc+4)!=ppc; einst.rDst = dInst.rDst; return einst; endfunction April 7, 2012 L7-29http://csg.csail.mit.edu/6.s078Rev


Download ppt "Computer Architecture: A Constructive Approach Branch Prediction - 1 Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of."

Similar presentations


Ads by Google