Computer Architecture: A Constructive Approach Data Hazards and Multistage Pipelines Teacher: Yoav Etsion Taken (with permission) from Arvind et al.*,

Computer Architecture: A Constructive Approach Data Hazards and Multistage Pipelines Teacher: Yoav Etsion Taken (with permission) from Arvind et al.*, Massachusetts Institute of Technology Derek Chiou, The University of Texas at Austin * Joel Emer, Li-Shiuan Peh, Murali Vijayaraghavan, Asif Khan, Abhinav Agarwal, Myron King 1

Two-Stage pipeline A robust two-rule solution PC Inst Memory Decode Register File Execute Data Memory +4 ir Bypass FIFO Pipeline FIFO nextPC fEpoch eEpoch Either fifo can be a normal (>1 element) fifo 2

A different 2-Stage pipeline: 2-Stage-DH pipeline PC Inst Memory Decode Register File Execute Data Memory itr nextPC fEpoch eEpoch +4 3

TypeDecode2Execute typedef struct { Addr pc; Bool epoch; DecodedInst dInst; Data rVal1; Data rVal2; } TypeDecode2Execute deriving (Bits, Eq); value instead of register names 4

2-Stage-DH pipeline first attempt module mkProc(Proc); Reg#(Addr) pc <- mkRegU; RFile rf <- mkRFile; IMemory iMem <- mkIMemory; DMemory dMem <- mkDMemory; PipeReg#(TypeDecode2Execute) itr <- mkPipeReg; Reg#(Bool) fEpoch <- mkReg(False); Reg#(Bool) eEpoch <- mkReg(False); FIFOF#(TypeNextPCE) nextPC <- mkBypassFIFOF; typedef struct { Addr npc; Bool nepoch; } TypeNextPCE deriving (Bits, Eq); 5

2-Stage-DH pipeline doFetch rule first attempt rule doFetch (itr.notFull); let inst = iMem(pc); let dInst = decode(inst); let rVal1 = rf.rd1(fromMaybe(dInst.src1)); let rVal2 = rf.rd2(fromMaybe(dInst.src2)); itr.enq(TypeDecode2Execute{pc:pc, epoch:fEpoch, dInst:dInst, rVal1:rVal1, rVal2:rVal2}); if(nextPC.notEmpty) begin npc = nextPC.first.npc; nepoch = nextPC.first.nepoch; pc <= npc; fEpoch <= nepoch; nextPC.deq; end else pc <= pc+4; endrule Not quite correct! 6

2-Stage-DH pipeline doExecute rule first attempt rule doExecute (itr.notEmpty); let itrpc=itr.first.pc; let dInst=itr.first.dInst; let rVal1=itr.first.rVal1; let rVal2=itr.first.rVal2; if(itr.first.epoch==eEpoch) begin let eInst = execute(dInst, rVal1, rVal2, itrpc); let memData <- dMemAction(eInst, dMem); regUpdate(eInst, memData, rf); if(eInst.brTaken) begin let nepoch = next(epoch); eEpoch <= nepoch; nextPC.enq( TypeNextPCE{npc:eInst.addr, nepoch:nepoch}); end itr.deq; endrule endmodule Not quite correct! Fetch is potentially reading stale values from rf 7

Data Hazards fetch & decode execute itr time t0t1t2t3t4t5t6t7.... FDstageFD 1 FD 2 FD 3 FD 4 FD 5 EXstageEX 1 EX 2 EX 3 EX 4 EX 5 I 1 Add(R1,R2,R3) I 2 Add(R4,R1,R2) I 2 must be stalled until I 1 updates the register file pcrf dMem time t0t1t2t3t4t5t6t7.... FDstageFD 1 FD 2 FD 2 FD 3 FD 4 FD 5 EXstageEX 1 EX 2 EX 3 EX 4 EX 5 8

2-Stage-DH pipeline Stall logic PC Inst Memory Decode Register File Execute Data Memory itr nextPC fEpoch eEpoch +4 scoreboard 9

Data Hazard Given two source registers and a destination register determine if there is a potential for a data hazard src1, src2 and rDst in decodedInst are changed from Rindx to Maybe#(Rindx) function Bool dataHazard(Maybe#(Rindx) src1, Maybe#(Rindx) src2, Maybe#(Rindx) dst); return (isValid(dst) && ( (isValid(src1) && fromMaybe(dst) == fromMaybe(src1)) || (isValid(src2) && fromMaybe(dst)== fromMaybe(src2)))); endfunction 10

Scoreboard: Keeping track of instructions in execution Scoreboard: a data structure to keep track of the destination registers of the instructions beyond the fetch stage method insert: inserts the destination (if any) of an instruction in the scoreboard when the instruction is decoded method search(src1,src2): searches the scoreboard for data hazards method remove: deletes the oldest entry when an instruction commits 11

Scoreboard module mkScoreboard(Scoreboard#(size)); Vector#(size, EHR#(2, Maybe#(Rindx))) sb <- replicateM(mkEHR(Invalid)); Reg#(Bit#(TAdd#(TLog#(size),1))) iidx <- mkReg(0); Reg#(Bit#(TAdd#(TLog#(size),1))) ridx <- mkReg(0); EHR#(2, Bit#(TAdd#(TLog#(size),1))) cnt <- mkEHR(0); Integer vsize = valueOf(size); Bit#(TAdd#(TLog#(size),1)) sz = fromInteger(vsize); method Action insert(Maybe#(Rindx) r) if(cnt.r1!==sz); sb[iidx].w1(r); iidx <= iidx==(sz-1) ? 0 : iidx + 1; cnt.w1(cnt.r1 + 1); endmethod 12

Scoreboard cont method Action remove if (cnt.r0!=0); sb[ridx].w0(Invalid); ridx <= ridx==sz-1 ? 0 : ridx + 1; cnt.w0(cnt.r0 – 1); endmethod method Bool search(Maybe#(Rindx) s1, Maybe#(Rindx) s2); Bool j = False; for (Integer i=0; i<vsize; i=i+1) j = (j || dataHazard(s1, s2, sb[i].r1)); return j; endmethod endmodule 13 remove < search < insert 13

2-Stage-DH pipeline module mkProc(Proc); Reg#(Addr) pc <- mkRegU; RFile rf <- mkBypassRFile; IMemory iMem <- mkIMemory; DMemory dMem <- mkDMemory; PipeReg#(TypeDecode2Execute) itr <- mkPipeReg; Scoreboard#(1) sb <- mkScoreboard; // contains only one instruction Reg#(Bool) fEpoch <- mkReg(False); Reg#(Bool) eEpoch <- mkReg(False); FIFOF#(TypeNextPCE) nextPC <- mkBypassFIFOF; 14

2-Stage-DH pipeline doFetch rule rule doFetch (itr.notFull); let inst = iMem(pc); let dInst = decode(inst); let stall = sb.search(dInst.src1, dInst.src2); if(!stall) begin let rVal1 = rf.rd1(fromMaybe(dInst.src1)); let rVal2 = rf.rd2(fromMaybe(dInst.src2)); itr.enq(TypeDecode2Execute{pc:pc, epoch:fEpoch, dInst:dInst, rVal1:rVal1, rVal2:rVal2}); sb.insert(dInst.rDst); if(nextPC.notEmpty) begin npc = nextPC.first.npc; nepoch = nextPC.first.nepoch; pc <= npc; fEpoch <= nepoch; nextPC.deq; end else pc <= pc+4; end endrule 15

2-Stage-DH pipeline doExecute rule rule doExecute (itr.notEmpty); let itrpc=itr.first.pc; let dInst=itr.first.dInst; let rVal1=itr.first.rVal1; let rVal2=itr.first.rVal2; if(itr.first.epoch==eEpoch) begin let eInst = execute(dInst, rVal1, rVal2, itrpc); let memData <- dMemAction(eInst, dMem); regUpdate(eInst, memData, rf); if(eInst.brTaken) begin let nepoch = next(epoch); eEpoch <= nepoch; nextPC.enq( TypeNextPCE{npc:eInst.addr, nepoch:nepoch}); end itr.deq; sb.remove; endrule endmodule 16

Concurrency analysis doExecute < doFetch implies that the method calls of a module whose methods are called by both rules must be ordered similarly  {itr.first, itr.deq} < {itr.enq}  pipeline FIFO sb.remove < {sb.search, sb.insert} {nextPC.enq} < {nextPC.first, nextPC.deq}  bypass FIFO {rf.wr} < {rf.rd1, rfrd2}  bypass RF 17

Multi-stage pipeline with Data Hazards 18

Three Stage Pipeline Bypass (1) PC Inst Memory Decode Register File Execute Data Memory itr nextPC fEpoch eEpoch +4 cr scoreboard 19

What Sort of Logic? What information is needed? Does anything need to be done to the pipeline? 20

Three Stage Pipeline Bypass (2) PC Inst Memory Decode Register File Execute Data Memory itr nextPC fEpoch eEpoch +4 cr scoreboard 21

What Sort of Logic? What information is needed? Does anything need to be done to the pipeline? 22

Bypass Issues Need to ensure that data is never “lost” Conceptually, data needs to live until everyone who needs it has it Naming is important There can be different versions throughout the pipeline Bypassing once is logically straight forward But, not necessarily easy to implement What if you make a change to the pipeline structure? One elegant bypassing strategy is to rename registers Only need to look for one tag Eliminates complexity of bypassing for a specific pipeline 23

Computer Architecture: A Constructive Approach Branch Prediction - 1 24

I-cache Fetch Buffer Issue Buffer Func. Units Arch. State Execute Decode Result Buffer Commit PC Fetch Branch executed Next fetch started Modern processors may have > 10 pipeline stages between next PC calculation and branch resolution ! Control Flow Penalty How much work is lost if pipeline doesn’t follow correct instruction flow ? ~ Loop length x pipeline width 25

Average Run-Length between Branches Average dynamic instruction mix from SPEC92: SPECint92 SPECfp92 ALU39 %13 % FPU Add 20 % FPU Mult13 % load26 %23 % store 9 % 9 % branch16 % 8 % other10 %12 % SPECint92: compress, eqntott, espresso, gcc, li SPECfp92: doduc, ear, hydro2d, mdijdp2, su2cor What is the average run-length between branches? 26

InstructionTaken known?Target known? J JR BEQZ/BNEZ MIPS Branches and Jumps Each instruction fetch depends on one or two pieces of information from the preceding instruction: 1. Is the preceding instruction a taken branch? 2. If so, what is the target address? After Inst. Decode After Reg. Fetch After Exec 27

Currently our simple pipelined architecture does very simple branch prediction What is it? Branch is predicted not taken: pc, pc+4, pc+8, … Can we do better? 28

Branch Prediction Bits Assume 2 BP bits per instruction Use saturating counter On ¬taken   On taken 11Strongly taken 10Weakly taken 01Weakly ¬taken 00Strongly ¬taken 29

Branch History Table (BHT) 4K-entry BHT, 2 bits/entry, ~80-90% correct predictions 00 Fetch PC Branch? Target PC + I-Cache Opcodeoffset Instruction k BHT Index 2 k -entry BHT, 2 bits/entry Taken/¬Taken? 30

Where does BHT fit in the processor pipeline? BHT can only be used after instruction decode What should we do at the fetch stage? Need a mechanism to update the BHT where does the update information come from 31

Overview of branch prediction PCPC Need next PC immediately Decode Reg Read Execute Instr type, PC relative targets available Simple conditions, register targets available Complex conditions available Next Addr Pred BP, JMP, Ret Loose loop Tight loop Best predictors reflect program behavior 32

Next Address Predictor (NAP) first attempt BP bits are stored with the predicted target address. IF stage: nPC = If (BP=taken) then target else pc+4 later: check prediction, if wrong then kill the instruction and update BTB & BPb else update BPb iMem PC Branch Target Buffer (2 k entries) k BPb predicted targetBP target 33

Address Collisions What will be fetched after the instruction at 1028? NAP prediction= Correct target=  Assume a 128-entry NAP BPb target take 236 1028 Add..... 132 Jump 100 Instruction Memory 236 1032 kill PC=236 and fetch PC=1032 Is this a common occurrence? Can we avoid these bubbles? 34

Use NAP for Control Instructions only NAP contains useful information for branch and jump instructions only  Do not update it for other instructions For all other instructions the next PC is (PC)+4 ! How to achieve this effect without decoding the instruction? 35

Computer Architecture: A Constructive Approach Data Hazards and Multistage Pipelines Teacher: Yoav Etsion Taken (with permission) from Arvind et al.*,

Similar presentations

Presentation on theme: "Computer Architecture: A Constructive Approach Data Hazards and Multistage Pipelines Teacher: Yoav Etsion Taken (with permission) from Arvind et al.*,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computer Architecture: A Constructive Approach Data Hazards and Multistage Pipelines Teacher: Yoav Etsion Taken (with permission) from Arvind et al.*,

Similar presentations

Presentation on theme: "Computer Architecture: A Constructive Approach Data Hazards and Multistage Pipelines Teacher: Yoav Etsion Taken (with permission) from Arvind et al.*,"— Presentation transcript:

Similar presentations

About project

Feedback