Caches-2 Constructive Computer Architecture Arvind

Caches-2 Constructive Computer Architecture Arvind
Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology November 3, 2014

Blocking vs. Non-Blocking cache
At most one outstanding miss Cache must wait for memory to respond Cache does not accept requests in the meantime Non-blocking cache: Multiple outstanding misses Cache can continue to process requests while waiting for memory to respond to misses We will first design a write-back, Write-miss allocate, Direct-mapped, blocking cache November 3, 2014

Blocking Cache Interface
req status memReq mReqQ missReq DRAM or next level cache Processor cache memResp resp hitQ mRespQ interface Cache; method Action req(MemReq r); method ActionValue#(Data) resp; method ActionValue#(MemReq) memReq; method Action memResp(Line r); endinterface November 3, 2014

Interface dynamics The cache either gets a hit and responds immediately, or it gets a miss, in which case it takes several steps to process the miss Reading the response dequeues it Requests and responses follow the FIFO order Methods are guarded, e.g., the cache may not be ready to accept a request because it is processing a miss A status register keeps track of the state of the cache while it is processing a miss typedef enum {Ready, StartMiss, SendFillReq, WaitFillResp} CacheStatus deriving (Bits, Eq); November 3, 2014

Blocking Cache code structure
module mkCache(Cache); RegFile#(CacheIndex, Line) dataArray <- mkRegFileFull; … rule startMiss … endrule; method Action req(MemReq r) … endmethod; method ActionValue#(Data) resp … endmethod; method ActionValue#(MemReq) memReq … endmethod; method Action memResp(Line r) … endmethod; endmodule November 3, 2014

Extracting cache tags & index
tag index L 2 Cache size in bytes Byte addresses Processor requests are for a single word but internal communications are in line sizes (2L words, typically L=2) AddrSz = CacheTagSz + CacheIndexSz + LS + 2 Need getIdx, getTag, getOffset functions function CacheIndex getIdx(Addr addr) = truncate(addr>>4); function Bit#(2) getOffset(Addr addr) = truncate(addr >> 2); function CacheTag getTag(Addr addr) = truncateLSB(addr); truncate = truncateMSB November 3, 2014

Blocking cache state elements
RegFile#(CacheIndex, Line) dataArray <- mkRegFileFull; RegFile#(CacheIndex, Maybe#(CacheTag)) tagArray <- mkRegFileFull; RegFile#(CacheIndex, Bool) dirtyArray <- mkRegFileFull; Fifo#(1, Data) hitQ <- mkBypassFifo; Reg#(MemReq) missReq <- mkRegU; Reg#(CacheStatus) status <- mkReg(Ready); Fifo#(2, MemReq) memReqQ <- mkCFFifo; Fifo#(2, Line) memRespQ <- mkCFFifo; Tag and valid bits are kept together as a Maybe type CF Fifos are preferable because they provide better decoupling. An extra cycle here may not affect the performance by much November 3, 2014

Req method hit processing
It is straightforward to extend the cache interface to include a cacheline flush command method Action req(MemReq r) if(status == Ready); let idx = getIdx(r.addr); let tag = getTag(r.addr); Bit#(2) wOffset = truncate(r.addr >> 2); let currTag = tagArray.sub(idx); let hit = isValid(currTag)? fromMaybe(?,currTag)==tag : False; if(hit) begin let x = dataArray.sub(idx); if(r.op == Ld) hitQ.enq(x[wOffset]); else begin x[wOffset]=r.data; dataArray.upd(idx, x); dirtyArray.upd(idx, True); end else begin missReq <= r; status <= StartMiss; end endmethod overwrite the appropriate word of the line November 3, 2014

Rest of the methods Memory side methods
method ActionValue#(Data) resp; hitQ.deq; return hitQ.first; endmethod method ActionValue#(MemReq) memReq; memReqQ.deq; return memReqQ.first; method Action memResp(Line r); memRespQ.enq(r); Memory side methods November 3, 2014

Start-miss and Send-fill rules
Ready -> StartMiss -> SendFillReq -> WaitFillResp -> Ready rule startMiss(status == StartMiss); let idx = getIdx(missReq.addr); let tag=tagArray.sub(idx); let dirty=dirtyArray.sub(idx); if(isValid(tag) && dirty) begin // write-back let addr = {fromMaybe(?,tag), idx, 4'b0}; let data = dataArray.sub(idx); memReqQ.enq(MemReq{op: St, addr: addr, data: data}); end status <= SendFillReq; endrule Ready -> StartMiss -> SendFillReq -> WaitFillResp -> Ready rule sendFillReq (status == SendFillReq); memReqQ.enq(missReq); status <= WaitFillResp; endrule November 3, 2014

Wait-fill rule Ready -> StartMiss -> SendFillReq -> WaitFillResp -> Ready rule waitFillResp(status == WaitFillResp); let idx = getIdx(missReq.addr); let tag = getTag(missReq.addr); let data = memRespQ.first; tagArray.upd(idx, Valid (tag)); if(missReq.op == Ld) begin dirtyArray.upd(idx,False);dataArray.upd(idx, data); hitQ.enq(data[wOffset]); end else begin data[wOffset] = missReq.data; dirtyArray.upd(idx,True); dataArray.upd(idx, data); end memRespQ.deq; status <= Ready; endrule Is there a problem with waitFill? What if the hitQ is blocked? Should we not at least write it in the cache? November 3, 2014

Hit and miss performance
Combinational read/write, i.e. 0-cycle response Requires req and resp methods to be concurrently schedulable, which in turn requires hitQ.enq < {hitQ.deq, hitQ.first} i.e., hitQ should be a bypass Fifo Miss No evacuation: memory load latency plus combinational read/write Evacuation: memory store followed by memory load latency plus combinational read/write Adding an extra cycle here and there in the miss case should not have a big negative performance impact November 3, 2014

Non-blocking cache cache
req req proc mReq mReqQ Processor cache resp mResp Out-of-Order responses mRespQ Requests have to be tagged because responses come out-of-order (OOO) We will assume that all tags are unique and the processor is responsible for reusing tags properly November 3, 2014

Non-blocking Cache Behavior to be described by 2 concurrent FSMs to process input requests and memory responses, respectively St req goes into StQ and waits until data can be written into the cache req hitQ resp V D W Tag Data St Q Ld Buff load reqs waiting for data An extra bit in the cache to indicate if the data for a line is present wbQ mRespQ November 3, 2014 mReqQ 14

Incoming req Type of request st ld cache state V? In StQ? yes no yes
bypass hit Cache state V? StQ empty? yes no yes no hit Cache W? yes no Write in cache Put in StQ Cache state W? Put in LdBuf Put in LdBuf send memReq set W If (evacuate) send wbResp yes no Put in StQ Put in StQ send memReq set W If (evacuate) send wbResp November 3, 2014

Mem Resp (line) 1. Update cache line (set V and unset W)
2. Process all matching ldBuff entries and send responses 3. L: If cachestate(oldest StQ entry address) = V then update the cache word with StQ entry; remove the oldest entry; Loop back to L else if cachestate(oldest StQ entry address) = !W then if(evacuate) wbResp; memReq for this store entry; set W November 3, 2014

Non-blocking Cache state declaration
Code has not been tested module mkNBCache(NBCache); RegFile#(Index, Bool) valid <- mkRegFileFull; RegFile#(Index, Bool) dirty <- mkRegFileFull; RegFile#(Index, Bool) wait <- mkRegFileFull; RegFile#(Index, Tag) tagArray <- mkRegFileFull; RegFile#(Index, Data) dataArray <- mkRegFileFull; StQ#(StQSz) stQ <- mkStQ; LdBuff#(LdBuffSz) ldBuff <- mkLdBuff; FIFOF#(Tuple2#(Addr, Line)) wbQ <- mkFIFOF; FIFOF#(Addr) mReqQ <- mkFIFOF; mRespQ <- mkFIFOF; FIFOF#(Tuple2#(Id, Data)) respQ <- mkFIFOF; Reg#(Addr) addrResp <- mkRegU; Reg#(CacheState) buffSearch <- mkReg(None); // Either LdBuff, StQ or None November 3, 2014 17

Non-blocking Cache req method
Code has not been tested method req(MemReq r) if (buffSearch== None); let idx = getIndex(r.addr); let tag = getTag(r.addr); let line = dataArray.sub(getIndex(r.addr)); Bit#(2) offset = getOffset(r.addr); let v = valid.sub(idx); let t = tagArray.sub(idx); let d = dirty.sub(idx); if(r.op == Ld) begin if (isValid(stQ.search(r.addr))) respQ.enq(tuple2(r.id, fromMaybe(?,stQ.search(r.addr)))); else if (t == tag && v) respQ.enq(tuple2(r.id, line[offset])); else begin ldBuff.enq(r); if ( !wait.sub(idx)) begin memReqQ.enq(r.addr); wait.upd(idx, True); if (t!= tag && d) begin //evacuate wbQ.enq(tuple2({t, idx, 4'b0},line)); tagArray.upd(idx, tag); end end end end else … November 3, 2014 18

Non-blocking Cache req method (cont)
Code has not been tested else begin //store req if (t == tag && v) if (stQ.empty) begin line[offet] = r.data; dataArray.upd(idx, line); dirty.upd(idx, True); end else stQ.enq(r); else begin stQ.enq(r); if (!wait.sub(idx)) begin memReqQ.enq(r.addr); wait.upd(idx, True); if (t!= tag && d) begin //evacuate wbQ.enq(tuple2({t, idx, 4'b0},line,state.sub(idx))); tagArray.upd(idx, tag); endmethod November 3, 2014 19

Non-blocking Cache Memory response processing
Code has not been tested rule memResp(buffSearch == None); match {.addr, .data} = memRespQ.first; memRespQ.deq; dataArray.upd(getIdx(addr), data); valid.upd(getIdx(addr), True); wait.upd(getIdx(addr), False); dirty.upd(getIdx(addr), False); buffSearch <= LdBuff; addrResp <= addr; endrule rule clearLoad(buffSearch == LdBuff); Bit#(2) offset = getOffset(r.addr); let rMaybe = ldBuff.search(addrResp); if (isValid(rMaybe)) begin let r = fromMaybe(?, rMaybe); respQ.enq(tuple2(r.id, dataArray.sub(getIdx(r.addr))[offset])); ldBuff.remove(r.ldBuffId); end else buffSearch <= StQ; November 3, 2014 20

Non-blocking Cache rules 2
Code has not been tested rule clearStore(buffSearch == StQ); Bit#(2) offset = getOffset(r.addr); let r = stQ.first; let line = dataArray.sub(getIdx(r.addr)); let v = valid.sub(getIdx(r.addr)); let t = tagArray.sub(getIdx(r.addr)); if (t == getTag(r.addr) && v) begin line[offset] = r.data; dataArray.upd(getIdx(r.addr), line); stQ.deq; dirty.upd(getIdx(r.addr), True); end else begin if (!wait.sub(idx)) begin memQ.enq(r.addr); wait.upd(idx, True); if (t!= tag && dirty.sub(getIdx(r.addr))) begin wbQ.enq(tuple2({tag, idx, 4'b0},line,state.sub(idx))); tagArray.upd(idx, tag); buffSearch <= None; endrule November 3, 2014 21

Non-blocking Cache methods (cont)
Code has not been tested method ActionValue#(Addr) memReq; memReqQ.deq; return memReqQ.first; endmethod method ActionValue#(Tuple2#(Addr, Data)) wbResp; wbQ.deq; return wbQ.first; method Action memResp(Tuple2#(Addr, Data) r); memRespQ.enq(r); November 3, 2014 22

Four-Stage Pipeline Register File PC Decode Execute Inst Data Memory
Epoch Register File PC Next Addr Pred f2d Decode d2e Execute m2w e2m f12f2 Inst Memory Data Memory scoreboard insert bypass FIFO’s to deal with (0,n) cycle memory response November 3, 2014 23

Caches-2 Constructive Computer Architecture Arvind

Similar presentations

Presentation on theme: "Caches-2 Constructive Computer Architecture Arvind"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Caches-2 Constructive Computer Architecture Arvind

Similar presentations

Presentation on theme: "Caches-2 Constructive Computer Architecture Arvind"— Presentation transcript:

Similar presentations

About project

Feedback