Modeling Processors: Concurrency Issues

Slides:

Advertisements

Similar presentations

BSV execution model and concurrent rule scheduling Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology February.

Advertisements

Elastic Pipelines and Basics of Multi-rule Systems Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February.

Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology October 13, 2009http://csg.csail.mit.edu/koreaL12-1.

Modeling Processors Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 22, 2011L07-1

March, 2007http://csg.csail.mit.edu/arvindIPlookup-1 IP Lookup Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.

September 24, L08-1 IP Lookup: Some subtle concurrency issues Arvind Computer Science & Artificial Intelligence Lab.

December 12, 2006http://csg.csail.mit.edu/6.827/L24-1 Scheduling Primitives for Bluespec Arvind Computer Science & Artificial Intelligence Lab Massachusetts.

Constructive Computer Architecture: Guards Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology September 24, 2014.

September 22, 2009http://csg.csail.mit.edu/koreaL07-1 Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab.

Constructive Computer Architecture Sequential Circuits Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology

Elastic Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 28, 2011L08-1http://csg.csail.mit.edu/6.375.

Constructive Computer Architecture: Control Hazards Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology October.

February 20, 2009http://csg.csail.mit.edu/6.375L08-1 Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts.

Modular Refinement Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology March 8,

October 22, 2009http://csg.csail.mit.edu/korea Modular Refinement Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.

Computer Architecture: A Constructive Approach Bluespec execution model and concurrent rule scheduling Teacher: Yoav Etsion Taken (with permission) from.

October 20, 2009L14-1http://csg.csail.mit.edu/korea Concurrency and Modularity Issues in Processor pipelines Arvind Computer Science & Artificial Intelligence.

Modeling Processors Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology March 1, 2010

Elastic Pipelines: Concurrency Issues

Bluespec-3: A non-pipelined processor Arvind

Concurrency properties of BSV methods and rules

Bluespec-6: Modeling Processors

Bluespec-6: Modules and Interfaces

Scheduling Constraints on Interface methods

Blusepc-5: Dead cycles, bubbles and Forwarding in Pipelines Arvind

Sequential Circuits Constructive Computer Architecture Arvind

Sequential Circuits: Constructive Computer Architecture

Performance Specifications

Pipelining combinational circuits

Multirule Systems and Concurrent Execution of Rules

Constructive Computer Architecture: Guards

Sequential Circuits Constructive Computer Architecture Arvind

Pipelining combinational circuits

Modular Refinement Arvind

Modular Refinement Arvind

EHR: Ephemeral History Register

Blusepc-5: Dead cycles, bubbles and Forwarding in Pipelines Arvind

Bluespec-7: Scheduling & Rule Composition

Modules with Guarded Interfaces

Pipelining combinational circuits

Sequential Circuits - 2 Constructive Computer Architecture Arvind

Pipeline control unit (highly abstracted)

Elastic Pipelines: Concurrency Issues

Bluespec-3: A non-pipelined processor Arvind

Multirule systems and Concurrent Execution of Rules

Modular Refinement - 2 Arvind

IP Lookup: Some subtle concurrency issues

Computer Science & Artificial Intelligence Lab.

Elastic Pipelines: Concurrency Issues

Elastic Pipelines and Basics of Multi-rule Systems

Bluespec-5: Modeling Processors

Constructive Computer Architecture: Guards

GCD: A simple example to introduce Bluespec

Elastic Pipelines and Basics of Multi-rule Systems

Bluespec-7: Scheduling & Rule Composition

Control Hazards Constructive Computer Architecture: Arvind

Pipelined Processors Constructive Computer Architecture: Arvind

Pipeline control unit (highly abstracted)

Multirule systems and Concurrent Execution of Rules

IP Lookup: Some subtle concurrency issues

Bluespec-5: Scheduling & Rule Composition

Tutorial 4: RISCV modules Constructive Computer Architecture

Pipeline Control unit (highly abstracted)

Modeling Processors Arvind

Modeling Processors Arvind

Modular Refinement Arvind

Control Hazards Constructive Computer Architecture: Arvind

Implementing for Correct Concurrency

Modular Refinement Arvind

Bluespec-8: Modules and Interfaces

Presentation transcript:

Modeling Processors: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 27, 2008 http://csg.csail.mit.edu/6.375

The Plan Two-stage synchronous pipeline  Bypassing issues Two-stage asynchronous pipeline Concurrency Issues In the rest of the talk, I want to say a little bit more about how to describe hardware as an TRS Next I will explain how to synthesize these operation-centric descriptions. I have a few more design that I can show you in the result section. Finally, I will say a fews things about related work and future work Some understanding of simple processor pipelines is needed to follow this lecture February 27, 2008 http://csg.csail.mit.edu/6.375

Synchronous Pipeline rule SyncTwoStage (True); fetch & decode execute pc rf CPU buReg Synchronous Pipeline rule SyncTwoStage (True); let instr = iMem.read(pc); let stall = stallfuncR(instr,buReg); let fetchAction = action if(!stall) pc <= predIa; buReg <= (stall) ? Invalid : Valid newIt(instr); endaction; case (buReg) matches … endcase endcase endrule February 27, 2008 http://csg.csail.mit.edu/6.375

Synchronous Pipeline case (it) matches rule SyncTwoStage (True); … fetch & decode execute pc rf CPU buReg Synchronous Pipeline rule SyncTwoStage (True); … case (buReg) matches tagged Invalid: fetchAction; tagged Valid .it: begin case (it) matches tagged EAdd{dst:.rd,src1:.va,src2:.vb}: begin rf.upd(rd, va+vb); fetchAction; end tagged EBz {cond:.cv,addr:.av}: if (cv == 0) then begin pc <= av; buReg <= Invalid; end else fetchAction; tagged ELoad{dst:.rd,addr:.av}: begin rf.upd(rd, dMem.read(av)); fetchAction; end tagged EStore{value:.vv,addr:.av}: begin dMem.write(av, vv); fetchAction; end endcase endcase endrule February 27, 2008 http://csg.csail.mit.edu/6.375

The Stall Function function Bool stallfuncR (Instr instr, Maybe#(InstTemplate) buReg); case (buReg) matches tagged Invalid: return False; tagged Valid .it: case (instr) matches tagged Add {dst:.rd,src1:.ra,src2:.rb}: return (findf(ra,it) || findf(rb,it)); tagged Bz {cond:.rc,addr:.addr}: return (findf(rc,it) || findf(addr,it)); tagged Load {dst:.rd,addr:.addr}: return (findf(addr,it)); tagged Store {value:.v,addr:.addr}: return (findf(v,it) || findf(addr,it)); endcase endfunction This first rule describe what happens in the fetch of this pipeline. This rules says that given any processor term, we can always rewrite it to a new term wehre, and we enqueue the current instruction into bf. And at the same time the pc field is incremende by one It may look like this rule can fire forever, but because bf is bounded, when a rule enqueue into bf, there is an implied predicate that say bf must not be full. February 27, 2008 http://csg.csail.mit.edu/6.375 8 6 8 8

The findf function function Bool findf (RName r, InstrTemplate it); case (it) matches tagged EAdd{dst:.rd,op1:.v1,op2:.v2}: return (r == rd); tagged EBz {cond:.c,addr:.a}: return (False); tagged ELoad{dst:.rd,addr:.a}: tagged EStore{value:.v,addr:.a}: endcase endfunction February 27, 2008 http://csg.csail.mit.edu/6.375

Bypasses February 27, 2008 http://csg.csail.mit.edu/6.375

Bypassing will affect ... The newIt function: After decoding it must read the new register values if available (i.e., the values that are still to be committed in the register file) The Stall function: The instruction fetch must not stall if the new value of the register to be read exists In our specific design we never stall because the new register value will be available February 27, 2008 http://csg.csail.mit.edu/6.375

The bypassRF function function bypassRF(r,tobeCommitted); case (tobeCommitted) matches tagged (Valid {.rd, .v} &&& (r==rd)): return (v); tagged Invalid: return (rf[r]); endcase; endfunction February 27, 2008 http://csg.csail.mit.edu/6.375

Modified Decode function function InstrTemplate newItBy (instr,tobeCommitted); let bRf(x) = bypassRF(x, tobeCommitted); case (instr) matches tagged Add {dst:.rd,src1:.ra,src2:.rb}: return EAdd{dst:rd,op1:bRF(ra),op2:bRF(rb)}; tagged Bz {cond:.rc,addr:.addr}: return EBz{cond:bRF(rc),addr:bRF(addr)}; tagged Load {dst:.rd,addr:.addr}: return ELoad{dst:rd,addr:bRF(addr)}; tagged Store{value:.v,addr:.addr}: return EStore{value:bRF(v),addr:bRF(addr)}; endcase endfunction Replace each registerfile read by function bypassRF(ra) which will return the newly written value if it exists February 27, 2008 http://csg.csail.mit.edu/6.375

Synchronous Pipeline with bypassing rule SyncTwoStage (True); let instr = iMem.read(pc); let stall = newstallfuncR(instr,buReg); let fetchAction(tobeCommitted) = action if(!stall) pc <= predIa; buReg <= (stall) ? Invalid : Valid newByIt(instr,tobeCommitted); endaction; case (buReg) matches … endcase endcase endrule February 27, 2008 http://csg.csail.mit.edu/6.375

Synchronous Pipeline with bypassing rule SyncTwoStage (True); … case (buReg) matches tagged Invalid: fetchAction(Invalid); tagged Valid .it: begin case (it) matches tagged EAdd{dst:.rd,src1:.va,src2:.vb}: begin let v = va + vb; rf.upd(rd,t); fetchAction(Valid tuple2(rd,v));end tagged EBz {cond:.cv,addr:.av}: if (cv == 0) then begin pc <= av; buReg <= Invalid; end else fetchAction(Invalid); tagged ELoad{dst:.rd,addr:.av}: begin let v = dMem.read(av); rf.upd(rd,v); fetchAction(Valid tuple2(rd,v));end tagged EStore{value:.vv,addr:.av}: begin dMem.write(av, vv); fetchAction(Invalid);end endcase endcase endrule February 27, 2008 http://csg.csail.mit.edu/6.375

The New Stall Function function Bool newstallfuncR (Instr instr, Reg#(Maybe#(InstTemplate)) buReg); case (buReg) matches tagged Invalid: return False; tagged Valid .it: case (instr) matches tagged Add {dst:.rd,src1:.ra,src2:.rb}: return (findf(ra,it) || findf(rb,it)); … return (false); This first rule describe what happens in the fetch of this pipeline. This rules says that given any processor term, we can always rewrite it to a new term wehre, and we enqueue the current instruction into bf. And at the same time the pc field is incremende by one It may look like this rule can fire forever, but because bf is bounded, when a rule enqueue into bf, there is an implied predicate that say bf must not be full. Previously we stalled when ra matched the destination register of the instruction in the execute stage. Now we bypass that information when we read, so no stall is necessary. February 27, 2008 http://csg.csail.mit.edu/6.375 8 6 8 8

The Plan Two-stage synchronous pipeline Bypassing issues Two-stage asynchronous pipeline  Concurrency Issues In the rest of the talk, I want to say a little bit more about how to describe hardware as an TRS Next I will explain how to synthesize these operation-centric descriptions. I have a few more design that I can show you in the result section. Finally, I will say a fews things about related work and future work Some understanding of simple processor pipelines is needed to follow this lecture February 27, 2008 http://csg.csail.mit.edu/6.375

Two-stage Pipeline rule fetch_and_decode (!stallfunc(instr, bu)); fetch & decode execute pc rf CPU bu Two-stage Pipeline rule fetch_and_decode (!stallfunc(instr, bu)); bu.enq(newIt(instr,rf)); pc <= predIa; endrule Can these rules fire concurrently ? rule execute (True); case (it) matches tagged EAdd{dst:.rd,src1:.va,src2:.vb}: begin rf.upd(rd, va+vb); bu.deq(); end tagged EBz {cond:.cv,addr:.av}: if (cv == 0) then begin pc <= av; bu.clear(); end else bu.deq(); tagged ELoad{dst:.rd,addr:.av}: begin rf.upd(rd, dMem.read(av)); bu.deq(); end tagged EStore{value:.vv,addr:.av}: begin dMem.write(av, vv); bu.deq(); end endcase endrule Does it matter? February 27, 2008 http://csg.csail.mit.edu/6.375

The tension If the two rules never fire in the same cycle then the machine can hardly be called a pipelined machine If both rules fire every cycle they are enabled, then wrong results would be produced February 27, 2008 http://csg.csail.mit.edu/6.375

The compiler issue Can the compiler detect all the conflicting conditions? Important for correctness Does the compiler detect conflicts that do not exist in reality? False positives lower the performance The main reason is that sometimes the compiler cannot detect under what conditions the two rules are mutually exclusive or conflict free What can the user specify easily? Rule priorities to resolve nondeterministic choice In some situations correctness of the design is not enough; the design is not done unless the performance goals are met February 27, 2008 http://csg.csail.mit.edu/6.375

some insight into Concurrent rule firing Rules Ri Rj Rk rule steps Rj HW Rk clocks Ri There are more intermediate states in the rule semantics (a state after each rule step) In the HW, states change only at clock edges February 27, 2008 http://csg.csail.mit.edu/6.375

Parallel execution reorders reads and writes Rules rule steps reads writes reads writes reads writes reads writes reads writes reads writes reads writes clocks HW In the rule semantics, each rule sees (reads) the effects (writes) of previous rules In the HW, rules only see the effects from previous clocks, and only affect subsequent clocks February 27, 2008 http://csg.csail.mit.edu/6.375

Correctness Rules Ri Rj Rk rule steps Rj HW Rk clocks Ri Rules are allowed to fire in parallel only if the net state change is equivalent to sequential rule execution (i.e., CF or SC) Consequence: the HW can never reach a state unexpected in the rule semantics February 27, 2008 http://csg.csail.mit.edu/6.375

Compiler determines if two rules can be executed in parallel Rulea and Ruleb are conflict-free if s . pa(s)  pb(s)  1. pa(db(s))  pb(da(s)) 2. da(db(s)) == db(da(s)) D(Ra)  R(Rb) =  D(Rb)  R(Ra) =  R(Ra)  R(Rb) =  Rulea and Ruleb are sequentially composable if s . pa(s)  pb(s)  pb(da(s)) and dab(s) is implemented as db(da(s)) D(pb)  R(Ra) =  These properties can be determined by examining the domains and ranges of the rules in a pairwise manner. The single rewrite per cycle implementation is correct but not very good. Remember how we described the fetch stage and execute stage in separate rules. If we only fire one rule per clock cycle, the we are not going to get a piplined processor. These conditions are sufficient but not necessary. Parallel execution of CF and SC rules does not increase the critical path delay February 27, 2008 http://csg.csail.mit.edu/6.375 8 8 6 8

Concurrency analysis Two-stage Pipeline fetch & decode execute pc rf CPU bu Concurrency analysis Two-stage Pipeline rule fetch_and_decode (!stallfunc(instr, bu)); bu.enq(newIt(instr,rf)); pc <= predIa; endrule conflicts around: pc, bu, rf rule execute (True); case (it) matches tagged EAdd{dst:.rd,src1:.va,src2:.vb}: begin rf.upd(rd, va+vb); bu.deq(); end tagged EBz {cond:.cv,addr:.av}: if (cv == 0) then begin pc <= av; bu.clear(); end else bu.deq(); tagged ELoad{dst:.rd,addr:.av}: begin rf.upd(rd, dMem.read(av)); bu.deq(); end tagged EStore{value:.vv,addr:.av}: begin dMem.write(av, vv); bu.deq(); end endcase endrule Let us split this rule for the sake of analysis February 27, 2008 http://csg.csail.mit.edu/6.375

Concurrency analysis Add Rule fetch & decode execute pc rf CPU bu Concurrency analysis Add Rule rule fetch_and_decode (!stallfunc(instr, bu)); bu.enq(newIt(instr,rf)); pc <= predIa; endrule rf: sub bu: find, enq pc: read,write rule execAdd (it matches tagged EAdd{dst:.rd,src1:.va,src2:.vb}); rf.upd(rd, va+vb); bu.deq(); endrule fetch < execAdd  execAdd < fetch  execAdd rf: upd bu: first, deq Do either of these concurrency properties hold ? February 27, 2008 http://csg.csail.mit.edu/6.375

Concurrency analysis One Element “Loopy” FIFO From Lecture L07 module mkLFIFO1 (FIFO#(t)); Reg#(t) data <- mkRegU(); Reg#(Bool) full <- mkReg(False); RWire#(void) deqEN <- mkRWire(); Bool deqp = isValid (deqEN.wget())); method Action enq(t x) if (!full || deqp); full <= True; data <= x; endmethod method Action deq() if (full); full <= False; deqEN.wset(?); method t first() if (full); return (data); method Action clear(); full <= False; endmodule not empty not full rdy enab enq deq module FIFO !full or February 27, 2008 http://csg.csail.mit.edu/6.375

One Element Searchable FIFO module mkSFIFO1#(function Bool findf(tr r, t x)) (SFIFO#(t,tr)); Reg#(t) data <- mkRegU(); Reg#(Bool) full <- mkReg(False); RWire#(void) deqEN <- mkRWire(); Bool deqp = isValid (deqEN.wget())); method Action enq(t x) if (!full || deqp); full <= True; data <= x; endmethod method Action deq() if (full); full <= False; deqEN.wset(?); method t first() if (full); return (data); method Action clear(); full <= False; method Bool find(tr r); return (findf(r, data) && full); endmethod endmodule (full && !deqp)); February 27, 2008 http://csg.csail.mit.edu/6.375

Register File concurrency properties Register File implementation would guarantees: rf.sub < rf.upd that is, reads happen before writes in concurrent execution But concurrent rf.sub(r1) and rf.upd(r2,v) where r1 ≠ r2 behaves like both rf.sub(r1) < rf.upd(r2,v) rf.sub(r1) > rf.upd(r2,v) February 27, 2008 http://csg.csail.mit.edu/6.375

Concurrency analysis Two-stage Pipeline fetch & decode execute pc rf CPU bu Concurrency analysis Two-stage Pipeline rule fetch_and_decode (!stallfunc(instr, bu)); bu.enq(newIt(instr,rf)); pc <= predIa; endrule rf: sub bu: find, enq pc: read,write rule execAdd (it matches tagged EAdd{dst:.rd,src1:.va,src2:.vb}); rf.upd(rd, va+vb); bu.deq(); endrule fetch < execAdd  rf: sub < upd bu: {find, enq} < {first , deq} execAdd < fetch  rf: sub > upd bu: {find, enq} > {first , deq} execAdd rf: upd bu: first, deq Do either of these concurrency properties hold ? February 27, 2008 http://csg.csail.mit.edu/6.375

What concurrency do we want? fetch & decode execute pc rf CPU bu What concurrency do we want? Suppose bu is empty initially If fetch and execAdd happened in the same cycle and the meaning was: fetch < execAdd execAdd < fetch February 27, 2008 http://csg.csail.mit.edu/6.375

Concurrency analysis Branch Rules fetch & decode execute pc rf CPU bu Concurrency analysis Branch Rules rule fetch_and_decode (!stallfunc(instr, bu)); bu.enq(newIt(instr,rf)); pc <= predIa; endrule rule execBzTaken(it matches tagged Bz {cond:.cv,addr:.av} &&& (cv == 0)); pc <= av; bu.clear(); endrule rule execBzNotTaken(it matches tagged Bz {cond:.cv,addr:.av} &&& !(cv == 0)); bu.deq(); endrule execBzTaken < fetch ? execBzNotTaken < fetch ? February 27, 2008 http://csg.csail.mit.edu/6.375

Concurrency analysis Load-Store Rules fetch & decode execute pc rf CPU bu Concurrency analysis Load-Store Rules rule fetch_and_decode (!stallfunc(instr, bu)); bu.enq(newIt(instr,rf)); pc <= predIa; endrule rule execLoad(it matches tagged ELoad{dst:.rd,addr:.av}); rf.upd(rd, dMem.read(av)); bu.deq(); endrule rule execStore(it matches tagged EStore{value:.vv,addr:.av}); dMem.write(av, vv); bu.deq(); endrule execLoad < fetch ? execStore < fetch ? February 27, 2008 http://csg.csail.mit.edu/6.375

Properties Required of Register File and FIFO for Instruction Pipelining rf.upd(r1, v) < rf.sub(r2) Our construction of stall guarantees that r1 ≠ r2 in concurrent calls We can assert no conflict here FIFO bu: {first , deq} < {find, enq}  bu.first < bu.find bu.first < bu.enq bu.deq < bu.find bu.deq < bu.enq February 27, 2008 http://csg.csail.mit.edu/6.375

Concurrency analysis Two-stage Pipeline fetch & decode execute pc rf CPU bu Concurrency analysis Two-stage Pipeline rule fetch_and_decode (!stallfunc(instr, bu)); bu.enq(newIt(instr,rf)); pc <= predIa; endrule It all works rule execAdd (it matches tagged EAdd{dst:.rd,src1:.va,src2:.vb}); rf.upd(rd, va+vb); bu.deq(); endrule rule execBz(it matches tagged Bz {cond:.cv,addr:.av}); if (cv == 0) then begin pc <= av; bu.clear(); end else bu.deq(); endrule rule execLoad(it matches tagged ELoad{dst:.rd,addr:.av}); rf.upd(rd, dMem.read(av)); bu.deq(); endrule rule execStore(it matches tagged EStore{value:.vv,addr:.av}); dMem.write(av, vv); bu.deq(); endrule February 27, 2008 http://csg.csail.mit.edu/6.375

Lot of nontrivial analysis but no change in processor code! Needed Fifos and Register files with the appropriate concurrency properties February 27, 2008 http://csg.csail.mit.edu/6.375

Processor Pipelines CPU pc rf fetch decode execute memory write- back iMem dMem CPU The problem of exploiting the right amount of concurrency is quite difficult in complex processor pipelines February 27, 2008 http://csg.csail.mit.edu/6.375