Bluespec-7: Scheduling & Rule Composition

Slides:



Advertisements
Similar presentations
BSV execution model and concurrent rule scheduling Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology February.
Advertisements

Elastic Pipelines and Basics of Multi-rule Systems Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February.
An EHR based methodology for Concurrency management Arvind (with Asif Khan) Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.
Constructive Computer Architecture: Multirule systems and Concurrent Execution of Rules Arvind Computer Science & Artificial Intelligence Lab. Massachusetts.
Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology October 13, 2009http://csg.csail.mit.edu/koreaL12-1.
Modeling Processors Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 22, 2011L07-1
March, 2007http://csg.csail.mit.edu/arvindIPlookup-1 IP Lookup Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.
September 24, L08-1 IP Lookup: Some subtle concurrency issues Arvind Computer Science & Artificial Intelligence Lab.
December 10, 2009 L29-1 The Semantics of Bluespec Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute.
December 12, 2006http://csg.csail.mit.edu/6.827/L24-1 Scheduling Primitives for Bluespec Arvind Computer Science & Artificial Intelligence Lab Massachusetts.
September 3, 2009L02-1http://csg.csail.mit.edu/korea Introduction to Bluespec: A new methodology for designing Hardware Arvind Computer Science & Artificial.
Introduction to Bluespec: A new methodology for designing Hardware Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.
March 6, 2006http://csg.csail.mit.edu/6.375/L10-1 Bluespec-4: Rule Scheduling and Synthesis Arvind Computer Science & Artificial Intelligence Lab Massachusetts.
Constructive Computer Architecture: Guards Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology September 24, 2014.
September 22, 2009http://csg.csail.mit.edu/koreaL07-1 Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab.
Elastic Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 28, 2011L08-1http://csg.csail.mit.edu/6.375.
Introduction to Bluespec: A new methodology for designing Hardware Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.
February 20, 2009http://csg.csail.mit.edu/6.375L08-1 Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts.
Modular Refinement Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology March 8,
October 22, 2009http://csg.csail.mit.edu/korea Modular Refinement Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.
Introduction to Bluespec: A new methodology for designing Hardware Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.
Computer Architecture: A Constructive Approach Bluespec execution model and concurrent rule scheduling Teacher: Yoav Etsion Taken (with permission) from.
October 20, 2009L14-1http://csg.csail.mit.edu/korea Concurrency and Modularity Issues in Processor pipelines Arvind Computer Science & Artificial Intelligence.
Modeling Processors Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology March 1, 2010
Elastic Pipelines: Concurrency Issues
Introduction to Bluespec: A new methodology for designing Hardware
Introduction to Bluespec: A new methodology for designing Hardware
Bluespec-3: A non-pipelined processor Arvind
Concurrency properties of BSV methods and rules
Bluespec-6: Modeling Processors
Bluespec-6: Modules and Interfaces
Scheduling Constraints on Interface methods
Blusepc-5: Dead cycles, bubbles and Forwarding in Pipelines Arvind
Sequential Circuits: Constructive Computer Architecture
Introduction to Bluespec: A new methodology for designing Hardware
Performance Specifications
Pipelining combinational circuits
Multirule Systems and Concurrent Execution of Rules
Constructive Computer Architecture: Guards
Pipelining combinational circuits
Modular Refinement Arvind
Modular Refinement Arvind
EHR: Ephemeral History Register
Blusepc-5: Dead cycles, bubbles and Forwarding in Pipelines Arvind
Bluespec-7: Scheduling & Rule Composition
Modeling Processors: Concurrency Issues
Modules with Guarded Interfaces
Pipelining combinational circuits
Introduction to Bluespec: A new methodology for designing Hardware
Elastic Pipelines: Concurrency Issues
Bluespec-3: A non-pipelined processor Arvind
Multirule systems and Concurrent Execution of Rules
Modular Refinement - 2 Arvind
Bluespec-4: Rule Scheduling and Synthesis
Computer Science & Artificial Intelligence Lab.
Elastic Pipelines: Concurrency Issues
Elastic Pipelines and Basics of Multi-rule Systems
Bluespec-5: Modeling Processors
Constructive Computer Architecture: Guards
GCD: A simple example to introduce Bluespec
Elastic Pipelines and Basics of Multi-rule Systems
Control Hazards Constructive Computer Architecture: Arvind
Multirule systems and Concurrent Execution of Rules
Introduction to Bluespec: A new methodology for designing Hardware
Bluespec-5: Scheduling & Rule Composition
Modeling Processors Arvind
Modeling Processors Arvind
Modular Refinement Arvind
Implementing for Correct Concurrency
Bluespec-8: Modules and Interfaces
Presentation transcript:

Bluespec-7: Scheduling & Rule Composition Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 28, 2007 http://csg.csail.mit.edu/6.375/

GAA Execution model Repeatedly: Select a rule to execute Compute the state updates Make the state updates Highly non-deterministic User annotations can help in rule selection Implementation concern: Schedule multiple rules concurrently without violating one-rule-at-a-time semantics February 28, 2007 http://csg.csail.mit.edu/6.375/

Rule: As a State Transformer A rule may be decomposed into two parts p(s) and d(s) such that snext = if p(s) then d(s) else s p(s) is the condition (predicate) of the rule, a.k.a. the “CAN_FIRE” signal of the rule. (conjunction of explicit and implicit conditions) d(s) is the “state transformation” function, i.e., computes the next-state value in terms of the current state values. Abstractly, we can think of a rule having to parts, a pi function and a delta function. The pi function tells us whetherule can be applied to a term s If the the pievaluates to true, then the delta function tells us what is the new term. And if pi is false, the rule cannot change s February 28, 2007 http://csg.csail.mit.edu/6.375/

Compiling a Rule p enable d next current state state values rule r (f.first() > 0) ; x <= x + 1 ; f.deq (); endrule enable p f f x x d In a circuit, pi maps to combination logic that looks at the current state and generates a boolean enable signal for this rule The delta functions is another combination logic that computes the next state value from the current state value. Actually, delta has to compute the control signals to set the state element to the new value current state next state values rdy signals read methods enable signals action parameters p = enabling condition d = action signals & values February 28, 2007 http://csg.csail.mit.edu/6.375/

Combining State Updates: strawman p’s from the rules that update R OR pn latch enable After mapping all the rules, we have to combine their logic some how. For a particular state elemetn like the PC register, the latch enable is the or the enable signals from all the rules that updates PC. The actual next state value of PC has to be selected through a multiplexer. Notice, this circuit only works if only one of these pi signal is asserted at a time d1,R dn,R OR R d’s from the rules that update R next state value What if more than one rule is enabled? February 28, 2007 http://csg.csail.mit.edu/6.375/

Combining State Updates f1 p1 Scheduler: Priority Encoder OR p’s from all the rules pn fn latch enable After mapping all the rules, we have to combine their logic some how. For a particular state elemetn like the PC register, the latch enable is the or the enable signals from all the rules that updates PC. The actual next state value of PC has to be selected through a multiplexer. Notice, this circuit only works if only one of these pi signal is asserted at a time d1,R dn,R OR R d’s from the rules that update R next state value Scheduler ensures that at most one fi is true February 28, 2007 http://csg.csail.mit.edu/6.375/

One-rule-at-a-time Scheduler p1 Scheduler: Priority Encoder f1 p2 f2 pn fn 1. fi  pi 2. p1  p2  ....  pn  f1  f2  ....  fn 3. One rewrite at a time i.e. at most one fi is true To resolve this, In the simplest implementation, instead of using the pi signals to enable state elements directly, we can these arbitrate phi signals. Thephi signals are generate a priority encoder that only assert one of the phi’s signals whose corresponding pi signal is asserted. With this scheduler, In effect, we are creating a implementation that executes one rule per clock cycle. Very conservative way of guaranteeing correctness February 28, 2007 http://csg.csail.mit.edu/6.375/

Executing Multiple Rules Per Cycle: Conflict-free rules rule ra (z > 10); x <= x + 1; endrule rule rb (z > 20); y <= y + 2; Parallel execution behaves like ra < rb = rb < ra Rulea and Ruleb are conflict-free if s . pa(s)  pb(s)  1. pa(db(s))  pb(da(s)) 2. da(db(s)) == db(da(s)) The single rewrite per cycle implementation is correct but not very good. Remember how we described the fetch stage and execute stage in separate rules. If we only fire one rule per clock cycle, the we are not going to get a piplined processor. Parallel Execution can also be understood in terms of a composite rule rule ra_rb((z>10)&&(z>20)); x <= x+1; y <= y+2; endrule February 28, 2007 http://csg.csail.mit.edu/6.375/ 8 8 6 8

Executing Multiple Rules Per Cycle: Sequentially Composable rules rule ra (z > 10); x <= y + 1; endrule rule rb (z > 20); y <= y + 2; Parallel execution behaves like ra < rb Rulea and Ruleb are sequentially composable if s . pa(s)  pb(s)  pb(da(s)) The single rewrite per cycle implementation is correct but not very good. Remember how we described the fetch stage and execute stage in separate rules. If we only fire one rule per clock cycle, the we are not going to get a piplined processor. Parallel Execution can also be understood in terms of a composite rule rule ra_rb((z>10)&&(z>20)); x <= y+1; y <= y+2; endrule February 28, 2007 http://csg.csail.mit.edu/6.375/ 8 8 6 8

Sequentially Composable rules ... rule ra (z > 10); x <= 1; endrule rule rb (z > 20); x <= 2; Parallel execution can behave either like ra < rb or rb < ra but the two behaviors are not the same Composite rules rule ra_rb(z>10 && z>20); x <= 2; endrule Behavior ra < rb The single rewrite per cycle implementation is correct but not very good. Remember how we described the fetch stage and execute stage in separate rules. If we only fire one rule per clock cycle, the we are not going to get a piplined processor. rule rb_ra(z>10 && z>20); x <= 1; endrule Behavior rb < ra February 28, 2007 http://csg.csail.mit.edu/6.375/ 8 6 8 8

Compiler determines if two rules can be executed in parallel Rulea and Ruleb are conflict-free if s . pa(s)  pb(s)  1. pa(db(s))  pb(da(s)) 2. da(db(s)) == db(da(s)) D(Ra)  R(Rb) =  D(Rb)  R(Ra) =  R(Ra)  R(Rb) =  Rulea and Ruleb are sequentially composable if s . pa(s)  pb(s)  pb(da(s)) D(pb)  R(Ra) =  The single rewrite per cycle implementation is correct but not very good. Remember how we described the fetch stage and execute stage in separate rules. If we only fire one rule per clock cycle, the we are not going to get a piplined processor. These properties can be determined by examining the domains and ranges of the rules in a pairwise manner. These conditions are sufficient but not necessary. Parallel execution of CF and SC rules does not increase the critical path delay February 28, 2007 http://csg.csail.mit.edu/6.375/ 8 8 6 8

Mutually Exclusive Rules Rulea and Ruleb are mutually exclusive if they can never be enabled simultaneously s . pa(s)  ~ pb(s) Mutually-exclusive rules are Conflict-free even if they write the same state As an implementation consideration, we are going to require that the result of an sequential application can be reconstructed by combining the effects of the applying the two rules direct on s. Otherwise, we will have to cascade their combination logic which may leads to a longer cycle time. We say two rules are conflict free if they can satisfy all these conditions Mutual-exclusive analysis brings down the cost of conflict-free analysis February 28, 2007 http://csg.csail.mit.edu/6.375/

Multiple-Rules-per-Cycle Scheduler pn f1 f2 fn Divide the rules into smallest conflicting groups; provide a scheduler for each group 1. fi  pi 2. p1  p2  ....  pn  f1  f2  ....  fn 3. Multiple operations such that fi  fj  Ri and Rj are conflict-free or sequentially composable February 28, 2007 http://csg.csail.mit.edu/6.375/

Muxing structure Muxing logic requires determining for each register (action method) the rules that update it and under what conditions Conflict Free (Mutually exclusive) and or d1 p1 d2 p2 CF rules either do not update the same element or are ME p1  ~p2 Sequentially composable and or d1 p1 and ~p2 d2 p2 February 28, 2007 http://csg.csail.mit.edu/6.375/

Scheduling and control logic Modules (Current state) Modules (Next state) “CAN_FIRE” “WILL_FIRE” Rules p1 Scheduler f1 d1 p1 pn fn After mapping all the rules, we have to combine their logic some how. For a particular state elemetn like the PC register, the latch enable is the or the enable signals from all the rules that updates PC. The actual next state value of PC has to be selected through a multiplexer. Notice, this circuit only works if only one of these pi signal is asserted at a time d1 Muxing dn pn cond action dn February 28, 2007 http://csg.csail.mit.edu/6.375/

some insight Pictorially Rules Ri Rj Rk rule steps Rj HW Rk clocks Ri There are more intermediate states in the rule semantics (a state after each rule step) In the HW, states change only at clock edges February 28, 2007 http://csg.csail.mit.edu/6.375/

Parallel execution reorders reads and writes Rules rule steps reads writes reads writes reads writes reads writes reads writes reads writes reads writes clocks HW In the rule semantics, each rule sees (reads) the effects (writes) of previous rules In the HW, rules only see the effects from previous clocks, and only affect subsequent clocks February 28, 2007 http://csg.csail.mit.edu/6.375/

Correctness Rules Ri Rj Rk rule steps Rj HW Rk clocks Ri Rules are allowed to fire in parallel only if the net state change is equivalent to sequential rule execution (i.e., CF or SC) Consequence: the HW can never reach a state unexpected in the rule semantics February 28, 2007 http://csg.csail.mit.edu/6.375/

Synthesis Summary Bluespec generates a combinational hardware scheduler allowing multiple enabled rules to execute in the same clock cycle The hardware makes a rule-execution decision on every clock (i.e., it is not a static schedule) Among those rules that CAN_FIRE, only a subset WILL_FIRE that is consistent with a Rule order Since multiple rules can write to a common piece of state, the compiler introduces appropriate muxing logic For proper pipelining, dead-cycle elimination and value forwarding, the user needs some understanding and control of scheduling February 28, 2007 http://csg.csail.mit.edu/6.375/

Two-stage Pipeline rule fetch_and_decode (!stallfunc(instr, bu)); fetch & decode execute pc rf CPU bu Two-stage Pipeline rule fetch_and_decode (!stallfunc(instr, bu)); bu.enq(newIt(instr,rf)); pc <= predIa; endrule Can these rules fire concurrently ? rule execute (True); case (it) matches tagged EAdd{dst:.rd,src1:.va,src2:.vb}: begin rf.upd(rd, va+vb); bu.deq(); end tagged EBz {cond:.cv,addr:.av}: if (cv == 0) then begin pc <= av; bu.clear(); end else bu.deq(); tagged ELoad{dst:.rd,addr:.av}: begin rf.upd(rd, dMem.read(av)); bu.deq(); end tagged EStore{value:.vv,addr:.av}: begin dMem.write(av, vv); bu.deq(); end endcase endrule No! Conflicts around: pc, bu and rf Seq Comp? Try 1. fetch < execute 2. execute < fetch February 28, 2007 http://csg.csail.mit.edu/6.375/

Two-stage Pipeline Analysis fetch & decode execute pc rf CPU bu 1. fetch < execute 2. execute < fetch - will behave like a non-pipelined machine - will behave like a pipeline with bypasses February 28, 2007 http://csg.csail.mit.edu/6.375/

Scheduling expectations: execute < fetch schedule rule fetch_and_decode (!stallfunc(instr, bu)); bu.enq(newIt(instr,rf)); pc <= predIa; endrule pc: conflict in case of Bz Separate the Bz part of the rule rule execute (True); case (it) matches tagged EAdd{dst:.rd,src1:.va,src2:.vb}: begin rf.upd(rd, va+vb); bu.deq(); end tagged EBz {cond:.cv,addr:.av}: if (cv == 0) then begin pc <= av; bu.clear(); end else bu.deq(); tagged ELoad{dst:.rd,addr:.av}: begin rf.upd(rd, dMem.read(av)); bu.deq(); end tagged EStore{value:.vv,addr:.av}: begin dMem.write(av, vv); bu.deq(); end endcase endrule bu:(first < deq) < (find < enq) what about clear? rf: upd < sub February 28, 2007 http://csg.csail.mit.edu/6.375/

One Element FIFO Analysis module mkFIFO1 (FIFO#(t)); Reg#(t) data <- mkRegU(); Reg#(Bool) full <- mkReg(False); method Action enq(t x) if (!full); full <= True; data <= x; endmethod method Action deq() if (full); full <= False; method t first() if (full); return (data); method Action clear(); endmodule deq < enq ? No. ME first < deq ? Yes. first < enq ? No. ME Expectation bu: (first<deq) < (find<enq) February 28, 2007 http://csg.csail.mit.edu/6.375/

The good news ... It is always possible to transform your design to meet desired concurrency and functionality February 28, 2007 http://csg.csail.mit.edu/6.375/

Register Interfaces read < write write < read ? 1 D Q read write.x write.en read’ read’ – returns the current state when write is not enabled read’ – returns the value being written if write is enabled February 28, 2007 http://csg.csail.mit.edu/6.375/

Ephemeral History Register (EHR) [MEMOCODE’04] read0 < write0 < read1 < write1 < …. D Q 1 read1 write0.x write0.en read0 write1.x write1.en writei+1 takes precedence over writei February 28, 2007 http://csg.csail.mit.edu/6.375/

Transformation for Performance rule fetch_and_decode (!stallfunc(instr, bu)1); bu.enq1(newIt(instr,rf)); pc <= predIa; endrule execute < fetch_and_decode  rf.upd0 < rf.sub1 bu.first0 < {bu.deq0, bu.clear0} < bu.find1 < bu.enq1 rule execute (True); case (it) matches tagged EAdd{dst:.rd,src1:.va,src2:.vb}: begin rf.upd0(rd, va+vb); bu.deq0(); end tagged EBz {cond:.cv,addr:.av}: if (cv == 0) then begin pc <= av; bu.clear0(); end else bu.deq0(); tagged ELoad{dst:.rd,addr:.av}: begin rf.upd0(rd, dMem.read(av)); bu.deq0(); end tagged EStore{value:.vv,addr:.av}: begin dMem.write(av, vv); bu.deq0(); end endcase endrule February 28, 2007 http://csg.csail.mit.edu/6.375/

One Element FIFO using EHRs module mkFIFO1 (FIFO#(t)); EHReg2#(t) data <- mkEHReg2U(); EHReg2#(Bool) full <- mkEHReg2(False); method Action enq0(t x) if (!full.read0); full.write0 <= True; data.write0 <= x; endmethod method Action deq0() if (full.read0); full.write0 <= False; method t first0() if (full.read0); return (data.read0); method Action clear0(); endmodule first0 < deq0 < enq1 method Action enq1(t x) if (!full.read1); full.write1 <= True; data.write1 <= x; endmethod February 28, 2007 http://csg.csail.mit.edu/6.375/

After Renaming Things will work Programmer Specifies: both rules can fire concurrently Programmer Specifies: Rexecute < Rfetch Compiler Derives: (first0, deq0) < (find1, deq1) What if the programmer wrote this? Rexecute < Rexecute < Rfetch < Rfetch February 28, 2007 http://csg.csail.mit.edu/6.375/

Experiments in scheduling Dan Rosenband, ICCAD 2005 What happens if the user specifies: No change in rules Wb < Wb < Mem < Mem < Exe < Exe < Dec < Dec < IF < IF a superscalar processor! A cycle in slow motion RF iMem dMem Wb IF bI Exe bE Mem bW Dec bD I9 I8 I7 I6 I5 I4 I3 I2 I1 I0 Executing 2 instructions per cycle requires more resources but is functionally equivalent to the original design February 28, 2007 http://csg.csail.mit.edu/6.375/