Bluespec-4: Rule Scheduling and Synthesis

Slides:

Advertisements

Similar presentations

BSV execution model and concurrent rule scheduling Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology February.

Advertisements

Elastic Pipelines and Basics of Multi-rule Systems Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February.

Control path Recall that the control path is the physical entity in a processor which: fetches instructions, fetches operands, decodes instructions, schedules.

An EHR based methodology for Concurrency management Arvind (with Asif Khan) Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.

Constructive Computer Architecture: Multirule systems and Concurrent Execution of Rules Arvind Computer Science & Artificial Intelligence Lab. Massachusetts.

CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,

Give qualifications of instructors: DAP

CS 151 Digital Systems Design Lecture 37 Register Transfer Level

Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.

1 Sequential Circuits Registers and Counters. 2 Master Slave Flip Flops.

Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.

Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology October 13, 2009http://csg.csail.mit.edu/koreaL12-1.

December 10, 2009 L29-1 The Semantics of Bluespec Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute.

Computer Architecture: A Constructive Approach Sequential Circuits Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

September 3, 2009L02-1http://csg.csail.mit.edu/korea Introduction to Bluespec: A new methodology for designing Hardware Arvind Computer Science & Artificial.

Introduction to Bluespec: A new methodology for designing Hardware Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

March 6, 2006http://csg.csail.mit.edu/6.375/L10-1 Bluespec-4: Rule Scheduling and Synthesis Arvind Computer Science & Artificial Intelligence Lab Massachusetts.

September 22, 2009http://csg.csail.mit.edu/koreaL07-1 Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab.

Constructive Computer Architecture Sequential Circuits Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology

Constructive Computer Architecture Sequential Circuits Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology September.

ECE 448 Lecture 6 Finite State Machines State Diagrams vs. Algorithmic State Machine (ASM) Charts.

Introduction to Bluespec: A new methodology for designing Hardware Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

February 20, 2009http://csg.csail.mit.edu/6.375L08-1 Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts.

Introduction to Bluespec: A new methodology for designing Hardware Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

Multiple Clock Domains (MCD) Arvind with Nirav Dave Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.

Computer Architecture: A Constructive Approach Bluespec execution model and concurrent rule scheduling Teacher: Yoav Etsion Taken (with permission) from.

October 20, 2009L14-1http://csg.csail.mit.edu/korea Concurrency and Modularity Issues in Processor pipelines Arvind Computer Science & Artificial Intelligence.

Overview Logistics Last lecture Today HW5 due today

Introduction to Bluespec: A new methodology for designing Hardware

Introduction to Bluespec: A new methodology for designing Hardware

Lecture 10 Flip-Flops/Latches

Computer Organization and Architecture + Networks

Flip Flops Lecture 10 CAP

Clocks A clock is a free-running signal with a cycle time.

Concurrency properties of BSV methods and rules

Sequential Circuits Constructive Computer Architecture Arvind

Sequential Circuits: Constructive Computer Architecture

Morgan Kaufmann Publishers The Processor

Morgan Kaufmann Publishers

ECE 448 Lecture 6 Finite State Machines State Diagrams vs. Algorithmic State Machine (ASM) Charts.

Introduction to Bluespec: A new methodology for designing Hardware

Inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture #21 State Elements: Circuits that Remember Hello to James Muerle in the.

Performance Specifications

Basics Combinational Circuits Sequential Circuits Ahmad Jawdat

Instruction Level Parallelism and Superscalar Processors

Introduction to cosynthesis Rabi Mahapatra CSCE617

Multirule Systems and Concurrent Execution of Rules

Constructive Computer Architecture: Guards

Sequential Circuits Constructive Computer Architecture Arvind

How does the CPU work? CPU’s program counter (PC) register has address i of the first instruction Control circuits “fetch” the contents of the location.

Bluespec-7: Scheduling & Rule Composition

Modeling Processors: Concurrency Issues

Modules with Guarded Interfaces

Introduction to Bluespec: A new methodology for designing Hardware

Multirule systems and Concurrent Execution of Rules

Elastic Pipelines and Basics of Multi-rule Systems

Constructive Computer Architecture: Guards

GCD: A simple example to introduce Bluespec

COMS 361 Computer Organization

Elastic Pipelines and Basics of Multi-rule Systems

Bluespec-7: Scheduling & Rule Composition

Multirule systems and Concurrent Execution of Rules

Introduction to Bluespec: A new methodology for designing Hardware

Bluespec-5: Scheduling & Rule Composition

Modular Refinement Arvind

ECE 448 Lecture 6 Finite State Machines State Diagrams, State Tables, Algorithmic State Machine (ASM) Charts, and VHDL code ECE 448 – FPGA and ASIC Design.

ECE 448 Lecture 6 Finite State Machines State Diagrams vs. Algorithmic State Machine (ASM) Charts.

Instructor: Michael Greenbaum

Presentation transcript:

Bluespec-4: Rule Scheduling and Synthesis Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology Based on material prepared by Bluespec Inc, January 2005 March 2, 2005 http://csg.csail.mit.edu/6.884/

Synthesis: From State & Rules into Synchronous FSMs interface module Transition Logic I O S“Next” S Collection of State Elements March 2, 2005 http://csg.csail.mit.edu/6.884/

Hardware Elements Combinational circuits . . Mux, Demux, ALU, ... OpSelect - Add, Sub, AddU, ... - And, Or, Not, ... - GT, LT, EQ, ... - SL, SR, SRA, ... Result NCVZ A B ALU Sel O I0 I1 In Mux . Sel I De- Mux . O0 O1 On Synchronous state elements Flipflop, Register, Register file, SRAM, DRAM ff D Q Clk En register March 2, 2005 http://csg.csail.mit.edu/6.884/

Flip-flops with Write Enables ff Q D C EN C D Q EN Edge-triggered: Data is sampled at the rising edge ff Q D C EN 1 ff Q D C EN dangerous! March 2, 2005 http://csg.csail.mit.edu/6.884/

Semantics and synthesis Rules Semantics: “Untimed” (one rule at a time) Verification activities Using Rule Semantics, establish functional correctness Scheduling and Synthesis by the BSV compiler Using Schedules, establish performance correctness Verilog RTL Semantics: clocked synchronous HW (multiple rules per clock) March 2, 2005 http://csg.csail.mit.edu/6.884/

Rule semantics Given a set of rules and an initial state while ( some rules are applicable* in the current state ) choose one applicable rule apply that rule to the current state to produce the next state of the system** (*) “applicable” = a rule’s condition is true in current state (**) These rule semantics are “untimed” – the action to change the state can take as long as necessary provided the state change is seen as atomic, i.e., not divisible. March 2, 2005 http://csg.csail.mit.edu/6.884/

Why are these rule semantics useful? Much easier to reason about correctness of a system when you consider just one rule at a time No problems with concurrency (e.g., race conditions, mis-timing, inconsistent states) We also say that rules are “interlocked”  Major impact on design entry time and on verification time March 2, 2005 http://csg.csail.mit.edu/6.884/

Extensive supporting theory Term Rewriting Systems, Terese, Cambridge Univ. Press, 2003, 884 pp. Parallel Program Design: A Foundation, K. Mani Chandy and Jayadev Misra, Addison Wesley, 1988 Using Term Rewriting Systems to Design and Verify Processors, Arvind and Xiaowei Shen, IEEE Micro 19:3, 1998, p36-46 Proofs of Correctness of Cache-Coherence Protocols, Stoy et al, in Formal Methods for Increasing Software Productivity, Berlin, Germany, 2001, Springer-Verlag LNCS 2021 Superscalar Processors via Automatic Microarchitecture Transformation, Mieszko Lis, Masters thesis, Dept. of Electrical Eng. and Computer Science, MIT, 2000 … and more … The intuitions underlying this theory are easy to use in practice March 2, 2005 http://csg.csail.mit.edu/6.884/

Bluespec’s synthesis introduces concurrency Synthesis is all about executing multiple rules “simultaneously” (in the same clock cycle) When executing a set of rules in a clock cycle in hardware, each rule reads state from the leading clock edge and sets state at the trailing clock edge  none of the rules in the set can see the effects of any of the other rules in the set However, in one-rule-at-a-time semantics , each rule sees the effects of all previous rule executions Thus, a set of rules can be safely executed together in a clock cycle only if A and B produce the same net state change March 2, 2005 http://csg.csail.mit.edu/6.884/

Pictorially Rules Ri Rj Rk rule steps Rj HW Rk clocks Ri There are more intermediate states in the rule semantics (a state after each rule step) In the HW, states change only at clock edges In each clock, a different number of rules may fire March 2, 2005 http://csg.csail.mit.edu/6.884/

Parallel execution reorders reads and writes Rules rule steps reads writes reads writes reads writes reads writes reads writes reads writes reads writes clocks HW In the rule semantics, each rule sees (reads) the effects (writes) of previous rules In the HW, rules only see the effects from previous clocks, and only affect subsequent clocks March 2, 2005 http://csg.csail.mit.edu/6.884/

Correctness Rules Ri Rj Rk rule steps Rj HW Rk clocks Ri Rules are allowed to fire in parallel only if the net state change is equivalent to sequential rule exeuction Consequence: the HW can never reach a state unexpected in the rule semantics Therefore, correctness is preserved March 2, 2005 http://csg.csail.mit.edu/6.884/

Scheduling The tool schedules as many rules in a clock cycle as it can prove are safe Generates interlock hardware to prevent execution of unsafe combinations Scheduling is the tool’s best attempt at the fastest possible correct and safe hardware Delayed rule execution is a consequence of safety checking, i.e., a rule “conflicts” with another rule in the same clock, the tool may delay its execution to a later clock March 2, 2005 http://csg.csail.mit.edu/6.884/

Obviously safe to execute simultaneously always @(posedge CLK) x <= x + 1; y <= y + 2; rule r1; x <= x + 1; endrule rule r2; y <= y + 2; endrule always @(posedge CLK) begin x <= x + 1; y <= y + 2; end Simultaneous execution is equivalent to r1 followed by r2 And also to r2 followed by r1 March 2, 2005 http://csg.csail.mit.edu/6.884/

Safe to execute simultaneously rule r1; x <= y + 1; endrule rule r2; y <= y + 2; endrule always @(posedge CLK) x <= y + 1; y <= y + 2; Simultaneous execution is equivalent to r1 followed by r2 Not equivalent to r2 followed by r1 But that’s ok; just need equivalence to some rule sequence March 2, 2005 http://csg.csail.mit.edu/6.884/

Actions within a single rule rule r1; x <= y + 1; y <= x + 2; endrule always @(posedge CLK) x <= y + 1; y <= x + 2; Actions within a single rule are simultaneous (The above translation is ok assuming no interlocks needed with any other rules involving x and y) March 2, 2005 http://csg.csail.mit.edu/6.884/

Not safe to execute simultaneously rule r1; x <= y + 1; endrule rule r2; y <= x + 2; always @(posedge CLK) x <= y + 1; y <= x + 2; Simultaneous execution is not equivalent to r1 followed by r2 nor to r2 followed by r1 A rule is not a Verilog “always” block! Interlocks will prevent these firing together (by delaying one of them) March 2, 2005 http://csg.csail.mit.edu/6.884/

Rule: As a State Transformer A rule may be decomposed into two parts p(s) and d(s) such that snext = if p(s) then d(s) else s p(s) is the condition (predicate) of the rule, a.k.a. the “CAN_FIRE” signal of the rule. (conjunction of explicit and implicit conditions) d(s) is the “state transformation” function, i.e., computes the next-state value in terms of the current state values. Abstractly, we can think of a rule having to parts, a pi function and a delta function. The pi function tells us whetherule can be applied to a term s If the the pievaluates to true, then the delta function tells us what is the new term. And if pi is false, the rule cannot change s March 2, 2005 http://csg.csail.mit.edu/6.884/

Compiling a Rule p enable d next current state state values rule r (f.first() > 0) ; x <= x + 1 ; f.deq (); endrule enable p f f x x d In a circuit, pi maps to combination logic that looks at the current state and generates a boolean enable signal for this rule The delta functions is another combination logic that computes the next state value from the current state value. Actually, delta has to compute the control signals to set the state element to the new value current state next state values rdy signals read methods enable signals action parameters p = enabling condition d = action signals & values March 2, 2005 http://csg.csail.mit.edu/6.884/

Combining State Updates: strawman p’s from the rules that update R OR pn latch enable After mapping all the rules, we have to combine their logic some how. For a particular state elemetn like the PC register, the latch enable is the or the enable signals from all the rules that updates PC. The actual next state value of PC has to be selected through a multiplexer. Notice, this circuit only works if only one of these pi signal is asserted at a time d1,R dn,R OR R d’s from the rules that update R next state value What if more than one rule is enabled? March 2, 2005 http://csg.csail.mit.edu/6.884/

Combining State Updates f1 p1 Scheduler: Priority Encoder OR p’s from all the rules pn fn latch enable After mapping all the rules, we have to combine their logic some how. For a particular state elemetn like the PC register, the latch enable is the or the enable signals from all the rules that updates PC. The actual next state value of PC has to be selected through a multiplexer. Notice, this circuit only works if only one of these pi signal is asserted at a time d1,R dn,R OR R d’s from the rules that update R next state value Scheduler ensures that at most one fi is true March 2, 2005 http://csg.csail.mit.edu/6.884/

One-rule-at-a-time Scheduler p1 Scheduler: Priority Encoder f1 p2 f2 pn fn 1. fi  pi 2. p1  p2  ....  pn  f1  f2  ....  fn 3. One rewrite at a time i.e. at most one fi is true To resolve this, In the simplest implementation, instead of using the pi signals to enable state elements directly, we can these arbitrate phi signals. Thephi signals are generate a priority encoder that only assert one of the phi’s signals whose corresponding pi signal is asserted. With this scheduler, In effect, we are creating a implementation that executes one rule per clock cycle. Very conservative way of guaranteeing correctness March 2, 2005 http://csg.csail.mit.edu/6.884/

Executing Multiple Rules Per Cycle rule ra (z > 10); x <= x + 1; endrule rule rb (z > 20); y <= y + 2; Can these rules be executed simultaneously? These rules are “conflict free” because they manipulate different parts of the state The single rewrite per cycle implementation is correct but not very good. Remember how we described the fetch stage and execute stage in separate rules. If we only fire one rule per clock cycle, the we are not going to get a piplined processor. Rulea and Ruleb are conflict-free if s . pa(s)  pb(s)  1. pa(db(s))  pb(da(s)) 2. da(db(s)) == db(da(s)) March 2, 2005 http://csg.csail.mit.edu/6.884/ 8 8 6 8

Executing Multiple Rules Per Cycle rule ra (z > 10); x <= y + 1; endrule rule rb (z > 20); y <= y + 2; Can these rules be executed simultaneously? These rules are “sequentially composable”, parallel execution behaves like ra < rb The single rewrite per cycle implementation is correct but not very good. Remember how we described the fetch stage and execute stage in separate rules. If we only fire one rule per clock cycle, the we are not going to get a piplined processor. Rulea and Ruleb are sequentially composable if s . pa(s)  pb(s)  pb(da(s)) March 2, 2005 http://csg.csail.mit.edu/6.884/ 8 8 6 8

Multiple-Rules-per-Cycle Scheduler f1 Scheduler p2 f2 Scheduler Scheduler pn fn 1. fi  pi 2. p1  p2  ....  pn  f1  f2  ....  fn 3. Multiple operations such that fi  fj  Ri and Rj are conflict-free or sequentially composable March 2, 2005 http://csg.csail.mit.edu/6.884/

Scheduling and control logic Modules (Current state) Modules (Next state) “CAN_FIRE” “WILL_FIRE” Rules p1 Scheduler f1 d1 p1 pn fn After mapping all the rules, we have to combine their logic some how. For a particular state elemetn like the PC register, the latch enable is the or the enable signals from all the rules that updates PC. The actual next state value of PC has to be selected through a multiplexer. Notice, this circuit only works if only one of these pi signal is asserted at a time d1 Muxing dn pn cond action dn March 2, 2005 http://csg.csail.mit.edu/6.884/

Synthesis Summary Bluespec generates a combinational hardware scheduler allowing multiple enabled rules to execute in the same clock cycle The hardware makes a rule-execution decision on every clock (i.e., it is not a static schedule) Among those rules that CAN_FIRE, only a subset WILL_FIRE that is consistent with a Rule order Since multiple rules can write to a common piece of state, the compiler introduces suitable muxing and mux control logic This is very simple logic: the compiler will not introduce long paths on its own (details later) March 2, 2005 http://csg.csail.mit.edu/6.884/

Conditionals and rule-spliting In Rule Semantics this rule: Is equivalent to the following two rules: rule r1 won’t fire unless both f anf g queues are not full! rule r1 (p1); if (q1) f.enq(x); else g.enq(y); endrule rule r1a (p1 && q1); f.enq(x); endrule rule r1b (p1 && ! q1); g.enq(y); but not quite because of the compiler treats implicit conditions conservatively March 2, 2005 http://csg.csail.mit.edu/6.884/

This example won’t work properly without rule spliting Rules for Example 3 (* descending_urgency = "r1, r2" *) // Moving packets from input FIFO i1 rule r1; Tin x = i1.first(); if (dest(x)== 1) o1.enq(x); else o2.enq(x); i1.deq(); if (interesting(x)) c <= c + 1; endrule // Moving packets from input FIFO i2 rule r2; Tin x = i2.first(); i2.deq(); Determine Queue +1 Count certain packets This example won’t work properly without rule spliting March 2, 2005 http://csg.csail.mit.edu/6.884/

Conditionals & Concurrency rule r1a (p1 && q1); f.enq(x); endrule rule r1b (p1 && ! q1); g.enq(y); rule r1 (p1); if (q1) f.enq(x); else g.enq(y); endrule rule r2 (p2); f.enq(z); endrule Suppose there is another rule r2 Rule r2 cannot be executed simultaneously with r1 (conflict on f) But rule r2 may be executed simultaneously with r1b (provided p1, !q1 and p2 so permit) Thus, splitting a rule can allow more concurrency because of fewer resource conflicts March 2, 2005 http://csg.csail.mit.edu/6.884/

Scheduling conflicting rules When two rules conflict on a shared resource, they cannot both execute in the same clock The compiler produces logic that ensures that, when both rules are enabled, only one will fire Which one? The compiler chooses (and informs you, during compilation) The “descending_urgency” attribute allows the designer to control the choice March 2, 2005 http://csg.csail.mit.edu/6.884/

Example 2: Concurrent Updates Process 0 increments register x; Process 1 transfers a unit from register x to register y; Process 2 decrements register y 1 2 x y +1 -1 rule proc0 (cond0); x <= x + 1; endrule rule proc1 (cond1); y <= y + 1; x <= x – 1; endrule rule proc2 (cond2); y <= y – 1; endrule (* descending_urgency = “proc2, proc1, proc0” *) March 2, 2005 http://csg.csail.mit.edu/6.884/

Functionality and performance It is often possible to separate the concerns of functionality and performance First, use Rules to achieve correct functionality If necessary, adjust scheduling to achieve application performance goals (latency and bandwidth)* *I.e., # clocks per datum and # data per clock. Technology performance (clock speed) is a separate issue. March 2, 2005 http://csg.csail.mit.edu/6.884/

Improving performance via scheduling Latency and bandwidth can be improved by performing more operations in each clock cycle That is, by firing more rules per cycle Bluespec schedules all applicable rules in a cycle to execute, except when there are resource conflicts Therefore: Improving performance is often about resolving conflicts found by the scheduler March 2, 2005 http://csg.csail.mit.edu/6.884/

Viewing the schedule The command-line flag -show-schedule can be used to dump the schedule Three groups of information: method scheduling information rule scheduling information the static execution order of rules and methods more on this and other rule and method attributes to be discussed in the Friday tutorial ... March 2, 2005 http://csg.csail.mit.edu/6.884/

End March 2, 2005 http://csg.csail.mit.edu/6.884/