Presentation is loading. Please wait.

Presentation is loading. Please wait.

Combinational Circuits and Simple Synchronous Pipelines

Similar presentations


Presentation on theme: "Combinational Circuits and Simple Synchronous Pipelines"— Presentation transcript:

1 Combinational Circuits and Simple Synchronous Pipelines
Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 13, 2009

2 Bluespec: Two-Level Compilation
(Objects, Types, Higher-order functions) Lennart Augustsson @Sandburst Type checking Massive partial evaluation and static elaboration Level 1 compilation Rules and Actions (Term Rewriting System) Now we call this Guarded Atomic Actions Rule conflict analysis Rule scheduling Level 2 synthesis James Hoe & Arvind @MIT Object code (Verilog/C) February 13, 2008

3 Static Elaboration At compile time Software Toolflow: Hardware
Inline function calls and unroll loops Instantiate modules with specific parameters Resolve polymorphism/overloading, perform most data structure operations source Software Toolflow: source Hardware Toolflow: design2 design3 design1 elaborate w/params .exe compile run1 run2.1 run3.1 run1.1 run w/ params run w/ params run1 run February 13, 2008

4 Combinational IFFT … * + - *j t2 t0 t3 t1
Bfly4 x16 out0 out1 out2 out63 out3 out4 Permute * + - *j t2 t0 t3 t1 Constants t0 to t3 are different for each box and can have dramatci impact on optimizations. All numbers are complex and represented as two sixteen bit quantities. Fixed-point arithmetic is used to reduce area, power, ... February 13, 2008

5 4-way Butterfly Node BSV has a very strong notion of types * + - *i t0
k0 k1 k2 k3 function Vector#(4,Complex) bfly4 (Vector#(4,Complex) t, Vector#(4,Complex) k); BSV has a very strong notion of types Every expression has a type. Either it is declared by the user or automatically deduced by the compiler The compiler verifies that the type declarations are compatible February 13, 2008

6 BSV code: 4-way Butterfly
function Vector#(4,Complex) bfly4 (Vector#(4,Complex) t, Vector#(4,Complex) k); Vector#(4,Complex) m, y, z; m[0] = k[0] * t[0]; m[1] = k[1] * t[1]; m[2] = k[2] * t[2]; m[3] = k[3] * t[3]; y[0] = m[0] + m[2]; y[1] = m[0] – m[2]; y[2] = m[1] + m[3]; y[3] = i*(m[1] – m[3]); z[0] = y[0] + y[2]; z[1] = y[1] + y[3]; z[2] = y[0] – y[2]; z[3] = y[1] – y[3]; return(z); endfunction * + - *i m y z Polymorphic code: works on any type of numbers for which *, + and - have been defined Note: Vector does not mean storage February 13, 2008

7 Combinational IFFT … stage_f function repeat it three times in0 in1
Bfly4 x16 out0 out1 out2 out63 out3 out4 Permute stage_f function repeat it three times February 13, 2008

8 BSV Code: Combinational IFFT
function Vector#(64, Complex) ifft (Vector#(64, Complex) in_data); //Declare vectors Vector#(4,Vector#(64, Complex)) stage_data; stage_data[0] = in_data; for (Integer stage = 0; stage < 3; stage = stage + 1) stage_data[stage+1] = stage_f(stage,stage_data[stage]); return(stage_data[3]); The for loop is unfolded and stage_f is inlined during static elaboration Note: no notion of loops or procedures during execution February 13, 2008

9 BSV Code: Combinational IFFT- Unfolded
function Vector#(64, Complex) ifft (Vector#(64, Complex) in_data); //Declare vectors Vector#(4,Vector#(64, Complex)) stage_data; stage_data[0] = in_data; for (Integer stage = 0; stage < 3; stage = stage + 1) stage_data[stage+1] = stage_f(stage,stage_data[stage]); return(stage_data[3]); stage_data[1] = stage_f(0,stage_data[0]); stage_data[2] = stage_f(1,stage_data[1]); stage_data[3] = stage_f(2,stage_data[2]); Stage_f can be inlined now; it could have been inlined before loop unfolding also. Does the order matter? February 13, 2008

10 Bluespec Code for stage_f
function Vector#(64, Complex) stage_f (Bit#(2) stage, Vector#(64, Complex) stage_in); begin for (Integer i = 0; i < 16; i = i + 1) Integer idx = i * 4; let twid = getTwiddle(stage, fromInteger(i)); let y = bfly4(twid, stage_in[idx:idx+3]); stage_temp[idx] = y[0]; stage_temp[idx+1] = y[1]; stage_temp[idx+2] = y[2]; stage_temp[idx+3] = y[3]; end //Permutation for (Integer i = 0; i < 64; i = i + 1) stage_out[i] = stage_temp[permute[i]]; return(stage_out); twid’s are mathematically derivable constants February 13, 2008

11 Suppose we want to reuse some part of the circuit ...
Reuse the same circuit three times to reduce area in0 in1 in2 in63 in3 in4 Bfly4 x16 out0 out1 out2 out63 out3 out4 Permute But why? February 13, 2008

12 Architectural Exploration:
Area-Performance tradeoff in a Transmitter February 13, 2009

13 802.11a Transmitter Overview
headers Must produce one OFDM symbol every 4 msec Controller Scrambler Encoder 24 Uncoded bits data Depending upon the transmission rate, consumes 1, 2 or 4 tokens to produce one OFDM symbol Interleaver Mapper IFFT Cyclic Extend One OFDM symbol (64 Complex Numbers) IFFT Transforms 64 (frequency domain) complex numbers into 64 (time domain) complex numbers accounts for 85% area February 13, 2008

14 Preliminary results [MEMOCODE 2006] Dave, Gerding, Pellauer, Arvind
Design Lines of Relative Block Code (BSV) Area Controller % Scrambler % Conv. Encoder % Interleaver % Mapper % IFFT % Cyc. Extender % Complex arithmetic libraries constitute another 200 lines of code February 13, 2008

15 Combinational IFFT … Reuse the same circuit three times to reduce area
Bfly4 x16 out0 out1 out2 out63 out3 out4 Permute February 13, 2008

16 Design Alternatives Reuse a block over multiple cycles we expect:
f g f g we expect: Throughput to Area to decrease – less parallelism decrease – reusing a block The clock needs to run faster for the same throughput  hyper-linear increase in energy February 13, 2008

17 Circular pipeline: Reusing the Pipeline Stage
in1 in2 in63 in3 in4 out0 out1 out2 out63 out3 out4 Bfly4 Permute Stage Counter February 13, 2008

18 Superfolded circular pipeline: Just one Bfly-4 node!
in1 in2 in63 in3 in4 out0 out1 out2 out63 out3 out4 Permute Bfly4 64, 2-way Muxes Stage 0 to 2 4, 16-way Muxes Index: 0 to 15 4, 16-way DeMuxes Index == 15? February 13, 2008

19 Pipelining a block inQ outQ f2 f1 f3 Combinational C inQ outQ f2 f1 f3
Pipeline P inQ outQ f Folded Pipeline FP Clock: C < P  FP Clock? Area? Throughput? Area: FP < C < P Throughput: FP < C < P February 13, 2008

20 Synchronous pipeline x sReg1 inQ f1 f2 f3 sReg2 outQ
rule sync-pipeline (True); inQ.deq(); sReg1 <= f1(inQ.first()); sReg2 <= f2(sReg1); outQ.enq(f3(sReg2)); endrule This rule can fire only if - inQ has an element - outQ has space Atomicity: Either all or none of the state elements inQ, outQ, sReg1 and sReg2 will be updated This is real IFFT code; just replace f1, f2 and f3 with stage_f code February 13, 2008

21 Stage functions f1, f2 and f3
function f1(x); return (stage_f(1,x)); endfunction function f2(x); return (stage_f(2,x)); function f3(x); return (stage_f(3,x)); The stage_f function was given earlier February 13, 2008

22 Problem: What about pipeline bubbles?
x sReg1 inQ f1 f2 f3 sReg2 outQ rule sync-pipeline (True); inQ.deq(); sReg1 <= f1(inQ.first()); sReg2 <= f2(sReg1); outQ.enq(f3(sReg2)); endrule Red and Green tokens must move even if there is nothing in the inQ! Also if there is no token in sReg2 then nothing should be enqueued in the outQ Valid bits or the Maybe type Modify the rule to deal with these conditions February 13, 2008

23 The Maybe type data in the pipeline
typedef union tagged { void Invalid; data_T Valid; } Maybe#(type data_T); data valid/invalid Registers contain Maybe type values rule sync-pipeline (True); if (inQ.notEmpty()) begin sReg1 <= Valid f1(inQ.first()); inQ.deq(); end else sReg1 <= Invalid; case (sReg1) matches tagged Valid .sx1: sReg2 <= Valid f2(sx1); tagged Invalid: sReg2 <= Invalid; case (sReg2) matches tagged Valid .sx2: outQ.enq(f3(sx2)); endrule sx1 will get bound to the appropriate part of sReg1 February 13, 2008

24 Folded pipeline x sReg inQ f outQ stage The same code will work for superfolded pipelines by changing n and stage function f rule folded-pipeline (True); if (stage==0) begin sxIn= inQ.first(); inQ.deq(); end else sxIn= sReg; sxOut = f(stage,sxIn); if (stage==n-1) outQ.enq(sxOut); else sReg <= sxOut; stage <= (stage==n-1)? 0 : stage+1; endrule no for-loop notice stage is a dynamic parameter now! Need type declarations for sxIn and sxOut February 13, 2008

25 802.11a Transmitter Synthesis results (Only the IFFT block is changing)
IFFT Design Area (mm2) ThroughputLatency (CLKs/sym) Min. Freq Required Pipelined 5.25 04 1.0 MHz Combinational 4.91 Folded (16 Bfly-4s) 3.97 Super-Folded (8 Bfly-4s) 3.69 06 1.5 MHz SF(4 Bfly-4s) 2.45 12 3.0 MHz SF(2 Bfly-4s) 1.84 24 6.0 MHz SF (1 Bfly4) 1.52 48 12 MHZ All these designs were done in less than 24 hours! The same source code TSMC .18 micron; numbers reported are before place and route. February 13, 2008

26 Why are the areas so similar
Folding should have given a 3x improvement in IFFT area BUT a constant twiddle allows low-level optimization on a Bfly-4 block a 2.5x area reduction! February 13, 2008

27 Language notes Pattern matching syntax Vector syntax
Implicit conditions Static vs dynamic expression February 13, 2008

28 Pattern-matching: A convenient way to extract datastructure components
typedef union tagged { void Invalid; t Valid; } Maybe#(type t); case (m) matches tagged Invalid : return 0; tagged Valid .x : return x; endcase x will get bound to the appropriate part of m if (m matches (Valid .x) &&& (x > 10)) The &&& is a conjunction, and allows pattern-variables to come into scope from left to right February 13, 2008

29 Syntax: Vector of Registers
suppose x and y are both of type Reg. Then x <= y means x._write(y._read()) Vector of Int x[i] means sel(x,i) x[i] = y[j] means x = update(x,i, sel(y,j)) Vector of Registers x[i] <= y[j] does not work. The parser thinks it means (sel(x,i)._read)._write(sel(y,j)._read), which will not type check (x[i]) <= y[j] parses as sel(x,i)._write(sel(y,j)._read), and works correctly Don’t ask me why February 13, 2008

30 Making guards explicit
rule recirculate (True); if (p) fifo.enq(8); r <= 7; endrule rule recirculate ((p && fifo.enqG) || !p); if (p) fifo.enqB(8); r <= 7; endrule Effectively, all implicit conditions (guards) are lifted and conjoined to the rule guard February 13, 2008

31 Implicit guards (conditions)
Rule rule <name> (<guard>); <action>; endrule where <action> ::= r <= <exp> | m.g(<exp>) | if (<exp>) <action> endif | <action> ; <action> m.gB(<exp>) when m.gG make implicit guards explicit February 13, 2008

32 Guards vs If’s A guard on one action of a parallel group of actions affects every action within the group (a1 when p1); (a2 when p2) ==> (a1; a2) when (p1 && p2) A condition of a Conditional action only affects the actions within the scope of the conditional action (if (p1) a1); a2 p1 has no effect on a2 ... Mixing ifs and whens (if (p) (a1 when q)) ; a2  ((if (p) a1); a2) when ((p && q) | !p) February 13, 2008

33 Static vs dynamic expressions
Expressions that can be evaluated at compile time will be evaluated at compile-time 3+4  7 Some expressions do not have run-time representations and must be evaluated away at compile time; an error will occur if the compile-time evaluation does not succeed Integers, reals, loops, lists, functions, … February 13, 2008


Download ppt "Combinational Circuits and Simple Synchronous Pipelines"

Similar presentations


Ads by Google