Presentation is loading. Please wait.

Presentation is loading. Please wait.

November 2, 2006http://csg.csail.mit.edu/6.827/L15-1 An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence.

Similar presentations


Presentation on theme: "November 2, 2006http://csg.csail.mit.edu/6.827/L15-1 An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence."— Presentation transcript:

1 November 2, 2006http://csg.csail.mit.edu/6.827/L15-1 An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology

2 November 2, 2006 L15-2http://csg.csail.mit.edu/6.827/ This subject is about The foundations of functional languages: the -calculus, types, monads, confluence, operational semantics, TRS... General purpose implicit parallel programming in Haskell & pH Parallel programming based on atomic actions or transactions in Bluespec Dataflow model of computation and understanding connections... Bluespec and pH borrow heavily from functional languages but their execution models differ completely from each other What we said in the first lecture

3 November 2, 2006 L15-3http://csg.csail.mit.edu/6.827/ Multithreaded C pH: Implicit Parallel Programming Multithreaded Intermediate Language pH: parallel Haskell (Types, Higher-order functions, I-structures, M-structures) front-end compilation Dataflow and multithreaded compilation model code generation  R.S.Nikhil, Arvind &  many brilliant students  @ MIT mid 80’s to 90’s SMP’sClusters We didn’t discuss compilation much !

4 November 2, 2006 L15-4http://csg.csail.mit.edu/6.827/ Fully Parallel, Multithreaded Model Global Heap of Shared Objects Tree of Activation Frames h:g: f: loop active threads asynchronous at all levels Efficient mappings on architectures has proved difficult Synchronization?

5 November 2, 2006 L15-5http://csg.csail.mit.edu/6.827/ Instead of focusing on compilation, we will study A hardware inspired methodology for “synthesizing” parallel programs Rule-based specification of behavior (Guarded Atomic Actions)  Lets you think one rule at a time Composition of modules with guarded interfaces Example: 802.11a transmitter Bluespec Unity – late 80s Chandy & Misra Warning: The ideas are untested in the software domain; you are the trailblazers.

6 November 2, 2006 L15-6http://csg.csail.mit.edu/6.827/ Bluespec: State and Rules organized into modules All state (e.g., Registers, FIFOs, RAMs,...) is explicit. Behavior is expressed in terms of atomic actions on the state: Rule: condition  action Rules can manipulate state in other modules only via their interfaces. interface module

7 November 2, 2006 L15-7http://csg.csail.mit.edu/6.827/ Terms GCD(x,y), integers Rewrite rules GCD(x, y)  GCD(y, x) if x>y, y0 (R 1 ) GCD(x, y)  GCD(x, y-x)if x y, y0 (R 2 ) Initial term GCD(initX,initY) Execution Programming with rules: Example Euclid’s GCD GCD(6, 15)  GCD(6, 9)  GCD(6, 3)  GCD(3, 6)  GCD(3, 3)  GCD(3, 0) R2R2 R2R2 R1R1 R2R2 R2R2

8 November 2, 2006 L15-8http://csg.csail.mit.edu/6.827/ module mkGCD (I_GCD); Reg#(int) x <- mkRegU; Reg#(int) y <- mkReg(0); rule swap when ((x>y)&&(y!=0)) ==> x <= y; y <= x; endrule rule subtract when ((x y <= y – x; endrule method Action start(int a, int b) when (y==0) ==> x <= a; y <= b; endmethod method int result() when (y==0); return x; endmethod endmodule Internal behavior GCD in Bluespec External interface State typedef int Int#(32) Assumes x /= 0 and y /= 0 x y swapsub

9 November 2, 2006 L15-9http://csg.csail.mit.edu/6.827/ rdy enab int rdy start result GCD module int y == 0 implicit conditions interface I_GCD; method Action start (int a, int b); method int result(); endinterface GCD Hardware Module t #(type t) t t tt t In a GCD call t could be Int#(32), UInt#(16), Int#(13),... The module can easily be made polymorphic Many different implementations can provide the same interface: module mkGCD (I_GCD)

10 November 2, 2006 L15-10http://csg.csail.mit.edu/6.827/ Object code (Verilog/C) Bluespec: Two-Level Compilation Rules and Actions (Term Rewriting System) Rule conflict analysis Rule scheduling  James Hoe & Arvind  @MIT 1997-2000 Bluespec (Objects, Types, Higher-order functions) Level 1 compilation Type checking Massive partial evaluation and static elaboration Level 2 synthesis  Lennart Augustsson  @Sandburst 2000-2002

11 November 2, 2006 L15-11http://csg.csail.mit.edu/6.827/ Static Elaboration.exe compile design2design3design1 elaborate w/params run1 run2.1 … run1 run3.1 … run1 run1.1 … run w/ params run w/ params run2run1 … run3 Inline function calls and datatypes Instantiate modules with specific parameters Resolve polymorphism/overloading source Software Toolflow: source Hardware Toolflow:

12 November 2, 2006http://csg.csail.mit.edu/6.827/L15-12 Expressing designs for 802.11a transmitter in Bluespec (BSV)

13 November 2, 2006 L15-13http://csg.csail.mit.edu/6.827/ 802.11a Transmitter Overview ControllerScramblerEncoderInterleaverMapper IFFT Cyclic Extend headers data IFFT Transforms 64 (frequency domain) complex numbers into 64 (time domain) complex numbers accounts for > 95% area 24 Uncoded bits One OFDM symbol (64 Complex Numbers) Must produce one OFDM symbol every 4 sec Depending upon the transmission rate, consumes 1, 2 or 4 tokens to produce one OFDM symbol

14 November 2, 2006 L15-14http://csg.csail.mit.edu/6.827/ Preliminary results Design Lines of Relative Block Code (BSV)Area Controller 49 0% Scrambler 40 0% Conv. Encoder 113 0% Interleaver 76 1% Mapper 112 11% IFFT 95 85% Cyc. Extender 23 3% Complex arithmetic libraries constitute another 200 lines of code

15 November 2, 2006 L15-15http://csg.csail.mit.edu/6.827/ Combinational IFFT in0 … in1 in2 in63 in3 in4 Radix 4 x16 Radix 4 … … out0 … out1 out2 out63 out3 out4 Permute_1Permute_2Permute_3 All numbers are complex and represented as two sixteen bit quantities. Fixed-point arithmetic is used to reduce area, power,... * * * * + - - + + - - + *j t2t2 t0t0 t3t3 t1t1

16 November 2, 2006 L15-16http://csg.csail.mit.edu/6.827/ Design Alternative fg Reuse a block over multiple cycles we expect: Throughput to reduce – less parallelism Energy/unit work to increase - due to extra HW Area to decrease – reusing a block ff g

17 November 2, 2006 L15-17http://csg.csail.mit.edu/6.827/ Combinational IFFT Opportunity for reuse in0 … in1 in2 in63 in3 in4 Radix 4 x16 Radix 4 … … out0 … out1 out2 out63 out3 out4 Permute_1Permute_2Permute_3 Reuse the same circuit three times

18 November 2, 2006 L15-18http://csg.csail.mit.edu/6.827/ Circular pipeline: Reusing the Pipeline Stage in0 … in1 in2 in63 in3 in4 out0 … out1 out2 out63 out3 out4 … Radix 4 Permute_1Permute_2Permute_3 Stage Counter 16 Radix 4s can be shared but not the three permutations. Hence the need for muxes 64, 4-way Muxes

19 November 2, 2006 L15-19http://csg.csail.mit.edu/6.827/ Superfolded circular pipeline: Just one Radix-4 node! in0 … in1 in2 in63 in3 in4 out0 … out1 out2 out63 out3 out4 Radix 4 Permute_1 Permute_2 Permute_3 Stage Counter 0 to 2 Index Counter 0 to 15 64, 4-way Muxes 4, 16-way Muxes 4, 16-way DeMuxes

20 November 2, 2006 L15-20http://csg.csail.mit.edu/6.827/ Which design consumes the least energy to transmit a symbol? Can we quickly code up all the alternatives? single source with parameters? Not practical in traditional hardware description languages like Verilog/VHDL

21 November 2, 2006 L15-21http://csg.csail.mit.edu/6.827/ Bluespec code: Radix-4 Node function Vector#(4,Complex) radix4(Vector#(4,Complex) t, Vector#(4,Complex) k); Vector#(4,Complex) m = newVector(), y = newVector(), z = newVector(); m[0] = k[0] * t[0]; m[1] = k[1] * t[1]; m[2] = k[2] * t[2]; m[3] = k[3] * t[3]; y[0] = m[0] + m[2]; y[1] = m[0] – m[2]; y[2] = m[1] + m[3]; y[3] = i*(m[1] – m[3]); z[0] = y[0] + y[2]; z[1] = y[1] + y[3]; z[2] = y[0] – y[2]; z[3] = y[1] – y[3]; return(z); endfunction Polymorphic code: works on any type of numbers for which *, + and - have been defined * * * * + - - + + - - + *j

22 November 2, 2006 L15-22http://csg.csail.mit.edu/6.827/ Combinational IFFT Can be used as a reference in0 … in1 in2 in63 in3 in4 Radix 4 x16 Radix 4 … … out0 … out1 out2 out63 out3 out4 Permute_1Permute_2Permute_3 stage_f function repeat it three times

23 November 2, 2006 L15-23http://csg.csail.mit.edu/6.827/ Bluespec Code for Combinational IFFT function SVector#(64, Complex) ifft (SVector#(64, Complex) in_data); //Declare vectors SVector#(4,SVector#(64, Complex)) stage_data = replicate(newSVector); stage_data[0] = in_data; for (Integer stage = 0; stage < 3; stage = stage + 1) stage_data[i+1] = stage_f(stage, stage_data[i]); return(stage_data[3]); The code is unfolded to generate a combinational circuit

24 November 2, 2006 L15-24http://csg.csail.mit.edu/6.827/ Bluespec Code for stage_f function SVector#(64, Complex) stage_f (Bit#(2) stage, SVector#(64, Complex) stage_in); begin for (Integer i = 0; i < 16; i = i + 1) begin Integer idx = i * 4; let twid = getTwiddle(stage, fromInteger(i)); let y = radix4(twid, stage_in[idx:idx+3]); stage_temp[idx] = y[0]; stage_temp[idx+1] = y[1]; stage_temp[idx+2] = y[2]; stage_temp[idx+3] = y[3]; end //Permutation for (Integer i = 0; i < 64; i = i + 1) stage_out[i] = stage_temp[permute[i]]; end return(stage_out); Stage function

25 November 2, 2006 L15-25http://csg.csail.mit.edu/6.827/ Synchronous pipeline rule sync-pipeline (True); inQ.deq(); sReg1 <= f1(inQ.first()); sReg2 <= f2( sReg1 ); outQ.enq(f3(sReg2)); endrule x sReg1inQ f1f2f3 sReg2outQ This is real IFFT code; just replace f1, f2 and f3 with stage_f code

26 November 2, 2006 L15-26http://csg.csail.mit.edu/6.827/ What about pipeline bubbles? rule sync-pipeline (True); Maybe#(data_T) sx, ox; for (Integer i = 1; i < n; i = i + 1) begin //Get stage input if (i == 0) if (inQ.notEmpty) begin sx = inQ.first();inQ.deq(); end else sx = Invalid; else sx = sRegs[i-1]; case(sx) matches //Calculate value tagged Valid.x: ox = f(fromInteger(i),x); tagged Invalid: ox = Invalid; endcase if (i == n-1) outQ.enq(ox); //Write Outputs else sRegs[i] <= ox; end endrule typedef union tagged { void Invalid; data_T Valid; } Maybe#(type data_T);

27 November 2, 2006 L15-27http://csg.csail.mit.edu/6.827/ Folded pipeline x sReg inQ rule folded-pipeline (True); if (stage==1) begin inQ.deq(); sxIn= inQ.first(); end else sxIn= sReg; sxOut = f(stage,sxIn); if (stage==3) outQ.enq(sxOut); else sReg <= sxOut; stage <= (stage==3)? 1 : stage+1; endrule f outQ stage f1 f2 f3 function f (stage,sx); case (stage) 1: return f1(sx); 2: return f2(sx); 3: return f3(sx); endcase endfunction This is real IFFT code too...

28 November 2, 2006 L15-28http://csg.csail.mit.edu/6.827/ Expressing these designs in Bluespec is easy All these designs were done in less than one day! Area and power estimates? Combinational Pipelined Folded (16 Radices) Super-Folded (8 Radices) Super-Folded (4 Radices) Super-Folded (2 Radices) Super-Folded (1 Radix)

29 November 2, 2006 L15-29http://csg.csail.mit.edu/6.827/ 802.11a Transmitter Synthesis results IFFT DesignArea (mm 2 ) Symb ol Latenc y (CLKs) Throughput Latency (CLKs/sym) Min. Freq Required Average Power (mW) Pipelined5.2512041.0 MHz4.92 Combinational4.9110041.0 MHz3.99 Folded (16 Radices) 3.9712041.0 MHz7.27 Super-Folded (8 Radices) 3.6915061.5 MHz10.9 SF(4 Radices)2.4521123.0 MHz14.4 SF(2 Radices)1.8433246.0 MHz21.1 SF (1 Radix)1.52574812 MHZ34.6

30 November 2, 2006 L15-30http://csg.csail.mit.edu/6.827/ Why are the areas so similiar Folding should have given a 3x improvement in IFFT area BUT a constant twiddle allows low- level optimization on a radix4 block a 2.5x area reduction!

31 November 2, 2006 L15-31http://csg.csail.mit.edu/6.827/ 802.11a Observation Dataflow network aka Kahn networks How should this level of concurrency be expressed in a reference code (say in C or systemC? Can we write Specs which work for both hardware and software

32 November 2, 2006 L15-32http://csg.csail.mit.edu/6.827/ Bluespec Tool flow Bluespec SystemVerilog source Verilog 95 RTL Verilog sim VCD output Debussy Visualization Bluespec Compiler RTL synthesis gates C Bluesim Cycle Accurate FPGA Power estimatio n tool Sequence Design PowerTheater


Download ppt "November 2, 2006http://csg.csail.mit.edu/6.827/L15-1 An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence."

Similar presentations


Ads by Google