Pipelining combinational circuits

Slides:



Advertisements
Similar presentations
BSV execution model and concurrent rule scheduling Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology February.
Advertisements

Elastic Pipelines and Basics of Multi-rule Systems Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February.
An EHR based methodology for Concurrency management Arvind (with Asif Khan) Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.
March, 2007http://csg.csail.mit.edu/arvind802.11a-1 Architectural Exploration: Area-Performance tradeoff in a Transmitter Arvind Computer Science.
Constructive Computer Architecture: Multirule systems and Concurrent Execution of Rules Arvind Computer Science & Artificial Intelligence Lab. Massachusetts.
March 2007http://csg.csail.mit.edu/arvindSemantics-1 Scheduling Primitives for Bluespec Arvind Computer Science & Artificial Intelligence Lab Massachusetts.
Folding and Pipelining complex combinational circuits Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February.
June 5, Architectural Exploration: a Transmitter Arvind, Nirav Dave, Steve Gerding, Mike Pellauer Computer Science & Artificial Intelligence.
Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology October 13, 2009http://csg.csail.mit.edu/koreaL12-1.
March, 2007http://csg.csail.mit.edu/arvindIPlookup-1 IP Lookup Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.
December 10, 2009 L29-1 The Semantics of Bluespec Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute.
December 12, 2006http://csg.csail.mit.edu/6.827/L24-1 Scheduling Primitives for Bluespec Arvind Computer Science & Artificial Intelligence Lab Massachusetts.
Pipelining combinational circuits Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology February 20, 2013http://csg.csail.mit.edu/6.375L05-1.
Constructive Computer Architecture: Bluespec execution model and concurrency semantics Arvind Computer Science & Artificial Intelligence Lab. Massachusetts.
September 3, 2009L02-1http://csg.csail.mit.edu/korea Introduction to Bluespec: A new methodology for designing Hardware Arvind Computer Science & Artificial.
Introduction to Bluespec: A new methodology for designing Hardware Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.
Constructive Computer Architecture Cache Coherence Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology November.
Folded Combinational Circuits as an example of Sequential Circuits Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.
Constructive Computer Architecture: Guards Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology September 24, 2014.
March, 2007http://csg.csail.mit.edu/arvindIFFT-1 Combinational Circuits: IFFT, Types, Parameterization... Arvind Computer Science & Artificial Intelligence.
September 22, 2009http://csg.csail.mit.edu/koreaL07-1 Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab.
Elastic Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 28, 2011L08-1http://csg.csail.mit.edu/6.375.
September 8, 2009http://csg.csail.mit.edu/koreaL03-1 Combinational Circuits in Bluespec Arvind Computer Science & Artificial Intelligence Lab Massachusetts.
Folding complex combinational circuits to save area Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February.
Constructive Computer Architecture Sequential Circuits Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology September.
Constructive Computer Architecture Cache Coherence Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology Includes.
Introduction to Bluespec: A new methodology for designing Hardware Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.
Constructive Computer Architecture Sequential Circuits - 2 Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.
Constructive Computer Architecture Store Buffers and Non-blocking Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute.
Introduction to Bluespec: A new methodology for designing Hardware Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.
Simple Inelastic and Folded Pipelines Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 14, 2011L04-1.
Computer Architecture: A Constructive Approach Pipelining combinational circuits Teacher: Yoav Etsion Taken (with permission) from Arvind et al.*, Massachusetts.
Multiple Clock Domains (MCD) Arvind with Nirav Dave Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.
Combinational Circuits in Bluespec Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 9, 2011L03-1
Computer Architecture: A Constructive Approach Bluespec execution model and concurrent rule scheduling Teacher: Yoav Etsion Taken (with permission) from.
Bluespec-3: A non-pipelined processor Arvind
Folded Combinational Circuits as an example of Sequential Circuits
Folded “Combinational” circuits
Sequential Circuits - 2 Constructive Computer Architecture Arvind
Sequential Circuits Constructive Computer Architecture Arvind
Architectural Exploration:
Sequential Circuits: Constructive Computer Architecture
Combinational Circuits in Bluespec
FFT: An example of complex combinational circuits
Combinational Circuits in Bluespec
Pipelining combinational circuits
Multirule Systems and Concurrent Execution of Rules
Constructive Computer Architecture: Guards
Sequential Circuits Constructive Computer Architecture Arvind
Pipelining combinational circuits
EHR: Ephemeral History Register
Combinational circuits
Combinational Circuits and Simple Synchronous Pipelines
Combinational Circuits and Simple Synchronous Pipelines
Modules with Guarded Interfaces
Pipelining combinational circuits
Sequential Circuits - 2 Constructive Computer Architecture Arvind
Bluespec-3: A non-pipelined processor Arvind
Multirule systems and Concurrent Execution of Rules
Modular Refinement - 2 Arvind
Combinational Circuits in Bluespec
Computer Science & Artificial Intelligence Lab.
Elastic Pipelines and Basics of Multi-rule Systems
FFT: An example of complex combinational circuits
Constructive Computer Architecture: Guards
Elastic Pipelines and Basics of Multi-rule Systems
Simple Synchronous Pipelines
Multirule systems and Concurrent Execution of Rules
Architectural Exploration:
Simple Synchronous Pipelines
Presentation transcript:

Pipelining combinational circuits Constructive Computer Architecture: Pipelining combinational circuits Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology December 31, 2013 http://csg.csail.mit.edu/6.s195/CDAC

Contributors to the course material Arvind, Rishiyur S. Nikhil, Joel Emer, Muralidaran Vijayaraghavan Staff and students in 6.375 (Spring 2013), 6.S195 (Fall 2012, 2013), 6.S078 (Spring 2012) Andy Wright, Asif Khan, Richard Ruhler, Sang Woo Jun, Abhinav Agarwal, Myron King, Kermin Fleming, Ming Liu, Li-Shiuan Peh External Prof Amey Karkare & students at IIT Kanpur Prof Jihong Kim & students at Seoul Nation University Prof Derek Chiou, University of Texas at Austin Prof Yoav Etsion & students at Technion http://csg.csail.mit.edu/6.s195/CDAC December 31, 2013

Contents IFFT Inelastic versus Elastic pipelines The role of FIFOs Concurrency issues BSV Concepts The Maybe Type Concurrency analysis http://csg.csail.mit.edu/6.s195/CDAC December 31, 2013

Combinational IFFT … * + - *j t2 t0 t3 t1 Bfly4 x16 out0 out1 out2 out63 out3 out4 Permute * + - *j t2 t0 t3 t1 Constants t0 to t3 are different for each box and can have dramatci impact on optimizations. All numbers are complex and represented as two sixteen bit quantities. Fixed-point arithmetic is used to reduce area, power, ... http://csg.csail.mit.edu/6.s195/CDAC December 31, 2013

BSV code: 4-way Butterfly function Vector#(4,Complex#(s)) bfly4 (Vector#(4,Complex#(s)) t, Vector#(4,Complex#(s)) x); Vector#(4,Complex#(s)) m, y, z; m[0] = x[0] * t[0]; m[1] = x[1] * t[1]; m[2] = x[2] * t[2]; m[3] = x[3] * t[3]; y[0] = m[0] + m[2]; y[1] = m[0] – m[2]; y[2] = m[1] + m[3]; y[3] = i*(m[1] – m[3]); z[0] = y[0] + y[2]; z[1] = y[1] + y[3]; z[2] = y[0] – y[2]; z[3] = y[1] – y[3]; return(z); endfunction * + - *i m y z Polymorphic code: works on any type of numbers for which *, + and - have been defined Note: Vector does not mean storage; just a group of wires with names http://csg.csail.mit.edu/6.s195/CDAC December 31, 2013

Combinational IFFT … stage_f function Bfly4 x16 out0 out1 out2 out63 out3 out4 Permute stage_f function function Vector#(64, Complex#(n)) stage_f (Bit#(2) stage, Vector#(64, Complex#(n)) stage_in); function Vector#(64, Complex#(n)) ifft (Vector#(64, Complex#(n)) in_data); repeat stage_f three times http://csg.csail.mit.edu/6.s195/CDAC December 31, 2013

BSV Code: Combinational IFFT function Vector#(64, Complex#(n)) ifft (Vector#(64, Complex#(n)) in_data); //Declare vectors Vector#(4,Vector#(64, Complex#(n))) stage_data; stage_data[0] = in_data; for (Bit#(2) stage = 0; stage < 3; stage = stage + 1) stage_data[stage+1] = stage_f(stage,stage_data[stage]); return(stage_data[3]); endfunction The for-loop is unfolded and stage_f is inlined during static elaboration http://csg.csail.mit.edu/6.s195/CDAC December 31, 2013

Folded IFFT: Reusing the stage combinational circuit … in1 in2 in63 in3 in4 out0 … out1 out2 out63 out3 out4 … Bfly4 Permute Stage Counter http://csg.csail.mit.edu/6.s195/CDAC December 31, 2013

Superfolded IFFT: Just one Bfly-4 node! in0 … in1 in2 in63 in3 in4 out0 … out1 out2 out63 out3 out4 Permute Bfly4 64, 2-way Muxes Stage 0 to 2 4, 16-way Muxes Index: 0 to 15 4, 16-way DeMuxes Index == 15? f will be invoked for 48 dynamic values of stage; each invocation will modify 4 numbers in sReg after 16 invocations a permutation would be done on the whole sReg http://csg.csail.mit.edu/6.s195/CDAC December 31, 2013

Folding versus Pipelining xi+1 xi xi-1 3 different datasets in the pipeline f0 f1 f2 Lot of area and long combinational delay Folded or multi-cycle version can save area and reduce the combinational delay but throughput per clock cycle gets worse Pipelining: a method to increase the circuit throughput by concurrently evaluating multiple inputs http://csg.csail.mit.edu/6.s195/CDAC December 31, 2013

Inelastic vs Elastic pipeline x sReg1 inQ f0 f1 f2 sReg2 outQ Inelastic: all pipeline stages move synchronously x fifo1 inQ f1 f2 f3 fifo2 outQ Elastic: A pipeline stage can process data if its input FIFO is not empty and output FIFO is not Full Most complex processor pipelines are a combination of the two styles http://csg.csail.mit.edu/6.s195/CDAC December 31, 2013

Inelastic vs Elastic Pipelines Inelastic pipeline: typically only one rule or mutually exclusive rules; the designer controls precisely which activities go on in parallel downside: The designer must program the starting and draining of the pipeline. The rule can get complicated -- easy to make mistakes; difficult to make changes Elastic pipeline: several smaller rules, each easy to write, easier to make changes downside: sometimes rules do not fire concurrently when they should http://csg.csail.mit.edu/6.s195/CDAC December 31, 2013

Inelastic pipeline x sReg1 inQ f0 f1 f2 sReg2 outQ rule sync-pipeline (True); inQ.deq(); sReg1 <= f0(inQ.first()); sReg2 <= f1(sReg1); outQ.enq(f2(sReg2)); endrule This rule can fire only if - inQ has an element - outQ has space Atomicity: Either all or none of the state elements inQ, outQ, sReg1 and sReg2 will be updated http://csg.csail.mit.edu/6.s195/CDAC December 31, 2013

FIFO Module: methods with guarded interfaces rdy enab enq deq first FIFO not full not empty not empty fifo.enq(x); // action method fifo.deq(); // action method y=fifo.first() // value method http://csg.csail.mit.edu/6.s195/CDAC December 31, 2013

Inelastic pipeline Making implicit guard conditions explicit sReg1 inQ f0 f1 f2 sReg2 outQ rule sync-pipeline (!inQ.empty() && !outQ.full); inQ.deq(); sReg1 <= f0(inQ.first()); sReg2 <= f1(sReg1); outQ.enq(f2(sReg2)); endrule Suppose sReg1 and sReg2 have data, outQ is not full but inQ is empty. What behavior do you expect? Leave green and red data in the pipeline? http://csg.csail.mit.edu/6.s195/CDAC December 31, 2013

Pipeline bubbles x sReg1 inQ f0 f1 f2 sReg2 outQ rule sync-pipeline (True); inQ.deq(); sReg1 <= f0(inQ.first()); sReg2 <= f1(sReg1); outQ.enq(f2(sReg2)); endrule Red and Green tokens must move even if there is nothing in inQ! Also if there is no token in sReg2 then nothing should be enqueued in the outQ Valid bits or the Maybe type Modify the rule to deal with these conditions http://csg.csail.mit.edu/6.s195/CDAC December 31, 2013

Explicit encoding of Valid/Invalid data inQ f0 f1 f2 outQ sReg1 sReg2 typedef union tagged {void Valid; void Invalid; } Validbit deriving (Eq, Bits); rule sync-pipeline (True); if (inQ.notEmpty()) begin sReg1 <= f0(inQ.first()); inQ.deq(); sReg1f <= Valid end else sReg1f <= Invalid; sReg2 <= f1(sReg1); sReg2f <= sReg1f; if (sReg2f == Valid) outQ.enq(f2(sReg2)); endrule http://csg.csail.mit.edu/6.s195/CDAC December 31, 2013

When is this rule enabled? rule sync-pipeline (True); if (inQ.notEmpty()) begin sReg1 <= f0(inQ.first()); inQ.deq(); sReg1f <= Valid end else sReg1f <= Invalid; sReg2 <= f1(sReg1); sReg2f <= sReg1f; if (sReg2f == Valid) outQ.enq(f2(sReg2)); endrule sReg1 sReg2 inQ f0 f1 f2 outQ inQ sReg1f sReg2f outQ inQ sReg1f sReg2f outQ NE V V NF NE V V F NE V I NF NE V I F NE I V NF NE I V F NE I I NF NE I I F yes No Yes E V V NF E V V F E V I NF E V I F E I V NF E I V F E I I NF E I I F yes No Yes Yes1 NE = Not Empty; NF = Not Full Yes1 = yes but no change http://csg.csail.mit.edu/6.s195/CDAC December 31, 2013

The Maybe type A useful type to capture valid/invalid data typedef union tagged { void Invalid; data_T Valid; } Maybe#(type data_T); data valid/invalid Registers contain Maybe type values Some useful functions on Maybe type: isValid(x) returns true if x is Valid fromMaybe(d,x) returns the data value in x if x is Valid the default value d if x is Invalid http://csg.csail.mit.edu/6.s195/CDAC December 31, 2013

Using the Maybe type data valid/invalid typedef union tagged { void Invalid; data_T Valid; } Maybe#(type data_T); data valid/invalid Registers contain Maybe type values rule sync-pipeline if (True); if (inQ.notEmpty()) begin sReg1 <= Valid f0(inQ.first()); inQ.deq(); end else sReg1 <= Invalid; sReg2 <= isValid(sReg1)? Valid f1(fromMaybe(d, sReg1)) : Invalid; if isValid(sReg2) outQ.enq(f2(fromMaybe(d, sReg2))); endrule http://csg.csail.mit.edu/6.s195/CDAC December 31, 2013

The Maybe type data using the pattern matching syntax typedef union tagged { void Invalid; data_T Valid; } Maybe#(type data_T); data valid/invalid Registers contain Maybe type values rule sync-pipeline if (True); if (inQ.notEmpty()) begin sReg1 <= Valid (f0(inQ.first())); inQ.deq(); end else sReg1 <= Invalid; case (sReg1) matches tagged Valid .sx1: sReg2 <= Valid f1(sx1); tagged Invalid : sReg2 <= Invalid; endcase case (sReg2) matches tagged Valid .sx2: outQ.enq(f2(sx2)); endcase endrule sx1 will get bound to the appropriate part of sReg1 http://csg.csail.mit.edu/6.s195/CDAC December 31, 2013

Generalization: n-stage pipeline sReg[0] inQ sReg[1] outQ x f(0) f(1) f(2) f(n-1) ... sReg[n-2] rule sync-pipeline (True); if (inQ.notEmpty()) begin sReg[0]<= Valid f(1,inQ.first());inQ.deq();end else sReg[0]<= Invalid; for(Integer i = 1; i < n-1; i=i+1) begin case (sReg[i-1]) matches tagged Valid .sx: sReg[i] <= Valid f(i,sx); tagged Invalid: sReg[i] <= Invalid; endcase end case (sReg[n-2]) matches tagged Valid .sx: outQ.enq(f(n-1,sx)); endcase endrule http://csg.csail.mit.edu/6.s195/CDAC December 31, 2013

Elastic pipeline Use FIFOs instead of pipeline registers x inQ fifo1 fifo2 outQ rule stage1 if (True); fifo1.enq(f1(inQ.first()); inQ.deq(); endrule rule stage2 if (True); fifo2.enq(f2(fifo1.first()); fifo1.deq(); endrule rule stage3 if (True); outQ.enq(f3(fifo2.first()); fifo2.deq(); endrule What is the firing condition for each rule? Can tokens be left inside the pipeline? No need for Maybe types http://csg.csail.mit.edu/6.s195/CDAC December 31, 2013

Firing conditions for reach rule x fifo1 inQ f1 f2 f3 fifo2 outQ inQ fifo1 fifo2 outQ rule1 rule2 rule3 NE NE,NF NE,NF NF NE NE,NF NE,NF F NE NE,NF NE,F NF NE NE,NF NE,F F …. Yes Yes Yes Yes Yes No Yes No Yes Yes No No …. This is the first example we have seen where multiple rules may be ready to execute concurrently Can we execute multiple rules together? http://csg.csail.mit.edu/6.s195/CDAC December 31, 2013

Informal analysis x fifo1 inQ f1 f2 f3 fifo2 outQ inQ fifo1 fifo2 outQ rule1 rule2 rule3 NE NE,NF NE,NF NF NE NE,NF NE,NF F NE NE,NF NE,F NF NE NE,NF NE,F F …. Yes Yes Yes Yes Yes No Yes No Yes Yes No No …. FIFOs must permit concurrent enq and deq for all three rules to fire concurrently http://csg.csail.mit.edu/6.s195/CDAC December 31, 2013

Concurrency when the FIFOs do not permit concurrent enq and deq x fifo1 inQ f1 f2 f3 fifo2 outQ not empty not empty & not full not empty & not full not full At best alternate stages in the pipeline will be able to fire concurrently http://csg.csail.mit.edu/6.s195/CDAC December 31, 2013

Pipelined designs expressed using Multiple rules If rules for different pipeline stages never fire in the same cycle then the design can hardly be called a pipelined design If all the enabled rules fire in parallel every cycle then, in general, wrong results can be produced http://csg.csail.mit.edu/6.s195/CDAC December 31, 2013

BSV Execution Model Repeatedly: Select a rule to execute Compute the state updates Make the state updates Highly non-deterministic; User annotations can be used in rule selection A legal behavior of a BSV program can be explained by observing the state updates obtained by applying only one rule at a time One-rule-at-time semantics http://csg.csail.mit.edu/6.s195/CDAC December 31, 2013

Concurrent scheduling of rules The one-rule-at-a-time semantics plays the central role in defining functional correctness and verification but for meaningful hardware design it is necessary to execute multiple rules concurrently without violating the one-rule-at-a-time semantics What do we mean by concurrent scheduling? First - some hardware intuition Later – the semantics of concurrent scheduling http://csg.csail.mit.edu/6.s195/CDAC December 31, 2013

Hardware intuition for concurrent scheduling December 31, 2013 http://csg.csail.mit.edu/6.s195/CDAC

Rule Execution Application of a rule modifies some state elements of the system in a deterministic manner f x f x next state computation reg en’s current state next state values nextState http://csg.csail.mit.edu/6.s195/CDAC December 31, 2013

Executing multiple rules in one clock cycle Forwards old values unless updated current state next values rule2 f x rule 1 Next State mux Reg en Sequential composition ensures that no double updates are made on any register Parallel composition current state next values rule2 f x rule 1 merge NextState Reg en Sequential composition preserves the one-rule-at-a-time semantics but generally increases the critical combinational path Parallel composition does not create longer combinational paths but may not preserve the one-rule-at-a-time semantics http://csg.csail.mit.edu/6.s195/CDAC December 31, 2013

Violation of sequential semantics rule ra x <= y; endrule rule rb y <= x; Suppose initially x is x0 and y is y0 {x0,y0} ra {y0,y0} rb {y0,y0} {x0,y0} rb {x0,x0} ra {x0,x0} {x0,y0} rb||ra {y0,x0} Parallel execution does not behave like either ra<rb or rb<ra We do not want to allow concurrent execution of ra and rb next time compiler analysis to determine which rules can be executed concurrently http://csg.csail.mit.edu/6.s195/CDAC December 31, 2013

Area estimates Tool: Synopsys Design Compiler Comb. FFT Combinational area: 16536 Noncombinational area:  9279 Folded FFT Combinational area:       29330 Noncombinational area:     11603 Pipelined FFT Combinational area:       20610 Noncombinational area:     18558 Are the results surprising? Why is folded implementation not smaller? Explanation: Because of constant propagation optimization, each bfly4 gets reduced by 60% when twiddle factors are specified. Folded design disallows this optimization because of the sharing of bfly4’s http://csg.csail.mit.edu/6.s195/CDAC December 31, 2013