IP Lookup: Some subtle concurrency issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology March 4, 2013

Slides:

Advertisements

Similar presentations

March 2007http://csg.csail.mit.edu/arvindSemantics-1 Scheduling Primitives for Bluespec Arvind Computer Science & Artificial Intelligence Lab Massachusetts.

Advertisements

Stmt FSM Richard S. Uhler Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology (based on a lecture prepared by Arvind)

Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology October 13, 2009http://csg.csail.mit.edu/koreaL12-1.

Modeling Processors Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 22, 2011L07-1

February 21, 2007http://csg.csail.mit.edu/6.375/L07-1 Bluespec-4: Architectural exploration using IP lookup Arvind Computer Science & Artificial Intelligence.

March, 2007http://csg.csail.mit.edu/arvindIPlookup-1 IP Lookup Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.

September 24, L08-1 IP Lookup: Some subtle concurrency issues Arvind Computer Science & Artificial Intelligence Lab.

IP Lookup: Some subtle concurrency issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 22, 2011L06-1.

February 22, 2005http://csg.csail.mit.edu/6.884/L07-1 Bluespec-1: Design Affects Everything Arvind Computer Science & Artificial Intelligence Lab Massachusetts.

December 12, 2006http://csg.csail.mit.edu/6.827/L24-1 Scheduling Primitives for Bluespec Arvind Computer Science & Artificial Intelligence Lab Massachusetts.

Pipelining combinational circuits Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology February 20, 2013http://csg.csail.mit.edu/6.375L05-1.

Realistic Memories and Caches Li-Shiuan Peh Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology March 21, 2012L13-1

September 3, 2009L02-1http://csg.csail.mit.edu/korea Introduction to Bluespec: A new methodology for designing Hardware Arvind Computer Science & Artificial.

Introduction to Bluespec: A new methodology for designing Hardware Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

Multiple Clock Domains (MCD) Continued … Arvind with Nirav Dave Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology November.

March 6, 2006http://csg.csail.mit.edu/6.375/L10-1 Bluespec-4: Rule Scheduling and Synthesis Arvind Computer Science & Artificial Intelligence Lab Massachusetts.

Constructive Computer Architecture: Guards Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology September 24, 2014.

September 22, 2009http://csg.csail.mit.edu/koreaL07-1 Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab.

Constructive Computer Architecture Sequential Circuits Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology

March 4, 2008http://csg.csail.mit.edu/arvindDF3-1 Dynamic Dataflow Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.

Introduction to Bluespec: A new methodology for designing Hardware Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

Constructive Computer Architecture Sequential Circuits - 2 Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

Constructive Computer Architecture: Control Hazards Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology October.

Non-blocking Caches Arvind (with Asif Khan) Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology May 14, 2012L25-1

Constructive Computer Architecture Store Buffers and Non-blocking Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute.

Introduction to Bluespec: A new methodology for designing Hardware Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

October 20, 2009L14-1http://csg.csail.mit.edu/korea Concurrency and Modularity Issues in Processor pipelines Arvind Computer Science & Artificial Intelligence.

Modeling Processors Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology March 1, 2010

October 6, 2009http://csg.csail.mit.edu/koreaL10-1 IP Lookup-2: The Completion Buffer Arvind Computer Science & Artificial Intelligence Lab Massachusetts.

Caches-2 Constructive Computer Architecture Arvind

Bluespec-3: A non-pipelined processor Arvind

Bluespec-6: Modeling Processors

Folded “Combinational” circuits

Scheduling Constraints on Interface methods

Blusepc-5: Dead cycles, bubbles and Forwarding in Pipelines Arvind

Sequential Circuits - 2 Constructive Computer Architecture Arvind

Sequential Circuits Constructive Computer Architecture Arvind

Sequential Circuits: Constructive Computer Architecture

Caches-2 Constructive Computer Architecture Arvind

IP Lookup: Some subtle concurrency issues

Stmt FSM Arvind (with the help of Nirav Dave)

Performance Specifications

Pipelining combinational circuits

Multirule Systems and Concurrent Execution of Rules

Bluespec-1: Design Affects Everything

Constructive Computer Architecture: Guards

Sequential Circuits Constructive Computer Architecture Arvind

Pipelining combinational circuits

Bluespec-4: Architectural exploration using IP lookup Arvind

Constructive Computer Architecture: Well formed BSV programs

Blusepc-5: Dead cycles, bubbles and Forwarding in Pipelines Arvind

Modules with Guarded Interfaces

Pipelining combinational circuits

Sequential Circuits - 2 Constructive Computer Architecture Arvind

Bluespec-3: A non-pipelined processor Arvind

Stmt FSM Arvind (with the help of Nirav Dave)

Caches-2 Constructive Computer Architecture Arvind

Modular Refinement - 2 Arvind

IP Lookup Arvind Computer Science & Artificial Intelligence Lab

IP Lookup: Some subtle concurrency issues

Constructive Computer Architecture: Guards

Control Hazards Constructive Computer Architecture: Arvind

Simple Synchronous Pipelines

Introduction to Bluespec: A new methodology for designing Hardware

IP Lookup: Some subtle concurrency issues

Modeling Processors Arvind

Modular Refinement Arvind

Implementing for Correct Concurrency

Simple Synchronous Pipelines

Caches-2 Constructive Computer Architecture Arvind

Presentation transcript:

IP Lookup: Some subtle concurrency issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology March 4, L08-1

IP Lookup block in a router Queue Manager Packet Processor Exit functions Control Processor Line Card (LC) IP Lookup SRAM (lookup table) Arbitration Switch LC A packet is routed based on the “Longest Prefix Match” (LPM) of it’s IP address with entries in a routing table Line rate and the order of arrival must be maintained line rate  15Mpps for 10GE March 4,

IP addressResultM Ref F F E C Sparse tree representation 3 A … A … B C … C … 5 D F … F … 14 A … A … 7 F … F … 200 F … F … F* E5.*.*.* D C * B A7.14.*.* F … F … F F … E A In this lecture: Level 1: 16 bits Level 2: 8 bits Level 3: 8 bits  1 to 3 memory accesses March 4,

“C” version of LPM int lpm (IPA ipa) /* 3 memory lookups */ { int p; /* Level 1: 16 bits */ p = RAM [ipa[31:16]]; if (isLeaf(p)) return value(p); /* Level 2: 8 bits */ p = RAM [ptr(p) + ipa [15:8]]; if (isLeaf(p)) return value(p); /* Level 3: 8 bits */ p = RAM [ptr(p) + ipa [7:0]]; return value(p); /* must be a leaf */ } Not obvious from the C code how to deal with - memory latency - pipelining … … … … 0 Must process a packet every 1/15 s or 67 ns Must sustain 3 memory dependent lookups in 67 ns Memory latency ~30ns to 40ns March 4,

Longest Prefix Match for IP lookup: 3 possible implementation architectures Rigid pipeline Inefficient memory usage but simple design Linear pipeline Efficient memory usage through memory port replicator Circular pipeline Efficient memory with most complex control Designer’s Ranking: 123 Which is “best”? March 4, Arvind, Nikhil, Rosenband & Dave [ICCAD 2004]

IP-Lookup module: Circular pipeline done? RAM fifo enter getResult cbuf put no getToken March 4, Completion buffer ensures that departures take place in order even if lookups complete out-of-order Since cbuf has finite capacity it gives out tokens to control the entry into the circular pipeline The fifo must also hold the “token” while the memory access is in progress: Tuple2#(Token,Bit#(16)) remainingIP

Completion buffer: Interface interface CBuffer#(type t); method ActionValue#(Token) getToken; method Action put(Token tok, t d); method ActionValue#(t) getResult; endinterface typedef Bit#(TLog#(n)) TokenN#(numeric type n); typedef TokenN#(16) Token; cbuf getResult getToken put (result & token) March 4,

Addr Ready ctr (ctr > 0) ctr++ ctr-- deq Enable enq Request-Response Interface for Synchronous Memory Synch Mem Latency N interface Mem#(type addrT, type dataT); method Action req(addrT x); method Action deq; method dataT peek; endinterface Data Ack Data Ready req deq peek Making a synchronous component latency- insensitive March 4,

IP-Lookup module: Interface methods done? RAM fifo enter getResult cbuf put no getToken March 4, module mkIPLookup(IPLookup); rule recirculate… ; method Action enter (IP ip); Token tok <- cbuf.getToken; ram.req(ip[31:16]); fifo.enq(tuple2(tok,ip[15:0])); endmethod method ActionValue#(Msg) getResult(); let result <- cbuf.getResult; return result; endmethod endmodule When can enter fire? cbuf, ram & fifo each has space

Circular Pipeline Rules: enter? done? RAM fifo When can recirculate fire? ram & fifo each has an element and ram and fifo, or cbuf has space Requires simultaneous enq and deq in the same rule! Is this possible? March 4, done? Is the same as isLeaf rule recirculate; match{.tok,.rip} = fifo.first; fifo.deq; ram.deq; if(isLeaf(ram.peek)) cbuf.put(tok, ram.peek); else begin fifo.enq(tuple2(tok,(rip << 8))); ram.req(ram.peek + rip[15:8]); end

Dead Cycles enter? done? RAM fifo March 4, rule recirculate; match{.tok,.rip} = fifo.first; fifo.deq; ram.deq; if(isLeaf(ram.peek)) cbuf.put(tok, ram.peek); else begin fifo.enq(tuple2(tok,(rip << 8))); ram.req(ram.peek + rip[15:8]); end Can a new request enter the system when an old one is leaving? Is this worth worrying about? method Action enter (IP ip); Token tok <- cbuf.getToken; ram.req(ip[31:16]); fifo.enq(tuple2(tok,ip[15:0])); endmethod No

The Effect of Dead Cycles enter done? RAM yes fifo no What is the performance loss if “exit” and “enter” don’t ever happen in the same cycle? >33% slowdown!Unacceptable Circular Pipeline RAM takes several cycles to respond to a request Each IP request generates 1-3 RAM requests FIFO entries hold base pointer for next lookup and unprocessed part of the IP address March 4,

So is there a dead cycle? enter? done? RAM fifo March 4, rule recirculate; match{.tok,.rip} = fifo.first; fifo.deq; ram.deq; if(isLeaf(ram.peek)) cbuf.put(tok, ram.peek); else begin fifo.enq(tuple2(tok,(rip << 8))); ram.req(ram.peek + rip[15:8]); end method Action enter (IP ip); Token tok <- cbuf.getToken; ram.req(ip[31:16]); fifo.enq(tuple2(tok,ip[15:0])); endmethod In general these two rules conflict but when isLeaf(p) is true there is no apparent conflict! Assuming cbuf.getToken and cbuf.put are CF

Rule Spliting rule foo (True); if (p) r1 <= 5; else r2 <= 7; endrule rule fooT (p); r1 <= 5; endrule rule fooF (!p); r2 <= 7; endrule  rule fooT and fooF can be scheduled independently with some other rule March 4,

Splitting the recirculate rule rule recirculate(!isLeaf(ram.peek)); match{.tok,.rip} = fifo.first; fifo.enq(tuple2(tok,(rip << 8))); ram.req(ram.peek + rip[15:8]); fifo.deq; ram.deq; endrule rule exit (isLeaf(ram.peek)); match{.tok,.rip} = fifo.first; cbuf.put(tok, ram.peek); fifo.deq; ram.deq; endrule Rule exit and method enter can execute concurrently, if cbuf.put and cbuf.getToken can execute concurrently March 4, method Action enter (IP ip); Token tok <- cbuf.getToken; ram.req(ip[31:16]); fifo.enq(tuple2 (tok,ip[15:0])); endmethod This rule is a valid only if enq and deq can be enable simultaneously and execute concurrently

Concurrent FIFO methods pipelined FIFO rule foo (True); f.enq (5) ; f.deq; endrule  f.notFull can be calculated only after knowing if f.deq fires or not, i.e. there is a combinational path from enable of f.deq to f.notFull Firing condition for rule foo has to be independent of the body March 4, rule foo (f.notFull && f.notEmpty); f.enq (5) ; f.deq; endrule make implicit conditions explicit Can foo be enabled?

Concurrent FIFO methods CF FIFO rule foo (True); f.enq (5) ; f.deq; endrule  The firing condition for rule foo is independent of the body The FIFO in the IP lookup must therefore be CF March 4, rule foo (f.notFull && f.notEmpty); f.enq (5) ; f.deq; endrule make implicit conditions explicit Can foo be enabled?

cbuf Completion buffer: Interface interface CBuffer#(type t); method ActionValue#(Token) getToken; method Action put(Token tok, t d); method ActionValue#(t) getResult; endinterface getResult getToken put (result & token) March 4, For no dead cycles cbuf.getToken and cbuf.put and cbuf.getResult must be able to execute concurrently

cbuf Completion buffer: Concurrency requirements getResult getToken put (result & token) March 4, For no dead cycles cbuf.getToken and cbuf.put and cbuf.getResult must be able to execute concurrently If we make these methods CF then every thing will work concurrently, i.e. ( enter CF exit), (enter CF getResult) and (exit CF getResult) However CF methods are hard to design. Suppose ( getToken < put), (getToken < getResult) and (put < getResult) then ( enter < exit), (enter < getResult) and (exit < getResult) In fact, any ordering will work

Completion buffer: Implementation IIVIVIIIVIVI cnt iidx ridx buf A circular buffer with two pointers iidx and ridx, and a counter cnt Elements are of Maybe type module mkCompletionBuffer(CompletionBuffer#(size)); Vector#(size, EHR#(Maybe#(t))) cb <- replicateM(mkEHR(Invalid)); Reg#(Bit#(TAdd#(TLog#(size),1))) iidx <- mkReg(0); Reg#(Bit#(TAdd#(TLog#(size),1))) ridx <- mkReg(0); EHR#(Bit#(TAdd#(TLog#(size),1))) cnt <- mkEHR(0); Integer vsize = valueOf(size); Bit#(TAdd#(TLog#(size),1)) sz = fromInteger(vsize); rules and methods... endmodule March 4,

Completion Buffer cont method ActionValue#(t) getToken() if(cnt[0]!==sz); cb[iidx][0] <= Invalid; iidx <= iidx==sz-1 ? 0 : iidx + 1; cnt[0] <= cnt[0] + 1; return iidx; endmethod method Action put(Token idx, t data);  cb[idx][1] <= Valid data; endmethod method ActionValue#(t) getResult() if(cnt[1] !== 0  &&&(cb[ridx][2] matches tagged (Valid.x)); cb[ridx][2] <= Invalid; ridx <= ridx==sz-1 ? 0 : ridx + 1; cnt[1] <= cnt[1] – 1; return x; endmethod getToken < put put < getResult getToken < getResult March 4, Concurrency properties?

Longest Prefix Match for IP lookup: 3 possible implementation architectures Rigid pipeline Inefficient memory usage but simple design Linear pipeline Efficient memory usage through memory port replicator Circular pipeline Efficient memory with most complex control Which is “best”? Arvind, Nikhil, Rosenband & Dave [ICCAD 2004] March 4,

Implementations of Static pipelines Two designers, two results LPM versions Best Area (gates) Best Speed (ns) Static V (Replicated FSMs) Static V (Single FSM) Replicated: RAM FSM MUX / De-MUX FSM Counter MUX / De-MUX result IP addr FSM RAM MUX result IP addr BEST: Each packet is processed by one FSM Shared FSM March 4,

Synthesis results LPM versions Code size (lines) Best Area (gates) Best Speed (ns) Mem. util. (random workload) Static V % Static BSV (5% larger)3.32 (7% faster)63.5% Linear V % Linear BSV (8% larger)4.7 (same)99.9% Circular V % Circular BSV (1% larger)3.67 (2% slower)99.9% Synthesis: TSMC 0.18 µm lib - Bluespec results can match carefully coded Verilog - Micro-architecture has a dramatic impact on performance - Architecture differences are much more important than language differences in determining QoR V = Verilog; BSV = Bluespec System Verilog March 4,