Bridging the gap between asynchronous design and designers

Bridging the gap between asynchronous design and designers
Peter A. Beerel Fulcrum Microsystems, Calabasas Hills, CA, USA Jordi Cortadella Universitat Politècnica de Catalunya, Barcelona, Spain Alex Kondratyev Cadence Berkeley Labs, Berkeley, CA, USA

Outline Basic concepts on asynchronous circuit design Tea Break
Logic synthesis from concurrent specifications Synchronization of complex systems Lunch Design automation for asynchronous circuits Tea Break Industrial experiences

Basic concepts on asynchronous circuit design

Outline What is an asynchronous circuit ? Asynchronous communication
Asynchronous design styles (Micropipelines) Asynchronous logic building blocks Control specification and implementation Delay models and classes of async circuits Channel-based design Why asynchronous circuits ?

Synchronous circuit Implicit (global) synchronization between blocks
CL CLK Implicit (global) synchronization between blocks Clock period > Max Delay (CL + R)

Asynchronous circuit Explicit (local) synchronization:
Ack R CL R CL R CL R Req Explicit (local) synchronization: Req / Ack handshakes

Motivation for asynchronous
Asynchronous design is often unavoidable: Asynchronous interfaces, arbiters etc. Modern clocking is multi–phase and distributed – and virtually ‘asynchronous’ (cf. GALS – next slide): Mesachronous (clock travels together with data) Local (possibly stretchable) clock generation Robust asynchronous design flow is coming (e.g. VLSI programming from Philips, Balsa from Univ. of Manchester, NCL from Theseus Logic …) Please add mention of Fulcrum

Globally Async Locally Sync (GALS)
Asynchronous World Clocked Domain Req1 Req3 R CL R Ack1 Ack3 Local CLK Req4 Req2 Note that interfaces might be otherwise encoded (e.g., dual-rail) Ack4 Ack2 Async-to-sync Wrapper

Key Design Differences
Synchronous logic design: proceeds without taking timing correctness (hazards, signal ack–ing etc.) into account Combinational logic and memory latches (registers) are built separately Static timing analysis of CL is sufficient to determine the Max Delay (clock period) Fixed set–up and hold conditions for latches

Key Design Differences
Asynchronous logic design: Must ensure hazard–freedom, signal ack–ing, local timing constraints Combinational logic and memory latches (registers) are often mixed in “complex gates” Dynamic timing analysis of logic is needed to determine relative delays between paths To avoid complex issues, circuits may be built as Delay-insensitive and/or Speed-independent (as discussed later)

Verification and Testing Differences
Synchronous logic verification and testing: Only functional correctness aspect is verified and tested Testing can be done with standard ATE and at low speed (but high–speed may be required for DSM) Asynchronous logic verification and testing: In addition to functional correctness, temporal aspect is crucial: e.g. causality and order, deadlock–freedom Testing must cover faults in complex gates (logic+memory) and must proceed at normal operation rate Delay fault testing may be needed

Synchronous communication
1 1 1 Clock edges determine the time instants where data must be sampled Data wires may glitch between clock edges (set–up/hold times must be satisfied) Data are transmitted at a fixed rate (clock frequency)

Dual rail “LL” = “spacer”, “LH” = “0”, “HL” = “1” 1 1 1
Two wires with L(low) and H (high) per bit “LL” = “spacer”, “LH” = “0”, “HL” = “1” n–bit data communication requires 2n wires Each bit is self-timed Other delay-insensitive codes exist (e.g. k-of-n) and event–based signalling (choice criteria: pin and power efficiency) Should mention the need of an acknowledgement signal.

Bundled data 1 1 1 Validity signal Similar to an aperiodic local clock
1 Validity signal Similar to an aperiodic local clock n–bit data communication requires n+1 wires Data wires may glitch when no valid Signaling protocols level sensitive (latch) transition sensitive (register): 2–phase / 4–phase

Example: memory read cycle
Valid address Address A A Valid data Data D D Transition signaling, 4-phase

Example: memory read cycle
Valid address Address A A Valid data Data D D Transition signaling, 2-phase

Asynchronous modules Signaling protocol: DATA PATH Data IN Data OUT
start done req in req out CONTROL ack in ack out Signaling protocol: reqin+ start+ [computation] done+ reqout+ ackout+ ackin+ reqin- start [reset] done- reqout- ackout- ackin- (more concurrency is also possible)

Asynchronous latches: C element
Vdd A B C A B Z Z B A Z B A A B Z+ Z Z Z Static Logic Implementation A B [van Berkel 91] Gnd

C-element: Other implementations
B Gnd Vdd Z Vdd A Weak inverter B Z B Dynamic A Quasi-Static Gnd

Dual-rail logic Dual-rail AND gate
A.f B.t B.f C.t C.f Dual-rail AND gate Valid behavior for monotonic environment

Completion detection Dual-rail logic C done • •
Completion detection tree • •

Differential cascode voltage switch logic
start Z.f Z.t done A.t C.f B.f A.f N-type transistor network B.t C.t start 3–input AND/NAND gate

Examples of dual-rail design
Asynchronous dual-rail ripple-carry adder (A. Martin, 1991) Critical delay is proportional to logN (N=number of bits) 32–bit adder delay (1.6m MOSIS CMOS): 11 ns versus 40 ns for synchronous Async cell transistor count = 34 versus synchronous = 28 More recent success stories (modularity and automatic synthesis) of dual-rail logic from Null-Convention Logic (Theseus Logic) Mention Fulcrum.

Bundled-data logic blocks
Single-rail logic • • delay start done Conventional logic + matched delay

Micropipelines (Sutherland 89)
Micropipeline (2-phase) control blocks r1 r2 g1 g2 d1 d2 Request-Grant-Done (RGD)Arbiter C Join Merge in outf outt sel r1 r2 r a a1 a2 out0 in out1 Select Toggle Call

Micropipelines (Sutherland 89)
Aout delay delay Ain C C L logic L logic L logic L C C Rin delay Rout

Data-path / Control L logic L logic L logic L Rin Rout CONTROL Aout
Ain

Control specification
B+ A– B A input B output B–

B– A B A– B+

B+ A C+ C C A– B– B C–

B+ A C+ C C A– B B– C–

Ri+ Ao+ Ri- Ao- Ro+ Ai+ Ro- Ai- Ri Ro Ao Ai FIFO cntrl C Ri Ro Ai Ao

A simple filter: specification
Ain Rin IN y := 0; loop x := READ (IN); WRITE (OUT, (x+y)/2); y := x; end loop filter Aout Rout OUT

A simple filter: block diagram
x y + control Rin Ain Rout Aout Rx Ax Ry Ay Ra Aa IN OUT x and y are level-sensitive latches (transparent when R=1) + is a bundled-data adder (matched delay between Ra and Aa) Rin indicates the validity of IN After Ain+ the environment is allowed to change IN (Rout,Aout) control a level-sensitive latch at the output

A simple filter: control spec.
x y + control Rin Ain Rout Aout Rx Ax Ry Ay Ra Aa IN OUT Rin+ Ain+ Rin– Ain– Rx+ Ax+ Rx– Ax– Ry+ Ay+ Ry– Ay– Ra+ Aa+ Ra– Aa– Rout+ Aout+ Rout– Aout–

A simple filter: control impl.
Rin Ain Rx Ax Ry Ay Aa Ra Aout Rout Rin+ Ain+ Rin– Ain– Rx+ Ax+ Rx– Ax– Ry+ Ay+ Ry– Ay– Ra+ Aa+ Ra– Aa– Rout+ Aout+ Rout– Aout–

Taking delays into account
x+ x– y+ y– z+ z– x z y x’ z’ Delay assumptions: Environment: 3 time units Gates: 1 time unit events: x+  x’–  y+  z+  z’–  x–  x’+  z–  z’+  y–  time:

Taking delays into account
x+ x– y+ y– z+ z– x’ x y z’ z very slow Delay assumptions: unbounded delays events: x+  x’–  y+  z+  x–  x’+  y– failure ! time:

Gate vs wire delay models
Gate delay model: delays in gates, no delays in wires Wire delay model: delays in gates and wires

Delay models for async. circuits
Bounded delays (BD): realistic for gates and wires. Technology mapping is easy, verification is difficult Speed independent (SI): Unbounded (pessimistic) delays for gates and “negligible” (optimistic) delays for wires. Technology mapping is more difficult, verification is easy Delay insensitive (DI): Unbounded (pessimistic) delays for gates and wires. DI class (built out of basic gates) is almost empty Quasi-delay insensitive (QDI): Delay insensitive except for critical wire forks (isochronic forks). In practice it is the same as speed independent BD DI SI  QDI

Synchronization and communication between blocks
Channel-Based Design clock Asynchronous channel Keep it short and simple: Synchronous: different blocks and the communication of data between blocks is controlled by a global synchronous clock Asynchronous: the clock is replaced with handshaking. Request/acknoweldge protocols are used to synchronize computation among blocks and send data between blocks. Possibility: define channels informally here. Synchronous System Asynchronous System Synchronization and communication between blocks implemented with handshaking using asynchronous channels by sending/receiving “data tokens”

Channel Design – Single Rail
1 3 Req sender Ack receiver 2 4 Req Ack Data Data Data stable 4-phase bundled-data channel Features One request wire One wire per data bit One acknowledgment wire Has timing assumptions Explain timing assumptions briefly. State 1-of-N promotes DI communication because arrival of data is the same as the request. State in single track that there is assumption of sender/receiver tri-stating signal when necessary.

Channel Design: Dual Rail & 1-of-N
Two wires per data bit One acknowledgment wire Advantage: Supports delay-insensitive design 1-of-N Generalization of dual-rail DataT DataF Logical Value Reset 1 Invalid 4-phase 1-of-N channel Ack Data 1 2 3 4 (1-of-N) sender receiver

Anatomy of a Channel-Based Asynchronous Design
Architecture is typically a multi-level hierarchy of communicating blocks Anatomy of an asynchronous design. Don’t talk about decom. Yet, give an example of a big circuit go down, show the lowest leaf cell. Say asynchronous design is hierarchical netlist of leaf cells, where at each level blocks communicate along channels. Yields a hierarchical netlist of cells, where at each level blocks communicate along channels Reg C Reg B Adder Multiplier Reg A Main FSM Register Bank Memory Adder/ Mult. Subtract/ Divider BN-1 BN-2 BN-3 ASIC FAN-1 FAN-2 FAN-3 FA0 channels leaf cells

Asynchronous Cells F F Output Input Channels Channels Definition
Smallest element that communicates with its neighbors along asynchronous channels Functionality Reads a subset of input channels Computes F and writes to a subset of output channels Linear Pipelines Only one input and one output channel F

Cells for Non-Linear Pipelines
Joins and Forks Conditional Joins: Read only some of the input channels Conditional Splits: Write only to some of the output channels F F Join Fork F F Conditional Join Conditional Split

Template-Based Leaf-Cell Design
Each pipeline style (QDI, timed…) has a different blueprint Create a library using a blueprint to implement the lowest level communicating blocks RCD F LCD C 2-input 1-output pipeline stage 1-input 2-output pipeline stage C LCD RCD F Blueprint for a QDI N-input M-output pipeline stage

Template-Based Leaf-Cell Design
Pros Enables fine-grain 2-D pipelining yielding high-performance Simplifies logic synthesis by enabling simple control circuit generation and re-use of typical datapath synthesis Leaf-cells can be layed-out and verified creating a leaf-cell library, localizing timing assumptions Cons Unified template may not be optimal in all cases Particularly, less effective for non-pipelined architectures with more complicated control

Motivation (designer’s view)
Modularity for system-on-chip design Plug-and-play interconnectivity Average-case peformance No worst-case delay synchronization Many interfaces are asynchronous Buses, networks, ...

Motivation (technology aspects)
Low power Automatic clock gating Electromagnetic compatibility No peak currents around clock edges Security No ‘electro–magnetic difference’ between logical ‘0’ and ‘1’in dual rail code Robustness High immunity to technology and environment variations (temperature, power supply, ...) Mention high-performance

Dissuasion Concurrent models for specification
CSP, Petri nets, ...: no more FSMs Difficult to design Hazards, synchronization Complex timing analysis Difficult to estimate performance Difficult to test No way to stop the clock

But ... some successful stories
Philips AMULET microprocessors Sharp Intel (RAPPID) Start-up companies: Theseus logic, Fulcrum Microsystems, Self–Timed Solutions Recent blurb: It's Time for Clockless Chips, by Claire Tristram (MIT Technology Review, v. 104, no.8, October 2001: oct01/tristram.asp) …. Change ADD to Fulcrum.

Logic synthesis from concurrent specifications
Jordi Cortadella Universitat Politecnica de Catalunya Barcelona, Spain In collaboration with M. Kishinevsky, A. Kondratyev, L. Lavagno and A. Yakovlev

Outline Overview of the synthesis flow Specification
State graph and next-state functions State encoding Implementability conditions Speed-independent circuit Complex gates C-element architecture Review of some advanced topics

Book and synthesis tool
J. Cortadella, M. Kishinevsky, A. Kondratyev, L. Lavagno and A. Yakovlev, Logic synthesis for asynchronous controllers and interfaces, Springer-Verlag, 2002 petrify:

Design flow Specification (STG) Reachability analysis State Graph
State encoding SG with CSC Boolean minimization Next-state functions Logic decomposition Decomposed functions Technology mapping Gate netlist

Specification x x y y z z z+ x- x+ y+ z- y-
Signal Transition Graph (STG)

Token flow x y z x+ x- y+ y- z+ z-

State graph xyz 000 100 101 110 111 x+ x- y+ y- z+ z- 001 011 010 x+

Next-state functions xyz 000 100 101 110 111 001 011 010 x+ z+ y+ x-

Gate netlist x y z

VME bus Device Read Cycle Bus DSr LDS LDTACK D DTACK Data Transceiver
Controller DSw LDTACK DTACK

STG for the READ cycle DSr+ DTACK- LDS+ LDTACK+ D+ DTACK+ DSr- D-
VME Bus Controller LDTACK DTACK

Choice: Read and Write cycles
DSr+ DSw+ DTACK- DTACK- LDS+ D+ LDS- LDTACK- LDTACK+ LDS+ LDS- LDTACK- D+ LDTACK+ DTACK+ D- DSr- DTACK+ D- DSw-

Choice: Read and Write cycles
DTACK- DSr+ LDS+ LDTACK+ D+ DTACK+ DSr- D- LDS- LDTACK- DSw+ DSw-

Circuit synthesis Goal:
Derive a hazard-free circuit under a given delay model and mode of operation

Speed independence Delay model Conditions for implementability:
Unbounded gate / environment delays Certain wire delays shorter than certain paths in the circuit Conditions for implementability: Consistency Complete State Coding Persistency

STG for the READ cycle DSr+ DTACK- LDS+ LDTACK+ D+ DTACK+ DSr- D-
VME Bus Controller LDTACK DTACK

Binary encoding of signals
LDS = 0 LDS = 1 LDS - LDS + DSr+ DTACK- LDS+ LDTACK- LDTACK- LDTACK- DSr+ DTACK- LDTACK+ LDS- LDS- LDS- DSr+ DTACK- D+ D- DTACK+ DSr-

Binary encoding of signals
10000 DSr+ DTACK- LDS+ LDTACK- LDTACK- LDTACK- 10010 DSr+ DTACK- 01100 00110 LDTACK+ LDS- LDS- LDS- DSr+ DTACK- 10110 01110 10110 D+ D- DTACK+ DSr- (DSr , DTACK , LDTACK , LDS , D)

Excitation / Quiescent Regions
ER (LDS+) ER (LDS-) QR (LDS+) QR (LDS-) LDS- LDS+

Next-state function 0  1 LDS- LDS+ 0  0 1  1 1  0 10110

Karnaugh map for LDS - 1 - - - 1 - - - - - - - - - - - - - 1 1 1 - -
DTACK DSr D LDTACK 00 01 11 10 DTACK DSr D LDTACK 00 01 11 10 - 1 - - - 1 - - - - - - - - - - - - - 1 1 1 - - 0/1?

Concurrency reduction
DSr+ LDS+ LDS- LDS- LDS- 10110 10110

Concurrency reduction
DSr+ DTACK- LDS+ LDTACK+ D+ DTACK+ DSr- D- LDTACK- LDS-

State encoding conflicts
LDS+ LDTACK- LDTACK+ LDS- 10110 10110

Signal Insertion 101101 101100 CSC- CSC+ LDS+ LDTACK- LDTACK+ LDS- D-
DSr-

Complex-gate implementation

Implementability conditions
Consistency Rising and falling transitions of each signal alternate in any trace Complete state coding (CSC) Next-state functions correctly defined Persistency No event can be disabled by another event (unless they are both inputs)

Implementability conditions
Consistency + CSC + persistency There exists a speed-independent circuit that implements the behavior of the STG (under the assumption that ay Boolean function can be implemented with one complex gate)

Persistency 100 000 001 a- c+ a b c b+ a c b is this a pulse ?
Speed independence  glitch-free output behavior under any delay

a+ b+ c+ d+ a- b- d- c- 0000 1000 1100 0100 0110 0111 1111 1011 0011 1001 0001 a+ b+ c+ a- b- c- d- d+

ab cd 00 01 11 10 1 0000 1000 1100 0100 0110 0111 1111 1011 0011 1001 0001 a+ b+ c+ a- b- c- d- d+ ER(d+) ER(d-)

ab 0000 1000 1100 0100 0110 0111 1111 1011 0011 1001 0001 a+ b+ c+ a- b- c- d- d+ cd 00 01 11 10 00 1 01 1 1 1 1 11 1 10 Complex gate

Implementation with C elements
R S z • • •  S+  z+  S-  R+  z-  R-  • • • S (set) and R (reset) must be mutually exclusive S must cover ER(z+) and must not intersect ER(z-)  QR(z-) R must cover ER(z-) and must not intersect ER(z+)  QR(z+)

ab 0000 1000 1100 0100 0110 0111 1111 1011 0011 1001 0001 a+ b+ c+ a- b- c- d- d+ cd 00 01 11 10 00 1 01 1 1 1 1 11 1 10 S d C R

ab 0000 1000 1100 0100 0110 0111 1111 1011 0011 1001 0001 a+ b+ c+ a- b- c- d- d+ cd 00 01 11 10 00 1 01 1 1 1 1 11 1 10 C S R d Monotonic covers

C-based implementations
R d c d C b a c weak d c weak d a a b generalized C elements (gC)

Speed-independent implementations
Implementability conditions Consistency Complete state coding Persistency Circuit architectures Complex (hazard-free) gates C elements with monotonic covers ...

Synthesis exercise y- z- w- y+ x+ z+ x- w+ 1001 1000 1010 0001 0000
0101 0010 0100 0110 y- y+ x- x+ w+ w- z+ z- 1011 0011 0111 Derive circuits for signals x and z (complex gates and monotonic covers)

Synthesis exercise - - - - 1 1 1 1 1 1 1001 1000 1010 0001 0000 0101
0010 0100 0110 y- y+ x- x+ w+ w- z+ z- 1011 wx yz 00 01 11 10 1 1 - 00 1 1 - 0011 01 - 11 1 1 - 10 0111 Signal x

Synthesis exercise - - - - 1 1 1 1 1001 1000 1010 0001 0000 0101 0010
0100 0110 y- y+ x- x+ w+ w- z+ z- 1011 wx yz 00 01 11 10 - 00 0011 - 01 1 - 1 1 11 1 - 10 0111 Signal z

Logic decomposition: example
y- y- 1001 1011 z- w- 1000 0001 w+ y+ w- z- x+ z- w- w+ 1010 0000 0101 0011 w- y+ x+ z- 0010 0100 y+ x+ x- x- x+ y+ z+ 0110 0111 z+

1001 1011 1000 1010 0001 0000 0101 0010 0100 0110 0111 0011 y- y+ x- x+ w+ w- z+ z- y- yz=1 yz=0 1001 1011 y w z- w- z y 1000 0001 w+ y+ w- z- x+ z x w 1010 0000 0101 0011 w- y+ x+ z- w C z y 0010 0100 x- z x+ y+ z+ 0110 0111 y C x z y

y- w s 1001 1011 y z- s- z 1001 w+ 1000 z- y+ s- w- x w 0011 1000 0001 1010 y+ s- w- z- x+ w x- C 1010 0000 0101 z y z w- y+ x+ z- 0111 0010 0100 y x+ y+ s+ C x z s=0 z+ 0111 y 0110

y- y- 1001 1011 z- s- s- 1001 w+ 1000 z- y+ s- w- z- w- w+ 0011 1000 0001 1010 y+ s- w- z- x+ x- 1010 0000 0101 y+ x+ x- w- y+ x+ z- 0111 0010 0100 x+ y+ s+ s+ z+ s=0 z+ 0111 0110

Speed-independent Netlist
DSr+ DTACK- LDS+ LDTACK+ D+ DTACK+ DSr- D- LDTACK- LDS- D DTACK LDS map csc DSr LDTACK

Adding timing assumptions
DSr+ DTACK- LDS+ LDTACK+ D+ DTACK+ DSr- D- LDTACK- before DSr+ LDTACK- LDS- D DTACK FAST SLOW LDS map csc DSr LDTACK

Adding timing assumptions
DSr+ DTACK- LDS+ LDTACK+ D+ DTACK+ DSr- D- LDTACK- before DSr+ LDTACK- LDS- D DTACK LDS map csc DSr LDTACK

State space domain LDTACK- before DSr+ DSr+ LDTACK-

State space domain LDTACK- before DSr+ DSr+ LDTACK-
Two more unreachable states

Boolean domain - 1 - - - 1 - - - - - - - - - - - - - 1 1 1 - - 0/1?
LDS = 0 LDS = 1 DTACK DSr D LDTACK 00 01 11 10 DTACK DSr D LDTACK 00 01 11 10 - 1 - - - 1 - - - - - - - - - - - - - 1 1 1 - - 0/1?

Boolean domain - 1 - - - 1 - - - - - - - - - - - - - 1 1 1 - - - 1
LDS = 0 LDS = 1 DTACK DSr D LDTACK 00 01 11 10 DTACK DSr D LDTACK 00 01 11 10 - 1 - - - 1 - - - - - - - - - - - - - 1 1 1 - - - 1 One more DC vector for all signals One state conflict is removed

Netlist with one constraint
DSr+ DTACK- LDS+ LDTACK+ D+ DTACK+ DSr- D- LDTACK- LDS- D DTACK LDS map csc DSr LDTACK

Netlist with one constraint
DSr+ DTACK- LDS+ LDTACK+ D+ DTACK+ DSr- D- LDTACK- LDS- D DTACK LDTACK- before DSr+ TIMING CONSTRAINT LDS DSr LDTACK

Conclusions STGs have a high expressiveness power at a low level of granularity (similar to FSMs for synchronous systems) Synthesis from STGs can be fully automated Synthesis tools often suffer from the state explosion problem (symbolic techniques are used) The theory of logic synthesis from STGs can be found in: J. Cortadella, M. Kishinevsky, A. Kondratyev, L. Lavagno and A. Yakovlev, Logic Synthesis of Asynchronous Controllers and Interfaces, Springer Verlag, 2002.

Synchronization of complex systems
Jordi Cortadella Universitat Politecnica de Catalunya Barcelona, Spain Thanks to A. Chakraborty, T. Chelcea, M. Greenstreet and S. Nowick

Multiple clock domains
f1/f0 CLK1 CLK (f0) f2/f0 CLK2 CLK0 CLK f3/f0 CLK3 Independent clocks (plesiochronous if frequencies closely match) Single clock (Mesochronous) Rational clock frequencies

The problem: metastability
D Q D Q ФT ФR ФR setup hold D Q ?

Classical “synchronous” solution
D Q D Q D Q D Q ФT ФR Example Mean Time Between Failures fФ: frequency of the clock fD: frequency of the data tr: resolve time available W: metastability window  : resolve time constant # FFs MTBF 1 FF 15 min 2 FF 9 days 3 FF 23 years

How to live with metastability ?
Metastability cannot be avoided, it must be tolerated. Having a decent MTBF ( years) may result in a tangible impact in latency Purely asynchronous systems can be designed failure-free Synchronous and mixed synchronous-asynchronous systems need mechanisms with impact in latency But latency can be hidden in many cases …

Different approaches Pausible Clocks (Yun & Donohue 1996)
Predict metastability-free transmission windows for domains with related clocks (Chakraborty & Greenstreet 2003) Use the waiting time in FIFOs to resolve metastability (Chelcea & Nowick 2001) And others … The term “Globally Asynchronous, Locally Synchronous” is typically used for these systems (Chapiro 1984)

Mutual exclusion element
req1 1 ack1 req2 1 ack2

Metastability

Mutual exclusion element
Metastability resolver 1 req1 ack2 req2 1 ack1 An asynchronous data latch with MS resolver can be built similarly

Abstraction of the MUTEX
G1 MUTEX R2 G2

A pausible clock generator
Environment MUTEX [δ1, δ2] delay

Pausible clocks ME [δ1, δ2] delay CLK Req Ack MUTEX Cntr FF
Yun & Dooply, IEEE Trans. VLSI, Dec. 1999 Moore et al., ASYNC 2002

STARI (Self-Timed At Receiver’s Input)
Both clocks are generated from the same source The FIFO compensates for skew between transmitter and receiver M. Greenstreet, 1993

A Minimalist Interface
FIFO reduces to latch-X and a latch controller Φx can always be generated in such a way as to reliably transfer data from input to output Chakraborty & Greenstreet, 2002

A Minimalist Interface: 3 scenarios
Latch-X setup & hold Latch-R setup & hold Фx Permitted The scenario is chosen at initialization

A Minimalist Interface: latch controller
The controller detects which transition arrives first (from ΦT and ΦR) and generates ΦX accordingly

A Minimalist Interface: rational clocks

A Minimalist Interface: arbitrary clocks
Assumption: clocks are stable Each domain estimates the other’s frequency Residual error corrected using stuff bits

Mixed-Timing Interfaces
Async-Sync FIFO Asynchronous Domain Synchronous Domain 2 Async-Sync FIFO Synchronous Domain 1 Sync-Async FIFO Mixed-Clock FIFO’s Chelcea & Nowick, 2001

Mixed-Clock FIFO: Block Level
full req_get valid_get req_put synchronous put inteface Mixed-Clock FIFO synchronous get interface empty data_put data_get CLK_put CLK_get The Mixed-Clock FIFO interfaces two synchronous domains operating under different clocks. The FIFO has a put interface and a get interface. The put interface is controlled by one of the clocks. The sender initiates a put operation on a request wire. The data items are placed on a data bus. The FIFO communicates when it is full on the full wire. The get interface is controlled by the other clock. The receiver initiates a get operation on a request wire. The FIFO places the data items on a data bus, and it also indicates their validity. However, in this design, the FIFO passes only valid data items. When the FIFO becomes empty, it communicates its state on the empty signal.

Initiates put operations Initiates get operations Bus for data items Bus for data items full req_get valid_get req_put synchronous put inteface Mixed-Clock FIFO synchronous get interface empty data_put data_get CLK_put CLK_get Controls get operations The Mixed-Clock FIFO interfaces two synchronous domains operating under different clocks. The FIFO has a put interface and a get interface. The put interface is controlled by one of the clocks. The sender initiates a put operation on a request wire. The data items are placed on a data bus. The FIFO communicates when it is full on the full wire. The get interface is controlled by the other clock. The receiver initiates a get operation on a request wire. The FIFO places the data items on a data bus, and it also indicates their validity. However, in this design, the FIFO passes only valid data items. When the FIFO becomes empty, it communicates its state on the empty signal. Controls put operations

Indicates data items validity (always 1 in this design) Indicates when FIFO full full req_get valid_get req_put synchronous put inteface Mixed-Clock FIFO synchronous get interface empty data_put data_get CLK_put CLK_get The Mixed-Clock FIFO interfaces two synchronous domains operating under different clocks. The FIFO has a put interface and a get interface. The put interface is controlled by one of the clocks. The sender initiates a put operation on a request wire. The data items are placed on a data bus. The FIFO communicates when it is full on the full wire. The get interface is controlled by the other clock. The receiver initiates a get operation on a request wire. The FIFO places the data items on a data bus, and it also indicates their validity. However, in this design, the FIFO passes only valid data items. When the FIFO becomes empty, it communicates its state on the empty signal. Indicates when FIFO empty

Mixed-Clock FIFO: Architecture
full Full Detector req_put Put Controller data_put CLK_put cell cell cell cell cell CLK_get data_get req_get The mixed-clock FIFO’s architecture is a token-ring one. It has an array of identical cells. The put interface has shared data and control buses. Similarly, the get interface has common data and control buses. At all times, the FIFO contains a put token (the red circle). The put token is used to enqueue data items. The cell with the put token is the tail of the queue. The put token moves around the FIFO in a put token ring. Similarly, FIFO also contains a get token, used to dequeue data items. The get token defines the head of the queue. The get token moves around the FIFO in the get token ring. The full detector observes the state of each cell, and computes when the FIFO is full. Its result is communicated to the put interface. The put controller receives as an input the put requests. Normally, it passes them to the FIFO. However, when the FIFO becomes full, the put controller stalls the put interface. Very much similarly, the get interface has an empty detector and a get controller. The empty detector detects when the FIFO is empty. The get controller passes get requests to the FIFO and stalls the get interface when the FIFO is empty. valid_get Controller Get Empty Detector empty

Mixed-Clock FIFO: Cell Implementation
CLK_put en_put req_put data_put En ptok_out ptok_in f_i SR REG e_i Here is the implementation of a mixed-clock FIFO cell. ### I will point out only some important characteristics, more details are provided in the paper. The cell receives the put and get token from the right cell, and passes them to the left cell. The register enqueues both the data item and its validity, and dequeues them to similar buses on the get interface. The put and get operations in the cell are enabled independently, on two different control buses. The cell outputs its state on two status bits. One indicates when the cell is full, and one indicates when the cell is empty. The cell is partitioned into three components: a reusable synchronous put part, a reusable synchronous get part, and a data validity controller. The data validity controller (a simple SR latch), is unique to the mixed-clock FIFO cell. En gtok_out gtok_in CLK_get en_get valid data_get

Mixed-Clock FIFO: Cell Implementation
CLK_put en_put req_put data_put En ptok_out ptok_in PUT INTERFACE f_i SR REG e_i GET INTERFACE Here is the implementation of a mixed-clock FIFO cell. ### I will point out only some important characteristics, more details are provided in the paper. The cell receives the put and get token from the right cell, and passes them to the left cell. The register enqueues both the data item and its validity, and dequeues them to similar buses on the get interface. The put and get operations in the cell are enabled independently, on two different control buses. The cell outputs its state on two status bits. One indicates when the cell is full, and one indicates when the cell is empty. The cell is partitioned into three components: a reusable synchronous put part, a reusable synchronous get part, and a data validity controller. The data validity controller (a simple SR latch), is unique to the mixed-clock FIFO cell. En gtok_out gtok_in data_get CLK_get en_get valid

Design Automation for Asynchronous Circuits
Alex Kondratyev Cadence Berkeley Labs, Berkeley, CA, USA In collaboration with Jordi Cortadella, Luciano Lavagno Kelvin Lwin and Christos Sotiriou

Outline Outline What do we optimize? End of deterministic design
Technical and business implications Asynchronous design with commercial tools Desynchronization Delay-insensitive datapath Fine-grain pipelining

Optimization metrics nodes of a Boolean network
Late 70-s: Literals nodes of a Boolean network Levels of a Boolean network Area Speed Nowadays: Literals nodes of a Boolean network Levels of a Boolean network Wire length Area Speed Tools are optimizing for area and speed!

Universal metrics Power: P P = P + P + P P P = a * f * C * V P short
? P short small P = P + P P avg short leak dyn P dyn P = a * f * C * V dyn 2 dd clk C P leak

Universal metrics     Power P = P + P + P P = a * f * C * V Delay:
small ? P = P + P P avg short leak dyn 2 C P = a * f * C * V dyn dd clk Delay: I ds t = Q / I = C * V / k(V V ) d 2 dd ds c t  Supply voltage  Power   , delay Speed can be taken as a universal metrics

Timing margins Algorithms/tools (approximations)
Modeling (process corners e.g.) Architecture (unbalanced computation)

Algorithms/tools 10-35% gain from floorplan flattening (Reshape)
False paths (< 5%) Common path pessimism removal Hierarchy hurts!!! 10-35% gain from floorplan flattening (Reshape) Bad news: we do not know how far we are from optimum  Good news: optimum is not possible to find 

Modeling Why to panic? 0.25 , Vdd=2.510%, T=0, 125C
slow typical fast 0.25 , Vdd=2.510%, T=0, 125C INVX2 (fall) Fast  Typical Slow  Typical Fast  Typical Slow  Typical Why to panic? New BIG players: signal integrity and process variability

Variability sources Environment (T, Vdd) + signal integrity
Within-die only Process variations (gate length L, wire width W, threshold voltage Vt) Die-to-die (design independent) Within-die (design dependent)

Environment + SI IR drop – decrease in the current from Vdd Bad news:
Supply voltage: ± 10% Temperature: -40C to 125 C VDD V’DD IR drop – decrease in the current from Vdd Bad news: Good news: 7 6 Field solvers can handle 10 variables 10 gates x 8metal layers Abstraction, model reduction, IP reuse help further 9  10 RC elements in VDD grid Tools make IR drop sign off at 5%Vdd (still  10% delay penalty)

Compute switching windows Worst coupling estimation
Environment + SI aggressor victim pulse aggressor victim delay Crosstalk Pruning by coupling Compute switching windows Worst coupling estimation H-Spice simulation Pruning by timing Tc (%) Conservative analysis: up to 20% delay penalty (post-layout fixes)

Process variations Die-to-die Within-die design independent, well
modeled via worst-case files Within-die design dependent, systematic and random!! within-die die-to-die Lgate Wwire Tt Nassif’01

Measuring variability
% chips Microprocessor at-speed functional testing Bin1 Bin2 frequency ASIC no delay testing, no binning Bin3 Strategically placed oscillators: Problem: Up to 15% delay variation in RO (Nassif’03) Vertical/horizontal (4%), spacing poli-SI (7%), distance (5%)

Modeling variability var
Model for gate delay (linear wrt variability sources) d =  env  device  wire var Independence of sources (within a group - model reduction (PCA or SVD)) For a single variability source: L = L L var spatial random (is modeled by random normally distributed variables N(0,)) Variation of path delay: D =  d (L ) var var var

Statistical timing analysis
? Reconvergence needs some care Numerical computation of a distribution Approximate convolution (5% accuracy) Use upper and lower bounds (10% diff. Blaauw’03) Algorithms have linear complexity!

WC confidence margin must be big
What it buys? worst Confidence margin WC confidence margin must be big (chips work) But it is fully unknown Trading yield STA helps to quantify risk (reduce margin and be structure specific) STA might help to trade off confidence margin and yield (testing???) Open issues: why normal? how to derive ? how to derive sensitivity coefficients?

Some designs work twice faster than needed by spec!
Summing this up Clock overhead Cycle time SI Clock skew 10% Non-balanced stages 20% Real Computation Time Worst- average Variability 45% 25% 30% Some designs work twice faster than needed by spec! Everything boils down to $$$ Synchronous design is turning out to become a costly proposition

Is asynchronous an option?
It is about time but … “must” requirements to asynchronous CAD tool: Competitive - added value with minimal (or no) penalty - scalable (capable of handling large designs) Simple - minimal knowledge of asynchronous design - RTL input Risk-free - does not change sign-off (STA) - complete solution in verification and testing - backup options (synchronous implementation)

Sliding the trade-off curve
Automation efforts QDI + fine-grain pipelining Template-based gate-level pipelining QDI datapath NCL, phased logic Penalties? Bundled data desynchronization EMI, skew penalty Variability Average speed gates blocks

Desyncronization flow
Think synchronous Design synchronous: one clock and edge-triggered flip-flops De-synchronize (automatically) Run it asynchronously Asynchronous for dummies

Synchronous circuit MS flip-flop L L L L 1 1 CLK L L

De-synchronization L L L L 1 1 C L L

De-synchronization Distributed controllers substitute the clock network C C C C C C The data path remains intact !

Non-overlapping handshake protocol
B C D A B C D A+ A- B- B+ C+ C- D- D+ Non-overlapping handshake protocol

A B C D A B C D A+ B+ C+ D+ A- B- C- D- Overlapping is also acceptable

Concurrent model A+ A- B+ B- C+ C- A B C data bubble
+ and – must alternate data available at the previous latch next latch must be closed before receiving new data

For any netlist

Synchronization layer

Synchronization layer
This This is a circuit marked graph (CMG)

Properties of CMGs Any CMG is live and safe
Safeness: no data overwriting Liveness: no deadlock A+ B+ C+ A- B- C-

Flow equivalence [Guernic, Talpin, Lann, 2003]
B

De-synchronized behavior
Flow equivalence CLK A B Synchronous behavior A B De-synchronized behavior

De-synchronized behavior
Flow equivalence CLK A B Synchronous behavior A B De-synchronized behavior Theorem: The de-synchronization model preserves flow-equivalence

Timing equivalence La Lb Lc Ld del_a del_b del_c
del_b = del_a = del_c = del_d A del_a del_a B del_b del_b C del_c del_c D A+ B- C+ D- Synchronous-like behavior del_a del_b del_c A- B+ C- D+

Timing equivalence La Lb Lc Ld del_a del_b del_c
del_b > del_a = del_c = del_d A del_a del_a B del_b del_b C del_c del_c D A+ B- C+ D- B keeps the same period and settles the rest del_a del_b del_c A- B+ C- D+

Compatibility sync comb setup skew CQ desync comb controller CQ
Synchronous: T  T T T T sync comb setup skew CQ Desynchronized: T  T T T desync comb controller CQ Statement: Desynchronized design is behavior and timing compatible to its synchronous counterpart

Synchronous environment
A B C Clk Clk Clk+ A+ B+ C+ Timing arc Clk- A- B- C-

Implementation of a controller
Only local handshakes with adjacent controllers are necessary Synthesis by using intuition, common sense, … and petrify

Implementation of a controller

Delay matching Combinational logic d

Post-layout delay matching
Combinational logic

Desynchronization. Gaining Trust
Synchronous RTL =

Async DLX block diagram

Desynchronization. Gaining Trust
Synchronous RTL Synchronous Desynchronized = Cycle: ns Power: mW Area: 372,656m Cycle: ns Power: mW Area: 378,058m

DLX lessons. Positive B C req
Asynchronous design with no area, power, delay penalties 30% less EMI Partial tolerance of variability (matched delays scale with the rest of the gates) Clk B C Binning!!! req Treq > Tclk  Error

DLX lessons. Negative Hard work to come out even with synchronous
Asynchronous design with no area, power, delay advantage Clock power is saved but latched designs have higher loads P&R constraints of de-sync design are non-trivial Matched delay variability might hurt Hard work to come out even with synchronous

Can we do better? early late M S M S Clustering Timing A optimization
Retiming of M-latches

Automation efforts QDI + fine-grain pipelining Template-based gate-level pipelining QDI datapath NCL, phased logic Bundled data desynchronization EMI, skew penalty Variability Average speed gates blocks

Introduction to NCL NULL Ack+ DATA Ack+
2-phase functioning (evaluate (DATA) – precharge (NULL)) + Self-timed register interaction (acknowledgement of phases) Reg. Reg. Combinational logic CD NULL Ack+ DATA Ack+ Micropipeline with delay-insensitive (DI) datapath

NCL Design Flow Synchronous Asynchronous VHDL GTECH Synthesis
library VHDL GTECH 2-rail expansion+ optimization Synchronous netlist Synthesis 1. Pattern matching (Ligthart’00) 2. Completion separation (NCLX)

From 2 to 3-rail Scheme … 2-rail gate x.0 z.1 x.1 F z.0 y.0 y.1 x.1
z.1, z.0 are 2-rails but they do not acknowledge inputs x.0 y.0 x.1 y.1 z.1 z.0 Not DI scheme!!!

From 2 to 3-rail Scheme C … … 2-rail gate x.0 z.1 x.1 F z.0 y.0 y.1
Functional part Completion 2-rail gate F x.0 x.1 y.0 y.1 … z.1 z.0 C z.go … x.go y.go Rationale behind delay-insensitivity of 3-rail scheme: 2-rail circuit is hazard-free under monotonic input changes All inputs changes are observable at outputs

NCLX flow (MUX ) a Unate s Tech. Map. z b 2-rail expansion
2-rail gate (complete) z.1 z.0 Functional part Completion part b.go a.go s.go z.go a.1 s.1 s.0 a.0 b.0 b.1 Completing C a.1 b.1 s.1 z.1 s.0 a.0 b.0 z.0 2-rail gate (incomplete)

NCL lessons. Positive Very low EMI High security of computation
Automatic stand-by mode Tolerance to variability

NCL lessons. Negative Big area overhead: 2.7-3.0x
No performance advantage (average case performance is swallowed by the penalty from NULL) Completion introduces further penalties (power and delay)

Can we do better? Timing optimization of completion network
(may recover about 25% area and power) Partial recovery of single-rail nodes in datapath Fast NULL 4-rail data communication to save power

Phased Logic 1 1 1 1 1 t v odd1 even0 odd0 even1 odd1 even0 even1
Linden’94 Even Phase 00 11 LSB is ‘value’ bit (v) MSB is ‘timing’ bit (t) Odd Phase 10 01 Value ‘0’ Value ‘1’ 1 1 1 1 1 t v odd1 even0 odd0 even1 odd1 even0 even1 A signal changes phase or value (only one bit changes)

Phased logic gate Gate Phase: E Gate Phase: O Gate Phase: E
A PL gate has an internal state Even or Odd. A PL gate fires when all inputs match the gate phase. E Gate Phase: E O Gate is not ready to fire O After Firing Gate ready to fire E E Gate Phase: O Gate Phase: E E O E E

LUT-4 based implementation
a_v D - latch new_v b_v LUT4 v c_v D Q d_v EN Q Input completion detection R r - bit fi a_v reset v_rbit a_t D - lat ch b_v gate_phase b_t delay new_t t G1 D Q c_v C Q t_b c_t EN d_v out_phase = gate_phase reset - d_t G 2 R r bit fo reset t_rbit fo_b out_phase G3 Functionality: v(a_v, b_v, c_v, d_v) Phase: a_t, b_t, c_t, d_t, t Area penalty!

DI-datapath summary New optimization Fine-grain pipelining approaches
NCL and PL show a way to tolerate variability Both have significant penalties May be good for niche applications (smart cards, mixed signals) Average case speed is masked by DI-coordination overhead New optimization approaches Fine-grain pipelining

Automation efforts QDI + fine-grain pipelining Template-based gate-level pipelining QDI datapath NCL, phased logic Bundled data desynchronization EMI, skew penalty Variability Average speed gates blocks

Industrial Experiences Pioneering Asynchronous Commercial Design
Peter A. Beerel Fulcrum Microsystems Calabasas Hills, CA, USA

Synthesis & Floor Planning Database Release to Manufacturing
Agenda Introduction to Fulcrum Description of Integrated Pipelining Fulcrum’s clockless circuit architecture Description of Fulcrum’s Design Flow Overview of Nexus Fulcrum’s Terabit crossbar Overview of PivotPoint Fulcrum’s first commercial product Circuit A Circuit B Design & Verification Synthesis & Floor Planning Physical Design Specification Database Release to Manufacturing Simulation & Verification

in large-scale designs
Company Snapshot Technology proven in large-scale designs Formed out of Caltech (1/00) “Clockless” Semiconductor Company Located in Calabasas, CA (30 people) Backed by top-tier investors (raised $14M in June)

Fulcrum’s Integrated Pipelining
Robust, power efficient, and high performance Dual-Rail Domino Logic Dual-Rail Domino Logic Dual-Rail Domino Logic Acknowledge Acknowledge Fast delay-insensitive style using domino logic without latches (Developed at Caltech by Fulcrum’s founders)

Integrated Pipelining
Harnessing the power of Domino Logic Addresses delay variability with Completion Sensing Addresses power inefficiency with Async Handshakes Leverages more efficient “N” transistors Leaf Cell A Leaf Cell B Leaf Cell C Dual-Rail Domino Logic Dual-Rail Domino Logic Dual-Rail Domino Logic Input Completion Detection Output Completion Detection Control Control Control

Hierarchical Design Multi-level hierarchy of communicating blocks
At each level blocks communicate along channels Anatomy of an asynchronous design. Don’t talk about decom. Yet, give an example of a big circuit go down, show the lowest leaf cell. Say asynchronous design is hierarchical netlist of leaf cells, where at each level blocks communicate along channels. Reg C Reg B Adder Multiplier Reg A Main FSM Register Bank Memory Adder/ Mult. Subtract/ Divider BN-1 BN-2 BN-3 ASIC FAN-1 FAN-2 FAN-3 FA0 channels leaf cells

Leaf Cells Definition Features
LCD RCD D Definition Smallest block that performs logic and communicates via channels Based on small number of pipeline templates guiding design Forms basic building block for physical design Features Facilitates high throughput and low latency Provides easy timing validation and analog verification ~1,000 digital leaf cell types compose our leaf cell library ~200 additional subtypes for different physical environments (e.g., loads)

Template-Based Cell Design
Each pipeline style (QDI, timed…) has a different blueprint Library uses a blueprint to implement the lowest level blocks C LCD RCD LCD F C 2-input 1-output pipeline stage LCD RCD F C LCD RCD Blueprint for a QDI N-input M-output pipeline stage F RCD 1-input 2-output pipeline stage

Summary of Characteristics
Delay-Insensitive timing model Gates and wires can have arbitrary delays 4 phase 1of4 handshake Uses 4 wires to send 2 bits Plus an acknowledge wire for flow control Returned to neutral between each data transfer Self shielding Precharge domino logic plus async handshake Low latency; high frequency; robust Auto power conservation; zero standby power

Mitered Simulation & Verification
Fulcrum Design Flow Architecture Design & Verification Micro-architecture Synthesis & Floor Planning Physical Design Design Specification Database Release to Manufacturing Mitered Simulation & Verification Hierarchical design flow Executable specifications Formal decomposition Creates design hierarchy Semi-custom synthesis & layout Hierarchical floor planning Automated transistor sizing Semi-automated physical design Supports synchronous & asynchronous designs Hard macro from place & route

Managing Design Hierarchy
Proprietary Objected Oriented Hardware Language Integrated hierarchical design/verification language Defines cell specification & implementation Specification Java or communicating-sequential-processes (CSP) Implementation: multiple forms Sub-cells Sub-cells defined in terms of specification or implementation Defines integrated test environment for each cell Enables verification at all pairs of levels Efficiency features Supports refinement of cells and channels

Physical Design Layout hierarchy based on design hierarchy
Hierarchical floor-planning semi-automated Large scale hand placement before sizing Long distance channels planned carefully Timing closure by construction Placement drives sizing Can insert extra pipelining on long wires late in design Tradeoffs between performance and design time Hand layout where necessary Automated layout where possible Goals Full-custom density and speed within ASIC design time

Design Verification: System-Level
Test Bench Device Under Test Configuration Manager Bus Functional Model Test Cases Executable Spec Traffic Generator & Checker Gate-level Verilog Model Mission Verify that executable spec = written spec + gate-level model Use industry-standard tools & methods Cadence NCSIM and efficient Java-Verilog interface Directed random testing Line & functional coverage Monitor

Design Verification: Unit-Level
High level (Java/CSP) Low level (CSP/PRS/CDL) Log Test Engine Copy == Mitered co-simulation for unit-level verification Check correctness of digital model by comparing it to golden CSP/Java model Features Framework automated and regressed Checks correctness Checks delay insensitivity and/or throughput and latency

Analog Verification: Charge Sharing
Test Generator Synthesis SPICE SPICE-based charge sharing analysis Test case generation and analysis automated Charge-sharing problems solved in numerous ways Symmetrization Less transistor sharing Delay perturbations

Synthesis: Gate Generation / Sizing
Automated generation of transistor netlists Dynamic logic generation Transistor sharing Symmetrization Gate-library matching Transistor sizing Path-based sizing to meet amortized unit-delay model Micro-architecture feedback Identifies where fanout limits performance CSP Gate Library Floor planning Information Logic Synthesis Transistor Sizing CDL Netlist

Fulcrum QDI v. Synchronous Flows
Save clock tree design, analysis, optimization, and verification No timing closure problems Unexpected long-wire bottlenecks easily solved with additional pipeline buffers late in design cycle QDI/DI timing model reduces timing analysis challenges Fulcrum QDI hierarchical design facilitates: Composability, re-use, and early bug detection Hierarchical-floorplanning improves predictability of wires Template-based leaf cell designs simplifies logic design Design reuse reduces criticality of high-level synthesis Decomposition methodology amenable to formal verification

Globally Asynchronous, Locally Synchronous
SoC designs: many cores with different clock domains Async circuits can interconnect multiple sync cores in an SoC design, eliminating global clock distribution and simplifying clock domain crossing Fulcrum’s “Nexus” is a high speed on-chip interconnect: 16 port, 36 bit asynchronous crossbar Asynchronous cross-chip channels Async-sync clock domain converters Runs at 1.35GHz in 130nm process

Nexus System-on-Chip Interconnect
Generic Nexus Example Non-blocking crossbar 16 full-duplex ports Flow control extends through the crossbar Full speed arbitration Arbitrary-length “bursts” Bridges clock domains Scales in bit width and ports Process portable Synchronous IP block Asynchronous IP block Pipelined repeater Clock domain converter

Arbitrary-length source-routed bursts provide flexibility
Nexus Burst Format Incoming From Source Outgoing To Target DN D3 D2 D1 DN D3 D2 D1 Data 36 bit • • • • • • Tail 1 bit 1 1 To From Control 4 bit Source Module Target Module Arbitrary-length source-routed bursts provide flexibility

Sync-to-Async Conversion
Synchronous Request / Grant FIFO protocol Data transferred if request and grant both high on rising edge of clock Compensates for any skew on asynchronous side Low latency: 1/2 to 3/2 clock cycles at A2S S2A A2S Synchronous Datapath Asynchronous Datapath Asynchronous Datapath Synchronous Datapath Request Request A A Grant Grant clock clock Seamlessly Bridges Different Clock Domains

Arbitration and Ordering
Unrelated sender/receiver links are independent Bursts sent from multiple input ports to the same output port are serviced fairly by built-in arbitration circuitry Bursts from A to B remain ordered Producer-consumer and global-store-ordering satisfied A sends X to B, A notifies C, C can read X from B A writes X to B, A writes Y to C, if D reads Y from C, it can read X from B Split transactions implement loads Load request and load completion bursts Load completions returned out-of-order Can tunnel common bus and cache coherance protocols

Example: Load/Store Systems
Option 1: Pure Master/Target Ports Masters send Requests to Targets, which may return Completions Each port must either be a Master or a Target so that Completions are never blocked by Requests Devices which need to be both Masters and Targets are given two separate full-duplex ports Could use two separate Nexus crossbars Option 2: Peers Modules which are both Masters and Targets implement an internal buffer to hold Requests so that Completions can bypass them All Masters or Peers restrict number of outstanding Requests to avoid overflowing Request buffers

Example: Switch Fabric
Each module maintains input/output queues for traffic to/from each other module Data is sent from an input queue to an output queue over Nexus as a series of short bursts Flow control credits for each output queue are sent backward Eliminates head-of-line blocking Segmentation, buffering, and overspeed optimize performance during congestion Used in PivotPoint, Fulcrum’s first chip product.

Nexus Silicon Validation
TSMC 130nm LV Results Block diagram of Nexus Validation Chip Proc V GHz ns pJ/bit Low-K 1.2 1.35 2.0 10.4 1.0 1.11 2.4 7.0 FSG 1.10 2.5 11.2 0.87 3.1 7.6 ALU S1 S2 S3 S4 S5 S6 S7 Serial IO Crossbar area: 1.75mm^2 Total interconnect area: 4.15mm^2 Peak cross-section bandwidth: 778Gb/s Plot of Nexus crossbar

Nexus Summary Nexus is an asynchronous crossbar interconnect designed to connect up to 16 synchronous modules in a SoC Nexus can be used to implement load/store systems as well as switch fabrics Systems using Nexus can be tested with standard equipment Nexus runs up to 1.35GHz in TSMC 130nm Asynchronous interconnect is now viable for very high performance SoC designs

PivotPoint Blade Interconnect
Large-scale SoC design >32.5M transistors (83% async) 14 separate clock domains Includes key Fulcrum IP Nexus Terabit Crossbar Quad-port 600MHz async SRAM Operates at over 1GHz Delivers 192Gbps of non-blocking switching capacity Testable via standard tools JTAG; scan chain Activity-based power scaling 9-month project World’s first high-performance clockless chip Generic System “Blade” CPU NPU ASIC FPGA CPU NPU ASIC FPGA SPI-4 I/O (Phy/MAC) X8 Backplane Interface CPU NPU ASIC FPGA CPU NPU ASIC FPGA

PivotPoint Leverages Nexus
Flexible architecture 6 duplex SPI-4.2 interfaces All paths are independent Optimized for performance Up to 14.4Gbps per interface Up to 32Gbps per Nexus port Full-rate buffer memories Lossless flow control Easily configurable 16-bit CPU interface JTAG support Modest size and power ~2 Watt per active interface 1036 ball package CPU Interface JTAG Boundary Scan SPI-4 16KB Buffer Control Bus (Serial Tree) 16KB Buffer SPI-4 Route Table Route Table SPI-4 16KB Buffer 16KB Buffer SPI-4 SPI-4 16KB Buffer 16KB Buffer SPI-4 Route Table Route Table SPI-4 16KB Buffer 16KB Buffer SPI-4 SPI-4 16KB Buffer 16KB Buffer SPI-4 Route Table Route Table SPI-4 16KB Buffer 16KB Buffer SPI-4 3ns latency A true SoC GALS design

Testing – A Multi-Dimensional Approach
DFT Synchronous scan chains for Synchronous logic Asynchronous scan-chain-like structures for asynchronous logic and sync-async interfaces Standardized JTAG interface for testing Fault-Grading Verilog fault-model for domino logic Industry-standard fault grading tools BIST Use Nexus for observability in Nexus-Based SOCs RAM self test and repair

Differentiating Through Technology
Leveraging our clockless technology foundation Differentiated Product Offering High performance (latency, capacity) Power efficient (linear scaling) Robust in operation Unique IP Blocks Unmatched performance Extremely robust (power and temperature) Easy to integrate (benign behavior) Clockless Technology Foundation Silicon proven and customer validated Mature CAD flow (integrated with commercial tools) Robust cell library (thousands of unique cells)

Thank You! Peter A. Beerel, PhD VP Strategic CAD
26775 Malibu Hills Road Suite 200 Calabasas Hills, CA 91301 “A group of engineers wants to turn the microprocessor world on its head by doing the unthinkable: tossing out the clock and letting the signals move about unencumbered. For those designers, inspired by research conducted at Caltech, clocks are for wimps.” Anthony Cataldo , EE Times

Bridging the gap between asynchronous design and designers

Similar presentations

Presentation on theme: "Bridging the gap between asynchronous design and designers"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bridging the gap between asynchronous design and designers

Similar presentations

Presentation on theme: "Bridging the gap between asynchronous design and designers"— Presentation transcript:

Similar presentations

About project

Feedback