Presentation is loading. Please wait.

Presentation is loading. Please wait.

NanoNet’07, Catania 17/09/2007 1 Asynchronous Links, for NanoNets? Alex Yakovlev University of Newcastle, UK.

Similar presentations


Presentation on theme: "NanoNet’07, Catania 17/09/2007 1 Asynchronous Links, for NanoNets? Alex Yakovlev University of Newcastle, UK."— Presentation transcript:

1 NanoNet’07, Catania 17/09/ Asynchronous Links, for NanoNets? Alex Yakovlev University of Newcastle, UK

2 NanoNet’07, Catania 17/09/ Motivation-1 At very deep submicron, gate delay is much less than interconnect delay: total interconnect length can reach several meters; interconnect delay can be as much as 90% of total path delay in VDSM circuits Timing issue is a problem, particularly for global wires Multiple clock domains are reality, problem of interface between them ITRS’05 predicted: 4x (8x) increase in global asynchronous signalling by 2012 (2020)

3 NanoNet’07, Catania 17/09/ Motivation-2 Variability and uncertainty –Geometry and process: for long channels intra-die variations are less correlated for different part of the interconnect, both for interconnects and repeaters e.g., M4 and M5 resistance/um massively differ, leading to mistracking (C.Visuweswariah, SLIP’06) e.g. 250nm clock skew has 25% variability due to interconnect variations (Y.Liu et.al. DAC’00) –Behavioural: crosstalk (sidewall capacitance can cause up to 7x variation in delay (R. Ho, M.Horowitz))

4 NanoNet’07, Catania 17/09/ A Network on Chip Synchronization required Arbitration required Multiple Clocks Async Links

5 NanoNet’07, Catania 17/09/ Example from the Past: Fault-Tolerant Self-Timed Ring (Varshavsky et al. 1986) For an onboard airborne computer-control system which tolerated up to two faults. Self-timed ring was a GALS system with self-checking and self-repair at the hardware level Individually clocked subsystems Self-timed adapters forming a ring

6 NanoNet’07, Catania 17/09/ Communication Channel Adapter Data (DR,DS) is encoded using 3-of-6 Sperner code (16 data values for half-byte, plus 4 tokens for ring acquisition protocol) AR, AS – acknowledgements RR, RS – spare (for self-repair) lines Much higher reliability than a bus and other forms of redundancy MCC was developed TTL- Schottky gate arrays, approx 2K gates.

7 NanoNet’07, Catania 17/09/ Outline Token-based view of communication Basics of asynchronous signalling Self-timed data encoding Pipelining How to hide acknowledgements Serial vs Parallel links Arbiters and routers Async2sync interface CAD issues

8 NanoNet’07, Catania 17/09/ Data exchange: token-based view Question 1: when can Rx look at the incoming data? Data validity issue – Forming a well-defined token source txrx dest Data

9 NanoNet’07, Catania 17/09/ Data exchange: token-based view Question 1: when can Rx looked at the data? Data validity issue – Forming a well-defined token Question 2: when can Tx send new data? Acknowledgement issue – Separation b/w tokens source txrx dest Data

10 NanoNet’07, Catania 17/09/ Data exchange: token-based view Question 1: when can Rx looked at the data? Data validity issue – Forming a well-defined token Question 2: when can Tx send new data? Acknowledgement issue – Separation b/w tokens These are fundamental issues of flow control at the physical and link levels The answers are determined by many design aspects: technology level, system architecture (application, pipelining), latency, throughput, power, design process etc. source txrx dest Data

11 NanoNet’07, Catania 17/09/ Tokens and spaces with global clocking In globally clocked systems both Q1 and Q2 are resolved with the aid of clock pulses source txrx dest Data clk

12 NanoNet’07, Catania 17/09/ Tokens and spaces Without global clocking: Q1 can be resolved differently from Q2 E.g.: Q1 – source-synchronous (mesochronous), bundled data or self-synchronising codes; Q2 – ack or stop signal, or by local timing source txrx dest Data Clk_tx Clk_rx D_valid bundle

13 NanoNet’07, Catania 17/09/ Tokens and spaces Without global clocking: Q1 can be resolved differently from Q2 E.g.: Q1 – source-synchronous (mesochronous), bundled data or self-synchronising codes; Q2 – ack or stop signal, or by local timing source txrx dest Data D_valid bundle ack

14 NanoNet’07, Catania 17/09/ Petri net model TxRxsource dest Tx delay Rx delay TxRxsource dest Tx delay or ack Rx delay or ack Data Valid ack Always safe but with a round trip delay! One way delay, but may be unsafe!

15 NanoNet’07, Catania 17/09/ Asynchronous handshake signalling Valid data tokens and safe spaces between them can be created by different means of signalling and encoding Level-based -> Return-To-Zero (RTZ) or 4- phase protocol Transition-based -> Non-Return-to-Zero (NRZ) or 2-phase protocol Pulse-based, e.g. GasP Phase-difference-based Data encoding: bundled data (BD), Delay- insensitive (DI)

16 NanoNet’07, Catania 17/09/ Handshake Signalling Protocols Level Signalling (RTZ or 4-phase) Transition Signalling (RTZ or 4-phase) One cycle req ack req ack One cycle req ack One cycle

17 NanoNet’07, Catania 17/09/ Handshake Signalling Protocols Pulse Signalling Single-track Signalling (GasP) One cycle req ack req ack One cycle req + ack req ack

18 NanoNet’07, Catania 17/09/ GasP signalling Pull up from pred (req) Pull down here (ack) Pull up from here (req) Pull down from succ (ack) Pulse length control loops Source: R. Ho et al, Async’04

19 NanoNet’07, Catania 17/09/ Data encoding Bundled data –Code is positional binary, token is determined by Req+ signal; Req+ arrives with a safe set-up delay from data Delay-insensitive codes (tokens determined by the codeword values, require a spacer, or NULL, state if RTZ) –1-of-2 (Dual-rail per bit) – systematic code, encoding, decoding straightforward –m-of-n (n>2) – not systematic, i.e. incur encoding and decoding costs, optimal when m=n/2 –One-hot,1-of-n (n>2), completion detection is easy, not practical beyond n>4 –Systematic, such as Berger, incur complex completion detection

20 NanoNet’07, Catania 17/09/ Bundled Data req ack Data One cycle req ack Data RTZ: NRZ: One cycle req ack Data One cycle

21 NanoNet’07, Catania 17/09/ DI encoded data (Dual-Rail) ack Data.0 One cycle Data.1 ack Data.0 Data.1 Logical 1 Logical 0 One cycle NULL (spacer)NULL cycle Data.1 ack Data.0 Logical 1 Logical 0 cycle Logical 1 cycle RTZ: NRZ:

22 NanoNet’07, Catania 17/09/ DI encoded data (Dual-Rail) ack Data.0 One cycle Data.1 ack Data.0 Data.1 Logical 1 Logical 0 One cycle NULL (spacer)NULL cycle Data.1 ack Data.0 Logical 1 Logical 0 cycle Logical 1 cycle RTZ: NRZ: This coding leads to complex logic implementation; hard to track odd and even phases and logic values – hence see LEDR below

23 NanoNet’07, Catania 17/09/ DI codes (1-of-n and m-of-n) 1-of-4: –0001=> 00, 0010=>01, 0100=>10, 1000=>11 2-of-4: –1100, 1010, 1001, 0110, 0101, 0011 – total 6 combinations (cf. 2-bit dual-rail – 4 comb.) 3-of-6: –111000, , …, – total 20 combinations (can encode 4 bits + 4 control tokens) 2-of-7: – , , …, – total 21 combinations (4 bits + 5 control tokens)

24 NanoNet’07, Catania 17/09/ DI codes completion detection and decoding 1-of-4 completion detection is a 4-input OR gate (CD=d0+d1+d2+d3) Decode 1-of-4 to dual rail is a set of four 2-input OR gates (q0.0=d0+d2; q0.1=d1+d3; q1.0=d0+d1; q1.1=d2+d3) For m-of-n codes CD and decoding is non-trivial From J.Bainbridge et al, ASYNC’03

25 NanoNet’07, Catania 17/09/ Incomplete DI codes Incomplete 2-of-7: Composed of 1-of-3 and 1-of-4 From J.Bainbridge et al ASYNC’03

26 NanoNet’07, Catania 17/09/ Phase difference based encoding (C. D’Alessandro et al. ASYNC’06,’07) The proposed system consists in encoding a bit of data in the phase relationship between two signals generated using a reference This would ensure that any transient fault appearing on one of the reference signals will be ignored if it is not mirrored by a corresponding transition on the other line Similarity with multi-wire communication

27 NanoNet’07, Catania 17/09/ Phase encoding: multiple rail No group of wires has the same delay All wires toggle when an item of data is sent Increased number of states available ( n wires = n! states) hence more bits/symbol Table illustrates examples of phase encoding compared to the respective m-of-n counterpart Type of Link Number of states Bits per Symbol Extra states Transitions per symbol Symbols per packet Transitions per packet Phase enc. (4) of Phase enc. (6) of

28 NanoNet’07, Catania 17/09/ Phase encoding Repeater Phase detectors (Mutexes) 1<3 3<1 2<3 3<2 1<2 2<1

29 NanoNet’07, Catania 17/09/ Pipelines Dual-rail pipeline From J.Bainbridge & S. Furber IEEE Micro, 2002

30 NanoNet’07, Catania 17/09/ The problem of Acking Question 2 “when can Tx send new data?” has two aspects: –Safety (not to overflow the channel or when Tx and Rx have much variation in delay) –Performance (to maximize throughput and reduce latency) Can we hide ack (round trip) delay?

31 NanoNet’07, Catania 17/09/ From R.Ho et al. ASYNC’04 To maintain throughput more pipeline stages are required but that costs too much latency and power First minimize latency along a long wire (not specific to asynchronous) and then maximize throughput (using “wagging tail buffer” approach)

32 NanoNet’07, Catania 17/09/ From R.Ho et al. ASYNC’04 Use of wagging buffer approach Alternate between top and bottom control

33 NanoNet’07, Catania 17/09/ “Wagging tail buffer” approach reqtop acktop ackbot reqbot data Top and bot control channels work at ½ frequenc y of data channel

34 NanoNet’07, Catania 17/09/ Serial Link vs Parallel Link (from R. Dobkin) Why Serial Link? –Less interconnect area –Less routing congestion –Less coupling –Less power (depends on range) The relative improvement grows with technology scaling. The example on the right refers to: –Single gate delay serial link –Fully-shielded parallel link with 8 gate delay clock cycle –Equal bit-rate –Word width N=8 Parallel Link dissipates less power Serial Link dissipates less power Technology Node [nm] Link Length [mm] Parallel Link requires less area Serial Link requires less area

35 NanoNet’07, Catania 17/09/ Serialization model TxRx Acking at the bit level … …

36 NanoNet’07, Catania 17/09/ Serialization model TxRx Acking at the word level

37 NanoNet’07, Catania 17/09/ Serialization model TxRx Acking at the word level (with more concurrency)

38 NanoNet’07, Catania 17/09/ Serial Link – Top Structure (R.Dobkin, Async’07) Transition signaling instead of sampling: two-phase NRZ Level Encoded Dual Rail (LEDR) asynchronous protocol, a.k.a. data- strobe (DS) Acknowledge per word instead of per bit Synchronizers used at the level of the ack signals Wave-pipelining over channel Differential encoding (DS-DE, IEEE ) Reported throughput: 67Gps for 65nm process (viz. one bit per 15ps – expected FO4 inverter delay), based on simulations

39 NanoNet’07, Catania 17/09/ Encoding –Two Phase NRZ LEDR Two Phase Non-Return-to-Zero Level Encoded Dual Rail –“delta” encoding (one transition per bit) Uncoded (B) State bit (S) Phase bit (P)

40 NanoNet’07, Catania 17/09/ Transmitter – Fast SR Approach (from R. Dobkin)

41 NanoNet’07, Catania 17/09/ Receiver Splitter (from R. Dobkin)

42 NanoNet’07, Catania 17/09/ Self Timed Networks Router requires priority arbitration –Arbitration necessary at every router merge –Potential delay at every node on the path BUT –Asynchronous merge/arbitration time is average not worst case Adapters to locally clocked cells require synchronization Synchronization necessary when clocks are unknown –Occurs when receiving data (data valid), and when sending (acknowledge) BUT –Time can be long (2 cycles?) –Must assume worst case time (maybe)

43 NanoNet’07, Catania 17/09/ Router priority Virtual channels implement scheduling algorithm Contention for link resolved by priority circuits MergeSplit Link Flow Control

44 NanoNet’07, Catania 17/09/ Asynchronous Arbiters Multiway arbiters (e.g. for Xbar switches): –Cascaded mesh (latency ~ N) –Cascaded Tree (latency ~ logN) –Token-Ring (busy ring and lazy ring) (latency ~ from 1 to N) Priority arbiters (e.g. for Routers with different QS): –Static priority (topological order) –Dynamic priority (request arrives with priority code) –Ordered (time-priority) - multiway arbiter, followed by a FIFO buffer

45 NanoNet’07, Catania 17/09/ Static Priority Arbiter sq r* C MUTEX C s*q r MUTEX C s*q r MUTEX C s*q r G1 G2 G3 R1 R2 R3 Lock Lock Register Priority Module r1 r2 r3 s1 s2 s3

46 NanoNet’07, Catania 17/09/ Why Synchronizer? Here one clock cycle is used for the metastability to resolve. DFF CLK DATAQ CLK Q Metastability DFF CLK DATA DFF Q Metastability Two DFF Synchronizer

47 NanoNet’07, Catania 17/09/ CAD support: Async design flow

48 NanoNet’07, Catania 17/09/ Device LDS LDTACK D DSr DSw DTACK VME Bus Controller Data Transceiver Bus DSr LDS LDTACK D DTACK Read Cycle Synthesis of Asynchronous link interfaces

49 NanoNet’07, Catania 17/09/ DTACK- DSr+ LDS+ LDTACK+ D+ DTACK+ DSr- D- LDS- LDTACK- DSw- DSw+ D+ LDS+ LDTACK+ D- DTACK+

50 NanoNet’07, Catania 17/09/ DSr+ DTACK- LDS- LDTACK- D- DSr- DTACK+ D+ LDTACK+ LDS+ Complete State Coding (CSC) csc - csc + Boolean equations: LDS = D  csc DTACK = D D = LDTACK csc = DSr Logic asynchronous circuit DTACK D DSr LDS LDTACK csc synthesis DTACK- DSr+ LDS+ LDTACK+ D+ DTACK+ DSr- D- LDS- LDTACK- DSw- DSw+ D+ LDS+ LDTACK+ D- DTACK+

51 NanoNet’07, Catania 17/09/ Conclusions on Async Links At nm level links will be more asynchronous, perhaps first, mesochronous to avoid global clock skew Delay-insensitive codes can be used to tolerate interwire-delay variability Phase-encoding can be used for higher power-bit efficiency and SEU tolerance Acking will be mainly used for flow control (word level) and its overhead can be ‘hidden’ by using the “wagging buffer” technique Serial Links save area and power for long interconnects, with buffering (pipelining) if one wants to maintain high throughput; they also simplify building switches Synthesis tools can be used to build clock-free interfaces between different links Asynchronous logic can be used for building higher level circuits, e.g. arbiters for switches and routers

52 NanoNet’07, Catania 17/09/ And finally …

53 NanoNet’07, Catania 17/09/ ASYNC’08 and NOCs’08 …plus SLIP’08 Held in Newcastle upon Tyne, UK, 7-11 April 2008 (SLIP on 5-6 April – weekend) async.org.uk/async2008 async.org.uk/nocs2008 Submission deadlines: –Async’08: Abstract – Oct. 8, Full paper – Oct. 15 –NOCs’08: Abstract – Nov. 12, Full paper – Nov. 19

54 NanoNet’07, Catania 17/09/ Extras More slides if I have time!

55 NanoNet’07, Catania 17/09/ Chain Network Components From J.Bainbridge & S. Furber IEEE Micro, 2002

56 NanoNet’07, Catania 17/09/ A Network on Chip Synchronization required Arbitration required Multiple Clocks

57 NanoNet’07, Catania 17/09/ Transmitter – Fast SR Approach (from R. Dobkin)

58 NanoNet’07, Catania 17/09/ Receiver Splitter (from R. Dobkin)

59 NanoNet’07, Catania 17/09/ Self Timed Networks Router requires priority arbitration –Arbitration necessary at every router merge –Potential delay at every node on the path BUT –Asynchronous merge/arbitration time is average not worst case Adapters to locally clocked cells require synchronization Synchronization necessary when clocks are unknown –Occurs when receiving data (data valid), and when sending (acknowledge) BUT –Time can be long (2 cycles?) –Must assume worst case time (maybe)

60 NanoNet’07, Catania 17/09/ Router priority Virtual channels implement scheduling algorithm Contention for link resolved by priority circuits MergeSplit Link Flow Control

61 NanoNet’07, Catania 17/09/ Static priority arbiter sq r* C MUTEX C s*q r MUTEX C s*q r MUTEX C s*q r G1 G2 G3 R1 R2 R3 Lock Lock Register Priority Module r1 r2 r3 s1 s2 s3

62 NanoNet’07, Catania 17/09/ Reliability and latency Asynchronous arbiters fail only if time is bounded –Latency depends on fixed gates plus MUTEX lock time –  for 2 channels,  +  ln(N-1) for more –This likely to be small compared with flow control latency Synchronizers fail at (fairly) predictable rates but these rates may get worse –Latency can be 35  now for good reliability

63 NanoNet’07, Catania 17/09/ The synchronizer Clock and valid can happen very close together Flip Flop #1 gets caught in metastability We wait until it is resolved (1 –2 clock periods) DQDQ CLK2 VALID #1#2 DATA CLK1

64 NanoNet’07, Catania 17/09/ MTBF For a 0.18  process  is 20 – 50 ps T w is similar Suppose the clock and data frequencies are 2 GHz t needs to be > 25  (more than one clock period) to get MTBF > 28 days –100 synchronizers + 5  –MTBF > 1year + 2  –PVT variations ... MTBF e Tff t w  /..  cd

65 NanoNet’07, Catania 17/09/ Event Histogram Measurement Convert to log scale, slope is 

66 NanoNet’07, Catania 17/09/ Not always simple More than one slope 350ps 120ps 140ps

67 NanoNet’07, Catania 17/09/ Synchronization Strategies Avoid synchronization time (and arbitration time) by –predicting clocks, stoppable clocks –dedicate link paths for long periods of time Minimize time by circuit methods –Higher power, better  –Reducing apparent device variability - wide transistors –many parallel synchronizers increase throughput Reduce average latency by speculation –Reduce synchronization time, detect errors and roll back

68 NanoNet’07, Catania 17/09/ Timing regions can have predictable relationships Locked –Two clocks from same source –Linked by PLL –One produced by dividing the other –Some asynchronous systems –Some GALS Not locked together but predictable –Two clocks same frequency, but different oscillators. –As above, same frequency ratio

69 NanoNet’07, Catania 17/09/ Don’t synchronise when you don’t need to If the two clocks are locked together, you don’t need a synchroniser, just an asynchronous FIFO big enough to accommodate any jitter/skew FIFO must never overflow Next read clock can be predicted and metastability avoided REQ IN Write Data Available Read done ACK INREQ OUT ACK OUT FIFO DATA

70 NanoNet’07, Catania 17/09/ Conflict Prediction Receiver Clock Transmitter Clock Predicted Transmitter Clock Synchronization problem known a cycle in advance of the Receiver clock. We can do this thanks to the periodic nature of the clocks

71 NanoNet’07, Catania 17/09/ Problems predicting next cycle Difficult to predict –Multiple source clocks –Input output interfaces Dynamic jitter and noise –GALS start up clocks take several cycles to stabilise –Crosstalk –power supply variations introducing noise into both data and clock. –temperature changes alter relative delays As a proportion of cycle time, this is likely to increase with smaller geometries

72 NanoNet’07, Catania 17/09/ Synchronizer reliability trends Clock rates increase. 10 GHz gives 100ps for a cycle. –Both data and clock rates up by n –  down by n Assume  scales with cycle time reliability (MTBF) of one synchronizer down by n Number of synchronizers goes up by N –Die reliability down by N Die – die and on-die variability increases to as much as 40% –40% more time needed for all synchronizers

73 NanoNet’07, Catania 17/09/ An example Example –10 GHz clock and data rate –  = 10 ps –100 synchronizers –MBTF required 3.8 months (10 7 seconds ) –Time required 41 , or 4.1 cycles + 40% =5.8 cycles Does this matter?

74 NanoNet’07, Catania 17/09/ Power futures Total synchronizer area/power small, BUT  very sensitive to voltage/power – both n and p transistors can turn off at low voltages – no gain This affects MUTEX circuits as well

75 NanoNet’07, Catania 17/09/ Power/speed tradeoffs Increase V dd when synchronisation required Make synchronizer transistors wide to reduce variation and, to some extent,  Make many synchronizer circuits, and select the consistently fastest one Avoid reducing synchronizer V dd when running slow

76 NanoNet’07, Catania 17/09/ Speculation Mostly, the synchronizer does not need 35  to settle Only e -10 (0.005%) need more than 10  Why not go ahead anyway, and try again if more time was needed

77 NanoNet’07, Catania 17/09/ Low latency synchronization Data Available, or Free to write are produced early –After one cycle?. If they prove to be in error, synchronization failed –Only know this after two of more cycles Read Fail or Write Fail flag is then raised and the action can be repeated. Read Fail Data Available WRITE FIFO Write Fail Write DataRead done Free to write FullNot Empty READ DATA Write clockRead Clock Speculative synchronizer

78 NanoNet’07, Catania 17/09/ Comments Synchronization time will be an issue for future GALS Latency and throughput can be affected –Should the flit be large to reduce the effective overhead of time and power? Some power speed trade off is possible –Higher power synchronization can buy some performance ? Speculation is complex –Is it worth it?


Download ppt "NanoNet’07, Catania 17/09/2007 1 Asynchronous Links, for NanoNets? Alex Yakovlev University of Newcastle, UK."

Similar presentations


Ads by Google