Presentation is loading. Please wait.

Presentation is loading. Please wait.

NanoNet’07, Catania 17/09/2007 1 Asynchronous Links, for NanoNets? Alex Yakovlev University of Newcastle, UK.

Similar presentations


Presentation on theme: "NanoNet’07, Catania 17/09/2007 1 Asynchronous Links, for NanoNets? Alex Yakovlev University of Newcastle, UK."— Presentation transcript:

1 NanoNet’07, Catania 17/09/2007 1 Asynchronous Links, for NanoNets? Alex Yakovlev University of Newcastle, UK

2 NanoNet’07, Catania 17/09/2007 2 Motivation-1 At very deep submicron, gate delay is much less than interconnect delay: total interconnect length can reach several meters; interconnect delay can be as much as 90% of total path delay in VDSM circuits Timing issue is a problem, particularly for global wires Multiple clock domains are reality, problem of interface between them ITRS’05 predicted: 4x (8x) increase in global asynchronous signalling by 2012 (2020)

3 NanoNet’07, Catania 17/09/2007 3 Motivation-2 Variability and uncertainty –Geometry and process: for long channels intra-die variations are less correlated for different part of the interconnect, both for interconnects and repeaters e.g., M4 and M5 resistance/um massively differ, leading to mistracking (C.Visuweswariah, SLIP’06) e.g. 250nm clock skew has 25% variability due to interconnect variations (Y.Liu et.al. DAC’00) –Behavioural: crosstalk (sidewall capacitance can cause up to 7x variation in delay (R. Ho, M.Horowitz))

4 NanoNet’07, Catania 17/09/2007 4 A Network on Chip Synchronization required Arbitration required Multiple Clocks Async Links

5 NanoNet’07, Catania 17/09/2007 5 Example from the Past: Fault-Tolerant Self-Timed Ring (Varshavsky et al. 1986) For an onboard airborne computer-control system which tolerated up to two faults. Self-timed ring was a GALS system with self-checking and self-repair at the hardware level Individually clocked subsystems Self-timed adapters forming a ring

6 NanoNet’07, Catania 17/09/2007 6 Communication Channel Adapter Data (DR,DS) is encoded using 3-of-6 Sperner code (16 data values for half-byte, plus 4 tokens for ring acquisition protocol) AR, AS – acknowledgements RR, RS – spare (for self-repair) lines Much higher reliability than a bus and other forms of redundancy MCC was developed TTL- Schottky gate arrays, approx 2K gates.

7 NanoNet’07, Catania 17/09/2007 7 Outline Token-based view of communication Basics of asynchronous signalling Self-timed data encoding Pipelining How to hide acknowledgements Serial vs Parallel links Arbiters and routers Async2sync interface CAD issues

8 NanoNet’07, Catania 17/09/2007 8 Data exchange: token-based view Question 1: when can Rx look at the incoming data? Data validity issue – Forming a well-defined token source txrx dest Data

9 NanoNet’07, Catania 17/09/2007 9 Data exchange: token-based view Question 1: when can Rx looked at the data? Data validity issue – Forming a well-defined token Question 2: when can Tx send new data? Acknowledgement issue – Separation b/w tokens source txrx dest Data

10 NanoNet’07, Catania 17/09/2007 10 Data exchange: token-based view Question 1: when can Rx looked at the data? Data validity issue – Forming a well-defined token Question 2: when can Tx send new data? Acknowledgement issue – Separation b/w tokens These are fundamental issues of flow control at the physical and link levels The answers are determined by many design aspects: technology level, system architecture (application, pipelining), latency, throughput, power, design process etc. source txrx dest Data

11 NanoNet’07, Catania 17/09/2007 11 Tokens and spaces with global clocking In globally clocked systems both Q1 and Q2 are resolved with the aid of clock pulses source txrx dest Data clk

12 NanoNet’07, Catania 17/09/2007 12 Tokens and spaces Without global clocking: Q1 can be resolved differently from Q2 E.g.: Q1 – source-synchronous (mesochronous), bundled data or self-synchronising codes; Q2 – ack or stop signal, or by local timing source txrx dest Data Clk_tx Clk_rx D_valid bundle

13 NanoNet’07, Catania 17/09/2007 13 Tokens and spaces Without global clocking: Q1 can be resolved differently from Q2 E.g.: Q1 – source-synchronous (mesochronous), bundled data or self-synchronising codes; Q2 – ack or stop signal, or by local timing source txrx dest Data D_valid bundle ack

14 NanoNet’07, Catania 17/09/2007 14 Petri net model TxRxsource dest Tx delay Rx delay TxRxsource dest Tx delay or ack Rx delay or ack Data Valid ack Always safe but with a round trip delay! One way delay, but may be unsafe!

15 NanoNet’07, Catania 17/09/2007 15 Asynchronous handshake signalling Valid data tokens and safe spaces between them can be created by different means of signalling and encoding Level-based -> Return-To-Zero (RTZ) or 4- phase protocol Transition-based -> Non-Return-to-Zero (NRZ) or 2-phase protocol Pulse-based, e.g. GasP Phase-difference-based Data encoding: bundled data (BD), Delay- insensitive (DI)

16 NanoNet’07, Catania 17/09/2007 16 Handshake Signalling Protocols Level Signalling (RTZ or 4-phase) Transition Signalling (RTZ or 4-phase) One cycle req ack req ack One cycle req ack One cycle

17 NanoNet’07, Catania 17/09/2007 17 Handshake Signalling Protocols Pulse Signalling Single-track Signalling (GasP) One cycle req ack req ack One cycle req + ack req ack

18 NanoNet’07, Catania 17/09/2007 18 GasP signalling Pull up from pred (req) Pull down here (ack) Pull up from here (req) Pull down from succ (ack) Pulse length control loops Source: R. Ho et al, Async’04

19 NanoNet’07, Catania 17/09/2007 19 Data encoding Bundled data –Code is positional binary, token is determined by Req+ signal; Req+ arrives with a safe set-up delay from data Delay-insensitive codes (tokens determined by the codeword values, require a spacer, or NULL, state if RTZ) –1-of-2 (Dual-rail per bit) – systematic code, encoding, decoding straightforward –m-of-n (n>2) – not systematic, i.e. incur encoding and decoding costs, optimal when m=n/2 –One-hot,1-of-n (n>2), completion detection is easy, not practical beyond n>4 –Systematic, such as Berger, incur complex completion detection

20 NanoNet’07, Catania 17/09/2007 20 Bundled Data req ack Data One cycle req ack Data RTZ: NRZ: One cycle req ack Data One cycle

21 NanoNet’07, Catania 17/09/2007 21 DI encoded data (Dual-Rail) ack Data.0 One cycle Data.1 ack Data.0 Data.1 Logical 1 Logical 0 One cycle NULL (spacer)NULL cycle Data.1 ack Data.0 Logical 1 Logical 0 cycle Logical 1 cycle RTZ: NRZ:

22 NanoNet’07, Catania 17/09/2007 22 DI encoded data (Dual-Rail) ack Data.0 One cycle Data.1 ack Data.0 Data.1 Logical 1 Logical 0 One cycle NULL (spacer)NULL cycle Data.1 ack Data.0 Logical 1 Logical 0 cycle Logical 1 cycle RTZ: NRZ: This coding leads to complex logic implementation; hard to track odd and even phases and logic values – hence see LEDR below

23 NanoNet’07, Catania 17/09/2007 23 DI codes (1-of-n and m-of-n) 1-of-4: –0001=> 00, 0010=>01, 0100=>10, 1000=>11 2-of-4: –1100, 1010, 1001, 0110, 0101, 0011 – total 6 combinations (cf. 2-bit dual-rail – 4 comb.) 3-of-6: –111000, 110100, …, 000111 – total 20 combinations (can encode 4 bits + 4 control tokens) 2-of-7: –1100000, 1010000, …, 0000011 – total 21 combinations (4 bits + 5 control tokens)

24 NanoNet’07, Catania 17/09/2007 24 DI codes completion detection and decoding 1-of-4 completion detection is a 4-input OR gate (CD=d0+d1+d2+d3) Decode 1-of-4 to dual rail is a set of four 2-input OR gates (q0.0=d0+d2; q0.1=d1+d3; q1.0=d0+d1; q1.1=d2+d3) For m-of-n codes CD and decoding is non-trivial From J.Bainbridge et al, ASYNC’03

25 NanoNet’07, Catania 17/09/2007 25 Incomplete DI codes Incomplete 2-of-7: Composed of 1-of-3 and 1-of-4 From J.Bainbridge et al ASYNC’03

26 NanoNet’07, Catania 17/09/2007 26 Phase difference based encoding (C. D’Alessandro et al. ASYNC’06,’07) The proposed system consists in encoding a bit of data in the phase relationship between two signals generated using a reference This would ensure that any transient fault appearing on one of the reference signals will be ignored if it is not mirrored by a corresponding transition on the other line Similarity with multi-wire communication

27 NanoNet’07, Catania 17/09/2007 27 Phase encoding: multiple rail No group of wires has the same delay All wires toggle when an item of data is sent Increased number of states available ( n wires = n! states) hence more bits/symbol Table illustrates examples of phase encoding compared to the respective m-of-n counterpart Type of Link Number of states Bits per Symbol Extra states Transitions per symbol Symbols per packet Transitions per packet Phase enc. (4)2448432128 1-of-4420264128 Phase enc. (6)720920861590 3-of-62044632192

28 NanoNet’07, Catania 17/09/2007 28 Phase encoding Repeater Phase detectors (Mutexes) 1<3 3<1 2<3 3<2 1<2 2<1

29 NanoNet’07, Catania 17/09/2007 29 Pipelines Dual-rail pipeline From J.Bainbridge & S. Furber IEEE Micro, 2002

30 NanoNet’07, Catania 17/09/2007 30 The problem of Acking Question 2 “when can Tx send new data?” has two aspects: –Safety (not to overflow the channel or when Tx and Rx have much variation in delay) –Performance (to maximize throughput and reduce latency) Can we hide ack (round trip) delay?

31 NanoNet’07, Catania 17/09/2007 31 From R.Ho et al. ASYNC’04 To maintain throughput more pipeline stages are required but that costs too much latency and power First minimize latency along a long wire (not specific to asynchronous) and then maximize throughput (using “wagging tail buffer” approach)

32 NanoNet’07, Catania 17/09/2007 32 From R.Ho et al. ASYNC’04 Use of wagging buffer approach Alternate between top and bottom control

33 NanoNet’07, Catania 17/09/2007 33 “Wagging tail buffer” approach reqtop acktop ackbot reqbot data Top and bot control channels work at ½ frequenc y of data channel

34 NanoNet’07, Catania 17/09/2007 34 Serial Link vs Parallel Link (from R. Dobkin) Why Serial Link? –Less interconnect area –Less routing congestion –Less coupling –Less power (depends on range) The relative improvement grows with technology scaling. The example on the right refers to: –Single gate delay serial link –Fully-shielded parallel link with 8 gate delay clock cycle –Equal bit-rate –Word width N=8 Parallel Link dissipates less power Serial Link dissipates less power Technology Node [nm] Link Length [mm] Parallel Link requires less area Serial Link requires less area

35 NanoNet’07, Catania 17/09/2007 35 Serialization model TxRx Acking at the bit level … …

36 NanoNet’07, Catania 17/09/2007 36 Serialization model TxRx Acking at the word level

37 NanoNet’07, Catania 17/09/2007 37 Serialization model TxRx Acking at the word level (with more concurrency)

38 NanoNet’07, Catania 17/09/2007 38 Serial Link – Top Structure (R.Dobkin, Async’07) Transition signaling instead of sampling: two-phase NRZ Level Encoded Dual Rail (LEDR) asynchronous protocol, a.k.a. data- strobe (DS) Acknowledge per word instead of per bit Synchronizers used at the level of the ack signals Wave-pipelining over channel Differential encoding (DS-DE, IEEE1355-95) Reported throughput: 67Gps for 65nm process (viz. one bit per 15ps – expected FO4 inverter delay), based on simulations

39 NanoNet’07, Catania 17/09/2007 39 Encoding –Two Phase NRZ LEDR Two Phase Non-Return-to-Zero Level Encoded Dual Rail –“delta” encoding (one transition per bit) Uncoded (B) State bit (S) Phase bit (P) 00110 00010

40 NanoNet’07, Catania 17/09/2007 40 Transmitter – Fast SR Approach (from R. Dobkin)

41 NanoNet’07, Catania 17/09/2007 41 Receiver Splitter (from R. Dobkin)

42 NanoNet’07, Catania 17/09/2007 42 Self Timed Networks Router requires priority arbitration –Arbitration necessary at every router merge –Potential delay at every node on the path BUT –Asynchronous merge/arbitration time is average not worst case Adapters to locally clocked cells require synchronization Synchronization necessary when clocks are unknown –Occurs when receiving data (data valid), and when sending (acknowledge) BUT –Time can be long (2 cycles?) –Must assume worst case time (maybe)

43 NanoNet’07, Catania 17/09/2007 43 Router priority Virtual channels implement scheduling algorithm Contention for link resolved by priority circuits MergeSplit Link Flow Control

44 NanoNet’07, Catania 17/09/2007 44 Asynchronous Arbiters Multiway arbiters (e.g. for Xbar switches): –Cascaded mesh (latency ~ N) –Cascaded Tree (latency ~ logN) –Token-Ring (busy ring and lazy ring) (latency ~ from 1 to N) Priority arbiters (e.g. for Routers with different QS): –Static priority (topological order) –Dynamic priority (request arrives with priority code) –Ordered (time-priority) - multiway arbiter, followed by a FIFO buffer

45 NanoNet’07, Catania 17/09/2007 45 Static Priority Arbiter sq r* C MUTEX C s*q r MUTEX C s*q r MUTEX C s*q r G1 G2 G3 R1 R2 R3 Lock Lock Register Priority Module r1 r2 r3 s1 s2 s3

46 NanoNet’07, Catania 17/09/2007 46 Why Synchronizer? Here one clock cycle is used for the metastability to resolve. DFF CLK DATAQ CLK Q Metastability DFF CLK DATA DFF Q 0 1 0 1 Metastability Two DFF Synchronizer

47 NanoNet’07, Catania 17/09/2007 47 CAD support: Async design flow

48 NanoNet’07, Catania 17/09/2007 48 Device LDS LDTACK D DSr DSw DTACK VME Bus Controller Data Transceiver Bus DSr LDS LDTACK D DTACK Read Cycle Synthesis of Asynchronous link interfaces

49 NanoNet’07, Catania 17/09/2007 49 DTACK- DSr+ LDS+ LDTACK+ D+ DTACK+ DSr- D- LDS- LDTACK- DSw- DSw+ D+ LDS+ LDTACK+ D- DTACK+

50 NanoNet’07, Catania 17/09/2007 50 DSr+ DTACK- LDS- LDTACK- D- DSr- DTACK+ D+ LDTACK+ LDS+ Complete State Coding (CSC) csc - csc + Boolean equations: LDS = D  csc DTACK = D D = LDTACK csc = DSr Logic asynchronous circuit DTACK D DSr LDS LDTACK csc synthesis DTACK- DSr+ LDS+ LDTACK+ D+ DTACK+ DSr- D- LDS- LDTACK- DSw- DSw+ D+ LDS+ LDTACK+ D- DTACK+

51 NanoNet’07, Catania 17/09/2007 51 Conclusions on Async Links At nm level links will be more asynchronous, perhaps first, mesochronous to avoid global clock skew Delay-insensitive codes can be used to tolerate interwire-delay variability Phase-encoding can be used for higher power-bit efficiency and SEU tolerance Acking will be mainly used for flow control (word level) and its overhead can be ‘hidden’ by using the “wagging buffer” technique Serial Links save area and power for long interconnects, with buffering (pipelining) if one wants to maintain high throughput; they also simplify building switches Synthesis tools can be used to build clock-free interfaces between different links Asynchronous logic can be used for building higher level circuits, e.g. arbiters for switches and routers

52 NanoNet’07, Catania 17/09/2007 52 And finally …

53 NanoNet’07, Catania 17/09/2007 53 ASYNC’08 and NOCs’08 …plus SLIP’08 Held in Newcastle upon Tyne, UK, 7-11 April 2008 (SLIP on 5-6 April – weekend) async.org.uk/async2008 async.org.uk/nocs2008 Submission deadlines: –Async’08: Abstract – Oct. 8, Full paper – Oct. 15 –NOCs’08: Abstract – Nov. 12, Full paper – Nov. 19

54 NanoNet’07, Catania 17/09/2007 54 Extras More slides if I have time!

55 NanoNet’07, Catania 17/09/2007 55 Chain Network Components From J.Bainbridge & S. Furber IEEE Micro, 2002

56 NanoNet’07, Catania 17/09/2007 56 A Network on Chip Synchronization required Arbitration required Multiple Clocks

57 NanoNet’07, Catania 17/09/2007 57 Transmitter – Fast SR Approach (from R. Dobkin)

58 NanoNet’07, Catania 17/09/2007 58 Receiver Splitter (from R. Dobkin)

59 NanoNet’07, Catania 17/09/2007 59 Self Timed Networks Router requires priority arbitration –Arbitration necessary at every router merge –Potential delay at every node on the path BUT –Asynchronous merge/arbitration time is average not worst case Adapters to locally clocked cells require synchronization Synchronization necessary when clocks are unknown –Occurs when receiving data (data valid), and when sending (acknowledge) BUT –Time can be long (2 cycles?) –Must assume worst case time (maybe)

60 NanoNet’07, Catania 17/09/2007 60 Router priority Virtual channels implement scheduling algorithm Contention for link resolved by priority circuits MergeSplit Link Flow Control

61 NanoNet’07, Catania 17/09/2007 61 Static priority arbiter sq r* C MUTEX C s*q r MUTEX C s*q r MUTEX C s*q r G1 G2 G3 R1 R2 R3 Lock Lock Register Priority Module r1 r2 r3 s1 s2 s3

62 NanoNet’07, Catania 17/09/2007 62 Reliability and latency Asynchronous arbiters fail only if time is bounded –Latency depends on fixed gates plus MUTEX lock time –  for 2 channels,  +  ln(N-1) for more –This likely to be small compared with flow control latency Synchronizers fail at (fairly) predictable rates but these rates may get worse –Latency can be 35  now for good reliability

63 NanoNet’07, Catania 17/09/2007 63 The synchronizer Clock and valid can happen very close together Flip Flop #1 gets caught in metastability We wait until it is resolved (1 –2 clock periods) DQDQ CLK2 VALID #1#2 DATA CLK1

64 NanoNet’07, Catania 17/09/2007 64 MTBF For a 0.18  process  is 20 – 50 ps T w is similar Suppose the clock and data frequencies are 2 GHz t needs to be > 25  (more than one clock period) to get MTBF > 28 days –100 synchronizers + 5  –MTBF > 1year + 2  –PVT variations +5 - 10 ... MTBF e Tff t w  /..  cd

65 NanoNet’07, Catania 17/09/2007 65 Event Histogram Measurement Convert to log scale, slope is 

66 NanoNet’07, Catania 17/09/2007 66 Not always simple More than one slope 350ps 120ps 140ps

67 NanoNet’07, Catania 17/09/2007 67 Synchronization Strategies Avoid synchronization time (and arbitration time) by –predicting clocks, stoppable clocks –dedicate link paths for long periods of time Minimize time by circuit methods –Higher power, better  –Reducing apparent device variability - wide transistors –many parallel synchronizers increase throughput Reduce average latency by speculation –Reduce synchronization time, detect errors and roll back

68 NanoNet’07, Catania 17/09/2007 68 Timing regions can have predictable relationships Locked –Two clocks from same source –Linked by PLL –One produced by dividing the other –Some asynchronous systems –Some GALS Not locked together but predictable –Two clocks same frequency, but different oscillators. –As above, same frequency ratio

69 NanoNet’07, Catania 17/09/2007 69 Don’t synchronise when you don’t need to If the two clocks are locked together, you don’t need a synchroniser, just an asynchronous FIFO big enough to accommodate any jitter/skew FIFO must never overflow Next read clock can be predicted and metastability avoided REQ IN Write Data Available Read done ACK INREQ OUT ACK OUT FIFO DATA

70 NanoNet’07, Catania 17/09/2007 70 Conflict Prediction Receiver Clock Transmitter Clock Predicted Transmitter Clock Synchronization problem known a cycle in advance of the Receiver clock. We can do this thanks to the periodic nature of the clocks

71 NanoNet’07, Catania 17/09/2007 71 Problems predicting next cycle Difficult to predict –Multiple source clocks –Input output interfaces Dynamic jitter and noise –GALS start up clocks take several cycles to stabilise –Crosstalk –power supply variations introducing noise into both data and clock. –temperature changes alter relative delays As a proportion of cycle time, this is likely to increase with smaller geometries

72 NanoNet’07, Catania 17/09/2007 72 Synchronizer reliability trends Clock rates increase. 10 GHz gives 100ps for a cycle. –Both data and clock rates up by n –  down by n Assume  scales with cycle time reliability (MTBF) of one synchronizer down by n Number of synchronizers goes up by N –Die reliability down by N Die – die and on-die variability increases to as much as 40% –40% more time needed for all synchronizers

73 NanoNet’07, Catania 17/09/2007 73 An example Example –10 GHz clock and data rate –  = 10 ps –100 synchronizers –MBTF required 3.8 months (10 7 seconds ) –Time required 41 , or 4.1 cycles + 40% =5.8 cycles Does this matter?

74 NanoNet’07, Catania 17/09/2007 74 Power futures Total synchronizer area/power small, BUT  very sensitive to voltage/power – both n and p transistors can turn off at low voltages – no gain This affects MUTEX circuits as well

75 NanoNet’07, Catania 17/09/2007 75 Power/speed tradeoffs Increase V dd when synchronisation required Make synchronizer transistors wide to reduce variation and, to some extent,  Make many synchronizer circuits, and select the consistently fastest one Avoid reducing synchronizer V dd when running slow

76 NanoNet’07, Catania 17/09/2007 76 Speculation Mostly, the synchronizer does not need 35  to settle Only e -10 (0.005%) need more than 10  Why not go ahead anyway, and try again if more time was needed

77 NanoNet’07, Catania 17/09/2007 77 Low latency synchronization Data Available, or Free to write are produced early –After one cycle?. If they prove to be in error, synchronization failed –Only know this after two of more cycles Read Fail or Write Fail flag is then raised and the action can be repeated. Read Fail Data Available WRITE FIFO Write Fail Write DataRead done Free to write FullNot Empty READ DATA Write clockRead Clock Speculative synchronizer

78 NanoNet’07, Catania 17/09/2007 78 Comments Synchronization time will be an issue for future GALS Latency and throughput can be affected –Should the flit be large to reduce the effective overhead of time and power? Some power speed trade off is possible –Higher power synchronization can buy some performance ? Speculation is complex –Is it worth it?


Download ppt "NanoNet’07, Catania 17/09/2007 1 Asynchronous Links, for NanoNets? Alex Yakovlev University of Newcastle, UK."

Similar presentations


Ads by Google