Presentation is loading. Please wait.

Presentation is loading. Please wait.

Clocking and Timing in Fault- Tolerant Systems-on-Chip Andreas Steininger.

Similar presentations


Presentation on theme: "Clocking and Timing in Fault- Tolerant Systems-on-Chip Andreas Steininger."— Presentation transcript:

1 Clocking and Timing in Fault- Tolerant Systems-on-Chip Andreas Steininger

2 Outline The Clock as a Blessing The Clock as a Curse Alternative Synchronization Schemes  GALS  fully asynchronous  the DARTS approach Conclusion 2

3 Contributors to this Work The DARTS project team TU ViennaGottfried Fuchs Matthias Fuegger Ulrich Schmid Thomas Handl RUAG SpaceGerald Kempf Manfred Sust Wolfgang Zangerl 3

4 The Need for Fault Tolerance miniaturization is key to progress in VLSI => smaller structures => lower voltage swing => smaller critical charge => higher operating frequencies …result in higher susceptibility to faults (SET, EMI,…) => cannot avoid faults, need to tolerate them 4

5 The Role of Time “The only reason for time is so that everything doesn’t happen at once”, Albert Einstein 5

6 The Need for Clocking activities need to be co-ordinated on system level (braking of wheels, …) on algorithmic level (consensus, …) on communication level on logic level (state machine switching,…) co-ordination in the time domain (synchronization) is an efficient way to attain this => need a global notion of time (discrete „ticks“) 6

7 The Quality of Synchronization real time local time (number of ticks) precision π 7

8 Typical Precision Values on system level:  s … ms on algorithm level:  s … ms on communication level:ns …  s on logic level:ps … ns 8

9 Synchronization Requirements 9 phase synchronisation (for „hardware clock“ on logic level) clock synchronisation (for distributed time base on algorithmic level) 1  s is excellent precision for distributed clock at 1GHz this means 360.000° phase shift

10 Globally Synchronous Design whole design is „isochronic“ („perfect“ precision) time conveyed by clock transitions perfect co-ordination of all activities very efficient design can assume consistent states high level of abstraction very efficient implementation: single crystal oscillator single control line (clock net) 10

11 „Isochronic“ Regions ? speed of light (in medium) = 2 x 10 8 m/s = 20cm/ns 11 2cm Ref 1GHz 4GHz 8GHz

12 The Variation Problem 12 Designer system model projected conditions User actual conditions actual system worst case safety margins ?(unknown) ?(imperfections) Timing completely fixed after design No way to react to actual conditions & system („PVT variations“)

13 Fault-Tolerant Architectures  Duplication & Comparison  Triple-Modular Redundancy 13 FU =? ERR FU vo- ter Y FU

14 Lock-Step Operation single clock 14 „3“ „4“ „3“„4“ single point of failuregood replica determinism FU vo- ter Y FU „3“ „4“

15 Lock-Step Operation independent clocks 15 „3“„4“ „3“„4“ single fault tolerantbad replica determinism FU vo- ter Y FU „3“„4“

16 Fault-Tolerant HW-Clocking 16 FU vo- ter Y FU v v v

17 Fault-Tolerant HW-Clocking 17 FU vo- ter Y FU v v v    

18 The Charme of SoCs billions of transistors fit on one die => structuring into (IP) modules „System-on-Chip“ BUT: large clock distribution networks => „isochronic“?? FT clocking does not work with large skew may need individual clocks for function modules => clock-synchrony neither attainable nor desirable 18

19 Co-ordination of Data Exchange 19 SRCSNK f(x) When it is valid and consistent When SNK has consumed the previous one When can SNK use its input? When can SRC apply the next input?

20 The Synchronous Approach 20 SRCSNK f(x) co-ordination based on (global) time

21 Alternative: Asynchronous Design 21 SRCSNK f(x) co-ordination based on handshaking REQ: „Data word valid, you can use it“ ACK: „Data word consumed, send the next“

22 Async. Design – Advantages closed-loop control makes timing much more robust and adaptive to PVT variations no need for worst-case timing local handshakes replace global clock activity only when needed beneficial for EMI tends to stop operation in case of fault 22

23 Async. Design – Disadvantages Need to handle race between REQ and data 23

24 Async. Design – Disadvantages Need to handle race between REQ and data 24 SRCSNK f(x) REQ: „Data word valid, you can use it“

25 Async. Design – Disadvantages Need to handle race between REQ and data Solution 1: „Bundled Data“ 25 SRCSNK f(x) REQ: „Data word valid, you can use it“

26 Async. Design – Disadvantages Need to handle race between REQ and data Solution 2: „Delay Insensitive“ (Coding) 26 SRCSNK f(x) REQ: „Data word valid, you can use it“ Completion detection

27 Async. Design – Disadvantages Need to handle race between REQ and data significant HW overhead (coding, delay elements) „adaptive“ timing not as predictable more difficult to design classical fault-tolerance schemes not applicable tends to stop operation in case of fault 27

28 Best of Both Worlds GALS: Globally Asynchronous Locally Synchronous 28 retain efficiency of synchronous design wherever possible: „intra-module“ use asynchronous principle where clock distribution too cumbersome: „inter-module“ First mention in PhD thesis by Chapiro / Stanford 84

29 A GALS Example 29 CPU 2GHz PCI-IF 533MHz DSP 2,7GHz USB-IF 24MHz

30 Communication in GALS Shared Memory producer writes to memory, consumer reads from there pro: control flow stays independent shared single-port memory true dual-port memory Direct Messages (Data words) move data word from producer‘s output register to consumer‘s input register non-buffered / buffered (FIFO-queues) clock fixed, data-driven or pausible 30

31 Shared Memory decoupling of clock domains by memory acting as a third party => high area overhead => unusual for single port memory arbitration required arbitration problem (unbounded delay…) one side may block the other at the arbiter for multiport memory problems are confined to access to the same cell busy flag may become metastable blocking still possible for one specific address 31

32 Shared Memory 32 CPU 2GHz shared memory Arbi- tration 0xff14 DSP 2,7GHz perfect decoupling of data path potential metastability problems at arbitration logic potential blocking through arbitration

33 Direct Messages clock domain boundary is between producer‘s output register and consumer‘s input register in general a synchronizer is needed at consumer‘s input definitely for conventional (fixed) clock can be avoided by data-driven / pausible clocking control flows of producer and consumer are strongly coupled: not maintaining the input/output register blocks other party buffers/queues/FIFOs can mitigate, but not avoid this problem (full/empty) compensate variations in the data rate on both sides, but not different average data rates 33

34 Direct Messages data moving over clock domain boundary metastability problems => need to insert handshake …with synchronizers 34 S0xff14 CPU 2GHz DSP 2,7GHz S and (optional) buffers

35 Arbiter: Principle purpose: ○ manage concurring requests to shared resource method: ○ handle pairs of request_in / grant_out ○ requests may arrive in any order ○ arbiter must activate only one grant_out at a time (respond to the first requester) Mutual Exclusion (MUTEX) problem : ○ resolve concurrent requests => metastability problem 35

36 Arbiter: Circuit 36 „Metastability filter“: e.g., hi-threshold inverter [from D. J. Kinniment „Synchronization and Arbitration in Digital Systems“, Wiley] MUTEX-element: SR-latch G1’ G2’ R1 R2 G1 G2 V out,FF t V th,inv V meta

37 Arbiter: Operation 37 R1 G1 R2 G2 G1’ G2’ R1 R2 G1 G2

38 Muller C-Element 38 RS reset set a b y IF a = b THEN y = a ELSE hold y C ab y C a b y

39 Muller C-Element: Circuit 39 [Alan Martin, Caltech]

40 Data-Driven Clocking Principle: ○ as soon as new data arrive => start clocking ○ determine number k of clock cycles required to process new data ○ stop clocking after k cycles, wait for next data Properties: ○ need to switch clock on and off => beware spurious clock pulses! ○ no metastability problem: data stable as soon as consumer clock starts ○ potential for power saving ○ useful for specific applications only (no pipe!) 40

41 Data-Driven Clock: Circuit / 1 41 CLK out  CLK half period determined by  

42 Data-Driven Clock: Circuit / 2 42  C REQ ACK CLK out REQ ACK transition on REQ answered by transition on CLK out min CLK half period deter- mined by  CLK out 

43 Pausible Clocking Principle: ○ producer requests consumer‘s clock to pause ○ data provided to input register during idle time ○ consumer‘s clock may resume - free running („pausible clock“) - with one cycle only („stoppable clock“) Properties: ○ need to switch clock on and off => beware spurious clock pulses! => beware of clock tree delays! ○ producer controls consumer‘s clock (blocking!) ○ applications must cope with paused clock 43

44 Pausible Clock: Circuit / 1 44  C REQ ACK CLK out REQ ACK inverter generates next REQ from ACK self-oscillation CLK out 

45 Pausible Clock: Circuit / 2 45  C REQ’ ACK’ external unit can safely stop CLK by activating REQ’ … and gets ACK’ as a response CLK out REQ’ ACK’ Arb 

46 Pausible Clock: Circuit / 3 46  C REQ1 ACK1 for more external sources arbiters can be added and “anded” before the Muller C-Element the two inverters can be eliminated by using a Muller C- Element with inverting output CLK out Arb REQn ACKn Arb

47 Advantages of GALS synchronous islands can be designed efficiently modules operate independently can use module specific-clock & timing clocking is no single point of failure 47

48 Problems with GALS operation of modules not (inherently) co-ordinated synchrony for communication but not on system / algorithm level communication has to cross clock boundaries potential for metastability => performance penalty through synchronizers OR => module must handle irregular clocking 48

49 The DARTS Idea 49 phase synchronisation tick synchronisation clock synchronisation Distributed Algorithms for Robust Tick Synchronization

50 The DARTS Approach  Concept: Multiple synchronized tick generators  Method: Distributed algorithm for fault-tolerant tick generation implemented in (asynchronous) digital logic  Advantages  No crystal oscillator(s)  No critical clock tree  Clock is no single point of failure!  Reasonable synchrony 50

51 The DARTS Principle 51  Every function unit Fu i augmented with simple local clock unit (TG-Alg)  TG-Algs communicate over dedicated TG-Net to generate tick-synchronized local clock signals  Up to f TG-Algs can be Byzantine faulty  need n ≥ 3f + 2 TG-Algs Fu 1 Fu 2 Fu 3 data bus Clock tree TG-Algs TG-Net DARTS clocks Standard synchronous clocking Formally proven synchronization properties

52 A Comparison 52 tick(3) tick(4) Fu 1 clk Fu 2 clk 52 global synchrony (< 1 tick) synchronous SoC GALS DARTS  single point of failure global synchrony (potentially  1 tick) no single point of failure  NO (inherent) global synchrony

53 The Distributed Algorithm (1)Initially: (2)send tick(0) to all; clock:= 0; (3)“Relay Rule” (4)If received tick(m) from at least f+1 remote nodes and m > clock: (5)send tick(clock+1),…, tick(m) to all [once]; clock:= m; (6)“Increment Rule” (7)If received tick(m) from at least 2f+1 remote nodes and m >= clock: (8)send tick(m+1) to all [once]; clock:= m+1; [Srikanth & Toueg, 87] TG-Alg 1 TG-Alg 6 TG-Alg 5 TG-Alg 4 TG-Alg 3 TG-Alg 2 TG-Net

54 Implementation Challenges 54 (1)Initially: (2)send tick(0) to all; clock:= 0; (3)“Relay Rule” (4)If received tick(m) from at least f+1 remote nodes and m > clock: (5)send tick(clock+1),…, tick(m) to all [once]; clock:= m; (6)“Increment Rule” (7)If received tick(m) from at least 2f+1 remote nodes and m >= clock: (8)send tick(m+1) to all [once]; clock:= m+1; Replacement by zero-bit messages k-bit messages k unbounded Atomicity of actions To be ensured by the architecture and delay constraints Thresholds functions for fault tolerance Glitch-free asynchronous implementation k-bit msg vs. zero-bit tick Software-based algorithm

55 The DARTS Prototype 55 ASIC design: radhard 180nm technology 2 designs: - flexible - fast Prototype board: 8 chips plus fixed & programmable interconnect

56 Proof of Concept 56

57 Frequency Stability (Warm-up) 57

58 Frequency Stability (detail) 58

59 DARTS – General Properties  Fully asynchronous implementation  NO oscillators  Tolerates up to three Byzantine faulty nodes (configurable number of TG-Algs; 5 to 12)  Adapts to operating conditions (asynchronous logic) 59

60 Still Room for Improvements o Transient faults are permanently stored in the elastic pipelines o No on-the-fly integration of TG-Alg o Relatively low clock speed o Interfacing to traditional synchronous designs o Scaling with number of faults is costly 60

61 Summary: Trends & Needs Preceding miniaturization necessitates fault tolerance Co-ordinaton of activities is fundamental, thus tight synchrony is a desirable feature on all levels SoCs are large modular designs on a single die 61

62 Summary: SoC Clocking globally synchronous clock: + ideal synchrony, efficient in design & implementation - isochrony unrealistic, single point of failure DARTS clock + best attainable global synchrony, adaptive timing, FT - high implementation efforts, frequency not stable GALS + uses best of syn & asyn, indep. & module-specific clock - no global synchrony, metastability issues asynchronous design + power-efficient, robust against faults & PVT - high overheads, difficult to design, timing hard to predict 62

63 More information on DARTS http://ti.tuwien.ac.at/ecs/research/projects/darts 63


Download ppt "Clocking and Timing in Fault- Tolerant Systems-on-Chip Andreas Steininger."

Similar presentations


Ads by Google