Clocking and Timing in Fault- Tolerant Systems-on-Chip Andreas Steininger.

Slides:

Advertisements

Similar presentations

Categories of I/O Devices

Advertisements

Self-Timed Logic Timing complexity growing in digital design -Wiring delays can dominate timing analysis (increasing interdependence between logical and.

1 Lecture 16 Timing  Terminology  Timing issues  Asynchronous inputs.

Data Synchronization Issues in GALS SoCs Rostislav (Reuven) Dobkin and Ran Ginosar Technion Christos P. Sotiriou FORTH ICS- FORTH.

Copyright 2001, Agrawal & BushnellVLSI Test: Lecture 261 Lecture 26 Logic BIST Architectures n Motivation n Built-in Logic Block Observer (BILBO) n Test.

CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Resource Containers: A new Facility for Resource Management in Server Systems G. Banga, P. Druschel,

ELEC 256 / Saif Zahir UBC / 2000 Timing Methodology Overview Set of rules for interconnecting components and clocks When followed, guarantee proper operation.

(Neil west - p: ). Finite-state machine (FSM) which is composed of a set of logic input feeding a block of combinational logic resulting in a set.

Presenter : Ching-Hua Huang 2012/4/16 A Low-latency GALS Interface Implementation Yuan-Teng Chang; Wei-Che Chen; Hung-Yue Tsai; Wei-Min Cheng; Chang-Jiu.

Introduction to CMOS VLSI Design Lecture 19: Design for Skew David Harris Harvey Mudd College Spring 2004.

Spartan II Features  Plentiful logic and memory resources –15K to 200K system gates (up to 5,292 logic cells) –Up to 57 Kb block RAM storage  Flexible.

Introduction to CMOS VLSI Design Clock Skew-tolerant circuits.

Reconfigurable Computing - Clocks John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound, Western Australia.

Synchronous Digital Design Methodology and Guidelines

CSE477 L19 Timing Issues; Datapaths.1Irwin&Vijay, PSU, 2002 CSE477 VLSI Digital Circuits Fall 2002 Lecture 19: Timing Issues; Introduction to Datapath.

Clock Design Adopted from David Harris of Harvey Mudd College.

A 16-Bit Kogge Stone PS-CMOS adder with Signal Completion Seng-Oon Toh, Daniel Huang, Jan Rabaey May 9, 2005 EE241 Final Project.

Digital Integrated Circuits A Design Perspective

Lecture 8: Clock Distribution, PLL & DLL

Demystifying Data-Driven and Pausible Clocking Schemes Robert Mullins Computer Architecture Group Computer Laboratory, University of Cambridge ASYNC 2007,

Programmable logic and FPGA

1 Synchronization of complex systems Jordi Cortadella Universitat Politecnica de Catalunya Barcelona, Spain Thanks to A. Chakraborty, T. Chelcea, M. Greenstreet.

1 Advanced Digital Design Asynchronous Design: Research Concept by A. Steininger and M. Delvai Vienna University of Technology.

Embedded Systems Hardware: Storage Elements; Finite State Machines; Sequential Logic.

COMPUTER ARCHITECTURE & OPERATIONS I Instructor: Hao Ji.

Lecture 11 MOUSETRAP: Ultra-High-Speed Transition-Signaling Asynchronous Pipelines.

1 CSE370, Lecture 16 Lecture 19 u Logistics n HW5 is due today (full credit today, 20% off Monday 10:29am, Solutions up Monday 10:30am) n HW6 is due Wednesday.

Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.

Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology High-level Specification and Efficient Implementation.

© Digital Integrated Circuits 2nd Sequential Circuits Digital Integrated Circuits A Design Perspective Designing Sequential Logic Circuits Jan M. Rabaey.

Digital System Bus A bus in a digital system is a collection of (usually unbroken) signal lines that carry module-to-module communications. The signals.

Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.

High-Performance Networks for Dataflow Architectures Pravin Bhat Andrew Putnam.

DLS Digital Controller Tony Dobbing Head of Power Supplies Group.

MOUSETRAP Ultra-High-Speed Transition-Signaling Asynchronous Pipelines Montek Singh & Steven M. Nowick Department of Computer Science Columbia University,

CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION

Low Latency Clock Domain Transfer for Simultaneously Mesochronous, Plesiochronous and Heterochronous Interfaces Wade Williams Philip Madrid, Scott C. Johnson.

DEVICES AND COMMUNICATION BUSES FOR DEVICES NETWORK

ECE 553: TESTING AND TESTABLE DESIGN OF DIGITAL SYSTEMS

© BYU 18 ASYNCH Page 1 ECEn 224 Handling Asynchronous Inputs.

1 CSE370, Lecture 17 Lecture 17 u Logistics n Lab 7 this week n HW6 is due Friday n Office Hours íMine: Friday 10:00-11:00 as usual íSara: Thursday 2:30-3:20.

How to Speed-up Fault-Tolerant Clock Generation in VLSI Systems-on-Chip via Pipelining Matthias Függer 1, Andreas Dielacher 2 and Ulrich Schmid 1 1 Vienna.

Field Programmable Port Extender (FPX) 1 Modular Design Techniques for the FPX.

Reading Assignment: Rabaey: Chapter 9

Advanced Digital Design GALS Design Andreas Steininger Vienna University of Technology.

By Nasir Mahmood.  The NoC solution brings a networking method to on-chip communication.

1 Advanced Digital Design Reconfigurable Logic by A. Steininger and M. Delvai Vienna University of Technology.

Clocking System Design

Advanced Digital Design Asynchronous Design: Principles by A. Steininger and M. Delvai Vienna University of Technology.

Implementing Tile-based Chip Multiprocessors with GALS Clocking Styles Zhiyi Yu, Bevan Baas VLSI Computation Lab, ECE Department University of California,

Field Programmable Port Extender (FPX) 1 Modular Design Techniques for the Field Programmable Port Extender John Lockwood and David Taylor Washington University.

Penn ESE370 Fall DeHon 1 ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems Day 20: October 25, 2010 Pass Transistors.

1 Recap: Lecture 4 Logic Implementation Styles:  Static CMOS logic  Dynamic logic, or “domino” logic  Transmission gates, or “pass-transistor” logic.

Advanced Digital Design GALS Design A. Steininger Vienna University of Technology.

Advanced Digital Design GALS Design Andreas Steininger Vienna University of Technology.

1 Clockless Logic Montek Singh Thu, Mar 2, Review: Logic Gate Families  Static CMOS logic  Dynamic logic, or “domino” logic  Transmission gates,

May 2006Andreas Steininger1 D istributed A lgorithms for R obust T ick S ynchronization.

Advanced Digital Design

Digital Integrated Circuits A Design Perspective

Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &

CMOS VLSI Design Chapter 13 Clocks, DLLs, PLLs

Chapter 10 Timing Issues Rev /11/2003 Rev /28/2003

CSE 370 – Winter Sequential Logic - 1

CMOS VLSI Design Chapter 13 Clocks, DLLs, PLLs

332:578 Deep Submicron VLSI Design Lecture 14 Design for Clock Skew

Fault Tolerance in the Systems-on-Chip Era

Clockless Logic: Asynchronous Pipelines

Lecture 26 Logic BIST Architectures

Lecture 19 Logistics Last lecture Today

Presentation transcript:

Clocking and Timing in Fault- Tolerant Systems-on-Chip Andreas Steininger

Outline The Clock as a Blessing The Clock as a Curse Alternative Synchronization Schemes  GALS  fully asynchronous  the DARTS approach Conclusion 2

Contributors to this Work The DARTS project team TU ViennaGottfried Fuchs Matthias Fuegger Ulrich Schmid Thomas Handl RUAG SpaceGerald Kempf Manfred Sust Wolfgang Zangerl 3

The Need for Fault Tolerance miniaturization is key to progress in VLSI => smaller structures => lower voltage swing => smaller critical charge => higher operating frequencies …result in higher susceptibility to faults (SET, EMI,…) => cannot avoid faults, need to tolerate them 4

The Role of Time “The only reason for time is so that everything doesn’t happen at once”, Albert Einstein 5

The Need for Clocking activities need to be co-ordinated on system level (braking of wheels, …) on algorithmic level (consensus, …) on communication level on logic level (state machine switching,…) co-ordination in the time domain (synchronization) is an efficient way to attain this => need a global notion of time (discrete „ticks“) 6

The Quality of Synchronization real time local time (number of ticks) precision π 7

Typical Precision Values on system level:  s … ms on algorithm level:  s … ms on communication level:ns …  s on logic level:ps … ns 8

Synchronization Requirements 9 phase synchronisation (for „hardware clock“ on logic level) clock synchronisation (for distributed time base on algorithmic level) 1  s is excellent precision for distributed clock at 1GHz this means ° phase shift

Globally Synchronous Design whole design is „isochronic“ („perfect“ precision) time conveyed by clock transitions perfect co-ordination of all activities very efficient design can assume consistent states high level of abstraction very efficient implementation: single crystal oscillator single control line (clock net) 10

„Isochronic“ Regions ? speed of light (in medium) = 2 x 10 8 m/s = 20cm/ns 11 2cm Ref 1GHz 4GHz 8GHz

The Variation Problem 12 Designer system model projected conditions User actual conditions actual system worst case safety margins ?(unknown) ?(imperfections) Timing completely fixed after design No way to react to actual conditions & system („PVT variations“)

Fault-Tolerant Architectures  Duplication & Comparison  Triple-Modular Redundancy 13 FU =? ERR FU vo- ter Y FU

Lock-Step Operation single clock 14 „3“ „4“ „3“„4“ single point of failuregood replica determinism FU vo- ter Y FU „3“ „4“

Lock-Step Operation independent clocks 15 „3“„4“ „3“„4“ single fault tolerantbad replica determinism FU vo- ter Y FU „3“„4“

Fault-Tolerant HW-Clocking 16 FU vo- ter Y FU v v v

Fault-Tolerant HW-Clocking 17 FU vo- ter Y FU v v v    

The Charme of SoCs billions of transistors fit on one die => structuring into (IP) modules „System-on-Chip“ BUT: large clock distribution networks => „isochronic“?? FT clocking does not work with large skew may need individual clocks for function modules => clock-synchrony neither attainable nor desirable 18

Co-ordination of Data Exchange 19 SRCSNK f(x) When it is valid and consistent When SNK has consumed the previous one When can SNK use its input? When can SRC apply the next input?

The Synchronous Approach 20 SRCSNK f(x) co-ordination based on (global) time

Alternative: Asynchronous Design 21 SRCSNK f(x) co-ordination based on handshaking REQ: „Data word valid, you can use it“ ACK: „Data word consumed, send the next“

Async. Design – Advantages closed-loop control makes timing much more robust and adaptive to PVT variations no need for worst-case timing local handshakes replace global clock activity only when needed beneficial for EMI tends to stop operation in case of fault 22

Async. Design – Disadvantages Need to handle race between REQ and data 23

Async. Design – Disadvantages Need to handle race between REQ and data 24 SRCSNK f(x) REQ: „Data word valid, you can use it“

Async. Design – Disadvantages Need to handle race between REQ and data Solution 1: „Bundled Data“ 25 SRCSNK f(x) REQ: „Data word valid, you can use it“

Async. Design – Disadvantages Need to handle race between REQ and data Solution 2: „Delay Insensitive“ (Coding) 26 SRCSNK f(x) REQ: „Data word valid, you can use it“ Completion detection

Async. Design – Disadvantages Need to handle race between REQ and data significant HW overhead (coding, delay elements) „adaptive“ timing not as predictable more difficult to design classical fault-tolerance schemes not applicable tends to stop operation in case of fault 27

Best of Both Worlds GALS: Globally Asynchronous Locally Synchronous 28 retain efficiency of synchronous design wherever possible: „intra-module“ use asynchronous principle where clock distribution too cumbersome: „inter-module“ First mention in PhD thesis by Chapiro / Stanford 84

A GALS Example 29 CPU 2GHz PCI-IF 533MHz DSP 2,7GHz USB-IF 24MHz

Communication in GALS Shared Memory producer writes to memory, consumer reads from there pro: control flow stays independent shared single-port memory true dual-port memory Direct Messages (Data words) move data word from producer‘s output register to consumer‘s input register non-buffered / buffered (FIFO-queues) clock fixed, data-driven or pausible 30

Shared Memory decoupling of clock domains by memory acting as a third party => high area overhead => unusual for single port memory arbitration required arbitration problem (unbounded delay…) one side may block the other at the arbiter for multiport memory problems are confined to access to the same cell busy flag may become metastable blocking still possible for one specific address 31

Shared Memory 32 CPU 2GHz shared memory Arbi- tration 0xff14 DSP 2,7GHz perfect decoupling of data path potential metastability problems at arbitration logic potential blocking through arbitration

Direct Messages clock domain boundary is between producer‘s output register and consumer‘s input register in general a synchronizer is needed at consumer‘s input definitely for conventional (fixed) clock can be avoided by data-driven / pausible clocking control flows of producer and consumer are strongly coupled: not maintaining the input/output register blocks other party buffers/queues/FIFOs can mitigate, but not avoid this problem (full/empty) compensate variations in the data rate on both sides, but not different average data rates 33

Direct Messages data moving over clock domain boundary metastability problems => need to insert handshake …with synchronizers 34 S0xff14 CPU 2GHz DSP 2,7GHz S and (optional) buffers

Arbiter: Principle purpose: ○ manage concurring requests to shared resource method: ○ handle pairs of request_in / grant_out ○ requests may arrive in any order ○ arbiter must activate only one grant_out at a time (respond to the first requester) Mutual Exclusion (MUTEX) problem : ○ resolve concurrent requests => metastability problem 35

Arbiter: Circuit 36 „Metastability filter“: e.g., hi-threshold inverter [from D. J. Kinniment „Synchronization and Arbitration in Digital Systems“, Wiley] MUTEX-element: SR-latch G1’ G2’ R1 R2 G1 G2 V out,FF t V th,inv V meta

Arbiter: Operation 37 R1 G1 R2 G2 G1’ G2’ R1 R2 G1 G2

Muller C-Element 38 RS reset set a b y IF a = b THEN y = a ELSE hold y C ab y C a b y

Muller C-Element: Circuit 39 [Alan Martin, Caltech]

Data-Driven Clocking Principle: ○ as soon as new data arrive => start clocking ○ determine number k of clock cycles required to process new data ○ stop clocking after k cycles, wait for next data Properties: ○ need to switch clock on and off => beware spurious clock pulses! ○ no metastability problem: data stable as soon as consumer clock starts ○ potential for power saving ○ useful for specific applications only (no pipe!) 40

Data-Driven Clock: Circuit / 1 41 CLK out  CLK half period determined by  

Data-Driven Clock: Circuit / 2 42  C REQ ACK CLK out REQ ACK transition on REQ answered by transition on CLK out min CLK half period determined by  CLK out 

Pausible Clocking Principle: ○ producer requests consumer‘s clock to pause ○ data provided to input register during idle time ○ consumer‘s clock may resume - free running („pausible clock“) - with one cycle only („stoppable clock“) Properties: ○ need to switch clock on and off => beware spurious clock pulses! => beware of clock tree delays! ○ producer controls consumer‘s clock (blocking!) ○ applications must cope with paused clock 43

Pausible Clock: Circuit / 1 44  C REQ ACK CLK out REQ ACK inverter generates next REQ from ACK self-oscillation CLK out 

Pausible Clock: Circuit / 2 45  C REQ’ ACK’ external unit can safely stop CLK by activating REQ’ … and gets ACK’ as a response CLK out REQ’ ACK’ Arb 

Pausible Clock: Circuit / 3 46  C REQ1 ACK1 for more external sources arbiters can be added and “anded” before the Muller C-Element the two inverters can be eliminated by using a Muller C- Element with inverting output CLK out Arb REQn ACKn Arb

Advantages of GALS synchronous islands can be designed efficiently modules operate independently can use module specific-clock & timing clocking is no single point of failure 47

Problems with GALS operation of modules not (inherently) co-ordinated synchrony for communication but not on system / algorithm level communication has to cross clock boundaries potential for metastability => performance penalty through synchronizers OR => module must handle irregular clocking 48

The DARTS Idea 49 phase synchronisation tick synchronisation clock synchronisation Distributed Algorithms for Robust Tick Synchronization

The DARTS Approach  Concept: Multiple synchronized tick generators  Method: Distributed algorithm for fault-tolerant tick generation implemented in (asynchronous) digital logic  Advantages  No crystal oscillator(s)  No critical clock tree  Clock is no single point of failure!  Reasonable synchrony 50

The DARTS Principle 51  Every function unit Fu i augmented with simple local clock unit (TG-Alg)  TG-Algs communicate over dedicated TG-Net to generate tick-synchronized local clock signals  Up to f TG-Algs can be Byzantine faulty  need n ≥ 3f + 2 TG-Algs Fu 1 Fu 2 Fu 3 data bus Clock tree TG-Algs TG-Net DARTS clocks Standard synchronous clocking Formally proven synchronization properties

A Comparison 52 tick(3) tick(4) Fu 1 clk Fu 2 clk 52 global synchrony (< 1 tick) synchronous SoC GALS DARTS  single point of failure global synchrony (potentially  1 tick) no single point of failure  NO (inherent) global synchrony

The Distributed Algorithm (1)Initially: (2)send tick(0) to all; clock:= 0; (3)“Relay Rule” (4)If received tick(m) from at least f+1 remote nodes and m > clock: (5)send tick(clock+1),…, tick(m) to all [once]; clock:= m; (6)“Increment Rule” (7)If received tick(m) from at least 2f+1 remote nodes and m >= clock: (8)send tick(m+1) to all [once]; clock:= m+1; [Srikanth & Toueg, 87] TG-Alg 1 TG-Alg 6 TG-Alg 5 TG-Alg 4 TG-Alg 3 TG-Alg 2 TG-Net

Implementation Challenges 54 (1)Initially: (2)send tick(0) to all; clock:= 0; (3)“Relay Rule” (4)If received tick(m) from at least f+1 remote nodes and m > clock: (5)send tick(clock+1),…, tick(m) to all [once]; clock:= m; (6)“Increment Rule” (7)If received tick(m) from at least 2f+1 remote nodes and m >= clock: (8)send tick(m+1) to all [once]; clock:= m+1; Replacement by zero-bit messages k-bit messages k unbounded Atomicity of actions To be ensured by the architecture and delay constraints Thresholds functions for fault tolerance Glitch-free asynchronous implementation k-bit msg vs. zero-bit tick Software-based algorithm

The DARTS Prototype 55 ASIC design: radhard 180nm technology 2 designs: - flexible - fast Prototype board: 8 chips plus fixed & programmable interconnect

Proof of Concept 56

Frequency Stability (Warm-up) 57

Frequency Stability (detail) 58

DARTS – General Properties  Fully asynchronous implementation  NO oscillators  Tolerates up to three Byzantine faulty nodes (configurable number of TG-Algs; 5 to 12)  Adapts to operating conditions (asynchronous logic) 59

Still Room for Improvements o Transient faults are permanently stored in the elastic pipelines o No on-the-fly integration of TG-Alg o Relatively low clock speed o Interfacing to traditional synchronous designs o Scaling with number of faults is costly 60

Summary: Trends & Needs Preceding miniaturization necessitates fault tolerance Co-ordinaton of activities is fundamental, thus tight synchrony is a desirable feature on all levels SoCs are large modular designs on a single die 61

Summary: SoC Clocking globally synchronous clock: + ideal synchrony, efficient in design & implementation - isochrony unrealistic, single point of failure DARTS clock + best attainable global synchrony, adaptive timing, FT - high implementation efforts, frequency not stable GALS + uses best of syn & asyn, indep. & module-specific clock - no global synchrony, metastability issues asynchronous design + power-efficient, robust against faults & PVT - high overheads, difficult to design, timing hard to predict 62

More information on DARTS 63