Presentation is loading. Please wait.

Presentation is loading. Please wait.

Asynchronous Logic: Results and Prospects Alain J. Martin California Institute of Technology NTU, March 2007.

Similar presentations


Presentation on theme: "Asynchronous Logic: Results and Prospects Alain J. Martin California Institute of Technology NTU, March 2007."— Presentation transcript:

1 Asynchronous Logic: Results and Prospects Alain J. Martin California Institute of Technology NTU, March 2007

2 What Is Asynchronous Logic?

3 3 Sequencing and Computation  “An algorithm is a sequence of computational steps.” CL&R  How do we implement sequencing in a continuous physical medium?  Traditional answer: use of a global time reference (“the clock”) CLK ABCDE

4 4 Can we compute without a clock?  Yes!: “asynchronous” or “clockless” logic  Also “self-timed” or “speed-independent”  David Muller “Theory of Asynchronous Circuits” (1959)  ILLIAC (1959) and ILLIAC II (1962) partially asynchronous  PDP6 (1960) asynchronous

5 5 Can we compute without a clock and without delay assumptions?  Delay-insensitivity (Molnar, 198x…)  Almost:  “The class of delay-insensitive circuits is limited (not Turing-complete).” (Martin, 1990)  Quasi-delay-insensitive (QDI) logic: –Delay-insensitive –Isochronic forks (only delay assumption)  QDI is Turing-complete (Martin & Manohar, 1996)

6 6 What is an Asynchronous Circuit?  Asynchronous system: collection of modules communicating by handshake protocols  Distributed system on a chip (communicating by message exchange) ABCDE ack

7 7 Caltech QDI Approach  Quasi delay-insensitive (QDI) design  Minimal delay assumptions (only isochronic forks)  Stricter logic synthesis (DI codes for datapath, completion trees), but…  Robust and efficient (no evidence that delay assumptions improve efficiency)

8 Why Asynchronous and QDI Logic?

9 9 Scientific Reasons  Understanding the role of time in computation  Limit of delay insensitivity  Implementing a digital computation directly in a continuous physical medium  Design by program transformation (real correctness-by-construction approach)  “VLSI design as programming” paradigm

10 10 Engineering Reasons  Better match for high-level synthesis –Can separate correctness from performance issues  Modularity and better use of concurrency  Large system design (SoC): Only local communication  Efficiency –Average-case instead of worst-case behavior –Less pressure for global optimization (“timing closure”)  Robustness and reliability –Robust to variations in fabrication technology, temperature, voltage, noise, SEU-tolerance  Energy efficiency

11 11 Energy Advantages of Async  No clock –Up to 50% of clock power recuperated  Automatic shut-off of idle parts –Perfect clock gating  No glitches (spurious transitions) –Up to 50% of power in combinational circuits  Automatic adaptation to parameter’s variations –Voltage scaling: Perfect exchange of delay against energy through voltage scaling  Flexibility of asynchronous interfaces: –Better use of concurrency

12 12 Reactive Use in Embedded Systems  Archetype of a reactive system  Average execution time may be much shorter than maximal execution time  Sleep sequence without race condition –Modeled after wait/signal with condition variables  Instant wake-up from deep sleep

13 13 Robustness to PVT Variations  Increase in physical parameter variations (PVT) is becoming a huge problem…  Even worse in future technologies (nano CMOS or others)  Variations of physical parameters all affect timing  Increased timing variations reduce robustness and/or performance  Single time reference (clock) may become unavailable or too expensive in future technologies and large systems (SoC)

14 14 Robustness to Voltage and Temperature Variations

15 15 Single-event Upset and Soft-error Tolerance of QDI circuits  Soft-errors caused by alpha particles, cosmic rays and other radiation sources are becoming increasingly problematic, even at ground-level  QDI circuits can absorb most “dose-effects”  Single-event upsets that cause a soft-error (bit flip) can be corrected efficiently in QDI circuits  Error-correction scheme specific to QDI  Entire async microcontroller SEU-tolerant

16 16 Detection and Correction of SE in QDI circuits  Single-error detection: duplicate and compare  Correction: –prevent propagation of detected SE –stability of guards corrects automatically –“Detection is correction”  Simplest, most expensive coding, but simplest detection mechanism  Entire microcontroller SEU tolerant

17 17 Disadvantages of Async  Size overhead (more transistors)  Poorly understood and rarely taught  No industrial CAD tools (yet)  No well-developed testing procedure (yet)  No easy transition path for large established companies…

18 Experimental Evidence

19 19 Asynchronous Caltech World-first Asynchronous Microprocessor (1988) MiniMIPS (1998) Lutonium 8051 Microcontroller (2005) Lattice-Structure Filter (1994)

20 20 First Asynchronous Microprocessor (Caltech, 1988)  Performance: – 5 MIPS, 2V –18 MIPS, 5V –26 MIPS, 10V  16-bit RISC, 2-micron CMOS  Formal synthesis: –Initial sequential description was a single page of CHP code –5 months from start of project to tape-out (small group) –Fully functional on first silicon  Potato-chip experiment –Runs on a potato as power supply! – 0.75V, 0.9V

21 21 Asynchronous MIPS R3000 Microprocessor  Standard 32-bit RISC ISA  Single instruction issue, one branch delay slot  Precise exceptions  2 on-chip caches: 4kB Icache and 4kB Dcache  First prototype (1998): –No TLB –2M transistors –First asynchronous processor competitive with large synchronous designs

22 22 MiniMIPS Low-Voltage Operation  Functional from 0.5V Vdd up  Functional at 0.4V with some transistor resizing

23 23 Asynchronous MIPS: Practical Results  HP’s 0.6-micron CMOS –Expected: o C –First prototype: o C –Voltage range:1V W) to 8V  Functional on first silicon despite –Inconsistencies in HP’s process parameters (e.g. higher V t ’s) –Long polysilicon wire overlooked in the critical fetch loop –(Testament to the robustness of asynchronous design style!)  Roughly 4x faster than commercial synchronous MIPS ported to same technology –Note: no particular effort made towards designing for low power.

24 24 Lutonium-18: QDI 8051 Microcontroller  TSMC SCN018 through MOSIS –0.18  m CMOS –1.8V nominal –|V t | = 0.4V to 0.5V  Expected area: 5mm 2 (including 8kB SRAM)  Performance from low-level simulation (conservative!) 1.8 V200 MIPS100.0 mW500 pJ/inst 1800 MIPS/W 1.1 V100 MIPS 20.7 mW207 pJ/inst 4830 MIPS/W 0.9 V 66 MIPS 9.2 mW139 pJ/inst 7200 MIPS/W 0.8 V 48 MIPS 4.4 mW 92 pJ/inst10900 MIPS/W 0.5 V 4 MIPS 170  W 43 pJ/inst23000 MIPS/W

25 25 Energy Efficiency Metric: Et 2 E = C*V 2, t = k / V E = C*V 2, t = k / V E*t 2 independent of V E*t 2 independent of V Estimate of energy efficiency Estimate of energy efficiency Comparison of designs Comparison of designs “Algorithmic of energy’’ “Algorithmic of energy’’ See Chapter 15 in “Power Aware Computing” book by Graybill & Melhem eds. Kluwer See Chapter 15 in “Power Aware Computing” book by Graybill & Melhem eds. Kluwer

26 26 Voltage Scaling Advantage: Comparison to Intel Xscale

27 27 Energy Breakdown and Comparisons Microprocessor -- Results MIPS Energy async-0.6  33nJ 70nJ sync-0.6  MIPS CycleTime async-0.6  6ns 21ns sync-0.6  Microcontroller -- Estimation 8051 Energy per Instr sync-0.5  10.00nJ (1X) 1.67nJ (6X) async-0.5  0.56nJ (18X) async nJ (72X) async CycleTime sync-0.5  20ns (1X) 10ns (2X) async-0.5  5ns (4X) async ns (2X) async-0.18 More than 100X Et 2 improvement over any other 8051 icache decodewrite back regfile (bypass) fetch exec units (adder) (shifter) (fblock) (mem) (mult/div) Energy Breakdown

28 Design Methodology

29 29 Handshakes & Dual-Rail Encoding  Four-phase handshake  Dual-rail encoding: –3 wires (2 data, 1 ack) for one bit of information –Other DI codes are used: 1-of-N DATA ACK L0L0 L1L1 LaLa R0R0 R1R1 RaRa BUFFER: *[ L?x; R!x ] C0C0 C1C1 Data 00Hasn’t arrived invalid L? R!

30 30 A QDI pipeline stage *[ L?x; R!f(x) ]

31 31 QDI PIPELINE vs Bundled Data  Dual-rail or 1-of-n data encoding  Completion tree  Critics: high overhead (2*N +1 wires and completion tree)  Alternative: Bundled data  N + 1 wires, no completion tree  Delay line for indicating completion, spurious transitions  Big controversy!

32 32 Fine-grain Pipeline (PCHB) L? LaLa LvLv RvRv R RaRa en f validity completion validity L? R!

33 33 FINE-GRAIN PIPELINE  No need for separate register  Very high throughput and low forward latency  Excellent Et^2 performance  Entirely QDI  Used in MiniMIPS and Lutonium  Area overhead significant

34 34 Lower-Level Synthesis: HSE CHP Program *[ L?x; R!x ] Handshaking Expansion *[ [  R a  L 0  R 0   R a  L 1  R 1  ]; L a  ; [ R a  R 0 , R 1  ]; [  L 0   L 1  L a  ] ] [ L d ]; L a  ; [  L d ]; L a  [  R a ]; R d  ; [ R a ]; R d 

35 35 Lower-Level Synthesis: PRS CHP Program *[ L?x; R!x ] Handshaking Expansion *[ [  R a  L 0  R 0   R a  L 1  R 1  ]; L a  ; [ R a  R 0 , R 1  ]; [  L 0   L 1  L a  ] ] Production Rule Set L 0  L 1  L v   L a  R a  L 0  R 0   L a  R a  L 1  R 1  R 0  R 1  R v  L v  R v  L a   L 0   L 1  L v  R a  L a  R 0  R a  L a  R 1   R 0   R 1  R v   L v   R v  L a  To PRS for CMOS …

36 36 Lower-Level Synthesis: PRS Production Rule Set L 0  L 1  L v   L a  R a  L 0  R 0   L a  R a  L 1  R 1  R 0  R 1  R v  L v  R v  L a   L 0   L 1  L v  R a  L a  R 0  R a  L a  R 1   R 0   R 1  R v   L v   R v  L a  To PRS for CMOS …  Each production rule has the form: guard expr  node  or guard expr  node   These can be evaluated as If ( guard expr is true ) node = Vdd or If ( guard expr is true ) node = GND  A set of production rules must be stable and non-interfering (for hazard-free circuits)

37 37 Asynchronous Architectures  New asynchronous solutions for pipelined microprocessors  Execution units are in parallel, allowing concurrent and out- of-order execution of instructions

38 38 CAD Tools  Complete suite of tools: synthesis, simulation, verification, optimization, layout  Designer-assisted compilation  Tools are modular and customizable  Main representations: CHP, PRS, Cast

39 39 Design Flow chpsim cosim database sequential program concurrent system PRS placed cells physical layout routed cells collection of cells sized PRS logical physical DDD PlacerRouter PL2 = ? ! add physical resize using wire information prsim/esimspice simulators synthesis Legend SDD Sizer

40 Robustness and Reliability

41 41 Robustness to Power-Supply Noise HPSICE simulation of a typical QDI asynchronous circuit: A five-stage ring of async (PCHB) pipeline stages. Technology: TSMC 0.18micron CMOS Vdd: 1.8V, Vt :.5V, Complete layout. Vdd is oscillating between 3.5V and 0V (maximal amplitude), and at various frequencies. The circuit keeps working correctly! (It will malfunction at some very high-frequency noise in phase with circuit frequency.)

42 42 Robustness to Power-Supply Noise

43 43 SE-Tolerant QDI Circuits z’ b C zaza zbzb C xaxa yaya xbxb ybyb z ’a’a intermediate final

44 44  The STAM architecture defines simplified 32-bit RISC instruction set, which has eight general registers, and four types of instructions: arithmetic, branch, memory and shift operations.  A partially-wired layout of the STAM was completed TSMC.SCN 0.18um CMOS. In SPICE simulation, it runs about 120 MHz.  The soft-error tolerance of the STAM has been tested by injecting errors randomly while the STAM runs the RC4 program (a simple stream cipher) in the digital-level simulator.  About five soft errors, whose locations are chosen randomly from a list of all nodes of the STAM, are injected in each execution of an instruction.  About 25% of 203,000 nets in the STAM experience a bit-flipping in each testing  The figure shows locations of errors by dots and a box in the figure represents a CHP process. Soft-error Tolerant Asynchronous Microprocessor (STAM)

45 45 Soft-error Tolerant Asynchronous Microprocessor (STAM)

46 46 Async Molecular Nanoelectronics Molecular nano was our motivation for XQDI: Extreme case of variability!

47 47 “Extreme” QDI (XQDI)  Can we improve QDI to eliminate (or reduce further) the remaining variability dependencies?  Isochronic forks  Keepers on state-holding nodes  Slew rates and oscillating rings

48 48 Isochronic Forks  Only timing assumption in QDI design  New design style that (1) minimizes the number of isochronic forks, and (2) mitigates their effect  d(single transition) << d(multi- transition path)  One-sided inequality can always be satisfied

49 49 Cell Design without Keeper  Keepers needed for state-holding cells  Keeper requires transistor sizing and balancing current strengths. Difficult with variability…  Example of the C-element: With keeper Without keeper

50 50 Ring Oscillators  An async system is a collection of rings of operators. Oscillating rings are the engine of an asynchronous circuit.  Right choices of slew rates and number of stages guarantee that each ring oscillates.  What are the limits? How many restoring stages per ring?

51 Theoretical Results & General Comments

52 52 Concurrency and the digital/analog interface  Elementary building block: guarded transition (PR: guard expr  node  or guard expr  node  )  Stability and non-interference are necessary and sufficient to guarantee the absence of logical hazards  Stable and non-interfering PR set is deterministic (Church-Rosser property)  Any sequential execution is OK (powerful simulator and execution model)

53 53 Analog Implementation  There exists a QDI (stable, non-interfering) implementation for any deterministic computation (Turing-completeness)  Arbitration treated separately. Metastability of arbiters is not a problem because of asynchrony  Analog requirements on isochronic forks and ring oscillation can always be satisfied by adding restoring delays to the circuit (single- sided timing requirement).

54 54 Knowledge vs. Ignorance  Cost of implementing sequencing  In a clocked discipline: relies on knowledge of delays  Because of increasing variability and complexity, this knowledge is increasingly expensive!  In a QDI system, timing is ignored; cost to implement sequencing is high but fixed!  “If knowledge is expensive, try ignorance”

55 55 At some point in time the costs cross… QDI CLOCKED COST/ COMPONENT TECHNOLOGY (increasing variability and complexity) Crossing point already passed for SoC…

56 56 Intel Says…  From ISSCC 2005 article by Intel about Itanium L3 cache:  “ …traditional synchronous design becomes increasingly inefficient. Much of total delay is dedicated to clock skew, latch delay, margin in each cycle, and non-ideal division to cycle boundaries. …Significant margins must be added to account for slow marginal cells that are statistically probable in a 24MB cache. The delivery of low clock skew over such an area is also difficult and costly. This single-ended asynchronous design eliminates the drawbacks above…”

57 57 Conclusion  Async QDI logic can be made extremely robust to timing variations and therefore to parameter variability  Flexible interfaces of async & absence of global signal better suited for complex system design as in SoC  Better match for probabilistic design  Energy efficient  No synchronization failure because of metastability  As technology advances, less costly for complex designs

58 58 Conclusion As we enter the nanoscale era:  System complexity (interfaces, clocking in SoC, reuse)  Robustness issues (parameters variations, soft errors, noise)  Costs: masks, design time  Power and energy consumption  “End-of-Moore’s-law” argument for parallelism An asynchronous approach offers many advantages and is unavoidable in the long run.

59 59 Industrial Prospects  Time is ripe. Why is industry so aloof?  Absence of industrial CAD tools  No seamless transition (GALS the stopgap solution?)  Maybe not in Intel’s interest?  Perhaps, we need an industrial environment untied to traditional approaches and EDA tools  Async offers an opportunity to leapfrog the current technology limitations

60 60

61 61

62 62 Managing Complexity: The Design Productivity Gap From: The International Roadmap for Semiconductors: 1999

63 63 Managing Complexity All circuits designed have been found fully functional on first silicon: YearTransistorsDescription Distributed mutual exclusion element Stack Element First microprocessor DSP filter MIPS microprocessor


Download ppt "Asynchronous Logic: Results and Prospects Alain J. Martin California Institute of Technology NTU, March 2007."

Similar presentations


Ads by Google