3 Sequencing and Computation “An algorithm is a sequence of computational steps.” CL&R How do we implement sequencing in a continuous physical medium? Traditional answer: use of a global time reference (“the clock”) CLK ABCDE
4 Can we compute without a clock? Yes!: “asynchronous” or “clockless” logic Also “self-timed” or “speed-independent” David Muller “Theory of Asynchronous Circuits” (1959) ILLIAC (1959) and ILLIAC II (1962) partially asynchronous PDP6 (1960) asynchronous
5 Can we compute without a clock and without delay assumptions? Delay-insensitivity (Molnar, 198x…) Almost: “The class of delay-insensitive circuits is limited (not Turing-complete).” (Martin, 1990) Quasi-delay-insensitive (QDI) logic: –Delay-insensitive –Isochronic forks (only delay assumption) QDI is Turing-complete (Martin & Manohar, 1996)
6 What is an Asynchronous Circuit? Asynchronous system: collection of modules communicating by handshake protocols Distributed system on a chip (communicating by message exchange) ABCDE ack
7 Caltech QDI Approach Quasi delay-insensitive (QDI) design Minimal delay assumptions (only isochronic forks) Stricter logic synthesis (DI codes for datapath, completion trees), but… Robust and efficient (no evidence that delay assumptions improve efficiency)
9 Scientific Reasons Understanding the role of time in computation Limit of delay insensitivity Implementing a digital computation directly in a continuous physical medium Design by program transformation (real correctness-by-construction approach) “VLSI design as programming” paradigm
10 Engineering Reasons Better match for high-level synthesis –Can separate correctness from performance issues Modularity and better use of concurrency Large system design (SoC): Only local communication Efficiency –Average-case instead of worst-case behavior –Less pressure for global optimization (“timing closure”) Robustness and reliability –Robust to variations in fabrication technology, temperature, voltage, noise, SEU-tolerance Energy efficiency
11 Energy Advantages of Async No clock –Up to 50% of clock power recuperated Automatic shut-off of idle parts –Perfect clock gating No glitches (spurious transitions) –Up to 50% of power in combinational circuits Automatic adaptation to parameter’s variations –Voltage scaling: Perfect exchange of delay against energy through voltage scaling Flexibility of asynchronous interfaces: –Better use of concurrency
12 Reactive Use in Embedded Systems Archetype of a reactive system Average execution time may be much shorter than maximal execution time Sleep sequence without race condition –Modeled after wait/signal with condition variables Instant wake-up from deep sleep
13 Robustness to PVT Variations Increase in physical parameter variations (PVT) is becoming a huge problem… Even worse in future technologies (nano CMOS or others) Variations of physical parameters all affect timing Increased timing variations reduce robustness and/or performance Single time reference (clock) may become unavailable or too expensive in future technologies and large systems (SoC)
14 Robustness to Voltage and Temperature Variations
15 Single-event Upset and Soft-error Tolerance of QDI circuits Soft-errors caused by alpha particles, cosmic rays and other radiation sources are becoming increasingly problematic, even at ground-level QDI circuits can absorb most “dose-effects” Single-event upsets that cause a soft-error (bit flip) can be corrected efficiently in QDI circuits Error-correction scheme specific to QDI Entire async microcontroller SEU-tolerant
16 Detection and Correction of SE in QDI circuits Single-error detection: duplicate and compare Correction: –prevent propagation of detected SE –stability of guards corrects automatically –“Detection is correction” Simplest, most expensive coding, but simplest detection mechanism Entire microcontroller SEU tolerant
17 Disadvantages of Async Size overhead (more transistors) Poorly understood and rarely taught No industrial CAD tools (yet) No well-developed testing procedure (yet) No easy transition path for large established companies…
20 First Asynchronous Microprocessor (Caltech, 1988) Performance: – 5 MIPS, 5mA @ 2V –18 MIPS, 45mA @ 5V –26 MIPS, 100mA @ 10V 16-bit RISC, 2-micron CMOS Formal synthesis: –Initial sequential description was a single page of CHP code –5 months from start of project to tape-out (small group) –Fully functional on first silicon Potato-chip experiment –Runs on a potato as power supply! – 50kHz @ 0.75V, 300kHz @ 0.9V
21 Asynchronous MIPS R3000 Microprocessor Standard 32-bit RISC ISA Single instruction issue, one branch delay slot Precise exceptions 2 on-chip caches: 4kB Icache and 4kB Dcache First prototype (1998): –No TLB –2M transistors –First asynchronous processor competitive with large synchronous designs
22 MiniMIPS Low-Voltage Operation Functional from 0.5V Vdd up Functional at 0.4V with some transistor resizing
23 Asynchronous MIPS: Practical Results HP’s 0.6-micron CMOS –Expected: 275 MIPS @ 7W @ 3.3V @ 25 o C –First prototype:190 MIPS @ 4W @ 3.3V @ 25 o C –Voltage range:1V (9.66MHz @ 0.021 W) to 8V Functional on first silicon despite –Inconsistencies in HP’s process parameters (e.g. higher V t ’s) –Long polysilicon wire overlooked in the critical fetch loop –(Testament to the robustness of asynchronous design style!) Roughly 4x faster than commercial synchronous MIPS ported to same technology –Note: no particular effort made towards designing for low power.
24 Lutonium-18: QDI 8051 Microcontroller TSMC SCN018 through MOSIS –0.18 m CMOS –1.8V nominal –|V t | = 0.4V to 0.5V Expected area: 5mm 2 (including 8kB SRAM) Performance from low-level simulation (conservative!) 1.8 V200 MIPS100.0 mW500 pJ/inst 1800 MIPS/W 1.1 V100 MIPS 20.7 mW207 pJ/inst 4830 MIPS/W 0.9 V 66 MIPS 9.2 mW139 pJ/inst 7200 MIPS/W 0.8 V 48 MIPS 4.4 mW 92 pJ/inst10900 MIPS/W 0.5 V 4 MIPS 170 W 43 pJ/inst23000 MIPS/W
25 Energy Efficiency Metric: Et 2 E = C*V 2, t = k / V E = C*V 2, t = k / V E*t 2 independent of V E*t 2 independent of V Estimate of energy efficiency Estimate of energy efficiency Comparison of designs Comparison of designs “Algorithmic of energy’’ “Algorithmic of energy’’ See Chapter 15 in “Power Aware Computing” book by Graybill & Melhem eds. Kluwer See Chapter 15 in “Power Aware Computing” book by Graybill & Melhem eds. Kluwer
26 Voltage Scaling Advantage: Comparison to Intel Xscale
27 Energy Breakdown and Comparisons Microprocessor -- Results MIPS Energy async-0.6 33nJ 70nJ sync-0.6 MIPS CycleTime async-0.6 6ns 21ns sync-0.6 Microcontroller -- Estimation 8051 Energy per Instr sync-0.5 10.00nJ (1X) 1.67nJ (6X) async-0.5 0.56nJ (18X) async-0.18 @1.8V 0.14nJ (72X) async-0.18 @0.9V 8051 CycleTime sync-0.5 20ns (1X) 10ns (2X) async-0.5 5ns (4X) async-0.18 @1.8V 10ns (2X) async-0.18 @0.9V More than 100X Et 2 improvement over any other 8051 icache decodewrite back regfile (bypass) fetch exec units (adder) (shifter) (fblock) (mem) (mult/div) Energy Breakdown
31 QDI PIPELINE vs Bundled Data Dual-rail or 1-of-n data encoding Completion tree Critics: high overhead (2*N +1 wires and completion tree) Alternative: Bundled data N + 1 wires, no completion tree Delay line for indicating completion, spurious transitions Big controversy!
32 Fine-grain Pipeline (PCHB) L? LaLa LvLv RvRv R RaRa en f validity completion validity L? R!
33 FINE-GRAIN PIPELINE No need for separate register Very high throughput and low forward latency Excellent Et^2 performance Entirely QDI Used in MiniMIPS and Lutonium Area overhead significant
34 Lower-Level Synthesis: HSE CHP Program *[ L?x; R!x ] Handshaking Expansion *[ [ R a L 0 R 0 R a L 1 R 1 ]; L a ; [ R a R 0 , R 1 ]; [ L 0 L 1 L a ] ] [ L d ]; L a ; [ L d ]; L a [ R a ]; R d ; [ R a ]; R d 87 65 4 3 2 1
35 Lower-Level Synthesis: PRS CHP Program *[ L?x; R!x ] Handshaking Expansion *[ [ R a L 0 R 0 R a L 1 R 1 ]; L a ; [ R a R 0 , R 1 ]; [ L 0 L 1 L a ] ] Production Rule Set L 0 L 1 L v L a R a L 0 R 0 L a R a L 1 R 1 R 0 R 1 R v L v R v L a L 0 L 1 L v R a L a R 0 R a L a R 1 R 0 R 1 R v L v R v L a To PRS for CMOS …
36 Lower-Level Synthesis: PRS Production Rule Set L 0 L 1 L v L a R a L 0 R 0 L a R a L 1 R 1 R 0 R 1 R v L v R v L a L 0 L 1 L v R a L a R 0 R a L a R 1 R 0 R 1 R v L v R v L a To PRS for CMOS … Each production rule has the form: guard expr node or guard expr node These can be evaluated as If ( guard expr is true ) node = Vdd or If ( guard expr is true ) node = GND A set of production rules must be stable and non-interfering (for hazard-free circuits)
37 Asynchronous Architectures New asynchronous solutions for pipelined microprocessors Execution units are in parallel, allowing concurrent and out- of-order execution of instructions
38 CAD Tools Complete suite of tools: synthesis, simulation, verification, optimization, layout Designer-assisted compilation Tools are modular and customizable Main representations: CHP, PRS, Cast
39 Design Flow chpsim cosim database sequential program concurrent system PRS placed cells physical layout routed cells collection of cells sized PRS logical physical DDD PlacerRouter PL2 = ? ! add physical resize using wire information prsim/esimspice simulators synthesis Legend SDD Sizer
41 Robustness to Power-Supply Noise HPSICE simulation of a typical QDI asynchronous circuit: A five-stage ring of async (PCHB) pipeline stages. Technology: TSMC 0.18micron CMOS Vdd: 1.8V, Vt :.5V, Complete layout. Vdd is oscillating between 3.5V and 0V (maximal amplitude), and at various frequencies. The circuit keeps working correctly! (It will malfunction at some very high-frequency noise in phase with circuit frequency.)
43 SE-Tolerant QDI Circuits z’ b C zaza zbzb C xaxa yaya xbxb ybyb z ’a’a intermediate final
44 The STAM architecture defines simplified 32-bit RISC instruction set, which has eight general registers, and four types of instructions: arithmetic, branch, memory and shift operations. A partially-wired layout of the STAM was completed TSMC.SCN 0.18um CMOS. In SPICE simulation, it runs about 120 MHz. The soft-error tolerance of the STAM has been tested by injecting errors randomly while the STAM runs the RC4 program (a simple stream cipher) in the digital-level simulator. About five soft errors, whose locations are chosen randomly from a list of all nodes of the STAM, are injected in each execution of an instruction. About 25% of 203,000 nets in the STAM experience a bit-flipping in each testing The figure shows locations of errors by dots and a box in the figure represents a CHP process. Soft-error Tolerant Asynchronous Microprocessor (STAM)
46 Async Molecular Nanoelectronics Molecular nano was our motivation for XQDI: Extreme case of variability!
47 “Extreme” QDI (XQDI) Can we improve QDI to eliminate (or reduce further) the remaining variability dependencies? Isochronic forks Keepers on state-holding nodes Slew rates and oscillating rings
48 Isochronic Forks Only timing assumption in QDI design New design style that (1) minimizes the number of isochronic forks, and (2) mitigates their effect d(single transition) << d(multi- transition path) One-sided inequality can always be satisfied
49 Cell Design without Keeper Keepers needed for state-holding cells Keeper requires transistor sizing and balancing current strengths. Difficult with variability… Example of the C-element: With keeper Without keeper
50 Ring Oscillators An async system is a collection of rings of operators. Oscillating rings are the engine of an asynchronous circuit. Right choices of slew rates and number of stages guarantee that each ring oscillates. What are the limits? How many restoring stages per ring?
52 Concurrency and the digital/analog interface Elementary building block: guarded transition (PR: guard expr node or guard expr node ) Stability and non-interference are necessary and sufficient to guarantee the absence of logical hazards Stable and non-interfering PR set is deterministic (Church-Rosser property) Any sequential execution is OK (powerful simulator and execution model)
53 Analog Implementation There exists a QDI (stable, non-interfering) implementation for any deterministic computation (Turing-completeness) Arbitration treated separately. Metastability of arbiters is not a problem because of asynchrony Analog requirements on isochronic forks and ring oscillation can always be satisfied by adding restoring delays to the circuit (single- sided timing requirement).
54 Knowledge vs. Ignorance Cost of implementing sequencing In a clocked discipline: relies on knowledge of delays Because of increasing variability and complexity, this knowledge is increasingly expensive! In a QDI system, timing is ignored; cost to implement sequencing is high but fixed! “If knowledge is expensive, try ignorance”
55 At some point in time the costs cross… QDI CLOCKED COST/ COMPONENT TECHNOLOGY (increasing variability and complexity) Crossing point already passed for SoC…
56 Intel Says… From ISSCC 2005 article by Intel about Itanium L3 cache: “ …traditional synchronous design becomes increasingly inefficient. Much of total delay is dedicated to clock skew, latch delay, margin in each cycle, and non-ideal division to cycle boundaries. …Significant margins must be added to account for slow marginal cells that are statistically probable in a 24MB cache. The delivery of low clock skew over such an area is also difficult and costly. This single-ended asynchronous design eliminates the drawbacks above…”
57 Conclusion Async QDI logic can be made extremely robust to timing variations and therefore to parameter variability Flexible interfaces of async & absence of global signal better suited for complex system design as in SoC Better match for probabilistic design Energy efficient No synchronization failure because of metastability As technology advances, less costly for complex designs
58 Conclusion As we enter the nanoscale era: System complexity (interfaces, clocking in SoC, reuse) Robustness issues (parameters variations, soft errors, noise) Costs: masks, design time Power and energy consumption “End-of-Moore’s-law” argument for parallelism An asynchronous approach offers many advantages and is unavoidable in the long run.
59 Industrial Prospects Time is ripe. Why is industry so aloof? Absence of industrial CAD tools No seamless transition (GALS the stopgap solution?) Maybe not in Intel’s interest? Perhaps, we need an industrial environment untied to traditional approaches and EDA tools Async offers an opportunity to leapfrog the current technology limitations
62 Managing Complexity: The Design Productivity Gap From: The International Roadmap for Semiconductors: 1999
63 Managing Complexity All circuits designed have been found fully functional on first silicon: YearTransistorsDescription 1985200Distributed mutual exclusion element 19862000Stack Element 198920 000First microprocessor 1995500 000DSP filter 19982 000 000MIPS microprocessor