Clockless Logic: Asynchronous Pipelines

Slides:



Advertisements
Similar presentations
Self-Timed Logic Timing complexity growing in digital design -Wiring delays can dominate timing analysis (increasing interdependence between logical and.
Advertisements

CS370 – Spring 2003 Hazards/Glitches. Time Response in Combinational Networks Gate Delays and Timing Waveforms Hazards/Glitches and How To Avoid Them.
Sequential Circuit Design
Andrey Mokhov, Victor Khomenko Danil Sokolov, Alex Yakovlev Dual-Rail Control Logic for Enhanced Circuit Robustness.
Reading1: An Introduction to Asynchronous Circuit Design Al Davis Steve Nowick University of Utah Columbia University.
Introduction to CMOS VLSI Design Sequential Circuits.
VLSI Design EE 447/547 Sequential circuits 1 EE 447/547 VLSI Design Lecture 9: Sequential Circuits.
Introduction to CMOS VLSI Design Sequential Circuits
MICROELETTRONICA Sequential circuits Lection 7.
Lecture 11: Sequential Circuit Design. CMOS VLSI DesignCMOS VLSI Design 4th Ed. 11: Sequential Circuits2 Outline  Sequencing  Sequencing Element Design.
Delay/Phase Regeneration Circuits Crescenzo D’Alessandro, Andrey Mokhov, Alex Bystrov, Alex Yakovlev Microelectronics Systems Design Group School of EECE.
Slide 1/20IWLS 2003, May 30Early Output Logic with Anti-Tokens Charlie Brej, Jim Garside APT Group Manchester University.
Sequential Circuits. Outline  Floorplanning  Sequencing  Sequencing Element Design  Max and Min-Delay  Clock Skew  Time Borrowing  Two-Phase Clocking.
1 Clockless Logic  Recap: Lookahead Pipelines  High-Capacity Pipelines.
Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis EE4800 CMOS Digital IC Design & Analysis Lecture 11 Sequential Circuit Design Zhuo Feng.
Introduction to CMOS VLSI Design Lecture 19: Design for Skew David Harris Harvey Mudd College Spring 2004.
Synchronous Digital Design Methodology and Guidelines
Clock Design Adopted from David Harris of Harvey Mudd College.
RTL Hardware Design by P. Chu Chapter 161 Clock and Synchronization.
1 Clockless Logic Montek Singh Thu, Jan 13, 2004.
1 Clockless Logic Montek Singh Tue, Mar 23, 2004.
© Ran GinosarAsynchronous Design and Synchronization 1 VLSI Architectures Lecture 2: Theoretical Aspects (S&F 2.5) Data Flow Structures.
COMP Clockless Logic and Silicon Compilers Lecture 3
Lab for Reliable Computing Generalized Latency-Insensitive Systems for Single-Clock and Multi-Clock Architectures Singh, M.; Theobald, M.; Design, Automation.
1 Clockless Logic Montek Singh Tue, Mar 21, 2006.
High-Throughput Asynchronous Pipelines for Fine-Grain Dynamic Datapaths Montek Singh and Steven Nowick Columbia University New York, USA
Introduction to CMOS VLSI Design Lecture 10: Sequential Circuits Credits: David Harris Harvey Mudd College (Material taken/adapted from Harris’ lecture.
1 Clockless Computing Montek Singh Thu, Sep 13, 2007.
Lecture 11 MOUSETRAP: Ultra-High-Speed Transition-Signaling Asynchronous Pipelines.
1 Recap: Lectures 5 & 6 Classic Pipeline Styles 1. Williams and Horowitz’s PS0 pipeline 2. Sutherland’s micropipelines.
1 Clockless Logic: Dynamic Logic Pipelines (contd.)  Drawbacks of Williams’ PS0 Pipelines  Lookahead Pipelines.
Amitava Mitra Intel Corp., Bangalore, India William F. McLaughlin
MOUSETRAP Ultra-High-Speed Transition-Signaling Asynchronous Pipelines Montek Singh & Steven M. Nowick Department of Computer Science Columbia University,
Paper review: High Speed Dynamic Asynchronous Pipeline: Self Precharging Style Name : Chi-Chuan Chuang Date : 2013/03/20.
Ratioed Circuits Ratioed circuits use weak pull-up and stronger pull-down networks. The input capacitance is reduced and hence logical effort. Correct.
DCSL & LVDCSL: A High Fan-in, High Performance Differential Current Switch Logic Families Dinesh Somasekhaar, Kaushik Roy Presented by Hazem Awad.
1 Clockless Computing Montek Singh Thu, Sep 6, 2007  Review: Logic Gate Families  A classic asynchronous pipeline by Williams.
12004 MAPLD: 153Brej Early output logic and Anti-Tokens Charlie Brej APT Group Manchester University.
FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Topics n Low power design. n Pipelining.
Clocking System Design
Penn ESE370 Fall DeHon 1 ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems Day 20: October 25, 2010 Pass Transistors.
1 Recap: Lecture 4 Logic Implementation Styles:  Static CMOS logic  Dynamic logic, or “domino” logic  Transmission gates, or “pass-transistor” logic.
RTL Hardware Design by P. Chu Chapter 9 – ECE420 (CSUN) Mirzaei 1 Sequential Circuit Design: Practice Shahnam Mirzaei, PhD Spring 2016 California State.
EE141 Timing Issues 1 Chapter 10 Timing Issues Rev /11/2003 Rev /28/2003 Rev /05/2003.
EE141 Timing Issues 1 Chapter 10 Timing Issues Rev /11/2003.
1 Clockless Logic Montek Singh Thu, Mar 2, Review: Logic Gate Families  Static CMOS logic  Dynamic logic, or “domino” logic  Transmission gates,
COE 360 Principles of VLSI Design Delay. 2 Definitions.
Digital Integrated Circuits A Design Perspective
Lecture 11: Sequential Circuit Design
Other Approaches.
Advanced Digital Design
Asynchronous Interface Specification, Analysis and Synthesis
Sequential circuit design with metastability
Pipelining and Retiming 1
Recap: Lecture 1 What is asynchronous design? Why do we want to study it? What is pipelining? How can it be used to design really fast hardware?
SEQUENTIAL LOGIC -II.
Introduction to CMOS VLSI Design Lecture 10: Sequential Circuits
Timing Analysis 11/21/2018.
触发器 Flip-Flops 刘鹏 浙江大学信息与电子工程学院 March 27, 2018
Clocking in High-Performance and Low-Power Systems Presentation given at: EPFL Lausanne, Switzerland June 23th, 2003 Vojin G. Oklobdzija Advanced.
Chapter 10 Timing Issues Rev /11/2003 Rev /28/2003
CSE 370 – Winter Sequential Logic - 1
ARM implementation the design is divided into a data path section that is described in register transfer level (RTL) notation control section that is viewed.
332:578 Deep Submicron VLSI Design Lecture 14 Design for Clock Skew
Day 21: October 29, 2010 Registers Dynamic Logic
Chapter 3 Overview • Multi-Level Logic
Wagging Logic: Moore's Law will eventually fix it
A Quasi-Delay-Insensitive Method to Overcome Transistor Variation
Early output logic and Anti-Tokens
Clockless Computing Lecture 3
Presentation transcript:

Clockless Logic: Asynchronous Pipelines MOUSETRAP: Ultra-High-Speed Transition-Signaling Asynchronous Pipelines

MOUSETRAP Pipelines Simple asynchronous implementation style, uses… transparent D-latches simple control: 1 gate/pipeline stage Target = static logic blocks “MOUSETRAP”: uses a “capture protocol” Latches … are normally transparent: before new data arrives become opaque: after data arrives (“capture” data) Control Signaling: transition-signaling = 2-phase simple protocol: req/ack = only 2 events per handshake (not 4) no “return-to-zero” each transition (up/down) signals a distinct operation Our Goal: very fast cycle time simple inter-stage communication

MOUSETRAP: A Basic FIFO Stages communicate using transition-signaling: Latch Controller 1 transition per data item! ackN-1 ackN En reqN doneN reqN+1 Data in Data out Data Latch Stage N-1 Stage N Stage N+1 2nd data item flowing through the pipeline 1st data item flowing through the pipeline 1st data item flowing through the pipeline

MOUSETRAP: A Basic FIFO (contd.) Latch controller (XNOR) acts as “phase converter”: 2 distinct transitions (up or down)  pulsed latch enable Latch is re-enabled when next stage is “done” Latch is disabled when current stage is “done” Latch Controller 2 transitions per latch cycle ackN-1 ackN En reqN doneN reqN+1 Data in Data out Data Latch Stage N-1 Stage N Stage N+1

MOUSETRAP: FIFO Cycle Time reqN ackN-1 reqN+1 ackN Data Latch Latch Controller doneN Data in Data out Stage N Stage N-1 Stage N+1 En 3 Fast self-loop: N disables itself 2 1 2 N re-enabled to compute N computes N+1 computes Cycle Time =

Detailed Controller Operation Stage N’s Latch Controller ack from N+1 done from N to Latch One pulse per data item flowing through: down transition: caused by “done” of N up transition: caused by “done” of N+1 No minimum pulse width constraint! simply, down transition should start “early enough” can be “negative width” (no pulse!)

MOUSETRAP: Pipeline With Logic Simple Extension to FIFO: insert logic block + matching delay in each stage Latch Controller ackN-1 ackN reqN reqN+1 delay delay delay doneN logic logic logic Data Latch Stage N-1 Stage N Stage N+1 Logic Blocks: can use standard single-rail (non-hazard-free) “Bundled Data” Requirement: each “req” must arrive after data inputs valid and stable

Special Case: Using “Clocked Logic” Clocked-CMOS = C2MOS: eliminate explicit latches latch folded into logic itself pull-up network pull-down “keeper” En A General C2MOS gate logic inputs output “keeper” En A B logic output C2MOS AND-gate

Gate-Level MOUSETRAP: with C2MOS Use C2MOS: eliminate explicit latches New Control Optimization = “Dual-Rail XNOR” eliminate 2 inverters from critical path Latch Controller ackN-1 2 ackN 2 2 2 2 En,En 2 doneN 2 2 (En,En’) (done,done’) (ack,ack’) reqN reqN+1 pair of bit latches C2MOS logic Stage N-1 Stage N Stage N+1

Complex Pipelining: Forks & Joins Problems with Linear Pipelining: handles limited applications; real systems are more complex fork join Non-Linear Pipelining: has forks/joins Contribution: introduce efficient circuit structures Forks: distribute data + control to multiple destinations Joins: merge data + control from multiple sources Enabling technology for building complex async systems

Forks and Joins: Implementation req req2 Stage N C req1 ack req ack2 Stage N C ack1 Join: merge multiple requests Fork: merge multiple acknowledges

Day/Woods (’97), and Charlie Boxes (’00) Related Protocols Day/Woods (’97), and Charlie Boxes (’00) Similarities: all use… transition signaling for handshakes phase conversion for latch signals Differences: MOUSETRAP has… higher throughput ability to handle fork/join datapaths more aggressive timing, less insensitivity to delays

Performance, Timing and Optzn. MOUSETRAP with Logic: Stage Latency = Cycle Time = MOUSETRAP Using C2MOS Gates: Stage Latency = Cycle Time =

Timing Analysis Main Timing Constraint: avoid “data overrun” Data must be safely “captured” by Stage N before new inputs arrive from Stage N-1 simple 1-sided timing constraint: fast latch disable Stage N’s “self-loop” faster than entire path through previous stage Stage N Data Latch Latch Controller doneN logic delay Stage N-1 reqN ackN-1 reqN+1 ackN

Timing Optzn: Reducing Cycle Time Analytical Cycle Time = Goal: shorten (in steady-state operation) Steady-state = no undue pipeline congestion Observation: XNOR switches twice per data item: only 2nd (up) transition critical for performance: Solution: reduce XNOR output swing degrade “slew” for start of pulse allows quick pulse completion: faster rise time Still safe when congested: pulse starts on time pulse maintained until congestion clears

Timing Optzn (contd.) “optimized” XNOR output “unoptimized” N “done” N+1 “done” “optimized” XNOR output latch only partly disabled; recovers quicker! (no pulse width requirement) “unoptimized” XNOR output N’s latch disabled N’s latch re-enabled

Comparison with Wave Pipelining Two Scenarios: Steady State: both MOUSETRAP and wave pipelines act like transparent “flow through” combinational pipelines Congestion: right environment stalls: each MOUSETRAP stage safely captures data internal stage slow: MOUSETRAP stages to its left safely capture data  congestion properly handled in MOUSETRAP Conclusion: MOUSETRAP has potential of… speed of wave pipelining greater robustness and flexibility

Timing Issues: Handling Wide Datapaths Buffers inserted to amplify latch signals (En): reqN reqN+1 doneN Stage N Stage N-1 En Reducing Impact of Buffers: control uses unbuffered signals  buffer delay off of critical path! datapath skewed w.r.t. control Timing assumption: buffer delays roughly equal

Preliminary Results Pre-Layout Simulations of FIFO’s: do not account for wire delays, parasitics, etc. careful transistor sizing/verification of timing constraints

Conclusions and Future Work Introduced a new asynchronous pipeline style: Static logic blocks Simple latches and control: transparent latches, or C2MOS gates single gate control = 1 XNOR gate/stage Highly concurrent event-driven protocol High throughputs obtained: 3.5 GHz in 0.25, 1.9 GHz in 0.6 comparable to wave pipelines; yet more robust/less design effort Correctly handle forks and joins in datapaths Timing constrains: local, 1-sided, easily met Ongoing Work: more realistic performance measurement (incl. parasitics) layout and fabrication