Lecture 11 MOUSETRAP: Ultra-High-Speed Transition-Signaling Asynchronous Pipelines.

Slides:



Advertisements
Similar presentations
Self-Timed Logic Timing complexity growing in digital design -Wiring delays can dominate timing analysis (increasing interdependence between logical and.
Advertisements

CS370 – Spring 2003 Hazards/Glitches. Time Response in Combinational Networks Gate Delays and Timing Waveforms Hazards/Glitches and How To Avoid Them.
Andrey Mokhov, Victor Khomenko Danil Sokolov, Alex Yakovlev Dual-Rail Control Logic for Enhanced Circuit Robustness.
Introduction to CMOS VLSI Design Sequential Circuits.
VLSI Design EE 447/547 Sequential circuits 1 EE 447/547 VLSI Design Lecture 9: Sequential Circuits.
Introduction to CMOS VLSI Design Sequential Circuits
MICROELETTRONICA Sequential circuits Lection 7.
ELEC 256 / Saif Zahir UBC / 2000 Timing Methodology Overview Set of rules for interconnecting components and clocks When followed, guarantee proper operation.
Lecture 11: Sequential Circuit Design. CMOS VLSI DesignCMOS VLSI Design 4th Ed. 11: Sequential Circuits2 Outline  Sequencing  Sequencing Element Design.
Penn ESE370 Fall DeHon 1 ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems Day 24: November 4, 2011 Synchronous Circuits.
Lecture: 1.6 Tri-states, Mux, Latches & Flip Flops
Delay/Phase Regeneration Circuits Crescenzo D’Alessandro, Andrey Mokhov, Alex Bystrov, Alex Yakovlev Microelectronics Systems Design Group School of EECE.
Slide 1/20IWLS 2003, May 30Early Output Logic with Anti-Tokens Charlie Brej, Jim Garside APT Group Manchester University.
Sequential Circuits. Outline  Floorplanning  Sequencing  Sequencing Element Design  Max and Min-Delay  Clock Skew  Time Borrowing  Two-Phase Clocking.
1 Clockless Logic  Recap: Lookahead Pipelines  High-Capacity Pipelines.
Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis EE4800 CMOS Digital IC Design & Analysis Lecture 11 Sequential Circuit Design Zhuo Feng.
Introduction to CMOS VLSI Design Lecture 19: Design for Skew David Harris Harvey Mudd College Spring 2004.
Introduction to CMOS VLSI Design Clock Skew-tolerant circuits.
Synchronous Digital Design Methodology and Guidelines
Clock Design Adopted from David Harris of Harvey Mudd College.
1 Clockless Logic Montek Singh Thu, Jan 13, 2004.
1 Clockless Logic Montek Singh Tue, Mar 23, 2004.
© Ran GinosarAsynchronous Design and Synchronization 1 VLSI Architectures Lecture 2: Theoretical Aspects (S&F 2.5) Data Flow Structures.
COMP Clockless Logic and Silicon Compilers Lecture 3
S. Reda EN160 SP’08 Design and Implementation of VLSI Systems (EN1600) Lecture 22: Sequential Circuit Design (1/2) Prof. Sherief Reda Division of Engineering,
Lab for Reliable Computing Generalized Latency-Insensitive Systems for Single-Clock and Multi-Clock Architectures Singh, M.; Theobald, M.; Design, Automation.
1 Clockless Logic Montek Singh Tue, Mar 21, 2006.
High-Throughput Asynchronous Pipelines for Fine-Grain Dynamic Datapaths Montek Singh and Steven Nowick Columbia University New York, USA
Chapter #6: Sequential Logic Design 6.2 Timing Methodologies
Introduction to CMOS VLSI Design Lecture 10: Sequential Circuits Credits: David Harris Harvey Mudd College (Material taken/adapted from Harris’ lecture.
S. Reda EN160 SP’07 Design and Implementation of VLSI Systems (EN0160) Lecture 23: Sequential Circuit Design (1/3) Prof. Sherief Reda Division of Engineering,
1 Clockless Computing Montek Singh Thu, Sep 13, 2007.
1 Recap: Lectures 5 & 6 Classic Pipeline Styles 1. Williams and Horowitz’s PS0 pipeline 2. Sutherland’s micropipelines.
1 Clockless Logic: Dynamic Logic Pipelines (contd.)  Drawbacks of Williams’ PS0 Pipelines  Lookahead Pipelines.
Digital Integrated Circuits for Communication
Amitava Mitra Intel Corp., Bangalore, India William F. McLaughlin
MOUSETRAP Ultra-High-Speed Transition-Signaling Asynchronous Pipelines Montek Singh & Steven M. Nowick Department of Computer Science Columbia University,
Paper review: High Speed Dynamic Asynchronous Pipeline: Self Precharging Style Name : Chi-Chuan Chuang Date : 2013/03/20.
Ratioed Circuits Ratioed circuits use weak pull-up and stronger pull-down networks. The input capacitance is reduced and hence logical effort. Correct.
DCSL & LVDCSL: A High Fan-in, High Performance Differential Current Switch Logic Families Dinesh Somasekhaar, Kaushik Roy Presented by Hazem Awad.
Optimal digital circuit design Mohammad Sharifkhani.
Introduction to CMOS VLSI Design Lecture 5: Logical Effort GRECO-CIn-UFPE Harvey Mudd College Spring 2004.
1 Clockless Computing Montek Singh Thu, Sep 6, 2007  Review: Logic Gate Families  A classic asynchronous pipeline by Williams.
12004 MAPLD: 153Brej Early output logic and Anti-Tokens Charlie Brej APT Group Manchester University.
Reading Assignment: Rabaey: Chapter 9
Introduction to Clock Tree Synthesis
FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Topics n Low power design. n Pipelining.
1 Practical Design and Performance Evaluation of Completion Detection Circuits Fu-Chiung Cheng Department of Computer Science Columbia University.
Clocking System Design
EE3A1 Computer Hardware and Digital Design Lecture 9 Pipelining.
Penn ESE370 Fall DeHon 1 ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems Day 20: October 25, 2010 Pass Transistors.
1 Recap: Lecture 4 Logic Implementation Styles:  Static CMOS logic  Dynamic logic, or “domino” logic  Transmission gates, or “pass-transistor” logic.
RTL Hardware Design by P. Chu Chapter 9 – ECE420 (CSUN) Mirzaei 1 Sequential Circuit Design: Practice Shahnam Mirzaei, PhD Spring 2016 California State.
EE141 Timing Issues 1 Chapter 10 Timing Issues Rev /11/2003 Rev /28/2003 Rev /05/2003.
EE141 Timing Issues 1 Chapter 10 Timing Issues Rev /11/2003.
1 Clockless Logic Montek Singh Thu, Mar 2, Review: Logic Gate Families  Static CMOS logic  Dynamic logic, or “domino” logic  Transmission gates,
COE 360 Principles of VLSI Design Delay. 2 Definitions.
Lecture 11: Sequential Circuit Design
Other Approaches.
Sequential circuit design with metastability
Recap: Lecture 1 What is asynchronous design? Why do we want to study it? What is pipelining? How can it be used to design really fast hardware?
Introduction to CMOS VLSI Design Lecture 10: Sequential Circuits
Chapter 10 Timing Issues Rev /11/2003 Rev /28/2003
332:578 Deep Submicron VLSI Design Lecture 14 Design for Clock Skew
Day 21: October 29, 2010 Registers Dynamic Logic
Clockless Logic: Asynchronous Pipelines
Wagging Logic: Moore's Law will eventually fix it
A Quasi-Delay-Insensitive Method to Overcome Transistor Variation
Early output logic and Anti-Tokens
Clockless Computing Lecture 3
Presentation transcript:

Lecture 11 MOUSETRAP: Ultra-High-Speed Transition-Signaling Asynchronous Pipelines

2 MOUSETRAP Pipelines Simple asynchronous implementation style, uses… transparent D-latches transparent D-latches simple control: 1 gate/pipeline stage simple control: 1 gate/pipeline stage Target = static logic blocks “MOUSETRAP”: uses a “capture protocol” Latches … are normally transparent: before new data arrives are normally transparent: before new data arrives become opaque: after data arrives (“capture” data) become opaque: after data arrives (“capture” data) Control Signaling: transition-signaling = 2-phase simple protocol: req/ack = only 2 events per handshake (not 4) simple protocol: req/ack = only 2 events per handshake (not 4) no “return-to-zero” no “return-to-zero” each transition (up/down) signals a distinct operation each transition (up/down) signals a distinct operation Our Goal: very fast cycle time simple inter-stage communication simple inter-stage communication

3 req N ack N-1 req N+1 ack N Data Latch Latch Controller done N Data in Data out Stage NStage N-1Stage N+1 En MOUSETRAP: A Basic FIFO Stages communicate using transition-signaling: 1 transition per data item! 1 st data item flowing through the pipeline 2 nd data item flowing through the pipeline

4 MOUSETRAP: A Basic FIFO (contd.) Latch controller (XNOR) acts as “phase converter”: 2 distinct transitions (up or down)  pulsed latch enable 2 distinct transitions (up or down)  pulsed latch enable 2 transitions per latch cycle latch cycle req N ack N-1 req N+1 ack N Data Latch Latch Controller done N Data inData out Stage NStage N-1Stage N+1 En Latch is re-enabled when next stage is “done” Latch is disabled when current stage is “done”

5 MOUSETRAP: FIFO Cycle Time Cycle Time = req N ack N-1 req N+1 ack N Data Latch Latch Controller done N Data inData out Stage NStage N-1Stage N+1 En Fast self-loop: N disables itself N disables itself 2 N computes 1 N+1 computes 2 3 N re-enabled to compute to compute

6 Detailed Controller Operation  One pulse per data item flowing through: down transition: caused by “done” of N down transition: caused by “done” of N up transition: caused by “done” of N+1 up transition: caused by “done” of N+1  No minimum pulse width constraint! simply, down transition should start “early enough” simply, down transition should start “early enough” can be “negative width” (no pulse!) can be “negative width” (no pulse!) ack from N+1 Stage N’s Latch Controller to Latch done from N

7 Stage N+1 logic delay Stage N Data Latch Latch Controller done N logic delay Stage N-1 logic delay req N ack N-1 req N+1 ack N MOUSETRAP: Pipeline With Logic Logic Blocks: can use standard single-rail (non-hazard-free) “Bundled Data” Requirement: each “req” must arrive after data inputs valid and stable each “req” must arrive after data inputs valid and stable Simple Extension to FIFO: insert logic block + matching delay in each stage

8 Special Case: Using “Clocked Logic” Clocked-CMOS = C 2 MOS: eliminate explicit latches latch folded into logic itself latch folded into logic itself pull-up network pull-up network pull-down network pull-down network “keeper” En En A General C 2 MOS gate logic inputs logic inputs logic output C 2 MOS AND-gate “keeper” En En A B B A logic output

9 Gate-Level MOUSETRAP: with C 2 MOS Use C 2 MOS: eliminate explicit latches New Control Optimization = “Dual-Rail XNOR” eliminate 2 inverters from critical path eliminate 2 inverters from critical path C 2 MOS logic Latch Controller Stage N Stage N-1Stage N En,En pair of bit latches req N ack N-1 req N+1 ack N done N (En,En’) (done,done’) (ack,ack’)

10 Problems with Linear Pipelining: l handles limited applications; real systems are more complex Complex Pipelining: Forks & Joins Contribution: introduce efficient circuit structures Forks: distribute data + control to multiple destinations Forks: distribute data + control to multiple destinations Joins: merge data + control from multiple sources Joins: merge data + control from multiple sources è Enabling technology for building complex async systems forkjoin Non-Linear Pipelining: has forks/joins

11 req ack2 Stage N C ack1 req req2 Stage N C req1 ack Forks and Joins: Implementation Join: merge multiple requests Fork: merge multiple acknowledges

12 Related Protocols Day/Woods (’97), and Charlie Boxes (’00) Similarities: all use… transition signaling for handshakes transition signaling for handshakes phase conversion for latch signals phase conversion for latch signals Differences: MOUSETRAP has… higher throughput higher throughput ability to handle fork/join datapaths ability to handle fork/join datapaths more aggressive timing, less insensitivity to delays more aggressive timing, less insensitivity to delays

13 Performance, Timing and Optzn. MOUSETRAP with Logic: Cycle Time = Stage Latency = Cycle Time = MOUSETRAP Using C 2 MOS Gates:

14 Timing Analysis Main Timing Constraint: avoid “data overrun” Data must be safely “captured” by Stage N before new inputs arrive from Stage N-1 simple 1-sided timing constraint: fast latch disable simple 1-sided timing constraint: fast latch disable Stage N’s “self-loop” faster than entire path through previous stage Stage N’s “self-loop” faster than entire path through previous stage Stage N Data Latch Latch Controller done N logic delay Stage N-1 logic delay req N ack N-1 req N+1 ack N

15 Timing Optzn: Reducing Cycle Time Analytical Cycle Time = Goal: shorten (in steady-state operation) Steady-state = no undue pipeline congestion Observation: XNOR switches twice per data item: XNOR switches twice per data item: only 2nd (up) transition critical for performance: only 2nd (up) transition critical for performance: Solution: reduce XNOR output swing degrade “slew” for start of pulse degrade “slew” for start of pulse allows quick pulse completion: faster rise time allows quick pulse completion: faster rise time Still safe when congested: pulse starts on time pulse maintained until congestion clears pulse maintained until congestion clears

16 Timing Optzn (contd.) N “done” N+1 “done” N’s latch disabled disabled N’s latch re-enabled re-enabled “unoptimized” XNOR output “optimized” XNOR output latch only partly disabled; recovers quicker! (no pulse width requirement)

17 Comparison with Wave Pipelining Two Scenarios: Steady State: Steady State:  both MOUSETRAP and wave pipelines act like transparent “flow through” combinational pipelines Congestion: Congestion:  right environment stalls: each MOUSETRAP stage safely captures data  internal stage slow: MOUSETRAP stages to its left safely capture data  congestion properly handled in MOUSETRAP Conclusion: MOUSETRAP has potential of… speed of wave pipelining speed of wave pipelining greater robustness and flexibility greater robustness and flexibility

18 Timing Issues: Handling Wide Datapaths Buffers inserted to amplify latch signals (En): req N req N+1 done N Stage NStage N-1 En Reducing Impact of Buffers: l control uses unbuffered signals  buffer delay off of critical path! l datapath skewed w.r.t. control Timing assumption: buffer delays roughly equal

19 Preliminary Results Pre-Layout Simulations of FIFO’s: do not account for wire delays, parasitics, etc. do not account for wire delays, parasitics, etc. careful transistor sizing/verification of timing constraints careful transistor sizing/verification of timing constraints

20 Conclusions and Future Work Introduced a new asynchronous pipeline style: Static logic blocks Static logic blocks Simple latches and control: Simple latches and control:  transparent latches, or C 2 MOS gates  single gate control = 1 XNOR gate/stage Highly concurrent event-driven protocol Highly concurrent event-driven protocol High throughputs obtained: High throughputs obtained:  3.5 GHz in 0.25 , 1.9 GHz in 0.6   comparable to wave pipelines; yet more robust/less design effort Correctly handle forks and joins in datapaths Correctly handle forks and joins in datapaths Timing constrains: local, 1-sided, easily met Timing constrains: local, 1-sided, easily met Ongoing Work: more realistic performance measurement (incl. parasitics) more realistic performance measurement (incl. parasitics) layout and fabrication layout and fabrication