High-Throughput Asynchronous Pipelines for Fine-Grain Dynamic Datapaths Montek Singh and Steven Nowick Columbia University New York, USA

Slides:

Advertisements

Similar presentations

Self-Timed Logic Timing complexity growing in digital design -Wiring delays can dominate timing analysis (increasing interdependence between logical and.

Advertisements

Data Synchronization Issues in GALS SoCs Rostislav (Reuven) Dobkin and Ran Ginosar Technion Christos P. Sotiriou FORTH ICS- FORTH.

CPE 626 CPU Resources: Adders & Multipliers Aleksandar Milenkovic Web:

Intro to Computer Org. Pipelining, Part 2 – Data hazards + Stalls.

Reading1: An Introduction to Asynchronous Circuit Design Al Davis Steve Nowick University of Utah Columbia University.

Introduction to CMOS VLSI Design Sequential Circuits

Lecture 11: Sequential Circuit Design. CMOS VLSI DesignCMOS VLSI Design 4th Ed. 11: Sequential Circuits2 Outline  Sequencing  Sequencing Element Design.

Delay/Phase Regeneration Circuits Crescenzo D’Alessandro, Andrey Mokhov, Alex Bystrov, Alex Yakovlev Microelectronics Systems Design Group School of EECE.

Slide 1/20IWLS 2003, May 30Early Output Logic with Anti-Tokens Charlie Brej, Jim Garside APT Group Manchester University.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Decoupled Pipelines: Rationale, Analysis, and Evaluation Frederick A. Koopmans, Sanjay J. Patel Department of Computer Engineering University of Illinois.

1 Clockless Logic  Recap: Lookahead Pipelines  High-Capacity Pipelines.

Introduction to CMOS VLSI Design Lecture 19: Design for Skew David Harris Harvey Mudd College Spring 2004.

Introduction to CMOS VLSI Design Clock Skew-tolerant circuits.

Synchronous Digital Design Methodology and Guidelines

Clock Design Adopted from David Harris of Harvey Mudd College.

Embedding of Asynchronous Wave Pipelines into Synchronous Data Processing Stephan Hermanns, Sorin Alexander Huss University of Technology Darmstadt, Germany.

CMOS Circuit Design for Minimum Dynamic Power and Highest Speed Tezaswi Raja, Dept. of ECE, Rutgers University Vishwani D. Agrawal, Dept. of ECE, Auburn.

© Ran Ginosar Lecture 3: Handshake Ckt Implementations 1 VLSI Architectures Lecture 3 S&F Ch. 5: Handshake Ckt Implementations.

1 A Modular Synchronizing FIFO for NoCs Vainbaum Yuri.

1 Clockless Logic Montek Singh Thu, Jan 13, 2004.

1 Clockless Logic Montek Singh Tue, Mar 23, 2004.

© Ran GinosarAsynchronous Design and Synchronization 1 VLSI Architectures Lecture 2: Theoretical Aspects (S&F 2.5) Data Flow Structures.

1 Clockless Logic Montek Singh Tue, Mar 16, 2004.

COMP Clockless Logic and Silicon Compilers Lecture 3

1 Clockless Logic Prof. Montek Singh Feb. 3, 2004.

1 COMP541 Sequencing – III (Sequencing a Computer) Montek Singh April 9, 2007.

Jordi Cortadella, Universitat Politècnica de Catalunya, Spain

Lab for Reliable Computing Generalized Latency-Insensitive Systems for Single-Clock and Multi-Clock Architectures Singh, M.; Theobald, M.; Design, Automation.

1 Clockless Logic Montek Singh Tue, Mar 21, 2006.

A 1.5 GHz AWP Elliptic Curve Crypto Chip O. Hauck, S. A. Huss ICSLAB TU Darmstadt A. Katoch Philips Research A 1.5 GHz AWP Elliptic Curve Crypto Chip O.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

1 Clockless Computing Montek Singh Thu, Sep 13, 2007.

An Extra-Regular, Compact, Low-Power Multiplier Design Using Triple-Expansion Schemes and Borrow Parallel Counter Circuits Rong Lin Ronald B. Alonzo SUNY.

Lecture 11 MOUSETRAP: Ultra-High-Speed Transition-Signaling Asynchronous Pipelines.

1 Recap: Lectures 5 & 6 Classic Pipeline Styles 1. Williams and Horowitz’s PS0 pipeline 2. Sutherland’s micropipelines.

1 Clockless Logic: Dynamic Logic Pipelines (contd.)  Drawbacks of Williams’ PS0 Pipelines  Lookahead Pipelines.

Principle of Functional Verification Chapter 1~3 Presenter : Fu-Ching Yang.

Clockless Logic Montek Singh Tue, Apr 6, Case Study: An Adaptively-Pipelined Mixed Synchronous-Asynchronous System Montek Singh Univ. of North Carolina.

Asynchronous Datapath Design Adders Comparators Multipliers Registers Completion Detection Bus Pipeline …..

Amitava Mitra Intel Corp., Bangalore, India William F. McLaughlin

MOUSETRAP Ultra-High-Speed Transition-Signaling Asynchronous Pipelines Montek Singh & Steven M. Nowick Department of Computer Science Columbia University,

Paper review: High Speed Dynamic Asynchronous Pipeline: Self Precharging Style Name : Chi-Chuan Chuang Date : 2013/03/20.

Asynchronous Pipelines Author: Peter Yeh Advisor: Professor Beerel.

DCSL & LVDCSL: A High Fan-in, High Performance Differential Current Switch Logic Families Dinesh Somasekhaar, Kaushik Roy Presented by Hazem Awad.

1 Clockless Computing Montek Singh Thu, Sep 6, 2007  Review: Logic Gate Families  A classic asynchronous pipeline by Williams.

UNIVERSITY OF ROSTOCK Institute of Applied Microelectronics and Computer Science Single-Rail Self-timed Logic Circuits in Synchronous Designs Frank Grassert,

12004 MAPLD: 153Brej Early output logic and Anti-Tokens Charlie Brej APT Group Manchester University.

Reader: Pushpinder Kaur Chouhan

COMP541 Arithmetic Circuits

EE5970 Computer Engineering Seminar Spring 2012 Michigan Technological University Based on: A Low-Power FPGA Based on Autonomous Fine-Grain Power Gating.

FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Topics n Low power design. n Pipelining.

1 Practical Design and Performance Evaluation of Completion Detection Circuits Fu-Chiung Cheng Department of Computer Science Columbia University.

1 Bridging the gap between asynchronous design and designers Peter A. BeerelFulcrum Microsystems, Calabasas Hills, CA, USA Jordi CortadellaUniversitat.

Project : GasP pipeline in asynchronous circuit Wilson Kwan M.A.Sc. Candidate Ottawa-Carleton Institute for Electrical & Computer Engineering (OCIECE)

1 Recap: Lecture 4 Logic Implementation Styles:  Static CMOS logic  Dynamic logic, or “domino” logic  Transmission gates, or “pass-transistor” logic.

1 Clockless Logic Montek Singh Thu, Mar 2, Review: Logic Gate Families  Static CMOS logic  Dynamic logic, or “domino” logic  Transmission gates,

Control of Dynamic Discrete-Event Systems Lenko Grigorov Master’s Thesis, QU supervisor: Dr. Karen Rudie.

Asynchronous Interface Specification, Analysis and Synthesis

Parallel and Distributed Simulation Techniques

Recap: Lecture 1 What is asynchronous design? Why do we want to study it? What is pipelining? How can it be used to design really fast hardware?

Cache Memory Presentation I

Clocking in High-Performance and Low-Power Systems Presentation given at: EPFL Lausanne, Switzerland June 23th, 2003 Vojin G. Oklobdzija Advanced.

332:578 Deep Submicron VLSI Design Lecture 14 Design for Clock Skew

High Performance Asynchronous Circuit Design and Application

Clockless Logic: Asynchronous Pipelines

Synchronous, Wave and Asynchronous pipeling

Wagging Logic: Moore's Law will eventually fix it

Clockless Computing Lecture 3

Presentation transcript:

High-Throughput Asynchronous Pipelines for Fine-Grain Dynamic Datapaths Montek Singh and Steven Nowick Columbia University New York, USA Intl. Symp. Adv. Res. Asynchronous Circ. Syst. (ASYNC), April 2-6, 2000, Eilat, Israel.

2 Outline  Introduction  Background: Williams’ PS0 pipelines  New Pipeline Designs Dual-Rail: LP3/1, LP2/2 and LP2/1 Dual-Rail: LP3/1, LP2/2 and LP2/1 Single-Rail: LP SR 2/1 Single-Rail: LP SR 2/1  Practical Issue: Handling slow environments  Results and Conclusions

3 Why Dynamic Logic? Potentially:  Higher speed  Smaller area  “Latch-free” pipelines: Logic gate itself provides an implicit latch  lower latency  shorter cycle time  smaller area –– very important in gate-level pipelining! è Our Focus: Dynamic logic pipelines

4 How Do We Achieve High Throughput?  Introduce novel pipeline protocols: l specifically target dynamic logic l reduce impact of handshaking delays  shorter cycle times  Pipeline at very fine granularity: l “gate-level:” each stage is a single-gate deep  highest throughputs possible l latch-free datapaths especially desirable  dynamic logic is a natural match

5 Prior Work: Asynchronous Pipelines  Sutherland (1989), Yun/Beerel/Arceo (1996)  very elegant 2-phase control  expensive transition latches  Day/Woods (1995), Furber/Liu (1996)  4-phase control  simpler latches, but complex controllers  Kol/Ginosar (1997)  double latches  greater concurrency, but area-expensive  Molnar et al. ( ) Two designs: asp* and micropipeline  both very fast, but: –asp*: complex timing, cannot handle latch-free dynamic datapaths –micropipeline: area-expensive, cannot do logic processing at all!  Williams (1991), Martin (1997)  dynamic stages  no explicit latches!  low latency  throughput still limited

6 Background  Introduction è Background: Williams’ PS0 pipelines  New Pipeline Designs Dual-Rail: LP3/1, LP2/2 and LP2/1 Dual-Rail: LP3/1, LP2/2 and LP2/1 Single-Rail: LP SR 2/1 Single-Rail: LP SR 2/1  Practical Issue: Handling slow environments  Results and Conclusions

7 PS0 Pipelines (Williams ) Basic Architecture: Function Block Completion Detector Data in Data out PC

8 PS0 Function Block Each output is produced using a dynamic gate: Pull-downstack “keeper” evaluationcontrol prechargecontrol PC data inputs data outputs to completion detector

9 Dual-Rail Completion Detector  OR together two rails of each bit  Combine results using C-element C Done OR bit 0 OR bit 1 OR bit n

10 Precharge  Evaluate: another 3 events Complete cycle: 6 events N+1 indicates “done” l PRECHARGE N: when N+1 completes evaluation l EVALUATE N: when N+1 completes precharging PS0 Protocol N evaluates N+1 evaluates N+2 evaluates N+2 indicates “done” N+1 precharges N+1 indicates “done” 3 Evaluate  Precharge: 3 events N N+1 N+2

11 PS0 Performance Cycle Time =

12 New Pipeline Designs  Introduction  Background: Williams’ PS0 pipelines è New Pipeline Designs  Dual-Rail: LP3/1, LP2/2 and LP2/1 Single-Rail: LP SR 2/1 Single-Rail: LP SR 2/1  Practical Issue: Handling slow environments  Results and Conclusions

13 Overview of Approach Our Goal: Shorter cycle time, without degrading latency Our Approach: Use “Lookahead Protocols” (LP):  main idea: anticipate critical events based on richer observation Two new protocol optimizations: l “Early evaluation:”  give stage head-start on evaluation by observing events further down the pipeline (actually, a similar idea proposed by Williams in PA0, but our designs exploit it much better) l “Early done:”  stage signals “done” when it is about to precharge/evaluate

14 Uses “early evaluation:” l each stage now has two control inputs  the new input comes from two stages ahead l evaluate N as soon as N+1 starts precharging Dual-Rail Design #1: LP3/1 Data in Data out PC Eval From N+2 N N+1 N+2

15 LP3/1 Protocol LP3/1 Protocol l PRECHARGE N: when N+1 completes evaluation l EVALUATE N: when N+2 completes evaluation New! Enables “early evaluation!” 4 N evaluates N+1 evaluates N+2 indicates “done” N+2 evaluates N N+1 N+2 N+1 indicates “done” 3

16 PS0PS0 LP3/1LP3/1 LP3/1: Comparison with PS Only 4 events in cycle! 6 events in cycle NN+1N+2 NN+1N+2

LP3/1 Performance Cycle Time = saved path Savings over PS0: 1 Precharge + 1 Completion Detection

18 Inside a Stage: Merging Two Controls l Precharge when PC=1 (and Eval=0) Evaluate “early” when Eval=1 (or PC=0) Evaluate “early” when Eval=1 (or PC=0) Pull-downstack “keeper” PC (From Stage N+1) Eval (From Stage N+2) NAND A NAND gate combines the two control inputs: Problem: “early” Eval=1 is non-persistent!  it may get de-asserted before the stage has completed evaluation! Problem: “early” Eval=1 is non-persistent!  it may get de-asserted before the stage has completed evaluation!

19 LP3/1 Timing Constraints: Example Observation: PC=0 soon after Eval=1, and is persistent  use PC as safe “takeover” for Eval! Solution: no change! Timing Constraint: PC=0 arrives before Eval=1 is de-asserted  simple one-sided timing requirement  other constraints as well… all easily satisfied in practice PC (From Stage N+1) Eval (From Stage N+2) NAND Problem: “early” Eval=1 is non-persistent!

20 Dual-Rail Design #2: LP2/2 Uses “early done:” completion detector now before functional block completion detector now before functional block  stage indicates “done” when about to precharge/evaluate Function Block “early” Completion Detector Data in Data out

21 LP2/2 Completion Detector Modified completion detectors needed: Done =1 when stage starts evaluating, and inputs valid Done =1 when stage starts evaluating, and inputs valid Done =0 when stage starts precharging Done =0 when stage starts precharging  asymmetric C-element C Done OR bit 0 OR bit 1 OR bit n + + +PC

22 N+1 “early done” LP2/2 Protocol Completion detection occurs in parallel with evaluation/precharge: N evaluates N+1 evaluates N N+1 N+2 2 N+1 “early done” 3 3 N+2 “early done”

23 LP2/2 Performance Cycle Time = LP2/2 savings over PS0: 1 Evaluation + 1 Precharge

24 Dual-Rail Design #3: LP2/1 Hybrid of LP3/1 and LP2/2. Combines: early evaluation of LP3/1 early evaluation of LP3/1 early done of LP2/2 early done of LP2/2 Cycle Time =

25 New Pipeline Designs  Introduction  Background: Williams’ PS0 pipelines è New Pipeline Designs Dual-Rail: LP3/1, LP2/2 and LP2/1 Dual-Rail: LP3/1, LP2/2 and LP2/1  Single-Rail: LP SR 2/1  Practical Issue: Handling slow environments  Results and Conclusions

26 Single-Rail Design: LP SR 2/1 Derivative of LP2/1, adapted to single-rail:  bundled-data: matched delays instead of completion detectors delaydelay delay “Ack” to previous stages is “tapped off early”  once in evaluate (precharge), dynamic logic insensitive to input changes

27 PC and Eval are combined exactly as in LP3/1 Inside an LP SR 2/1 Stage “done” generated by an asymmetric C-element done =1 when stage evaluates, and data inputs valid done =1 when stage evaluates, and data inputs valid done =0 when stage precharges done =0 when stage precharges PC (From Stage N+1) Eval (From Stage N+2) NAND aC + “ack” “req” in data in data out “req” out matched delay done

28 LP SR 2/1 Protocol Cycle Time = N evaluates N+2 evaluates N+2 indicates “done” N N+1 N+2 2 N+1 evaluates N+1 indicates “done”

29 Practical Issue: Handling Slow Environments We inherit a timing assumption from Williams’ PS0: Input (left) environment must precharge reasonably fast Input (left) environment must precharge reasonably fastProblem: If environment is stuck in precharge, all pipelines (incl. PS0) will malfunction! Our Solution: Add a special robust controller for 1 st stage Add a special robust controller for 1 st stage  simply synchronizes input environment and pipeline  delay critical events until environment has finished precharge l Modular solution overcomes shortcoming of Williams’ PS0 l No serious throughput overhead  real bottleneck is the slow environment!

30 Results and Conclusions  Introduction  Background: Williams’ PS0 pipelines  New Pipeline Designs Dual-Rail: LP3/1, LP2/2 and LP2/1 Dual-Rail: LP3/1, LP2/2 and LP2/1 Single-Rail: LP SR 2/1 Single-Rail: LP SR 2/1  Practical Issue: Handling slow environments è Results and Conclusions

31 Results Designed/simulated FIFO’s for each pipeline style Experimental Setup: l design: 4-bit wide, 10-stage FIFO l technology: 0.6  HP CMOS l operating conditions: 3.3 V and 300°K

32 dual-rail single-rail Comparison with Williams’ PS0  LP2/1: >2X faster than Williams’ PS0  LP SR 2/1: 1.2 Giga items/sec

33 Comparison: LP SR 2/1 vs. Molnar FIFO’s LP SR 2/1 FIFO: 1.2 Giga items/sec Adding logic processing to FIFO:  simply fold logic into dynamic gate  little overhead Comparison with Molnar FIFO’s: l asp* FIFO: 1.1 Giga items/sec  more complex timing assumptions  not easily formalized  requires explicit latches, separate from logic!  adding logic processing between stages  significant overhead l micropipeline: 1.7 Giga items/sec  two parallel FIFO’s, each only 0.85 Giga/sec  very expensive transition latches  cannot add logic processing to FIFO!

34 datapath width = 32 dual-rail bits! Practicality of Gate-Level Pipelining When datapath is wide:  Can often split into narrow “streams”  comp. d et. f airly low cost!  Use “localized” completion detector for each stream: for each stream: l need to examine only a few bits  small fan-in  small fan-in l send “done” to only a few gates  small fan-out  small fan-outdone fan-out=2 comp. det. fan-in = 2

35 Conclusions Introduced several new dynamic pipelines: l Use two novel protocols: –“early evaluation” –“early done” Especially suitable for fine-grain (gate-level) pipelining Especially suitable for fine-grain (gate-level) pipelining l Very high throughputs obtained: –dual-rail: >2X improvement over Williams’ PS0 –single-rail: 1.2 Giga items/second in 0.6  CMOS l Use easy-to-satisfy, one-sided timing constraints l Robustly handle arbitrary-speed environments –overcome a major shortcoming of Williams’ PS0 pipelines Recent Improvement: Even faster single-rail pipeline (WVLSI’00)