Uncle – An RTL Approach to Asynchronous Design Presentor : Chi-Chuan Chuang Date : 2012.12.20.

Slides:



Advertisements
Similar presentations
VERILOG: Synthesis - Combinational Logic Combination logic function can be expressed as: logic_output(t) = f(logic_inputs(t)) Rules Avoid technology dependent.
Advertisements

TOPIC : SYNTHESIS DESIGN FLOW Module 4.3 Verilog Synthesis.
Combinational Logic.
Hardware Description Language (HDL)
Courtesy RK Brayton (UCB) and A Kuehlmann (Cadence) 1 Logic Synthesis Sequential Synthesis.
Reading1: An Introduction to Asynchronous Circuit Design Al Davis Steve Nowick University of Utah Columbia University.
06/05/08 Biscotti: a Framework for Token-Flow based Asynchronous Systems Charlie Brej.
ECE 551 Digital System Design & Synthesis Lecture 08 The Synthesis Process Constraints and Design Rules High-Level Synthesis Options.
CSE241 Formal Verification.1Cichy, UCSD ©2003 CSE241A VLSI Digital Circuits Winter 2003 Recitation 6: Formal Verification.
Asynchronous Design Using Commercial HDL Synthesis Tools Michiel Ligthart Karl Fant Ross Smith Alexander Taubin Alex Kondratyev.
© Ran GinosarAsynchronous Design and Synchronization 1 VLSI Architectures Lecture 2: Theoretical Aspects (S&F 2.5) Data Flow Structures.
Embedded Systems Hardware:
Dr. Turki F. Al-Somani VHDL synthesis and simulation – Part 3 Microcomputer Systems Design (Embedded Systems)
ELEN 468 Lecture 161 ELEN 468 Advanced Logic Design Lecture 16 Synthesis of Language Construct II.
1 Application Specific Integrated Circuits. 2 What is an ASIC? An application-specific integrated circuit (ASIC) is an integrated circuit (IC) customized.
ENEE 408C Lab Capstone Project: Digital System Design Fall 2005 Sequential Circuit Design.
Embedded Systems Hardware: Storage Elements; Finite State Machines; Sequential Logic.
1 Clockless Computing Montek Singh Thu, Sep 13, 2007.
FPGA Technology Mapping. 2 Technology mapping:  Implements the optimized nodes of the Boolean network to the target device library.  For FPGA, library.
Lecture 11 MOUSETRAP: Ultra-High-Speed Transition-Signaling Asynchronous Pipelines.
1 Clockless Logic: Dynamic Logic Pipelines (contd.)  Drawbacks of Williams’ PS0 Pipelines  Lookahead Pipelines.
ECE 551 Digital System Design & Synthesis Lecture 11 Verilog Design for Synthesis.
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.
Charles Kime & Thomas Kaminski © 2004 Pearson Education, Inc. Terms of Use (Hyperlinks are active in View Show mode) Terms of Use Lecture 11 – Design Concepts.
Advanced Digital Design Asynchronous EDA by A. Steininger, J. Lechner and R. Najvirt Vienna University of Technology.
CSET 4650 Field Programmable Logic Devices
Charles Kime & Thomas Kaminski © 2004 Pearson Education, Inc. Terms of Use (Hyperlinks are active in View Show mode) Terms of Use Lecture 12 – Design Procedure.
MOUSETRAP Ultra-High-Speed Transition-Signaling Asynchronous Pipelines Montek Singh & Steven M. Nowick Department of Computer Science Columbia University,
Synthesis Presented by: Ms. Sangeeta L. Mahaddalkar ME(Microelectronics) Sem II Subject: Subject:ASIC Design and FPGA.
Hardware Design Environment Instructors: Fu-Chiung Cheng ( 鄭福炯 ) Associate Professor Computer Science & Engineering Tatung University.
Paper review: High Speed Dynamic Asynchronous Pipeline: Self Precharging Style Name : Chi-Chuan Chuang Date : 2013/03/20.
Gate Transfer Level Synthesis as an Automated Approach to Fine-Grain Pipelining Alexander Smirnov Alexander Taubin Mark Karpovsky Leonid Rozenblyum.
L05 – Synthesis Spring /14/05 Synthesis: Verilog  Gates // 2:1 multiplexer or a or b) begin if (sel) z
FORMAL VERIFICATION OF ADVANCED SYNTHESIS OPTIMIZATIONS Anant Kumar Jain Pradish Mathews Mike Mahar.
Optimal digital circuit design Mohammad Sharifkhani.
Digital System 數位系統 Verilog HDL Ping-Liang Lai (賴秉樑)  
Introduction to CMOS VLSI Design Lecture 5: Logical Effort GRECO-CIn-UFPE Harvey Mudd College Spring 2004.
Asynchronous circuit design in control driven approach Name: Chi-Chuan Chuang Date:
1 Workshop Topics - Outline Workshop 1 - Introduction Workshop 2 - module instantiation Workshop 3 - Lexical conventions Workshop 4 - Value Logic System.
Area and Speed Oriented Implementations of Asynchronous Logic Operating Under Strong Constraints.
Reading1: An Introduction to Asynchronous Circuit Design Al Davis Steve Nowick University of Utah Columbia University.
Module 1.2 Introduction to Verilog
TOPIC : SYNTHESIS INTRODUCTION Module 4.3 : Synthesis.
Computer Organization & Programming Chapter 5 Synchronous Components.
Introduction to VHDL Simulation … Synthesis …. The digital design process… Initial specification Block diagram Final product Circuit equations Logic design.
IMPLEMENTATION OF MIPS 64 WITH VERILOG HARDWARE DESIGN LANGUAGE BY PRAMOD MENON CET520 S’03.
Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.
Introduction to ASIC flow and Verilog HDL
04/26/20031 ECE 551: Digital System Design & Synthesis Lecture Set : Introduction to VHDL 12.2: VHDL versus Verilog (Separate File)
03/31/031 ECE 551: Digital System Design & Synthesis Lecture Set 8 8.1: Miscellaneous Synthesis (In separate file) 8.2: Sequential Synthesis.
1 Practical Design and Performance Evaluation of Completion Detection Circuits Fu-Chiung Cheng Department of Computer Science Columbia University.
ECE 448 Lecture 6 Finite State Machines State Diagrams vs. Algorithmic State Machine (ASM) Charts.
ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU 99-1 Under-Graduate Project Design of Datapath Controllers Speaker: Shao-Wei Feng Adviser:
1 Recap: Lecture 4 Logic Implementation Styles:  Static CMOS logic  Dynamic logic, or “domino” logic  Transmission gates, or “pass-transistor” logic.
RTL Hardware Design by P. Chu Chapter 9 – ECE420 (CSUN) Mirzaei 1 Sequential Circuit Design: Practice Shahnam Mirzaei, PhD Spring 2016 California State.
1 Clockless Logic Montek Singh Thu, Mar 2, Review: Logic Gate Families  Static CMOS logic  Dynamic logic, or “domino” logic  Transmission gates,
On the Relation Between Simulation-based and SAT-based Diagnosis CMPE 58Q Giray Kömürcü Boğaziçi University.
-1- Soft Core Viterbi Decoder EECS 290A Project Dave Chinnery, Rhett Davis, Chris Taylor, Ning Zhang.
1 Advanced Digital Design Asynchronous Design Automation by A. Steininger and J. Lechner Vienna University of Technology.
Adapted from Krste Asanovic
Introduction Introduction to VHDL Entities Signals Data & Scalar Types
Basics Combinational Circuits Sequential Circuits Ahmad Jawdat
Blame Passing for Analysis and Optimisation
ECE-C662 Introduction to Behavioral Synthesis Knapp Text Ch
CSE 370 – Winter Sequential Logic - 1
Dynamically Scheduled High-level Synthesis
CSE 370 – Winter Sequential Logic-2 - 1
FPGA Glitch Power Analysis and Reduction
High Performance Asynchronous Circuit Design and Application
Clockless Logic: Asynchronous Pipelines
Presentation transcript:

Uncle – An RTL Approach to Asynchronous Design Presentor : Chi-Chuan Chuang Date :

Outline Introduction ◦ C-element ◦ Null convention logic (NCL) ◦ NCL asynchronous systems UNCLE synthesis flow ◦ From RTL to gates ◦ Ack generation ◦ Net buffering ◦ Latch balancing ◦ Relaxation, cell merging Comparisons Conclusion

C-element Commonly used asynchronous logic component Hysteresis Implementations ◦ Semi-static : with two cross-coupled inverters ◦ Static : doesn’t rely on feedback inverters ◦ Gate-level : depends on which gate used

C-element (cont.) Semi-static

C-element (cont.) Static Gate-level

Null convention logic Dual-rail Delay-insensitive logic style Based on threshold logic Use 27 fundamental threshold gates with 2~4 inputs Hysteresis state-holding capability

Null convention logic (cont.)

An example of implement TH23

Null convention logic (cont.) Compare between two types of DR AND2

27 Basic NCL macros

NCL asynchronous systems Data-driven approach ◦ Use NCL gates for both registers and control Control-driven approach ◦ Uses Balsa-style registers and control

Data-driven approach Using dual-rail latch with acknowledge signals ki, ko to control the datapath

Dual-rail latches ◦ C_0 = C-element with async reset to 0 ◦ C_1 = C-element with async reset to 1 ◦ t_d/f_d = dual-rail in ◦ ko = ackout ◦ t_q/f_q = dual-rail out ◦ ki = ackin Types of latch ◦ drlatn ◦ drlatr ◦ drlats

Dual-rail latches (cont.) drlatn drlatr drlats

Data-driven approach (cont.) Finite state machine ◦ The middle half-latch contains initial data ◦ All ports and registers are read and written every cycle

Control-driven Approach Registers with selective read/write Control network is separate from the datapath Number of read ports can be easily added to the register

UNCLE synthesis flow Both data-driven and control-driven are supported lower-level synthesis tool Verilog as its input language

From RTL to Gates RTL is transformed to a gate level netlist using commercial synthesis tools The target library read by the tool contains: ◦ AND2, XOR2, OR2, inverter ◦ D-flip-flop (DFF), D-latch (DLAT) ◦ Gates for special (T- elements, S-elements…) ◦ Complex gates that have been mapped into NCL Gates have unit delays for timing Area is proportional to transistor counts

Ack Generation Data-driven ◦ Each latch receive an ack signal from each destination latch of its output Control-driven ◦ Each control element receive an ack signal from each destination latch A simple Ack merging algorithm : ◦ any latches having at least one common destination have their ack networks merged An ack checker step is included at the end of the flow to check ack network validity

Net Buffering Timing data is non-linear delay model (NLDM) The signal net target transition time used for all examples in this paper is approximately equivalent to a 1 X inverter driving four separate 4X inverter loads Gate sizing Build a buffer tree with invertors

Latch Balancing For the data-driven style that moves half- latches in the netlist to balance data delays with ack delays Ack delay ◦ Depends on the number of destination that sets the completion network depth Data delay ◦ depends on the data logic complexity.

Latch Balancing (cont.)

Generally results in more transistors as the datapath width increases moving towards the source registers Requiring more latches, with a increase in the ack network size Implement by iterative heuristic algorithm

Latch Balancing (cont.)

Several sorting/pruning stages based on data/ack/cycle delays are used to find latch that are most likely to improve performance if pushed Chosen latches are pushed one gate level, and affected ack networks are rebuilt Latches only feed primary outputs are ineligible

Latch Balancing (cont.) Works appropriately for FSMs Has problems with linear pipelines if latches are pushed in one direction only

Relaxation and Cell Merging Relaxation is a technique that ◦ Look for redundant paths from a PI to a PO ◦ Finds gates that don’t have to be fully expanded to dual-rail versions, but can be implemented by eager versions that require fewer transistors Cell Merging ◦ A cell merging step is performed in which adjacent gates with no fanout are merged into more complex gates ◦ Area-driven

Example RTL Statements

Comparison GCD16 with different Uncle version Conditional port activity caused data-driven designs to be large, slow. Latch balancing helped DD performance. Control driven produced best results DD:data driven, CD:ctrl-driven, LB:latch balanced, NB:net buffered, *:ratio to best Uncle ver.DDDD/NBDD/LB/NBCDCD/NB transistors * cyc. time (ns) * energy (pJ) *

Comparison (cont.) GCD16 between Uncle and Balsa Balsa used more read ports on registers reducing loading but increasing transistor count Net buffering helped offset increased loading in Uncle design, improved performance transistorscyc. time (ns)energy (pJ) BalsaUncle (CD/NB) BalsaUncle (CD/NB) BalsaUncle (CD/NB) *

Comparison (cont.) Viterbi decoder design ◦ Branch Metric Unit (BMU)  Just combinational logic  With a half latch at the output for UNCLE ack ◦ Path Metric Unit (PMU)  It’s a set of parallel accumulator-like registers resulting in many parallel three half-latch loops ◦ History Unit (HU)  It has three 16-entry register files(4-bit, 2-bit, and 1-bit)  An outer loop writes the registers, and can conditionally trigger an inner while loop that contains register read/write operations and executes a variable number of iterations

Comparison (cont.) Viterbi’s Branch Metric Unit comparison ◦ Combination only Uncle version just combinational logic with half-latch on output Balsa version used loop splitting to split combinational logic into concurrent blocks that increased parallelism of internal computations at the cost of more transistors. transistorscyc. time (ns)energy (pJ) BalsaUncle (CD/NB) BalsaUncle (CD/NB) BalsaUncle (CD/NB) *

Comparison (cont.) Uncle’s Viterbi Path Metric Unit (PMU) LB+=latch-balanced, two set of half-latches added to RTL (one in FSM loop, and one on output port) Uncle ver.DD/NBDD/NB/LBDD/NB/LB+CD/NB transistors * cyc. time (ns) * energy (pJ) *

Comparison (cont.) Viterbi’s Path Metric Unit comparison transistorscyc. time (ns)energy (pJ) BalsaUncle (DD/NB/ LB+) BalsaUncle (DD/NB/ LB+) BalsaUncle (DD/NB/ LB+) *

Comparison (cont.) Viterbi’s History Unit comparison BalsaUncle CD/NB Uncle CD transistors * V1cyc. time (ns) * energy (pJ) * V2cyc. time (ns) * energy (pJ) *

Comparison (cont.) Viterbi comparison between Balsa and Uncle The Uncle decoder uses the DD/NB/LB+ PMU RTL transistorscyc. time (ns)energy (pJ) BalsaUncle (DD/NB/ LB+) BalsaUncle (DD/NB/ LB+) BalsaUncle (DD/NB/ LB+) *

Comparison (cont.) BalsaUncle Combinational synthesis Yes Control synthesisYesData-driven only Logic StyleDifferent dual-rail styles, bundled data NCL only Behavioral simulation YesLimited Area optimizations NoRelaxation, limited cell merging, ack sharing Area optimizations Relaxation, limited cell merging, ack sharing RTL style allow area/perf. tradeoffs, latch balancing, net buffering Timing modelFixed delayNLDM

Conclusion Requires more effort by the designer than Balsa, But can have a higher quality design If performance of the always active module is our goal, data-driven style would be better Control-driven style better for modules with conditional port activity.

Appendix : Teak Teak is a successor toolset to Balsa that uses a data-driven style One of Teak’s goals is to automatically insert latch stages and balance delays for optimum throughput. Teak is a fairly new tool with only one public release

Reference Uncle – An RTL Approach to Asynchronous Design ASYNC12 powerpoint about Uncle – An RTL Approach To Asynchronous Design Design of Asynchronous Circuits Using Synchronous CAD Tools Optimization of NULL convention self-timed circuits