Amitava Mitra Intel Corp., Bangalore, India William F. McLaughlin

Slides:



Advertisements
Similar presentations
Self-Timed Logic Timing complexity growing in digital design -Wiring delays can dominate timing analysis (increasing interdependence between logical and.
Advertisements

ASYNC07 High Rate Wave-pipelined Asynchronous On-chip Bit-serial Data Link R. Dobkin, T. Liran, Y. Perelman, A. Kolodny, R. Ginosar Technion – Israel Institute.
Andrey Mokhov, Victor Khomenko Danil Sokolov, Alex Yakovlev Dual-Rail Control Logic for Enhanced Circuit Robustness.
Data Synchronization Issues in GALS SoCs Rostislav (Reuven) Dobkin and Ran Ginosar Technion Christos P. Sotiriou FORTH ICS- FORTH.
Reading1: An Introduction to Asynchronous Circuit Design Al Davis Steve Nowick University of Utah Columbia University.
Fault-Tolerant Delay-Insensitive Inter-Chip Communication Yebin Shi Apt Group The University of Manchester.
Self-Timed Systems Timing complexity growing in digital design -Wiring delays can dominate timing analysis (increasing interdependence between logical.
Introduction to CMOS VLSI Design Sequential Circuits.
VLSI Design EE 447/547 Sequential circuits 1 EE 447/547 VLSI Design Lecture 9: Sequential Circuits.
Introduction to CMOS VLSI Design Sequential Circuits
MICROELETTRONICA Sequential circuits Lection 7.
Lecture 11: Sequential Circuit Design. CMOS VLSI DesignCMOS VLSI Design 4th Ed. 11: Sequential Circuits2 Outline  Sequencing  Sequencing Element Design.
Introduction to CMOS VLSI Design Lecture 10: Sequential Circuits David Harris Harvey Mudd College Spring 2004.
Presenter : Ching-Hua Huang 2012/4/16 A Low-latency GALS Interface Implementation Yuan-Teng Chang; Wei-Che Chen; Hung-Yue Tsai; Wei-Min Cheng; Chang-Jiu.
Sequential Circuits. Outline  Floorplanning  Sequencing  Sequencing Element Design  Max and Min-Delay  Clock Skew  Time Borrowing  Two-Phase Clocking.
The 8085 Microprocessor Architecture
1 Clockless Logic  Recap: Lookahead Pipelines  High-Capacity Pipelines.
Introduction to CMOS VLSI Design Lecture 19: Design for Skew David Harris Harvey Mudd College Spring 2004.
Spartan II Features  Plentiful logic and memory resources –15K to 200K system gates (up to 5,292 logic cells) –Up to 57 Kb block RAM storage  Flexible.
Introduction to CMOS VLSI Design Clock Skew-tolerant circuits.
Clock Design Adopted from David Harris of Harvey Mudd College.
1 Asynchronous Bit-stream Compression (ABC) IEEE 2006 ABC Asynchronous Bit-stream Compression Arkadiy Morgenshtein, Avinoam Kolodny, Ran Ginosar Technion.
1 A Modular Synchronizing FIFO for NoCs Vainbaum Yuri.
1 Clockless Logic Montek Singh Thu, Jan 13, 2004.
1 Clockless Logic Montek Singh Tue, Mar 23, 2004.
EE 141 Project 2May 8, Outstanding Features of Design Maximize speed of one 8-bit Division by: i. Observing loop-holes in 8-bit division ii. Taking.
ELEC 6200, Fall 07, Oct 24 Jiang: Async. Processor 1 Asynchronous Processor Design for ELEC 6200 by Wei Jiang.
Low Power Design for Wireless Sensor Networks Aki Happonen.
COMP Clockless Logic and Silicon Compilers Lecture 3
Lab for Reliable Computing Generalized Latency-Insensitive Systems for Single-Clock and Multi-Clock Architectures Singh, M.; Theobald, M.; Design, Automation.
1 Clockless Logic Montek Singh Tue, Mar 21, 2006.
High-Throughput Asynchronous Pipelines for Fine-Grain Dynamic Datapaths Montek Singh and Steven Nowick Columbia University New York, USA
COMPUTER ARCHITECTURE & OPERATIONS I Instructor: Hao Ji.
Introduction to CMOS VLSI Design Lecture 10: Sequential Circuits Credits: David Harris Harvey Mudd College (Material taken/adapted from Harris’ lecture.
1 Clockless Computing Montek Singh Thu, Sep 13, 2007.
Lecture 11 MOUSETRAP: Ultra-High-Speed Transition-Signaling Asynchronous Pipelines.
1 Recap: Lectures 5 & 6 Classic Pipeline Styles 1. Williams and Horowitz’s PS0 pipeline 2. Sutherland’s micropipelines.
1 Clockless Logic: Dynamic Logic Pipelines (contd.)  Drawbacks of Williams’ PS0 Pipelines  Lookahead Pipelines.
Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
DARPA Digital Audio Receiver, Processor and Amplifier Group Z James Cotton Bobak Nazer Ryan Verret.
MOUSETRAP Ultra-High-Speed Transition-Signaling Asynchronous Pipelines Montek Singh & Steven M. Nowick Department of Computer Science Columbia University,
High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.
ICCD Conversion Driven Design of Binary to Mixed Radix Circuits Ashur Rafiev, Julian Murphy, Danil Sokolov, Alex Yakovlev School of EECE, Newcastle.
Paper review: High Speed Dynamic Asynchronous Pipeline: Self Precharging Style Name : Chi-Chuan Chuang Date : 2013/03/20.
Tinoosh Mohsenin and Bevan M. Baas VLSI Computation Lab, ECE Department University of California, Davis Split-Row: A Reduced Complexity, High Throughput.
1 Clockless Computing Montek Singh Thu, Sep 6, 2007  Review: Logic Gate Families  A classic asynchronous pipeline by Williams.
EEE440 Computer Architecture
Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.
Reading1: An Introduction to Asynchronous Circuit Design Al Davis Steve Nowick University of Utah Columbia University.
12004 MAPLD: 153Brej Early output logic and Anti-Tokens Charlie Brej APT Group Manchester University.
Reader: Pushpinder Kaur Chouhan
EE5970 Computer Engineering Seminar Spring 2012 Michigan Technological University Based on: A Low-Power FPGA Based on Autonomous Fine-Grain Power Gating.
Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.
Introduction to Microprocessors - chapter3 1 Chapter 3 The 8085 Microprocessor Architecture.
1 Practical Design and Performance Evaluation of Completion Detection Circuits Fu-Chiung Cheng Department of Computer Science Columbia University.
Implementing Tile-based Chip Multiprocessors with GALS Clocking Styles Zhiyi Yu, Bevan Baas VLSI Computation Lab, ECE Department University of California,
1 Recap: Lecture 4 Logic Implementation Styles:  Static CMOS logic  Dynamic logic, or “domino” logic  Transmission gates, or “pass-transistor” logic.
1 Clockless Logic Montek Singh Thu, Mar 2, Review: Logic Gate Families  Static CMOS logic  Dynamic logic, or “domino” logic  Transmission gates,
The 8085 Microprocessor Architecture
Other Approaches.
The 8085 Microprocessor Architecture
Recap: Lecture 1 What is asynchronous design? Why do we want to study it? What is pipelining? How can it be used to design really fast hardware?
332:578 Deep Submicron VLSI Design Lecture 14 Design for Clock Skew
The 8085 Microprocessor Architecture
Clockless Logic: Asynchronous Pipelines
Wagging Logic: Moore's Law will eventually fix it
A Quasi-Delay-Insensitive Method to Overcome Transistor Variation
Early output logic and Anti-Tokens
Clockless Computing Lecture 3
Presentation transcript:

Efficient Asynchronous Protocol Converters for Two-Phase Delay-Insensitive Global Communication Amitava Mitra Intel Corp., Bangalore, India William F. McLaughlin Columbia University, Electrical Engineering Steven M. Nowick Columbia University, Computer Science

Outline Motivation and Contribution Proposed System Architecture System-on-Chip: Concepts and Trends Asynchronous Signaling Styles Target Asynchronous SOC Architecture Contribution Proposed System Architecture Experimental Results Extensions: Other Signaling Styles Conclusions and Future Work

System-on-Chip (SOC): Concept and Trends Microelectronic trends enabling SOC design Increasing integration density + chip size Formerly discrete functions (memory, I/O) now integrated Popularity of “multi-core” designs Heterogeneous SOC: Large complex chip with broad functionality Many independent computation nodes Multiple cores, memories, accelerators, multimedia processing, etc. Often includes multiple timing domains Complex network-style interconnect fabric Challenges in Heterogeneous SOC design: Wire costs not scaling down with device size Increasing proportion of power and delay in interconnect Robust and high-performance interconnect design: High latencies between remote nodes Mixed timing, timing variability/uncertainty Need to support varied components: modular/scalable design

SOC Communication Fabric Growing factor in overall system performance Ideal Requirements: Speed: high throughput, low latency Low power Robust to timing variations Flexibility: integrate modular IPs and upgrades Asynchronous design well-suited to these goals Timing robust flexible designs Lower power than synchronous Work by Quinton, Greenstreet, and Wilton [ICCD 2005] GALS-style: global LEDR interconnect + local synchronous blocks does not provide details of protocol converters

Asynchronous for SOC Communication Advantages of asynchronous global communication Delay-insensitive (DI) encoding Removes timing constraints on global routing No clock signals to route across chip Significant power advantage Can support both async + sync computation Delay-insensitive async logic combats growing variability concerns GALS style: Globally-Asynchronous Locally-Synchronous Several popular async signaling protocols Dual rail four-phase, LEDR, 1-of-4, bundled data, others No single protocol ideal for both logic and communication

Background: LEDR Signaling Dual-rail encoding: two wires per bit – delay-insensitive “Level-encoding”: Data rail: holds actual data value Parity rail: holds parity value Alternating-phase protocol: Encoding parity alternates between odd and even LEDR Encoding Bit value 1 Even 0 0 1 1 Odd 0 1 1 0 data rail parity rail Phase

LEDR Signaling data parity Exactly one wire transition for each new data item Data rail: carries bit value in both phases 1 1 1 1 data parity even odd even odd even odd even Parity rail: phase alternates with each data item

Four-Phase Dual-Rail Signaling Alternative DI Code Key Differences: Four-phase (Return-to-Zero) protocol Spacer (reset) state required between each data item One-hot encoding: True rail (encodes 1) & false rail (encodes 0) 1 1 1 Data values True rail False rail Evaluation (one rail high) Reset (both rails low)

Four-Phase Dual-Rail vs. LEDR Advantages of four-phase dual-rail: Delay-insensitive logic using standard gates Implementations are simple and fast: widely used LEDR: complex & impractical Disadvantages of four-phase dual-rail: System-level communication throughput: Spacer state doubles round-trip communication latency LEDR: no spacer required Power dissipation: Two transitions/bit (up and down) for each data item LEDR: only one transition/bit Conclusion: Four-phase dual-rail better for implementing function blocks LEDR is better for global communication

Target Asynchronous SOC Architecture Our goal – Protocol converters to enable this global LEDR SOC Three major components: Global communication network (LEDR) Local computation nodes (varied styles) New requirement: protocol converters at interfaces Allow full separation of computation and communication

Contribution High-speed protocol converters to enable heterogeneous SOC architectures Supports high-throughput, robust global communication LEDR encoding Supports efficient design of local function blocks (i) 4-phase dual-rail, (ii) 1-of-4, (iii) single-rail bundled data Features: Family of low-latency protocol converters: support above 3 local encoding styles High throughput: facilitates concurrent interaction of nodes Timing-robust: converters almost entirely QDI Low design effort: standard cell design flow Fully implemented in 0.18 μm CMOS Layout and simulation FIFO throughputs up to 250 MHz

Two Target SOC Topologies 1. “Pipeline-style” topology Feed-forward data path: uni-directional token flow Receiving node returns a single ACK (control signal) Supports concurrency between nodes Data feeds forward Acknowledge sent back

Two SOC Topologies (cont.) 2. “Server-style” topology Client passes data token to server Server computes/returns data token to client (result) Explicit ACK unnecessary Proposed SOC architecture supports both topologies Four-phase server Four-phase data client Bi-directional data flow: data passed back to client on completion

Outline Motivation and Contribution Proposed System Architecture Architecture Overview System Simulation Detailed Hardware Implementation Timing Analysis Experimental Results Extensions: Other Signaling Styles Conclusions and Future Work

Architecture Overview Four-phase core LEDR input LEDR output External LEDR interface, internal four-phase core Four-phase signals are shown in red Two-phase or transition signals are shown in yellow

Control Signals Two-phase control signals Phase of LEDR input (request from left) Phase of LEDR output (forward complete) Acknowledge to left neighbor Acknowledge from right neighbor

Control Signals Four-phase control signals Completion detect four-phase evaluate and RZ Enable four-phase evaluate and RZ

System Simulation LEDR inputs begin arriving at quiescent system LEDR inputs arrive Completion detection

System Simulation Input completion detection sent to control All input phases matching Transition to new phase

System Simulation Control enables four-phase evaluate phase Enable rises

One wire of each four-phase pair rises System Simulation LEDR input converted to four-phase Enable now high One wire of each four-phase pair rises

System Simulation Four-phase function evaluation

System Simulation Four-phase bits decoded to LEDR Each bit converted as soon as it computes LEDR outputs to next node generated Four-phase complete not used in evaluate phase

ACK from right may come any time after all pairs are sent System Simulation LEDR output completion detection Output pairs ACK from right may come any time after all pairs are sent

System Simulation Control enables four-phase reset phase Enable falls

System Simulation Function block inputs return-to-zero ACK is sent concurrently to left Enable now low Pipeline concurrency: request new data during reset phase

System Simulation Four-phase reset propagates through logic block New data may arrive now that ACK has been sent Reset Completion detection Enable remains low

System Simulation Four-phase reset completes Complete internal cycle has now been performed Complete falls

System Simulation New evaluate phase begins when Enable rises again Pre-conditions: reset finished, new data REQ, and old data ACK Three-way synchronization Input phase transitions when new data ready ACK transitions when outputs safe to change Complete low (means reset finished)

Detailed Hardware Implementation Four-phase core LEDR input LEDR output Each block implemented in CMOS standard cells Design has few non-QDI timing constraints

Four-phase Encode (Input Converter) Converts LEDR input to four-phase dual-rail Enable=‘1’: outputs evaluate based on LEDR data Enable=‘0’: outputs reset (LEDR data blocked)

Four-phase Decode (Output Converter) Converts four-phase bits to LEDR output LEDR data rail encoding Assert either S (1 value) or R (0 value), then return-to-hold More robust alternative: C-element

Four-phase Decode (Output Converter) Converts four-phase bits to LEDR output LEDR parity rail encoding Parity output: based on 4-phase data and LEDR input phase (parity) Alternating phases: green vs. red gates D-latch: blocks new input parity arrival until 4-phase reset complete even phase odd phase

1-Bit Completion Detectors LEDR CD at input and output Four-phase CD in function block Both protocols have one gate CD XOR (parity) for LEDR OR for four-phase dual-rail 1-bit LEDR completion detector 1-bit four-phase completion detector

N-Bit Completion Detectors C-element trees Used for both LEDR and four-phase C-element: standard cell implementation (AOI222 w/feedback)

For pipeline topology only Control Block Main Purpose: controls 4-phase function block 4-phase eval requires 3-way synchronization Function block: previous RZ complete Primary inputs: new data arrival Right interface (in pipeline): ACK received In pipeline topology: also sends left ACK For pipeline topology only

Two-phase to four-phase conversion Control Block Converts two-phase inputs to four-phase outputs Two-phase to four-phase conversion

Control Block: Signaling Conversion Pulse-mode (timed) Transition-signal (falling or rising ) Four-phase (level-sensitive) SR latch captures the pulse Inverter and XNOR form simple pulse gen

Timing Requirements Circuits almost entirely QDI Exceptions: Control block: Two-sided timing constraint on length of pulse Sensitive to both gate and wire delays Careful layout required Latches: simple hold time constraints SR latches can be replaced by C-elements C-elements also have implementation-specific timing constraints SR latch much faster than our standard cell C-element D latch can be removed at cost of concurrency

Outline Motivation and Contribution Proposed System Architecture Experimental Results Design Methodology Datapath Setup Simulation Results Latency and Throughput Analysis Extensions: Other Signaling Styles Conclusions and Future Work

Design Methodology Standard cell design flow with complete layout 0.18 μm TSMC CMOS process 4 metal layers of 7 available used in routing Custom place-and-route used Only major layout concern: pulse generator circuit Design could be automated with constraints on pulse Analog simulations: based on layout-extracted design Test vectors including limiting fast and slow cases

Datapath Implementation Two function blocks implemented An 8x8 carry-save multiplier An empty FIFO stage FIFO contains four-phase completion detector only Demonstrates minimum possible node latency Blocks are QDI in evaluate, but “eager” in reset Implemented in combinational CMOS “DIMS”-style logic (with C-elements) could be used instead QDI in both directions Increases both forward and reverse latencies

Multiplier Layout Includes dual rail multiplier and all conversion circuits Total area of 0.051 mm2 FIFO stage has area of 0.018 mm2

Measured Block Latencies Category Design Block Simulated Latency Function block latencies (includes four-phase completion detection) Multiplier evaluate 4.2 – 4.9 ns Multiplier reset 2.2 ns FIFO (evaluate or reset) 0.7 ns CD latency LEDR completion detector 1.3 ns (even) 0.9 ns (odd) Overhead of converters Input Converter 0.2 ns Output Converter 0.5 ns Control block (longest path) 1.1 ns

Performance Results 3 Metrics: Forward Latency: input arrival  output data available Average Values: Multiplier: 6.8 ns; FIFO: 2.9 ns. Stabilization Time: input arrival  reset complete (circuit quiescent) Multiplier: 10.5 ns; FIFO: 6.3 ns. Pipelined Cycle Time: min processing time/data item (steady-state) Multiplier: 8.3 ns; FIFO 4.0 ns.

Performance Analysis Forward latency: overhead 2.2 ns for both nodes Overhead independent of function block size Includes: LEDR CD, control unit, input/output converters Throughput: increased by concurrency Benefit: 2.2 ns reduction in cycle time (vs. post-reset ACK) Savings achieved even in environment without channel latency “Core converter” overhead (no CD) extremely low Only 1.1 ns average latency for converters + control Completion detectors: Account for half of forward latency overhead Account for 55% of FIFO cycle time Faster CDs would provide big improvement

Outline Motivation and Contribution Proposed System Architecture Experimental Results Extensions: Other Signaling Styles Converters for 1-of-4 function blocks Converters for bundled data function block Conclusions and Future Work

Extensions to Other Local Protocols Only small changes to handle 1-of-4 or bundled data No change to control block 1-of-4 encoding: Input/output converters: Small changes to logic Needs standard 1-of-4 completion detector Single-rail bundled data: Input converter: not needed – use LEDR data rail Output converter: New basic circuit required (see paper for details) Function block completion detection: Use bundled ‘done’ signal Asymmetric delay chain (fast reset)

Outline Background and Motivation Contribution Proposed System Architecture Experimental Results Extensions: Other Signaling Styles Conclusions and Future Work Summary and Conclusion Future Work

Summary and Conclusions Support heterogeneous SOCs using hybrid protocols LEDR: low-power, delay-insensitive communication fabric Dual rail four-phase: Simple, fast logic blocks Designed Converters for LEDR/four-phase SOC: Low latency, high throughput, timing robust design Robust concurrency system developed Exploits four-phase reset to mask communication time Simulations with realistic mid-sized function nodes Demonstrated low latency overhead Demonstrated low area overhead Achieved throughputs up to 250 MHz for FIFO stage

Future Work Evaluating system-level benefits Determine design spaces where converters most useful Quantify benefits over using either protocol exclusively Optimal partitioning of converter nodes Explore dependence on system topology Potential applications: use in async SOCs Beigne/Vivet – GALS NoC Architectures (Async-06) Scott et al. (Intel/Silistix) – PXA27x System (Async-07) Dobkin/Ginosar/Kolodny – fast LEDR serial links (Async-06/07) Convert 4-phase dual-rail to LEDR (for parallel load)