Presentation is loading. Please wait.

Presentation is loading. Please wait.

Amitava Mitra Intel Corp., Bangalore, India William F. McLaughlin

Similar presentations


Presentation on theme: "Amitava Mitra Intel Corp., Bangalore, India William F. McLaughlin"— Presentation transcript:

1 Efficient Asynchronous Protocol Converters for Two-Phase Delay-Insensitive Global Communication
Amitava Mitra Intel Corp., Bangalore, India William F. McLaughlin Columbia University, Electrical Engineering Steven M. Nowick Columbia University, Computer Science

2 Outline Motivation and Contribution Proposed System Architecture
System-on-Chip: Concepts and Trends Asynchronous Signaling Styles Target Asynchronous SOC Architecture Contribution Proposed System Architecture Experimental Results Extensions: Other Signaling Styles Conclusions and Future Work

3 System-on-Chip (SOC): Concept and Trends
Microelectronic trends enabling SOC design Increasing integration density + chip size Formerly discrete functions (memory, I/O) now integrated Popularity of “multi-core” designs Heterogeneous SOC: Large complex chip with broad functionality Many independent computation nodes Multiple cores, memories, accelerators, multimedia processing, etc. Often includes multiple timing domains Complex network-style interconnect fabric Challenges in Heterogeneous SOC design: Wire costs not scaling down with device size Increasing proportion of power and delay in interconnect Robust and high-performance interconnect design: High latencies between remote nodes Mixed timing, timing variability/uncertainty Need to support varied components: modular/scalable design

4 SOC Communication Fabric
Growing factor in overall system performance Ideal Requirements: Speed: high throughput, low latency Low power Robust to timing variations Flexibility: integrate modular IPs and upgrades Asynchronous design well-suited to these goals Timing robust flexible designs Lower power than synchronous Work by Quinton, Greenstreet, and Wilton [ICCD 2005] GALS-style: global LEDR interconnect + local synchronous blocks does not provide details of protocol converters

5 Asynchronous for SOC Communication
Advantages of asynchronous global communication Delay-insensitive (DI) encoding Removes timing constraints on global routing No clock signals to route across chip Significant power advantage Can support both async + sync computation Delay-insensitive async logic combats growing variability concerns GALS style: Globally-Asynchronous Locally-Synchronous Several popular async signaling protocols Dual rail four-phase, LEDR, 1-of-4, bundled data, others No single protocol ideal for both logic and communication

6 Background: LEDR Signaling
Dual-rail encoding: two wires per bit – delay-insensitive “Level-encoding”: Data rail: holds actual data value Parity rail: holds parity value Alternating-phase protocol: Encoding parity alternates between odd and even LEDR Encoding Bit value 1 Even 0 0 1 1 Odd 0 1 1 0 data rail parity rail Phase

7 LEDR Signaling data parity
Exactly one wire transition for each new data item Data rail: carries bit value in both phases 1 1 1 1 data parity even odd even odd even odd even Parity rail: phase alternates with each data item

8 Four-Phase Dual-Rail Signaling
Alternative DI Code Key Differences: Four-phase (Return-to-Zero) protocol Spacer (reset) state required between each data item One-hot encoding: True rail (encodes 1) & false rail (encodes 0) 1 1 1 Data values True rail False rail Evaluation (one rail high) Reset (both rails low)

9 Four-Phase Dual-Rail vs. LEDR
Advantages of four-phase dual-rail: Delay-insensitive logic using standard gates Implementations are simple and fast: widely used LEDR: complex & impractical Disadvantages of four-phase dual-rail: System-level communication throughput: Spacer state doubles round-trip communication latency LEDR: no spacer required Power dissipation: Two transitions/bit (up and down) for each data item LEDR: only one transition/bit Conclusion: Four-phase dual-rail better for implementing function blocks LEDR is better for global communication

10 Target Asynchronous SOC Architecture
Our goal – Protocol converters to enable this global LEDR SOC Three major components: Global communication network (LEDR) Local computation nodes (varied styles) New requirement: protocol converters at interfaces Allow full separation of computation and communication

11 Contribution High-speed protocol converters to enable heterogeneous SOC architectures Supports high-throughput, robust global communication LEDR encoding Supports efficient design of local function blocks (i) 4-phase dual-rail, (ii) 1-of-4, (iii) single-rail bundled data Features: Family of low-latency protocol converters: support above 3 local encoding styles High throughput: facilitates concurrent interaction of nodes Timing-robust: converters almost entirely QDI Low design effort: standard cell design flow Fully implemented in 0.18 μm CMOS Layout and simulation FIFO throughputs up to 250 MHz

12 Two Target SOC Topologies
1. “Pipeline-style” topology Feed-forward data path: uni-directional token flow Receiving node returns a single ACK (control signal) Supports concurrency between nodes Data feeds forward Acknowledge sent back

13 Two SOC Topologies (cont.)
2. “Server-style” topology Client passes data token to server Server computes/returns data token to client (result) Explicit ACK unnecessary Proposed SOC architecture supports both topologies Four-phase server Four-phase data client Bi-directional data flow: data passed back to client on completion

14 Outline Motivation and Contribution Proposed System Architecture
Architecture Overview System Simulation Detailed Hardware Implementation Timing Analysis Experimental Results Extensions: Other Signaling Styles Conclusions and Future Work

15 Architecture Overview
Four-phase core LEDR input LEDR output External LEDR interface, internal four-phase core Four-phase signals are shown in red Two-phase or transition signals are shown in yellow

16 Control Signals Two-phase control signals
Phase of LEDR input (request from left) Phase of LEDR output (forward complete) Acknowledge to left neighbor Acknowledge from right neighbor

17 Control Signals Four-phase control signals
Completion detect four-phase evaluate and RZ Enable four-phase evaluate and RZ

18 System Simulation LEDR inputs begin arriving at quiescent system
LEDR inputs arrive Completion detection

19 System Simulation Input completion detection sent to control
All input phases matching Transition to new phase

20 System Simulation Control enables four-phase evaluate phase
Enable rises

21 One wire of each four-phase pair rises
System Simulation LEDR input converted to four-phase Enable now high One wire of each four-phase pair rises

22 System Simulation Four-phase function evaluation

23 System Simulation Four-phase bits decoded to LEDR
Each bit converted as soon as it computes LEDR outputs to next node generated Four-phase complete not used in evaluate phase

24 ACK from right may come any time after all pairs are sent
System Simulation LEDR output completion detection Output pairs ACK from right may come any time after all pairs are sent

25 System Simulation Control enables four-phase reset phase Enable falls

26 System Simulation Function block inputs return-to-zero
ACK is sent concurrently to left Enable now low Pipeline concurrency: request new data during reset phase

27 System Simulation Four-phase reset propagates through logic block
New data may arrive now that ACK has been sent Reset Completion detection Enable remains low

28 System Simulation Four-phase reset completes
Complete internal cycle has now been performed Complete falls

29 System Simulation New evaluate phase begins when Enable rises again
Pre-conditions: reset finished, new data REQ, and old data ACK Three-way synchronization Input phase transitions when new data ready ACK transitions when outputs safe to change Complete low (means reset finished)

30 Detailed Hardware Implementation
Four-phase core LEDR input LEDR output Each block implemented in CMOS standard cells Design has few non-QDI timing constraints

31 Four-phase Encode (Input Converter)
Converts LEDR input to four-phase dual-rail Enable=‘1’: outputs evaluate based on LEDR data Enable=‘0’: outputs reset (LEDR data blocked)

32 Four-phase Decode (Output Converter)
Converts four-phase bits to LEDR output LEDR data rail encoding Assert either S (1 value) or R (0 value), then return-to-hold More robust alternative: C-element

33 Four-phase Decode (Output Converter)
Converts four-phase bits to LEDR output LEDR parity rail encoding Parity output: based on 4-phase data and LEDR input phase (parity) Alternating phases: green vs. red gates D-latch: blocks new input parity arrival until 4-phase reset complete even phase odd phase

34 1-Bit Completion Detectors
LEDR CD at input and output Four-phase CD in function block Both protocols have one gate CD XOR (parity) for LEDR OR for four-phase dual-rail 1-bit LEDR completion detector 1-bit four-phase completion detector

35 N-Bit Completion Detectors
C-element trees Used for both LEDR and four-phase C-element: standard cell implementation (AOI222 w/feedback)

36 For pipeline topology only
Control Block Main Purpose: controls 4-phase function block 4-phase eval requires 3-way synchronization Function block: previous RZ complete Primary inputs: new data arrival Right interface (in pipeline): ACK received In pipeline topology: also sends left ACK For pipeline topology only

37 Two-phase to four-phase conversion
Control Block Converts two-phase inputs to four-phase outputs Two-phase to four-phase conversion

38 Control Block: Signaling Conversion
Pulse-mode (timed) Transition-signal (falling or rising ) Four-phase (level-sensitive) SR latch captures the pulse Inverter and XNOR form simple pulse gen

39 Timing Requirements Circuits almost entirely QDI Exceptions:
Control block: Two-sided timing constraint on length of pulse Sensitive to both gate and wire delays Careful layout required Latches: simple hold time constraints SR latches can be replaced by C-elements C-elements also have implementation-specific timing constraints SR latch much faster than our standard cell C-element D latch can be removed at cost of concurrency

40 Outline Motivation and Contribution Proposed System Architecture
Experimental Results Design Methodology Datapath Setup Simulation Results Latency and Throughput Analysis Extensions: Other Signaling Styles Conclusions and Future Work

41 Design Methodology Standard cell design flow with complete layout
0.18 μm TSMC CMOS process 4 metal layers of 7 available used in routing Custom place-and-route used Only major layout concern: pulse generator circuit Design could be automated with constraints on pulse Analog simulations: based on layout-extracted design Test vectors including limiting fast and slow cases

42 Datapath Implementation
Two function blocks implemented An 8x8 carry-save multiplier An empty FIFO stage FIFO contains four-phase completion detector only Demonstrates minimum possible node latency Blocks are QDI in evaluate, but “eager” in reset Implemented in combinational CMOS “DIMS”-style logic (with C-elements) could be used instead QDI in both directions Increases both forward and reverse latencies

43 Multiplier Layout Includes dual rail multiplier and all conversion circuits Total area of mm2 FIFO stage has area of mm2

44 Measured Block Latencies
Category Design Block Simulated Latency Function block latencies (includes four-phase completion detection) Multiplier evaluate 4.2 – 4.9 ns Multiplier reset 2.2 ns FIFO (evaluate or reset) 0.7 ns CD latency LEDR completion detector 1.3 ns (even) 0.9 ns (odd) Overhead of converters Input Converter 0.2 ns Output Converter 0.5 ns Control block (longest path) 1.1 ns

45 Performance Results 3 Metrics:
Forward Latency: input arrival  output data available Average Values: Multiplier: 6.8 ns; FIFO: 2.9 ns. Stabilization Time: input arrival  reset complete (circuit quiescent) Multiplier: 10.5 ns; FIFO: 6.3 ns. Pipelined Cycle Time: min processing time/data item (steady-state) Multiplier: 8.3 ns; FIFO 4.0 ns.

46 Performance Analysis Forward latency: overhead
2.2 ns for both nodes Overhead independent of function block size Includes: LEDR CD, control unit, input/output converters Throughput: increased by concurrency Benefit: 2.2 ns reduction in cycle time (vs. post-reset ACK) Savings achieved even in environment without channel latency “Core converter” overhead (no CD) extremely low Only 1.1 ns average latency for converters + control Completion detectors: Account for half of forward latency overhead Account for 55% of FIFO cycle time Faster CDs would provide big improvement

47 Outline Motivation and Contribution Proposed System Architecture
Experimental Results Extensions: Other Signaling Styles Converters for 1-of-4 function blocks Converters for bundled data function block Conclusions and Future Work

48 Extensions to Other Local Protocols
Only small changes to handle 1-of-4 or bundled data No change to control block 1-of-4 encoding: Input/output converters: Small changes to logic Needs standard 1-of-4 completion detector Single-rail bundled data: Input converter: not needed – use LEDR data rail Output converter: New basic circuit required (see paper for details) Function block completion detection: Use bundled ‘done’ signal Asymmetric delay chain (fast reset)

49 Outline Background and Motivation Contribution
Proposed System Architecture Experimental Results Extensions: Other Signaling Styles Conclusions and Future Work Summary and Conclusion Future Work

50 Summary and Conclusions
Support heterogeneous SOCs using hybrid protocols LEDR: low-power, delay-insensitive communication fabric Dual rail four-phase: Simple, fast logic blocks Designed Converters for LEDR/four-phase SOC: Low latency, high throughput, timing robust design Robust concurrency system developed Exploits four-phase reset to mask communication time Simulations with realistic mid-sized function nodes Demonstrated low latency overhead Demonstrated low area overhead Achieved throughputs up to 250 MHz for FIFO stage

51 Future Work Evaluating system-level benefits
Determine design spaces where converters most useful Quantify benefits over using either protocol exclusively Optimal partitioning of converter nodes Explore dependence on system topology Potential applications: use in async SOCs Beigne/Vivet – GALS NoC Architectures (Async-06) Scott et al. (Intel/Silistix) – PXA27x System (Async-07) Dobkin/Ginosar/Kolodny – fast LEDR serial links (Async-06/07) Convert 4-phase dual-rail to LEDR (for parallel load)


Download ppt "Amitava Mitra Intel Corp., Bangalore, India William F. McLaughlin"

Similar presentations


Ads by Google