Presentation is loading. Please wait.

Presentation is loading. Please wait.

High-Throughput Asynchronous Pipelines for Fine-Grain Dynamic Datapaths Montek Singh and Steven Nowick Columbia University New York, USA

Similar presentations


Presentation on theme: "High-Throughput Asynchronous Pipelines for Fine-Grain Dynamic Datapaths Montek Singh and Steven Nowick Columbia University New York, USA"— Presentation transcript:

1 High-Throughput Asynchronous Pipelines for Fine-Grain Dynamic Datapaths Montek Singh and Steven Nowick Columbia University New York, USA {montek,nowick}@cs.columbia.edu http://www.cs.columbia.edu/~montek Intl. Symp. Adv. Res. Asynchronous Circ. Syst. (ASYNC), April 2-6, 2000, Eilat, Israel.

2 2 Outline  Introduction  Background: Williams’ PS0 pipelines  New Pipeline Designs Dual-Rail: LP3/1, LP2/2 and LP2/1 Dual-Rail: LP3/1, LP2/2 and LP2/1 Single-Rail: LP SR 2/1 Single-Rail: LP SR 2/1  Practical Issue: Handling slow environments  Results and Conclusions

3 3 Why Dynamic Logic? Potentially:  Higher speed  Smaller area  “Latch-free” pipelines: Logic gate itself provides an implicit latch  lower latency  shorter cycle time  smaller area –– very important in gate-level pipelining! è Our Focus: Dynamic logic pipelines

4 4 How Do We Achieve High Throughput?  Introduce novel pipeline protocols: l specifically target dynamic logic l reduce impact of handshaking delays  shorter cycle times  Pipeline at very fine granularity: l “gate-level:” each stage is a single-gate deep  highest throughputs possible l latch-free datapaths especially desirable  dynamic logic is a natural match

5 5 Prior Work: Asynchronous Pipelines  Sutherland (1989), Yun/Beerel/Arceo (1996)  very elegant 2-phase control  expensive transition latches  Day/Woods (1995), Furber/Liu (1996)  4-phase control  simpler latches, but complex controllers  Kol/Ginosar (1997)  double latches  greater concurrency, but area-expensive  Molnar et al. (1997-99) Two designs: asp* and micropipeline  both very fast, but: –asp*: complex timing, cannot handle latch-free dynamic datapaths –micropipeline: area-expensive, cannot do logic processing at all!  Williams (1991), Martin (1997)  dynamic stages  no explicit latches!  low latency  throughput still limited

6 6 Background  Introduction è Background: Williams’ PS0 pipelines  New Pipeline Designs Dual-Rail: LP3/1, LP2/2 and LP2/1 Dual-Rail: LP3/1, LP2/2 and LP2/1 Single-Rail: LP SR 2/1 Single-Rail: LP SR 2/1  Practical Issue: Handling slow environments  Results and Conclusions

7 7 PS0 Pipelines (Williams 1986-91) Basic Architecture: Function Block Completion Detector Data in Data out PC

8 8 PS0 Function Block Each output is produced using a dynamic gate: Pull-downstack “keeper” evaluationcontrol prechargecontrol PC data inputs data outputs to completion detector

9 9 Dual-Rail Completion Detector  OR together two rails of each bit  Combine results using C-element C Done OR bit 0 OR bit 1 OR bit n

10 10 Precharge  Evaluate: another 3 events Complete cycle: 6 events N+1 indicates “done” l PRECHARGE N: when N+1 completes evaluation l EVALUATE N: when N+1 completes precharging PS0 Protocol 1 2 3 4 5 6 N evaluates N+1 evaluates N+2 evaluates N+2 indicates “done” N+1 precharges N+1 indicates “done” 3 Evaluate  Precharge: 3 events N N+1 N+2

11 11 PS0 Performance 1 2 3 4 5 6 Cycle Time =

12 12 New Pipeline Designs  Introduction  Background: Williams’ PS0 pipelines è New Pipeline Designs  Dual-Rail: LP3/1, LP2/2 and LP2/1 Single-Rail: LP SR 2/1 Single-Rail: LP SR 2/1  Practical Issue: Handling slow environments  Results and Conclusions

13 13 Overview of Approach Our Goal: Shorter cycle time, without degrading latency Our Approach: Use “Lookahead Protocols” (LP):  main idea: anticipate critical events based on richer observation Two new protocol optimizations: l “Early evaluation:”  give stage head-start on evaluation by observing events further down the pipeline (actually, a similar idea proposed by Williams in PA0, but our designs exploit it much better) l “Early done:”  stage signals “done” when it is about to precharge/evaluate

14 14 Uses “early evaluation:” l each stage now has two control inputs  the new input comes from two stages ahead l evaluate N as soon as N+1 starts precharging Dual-Rail Design #1: LP3/1 Data in Data out PC Eval From N+2 N N+1 N+2

15 15 LP3/1 Protocol LP3/1 Protocol l PRECHARGE N: when N+1 completes evaluation l EVALUATE N: when N+2 completes evaluation New! 1 2 3 Enables “early evaluation!” 4 N evaluates N+1 evaluates N+2 indicates “done” N+2 evaluates N N+1 N+2 N+1 indicates “done” 3

16 16 PS0PS0 LP3/1LP3/1 LP3/1: Comparison with PS0 1 1 3 3 2 2 5 Only 4 events in cycle! 6 events in cycle 4 4 6 NN+1N+2 NN+1N+2

17 17 1 2 3 4 LP3/1 Performance Cycle Time = saved path Savings over PS0: 1 Precharge + 1 Completion Detection

18 18 Inside a Stage: Merging Two Controls l Precharge when PC=1 (and Eval=0) Evaluate “early” when Eval=1 (or PC=0) Evaluate “early” when Eval=1 (or PC=0) Pull-downstack “keeper” PC (From Stage N+1) Eval (From Stage N+2) NAND A NAND gate combines the two control inputs: Problem: “early” Eval=1 is non-persistent!  it may get de-asserted before the stage has completed evaluation! Problem: “early” Eval=1 is non-persistent!  it may get de-asserted before the stage has completed evaluation!

19 19 LP3/1 Timing Constraints: Example Observation: PC=0 soon after Eval=1, and is persistent  use PC as safe “takeover” for Eval! Solution: no change! Timing Constraint: PC=0 arrives before Eval=1 is de-asserted  simple one-sided timing requirement  other constraints as well… all easily satisfied in practice PC (From Stage N+1) Eval (From Stage N+2) NAND Problem: “early” Eval=1 is non-persistent!

20 20 Dual-Rail Design #2: LP2/2 Uses “early done:” completion detector now before functional block completion detector now before functional block  stage indicates “done” when about to precharge/evaluate Function Block “early” Completion Detector Data in Data out

21 21 LP2/2 Completion Detector Modified completion detectors needed: Done =1 when stage starts evaluating, and inputs valid Done =1 when stage starts evaluating, and inputs valid Done =0 when stage starts precharging Done =0 when stage starts precharging  asymmetric C-element C Done OR bit 0 OR bit 1 OR bit n + + +PC

22 22 N+1 “early done” 1 2 4 LP2/2 Protocol Completion detection occurs in parallel with evaluation/precharge: N evaluates N+1 evaluates N N+1 N+2 2 N+1 “early done” 3 3 N+2 “early done”

23 23 LP2/2 Performance 1 2 3 4 Cycle Time = LP2/2 savings over PS0: 1 Evaluation + 1 Precharge

24 24 Dual-Rail Design #3: LP2/1 Hybrid of LP3/1 and LP2/2. Combines: early evaluation of LP3/1 early evaluation of LP3/1 early done of LP2/2 early done of LP2/2 Cycle Time =

25 25 New Pipeline Designs  Introduction  Background: Williams’ PS0 pipelines è New Pipeline Designs Dual-Rail: LP3/1, LP2/2 and LP2/1 Dual-Rail: LP3/1, LP2/2 and LP2/1  Single-Rail: LP SR 2/1  Practical Issue: Handling slow environments  Results and Conclusions

26 26 Single-Rail Design: LP SR 2/1 Derivative of LP2/1, adapted to single-rail:  bundled-data: matched delays instead of completion detectors delaydelay delay “Ack” to previous stages is “tapped off early”  once in evaluate (precharge), dynamic logic insensitive to input changes

27 27 PC and Eval are combined exactly as in LP3/1 Inside an LP SR 2/1 Stage “done” generated by an asymmetric C-element done =1 when stage evaluates, and data inputs valid done =1 when stage evaluates, and data inputs valid done =0 when stage precharges done =0 when stage precharges PC (From Stage N+1) Eval (From Stage N+2) NAND aC + “ack” “req” in data in data out “req” out matched delay done

28 28 LP SR 2/1 Protocol 1 2 3 Cycle Time = N evaluates N+2 evaluates N+2 indicates “done” N N+1 N+2 2 N+1 evaluates N+1 indicates “done”

29 29 Practical Issue: Handling Slow Environments We inherit a timing assumption from Williams’ PS0: Input (left) environment must precharge reasonably fast Input (left) environment must precharge reasonably fastProblem: If environment is stuck in precharge, all pipelines (incl. PS0) will malfunction! Our Solution: Add a special robust controller for 1 st stage Add a special robust controller for 1 st stage  simply synchronizes input environment and pipeline  delay critical events until environment has finished precharge l Modular solution overcomes shortcoming of Williams’ PS0 l No serious throughput overhead  real bottleneck is the slow environment!

30 30 Results and Conclusions  Introduction  Background: Williams’ PS0 pipelines  New Pipeline Designs Dual-Rail: LP3/1, LP2/2 and LP2/1 Dual-Rail: LP3/1, LP2/2 and LP2/1 Single-Rail: LP SR 2/1 Single-Rail: LP SR 2/1  Practical Issue: Handling slow environments è Results and Conclusions

31 31 Results Designed/simulated FIFO’s for each pipeline style Experimental Setup: l design: 4-bit wide, 10-stage FIFO l technology: 0.6  HP CMOS l operating conditions: 3.3 V and 300°K

32 32 dual-rail single-rail Comparison with Williams’ PS0  LP2/1: >2X faster than Williams’ PS0  LP SR 2/1: 1.2 Giga items/sec

33 33 Comparison: LP SR 2/1 vs. Molnar FIFO’s LP SR 2/1 FIFO: 1.2 Giga items/sec Adding logic processing to FIFO:  simply fold logic into dynamic gate  little overhead Comparison with Molnar FIFO’s: l asp* FIFO: 1.1 Giga items/sec  more complex timing assumptions  not easily formalized  requires explicit latches, separate from logic!  adding logic processing between stages  significant overhead l micropipeline: 1.7 Giga items/sec  two parallel FIFO’s, each only 0.85 Giga/sec  very expensive transition latches  cannot add logic processing to FIFO!

34 34 datapath width = 32 dual-rail bits! Practicality of Gate-Level Pipelining When datapath is wide:  Can often split into narrow “streams”  comp. d et. f airly low cost!  Use “localized” completion detector for each stream: for each stream: l need to examine only a few bits  small fan-in  small fan-in l send “done” to only a few gates  small fan-out  small fan-outdone fan-out=2 comp. det. fan-in = 2

35 35 Conclusions Introduced several new dynamic pipelines: l Use two novel protocols: –“early evaluation” –“early done” Especially suitable for fine-grain (gate-level) pipelining Especially suitable for fine-grain (gate-level) pipelining l Very high throughputs obtained: –dual-rail: >2X improvement over Williams’ PS0 –single-rail: 1.2 Giga items/second in 0.6  CMOS l Use easy-to-satisfy, one-sided timing constraints l Robustly handle arbitrary-speed environments –overcome a major shortcoming of Williams’ PS0 pipelines Recent Improvement: Even faster single-rail pipeline (WVLSI’00)


Download ppt "High-Throughput Asynchronous Pipelines for Fine-Grain Dynamic Datapaths Montek Singh and Steven Nowick Columbia University New York, USA"

Similar presentations


Ads by Google