Wagging Logic: Moore's Law will eventually fix it Charlie Brej APT Group University of Manchester 14/07/2019 Group Talk
Introduction Quasi-Delay-Insensitive (QDI) approach Prove the high performance potential What is performance? Latency Throughput Why is async better? Average case performance Variability and data-dependant Bit level pipelining 14/07/2019 Group Talk
C Forward Safe Guarding Ensure all wire pairs are cycled up and down QDI C 14/07/2019 Group Talk
Behaviour Viewpoint of a single output Many inputs 14/07/2019 Group Talk
Behaviour All or nothing Synchronises inputs together 14/07/2019 Group Talk
Why is it so slow? Delays: Stage data propagation: X Gate: 1, C-element: 2 Stage data propagation: X Cycle time (times 2 for set and reset): Forward guarding: 2X C-element for each gate Acknowledge propagation: 2X C-element for each fork (fork depth ~ gate depth) About eight times slower than worst case! 14/07/2019 Group Talk
Why is four-phase so slow? Low latency Low throughput Only 1/8th of the system doing useful work Rest is resetting/completing Workie Sleepy Sleepy Sleepy Sleepy Sleepy Sleepy Sleepy Workie Sleepy 14/07/2019 Group Talk
Solutions Ultra/Hyper/Super Pipelining Faster completion detection Need 8 times finer pipelining Impossible Each latch adds to the latency Faster completion detection Balanced treeing C-elements Arranging to suit arrival order Backward guarding Not even close to 8x improvement 14/07/2019 Group Talk
Inspiration: Wagging Latches Alternate latch read/write Capacity of two latches Depth of one latch 14/07/2019 Group Talk
Wagging Logic Apply same method to the logic Alternate logic allowing one to set while the other resets (precharges) Set Reset Reset Set 14/07/2019 Group Talk
Wagging Logic Between wagging stages No need to wagg No need to synchronize Wagg only when communication with non-wagging logic 14/07/2019 Group Talk
Non FIFO Example 14/07/2019 Group Talk
Duplicate the Logic 14/07/2019 Group Talk
Connect to Complementary 14/07/2019 Group Talk
A Harder Example 14/07/2019 Group Talk
Duplicate the Logic 14/07/2019 Group Talk
Connect to Complementary 14/07/2019 Group Talk
Triplicate the Logic 14/07/2019 Group Talk
Connect to the next on the list 14/07/2019 Group Talk
Other example 14/07/2019 Group Talk
Proof of the pudding Simple gate level simulation Example circuits My own simulator Delays: C-element=2, Gate=1 Example circuits Fibonacci sequence generators Vertically pipelined 64bit ripple carry adder Non-pipelined 8bit ripple carry adder 16 input XOR Backward and Forward guarded Relative measurements of Speed, Power, Area 10,000 gate delays simulation 14/07/2019 Group Talk
64bit Fibonacci Performance Synchronous Worst Case:74 14/07/2019 Group Talk
8bit Fibonacci Performance Synchronous Worst Case:500 14/07/2019 Group Talk
XOR Performance Synchronous Worst/Best Case:1250 (8 gate delays) Inc. Flip-Flop:1000 (10 gate delays) Inc. Timing margins 14/07/2019 Group Talk
Power Consumption Synchronous:610 14/07/2019 Group Talk
Area 14/07/2019 Group Talk
Future work Larger and more complex designs Improve completion time Small CPU Layout Silicon? Improve completion time Current optimal wagging ~ 5 Target ~ 3 Fully automated flow Verilog Input & Output Partitioning 14/07/2019 Group Talk
Conclusions Matching and surpassing synchronous performance every time DI logic for performance Very Expensive 20 times more power 5 times bigger (times wagging) Fastest logic on the planet! Discounting increase in wire delays Assuming other things will be able to keep up 14/07/2019 Group Talk