Circuit Design for SRCMOS Asynchronous Wave Pipelines Oliver Hauck Circuit Design for SRCMOS Asynchronous Wave Pipelines Oliver Hauck Integrated Circuits.

Circuit Design for SRCMOS Asynchronous Wave Pipelines Oliver Hauck Circuit Design for SRCMOS Asynchronous Wave Pipelines Oliver Hauck Integrated Circuits and Systems Lab Departments of Computer Science and Electrical Engineering Darmstadt University of Technology

2 Outline n Pipelines: synchronous, asynchronous, wave pipelined, and asynchronous wave pipelined (AWP) n Comparison: AWPs vs. sync, async, and sync wave pipes n AWP Circuit Design n Conclusion

3 Pipelining n Pipelining used as premier technique to better exploit hardware and boost performance of VLSI chips n Clocking overhead presents serious threat for deeply pipelined systems built upon sub-micron CMOS processes running at GHz frequencies

4 General Framework for Pipelines Logic Latch/Reg Data Clk

5 Some Notations...

6 General Relations

7 n Throughput determined by longest logic path + clock/register overhead n Fine-grain pipelining allows high throughput at the cost of increased clock/register overhead Negative side-effects of gate-level pipelining : n Increased latency, clock load/skew, power, area, design time n More area for clocking and registers than for logic Implementation options: n Register- vs. latch-based, explicit latches vs. latchless n TSPC vs. local clocks derived from global clock n Static vs. dynamic, single-ended vs. dual-rail Synchronous Pipeline Logic Latch/Reg Data Clk

8 Asynchronous Pipeline Logic Handshake Data req_in ack_in req_out ack_out Micropipeline (Sutherland 1989) n Synchronous clock replaced by asynchronous handshaking n Elastic operation: input and output rate may differ momentarily, and pipeline will buffer n Plug & Play composability n Load on req and ack lines distributed n Used by Furber‘s group at Manchester U for AMULET1/2/3 n Operation is data dependant, saves power during idle n As with fine-grain sync pipelines, throughput can be high; handshake causes high latency and backward stall Implementation options: n 4-phase (level) vs. 2-phase (event) protocol n Bundled data (matched delay) vs. completion detection

9 Synchronous Wave Pipeline Wave Logic Latch/Reg Data Clk n Several data waves simultaneously active in the logic n Logic has to minimize delay variations over P,T,V corners n Global clock used with constructive skew to adjust phases n Wave pipelining potentially gives higher throughput as conventional pipelines at decreased latency and reduced clock load, area and power n However, tuning the logic and the delay elements is difficult

10 Wave Pipelining: A Short Outline n Wave pipelining occurs when combinational logic is clocked faster than latency would allow n Several data waves are then active in the logic without being separated by storage elements n Latency remains constant and throughput is determined by delay differences rather than absolute delay n Requirement for delay balanced logic and complicated timing are the main hurdles

11 Wave Pipelining: A Little History n Technique stems from the 60s and has had a reputation for being exotic since n Wave pipelining was long dead before being revived by W. Burleson (U. Mass.) and M. Flynn (Stanford U., PhDs by Wong, Klass, and Nowka) and C. Gray at NCSU n Some working academic chips exist, mainly datapath n Some commercial memory is wave pipelined (e.g. ULTRA-III cache), but no logic, as far as we know

12 Asynchronous Wave Pipeline (AWP) Wave Logic Wave Latch Data req_inreq_out matched delay n Data words associated with events on request line n Several data waves and protocol events simultaneously active in the logic and the matched delay element, respectively n AWP is special case of the sync wave pipeline with the constructive skew set to worst-case logic delay n It is crucial that the delay element accurately tracks the delay behaviour of the logic over P, T, V corners

13 AWPs vs. Synchronous Pipelines n No global clock, instead a local clock (request) that is fed through the pipeline and obeys a simple asynchronous protocol, i.e. data is associated with event on request n Many pipeline registers removed, thus requirements on the clock (request) relaxed n Synchronous pipelines can reach the throughput of AWPs only with excessive cost in area, power and latency

14 AWPs vs. Asynchronous Pipelines n AWPs deliberately sacrifice the ack and keep only the req to avoid protocol overhead n AWPs not elastic: data at output has to be consumed n AWPs eliminate hazards as side-effect of delay balancing n AWPs have in common with other async methodologies: data dependant operation (avoids redundant transitions), composability (though inelastic), no global clock

15 AWPs vs. Synchronous Wave Pipelines AWPs tackle two main difficulties in sync wave pipes: n Replacing the constructive skew by worst-case delay removes double-sided timing constraint, i. e. in con- trast to sync wave pipes do AWPs operate at any rate n Using dynamic self-resetting logic controls delay variation and doesn´t impact latency much

16 Wave Pipelining Combinational Logic n Overall goal: keep data wave coherent under all possible conditions (data, PTV) n Desirable architecture features: most logic paths have same depth fanin/fanout the same everywhere n First step: pad all short paths to maximum length

17 Example: 64-b Brent-Kung Parallel Adder pgPG G xorxor 0 1 2 3 4 Buffers provide for same depth on every logic path All gates in the same column must have the same delay

18 Circuits n Logic style used has to minimize delay variation n Earlier work focused on bipolar logic (ECL, CML), but CMOS is mainstream n Static CMOS is not well suited for wave piping, fixing the problem results in more power and slower speed n Pass transistor logic gives slopy edges thereby introducing delay variation n Dynamic logic is attractive as only output high transition is data-dependant, output pulldown is done by precharge

19 Circuits (cont.) n Using dynamic logic as in Burleson´s Wave Domino jeopardizes the concept as it needs fine-grain precharge n What is needed is a dynamic logic family without precharge overhead: SRCMOS n Work done at IBM: classic paper by Chappell et al:``A 2-ns Cycle, 3.8-ns Access 512-kb CMOS ECL SRAM with a Fully Pipelined Architecture,´´ JSSC (26), 11, 1991; or, more recently: ``Implementation of a Self- Resetting CMOS 64-Bit Parallel Adder with Enhanced Testability,´´ JSSC (34), 8, 1999, by Hwang et al.

20 SRCMOS n Distinguishing property of our SRCMOS circuits: precharge feedback is fully local, and NMOS trees are delay balanced N inputs output

21 Operation of a 2-AND

22 Delay Balancing at Transistor Level n NMOS tree is designed so that the precharge node is pulled down by a constant number of series devices n Short paths are padded with dummy devices n Delay variation is minimal when exactly one path is on, i. e. wide fanin OR´s are hard to use n Every output has to see the same load n Lightly loaded outputs are given dummy cap

23 Example: Carry tree in a 64-bit adder

24 Gim Layout

25 Simulation of Gim cell n Pulses of 4 possible input situations giving ´1´ at the output are tightly matched n Note: in this case never are Pxy=Gxy=1

26 First Pulse Problem

27 Miller Effect

28 64-bit Adder Output Waveforms latching window

29 Transistor Sizing N inputs output Wpd Wkeeper Wprecharge Cdrive Cload Cfeedback Wpd / Cdrive = constCdrive / (Cload+Cfeedback+Wkeeper) = const Cfeedback / Wprecharge = const Wprecharge / Cdrive = const LINEAR SIZING

30 Interconnect: Resistive Effects n 0.9µm x 900µm MET2 parasitics: C=116fF, R=70 Ohms C only RC only R/2, R/2 R/3, R/3, R/3

31 Interconnect: Coupling Effects n 2 adjacent MET2 lines coupled by C=54fF

32 PTV Variations n SRCMOS provides some robustness by generating fresh pulses at every gate output n Pulsed operation reduces data dependancy, coupling n PTV noise is not critical when drift is in the same direction across die n Critical are: temperature gradient, supply drop, and local variations n What is needed: Rule of thumb like ``For process X, to be on the safe side, keep area between two latches < Y sqmm´´

33 Conclusion n AWPs presented as alternative approach to high-speed design, shows potential for GHz throughput without clocks n AWPs avoid some problems of conventional wave pipes and (a)synchronous systems n 64b adder + test circuit and EC crypto layout in the making n Not covered here: feedback + controllers n To do: support transistor sizing

Circuit Design for SRCMOS Asynchronous Wave Pipelines Oliver Hauck Circuit Design for SRCMOS Asynchronous Wave Pipelines Oliver Hauck Integrated Circuits.

Similar presentations

Presentation on theme: "Circuit Design for SRCMOS Asynchronous Wave Pipelines Oliver Hauck Circuit Design for SRCMOS Asynchronous Wave Pipelines Oliver Hauck Integrated Circuits."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Circuit Design for SRCMOS Asynchronous Wave Pipelines Oliver Hauck Circuit Design for SRCMOS Asynchronous Wave Pipelines Oliver Hauck Integrated Circuits.

Similar presentations

Presentation on theme: "Circuit Design for SRCMOS Asynchronous Wave Pipelines Oliver Hauck Circuit Design for SRCMOS Asynchronous Wave Pipelines Oliver Hauck Integrated Circuits."— Presentation transcript:

Similar presentations

About project

Feedback