1 Clockless Logic Montek Singh Tue, Mar 21, 2006.

1 Clockless Logic Montek Singh Tue, Mar 21, 2006

2 Dynamic Logic Pipelines (contd.)  Drawbacks of Williams’ PS0 Pipelines  Lookahead Pipelines  High-Capacity Pipelines

3 Drawbacks of PSO Pipelining 1. Poor throughput: long cycle time: 6 events per cycle long cycle time: 6 events per cycle data “tokens” are forced far apart in time data “tokens” are forced far apart in time 2. Limited storage capacity: max only 50% of stages can hold distinct tokens max only 50% of stages can hold distinct tokens data tokens must be separated by at least one spacer data tokens must be separated by at least one spacer Our Research Goals: address both issues still maintain very low latency still maintain very low latency

4 Recent Approaches 3 novel styles for high-speed async pipelining: MOUSETRAP Pipelines [Singh/Nowick, TAU-00, ICCD-01] MOUSETRAP Pipelines [Singh/Nowick, TAU-00, ICCD-01] “Lookahead Pipelines” (LP) [Singh/Nowick, Async-00] “Lookahead Pipelines” (LP) [Singh/Nowick, Async-00] “High-Capacity Pipelines” (HC) [Singh/Nowick, WVLSI-00] “High-Capacity Pipelines” (HC) [Singh/Nowick, WVLSI-00] Goal: significantly improve throughput of PS0 Two Distinct Strategies: LP: introduce protocol optimizations LP: introduce protocol optimizations  “shave off” components from critical cycle HC: fundamentally new protocol HC: fundamentally new protocol  greater concurrency: “loosely-coupled” stages  

5Outline è New Asynchronous Pipelines: MOUSETRAP Pipelines MOUSETRAP Pipelines è Lookahead Pipelines (LP) High-Capacity Pipelines (HC) High-Capacity Pipelines (HC) Dynamic circuit style Static circuit style

6 Lookahead Pipeline Styles Singh/NowickAsync-2000

7 Lookahead Pipelines: Strategy #1 Use non-neighbor communication: stage receives information from multiple later stages stage receives information from multiple later stages allows “early evaluation” allows “early evaluation” Benefit: stage gets head-start on next cycle

8 Lookahead Pipelines: Strategy #2 Use early completion detection: completion detector moved before stage (not after) completion detector moved before stage (not after) stage indicates “early done” in parallel with computation stage indicates “early done” in parallel with computation Benefit: again, stage gets head-start on next cycle early completion detector

9 Lookahead Pipelines: Overview 5 New Designs: è“Dual-Rail” Data Signaling: LP3/1: “early evaluation” LP3/1: “early evaluation” LP2/2: “early done” LP2/2: “early done” LP2/1: “early evaluation” + “early done” LP2/1: “early evaluation” + “early done”  “Single-Rail” Bundled-Data Signaling: LP SR 2/2: “early done” LP SR 2/2: “early done” LP SR 2/1: “early evaluation” + “early done” LP SR 2/1: “early evaluation” + “early done”

10 Optimization = “early evaluation” each stage has two control inputs: from stages N+1 and N+2 each stage has two control inputs: from stages N+1 and N+2 Idea: shorten precharge phase terminate precharge early: when N+2 is done evaluating terminate precharge early: when N+2 is done evaluating Dual-Rail Design #1: LP3/1 Data in Data out PC Eval From N+2 N N+1 N+2 Processing Block Completion Detector

11 LP3/1 Protocol LP3/1 Protocol PRECHARGE N: when N+1 completes evaluation PRECHARGE N: when N+1 completes evaluation EVALUATE N: when N+2 completes evaluation EVALUATE N: when N+2 completes evaluation New! 1 2 3 Enables “early evaluation!” 4 N evaluates N+1 evaluates N+2 indicates “done” N+2 evaluates N N+1 N+2 N+1 indicates “done” 3

12 PS0PS0 LP3/1LP3/1 LP3/1: Comparison with PS0 5 4 4 6 NN+1N+2 NN+1N+2 Enables “early evaluation!” 1 1 evaluates evaluates 2 2 evaluates evaluates 3 3 evaluates evaluates Only 4 events in cycle! 6 events in cycle PRECHARGE N: when N+1 completes evaluation 3 indicates “done” 3 EVALUATE N: when N+2 completes evaluation EVALUATE N: when N+1 completes precharging

13 1 2 3 4 LP3/1 Performance Cycle Time = saved path Savings over PS0: 1 Precharge + 1 Completion Detection

14 LP3/1: Inside a Stage Timing Issues:  must satisfy several simple constraints  Ex.: PC must arrive before Eval de-asserted 1-sided timing requirement 1-sided timing requirement easily satisfied in practice easily satisfied in practice PC (From Stage N+1) Eval (From Stage N+2) NAND “early Eval” “old Eval” Merging 2 Control Inputs:

15 Dual-Rail Design #2: LP2/2 Optimization = “early done” Idea: move completion detector before processing block Idea: move completion detector before processing block  stage indicates when “about to” precharge/evaluate Processing Block “early” Completion Detector Data in Data out “early done”

16 LP2/2 Completion Detector Modified completion detectors needed: Done =1 when stage starts evaluating, and inputs valid Done =1 when stage starts evaluating, and inputs valid Done =0 when stage starts precharging Done =0 when stage starts precharging  asymmetric C-element C Done OR bit 0 OR bit 1 OR bit n + + +PC

17 1 2 4 LP2/2 Protocol Completion Detection: performed in parallel with evaluation/precharge of stage N evaluates N+1 evaluates N N+1 N+2 2 “early done” of N+1 eval 3 3 “early done” of N+2 eval “early done” of N+1 prech

18 LP2/2 Performance 1 2 3 4 LP2/2 savings over PS0: 1 Evaluation + 1 Precharge Cycle Time =

19 Dual-Rail Design #3: LP2/1 Hybrid of LP3/1 and LP2/2. Combines: early evaluation of LP3/1 early evaluation of LP3/1 early done of LP2/2 early done of LP2/2 Cycle Time =

20 Lookahead Pipelines: Overview 5 New Designs:  “Dual-Rail” Data Signaling: LP3/1: “early evaluation” LP3/1: “early evaluation” LP2/2: “early done” LP2/2: “early done” LP2/1: “early evaluation” + “early done” LP2/1: “early evaluation” + “early done” è“Single-Rail” Bundled-Data Signaling: LP SR 2/2: “early done” LP SR 2/2: “early done” LP SR 2/1: “early evaluation” + “early done” LP SR 2/1: “early evaluation” + “early done”

21 Single-Rail Design: LP SR 2/1 Derivative of LP2/1, adapted to single-rail:  bundled-data: matched delays instead of completion detectors delaydelay delay “Ack” to previous stages is “tapped off early”  once in evaluate (precharge), dynamic logic insensitive to input changes

22 PC and Eval are combined exactly as in LP3/1 Inside an LP SR 2/1 Stage “done” generated by an asymmetric C-element done =1 when stage evaluates, and data inputs valid done =1 when stage evaluates, and data inputs valid done =0 when stage precharges done =0 when stage precharges PC (From Stage N+1) Eval (From Stage N+2) NAND aC + “ack” “req” in data in data out “req” out matched delay done

23 LP SR 2/1 Protocol 1 2 3 Cycle Time = N evaluates N+2 evaluates N+2 indicates “done” N N+1 N+2 2 N+1 evaluates N+1 indicates “done”

24Results Designed/simulated FIFO’s for each pipeline style Experimental Setup: design: 4-bit wide, 10-stage FIFO design: 4-bit wide, 10-stage FIFO technology: 0.6  HP CMOS (old!) technology: 0.6  HP CMOS (old!) operating conditions: 3.3 V and 300°K operating conditions: 3.3 V and 300°K

25 dual-rail single-rail FIFO Results (simulations) LP dual-rail: over 80% faster than Williams’ PS0 comparable latency comparable latency LP single-rail: even faster 0.6  HP CMOS 3.3 V, 300°K

26 datapath width = 32 dual-rail bits! Practicality of Gate-Level Pipelining When datapath is wide:  Can often split into narrow “streams”  comp. d et. f airly low cost!  Use “localized” completion detector for each stream: for each stream: need to examine only a few bits need to examine only a few bits  small fan-in  small fan-in send “done” to only a few gates send “done” to only a few gates  small fan-out  small fan-outdone fan-out=2 comp. det. fan-in = 2

27 High-Capacity Pipelines Singh/Nowick WVLSI-00, ISSCC-02, Async-02

28 HC Pipeline Style High-Capacity Pipelines (HC) bundled datapaths; dynamic logic function blocks bundled datapaths; dynamic logic function blocks latch-free: no explicit latches needed latch-free: no explicit latches needed  dynamic logic provides implicit latching novel highly-concurrent protocol maximizes storage capacity novel highly-concurrent protocol maximizes storage capacity  traditional latch-free approaches: “spacers” limit capacity to 50% Key Idea: Obtain greater control of stage’s operation separate control of pull-up/pull-down separate control of pull-up/pull-down result = new “isolate phase” result = new “isolate phase” stage holds outputs/impervious to input changes stage holds outputs/impervious to input changes Advantage: Each stage can hold a distinct data item è 100% storage capacity Extra Benefit: Obtain greater concurrency  High throughput

29 HC: Basic Structure Key Idea: 2 independent control signals: pc: controls precharge pc: controls precharge eval: controls evaluation eval: controls evaluation Allows novel 3-phase cycle: Evaluate Evaluate “Isolate” (hold) “Isolate” (hold) Precharge Precharge delay stagecontroller pceval ack N N+1N+2 delay Single-rail “Bundled Datapath”: l matched delay: produces delayed “done” signal  worst-case delay: longer than slowest path for data delay

30 HC: Inside a Stage Independent Controls of pull-up and pull-down: allows new 3 rd phase: “isolate” allows new 3 rd phase: “isolate” l pc asserted: precharge l eval asserted: evaluate l pc and eval de-asserted: enter “isolate” (hold) phase “keeper”controlsevaluationcontrolsprecharge eval inputs outputs pc

31 HC: Protocol Most Existing Protocols: 3 synchronization arcs 1 forward arc: data dependency 1 forward arc: data dependency 2 backward arcs: control synchronization 2 backward arcs: control synchronization Our protocol: only 2 synchronization arcs only 1 backward arc only 1 backward arc  once stage N+1 evaluates, N can complete entire next cycle! Eval Isolate Precharge pc=1 eval=1 pc=1 eval=0 pc=0 eval=0 Eval Isolate Precharge Stage N Stage N+1 X

32 Formal Specification of Controller Problem: Specification too concurrent for direct synthesis desired precharge condition: N and N+1 have evaluated same data desired precharge condition: N and N+1 have evaluated same data problem: this condition not uniquely captured by given signals! problem: this condition not uniquely captured by given signals!  N may evaluate next data item, while N+1 stuck on current item! T+ T- (Evaluate of N+1 complete) (Precharge of N+1 complete) pc+ eval+ S+ eval- pc- S- (Startevaluate) (Evaluatecomplete) (Isolate) (Startprecharge) (Prechargecomplete)

33 Modified Specification of Controller Solution: Add a state variable ok2pc ok2pc records whether N+1 has “absorbed” N’s data item  ok2pc resets immediately when N deletes item (N precharges)  ok2pc is set when N+1 deletes item (N+1 precharges) ok2pc+ ok2pc- pc+eval+ S+ eval- pc- S- T+ T- (Evaluate of N+1 complete) (Precharge of N+1 complete)

34 Controller implementation Controller implementation is very simple: each signal implemented using a single gate each signal implemented using a single gate ok2pc typically off the critical path ok2pc typically off the critical path INV NAND3 aC + S TST ok2pc pc eval S

35 + eval pc HC: Stage Implementation req done ack NAND INV delay state variable: off the critical path off the critical path from current stage self-loop: key to fast key to fast “isolation” “isolation” from next stage early ack

36 HC: Operation 1 NN+1 N evaluates N+1 starts to evaluate evaluate N precharges N enables itself for next evaluation 2 3 (fastself-loop) N isolates (fastself-loop) (early Ack) Cycle Time = 8 CMOS gate delays

37 N enables itself for next evaluation N precharges Performance1 Cycle Time = N evaluates N N+1N+2 N+1 evaluates 3 2 N isolates 2

38 dual-rail single-rail FIFO Results (simulation) LP dual-rail: over 80% faster than Williams’ PS0 comparable latency comparable latency HC single-rail: 1.3 Giga items/second 0.6  HP CMOS 3.3 V, 300°K

39 Fabricated Chip: HC FIFO  2.5 GHz in 0.18u

40 Ripple-Carry Adder: One Stage Mixed Dual-Rail/Single-Rail Datapath: single-rail: sum single-rail: sum dual-rail: A, B, Carry-in and Carry-out dual-rail: A, B, Carry-in and Carry-out  must implement binate functions using unate dynamic logic Full-AdderStage c in 1 c in 0 req c a0a0a0a0 a1a1a1a1 b0b0b0b0 b1b1b1b1 req ab c out 1 c out 0 sum doneABCarry-in Carry-out

41 Final Adder Architecture adderstage A,B sum carry in carry out shift-registers provide operand bits shift-registers accumulate sum bits leastsignificant mostsignificant

42Results Designed/simulated adder in each pipeline style Experimental Setup: design: 32-bit ripple-carry-adder design: 32-bit ripple-carry-adder technology: 0.6  HP CMOS, @3.3 V and 300°K technology: 0.6  HP CMOS, @3.3 V and 300°K HC is 10% faster than LP SR 2/1

43Conclusions Introduced 2 new async high-speed pipeline styles: Lookahead Pipelines: use novel protocol optimizations Lookahead Pipelines: use novel protocol optimizations High-Capacity Pipelines: fundamentally new protocol High-Capacity Pipelines: fundamentally new protocol –allows 100% storage capacity Obtain very high throughputs: FIFO’s: up to 1.3 GigaHertz in 0.6  CMOS FIFO’s: up to 1.3 GigaHertz in 0.6  CMOS Adders: ~1.0 GigaHertz in 0.6  CMOS Adders: ~1.0 GigaHertz in 0.6  CMOS  near-best performance, and… –significantly simpler and easier-to-construct Fabricated chip: 2.5 GHz (HC FIFOs in 0.18u)

1 Clockless Logic Montek Singh Tue, Mar 21, 2006.

Similar presentations

Presentation on theme: "1 Clockless Logic Montek Singh Tue, Mar 21, 2006."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Clockless Logic Montek Singh Tue, Mar 21, 2006.

Similar presentations

Presentation on theme: "1 Clockless Logic Montek Singh Tue, Mar 21, 2006."— Presentation transcript:

Similar presentations

About project

Feedback