Clockless Logic Montek Singh Tue, Apr 6, 2004. Case Study: An Adaptively-Pipelined Mixed Synchronous-Asynchronous System Montek Singh Univ. of North Carolina.

Clockless Logic Montek Singh Tue, Apr 6, 2004

Case Study: An Adaptively-Pipelined Mixed Synchronous-Asynchronous System Montek Singh Univ. of North Carolina at Chapel Hill Jose Tierno, Alexander Rylyakov and Sergey Rylov IBM TJ Watson Research Center Steven M. Nowick Columbia University

3Motivation Read Channel Filters: key components of disk drives used in: hard disks, CD/DVD drives, IBM microdrives … used in: hard disks, CD/DVD drives, IBM microdrives … used for: removing “inter-symbol interference” used for: removing “inter-symbol interference”  adjacent bits of data overlap due to dispersion of read pulses read head pickup time successive bits 10 years ago read head pickup time Today High-speed high-density drives:  require high-speed high-resolution filters

4Challenges  High data rates: ~1 Giga items/second To handle high disk rotational speeds (up to 10,000 RPM) To handle high disk rotational speeds (up to 10,000 RPM)  Complex filter architectures To handle high disk storage densities To handle high disk storage densities  Short design cycle/low designer effort To target consumer electronics ($5 per chip or less) To target consumer electronics ($5 per chip or less) To allow ease of migration to new silicon technology To allow ease of migration to new silicon technology  Variable clock-rate operation To handle variable input data rates To handle variable input data rates  innermost tracks produce 1/5th data rate of outermost tracks!  Very low latency Filter is part of a tight feedback loop which… Filter is part of a tight feedback loop which…  … aligns clock frequency and phase with input data

5Contribution A fabricated real-world read channel filter chip: Provides high-speed operation (over 1.3 Giga items/sec) Provides high-speed operation (over 1.3 Giga items/sec)  asynchronous portion estimated capable of 1.8 Giga items/sec Adaptively-pipelined: provides variable “pipelining depth” Adaptively-pipelined: provides variable “pipelining depth”  behaves as a deep pipeline (7 clocked stages) for high input rates  behaves as a shallow pipeline (4 clocked stages) at lowest rates  latency can be kept low at all input rates! Provides clocked interfaces Provides clocked interfaces  can be embedded into a synchronous environment Fairly low power consumption (400 mW at 1 GHz) Fairly low power consumption (400 mW at 1 GHz) Easy to implement Easy to implement  large fraction of design uses standard library components  automated placement and routing  no post-layout tweaking needed

6 What is Asynchronous Design?  Digital design with no centralized clock  Synchronization using local “handshaking” Asynchronous System (Distributed Control) handshakinginterface Synchronous System (Centralized Control) clock

7 Why Asynchronous Design?  Higher Performance May obtain “average-case” operation (not “worst-case”) May obtain “average-case” operation (not “worst-case”)  not limited by slowest component Avoids overheads of multi-GHz clock distribution Avoids overheads of multi-GHz clock distribution  Lower Power No clock power expended No clock power expended Inactive components consume negligible power Inactive components consume negligible power  Better Electromagnetic Compatibility Smooth radiation spectra: no clock spikes Smooth radiation spectra: no clock spikes Much less interference with sensitive receivers [e.g., Philips pagers, smartcards] Much less interference with sensitive receivers [e.g., Philips pagers, smartcards]  Greater Flexibility/Modularity Naturally adapt to variable-speed environments Naturally adapt to variable-speed environments Supports reusable components Supports reusable components

8Outline  Background Read Channel Filters Read Channel Filters High-Capacity (HC) Asynchronous Pipeline Style High-Capacity (HC) Asynchronous Pipeline Style  New Filter Design Implementation Implementation Operation Operation Performance Analysis Performance Analysis  Layout and Fabrication  Experimental Results  Conclusions

9 Background: Read Channel Filters Read channel filter: finite impulse response (FIR) filter finite impulse response (FIR) filter output determined by a finite history of inputs output determined by a finite history of inputs An “N-tap” filter computes the weighted sum: y(k) = h 0 x(k) + h 1 x(k-1) + … + h N-1 x(k-N+1), where: x(k)…x(k-N+1) is the input sequence x(k)…x(k-N+1) is the input sequence h 0… h N-1 are constant tap weights h 0… h N-1 are constant tap weights Computeengine Data input Data output Shift reg TapweightsTapweights

10 Background: HC Pipeline Style High-Capacity Pipelines (HC) [Singh/Nowick WVLSI-00] bundled datapaths; dynamic logic function blocks bundled datapaths; dynamic logic function blocks latch-free: no explicit latches needed latch-free: no explicit latches needed  dynamic logic provides implicit latching novel highly-concurrent protocol maximizes storage capacity novel highly-concurrent protocol maximizes storage capacity  traditional latch-free approaches: “spacers” limit capacity to 50% Key Idea: Obtain greater control of stage’s operation separate control of pull-up/pull-down separate control of pull-up/pull-down result = new “isolate phase” result = new “isolate phase” stage holds outputs/impervious to input changes stage holds outputs/impervious to input changes Advantage: Each stage can hold a distinct data item è 100% storage capacity Extra Benefit: Obtain greater concurrency  High throughput

11 HC: Basic Structure Key Idea: 2 independent control signals: pc: controls precharge pc: controls precharge eval: controls evaluation eval: controls evaluation Allows novel 3-phase cycle: Evaluate Evaluate “Isolate” (hold) “Isolate” (hold) Precharge Precharge delay stagecontroller pceval ack N N+1N+2 delay Single-rail “Bundled Datapath”: l matched delay: produces delayed “done” signal  worst-case delay: longer than slowest path for data delay

12 HC: Inside a Stage Independent Controls of pull-up and pull-down: allows new 3 rd phase: “isolate” allows new 3 rd phase: “isolate” l pc asserted: precharge l eval asserted: evaluate l pc and eval de-asserted: enter “isolate” (hold) phase “keeper”controlsevaluationcontrolsprecharge eval inputs outputs pc

13 HC: Protocol Most Existing Protocols: 3 synchronization arcs 1 forward arc: data dependency 1 forward arc: data dependency 2 backward arcs: control synchronization 2 backward arcs: control synchronization Our protocol: only 2 synchronization arcs only 1 backward arc only 1 backward arc  once stage N+1 evaluates, N can complete entire next cycle! Eval Isolate Precharge pc=1 eval=1 pc=1 eval=0 pc=0 eval=0 Eval Isolate Precharge Stage N Stage N+1 X

14 + eval pc HC: Stage Implementation req done ack NAND INV delay state variable: off the critical path off the critical path from current stage self-loop: key to fast key to fast “isolation” “isolation” from next stage early ack

15 HC: Operation 1 NN+1 N evaluates N+1 starts to evaluate evaluate N precharges N enables itself for next evaluation 2 3 (fastself-loop) N isolates (fastself-loop) (early Ack) Cycle Time = 8 CMOS gate delays

16Outline  Background Read Channel Filters Read Channel Filters High-Capacity (HC) Asynchronous Pipeline Style High-Capacity (HC) Asynchronous Pipeline Style  New Filter Design Implementation Implementation Operation Operation Performance Analysis Performance Analysis  Layout and Fabrication  Experimental Results  Conclusions

17 Filter Architecture: Overview 108 CarrySaveAdder Partial sums Mux Inputshift-reg 10w 18 Bit slice 6 Table lookup Carry Look- Ahead Adder 25 15

18 Filter Architecture (contd.) Distributed Arithmetic Architecture: Bit-slice the weighted sum computation Bit-slice the weighted sum computation  each x is a 6-bit value; compute 6 partial sums in parallel Precompute all partial sums, store in registers/memory Precompute all partial sums, store in registers/memory Problem: Lookup table can get quite big… e.g., for 10-tap filter, all addresses are 10-bit words e.g., for 10-tap filter, all addresses are 10-bit words  1024-word memory Solution: Use two techniques to reduce table size: Partitioning: split data into odd and even interleaves Partitioning: split data into odd and even interleaves  each interleave generates 5-bit addresses  memory requirement drops to two 32-word tables Exploit Symmetry: use signed-digit offset binary notation Exploit Symmetry: use signed-digit offset binary notation  “1” means +1, while “0” means –1  makes lookup table symmetric, allowing half to be discarded

19 Filter Architecture (contd.) Table Lookup L1 domino latch XOR CSA Stage 1 CSA Stage 2CSA Stage 3CSA Stage 4CSA Stage 5 CLA Stage 1 CLA Stage 2CLA Stage 3 L2 Asynchronous Clocked Sync-Async interface Restores sign bit to partial sums Async-Sync interface. dynamic static Pipelined using the High- Capacity Style

20 Asynchronous Pipelined Portion Challenge: Providing Adequate Control Buffering reqreqreq ackack

21 Asynchronous Pipelined Portion (cont.) Optimization: Eliminating Buffer Delays reqreqreq ackack

22 FIR Filter: Sync-Async Interfaces L1 domino XOR CSA Stage 1 CSA Stage 2CSA Stage 3CSA Stage 4CSA Stage 5 CLA Stage 1 CLA Stage 2CLA Stage 3 L2 ClkClk’ req ackXreqClk’ Programmed delay (shift-reg) data data data ack Table Lookup

23Outline  Background Read Channel Filters Read Channel Filters High-Capacity (HC) Asynchronous Pipeline Style High-Capacity (HC) Asynchronous Pipeline Style  New Filter Design Implementation Implementation  Operation  Performance Analysis  Layout and Fabrication  Experimental Results  Conclusions

24 Filter Operation Performance Goals: operation desired over wide range of clock frequencies operation desired over wide range of clock frequencies  input data rate to a read channel can vary greatly  data rate varies as the read head moves from innermost to outermost track  variation up to factor of 5! low filter latency required at all clock frequencies low filter latency required at all clock frequencies  filter is part of closed feedback loop (“clock recovery loop”)  low loop latency critical to accurate alignment of clock w.r.t. data Challenge: purely synchronous pipeline cannot easily satisfy above goals purely synchronous pipeline cannot easily satisfy above goals  deep pipeline design required to meet highest data rates…  … but: deep pipelining implies long clock cycle latency  at lowest data rates: long clock cycle latency is unacceptable

25 Our Solution: Adaptive Pipelining Key Idea: exploit constant ns latency of asynchronous pipelines exploit constant ns latency of asynchronous pipelines behaves similar to a clocked pipeline with variable depth behaves similar to a clocked pipeline with variable depth clocked input side clocked output side asynchronous pipeline Slow speed scenario High speed scenario Benefit: filter appears to the external clocked environment as a clocked pipeline with variable depth filter appears to the external clocked environment as a clocked pipeline with variable depth obtain 1 clock cycle latency at lowest data rates obtain 1 clock cycle latency at lowest data rates

26 Comparison: Adaptive vs. Wave Pipelining Similarity: Both allow variable number of data tokens in the datapath Both allow variable number of data tokens in the datapath Both allow interfacing with clocked environments Both allow interfacing with clocked environments … But: significant differences: Wave pipelining: requires much designer effort Wave pipelining: requires much designer effort  at all levels of design: from architectural down to layout level  needs accurate balancing of path delays (incl. data dependent)  vulnerable to process, temperature and voltage variations  cannot handle varying input/output data rates Adaptive pipelining: significantly more robust Adaptive pipelining: significantly more robust  uses robust handshake protocol between stages  is elastic: can handle stalls, congestion, etc.

27 Performance Analysis Key Result: Filter behaves similar to a self-timed ring … with 9 ½ stages! … with 9 ½ stages! clocked input/output self-timed ring 2N01 Tokens in Pipeline Throughput 1/T F 1/T B 1/T C Reachable Throughput

28Outline  Background Read Channel Filters Read Channel Filters High-Capacity (HC) Asynchronous Pipeline Style High-Capacity (HC) Asynchronous Pipeline Style  New Filter Design Implementation Implementation Operation Operation Performance Analysis Performance Analysis  Layout and Fabrication  Experimental Results  Conclusions

29 Layout and Fabrication Technology: IBM’s CMOS-7SF technology with Cu interconnect IBM’s CMOS-7SF technology with Cu interconnect 0.18 micron process, 5 metal layers, and 1.8V supply 0.18 micron process, 5 metal layers, and 1.8V supplyLayout: part standard-cell, part full-custom part standard-cell, part full-custom  entire clocked portion: standard-cell  asynchronous datapath: full-custom dynamic gates  asynchronous control: standard-cell for basic gates, full-custom for C- and aC-elements Placement and Routing (P&R): fully automated using the Silicon Ensemble tool fully automated using the Silicon Ensemble tool  chip partitioned into 8 parts, each P&R’ed automatically  top-level P&R also automated No resizing of gates performed after P&R

30 Results: Throughput Over 1.3 Giga items/second

31 Results: Power Consumption Less than 500 mW at 1 Giga items/sec

32Conclusions Designed, fabricated and tested a real-world FIR filter: Hybrid synchronous-asynchronous design Hybrid synchronous-asynchronous design Exhibits adaptive pipelining Exhibits adaptive pipelining  variable number of tokens in the datapath  enable low clock cycle latency operation at all frequencies Exceeds all performance specifications: Exceeds all performance specifications:  obtains throughput over 1.3 GigaHertz –15% faster than best existing read channel filter –asynchronous portion estimated capable of up to 1.8 Gigaitems/sec  obtains latency as low as 4 clock cycles Testable Testable Required low designer effort: Required low designer effort:  Layout: mostly using library components  Placement and routing: full automated

Clockless Logic Montek Singh Tue, Apr 6, 2004. Case Study: An Adaptively-Pipelined Mixed Synchronous-Asynchronous System Montek Singh Univ. of North Carolina.

Similar presentations

Presentation on theme: "Clockless Logic Montek Singh Tue, Apr 6, 2004. Case Study: An Adaptively-Pipelined Mixed Synchronous-Asynchronous System Montek Singh Univ. of North Carolina."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clockless Logic Montek Singh Tue, Apr 6, 2004. Case Study: An Adaptively-Pipelined Mixed Synchronous-Asynchronous System Montek Singh Univ. of North Carolina.

Similar presentations

Presentation on theme: "Clockless Logic Montek Singh Tue, Apr 6, 2004. Case Study: An Adaptively-Pipelined Mixed Synchronous-Asynchronous System Montek Singh Univ. of North Carolina."— Presentation transcript:

Similar presentations

About project

Feedback