Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Clockless Logic or How do I make hardware fast, power- efficient, less noisy, and easy-to-design? Montek Singh Tue, Jan 14, 2003.

Similar presentations


Presentation on theme: "1 Clockless Logic or How do I make hardware fast, power- efficient, less noisy, and easy-to-design? Montek Singh Tue, Jan 14, 2003."— Presentation transcript:

1 1 Clockless Logic or How do I make hardware fast, power- efficient, less noisy, and easy-to-design? Montek Singh Tue, Jan 14, 2003

2 2 Course Information (1) Course Number: COMP290-084 Time and Place Tue/Thu 3:30-4:45pm, Sitterson Hall 325 Tue/Thu 3:30-4:45pm, Sitterson Hall 325Instructor Montek Singh Montek Singh montek@cs.unc.edu (not singh@cs!) montek@cs.unc.edu (not singh@cs!) montek@cs.unc.edu SN 245, 962-1832 SN 245, 962-1832 Office hours: most afternoons/by appointment Office hours: most afternoons/by appointment Teaching Assistant None None Course Web Page http://www.cs.unc.edu/~montek http://www.cs.unc.edu/~montek

3 3 Course Information (2) Prerequisites: undergraduate knowledge of: digital logic, algorithms, discrete math (sets and graphs) undergraduate knowledge of: digital logic, algorithms, discrete math (sets and graphs) no knowledge of advanced circuit design or of VLSI is assumed no knowledge of advanced circuit design or of VLSI is assumed  relevant topics will be covered in class as needed you are assumed to know the following topics: you are assumed to know the following topics:  digital logic: Boolean algebra, logic gates, and latches and registers  algorithms: search techniques, enumeration, divide and conquer, and time complexity  discrete math: elementary set theory and graph theory

4 4 Course Information (3) Reading Material: Papers and technical reports supplied by instructor Papers and technical reports supplied by instructor Course Content: The following topics will be covered: The following topics will be covered:  Introduction to clockless logic  Graphical representation of asynchronous systems  Algorithms for logic synthesis –Combinational –Sequential  Design techniques –High-performance –Low-power  Formal methods (performance analysis and verification)  Case studies of real-world asynchronous processors

5 5 Course Information (4) Grading 30% homework assignments 30% homework assignments 35% class project 35% class project  your choice of topic: from pure algorithms to VLSI design 30% exams 30% exams 5% class participation 5% class participation Honor Code is in effect encouraged to discuss ideas/concepts encouraged to discuss ideas/concepts work handed in must be your own work handed in must be your own

6 6 Lecture 1: Introduction  What is asynchronous design?  Why do we want to study it?  How is data represented in an asynchronous system?  How is information exchanged?

7 7 Introduction: Clocked Digital Design Most current digital systems are synchronous: Clock: a global signal that paces operation of all components Clock: a global signal that paces operation of all components clock Benefit of clocking: enables discrete-time representation all components operate exactly once per clock tick all components operate exactly once per clock tick component outputs need to be ready by next clock tick component outputs need to be ready by next clock tick  allows “glitchy” or incorrect outputs between clock ticks

8 8 Microelectronics Trends Current and Future Trends: Significant Challenges Large-Scale “Systems-on-a-Chip” (SoC) Large-Scale “Systems-on-a-Chip” (SoC)  100 Million ~ 1 Billion transistors/chip Very High Speeds Very High Speeds  multiple GigaHertz clock rates Explosive Growth in Consumer Electronics Explosive Growth in Consumer Electronics  demand for ever-increasing functionality …  … with very low power consumption (limited battery life) Higher Portability/Modularity/Reusability Higher Portability/Modularity/Reusability  “plug ’n play” components, robust interfaces

9 9 Challenges to Clocked Design Breakdown of Single-Clock Paradigm: Chip will be partitioned into multiple timing domains Chip will be partitioned into multiple timing domains  challenge: gluing together multiple timing domains –glue logic is susceptible to “metastability” (=incorrect values transferred) and latency overheads Increasing Difficulties with Clocked Design: Clock distribution: requires significant designer effort Clock distribution: requires significant designer effort Performance bottleneck: a single slow component Performance bottleneck: a single slow component Clock burns large fraction of chip power (~40-70%) Clock burns large fraction of chip power (~40-70%) Fixed clock rate: poor match for Fixed clock rate: poor match for  designing reusable components  interfacing with mixed-timing environments

10 10 What is Asynchronous Design?  Digital design with no centralized clock  Synchronization using local “handshaking” Asynchronous System (Distributed Control) handshakinginterface Synchronous System (Centralized Control) clock

11 11 Why Asynchronous Design? (1)  Higher Performance May obtain “average-case” operation (not “worst-case”) May obtain “average-case” operation (not “worst-case”)  not limited by slowest component Avoids overheads of multi-GHz clock distribution Avoids overheads of multi-GHz clock distribution  Lower Power No clock power expended No clock power expended Inactive components consume negligible power Inactive components consume negligible power  Better Electromagnetic Compatibility Smooth radiation spectra: no clock spikes Smooth radiation spectra: no clock spikes Much less interference with sensitive receivers [e.g., Philips pagers, smartcards] Much less interference with sensitive receivers [e.g., Philips pagers, smartcards]  Greater Flexibility/Modularity Naturally adapt to variable-speed environments Naturally adapt to variable-speed environments Supports reusable components Supports reusable components

12 12 Why Asynchronous Design? (2)  The world already is mostly asynchronous! Events at the level of (or in between) large-scale systems are asynchronous Events at the level of (or in between) large-scale systems are asynchronous  several seconds to several milliseconds  e.g., PC-printer communication, keyboard inputs, network comm. Events at the board level (or between chips) are often asynchronous Events at the board level (or between chips) are often asynchronous  milliseconds to 100 nanoseconds  e.g., CPU-memory interface, interface with I/O subsystem (interrupts) Events within a chip, at the level of functional units (e.g., adders, control logic) are currently synchronous Events within a chip, at the level of functional units (e.g., adders, control logic) are currently synchronous  several nanoseconds to 100 picoseconds Events at the level of a single logic gate are asynchronous Events at the level of a single logic gate are asynchronous  10 picoseconds Events at the quantum level are asynchronous Events at the quantum level are asynchronous  picoseconds to femtoseconds  So, why bother with clocks at all?! make everything asynchronous  greater elegance and robustness make everything asynchronous  greater elegance and robustness

13 13 Challenges of Asynchronous Design communication must be hazard-free! communication must be hazard-free! special design challenge = “hazard-free synthesis” special design challenge = “hazard-free synthesis”  Testability Issues: absence of clock means no “single-stepping” absence of clock means no “single-stepping”  Lack of Commercial CAD Tools: chicken-and-egg problem chicken-and-egg problem  Hazards: potential “glitches” on wire clean signals hazardous signals clock tick no problem for clocked systems

14 14 Asynchronous Design: Past & Present Async Design: In existence for 50 years, but … … many recent technical advances: Hazard-Free Circuit Design: Hazard-Free Circuit Design:  several practical techniques for controllers [Stanford/Columbia] Design for Testability: Design for Testability:  several test solutions, e.g. Philips Research Maturing Computer-Aided-Design (“CAD”) Tools: Maturing Computer-Aided-Design (“CAD”) Tools:  software tools for automated design [Philips,Columbia,Manchester] Successful Fabricated Chips: Successful Fabricated Chips:  embedded processors, high-speed pipelines, consumer electronics…

15 15 Recent Commercial Interest Several commercial asynchronous chips: Philips: asynchronous 80c51 microcontrollers Philips: asynchronous 80c51 microcontrollers  used in commercial pagers [1998] and smartcards [2001] Univ. of Manchester: async ARM processor [2000] Univ. of Manchester: async ARM processor [2000] Motorola: async divider in PowerPC chip [2000] Motorola: async divider in PowerPC chip [2000] HAL: async floating-point divider HAL: async floating-point divider  in HAL-I and II processors [early 1990’s] Recent experimental chips: IBM, Sun and Intel: IBM, Sun and Intel:  fast pipelines, arbiters, instruction-length decoder… IBM/Columbia/UNC: asynchronous digital FIR filter IBM/Columbia/UNC: asynchronous digital FIR filter Several recent startups: Theseus Logic, Fulcrum, Self-Timed Solutions… Theseus Logic, Fulcrum, Self-Timed Solutions…

16 16 A 5-minute Homework Problem Alice and Bob live on opposite sides of a wide river: Alice is supposed to send a message (say, a “Yes”/”No”) across to Bob around midnight. Both have flashlights, but neither owns a watch. What should they do? Suggest several strategies, and discuss pros and cons of each. AliceBob

17 17 got it Solution 1 Alice uses 2 lamps: 1 to indicate that she is ready with the message, and 1 to indicate that she is ready with the message, and 1 for the message itself 1 for the message itself Bob uses 1 lamp: to indicate that he has received the message to indicate that he has received the message Alice Bob ready yes/no

18 18 Solution 2 Alice uses 2 lamps: Green lamp to indicate “yes” Green lamp to indicate “yes” Red lamp to indicate “no” Red lamp to indicate “no” Bob uses 1 lamp: to indicate that he has received the message to indicate that he has received the message got it Alice Bob no yes

19 19 Solution 3 What if Alice and Bob could keep time? Alice uses 1 lamp for the message: At 12 midnight: turns on lamp if message = “yes” At 12 midnight: turns on lamp if message = “yes” At 12:01: turns lamp off At 12:01: turns lamp off Bob needs no lamps! Takes down the message between 12 and 12:01 Takes down the message between 12 and 12:01 Pros: Fewer signals, lesser processing needed Cons: Alice and Bob must keep their clocks closely synchronized If Bob’s watch is off by a minute, incorrect communication possible If Bob’s watch is off by a minute, incorrect communication possible

20 20 Data Representation Styles: “Bundled Data” Single-rail “Bundled Datapath”: simplest approach widely used widely usedFeatures: datapath: 1 wire per bit (e.g. standard sync blocks) datapath: 1 wire per bit (e.g. standard sync blocks) matched delay: produces delayed “done” signal matched delay: produces delayed “done” signal  worst-case delay: longer than slowest path +Practical style: can reuse sync components ; small area –Fixed (worst-case) completion time done indicates valid data valid data bit 1 request bit n bit 1 bit m done matcheddelay function block

21 21 +provides robust data-dependent completion –needs completion detectors Data Representation Styles: Dual-Rail Dual-rail: uses 2 wires per data bit bit n bit 1 bit m bit 1 Each Dual-Rail Pair: provides both data value and validity

22 22 Dual-Rail (contd.) Dual-Rail Completion Detector: combines dual-rail signals combines dual-rail signals indicates when all bits are valid (or reset) indicates when all bits are valid (or reset) C Done OR bit 0 OR bit 1 OR bit n  OR together 2 rails per bit  Merge results using a Müller “C-element” C-element: if all inputs=1, output  1 if all inputs=1, output  1 if all inputs=0, output  0 if all inputs=0, output  0 else, maintain output value else, maintain output valueC-element: if all inputs=1, output  1 if all inputs=1, output  1 if all inputs=0, output  0 if all inputs=0, output  0 else, maintain output value else, maintain output value

23 23 4-Phase: requires 4 events per handshake Handshaking Styles: 4-phase Request Acknowledge start event done get ready for next event ready for next event +“Level-sensitive”  simpler logic implementation –Overhead of “return-to-zero” (RTZ or resetting) extra events which do no useful computation extra events which do no useful computation

24 24 +Elegant: no return-to-zero –Slower logic implementation: logic primitives are inherently level-sensitive, not event-based (at least in CMOS) logic primitives are inherently level-sensitive, not event-based (at least in CMOS) Handshaking Styles: 2-phase 2-Phase: requires 2 events per handshake Request Acknowledge start event done start next event next event done

25 25 Handshaking + Data Representation Several combinations possible: dual-rail 4-phase, single-rail 4-phase, dual-rail 2-phase, and single- rail 2-phase dual-rail 4-phase, single-rail 4-phase, dual-rail 2-phase, and single- rail 2-phase Example: dual-rail 4-phase dual-rail data: functions as an implicit “request” dual-rail data: functions as an implicit “request” 4-phase cycle: between acknowledge and implicit request 4-phase cycle: between acknowledge and implicit request bit m bit 1 ack A B

26 26 Other Data Representation Styles  Level-Encoded Dual-Rail (LEDR) 2 wires per bit: “data” and “phase” 2 wires per bit: “data” and “phase” exactly one wire per bit changes value exactly one wire per bit changes value  if new value is different, “data” wire changes value  else “phase” wire change value  M-of-N Codes N wires used for a data word N wires used for a data word M wires (M <= N) change value M wires (M <= N) change value Values of N and M: have impact on… Values of N and M: have impact on…  information transmitted, power consumed and logic complexity  Knuth codes, Huffman codes, … data phase

27 27 Which to use? Depends on several performance parameters: speed speed  single-rail vs. dual-rail –single-rail may be faster (if designed aggressively) –dual-rail may be faster (if completion times vary widely)  2-phase vs. 4-phase –2-phase may be faster (if logic overhead is small) –4-phase may be faster (if overhead of return-to-zero is small) power consumption power consumption  2-phase typically has fewer gate transitions (  lower power) amount of logic used (#gates/wires/pins  chip area) amount of logic used (#gates/wires/pins  chip area)  single-rail needs fewer gates/wires/pins design and verification effort design and verification effort  dual-rail, 1-of-N, M-of-N, Knuth codes…: –delay-insensitive: robust in the presence of arbitrary delays  single-rail: requires greater timing verification effort

28 28 Sutherland’s Micropipelines Seminal Paper

29 29 Focus of Sutherland’s Turing Award Lecture: Pipelining Motivation: Pipelining is at the heart of nearly all high-performance digital systems high-performance digital systems Additional Benefits: Low power Low power Interfacing with mixed systems Interfacing with mixed systems Modular and scalable design Modular and scalable design

30 30 A “coarse-grain” pipeline (e.g. simple processor) A “fine-grain” pipeline (e.g. pipelined adder) fetchdecodeexecute Background: Pipelining What is Pipelining?: Breaking up a complex operation on a stream of data into simpler sequential operations + Throughput: significantly increased – Latency: somewhat degraded Storage elements (latches/registers) Throughput = #data items processed/second

31 31 Focus of Async Community Our Focus: Extremely fine-grain pipelines “gate-level” pipelining = use narrowest possible stages “gate-level” pipelining = use narrowest possible stages each stage consists of only a single level of logic gates each stage consists of only a single level of logic gates  some of the fastest existing digital pipelines to date Application areas: multimedia hardware (graphics accelerators, video DSP’s, …) multimedia hardware (graphics accelerators, video DSP’s, …)  naturally pipelined systems, throughput is critical  input is often “bursty” optical networking optical networking  serializing/deserializing FIFO’s genomic string matching? genomic string matching?  KMP style string matching: variable skip lengths


Download ppt "1 Clockless Logic or How do I make hardware fast, power- efficient, less noisy, and easy-to-design? Montek Singh Tue, Jan 14, 2003."

Similar presentations


Ads by Google