IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 -

Slides:



Advertisements
Similar presentations
Symantec 2010 Windows 7 Migration EMEA Results. Methodology Applied Research performed survey 1,360 enterprises worldwide SMBs and enterprises Cross-industry.
Advertisements

EcoTherm Plus WGB-K 20 E 4,5 – 20 kW.
Números.
Symantec 2010 Windows 7 Migration Global Results.
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
AGVISE Laboratories %Zone or Grid Samples – Northwood laboratory
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
Simplifications of Context-Free Grammars
PDAs Accept Context-Free Languages
ALAK ROY. Assistant Professor Dept. of CSE NIT Agartala
EuroCondens SGB E.
Worksheets.
RWTÜV Fahrzeug Gmbh, Institute for Vehicle TechnologyTÜV Mitte Group 1 GRB Working Group Acceleration Pattern Results of pass-by noise measurements carried.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 4 Computing Platforms.
Sequential Logic Design
Copyright © 2013 Elsevier Inc. All rights reserved.
Addition and Subtraction Equations
David Burdett May 11, 2004 Package Binding for WS CDL.
Add Governors Discretionary (1G) Grants Chapter 6.
CALENDAR.
CHAPTER 18 The Ankle and Lower Leg
The 5S numbers game..
A Fractional Order (Proportional and Derivative) Motion Controller Design for A Class of Second-order Systems Center for Self-Organizing Intelligent.
1 OFDM Synchronization Speaker:. Wireless Access Tech. Lab. CCU Wireless Access Tech. Lab. 2 Outline OFDM System Description Synchronization What is Synchronization?
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
Break Time Remaining 10:00.
The basics for simulations
Factoring Quadratics — ax² + bx + c Topic
EE, NCKU Tien-Hao Chang (Darby Chang)
A sample problem. The cash in bank account for J. B. Lindsay Co. at May 31 of the current year indicated a balance of $14, after both the cash receipts.
PP Test Review Sections 6-1 to 6-6
Chapter 3 Logic Gates.
1 Bridging the gap between asynchronous design and designers Hao Zheng.
Figure 3–1 Standard logic symbols for the inverter (ANSI/IEEE Std
Regression with Panel Data
1 Prediction of electrical energy by photovoltaic devices in urban situations By. R.C. Ott July 2011.
Dynamic Access Control the file server, reimagined Presented by Mark on twitter 1 contents copyright 2013 Mark Minasi.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
Progressive Aerobic Cardiovascular Endurance Run
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
TCCI Barometer September “Establishing a reliable tool for monitoring the financial, business and social activity in the Prefecture of Thessaloniki”
When you see… Find the zeros You think….
2011 WINNISQUAM COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=1021.
Before Between After.
2011 FRANKLIN COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=332.
Slide R - 1 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Prentice Hall Active Learning Lecture Slides For use with Classroom Response.
Subtraction: Adding UP
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
Figure 10–1 A 64-cell memory array organized in three different ways.
Static Equilibrium; Elasticity and Fracture
FIGURE 12-1 Op-amp symbols and packages.
Converting a Fraction to %
Clock will move after 1 minute
Lial/Hungerford/Holcomb/Mullins: Mathematics with Applications 11e Finite Mathematics with Applications 11e Copyright ©2015 Pearson Education, Inc. All.
Select a time to count down from the clock above
1.step PMIT start + initial project data input Concept Concept.
9. Two Functions of Two Random Variables
A Data Warehouse Mining Tool Stephen Turner Chris Frala
1 Dr. Scott Schaefer Least Squares Curves, Rational Representations, Splines and Continuity.
Chart Deception Main Source: How to Lie with Charts, by Gerald E. Jones Dr. Michael R. Hyman, NMSU.
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
Introduction Embedded Universal Tools and Online Features 2.
Schutzvermerk nach DIN 34 beachten 05/04/15 Seite 1 Training EPAM and CANopen Basic Solution: Password * * Level 1 Level 2 * Level 3 Password2 IP-Adr.
Reading1: An Introduction to Asynchronous Circuit Design Al Davis Steve Nowick University of Utah Columbia University.
Jordi Cortadella, Universitat Politècnica de Catalunya, Spain
1 Bridging the gap between asynchronous design and designers Peter A. BeerelFulcrum Microsystems, Calabasas Hills, CA, USA Jordi CortadellaUniversitat.
Synthesis of asynchronous controllers from Signal Transition Graphs:
Presentation transcript:

IHP Im Technologiepark Frankfurt (Oder) Germany IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Asynchronous Circuit Design GALS Systems Synchronous and GALS NoCs - DAAD Workshop, Nis, Serbia, July Dr. Miloš Krstić

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Overview Motivation Problems of the synchronous design Asynchronous circuit design GALS - State of the Art Synchronous and GALS NoCs 2

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Challenges with Synchronous Design Most digital systems today operate synchronously. However, the complexity of electronic systems grows enormously. 3

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Classical Synchronous Paradigm Usually digital circuits are designed to work synchronously R1R2 R3 CL3 R4 CL4 CLK 4

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Synchronous communication Clock edges determine the time instants where data must be sampled Data wires may glitch between clock edges (setup/hold times must be satisfied) Data are transmitted at a fixed rate - clock frequency

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Problems with Synchronous Design As clock speeds increase clock distribution becomes difficult: We need to minimize clock skew. There is some upper limit to clock speed that depends on the material properties of the device. It is not possible to propagate a signal from one side of the chip to the other side within the single clock cycle Worst-case performance. Sensitive to variations in Voltage, Temperature, Process. Not modular (fixed clock rate: poor match for reusability of components). Clock burns large fraction of chip power (~40-70%) Synchronization failure. 6

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved What is Asynchronous Design ? (I) Synchronization is achieved without a global clock. Asynchronous Communication: Handshake mechanisms 7 Sender Receiver request acknowledge data

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved What is Asynchronous Design ? (II) R1 R2 R3 CL 3 R4 CTL CL 4 REQ ACK R1R2R3 CL 3 R4 CL 4 LINK / CHANNEL TOKEN FLOW REQ ACK DATA EXAMPLE: 8

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Asynchronous design styles (I) Bundled data (Single Rail) 4 - phase protocol This style is very widely used because of very small and fast asynchronous controllers REQ ACK DATA REQ ACK DATA 4 PHASE PROTOCOL: ALWAYS LIKE THIS SOME VARIATIONS n 9

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Bundled data Validity signal Similar to an aperiodic local clock n-bit data communication requires n+1 wires Data wires may glitch when no valid

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Asynchronous design stiles (II) Bundled data (Single Rail) 2 - phase protocol This style looks simpler and faster than 4-phase, but controllers are more complex REQ ACK DATA REQ ACK DATA 2 PHASE PROTOCOL n 11

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Asynchronous design stiles (III) 4-phase dual rail protocol Each data bit encoded into 2 wires Offers generation of Delay-Insensitive circuits Introduces very big area overhead ACK DATA ACK DATA 2n EMPTY 0 0 VALUEd.t d.f VALID VALID Not used 1 1 EMPTY VALID E10 12

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Dual rail Two wires per bit 00 = spacer, 01 = 0, 10 =

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Asynchronous modules Signaling protocol: reqin+ start+ [computation] done+ reqout+ ackout+ ackin+ reqin- start- [reset] done- reqout- ackout- ackin- Data INData OUT req inreq out ack inack out DATA PATH CONTROL startdone 14

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Asynchronous components 15 Asynchronous design require additional components and special logic Such components are not available in standard synchronous design kit Critical components are C-element and Mutex

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Muller C-element A b z 0 1 no change 1 0 no change

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Mutual Exclusion element ME prevents multiple event propagation ME is used for arbitration 17

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Dual-rail logic A.t A.f B.t B.f C.t C.f Dual-rail AND gate 18 Dual-rail logic require additional logic for each logical operation

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Completion detection (dual-rail) Cdone Completion detection tree 19

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Completion detection (bundled-data) delay startdone logic Conventional logic + matched delay 20

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Muller pipeline The delay-insensitive handshake machine C[i] accepts 1/0 from C[i-1] only if C[i+1]=0/1 Think of as waves: The C-elements propagate waves precisely Timing depends on local delays, may vary along the pipe If RIGHT is quiet, pipe will fill and stall 21

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Micropipelines (Sutherland 89) LLLL logic R in A out C C C C R out A in delay 22

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Abstract Pipeline Bubbles Tokens Valid (0 or 1, who cares) and Empty tokens EVVEE 23

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Abstract Rings 3 stages, 1 bubble: 3 steps for token round 6 steps to cycle VEV VEE VVE EVE token bubble 24

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Building Blocks LatchSourceSink ForkJoin (wait for all) Merge (wait for one) MUX 0 1 DEMUX 0 1 Function Block (Join; CL; Fork) 25

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Describing Asynchronous Cirsuit - STGs A+ B+ A–A– B– A B A input B output 26

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Control specification – C element A+ C- A- C+ A C B+ B- B C 27

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Control specification – FIFO Controller C C Ri Ro Ai Ao Ri+ Ao+ Ri- Ao- Ro+ Ai+ Ro- Ai- Ri Ro Ao Ai FIFO cntrl 28

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved A simple filter: specification y := 0; loop x := READ (IN); WRITE (OUT, (x+y)/2); y := x; end loop R in A in A out R out IN OUT filter 29 J. Cortadella - Introduction to asynchronous circuit design: specification and synthesis

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved A simple filter: block diagram xy + control R in A in R out A out RxRx AxAx RyRy AyAy RaRa AaAa IN OUT x and y are level-sensitive latches (transparent when R=1) + is a bundled-data adder (matched delay between R a and A a ) R in indicates the validity of IN After A in + the environment is allowed to change IN (R out,A out ) control a level-sensitive latch at the output 30

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved A simple filter: control spec. xy + control R in A in R out A out RxRx AxAx RyRy AyAy RaRa AaAa IN OUT R in + A in + R in - A in - Rx+Rx+ Ax+Ax+ Rx-Rx- Ax-Ax- Ry+Ry+ Ay+Ay+ Ry-Ry- Ay-Ay- Ra+Ra+ Aa+Aa+ Ra-Ra- Aa-Aa- R out + A out + R out - A out - 31

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved A simple filter: control impl. R in + A in + R in - A in - Rx+Rx+ Ax+Ax+ Rx-Rx- Ax-Ax- Ry+Ry+ Ay+Ay+ Ry-Ry- Ay-Ay- Ra+Ra+ Aa+Aa+ Ra-Ra- Aa-Aa- R out + A out + R out - A out - C R in A in RxRx AxAx RyRy AyAy AaAa RaRa A out R out 32

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Taking delays into account x+ x- y+ y- z+ z- x z y x z Delay assumptions: Environment: 3 times units Gates: 1 time unit events: x+ x- y+ z+ z- x- x+ z- z+ y- time:

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Taking delays into account x+ x- y+ y- z+ z- x z y x z Delay assumptions: unbounded delays events: x+ x- y+ z+ x- x+ y- time: very slow failure ! 34

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Gate vs wire delay models Gate delay model: delays in gates, no delays in wires Wire delay model: delays in gates and wires 35

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Delay models for async. circuits Bounded delays (BD): realistic for gates and wires. Technology mapping is easy, verification is difficult Speed independent (SI): Unbounded (pessimistic) delays for gates and negligible (optimistic) delays for wires. Technology mapping is more difficult, verification is easy Delay insensitive (DI): Unbounded (pessimistic) delays for gates and wires. DI class (built out of basic gates) is almost empty Quasi-delay insensitive (QDI): Delay insensitive except for critical wire forks (isochronic forks). Formally, it is the same as speed independent In practice, different synthesis strategies are used BD SI QDI DI 36

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Desynchronization - concept Start with synchronous design Replace clock with local handshake Use standard CAD tools Does not change datapath Guaranteed correctness 37 * Eyal Friedman, Desynchronization - From Synchronous to Asynchronous design, Seminar in VLSI Architecture, Technion, Israel, Spring 2008

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Desynchronization - flow steps Main assumptions: Normal Combinatorial logic, DFF single clock single clock edge 38

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Desynchronization flow step #1 Replace DFF by M+S latches 39

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Desynchronization flow step #2 Add matched delays Respect bundling assumption Delay > Tpd of CL Delay serves as completion signal 40

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Desynchronization flow step #3 Replace clock by local handshake controllers 41

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Why Asynchronous Design? We are used to sync design Logic and timing assumptions are simpler, but not true in reality Currently it is very hard to solve big problems of synchronous design like clock skew, big power consumption, process variability... Common arguments for asynchronous design: Low power ? High speed ? Low emission ? Low sensitivity to PVT (Process, Voltage, Temperature) variations ? High modularity (SoC) ? No clock distribution and timing problems (works) ? Secure chips ? 42

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Why not Asynchronous Design? Overhead (area, speed, power) Hard to design Non-decomposable to small combinatorial logic blocks Converting synchronous design to asynchronous typically fails Few CAD tools There is no real complete design-flow available There is only one commercial async EDA vendor available (Handshake Solutions) with very specific design flow (HASTE) Hard to test Asynchronous test methods are not present yet (or not mature enough), and it is difficult to go into any production without proper testing 43

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Available tools There are several tools available for automation of Asynchronous Design Mostly tools are developed at Universities Two groups of tools: for synthesis of asynchronous controllers and for design of the systems I group Minimalist Petrify 3D II group BALSA TAST HASTE 44

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Minimalist Developed at Columbia University burst-mode synthesis package based on synthesis of asynchronous FSMs integrates synthesis, testability and verification tools Good side Produce Hazard-free control circuits Contains several different algorithms for synthesis Can provide generalized C-element based mapping and also behavioral Verilog Bad side Doesnt support arbitration and EBM No optimal algorithm selection 45

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Petrify Designed by J. Cortadella, M. Kishinevsky, A. Kondratyev, L. Lavagno, A. Yakovlev Synthesis of Asynchronous controllers defined as Petri Nets or Signal Transition Graphs (STG) Good side Produce optimal Hazard-free control circuits Can provide generalized C-element based mapping, complex-gate mapping and mapping to the technology libraries Bad side Supports only asynchronous design, not mixed sync-async With increased number of signals, synthesis time grows exponentially Suitable for relatively small controllers 46

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved 3D Produced by Kenneth Yun Extended Burst-Mode synthesis package Good side Produce Hazard-free control circuits Supports restricted multiple-input change (input burst) with don't- care inputs Supports input choices based on sampling possibly glitchy signals Suitable for mixed sync-async systems (like GALS) Bad side No technology mapping No optimal algorithm selection No support and further development 47

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved TAST Produced by TIMA Laboratory, France TAST is compiler/synthesizer of Asynchronous digital circuits from high level communication description language Input is CHP language, which can describe Petri Nets. It is using VHDL as a format for behavioral and post synthesis simulation. Produces QDI (dual-rail, 1-M code rail) circuits Good side Produces complete asynchronous system and provides full design-flow Bad side Uses QDI style, which gives very big area overhead Gives not optimized output circuits Not available in the moment 48

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved TAST Design flow 49

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved BALSA Produced by University of Manchester BALSA is compiler/synthesizer of Asynchronous digital circuits from high level communication description language Input is BALSA language developed specially for this package Produces Bundled data, Dual-rail, 1-M code rail circuits Good side Produces complete asynchronous system and provides full design-flow Bad side Gives large overhead compared with manual design (up to 300 %) All tools are not freely available 50

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved BALSA Design Flow 51

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Asynchronous Success Stories - Philips Philips developed its own full design flow based on TANGRAM language Design flow also contains design for testability Asynchronous Demonstrators DCC error corrector Low Power 80C Low Power, Low EMI Smartcards Low Power, Security DCC error correctordatearea [mm 2 ]power [mW] synchronous async (dual-rail)93/ synchronous async (single rail)94/

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Asynchronous Success Stories - Philips 80c51 (I) Application - Pager baseband controller First asynchronous C ever on the market Motivations for asynchronous solution of 80c51 Low power Low EMI for easy integration 53

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Asynchronous Success Stories - Philips 80c51 (II) Low power issue Circuit is only active when and where needed 54

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Asynchronous Success Stories - Philips 80c51 (III) Low current peaks 55

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Asynchronous Success Stories - Philips 80c51 (IV) Low EMI 56

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Asynchronous Success Stories - RAPPID RAPPID - Revolving Asynchronous Pentium Processor Instruction-length Decoder Instruction Length Decoder was performance bottleneck in ca vintage CISC processors Potential for optimization for common cases (RISC-like) Results Developed a novel aggressive asynchronous method About 3x throughputT=3x About one half latencyL=2x About one half powerP=2x About same areaA=0.8x Namely, this is TxLxPxA 10 improvement 57

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Asynchronous Success Stories - Amulet Amulet group is formed in Manchester University Amulet1 (1994) transistors in 1.0 m, ARM6 instruction set Half instruction throughput with same energy efficiency as ARM6 Amulet2e (1996) transistors in 0.5 m, ARM7 compatible Still half the performance of a synchronous chip Amulet3i (2000) transistors in 0.35 m, ARM9 compatible Same performance as synchronous solution with an equal or marginally better energy efficiency 58

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Globally Asynchronous Locally Synchronous (GALS) Systems 59

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved GALS Technique 60 GALS is abbreviation for Globally-Asynchronous Locally- Synchronous systems. GALS techniques have the potential to solve some of the most challenging design issues of SoC integration of communication systems.

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Asynchronous wrapper GALS method Req Ack Data Synchronous block 3 Synchronous block 1 Synchronous block 2 Asynchronous wrapper Network Node Network Node Network Node Data GALS can be used on ist own or within the NoC concept 61

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved GALS as a Powerful Design Technique In the wireless communication systems GALS can approach the main design challenges. GALS makes data transfer between the blocks very easy. Design problems as timing closure or clock-tree generation are limited to the level of much smaller local blocks. Decoupling of local blocks from central clock source reduces spectral noise considerably. Power saving is automatically integrated in asynchronous wrapper. 62

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Power reduction with GALS Power distribution in high- performance CPU Clock signal is the dominant source of power consumption. First estimations showed that about 30% of power savings could be expected in the clock net due to the application of GALS. Recently, some more pessimistic power estimation figures were presented GALS techniques offer independent setting of frequency and voltage levels for each locally synchronous module. When using dynamic voltage scaling (DVS), an average energy reduction of up to 30% can be reached 63

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Potential for reducing EMI with GALS We have simulated noise generated on the power supply line in the synchronous and request-driven GALS system. dB Frequency GHz dB GALS introduces reduction of about 20 dB

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved GALS Opportunities – 3D Integration 3D Integration can be very interesting as the application field Sensor A/D Memory DSP Comm 65

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved GALS Opportunities - NoCs Another interesting application can be Networks on Chips and MP SoCs (Multi-Processor System-on-Chip) 66

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved GALS Opportunities – Process Scaling and Variability Asynchronous design gives average-case performance in comparison to worst-case performance of synchronous system Variability on the V th makes individual transistors faster or slower, more or less energy consuming. 65nm min-size Vt Nom %Vth variability = +/- 30% (+/-3 ) 67

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved GALS Methods GALS based on synchronizers GALS based on asynchronous FIFOs GALS based on pausible clocking 68

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved GALS with the Synchronizers 69

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved GALS with FIFOs 70

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Asynchronous wrappers GALS usually contains synchronous islands communicating with each other through asynchronous wrappers Asynchronous wrapper surrounds locally-synchronous islands Wrapper consists of pausable clock and Input & Output ports 71

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Classical Pausible Clocking GALS approach Locally Synchronous Module 1 Local Clock Generator 1 Output port Locally Synchronous Module 2 Local Clock Generator 2 Input port Data stretch1 stretch2 Published inJens Muttersbach et al., Globally-Asynchronous Locally- Synchronous Architectures to Simplify the Design of On-Chip Systems, In Proc. of ASIC/SOC Conference, pp , Sept Asynchronous Wrapper 1 Asynchronous Wrapper 2 handshake

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Pausable Clock Generator 73

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Main challenges of the typical GALS methods In many solutions, the problems of data transfer and throughput is critical. Most of them can perform data transfer every second clock cycle of the local clock. Some described circuits can theoretically transfer data every clock cycle. However, the intensive stretching of the pausable clock generator will significantly diminish the practical performance. The latency of the transferred data is not known in advance and may vary significantly from one data transfer to the other one. It is not very practical to use the ring oscillators for local clock generation. All solutions are oriented towards a very general application. They are not optimised for specific systems and environmental demands. 74

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Basic concept of the request-driven operation This approach covers point-to-point communication with very intensive but bursty data transfer. When receiving input burst, GALS block can operate in a request-driven mode. When there is no input activity, the data stored inside the locally synchronous pipeline has to be flushed out. Then a local clock generator drives the GALS blocks. A Time-out function controls the transition from request driven operation to local clock generation mode. 75

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Request-driven asynchronous wrapper Local clock can be generated either internally or externally. 76

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved What can we gain from this GALS technique? Reliable and fast transfer of large bursts of data is achieved. Data transfer is possible at every clock cycle of synchronous block. In request-driven mode operation there is no arbitration in input port. The circuit immediately responds to input requests. The clock speed is determined by the master and not by the slower participant in the communication. The local clock can be generated internally or externally. This proposed architecture offers an efficient power-saving mechanism, similar to clock gating. EMI should be reduced due to varying delays and frequencies in different asynchronous wrappers. 77

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Building the wrapper components - input port 78 Input port has to provide control of the dataflow according to a broad 4-phase handshake protocol. The input port consists of a speed-independent (SI) input controller along with few additional gates that have to provide glitch-free transitions of the input signals.

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Input controller specification 79 Input controller is modeled as an AFSM (asynchronous finite state machine). The controller is specified according to burst-mode requirements. Burst-mode AFSM is implemented asHuffman Machine without explicit latches. State graph of the input controller Hazard-Free Combinational Network X Y Z A B C outputs inputs State (several bits) Request-driven mode Local clock generation mode Transitional mode Idle mode

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Input controller implementation Burst-mode input controller is synthesized using 3D tool that supports 2-level hazard-free logic minimization and achieves optimal state assigment: REQ_INT = REQ_A1 REQ_INT + ACKC' REQ_INT + REQ_A1 ACKC' ST' ACKEN' ACK_A = ACKC' REQ_INT + REQ_A1 RST +ACKC' ST ACKI1' ACKEN Z0' + REQ_A1 ACKC' ST' ACKEN' ACKEN = ACKI1 + REQ_A1 ACKEN + ST ACKEN RST = STOP + ACKC' REQ_INT + REQ_A1 RST + ST RST + ACKC' ST ACKI1' ACKEN Z0' + REQ_A1 ACKC' ST' ACKEN' REQ_I1 = REQ_A1 ST ACKI1' ACKEN' Z0 = ACKI1 + REQ_A1' ACKC + REQ_A1' ST' Z0 + ACKC' ACKEN Z0 + ACKC ACKEN' Z0 Logic equations are automatically converted into synthesizable structural VHDL code with our 3DC tool. Formal analysis of the asynchronous wrapper is performed. 80

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved VHDL description of a port UN1: inv1x port map (ackc,t3); UN2: inv1x port map (st,t4); UN3: inv1x port map (clk1,t5); UN4: inv1x port map (req,t6); UN5: inv1x port map (ackeni,t7); UN6: inv1x port map (endi,t8); UN7: inv1x port map (z0,t9); UN8: inv1x port map (z1,t10); UN8i: inv1x port map (dvsi,t11); U6: and2ix port map (reqci,ackc,t1); U7: and2x port map (req,reqci,t28); U8: and4x port map (req,t3,t4,t9,t12); U9: or3x port map (t1,t28,t12,reqcix); U7i: and2x port map (req,reseti,t2); U7ii: and2x port map (st,acki,t31); U13: and3x port map (req,t3,z0,t13); U14: or5x port map (t1,t13,t12,t2,t31,ackix); U10: and2x port map (ackc,ackeni,t14); U12: and2x port map (t9,ackeni,t15); U15: or3x port map (t15,t14,clk1,ackenix); U11: and3x port map (st,t3,z0,t16); U19: or5x port map (endi,t1,t2,t12,t16,resetix); U17: and2x port map (t7,t9,t17); U18: and3x port map (req,st,t5,t18); U20: and2x port map (t18,t17,reqiix); U25: and2x port map (req,z0,t22); U26: and2x port map (st,z0,t23); U23: and3x port map (ackc,t5,ackeni,t21); U27: or4x port map (t21,t22,t23,endi,z0x); U28: and2x port map (t6,ackc,t24); U29: and2x port map (ackc,z1,t25); U30: and3x port map (t6,t4,z1,t26); U32: or3x port map (t25,t26,t24,z1x); entity and2x is port (a,b: in std_logic; c: out std_logic); end and2x; architecture struc of and2x is attribute DONT_TOUCH_NETWORK of a,b,c: signal is true; begin c<=(a and b) after 100 ps; end struc; 81

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Externally-driven GALS Wrapper 82

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Clock Management Unit 83

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Baseband processor for WLAN The goal of one of our projects was to develop a wireless broadband communication system in the 5 GHz band. The modem is compliant with the IEEE802.11a WLAN standard System uses Orthogonal Frequency Division Multiplexing (OFDM) with data rates ranging from 6 to 54 Mbit/s. The synchronous baseband processor was implemented as an ASIC (700k gates). 84

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Structure of the synchronous baseband processor Baseband processor includes receiver and transmitter datapath structure. Very complex blocks are implemented such as Viterbi decoder, FFT, IFFT, CORDIC processors, Msps block 20 Msps block 85 Baseband Processor Transmitter Receiver Input buffer Scrambler Signal field generator Encoder Interleaver Mapper Pilot insertion Pilot scrambler IFFT Guard interval insertion Preamble insertion Synchronizer datapath Channel estimator Demapper Deinterleaver Viterbi decoder Encoder Interleaver Mapper Descrambler Parallel converter FFT Synchronizer tracking Buffer Buffer

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Design challenges in the baseband processor Design of the baseband processor involves the challenges as: - several clock domains, -global clock tree generation, -large number of clock leaves (36 k flip- flops), -clock skew handling, -timing closure between the different modules, -clock gating, -power consumption, -EMI. Request–driven GALS architecture was developed as a possible solution for those problems. 86

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved GALS partitioning Tx_1 Baseband Processor Input buffer Scrambler Signal field generator Encoder Interleaver Mapper Pilot insertion Pilot scrambler IFFT Guard interval insertion Preamble insertion Synchronizer datapath Channel estimator Demapper Deinterleaver Viterbi decoder Encoder Interleaver Mapper Descrambler Parallel converter FFT Synchronizer tracking Buffer Buffer Tx_2Tx_3 Tx_int (async-sync interface) Rx_3Rx_2 Rx_1 Rx_int (async-sync interface) Token rate adaptation FIFO TA Rx_TRA Activation interface The partitioning process has to take into account possible power saving. 80 Msps block 20 Msps block Rate adaption block Interface block 87

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Test strategy We are using a hardware tester which is strictly cycle based and cannot react to asynchronous output signals of the circuit. The GALS arbitration processes preclude cycle level determinism. We want to have a possibility to run very complex functional tests internally. Applied test technique should support system diagnosis. A test strategy based on Built-In Self-Test (BIST) is proposed. BIST reduces the effort for generating a test program and enables us to use a synchronous tester. 88

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Design for Testability in GALS TPG and TDE are based on the linear feedback shift register structure with embedded additional logic. A central BIST controller performs control of the test procedure. We can run hierarchical tests. This BIST technique can be used as a method for prototype verification. In combination with the scan approach, BIST can be even used as a basis for the manufacturing test. 89

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Design flow We have used IHP 0.25 CMOS process. Asynchronous wrapper is equivalent to about 1.3 k inverter gates. Only tunable clock generation is 0.9 k gates. Asynchronous wrapper has throughput up to 150 Msps in request driven mode and 100 Msps in local mode. This application needs 80 Msps. 90 AFSM specifaction 3D - Logic synthesis 3DC tool – translation from 3D to structural VHDL Functional specification VHDL description Abstract behavioural simulation Gate mapping Realistic behavioural simulation Timing driven synthesis Postsynthesis simulation Layout Back annotation Tape-out Asynchronous wrappers Synchronous blocks Synopsys DC Cadence Silicon Encounter Model Sim Power estimation Prime Power Power estimation Prime Power Formal analysis LoLA

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Area and power distribution Area and power statistics are based on the synthesized netlist data. Locally synchronous blocks occupy around 90% of the total area, The BIST circuitry requires around 3.5%, interface blocks 2.9%, and asynchronous wrappers 2%. Based on the switching activities, in the realistic transceiver scenario, power estimation with Prime Power tool has been performed. Synchronous datapath logic uses most of the power (around 52.4%), then local synchronous clock trees are using 34.5%, async-to-sync interfaces 7%, and asynchronous wrappers 2.9%. After layout, the estimated power consumption is mW. 91

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Implementational results Our GALS baseband processor is fabricated and tested. The total number of pins is 120 and the silicon area including pads is 45.1 mm 2. Measured dynamic power dissipated in the pure synchronous baseband processor was 332 mW, and for the GALS baseband processor slightly lower, at 328 mW. 92

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Improving System Integration with GALS Synchronous baseband processor challenges: - several clock domains, -global clock tree generation, -large number of clock leaves, -clock skew handling, -timing closure between blocks, -clock gating. 93 Solved by GALS architecture No global clock in GALS Clock leaves distributed over GALS blocks Clock skew is reduced from 660ps to 486 ps Communication between the blocks through handshaking Clock-gating embedded in the asynchronous wrapper

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved EMI measurement (I) The supply voltage variation spectrum of the inner processor core is measured. 94 ~ 5 dB

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved EMI measurement (II) Additionally, instantaneous supply voltage peaks are reduced from 140 mV (synchronous design) from cycle to cycle to the less than 100 mV (GALS). This reduction can be very important for mixed-signal designs and for secure systems. An application with fine-grained GALS partitioning can lead to results closer to theoretical maximum reduction. 95

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Conclusions There are several asynchronous design currently on the market Asynchronous design is with greatest success used in the medium complexity - medium performance circuits Future applications GALS, large networks on the chips (NoCs) 3D Integration Some local blocks in the GALS then could be asynchronous Asynchronous circuitry can provide lower EMI for SOCs Design & Test flow remains as a problem 96

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Synchronous and GALS Networks on Chips 97

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Synchronous and GALS NoCs Today on-chip design is more and more communication-centric Classical topologies are not sufficient (point-to-point, mesh, bus, etc.) Shared bus = low performance Bandwidth is shared Bus width (bits) relatively small Global clock frequency limited Disadvantage of multiple buses Not scalable, not generic Promising alternative could be Networks on Chip (NoCs) NoCs can be implemented completely synchronously, mesochronously, or in GALS fashion 98

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved NoC Paradigm Apply Networks Protocols to SoC Network: Provides communication Satisfy quality-of-service requirements: Reliability Performance: Throughput, latency,.. Power ? Additional requirements unique to NoC Energy bounds Area Fit it to the standard design flow 99

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Switching Network Basics Transport Layer: Msg end-to-end Implemented using network adapters Assembly and disassembly of the packets at source/destination Network Layer: Pkt end-to-end Implemented using routers Routers decide the routing path to destination header of the packet topology knowledge Scalable distributed system: load shared between routers Data-Link Layer : Pkt over link Packets: header, payload, trailer Error correction (on packet): redundancy, error correction codes * Technion - Asynchronous NoC - Nikolai Samolazov 100

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Bus vs. Network Arguments BUSNoC Scalability:Every IP adds parasitic capacitance Only P2P connections Timing is difficultCan be pipelined Bus Arbiter performanceLoad shared by routers Bandwidth:Limited and shared by all IP Scales with network size Latency:Zero when granted controlNetwork latency always exists Cost:Low areaSignificant area Design Complexity: Simple: well known and understood Requires changes in HW and sometimes SW levels 101

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Hybrid Network Shared Busses as first level communication medium NoC routers as main communication devices 102

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Homogenous NoC FU * NoC General Concepts - Andreas Ehliar - Per Karlström 103

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Heterogeneous NoC FU MUL ALU DSP 104

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Heterogeneus NoC FU MUL ALU DSP 105

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Quality of Service Guaranteed latency Guaranteed bandwidth Correctness 106

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Design Issues - Physical layer To reduce power for signaling - low swing To improve noise immunity - Differential signaling Trade-off - Pseudo differential signaling System integration possibilities Synchronous, Mesochronous, GALS 107

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Design Issues - Architecture FU 108

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Design Issues - Architecture FU 109

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Design Issues - Data-link Is error detection/correction needed? Depends on BER/PER Media access control affects the power Defined by how individual nodes access the media 110

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved NoC Design Architecture Network Adapter and Router Architecture - Asynchronous or synchronous Network Topology Routing Strategy - Static Routing - Adaptive Routing Interconnect - Repeaters - Pipelining Design Technology Tools and Methodologies Simulation and (correctness, performance, power) Validation - SystemC 111

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Design Issues - Flow Control 112

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Design Issues - Long Wires Solving the global interconnect mess Delay Bit errors Repeaters Clock domains Create one optimized solution that can be reused 113

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Design Issues - Long Wires Add flip flops to increase clock frequency What about ACKs? NoC Router NoC Router 114

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Design Issues - Long Wires Add flip flops to increase clock frequency What about ACKs? NoC Router NoC Router 115

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Design Issues - Long Wires Bit errors on long wires will not be avoidable in the future Use error correcting codes Disadvantage: More wires, more throughput needed Use parity bits to discover errors Resend damaged packets No longer possible to guarantee real-time performance 116

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Design Issues - Long Wires Possibility to create heavily optimized solution Low voltage signaling Advanced symbol encoding/decoding Wave pipelining 117

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Design Issues - Long Wires High performance interconnect through wave pipelining Need very careful analysis NoC Router NoC Router NoC Router NoC Router 118

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Design Issues - Long Wires Wave pipelining performance 3.45 GHz signaling on one bit line in 0.25 um More energy efficient than regular pipeline Faster than regular pipeline Disadvantage Much harder to test/verify 119

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Network Topologies Mesh Tree Fat-Tree Routing algorithm depends on topology 120

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Routing Routing: path from source to destination. Must: deadlock free, livelock free Livelock: message proceeds indefinitely, but never arrives Possible only in adaptive non-minimal routing Deadlock: packets waiting for each other in a cycle Three main categories: Static (non-adaptive): predetermined path Minimal fully adaptive: routes through any shortest path Partially adaptive: multiple routing paths Some paths not shortest 121

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Wormhole Routing Header forwarded ASAP, not waiting to trailer Used in high-performance parallel computing networks (lumped) Not in the internet (distributed) Packet may span several routers Packet divided into flits (atomic flow control units) Main Disadvantage: cascaded contention Packet requests busy link VLSI routers: small buffers packet cannot be buffered in one router Routers spanned by packet are stalled Practical limitation, prevents achieving theoretical bandwidth 122

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved NoC Design Characteristics: Cost Area Network components area Wires, repeaters area Power Energy per transmitted packet Idle power 123

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved NoC Design Characteristics: Performance Latency [sec] From header leaving source, to trailer reaching destination Composed of waiting latency + network latency Waiting Latency Time message waits before entering the network Network Latency Time message travels inside the network Throughput [bits/sec] Measured at network port Average amount of user data that is accepted by the network on that port in a certain amount of time Aggregate Throughput [bits/sec] Sum of the throughputs at all network ports 124

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved NoC Saturation Offered Load Traffic produced by network clients as percentage of maximal network bandwidth L : number of cycles needed to accept the message, D : average number of cycles between messages Saturation Threshold: Offered Load at which average latency rises exponentially to infinite value 125

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Cost - Performance Tradeoff Santiago Gonzalez Pestana et al. Cost-Performance Trade-offs in Networks on Chip: A Simulation-Based Approach, DATE

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Architecture of On-Chip Router 127 Technion, Asynchronous vs. Synchronous Design Techniques for NoCs Robert Mullins, Asynchronous vs. Synchronous Design Techniques for NoCs

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Router Pipeline Numerous stages of Router Pipeline Raise communication latency Can make packet buffers less effective Incurs pipelining overheads 128

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Synchronous NoCs - Summary Can design high-performance single cycle routers Design is simplified by presence of global synchrony Distribution of global clock can be eased by New clock generation / distribution techniques Source synchronous communication 129

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Limitations of Fully-Synchronous Networks 1. Difficult to distribute clock Network spread over die & may have irregular layout Minimising skew costs complexity and power Alternatives/extensions to PLL and H-tree: Clock deskewing techniques Distributed Clock Generator (DCG). Distributed PLLs Standing-wave oscillators and rotary clock schemes Resonant global clocks, optical clock distribution etc. 130

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Limitations of Fully-Synchronous Networks 2. Single Network Clock Frequency Communicating synchronous IP blocks may operate at different and potentially adaptive clock frequencies What is most appropriate network clock frequency? 131

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Why Asynchronous NoCs No clock distribution, simple solution Networked IP blocks run at different clock frequencies No synchronization issues at interfaces Ability to exploit data / path-dependent delays Low-latency common or high-priority paths through router Freedom to optimize network links Not constrained by need to distribute/generate multiple clock frequencies. Can exploit high-frequency narrow links Dynamic latency/throughput trade-offs (adaptive pipeline depth) Exploit dynamic optimizations on links (e.g. DVS) Easy to use interfaces, modularity, Robust and simple implementation, Reduced design time Some arguments for reduced power 132

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Different NoC Architectures Router clocks derived from a single source Locally Generated Clocks (periodic & free-running) Synchronous Routers with Asynchronous Links Locally Clocked Routers / Asynchronous Interconnect (GALS style network) Can support asynchronous interconnects No longer exploiting periodic nature of router clocks Correct operation is independent of the delay of the link GALS interfaces with pausible clocks If necessary clock is stretched, data is always transferred reliably Need to construct local delay line Local aperiodic clock generation Data-Driven Local Clock Similarities to stoppable GALS interface and asynchronous priority arbiters 133

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Mesochronous Clocking Clock skew may force the system to be partitioned into multiple clock domains Can exploit the fact that only the phase of each routers clock differs, simple error-free clock-domain crossing possible (single clock source) 134

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Router clocks derived from a single source Each routers clock may be generated from the global network clock, either by: Clock division or Clock multiplication Clock domain crossing techniques can exploit known clock frequency relationships Chakraborty and M. Greenstreet, Efficient Self-Timed Interfaces for Crossing Clock Domains, In Proceedings ASYNC03 L. F. G. Sarmenta, G. A. Pratt and S. A. Ward, Rational Clocking, ICCD95 135

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Using Synchronisers for GALS NoCs Asynchronous channel uses 4-phase bundled data protocol A. Sheibanyrad, A. Greiner, Two efficient synchronous asynchronous converters well-suited for networks-on-chip in GALS architectures,

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Locally Generated Clocks (periodic & free-running) Can exploit knowledge about clocks (when crossing clock domains) even if all we know is that they are periodic, examples: predictive synchronizers [Dally][Frank/Ginosar] asynchronous FIFOs [Chakraborty/Greenstreet] 137

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Using Asynchronous FIFOs in GALS NoCs Synchronous network wrapper assembly/disassembly data packets Can connect many independent clock domains 138

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved NoC architecture for low power NoC concept together with GALS methodology gives good opportunities for power saving Each hardware block in NoC system can be setted to the optimal frequency/voltage Best is to combine DVFS with GALS concept in order to reduce power 139

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved NoC architecture for DVFS – LETI Solution (NoCs 2008) A fully asynchronous Network-on-Chip IP units are synchronous islands using programmable Local Clock Generator Within the IP unit Synchronization is done thanks to Pausable Clock A Power Unit manages internal Vcore generated using external Vhigh and Vlow A Network Interface is in charge of NoC communications Local Power Management Main CPU in charge of global power management 140

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved DVFS with GALS NoCs Each synchronous IP is an independent power and frequency domain A local fine grain Dynamic Voltage Scaling: Implementation of a local hardware controller to control transitions between Vhigh and Vlow Ensures smooth DVS transitions for IP safe computation A local fine grain Dynamic Frequency Scaling: Automatic frequency scaling Use of clock generation re-programming to find the optimal V/F point of operation Thanks to pausable clock technique, IP unit continues its operation during DVFS phases GALS architecture and local clock generation is a natural enabler for easy local DVFS 141

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved NoC Unit architecture Each IP core encapsulated with Network Interface Test Wrapper Pausable Clock Power Supply Unit IP units have 5 supply modes Init: reset at V high (1.2V) High: V high supply Low: V low supply (0.8V) Hopping: switch V high / V low for DVFS Idle: retention state at V low (no clock) Off: stand-by mode 142

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Local Power Manager Local Power Manager handles unit power modes A set of programmable registers, through the NoC Configuration of Programmable delay line Power Supply Unit Pulse Width modulator used to control the Hopping mode 143

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Power Supply Unit Power Supply Unit manages Vcore Two power switches Thigh and Tlow LVT transistors A Hopping Unit An Ultra Cut-Off Generator 144

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Hopping Unit Energy per operation scales with V² Decrease Voltage (and Frequency) to be energy efficient «Triple state» power supply Use of two PMOS power switches Vhigh (1.2 V), Vlow (0.7 V), or OFF (0 V) Switch between Vhigh and Vlow Transitions take less than 100 ns Mean speed / mean power of the IP is programmed by a PWM Compatible with synchronous and asynchronous IPs For GALS system: coordination done with local clock generator Can easily be integrated in any CMOS circuit No inductor contrary to traditional DC/DC converters No capacitor contrary to charge pump implementation 145

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Ultra Cut-Off Generator When reverse polarizing the gate, the leakage current goes through a minimum The optimal polarization point varies with the temperature, the supply voltage and the process corners The proposed UCO generator automatically polarizes the gate of the Power switch to its point of minimum leakage Compensates for temperature variation, alleviates corners variations. The gate oxide reliability is considered by introducing a passive stress reduction mechanism 146

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Pausable Clock Interface Pause temporary the clock when a transfer (NoC) or a supply switch is required Based on Two GALS ports : Synchronous-to Asynchronous and Asynchronous- to-Synchronous A programmable delay line A pausable clock generator Pausable Clock Generator arbitrates pause requests 147

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Pausable Clock Interface Programmable delay line Precise, small and low power Using Standard cells On the same unit power domain 148

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Power Gain Programmable delay line matches with unit logic on the same power domain Compensates any mismatch thanks to re-programmation Power reduction Vhigh=1.2V and Vlow=0.8V 35 % dynamic power reduction between High and Low modes Hopping mode is used to save power without any latency cost Leakage power thanks to UCO is reduced by 2 decade Power Supply Unit efficiency Hopping Unit Only resistive losses in the power transistors About 1 mW dynamic power => more than 95 % power efficiency 90 % total efficiency (external DC-DC taken into account) An adaptive and reliable Power Supply Unit giving high power reduction factor and high power efficiency 149

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Physical Implementation Power Switch One single Power-Switch for the complete power domain Sized to get a speed loss<5% Area : about <5% of the power domain Hopping Unit Area : 140μm*35μm Hopping Transition : <100 ns 150

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Synchronous or Asynchronous? A clock less on-chip network appears to be an elegant solution although some questions remain: Test Performance concerns Shouldnt asynchronous designs offer latency advantages? Fast local control, path/data dependent delays, DI interconnects Perhaps asynchronous routers mimic synchronous architectures too closely? Exploit flexibility, novel architectures, different topologies Overheads for data-driven clocking or GALS currently look small in comparison to the classical approach Synchronous design has advantages too Predictability and determinism can be exploited Fast single cycle routers possible Global snapshot of state is good for scheduling Still lots of interesting research to be done 151

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved GALAXY project GALAXY project (GALS InterfAce for CompleX Digital SYstem Integration) is funded in the FP7 program of EU 152

IHP Im Technologiepark Frankfurt (Oder) Germany © All rights reserved Project goals This project builds on a technology approach in which the EU currently has world leadership We are on the way to provide an integrated GALS NoC design flow We will provide an interoperability framework between the existing open and commercial CAD tools The project is evaluating the ability of the GALS approach to solve system integration issues, implement a complex GALS system on 40 nm CMOS process, explore the low EMI and low-power properties, and robustness to process variability problems. 153