Presentation is loading. Please wait.

Presentation is loading. Please wait.

IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 -

Similar presentations


Presentation on theme: "IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 -"— Presentation transcript:

1 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Asynchronous Circuit Design GALS Systems Synchronous and GALS NoCs - DAAD Workshop, Nis, Serbia, July 2009 - Dr. Miloš Krstić

2 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Overview Motivation Problems of the synchronous design Asynchronous circuit design GALS - State of the Art Synchronous and GALS NoCs 2

3 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Challenges with Synchronous Design Most digital systems today operate synchronously. However, the complexity of electronic systems grows enormously. 3

4 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Classical Synchronous Paradigm Usually digital circuits are designed to work synchronously R1R2 R3 CL3 R4 CL4 CLK 4

5 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Synchronous communication Clock edges determine the time instants where data must be sampled Data wires may glitch between clock edges (setup/hold times must be satisfied) Data are transmitted at a fixed rate - clock frequency 110010 5

6 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Problems with Synchronous Design As clock speeds increase clock distribution becomes difficult: We need to minimize clock skew. There is some upper limit to clock speed that depends on the material properties of the device. It is not possible to propagate a signal from one side of the chip to the other side within the single clock cycle Worst-case performance. Sensitive to variations in Voltage, Temperature, Process. Not modular (fixed clock rate: poor match for reusability of components). Clock burns large fraction of chip power (~40-70%) Synchronization failure. 6

7 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved What is Asynchronous Design ? (I) Synchronization is achieved without a global clock. Asynchronous Communication: Handshake mechanisms 7 Sender Receiver request acknowledge data

8 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved What is Asynchronous Design ? (II) R1 R2 R3 CL 3 R4 CTL CL 4 REQ ACK R1R2R3 CL 3 R4 CL 4 LINK / CHANNEL TOKEN FLOW REQ ACK DATA EXAMPLE: 8

9 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Asynchronous design styles (I) Bundled data (Single Rail) 4 - phase protocol This style is very widely used because of very small and fast asynchronous controllers REQ ACK DATA REQ ACK DATA 4 PHASE PROTOCOL: ALWAYS LIKE THIS SOME VARIATIONS n 9

10 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Bundled data Validity signal Similar to an aperiodic local clock n-bit data communication requires n+1 wires Data wires may glitch when no valid 110010 10

11 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Asynchronous design stiles (II) Bundled data (Single Rail) 2 - phase protocol This style looks simpler and faster than 4-phase, but controllers are more complex REQ ACK DATA REQ ACK DATA 2 PHASE PROTOCOL n 11

12 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Asynchronous design stiles (III) 4-phase dual rail protocol Each data bit encoded into 2 wires Offers generation of Delay-Insensitive circuits Introduces very big area overhead ACK DATA ACK DATA 2n EMPTY 0 0 VALUEd.t d.f VALID 0 0 1 VALID 1 1 0 Not used 1 1 EMPTY VALID E10 12

13 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Dual rail Two wires per bit 00 = spacer, 01 = 0, 10 = 1 11 00 1 0 13

14 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Asynchronous modules Signaling protocol: reqin+ start+ [computation] done+ reqout+ ackout+ ackin+ reqin- start- [reset] done- reqout- ackout- ackin- Data INData OUT req inreq out ack inack out DATA PATH CONTROL startdone 14

15 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Asynchronous components 15 Asynchronous design require additional components and special logic Such components are not available in standard synchronous design kit Critical components are C-element and Mutex

16 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Muller C-element 0 0 0 A b z 0 1 no change 1 0 no change 1 1 1 16

17 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Mutual Exclusion element ME prevents multiple event propagation ME is used for arbitration 17

18 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Dual-rail logic A.t A.f B.t B.f C.t C.f Dual-rail AND gate 18 Dual-rail logic require additional logic for each logical operation

19 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Completion detection (dual-rail) Cdone Completion detection tree 19

20 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Completion detection (bundled-data) delay startdone logic Conventional logic + matched delay 20

21 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Muller pipeline The delay-insensitive handshake machine C[i] accepts 1/0 from C[i-1] only if C[i+1]=0/1 Think of 1010101.. as waves: 1 0 1 0 1 0 1.. The C-elements propagate waves precisely Timing depends on local delays, may vary along the pipe If RIGHT is quiet, pipe will fill and stall 21

22 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Micropipelines (Sutherland 89) LLLL logic R in A out C C C C R out A in delay 22

23 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Abstract Pipeline Bubbles Tokens Valid (0 or 1, who cares) and Empty tokens EVVEE 23

24 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Abstract Rings 3 stages, 1 bubble: 3 steps for token round 6 steps to cycle VEV VEE VVE EVE token bubble 24

25 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Building Blocks LatchSourceSink ForkJoin (wait for all) Merge (wait for one) MUX 0 1 DEMUX 0 1 Function Block (Join; CL; Fork) 25

26 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Describing Asynchronous Cirsuit - STGs A+ B+ A–A– B– A B A input B output 26

27 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Control specification – C element A+ C- A- C+ A C B+ B- B C 27

28 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Control specification – FIFO Controller C C Ri Ro Ai Ao Ri+ Ao+ Ri- Ao- Ro+ Ai+ Ro- Ai- Ri Ro Ao Ai FIFO cntrl 28

29 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved A simple filter: specification y := 0; loop x := READ (IN); WRITE (OUT, (x+y)/2); y := x; end loop R in A in A out R out IN OUT filter 29 J. Cortadella - Introduction to asynchronous circuit design: specification and synthesis

30 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved A simple filter: block diagram xy + control R in A in R out A out RxRx AxAx RyRy AyAy RaRa AaAa IN OUT x and y are level-sensitive latches (transparent when R=1) + is a bundled-data adder (matched delay between R a and A a ) R in indicates the validity of IN After A in + the environment is allowed to change IN (R out,A out ) control a level-sensitive latch at the output 30

31 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved A simple filter: control spec. xy + control R in A in R out A out RxRx AxAx RyRy AyAy RaRa AaAa IN OUT R in + A in + R in - A in - Rx+Rx+ Ax+Ax+ Rx-Rx- Ax-Ax- Ry+Ry+ Ay+Ay+ Ry-Ry- Ay-Ay- Ra+Ra+ Aa+Aa+ Ra-Ra- Aa-Aa- R out + A out + R out - A out - 31

32 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved A simple filter: control impl. R in + A in + R in - A in - Rx+Rx+ Ax+Ax+ Rx-Rx- Ax-Ax- Ry+Ry+ Ay+Ay+ Ry-Ry- Ay-Ay- Ra+Ra+ Aa+Aa+ Ra-Ra- Aa-Aa- R out + A out + R out - A out - C R in A in RxRx AxAx RyRy AyAy AaAa RaRa A out R out 32

33 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Taking delays into account x+ x- y+ y- z+ z- x z y x z Delay assumptions: Environment: 3 times units Gates: 1 time unit events: x+ x- y+ z+ z- x- x+ z- z+ y- time: 3 4 5 6 7 9 10 12 13 14 33

34 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Taking delays into account x+ x- y+ y- z+ z- x z y x z Delay assumptions: unbounded delays events: x+ x- y+ z+ x- x+ y- time: 3 4 5 6 9 10 11 very slow failure ! 34

35 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Gate vs wire delay models Gate delay model: delays in gates, no delays in wires Wire delay model: delays in gates and wires 35

36 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Delay models for async. circuits Bounded delays (BD): realistic for gates and wires. Technology mapping is easy, verification is difficult Speed independent (SI): Unbounded (pessimistic) delays for gates and negligible (optimistic) delays for wires. Technology mapping is more difficult, verification is easy Delay insensitive (DI): Unbounded (pessimistic) delays for gates and wires. DI class (built out of basic gates) is almost empty Quasi-delay insensitive (QDI): Delay insensitive except for critical wire forks (isochronic forks). Formally, it is the same as speed independent In practice, different synthesis strategies are used BD SI QDI DI 36

37 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Desynchronization - concept Start with synchronous design Replace clock with local handshake Use standard CAD tools Does not change datapath Guaranteed correctness 37 * Eyal Friedman, Desynchronization - From Synchronous to Asynchronous design, Seminar in VLSI Architecture, Technion, Israel, Spring 2008

38 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Desynchronization - flow steps Main assumptions: Normal Combinatorial logic, DFF single clock single clock edge 38

39 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Desynchronization flow step #1 Replace DFF by M+S latches 39

40 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Desynchronization flow step #2 Add matched delays Respect bundling assumption Delay > Tpd of CL Delay serves as completion signal 40

41 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Desynchronization flow step #3 Replace clock by local handshake controllers 41

42 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Why Asynchronous Design? We are used to sync design Logic and timing assumptions are simpler, but not true in reality Currently it is very hard to solve big problems of synchronous design like clock skew, big power consumption, process variability... Common arguments for asynchronous design: Low power ? High speed ? Low emission ? Low sensitivity to PVT (Process, Voltage, Temperature) variations ? High modularity (SoC) ? No clock distribution and timing problems (works) ? Secure chips ? 42

43 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Why not Asynchronous Design? Overhead (area, speed, power) Hard to design Non-decomposable to small combinatorial logic blocks Converting synchronous design to asynchronous typically fails Few CAD tools There is no real complete design-flow available There is only one commercial async EDA vendor available (Handshake Solutions) with very specific design flow (HASTE) Hard to test Asynchronous test methods are not present yet (or not mature enough), and it is difficult to go into any production without proper testing 43

44 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Available tools There are several tools available for automation of Asynchronous Design Mostly tools are developed at Universities Two groups of tools: for synthesis of asynchronous controllers and for design of the systems I group Minimalist Petrify 3D II group BALSA TAST HASTE 44

45 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Minimalist Developed at Columbia University burst-mode synthesis package based on synthesis of asynchronous FSMs integrates synthesis, testability and verification tools Good side Produce Hazard-free control circuits Contains several different algorithms for synthesis Can provide generalized C-element based mapping and also behavioral Verilog Bad side Doesnt support arbitration and EBM No optimal algorithm selection 45

46 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Petrify Designed by J. Cortadella, M. Kishinevsky, A. Kondratyev, L. Lavagno, A. Yakovlev Synthesis of Asynchronous controllers defined as Petri Nets or Signal Transition Graphs (STG) Good side Produce optimal Hazard-free control circuits Can provide generalized C-element based mapping, complex-gate mapping and mapping to the technology libraries Bad side Supports only asynchronous design, not mixed sync-async With increased number of signals, synthesis time grows exponentially Suitable for relatively small controllers 46

47 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved 3D Produced by Kenneth Yun Extended Burst-Mode synthesis package Good side Produce Hazard-free control circuits Supports restricted multiple-input change (input burst) with don't- care inputs Supports input choices based on sampling possibly glitchy signals Suitable for mixed sync-async systems (like GALS) Bad side No technology mapping No optimal algorithm selection No support and further development 47

48 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved TAST Produced by TIMA Laboratory, France TAST is compiler/synthesizer of Asynchronous digital circuits from high level communication description language Input is CHP language, which can describe Petri Nets. It is using VHDL as a format for behavioral and post synthesis simulation. Produces QDI (dual-rail, 1-M code rail) circuits Good side Produces complete asynchronous system and provides full design-flow Bad side Uses QDI style, which gives very big area overhead Gives not optimized output circuits Not available in the moment 48

49 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved TAST Design flow 49

50 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved BALSA Produced by University of Manchester BALSA is compiler/synthesizer of Asynchronous digital circuits from high level communication description language Input is BALSA language developed specially for this package Produces Bundled data, Dual-rail, 1-M code rail circuits Good side Produces complete asynchronous system and provides full design-flow Bad side Gives large overhead compared with manual design (up to 300 %) All tools are not freely available 50

51 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved BALSA Design Flow 51

52 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Asynchronous Success Stories - Philips Philips developed its own full design flow based on TANGRAM language Design flow also contains design for testability Asynchronous Demonstrators DCC error corrector - 1993-1994 - Low Power 80C51 - 1995 - Low Power, Low EMI Smartcards- 1998- Low Power, Security DCC error correctordatearea [mm 2 ]power [mW] synchronous933.42.60 async (dual-rail)93/057.00.41 synchronous943.30.60 async (single rail)94/093.90.08 52

53 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Asynchronous Success Stories - Philips 80c51 (I) Application - Pager baseband controller First asynchronous C ever on the market Motivations for asynchronous solution of 80c51 Low power Low EMI for easy integration 53

54 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Asynchronous Success Stories - Philips 80c51 (II) Low power issue Circuit is only active when and where needed 54

55 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Asynchronous Success Stories - Philips 80c51 (III) Low current peaks 55

56 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Asynchronous Success Stories - Philips 80c51 (IV) Low EMI 56

57 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Asynchronous Success Stories - RAPPID RAPPID - Revolving Asynchronous Pentium Processor Instruction-length Decoder Instruction Length Decoder was performance bottleneck in ca. 1995-vintage CISC processors Potential for optimization for common cases (RISC-like) Results Developed a novel aggressive asynchronous method About 3x throughputT=3x About one half latencyL=2x About one half powerP=2x About same areaA=0.8x Namely, this is TxLxPxA 10 improvement 57

58 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Asynchronous Success Stories - Amulet Amulet group is formed in Manchester University Amulet1 (1994) 60000 transistors in 1.0 m, ARM6 instruction set Half instruction throughput with same energy efficiency as ARM6 Amulet2e (1996) 450000 transistors in 0.5 m, ARM7 compatible Still half the performance of a synchronous chip Amulet3i (2000) 800000 transistors in 0.35 m, ARM9 compatible Same performance as synchronous solution with an equal or marginally better energy efficiency 58

59 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Globally Asynchronous Locally Synchronous (GALS) Systems 59

60 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved GALS Technique 60 GALS is abbreviation for Globally-Asynchronous Locally- Synchronous systems. GALS techniques have the potential to solve some of the most challenging design issues of SoC integration of communication systems.

61 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Asynchronous wrapper GALS method Req Ack Data Synchronous block 3 Synchronous block 1 Synchronous block 2 Asynchronous wrapper Network Node Network Node Network Node Data GALS can be used on ist own or within the NoC concept 61

62 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved GALS as a Powerful Design Technique In the wireless communication systems GALS can approach the main design challenges. GALS makes data transfer between the blocks very easy. Design problems as timing closure or clock-tree generation are limited to the level of much smaller local blocks. Decoupling of local blocks from central clock source reduces spectral noise considerably. Power saving is automatically integrated in asynchronous wrapper. 62

63 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Power reduction with GALS Power distribution in high- performance CPU Clock signal is the dominant source of power consumption. First estimations showed that about 30% of power savings could be expected in the clock net due to the application of GALS. Recently, some more pessimistic power estimation figures were presented GALS techniques offer independent setting of frequency and voltage levels for each locally synchronous module. When using dynamic voltage scaling (DVS), an average energy reduction of up to 30% can be reached 63

64 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Potential for reducing EMI with GALS We have simulated noise generated on the power supply line in the synchronous and request-driven GALS system. dB Frequency GHz 64 0.5 1 1.5 2 2.5 3 3.5 4 4.5 -20 -40 -60 -80 -100 -120 0.5 1 1.5 2 2.5 3 3.5 4 4.5 -20 -40 -60 -80 -100 -120 -140 dB GALS introduces reduction of about 20 dB

65 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved GALS Opportunities – 3D Integration 3D Integration can be very interesting as the application field Sensor A/D Memory DSP Comm 65

66 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved GALS Opportunities - NoCs Another interesting application can be Networks on Chips and MP SoCs (Multi-Processor System-on-Chip) 66

67 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved GALS Opportunities – Process Scaling and Variability Asynchronous design gives average-case performance in comparison to worst-case performance of synchronous system Variability on the V th makes individual transistors faster or slower, more or less energy consuming. 65nm min-size Vt Nom %Vth variability = +/- 30% (+/-3 ) 67

68 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved GALS Methods GALS based on synchronizers GALS based on asynchronous FIFOs GALS based on pausible clocking 68

69 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved GALS with the Synchronizers 69

70 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved GALS with FIFOs 70

71 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Asynchronous wrappers GALS usually contains synchronous islands communicating with each other through asynchronous wrappers Asynchronous wrapper surrounds locally-synchronous islands Wrapper consists of pausable clock and Input & Output ports 71

72 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Classical Pausible Clocking GALS approach Locally Synchronous Module 1 Local Clock Generator 1 Output port Locally Synchronous Module 2 Local Clock Generator 2 Input port Data stretch1 stretch2 Published inJens Muttersbach et al., Globally-Asynchronous Locally- Synchronous Architectures to Simplify the Design of On-Chip Systems, In Proc. of ASIC/SOC Conference, pp. 317-321, Sept. 1999. 72 Asynchronous Wrapper 1 Asynchronous Wrapper 2 handshake

73 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Pausable Clock Generator 73

74 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Main challenges of the typical GALS methods In many solutions, the problems of data transfer and throughput is critical. Most of them can perform data transfer every second clock cycle of the local clock. Some described circuits can theoretically transfer data every clock cycle. However, the intensive stretching of the pausable clock generator will significantly diminish the practical performance. The latency of the transferred data is not known in advance and may vary significantly from one data transfer to the other one. It is not very practical to use the ring oscillators for local clock generation. All solutions are oriented towards a very general application. They are not optimised for specific systems and environmental demands. 74

75 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Basic concept of the request-driven operation This approach covers point-to-point communication with very intensive but bursty data transfer. When receiving input burst, GALS block can operate in a request-driven mode. When there is no input activity, the data stored inside the locally synchronous pipeline has to be flushed out. Then a local clock generator drives the GALS blocks. A Time-out function controls the transition from request driven operation to local clock generation mode. 75

76 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Request-driven asynchronous wrapper Local clock can be generated either internally or externally. 76

77 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved What can we gain from this GALS technique? Reliable and fast transfer of large bursts of data is achieved. Data transfer is possible at every clock cycle of synchronous block. In request-driven mode operation there is no arbitration in input port. The circuit immediately responds to input requests. The clock speed is determined by the master and not by the slower participant in the communication. The local clock can be generated internally or externally. This proposed architecture offers an efficient power-saving mechanism, similar to clock gating. EMI should be reduced due to varying delays and frequencies in different asynchronous wrappers. 77

78 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Building the wrapper components - input port 78 Input port has to provide control of the dataflow according to a broad 4-phase handshake protocol. The input port consists of a speed-independent (SI) input controller along with few additional gates that have to provide glitch-free transitions of the input signals.

79 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Input controller specification 79 Input controller is modeled as an AFSM (asynchronous finite state machine). The controller is specified according to burst-mode requirements. Burst-mode AFSM is implemented asHuffman Machine without explicit latches. State graph of the input controller Hazard-Free Combinational Network X Y Z A B C outputs inputs State (several bits) Request-driven mode Local clock generation mode Transitional mode Idle mode

80 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Input controller implementation Burst-mode input controller is synthesized using 3D tool that supports 2-level hazard-free logic minimization and achieves optimal state assigment: REQ_INT = REQ_A1 REQ_INT + ACKC' REQ_INT + REQ_A1 ACKC' ST' ACKEN' ACK_A = ACKC' REQ_INT + REQ_A1 RST +ACKC' ST ACKI1' ACKEN Z0' + REQ_A1 ACKC' ST' ACKEN' ACKEN = ACKI1 + REQ_A1 ACKEN + ST ACKEN RST = STOP + ACKC' REQ_INT + REQ_A1 RST + ST RST + ACKC' ST ACKI1' ACKEN Z0' + REQ_A1 ACKC' ST' ACKEN' REQ_I1 = REQ_A1 ST ACKI1' ACKEN' Z0 = ACKI1 + REQ_A1' ACKC + REQ_A1' ST' Z0 + ACKC' ACKEN Z0 + ACKC ACKEN' Z0 Logic equations are automatically converted into synthesizable structural VHDL code with our 3DC tool. Formal analysis of the asynchronous wrapper is performed. 80

81 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved VHDL description of a port UN1: inv1x port map (ackc,t3); UN2: inv1x port map (st,t4); UN3: inv1x port map (clk1,t5); UN4: inv1x port map (req,t6); UN5: inv1x port map (ackeni,t7); UN6: inv1x port map (endi,t8); UN7: inv1x port map (z0,t9); UN8: inv1x port map (z1,t10); UN8i: inv1x port map (dvsi,t11); U6: and2ix port map (reqci,ackc,t1); U7: and2x port map (req,reqci,t28); U8: and4x port map (req,t3,t4,t9,t12); U9: or3x port map (t1,t28,t12,reqcix); U7i: and2x port map (req,reseti,t2); U7ii: and2x port map (st,acki,t31); U13: and3x port map (req,t3,z0,t13); U14: or5x port map (t1,t13,t12,t2,t31,ackix); U10: and2x port map (ackc,ackeni,t14); U12: and2x port map (t9,ackeni,t15); U15: or3x port map (t15,t14,clk1,ackenix); U11: and3x port map (st,t3,z0,t16); U19: or5x port map (endi,t1,t2,t12,t16,resetix); U17: and2x port map (t7,t9,t17); U18: and3x port map (req,st,t5,t18); U20: and2x port map (t18,t17,reqiix); U25: and2x port map (req,z0,t22); U26: and2x port map (st,z0,t23); U23: and3x port map (ackc,t5,ackeni,t21); U27: or4x port map (t21,t22,t23,endi,z0x); U28: and2x port map (t6,ackc,t24); U29: and2x port map (ackc,z1,t25); U30: and3x port map (t6,t4,z1,t26); U32: or3x port map (t25,t26,t24,z1x); entity and2x is port (a,b: in std_logic; c: out std_logic); end and2x; architecture struc of and2x is attribute DONT_TOUCH_NETWORK of a,b,c: signal is true; begin c<=(a and b) after 100 ps; end struc; 81

82 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Externally-driven GALS Wrapper 82

83 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Clock Management Unit 83

84 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Baseband processor for WLAN The goal of one of our projects was to develop a wireless broadband communication system in the 5 GHz band. The modem is compliant with the IEEE802.11a WLAN standard System uses Orthogonal Frequency Division Multiplexing (OFDM) with data rates ranging from 6 to 54 Mbit/s. The synchronous baseband processor was implemented as an ASIC (700k gates). 84

85 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Structure of the synchronous baseband processor Baseband processor includes receiver and transmitter datapath structure. Very complex blocks are implemented such as Viterbi decoder, FFT, IFFT, CORDIC processors,... 80 Msps block 20 Msps block 85 Baseband Processor Transmitter Receiver Input buffer Scrambler Signal field generator Encoder Interleaver Mapper Pilot insertion Pilot scrambler IFFT Guard interval insertion Preamble insertion Synchronizer datapath Channel estimator Demapper Deinterleaver Viterbi decoder Encoder Interleaver Mapper Descrambler Parallel converter FFT Synchronizer tracking Buffer 20 - 80 Buffer 80 -20

86 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Design challenges in the baseband processor Design of the baseband processor involves the challenges as: - several clock domains, -global clock tree generation, -large number of clock leaves (36 k flip- flops), -clock skew handling, -timing closure between the different modules, -clock gating, -power consumption, -EMI. Request–driven GALS architecture was developed as a possible solution for those problems. 86

87 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved GALS partitioning Tx_1 Baseband Processor Input buffer Scrambler Signal field generator Encoder Interleaver Mapper Pilot insertion Pilot scrambler IFFT Guard interval insertion Preamble insertion Synchronizer datapath Channel estimator Demapper Deinterleaver Viterbi decoder Encoder Interleaver Mapper Descrambler Parallel converter FFT Synchronizer tracking Buffer 20 - 80 Buffer 80 -20 Tx_2Tx_3 Tx_int (async-sync interface) Rx_3Rx_2 Rx_1 Rx_int (async-sync interface) Token rate adaptation FIFO TA Rx_TRA Activation interface The partitioning process has to take into account possible power saving. 80 Msps block 20 Msps block Rate adaption block Interface block 87

88 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Test strategy We are using a hardware tester which is strictly cycle based and cannot react to asynchronous output signals of the circuit. The GALS arbitration processes preclude cycle level determinism. We want to have a possibility to run very complex functional tests internally. Applied test technique should support system diagnosis. A test strategy based on Built-In Self-Test (BIST) is proposed. BIST reduces the effort for generating a test program and enables us to use a synchronous tester. 88

89 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Design for Testability in GALS TPG and TDE are based on the linear feedback shift register structure with embedded additional logic. A central BIST controller performs control of the test procedure. We can run hierarchical tests. This BIST technique can be used as a method for prototype verification. In combination with the scan approach, BIST can be even used as a basis for the manufacturing test. 89

90 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Design flow We have used IHP 0.25 CMOS process. Asynchronous wrapper is equivalent to about 1.3 k inverter gates. Only tunable clock generation is 0.9 k gates. Asynchronous wrapper has throughput up to 150 Msps in request driven mode and 100 Msps in local mode. This application needs 80 Msps. 90 AFSM specifaction 3D - Logic synthesis 3DC tool – translation from 3D to structural VHDL Functional specification VHDL description Abstract behavioural simulation Gate mapping Realistic behavioural simulation Timing driven synthesis Postsynthesis simulation Layout Back annotation Tape-out Asynchronous wrappers Synchronous blocks Synopsys DC Cadence Silicon Encounter Model Sim Power estimation Prime Power Power estimation Prime Power Formal analysis LoLA

91 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Area and power distribution Area and power statistics are based on the synthesized netlist data. Locally synchronous blocks occupy around 90% of the total area, The BIST circuitry requires around 3.5%, interface blocks 2.9%, and asynchronous wrappers 2%. Based on the switching activities, in the realistic transceiver scenario, power estimation with Prime Power tool has been performed. Synchronous datapath logic uses most of the power (around 52.4%), then local synchronous clock trees are using 34.5%, async-to-sync interfaces 7%, and asynchronous wrappers 2.9%. After layout, the estimated power consumption is 324.6 mW. 91

92 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Implementational results Our GALS baseband processor is fabricated and tested. The total number of pins is 120 and the silicon area including pads is 45.1 mm 2. Measured dynamic power dissipated in the pure synchronous baseband processor was 332 mW, and for the GALS baseband processor slightly lower, at 328 mW. 92

93 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Improving System Integration with GALS Synchronous baseband processor challenges: - several clock domains, -global clock tree generation, -large number of clock leaves, -clock skew handling, -timing closure between blocks, -clock gating. 93 Solved by GALS architecture No global clock in GALS Clock leaves distributed over GALS blocks Clock skew is reduced from 660ps to 486 ps Communication between the blocks through handshaking Clock-gating embedded in the asynchronous wrapper

94 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved EMI measurement (I) The supply voltage variation spectrum of the inner processor core is measured. 94 ~ 5 dB

95 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved EMI measurement (II) Additionally, instantaneous supply voltage peaks are reduced from 140 mV (synchronous design) from cycle to cycle to the less than 100 mV (GALS). This reduction can be very important for mixed-signal designs and for secure systems. An application with fine-grained GALS partitioning can lead to results closer to theoretical maximum reduction. 95

96 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Conclusions There are several asynchronous design currently on the market Asynchronous design is with greatest success used in the medium complexity - medium performance circuits Future applications GALS, large networks on the chips (NoCs) 3D Integration Some local blocks in the GALS then could be asynchronous Asynchronous circuitry can provide lower EMI for SOCs Design & Test flow remains as a problem 96

97 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Synchronous and GALS Networks on Chips 97

98 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Synchronous and GALS NoCs Today on-chip design is more and more communication-centric Classical topologies are not sufficient (point-to-point, mesh, bus, etc.) Shared bus = low performance Bandwidth is shared Bus width (bits) relatively small Global clock frequency limited Disadvantage of multiple buses Not scalable, not generic Promising alternative could be Networks on Chip (NoCs) NoCs can be implemented completely synchronously, mesochronously, or in GALS fashion 98

99 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved NoC Paradigm Apply Networks Protocols to SoC Network: Provides communication Satisfy quality-of-service requirements: Reliability Performance: Throughput, latency,.. Power ? Additional requirements unique to NoC Energy bounds Area Fit it to the standard design flow 99

100 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Switching Network Basics Transport Layer: Msg end-to-end Implemented using network adapters Assembly and disassembly of the packets at source/destination Network Layer: Pkt end-to-end Implemented using routers Routers decide the routing path to destination header of the packet topology knowledge Scalable distributed system: load shared between routers Data-Link Layer : Pkt over link Packets: header, payload, trailer Error correction (on packet): redundancy, error correction codes * Technion - Asynchronous NoC - Nikolai Samolazov 100

101 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Bus vs. Network Arguments BUSNoC Scalability:Every IP adds parasitic capacitance Only P2P connections Timing is difficultCan be pipelined Bus Arbiter performanceLoad shared by routers Bandwidth:Limited and shared by all IP Scales with network size Latency:Zero when granted controlNetwork latency always exists Cost:Low areaSignificant area Design Complexity: Simple: well known and understood Requires changes in HW and sometimes SW levels 101

102 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Hybrid Network Shared Busses as first level communication medium NoC routers as main communication devices 102

103 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Homogenous NoC FU * NoC General Concepts - Andreas Ehliar - Per Karlström 103

104 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Heterogeneous NoC FU MUL ALU DSP 104

105 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Heterogeneus NoC FU MUL ALU DSP 105

106 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Quality of Service Guaranteed latency Guaranteed bandwidth Correctness 106

107 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Design Issues - Physical layer To reduce power for signaling - low swing To improve noise immunity - Differential signaling Trade-off - Pseudo differential signaling System integration possibilities Synchronous, Mesochronous, GALS 107

108 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Design Issues - Architecture FU 108

109 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Design Issues - Architecture FU 109

110 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Design Issues - Data-link Is error detection/correction needed? Depends on BER/PER Media access control affects the power Defined by how individual nodes access the media 110

111 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved NoC Design Architecture Network Adapter and Router Architecture - Asynchronous or synchronous Network Topology Routing Strategy - Static Routing - Adaptive Routing Interconnect - Repeaters - Pipelining Design Technology Tools and Methodologies Simulation and (correctness, performance, power) Validation - SystemC 111

112 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Design Issues - Flow Control 112

113 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Design Issues - Long Wires Solving the global interconnect mess Delay Bit errors Repeaters Clock domains Create one optimized solution that can be reused 113

114 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Design Issues - Long Wires Add flip flops to increase clock frequency What about ACKs? NoC Router NoC Router 114

115 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Design Issues - Long Wires Add flip flops to increase clock frequency What about ACKs? NoC Router NoC Router 115

116 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Design Issues - Long Wires Bit errors on long wires will not be avoidable in the future Use error correcting codes Disadvantage: More wires, more throughput needed Use parity bits to discover errors Resend damaged packets No longer possible to guarantee real-time performance 116

117 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Design Issues - Long Wires Possibility to create heavily optimized solution Low voltage signaling Advanced symbol encoding/decoding Wave pipelining 117

118 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Design Issues - Long Wires High performance interconnect through wave pipelining Need very careful analysis NoC Router NoC Router NoC Router NoC Router 118

119 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Design Issues - Long Wires Wave pipelining performance 3.45 GHz signaling on one bit line in 0.25 um More energy efficient than regular pipeline Faster than regular pipeline Disadvantage Much harder to test/verify 119

120 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Network Topologies Mesh Tree Fat-Tree Routing algorithm depends on topology 120

121 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Routing Routing: path from source to destination. Must: deadlock free, livelock free Livelock: message proceeds indefinitely, but never arrives Possible only in adaptive non-minimal routing Deadlock: packets waiting for each other in a cycle Three main categories: Static (non-adaptive): predetermined path Minimal fully adaptive: routes through any shortest path Partially adaptive: multiple routing paths Some paths not shortest 121

122 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Wormhole Routing Header forwarded ASAP, not waiting to trailer Used in high-performance parallel computing networks (lumped) Not in the internet (distributed) Packet may span several routers Packet divided into flits (atomic flow control units) Main Disadvantage: cascaded contention Packet requests busy link VLSI routers: small buffers packet cannot be buffered in one router Routers spanned by packet are stalled Practical limitation, prevents achieving theoretical bandwidth 122

123 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved NoC Design Characteristics: Cost Area Network components area Wires, repeaters area Power Energy per transmitted packet Idle power 123

124 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved NoC Design Characteristics: Performance Latency [sec] From header leaving source, to trailer reaching destination Composed of waiting latency + network latency Waiting Latency Time message waits before entering the network Network Latency Time message travels inside the network Throughput [bits/sec] Measured at network port Average amount of user data that is accepted by the network on that port in a certain amount of time Aggregate Throughput [bits/sec] Sum of the throughputs at all network ports 124

125 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved NoC Saturation Offered Load Traffic produced by network clients as percentage of maximal network bandwidth L : number of cycles needed to accept the message, D : average number of cycles between messages Saturation Threshold: Offered Load at which average latency rises exponentially to infinite value 125

126 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Cost - Performance Tradeoff Santiago Gonzalez Pestana et al. Cost-Performance Trade-offs in Networks on Chip: A Simulation-Based Approach, DATE 2004 126

127 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Architecture of On-Chip Router 127 Technion, Asynchronous vs. Synchronous Design Techniques for NoCs Robert Mullins, Asynchronous vs. Synchronous Design Techniques for NoCs

128 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Router Pipeline Numerous stages of Router Pipeline Raise communication latency Can make packet buffers less effective Incurs pipelining overheads 128

129 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Synchronous NoCs - Summary Can design high-performance single cycle routers Design is simplified by presence of global synchrony Distribution of global clock can be eased by New clock generation / distribution techniques Source synchronous communication 129

130 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Limitations of Fully-Synchronous Networks 1. Difficult to distribute clock Network spread over die & may have irregular layout Minimising skew costs complexity and power Alternatives/extensions to PLL and H-tree: Clock deskewing techniques Distributed Clock Generator (DCG). Distributed PLLs Standing-wave oscillators and rotary clock schemes Resonant global clocks, optical clock distribution etc. 130

131 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Limitations of Fully-Synchronous Networks 2. Single Network Clock Frequency Communicating synchronous IP blocks may operate at different and potentially adaptive clock frequencies What is most appropriate network clock frequency? 131

132 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Why Asynchronous NoCs No clock distribution, simple solution Networked IP blocks run at different clock frequencies No synchronization issues at interfaces Ability to exploit data / path-dependent delays Low-latency common or high-priority paths through router Freedom to optimize network links Not constrained by need to distribute/generate multiple clock frequencies. Can exploit high-frequency narrow links Dynamic latency/throughput trade-offs (adaptive pipeline depth) Exploit dynamic optimizations on links (e.g. DVS) Easy to use interfaces, modularity, Robust and simple implementation, Reduced design time Some arguments for reduced power 132

133 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Different NoC Architectures Router clocks derived from a single source Locally Generated Clocks (periodic & free-running) Synchronous Routers with Asynchronous Links Locally Clocked Routers / Asynchronous Interconnect (GALS style network) Can support asynchronous interconnects No longer exploiting periodic nature of router clocks Correct operation is independent of the delay of the link GALS interfaces with pausible clocks If necessary clock is stretched, data is always transferred reliably Need to construct local delay line Local aperiodic clock generation Data-Driven Local Clock Similarities to stoppable GALS interface and asynchronous priority arbiters 133

134 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Mesochronous Clocking Clock skew may force the system to be partitioned into multiple clock domains Can exploit the fact that only the phase of each routers clock differs, simple error-free clock-domain crossing possible (single clock source) 134

135 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Router clocks derived from a single source Each routers clock may be generated from the global network clock, either by: Clock division or Clock multiplication Clock domain crossing techniques can exploit known clock frequency relationships Chakraborty and M. Greenstreet, Efficient Self-Timed Interfaces for Crossing Clock Domains, In Proceedings ASYNC03 L. F. G. Sarmenta, G. A. Pratt and S. A. Ward, Rational Clocking, ICCD95 135

136 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Using Synchronisers for GALS NoCs Asynchronous channel uses 4-phase bundled data protocol A. Sheibanyrad, A. Greiner, Two efficient synchronous asynchronous converters well-suited for networks-on-chip in GALS architectures, 2005 136

137 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Locally Generated Clocks (periodic & free-running) Can exploit knowledge about clocks (when crossing clock domains) even if all we know is that they are periodic, examples: predictive synchronizers [Dally][Frank/Ginosar] asynchronous FIFOs [Chakraborty/Greenstreet] 137

138 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Using Asynchronous FIFOs in GALS NoCs Synchronous network wrapper assembly/disassembly data packets Can connect many independent clock domains 138

139 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved NoC architecture for low power NoC concept together with GALS methodology gives good opportunities for power saving Each hardware block in NoC system can be setted to the optimal frequency/voltage Best is to combine DVFS with GALS concept in order to reduce power 139

140 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved NoC architecture for DVFS – LETI Solution (NoCs 2008) A fully asynchronous Network-on-Chip IP units are synchronous islands using programmable Local Clock Generator Within the IP unit Synchronization is done thanks to Pausable Clock A Power Unit manages internal Vcore generated using external Vhigh and Vlow A Network Interface is in charge of NoC communications Local Power Management Main CPU in charge of global power management 140

141 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved DVFS with GALS NoCs Each synchronous IP is an independent power and frequency domain A local fine grain Dynamic Voltage Scaling: Implementation of a local hardware controller to control transitions between Vhigh and Vlow Ensures smooth DVS transitions for IP safe computation A local fine grain Dynamic Frequency Scaling: Automatic frequency scaling Use of clock generation re-programming to find the optimal V/F point of operation Thanks to pausable clock technique, IP unit continues its operation during DVFS phases GALS architecture and local clock generation is a natural enabler for easy local DVFS 141

142 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved NoC Unit architecture Each IP core encapsulated with Network Interface Test Wrapper Pausable Clock Power Supply Unit IP units have 5 supply modes Init: reset at V high (1.2V) High: V high supply Low: V low supply (0.8V) Hopping: switch V high / V low for DVFS Idle: retention state at V low (no clock) Off: stand-by mode 142

143 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Local Power Manager Local Power Manager handles unit power modes A set of programmable registers, through the NoC Configuration of Programmable delay line Power Supply Unit Pulse Width modulator used to control the Hopping mode 143

144 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Power Supply Unit Power Supply Unit manages Vcore Two power switches Thigh and Tlow LVT transistors A Hopping Unit An Ultra Cut-Off Generator 144

145 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Hopping Unit Energy per operation scales with V² Decrease Voltage (and Frequency) to be energy efficient «Triple state» power supply Use of two PMOS power switches Vhigh (1.2 V), Vlow (0.7 V), or OFF (0 V) Switch between Vhigh and Vlow Transitions take less than 100 ns Mean speed / mean power of the IP is programmed by a PWM Compatible with synchronous and asynchronous IPs For GALS system: coordination done with local clock generator Can easily be integrated in any CMOS circuit No inductor contrary to traditional DC/DC converters No capacitor contrary to charge pump implementation 145

146 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Ultra Cut-Off Generator When reverse polarizing the gate, the leakage current goes through a minimum The optimal polarization point varies with the temperature, the supply voltage and the process corners The proposed UCO generator automatically polarizes the gate of the Power switch to its point of minimum leakage Compensates for temperature variation, alleviates corners variations. The gate oxide reliability is considered by introducing a passive stress reduction mechanism 146

147 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Pausable Clock Interface Pause temporary the clock when a transfer (NoC) or a supply switch is required Based on Two GALS ports : Synchronous-to Asynchronous and Asynchronous- to-Synchronous A programmable delay line A pausable clock generator Pausable Clock Generator arbitrates pause requests 147

148 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Pausable Clock Interface Programmable delay line Precise, small and low power Using Standard cells On the same unit power domain 148

149 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Power Gain Programmable delay line matches with unit logic on the same power domain Compensates any mismatch thanks to re-programmation Power reduction Vhigh=1.2V and Vlow=0.8V 35 % dynamic power reduction between High and Low modes Hopping mode is used to save power without any latency cost Leakage power thanks to UCO is reduced by 2 decade Power Supply Unit efficiency Hopping Unit Only resistive losses in the power transistors About 1 mW dynamic power => more than 95 % power efficiency 90 % total efficiency (external DC-DC taken into account) An adaptive and reliable Power Supply Unit giving high power reduction factor and high power efficiency 149

150 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Physical Implementation Power Switch One single Power-Switch for the complete power domain Sized to get a speed loss<5% Area : about <5% of the power domain Hopping Unit Area : 140μm*35μm Hopping Transition : <100 ns 150

151 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Synchronous or Asynchronous? A clock less on-chip network appears to be an elegant solution although some questions remain: Test Performance concerns Shouldnt asynchronous designs offer latency advantages? Fast local control, path/data dependent delays, DI interconnects Perhaps asynchronous routers mimic synchronous architectures too closely? Exploit flexibility, novel architectures, different topologies Overheads for data-driven clocking or GALS currently look small in comparison to the classical approach Synchronous design has advantages too Predictability and determinism can be exploited Fast single cycle routers possible Global snapshot of state is good for scheduling Still lots of interesting research to be done 151

152 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved GALAXY project GALAXY project (GALS InterfAce for CompleX Digital SYstem Integration) is funded in the FP7 program of EU www.galaxy-project.org 152

153 IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved Project goals This project builds on a technology approach in which the EU currently has world leadership We are on the way to provide an integrated GALS NoC design flow We will provide an interoperability framework between the existing open and commercial CAD tools The project is evaluating the ability of the GALS approach to solve system integration issues, implement a complex GALS system on 40 nm CMOS process, explore the low EMI and low-power properties, and robustness to process variability problems. 153


Download ppt "IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 -"

Similar presentations


Ads by Google