Presentation is loading. Please wait.

Presentation is loading. Please wait.

5/3/2011 International Symposium on Network-on-Chip 1 DART: A Programmable Architecture for NoC Simulation on FPGAs Danyao Wang*† Natalie Enright Jerger*

Similar presentations


Presentation on theme: "5/3/2011 International Symposium on Network-on-Chip 1 DART: A Programmable Architecture for NoC Simulation on FPGAs Danyao Wang*† Natalie Enright Jerger*"— Presentation transcript:

1 5/3/2011 International Symposium on Network-on-Chip 1 DART: A Programmable Architecture for NoC Simulation on FPGAs Danyao Wang*† Natalie Enright Jerger* J. Gregory Steffan* *Department of Electrical & Computer Engineering University of Toronto †Google Inc.

2 5/3/2011 International Symposium on Network-on-Chip 2 Why yet another NoC simulator? Software simulators –Stand-alone or integrated –Parallel NoC simulator (DARSIM) FPGA-based Models –Direct map NoC emulators (Genko et al., NoCem) –Dynamic reconfiguration (DRNoC) –Decoupled timing and functional model (RAMPGold, ProtoFlex, A-Ports) Analytical models: FIST

3 5/3/2011 International Symposium on Network-on-Chip 3 Why yet another NoC simulator? Requirement Software Simulation AccuratePossible Fast to run < 10 KIPS to 100s KIPS Easy to implementYes Easy to use & modifyYes Available earlyYes @100KIPS: 1s of execution @ 1GHz = 10K sec = 2.8 hrs Benefits of thread-based parallelization is limited due to high synchronization overhead

4 5/3/2011 International Symposium on Network-on-Chip 4 Why yet another NoC simulator? Requirement Software Simulation FPGA-based Emulators AccuratePossible Fast to run < 10 KIPS to 100s KIPS 10s to 100s MIPS Easy to implementYesNo Easy to use & modifyYesNo Available earlyYes Hardware changes Hours of synthesis- place-route time Orders of magnitude faster!

5 5/3/2011 International Symposium on Network-on-Chip 5 FPGA DART: Hybrid Approach Generic NoC simulation engine Fixed function nodes for basic NoC building blocks –Router, traffic generator, link Software configurable parameters in each node PC UART Control FSM DART Simulator configuration, commands Simulation results Simulate different NoCs without changing hardware

6 5/3/2011 International Symposium on Network-on-Chip 6 Why yet another NoC simulator? Requirement Software Simulation FPGA-based Emulators DART AccuratePossible Yes Fast to run < 10 KIPS to 100s KIPS 10s to 100s MIPS 10s MIPS Easy to implementYesNo Easy to use & modifyYesNoYes Available earlyYes

7 5/3/2011 International Symposium on Network-on-Chip 7 DART Simulator Architecture

8 5/3/2011 International Symposium on Network-on-Chip 8 Traffic Generator Flit Queue Router Generic NoC Model Global interconnect Topology Routing algorithm Flow control Router microarchitecture Simulated traffic Link properties

9 5/3/2011 International Symposium on Network-on-Chip 9 DART Architecture Global Timer Synchronize all network transfers to a global time counter

10 5/3/2011 International Symposium on Network-on-Chip 10 DART Nodes NodeParametersStatistics Counter Traffic Generator Traffic pattern Injection intervals Packet size (# of flits) # of injected packets # of received packets Cumulative packet latency Flit Queue Latency (flit cycles) Bandwidth (flits / cycle) More can be added easily Routers Routing Table Input buffer sizes (credits) Pipeline delay (flit cycles) Parameters implemented using a shift register Configuration byte stream generated on the PC and sent to the FPGA

11 5/3/2011 International Symposium on Network-on-Chip 11 Simulating a NoC 1.Map simulated NoC to DART nodes 2.Program the routing tables to implement the simulated topology 3.Record timing of flit transfers

12 5/3/2011 International Symposium on Network-on-Chip 12 Example Walk-Through 0123 4567 Global Interconnect Global Timer

13 5/3/2011 International Symposium on Network-on-Chip 13 Example Walk-Through 0123 4567 Global Interconnect Global Timer Router Traffic Generator Flit Queues

14 5/3/2011 International Symposium on Network-on-Chip 14 Example Walk-Through 0123 4567 0 Global Interconnect Global Timer

15 5/3/2011 International Symposium on Network-on-Chip 15 Example Walk-Through 0123 4567 01 Global Interconnect Global Timer

16 5/3/2011 International Symposium on Network-on-Chip 16 Example Walk-Through 0123 4567 012 Global Interconnect Global Timer

17 5/3/2011 International Symposium on Network-on-Chip 17 Example Walk-Through 0123 4567 0123 Global Interconnect Global Timer

18 5/3/2011 International Symposium on Network-on-Chip 18 Example Walk-Through 0123 4567 01234 Global Interconnect Global Timer

19 5/3/2011 International Symposium on Network-on-Chip 19 Example Walk-Through 0123 4567 012345 Global Interconnect Global Timer

20 5/3/2011 International Symposium on Network-on-Chip 20 Example Walk-Through 0123 4567 0123456 Global Interconnect Global Timer

21 5/3/2011 International Symposium on Network-on-Chip 21 Example Walk-Through 0123 4567 01234567 Global Interconnect Global Timer

22 5/3/2011 International Symposium on Network-on-Chip 22 Example Walk-Through 0123 4567 01234567 Global Interconnect Global Timer

23 5/3/2011 International Symposium on Network-on-Chip 23 Example Walk-Through 0123 4567 01234567 Global Interconnect Global Timer

24 5/3/2011 International Symposium on Network-on-Chip 24 Example Walk-Through 0123 4567 01234567 Global Interconnect Global Timer

25 5/3/2011 International Symposium on Network-on-Chip 25 Example Walk-Through 0123 4567 01234567 Global Interconnect Global Timer 0123456 # injected: 1 # received: 1 Σlatency: 6 # received: 1 Σlatency = 6

26 5/3/2011 International Symposium on Network-on-Chip 26 DART Router Virtualizes the ports  replace crossbar with MUX –No large switch allocators and crossbars –Routes 1 flit per DART cycle –N cycles for N ports Input ports selected based on timestamp Router Input Port 0 Input port 1 Input port 2 Input port 3 Input port 4 Routing TableArbiter Router Input Port 0 Input port 1 Input port 2 Input port 3 Input port 4 Routing LogicAllocator Multiplexing in time saves area

27 5/3/2011 International Symposium on Network-on-Chip 27 DART Summary Configurable functional model of an NoC –Easy to modify and reuse –Fast by exploiting fine grained parallelism Decouple simulated cycle from FPGA cycles –Trade simulation speed for area and programmability Software configurable parameters –Familiar simulation flow and fast turn-around time

28 5/3/2011 International Symposium on Network-on-Chip 28 Evaluation & Results Overhead Architecture Scalability Implementation & Performance

29 5/3/2011 International Symposium on Network-on-Chip 29 Methodology C++ Cycle-accurate architecture simulator –Explore various DART architectures –Evaluate performance trade-offs 9-node implementation on a Virtex-II Pro FPGA Baseline: Booksim 2.0 –Cycle-based software simulator (C++) Metrics –Overhead: DART cycles/simulated cycle (CPS) –Performance: Thousands of simulated cycles per second

30 5/3/2011 International Symposium on Network-on-Chip 30 Programmability Overhead Measure performance overhead of global interconnect and simplified Router model Four combinations of two options –Interconnect: –Router:

31 5/3/2011 International Symposium on Network-on-Chip 31 Programmability Overhead Measure performance overhead of global interconnect and simplified Router model Four combinations of two options –Interconnect: dedicated vs. global –Router: dedicated global x

32 5/3/2011 International Symposium on Network-on-Chip 32 Programmability Overhead Measure performance overhead of global interconnect and simplified Router model Four combinations of two options –Interconnect: dedicated vs. global –Router: 5-port vs. 1-port 5-port1-port

33 5/3/2011 International Symposium on Network-on-Chip 33 Programmability Overhead Measure performance overhead of global interconnect and simplified Router model Four combinations of two options –Interconnect: dedicated vs. global –Router: 5-port vs. 1-port Baseline: dedicated+5-port Benchmarks: 9-node mesh and 64-node mesh 5-port dedicated

34 5/3/2011 International Symposium on Network-on-Chip 34 Overhead: 9-node DART Dedicated links + true 5-ported router Overhead (2-3x) due to global interconnect Overhead (2-6x) due to 1-port Router Simulated 9-node DART Lower Overhead Dedicated links + 1-ported router Global interconnect + 5-ported router Global interconnect + 1-ported router Router overhead dominates

35 5/3/2011 International Symposium on Network-on-Chip 35 Overhead: 64-node DART Dedicated links + true 5-ported router Simulated 64-node DART Lower Overhead Dedicated links + 1-ported router Global interconnect + 5-ported router Global interconnect + 1-ported router Global interconnect is the bottleneck Simulated NoC saturates

36 5/3/2011 International Symposium on Network-on-Chip 36 Scalability Compare DART’s performance scaling to Booksim beyond 9 nodes –64-node DART with 8-partition global interconnect Benchmarks: mesh sizes from 9 to 64 DART performance extrapolated from architecture simulator assuming 50 MHz clock

37 5/3/2011 International Symposium on Network-on-Chip 37 Scalability: Mesh Benchmarks Booksim 64-node DART Faster DART simulation speed depends on network load only Higher speedups over Booksim for large NoCs

38 5/3/2011 International Symposium on Network-on-Chip 38 An Implementation of DART 9 Nodes (max. that fit) 8-partition interconnect 50 MHz XUPV2P Development Board Virtex-II Pro XC2VP30 ComponentUtilization (LUTs) Router (x9)612 TrafficGen (x9)691 FlitQueue (x36)305 Interconnect2,144 Control FSM152 Total26,385 (96%)

39 5/3/2011 International Symposium on Network-on-Chip 39 Real Speed-up vs. Booksim Booksim DART Speedup Large NoC simulations can become more interactive Faster Slower with more traffic 70x ~ 160x speedup

40 5/3/2011 International Symposium on Network-on-Chip 40 Future Work Virtualize DART nodes using multithreading –Further trade performance for area Off-chip traffic generation –Integrate with full-system evaluation framework Better coverage of the router design space –Adaptive routing, speculative routing, etc. –Investigate specialized soft processors

41 5/3/2011 International Symposium on Network-on-Chip 41 Summary Software configurable FPGA-based NoC simulator is feasible –Area overhead vs. existing emulators is negligible Over 100x speedup over software NoC simulator (Booksim) Hardware and software tools available at http://www.eecg.toronto.edu/DART http://www.eecg.toronto.edu/DART

42 5/3/2011 International Symposium on Network-on-Chip 42 Q & A Thank you!

43 5/3/2011 International Symposium on Network-on-Chip 43 Backup Slides Classic Router Microarchitecture Global Interconnect DART Software Flow Correctness Analysis Interconnect Performance vs. Resource Utilization DART vs. Booksim Speedup

44 5/3/2011 International Symposium on Network-on-Chip 44 Classic Router Microarchitecture Back

45 5/3/2011 International Symposium on Network-on-Chip 45 Global Interconnect Back

46 5/3/2011 International Symposium on Network-on-Chip 46 DART Software DARTgen –Placement of simulated nodes in DART partitions –Evenly distribute nodes across partitions to balance load –Generate configuration bytes DARTportal –Communicates with the DART simulator on FPGA through serial port –Interactive FPGA UART Control FSM DART Simulator Back

47 5/3/2011 International Symposium on Network-on-Chip 47 Correctness (1/2) booksim: 5-cycle routing delay booksim2: 4-cycle routing delay + 1-cycle switch allocation delay Topology3 x 3 mesh Router architectureInput queued Routing algorithmXY # of VCs per port2 VC AllocationRound-robin Traffic patternRandom permutation Packet size2 flits Back

48 5/3/2011 International Symposium on Network-on-Chip 48 Correctness (2/2) 0-hop packets1 hop2 hops3 hops4 hops Booksim has longer tail Back

49 5/3/2011 International Symposium on Network-on-Chip 49 Interconnect Scalability (1/2) Flit injection rate = 0.1 Flit injection rate = 0.5 Back

50 5/3/2011 International Symposium on Network-on-Chip 50 Interconnect Scalability (2/2) Back

51 5/3/2011 International Symposium on Network-on-Chip 51 DART vs. Booksim Speedup Better speedup for larger NoCs Back

52 5/3/2011 International Symposium on Network-on-Chip 52 Related Work (1/2) FPGA-based processor simulation –RAMPGold – Tan et al. DAC 2010. –ProtoFlex – Chung et al. IPDPS 2007. –A-Ports – Pellauer et al. FPGA 2008. Direct NoC emulation –Genko et al. DATE 2005. –NoCem – Schelle and Grunwald. WARFP 2006.

53 5/3/2011 International Symposium on Network-on-Chip 53 Related Work (2/2) DRNoC: exploit dynamic reconfiguration of Xilinx FPGAs – Krasteva et al. Reconfig. 2008. Virtualized simulation – Wolkotte et al. NoCS 2007. DARSIM: parallel software NoC simulator – Lis et al. MoBS 2010.

54 5/3/2011 International Symposium on Network-on-Chip 54 Software Simulators Modular design (typically in an OO language) Stand-alone or integrated Pros: –Easy to implement new models –Fast to develop and debug –As detailed and accurate as desired Cons: –Simulating large NoCs in detail can be slow <10 KIPS to 100s KIPS –Parallelizing using threads is non-trivial High synchronization overhead @100KIPS: 1s of execution @ 1GHz = 10K sec = 2.8 hrs

55 5/3/2011 International Symposium on Network-on-Chip 55 FPGA-based Models FPGAs have become big enough Map entire NoC to FPGA Pros: –Faster than software simulation (10s to 100s MIPS) Lots of parallelism Low-overhead synchronization Cons: –Emulators can’t be reused to evaluate different NoCs –Redesign is difficult and time-consuming –Max simulatable NoC size limited by FPGA size

56 5/3/2011 International Symposium on Network-on-Chip 56 DART: Configurable Simulator on FPGA Emulators can’t be reused to evaluate different NoCs –A generic NoC simulation model that is decoupled from the architecture from a specific NoC Redesign is difficult and time-consuming –Software configurable, no hardware redesign needed Max simulatable NoC size limited by FPGA size –Optimize simulator architecture for area by trading off some speed Fixed framework, configurable settings, still fast!

57 5/3/2011 International Symposium on Network-on-Chip 57 Architecture Evaluation Methods Requirement Software Simulation FPGA Prototypes FPGA-based Emulators DART AccuratePossibleVeryPossibleYes Fast to run < 10 KIPS to 100s KIPS 100s MIPS 10s to 100s MIPS 10s MIPS Easy to buildYesNo Easy to modifyYesNo Yes Available earlyYesNoYes KIPS: Thousands of Instructions per Second MIPS: Millions of Instructions per Second

58 5/3/2011 International Symposium on Network-on-Chip 58 DART Simulator Model (cont’d) Descriptors without data payload –Flits: 36 bits –Credits: 12 bits 10-bit timestamp –Correctly captures latency up to 1024 cycles Scale up to 256 nodes, 8 ports/node, 4 VCs/port

59 5/3/2011 International Symposium on Network-on-Chip 59 NoC Basics Topology Routing algorithm Flow Control Router microarchitecture

60 5/3/2011 International Symposium on Network-on-Chip 60 Motivation Multi-core is here to stay Communication is performance bottleneck Network-on-Chip (NoC) advantages –Higher bandwidth –More efficient sharing of on-chip resources –Easier to build, verify, fabricate Need high quality evaluation tools Intel SCC 48 cores & mesh NoC Cell Processor 8 SPEs & ring NoC

61 5/3/2011 International Symposium on Network-on-Chip 61 The Ideal Simulator Accurate Fast Easy to implement, use and modify Available early in the design process Existing tools don’t offer all four properties


Download ppt "5/3/2011 International Symposium on Network-on-Chip 1 DART: A Programmable Architecture for NoC Simulation on FPGAs Danyao Wang*† Natalie Enright Jerger*"

Similar presentations


Ads by Google