Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mohamed ABDELFATTAH Andrew BITAR Vaughn BETZ. 2 Module 1 Module 2 Module 3 Module 4 FPGAs are big! Design big systems High on-chip communication.

Similar presentations


Presentation on theme: "Mohamed ABDELFATTAH Andrew BITAR Vaughn BETZ. 2 Module 1 Module 2 Module 3 Module 4 FPGAs are big! Design big systems High on-chip communication."— Presentation transcript:

1 Mohamed ABDELFATTAH Andrew BITAR Vaughn BETZ

2 2 Module 1 Module 2 Module 3 Module 4 FPGAs are big! Design big systems High on-chip communication

3 3 Bus 1 Bus 2 Bus 3 Module 4 Module 1 Module 3 Module 2 FPGAs are big! Design big systems High on-chip communication Wide links from single bit- programmable interconnect Somewhat unpredictable frequency  re-pipeline 100s of bits Tough timing constraints Physical distance affects frequency Buses are unique to the application, and.. inefficient Design a bus for each set of connections

4 4 FPGAs are big! Design big systems High on-chip communication Design a bus for each set of connections Wide links from single bit- programmable interconnect Somewhat unpredictable frequency  re-pipeline Module 4 Module 1 Module 3 Module 2 Buses are unique to the application, and.. inefficient

5 5 FPGAs are big! Design big systems High on-chip communication Module 4 Module 1 Module 3 Module 2 Embedded NoC Data Packet flit Data Sys-level interconnect used in most designs  harden! NoC=complete interconnect Data transport Switching Buffering NoC=complete interconnect Data transport Switching Buffering Links Routers IEEE Micro 2014: Smaller and less power than bus connecting 3 modules to DDR3 More efficient than soft buses and huge bandwidth

6 Module 6 NoC can transport 22.5 GB/s on one link Easy for designers No segmentation of data DDR3 max = 17 GB/s Run NoC at fixed, high frequency NoC can transport 22.5 GB/s on one link Easy for designers No segmentation of data DDR3 max = 17 GB/s Run NoC at fixed, high frequency Link Width# Nodes 150 bits16 Area %Freq. 1.3 %1.2 GHz 22.5 GB/s = 1.2 GHz x 150 bits = 300 MHz x 600 bits22.5 GB/s

7 Module 7 NoC can transport 22.5 GB/s on one link Easy for designers No segmentation of data DDR3 max = 17 GB/s Run NoC at fixed, high frequency NoC can transport 22.5 GB/s on one link Easy for designers No segmentation of data DDR3 max = 17 GB/s Run NoC at fixed, high frequency Link Width# Nodes 150 bits16 Area %Freq. 1.3 %1.2 GHz 22.5 GB/s = 1.2 GHz x 150 bits = 300 MHz x 600 bits22.5 GB/s

8 How do we adapt Embedded NoCs to FPGAs? 8 1. FabricPort: NoC-to-FPGA interface 2. NoC with FPGA design styles 3. Case studies: JPEG & Ethernet switch [Easy-to-use]

9 9 NoC FabricPort = On-ramp Photo from: http://www.aaroads.com

10 10 Ready/Valid Any* width 0-600 bits Any* width 0-600 bits Frequency 1.2 GHz Frequency 1.2 GHz Credits Fixed Width 150 bits Fixed Width 150 bits Frequency 100-400 MHz Frequency 100-400 MHz FPGA Module Embedded NoC Width # flits 0-150 bits 1 150-300 bits 2 300-450 bits 3 450-600 bits 4

11 11 3 Ready/Valid Any* width 0-600 bits Any* width 0-600 bits Frequency 1.2 GHz Frequency 1.2 GHz Credits Fixed Width 150 bits Fixed Width 150 bits Frequency 100-400 MHz Frequency 100-400 MHz FPGA Module Embedded NoC Time-domain multiplexing: Divide width by 4 Multiply frequency by 4

12 12 3 Ready/Valid Any* width 0-600 bits Any* width 0-600 bits Frequency 1.2 GHz Frequency 1.2 GHz Credits Fixed Width 150 bits Fixed Width 150 bits Frequency 100-400 MHz Frequency 100-400 MHz FPGA Module Embedded NoC Asynchronous FIFO: Cross into NoC clock No restriction on module frequency Time-domain multiplexing: Divide width by 4 Multiply frequency by 4

13 13 3 Ready/Valid Any* width 0-600 bits Any* width 0-600 bits Frequency 1.2 GHz Frequency 1.2 GHz Credits Fixed Width 150 bits Fixed Width 150 bits Frequency 100-400 MHz Frequency 100-400 MHz FPGA Module Embedded NoC Asynchronous FIFO: Cross into NoC clock No restriction on module frequency Time-domain multiplexing: Divide width by 4 Multiply frequency by 4 NoC Writer: Track available buffers in NoC Router Forward flits to NoC Backpressure 2 10

14 14 Ready/Valid Any* width 0-600 bits Any* width 0-600 bits Frequency 1.2 GHz Frequency 1.2 GHz Credits Fixed Width 150 bits Fixed Width 150 bits Frequency 100-400 MHz Frequency 100-400 MHz FPGA Module Embedded NoC 3 Virtual Channels: Multiple buffers Logical separation of data VC2 VC1 2121 Packet on VC1 Packet on VC2 Interleaved flits from 2 packets

15 15 Ready/Valid Any* width 0-600 bits Any* width 0-600 bits Frequency 1.2 GHz Frequency 1.2 GHz Credits Fixed Width 150 bits Fixed Width 150 bits Frequency 100-400 MHz Frequency 100-400 MHz FPGA Module Embedded NoC 32121 Packet on VC2 321 21 Virtual Channels: Multiple buffers Logical separation of data Packet on VC1 Interleaved flits from 2 packets

16 16 Ready/Valid Any* width 0-600 bits Any* width 0-600 bits Frequency 1.2 GHz Frequency 1.2 GHz Credits Fixed Width 150 bits Fixed Width 150 bits Frequency 100-400 MHz Frequency 100-400 MHz FPGA Module Embedded NoC 32121 Packet on VC2 321 21 Asynchronous FIFO: Reads packet from one VC at a time Supports Priorities Virtual Channels: Multiple buffers Logical separation of data Packet on VC1 321 21 Interleaved flits from 2 packets

17 17 Ready/Valid Any* width 0-600 bits Any* width 0-600 bits Frequency 1.2 GHz Frequency 1.2 GHz Credits Fixed Width 150 bits Fixed Width 150 bits Frequency 100-400 MHz Frequency 100-400 MHz FPGA Module Embedded NoC 32121 Packet on VC2 321 21 Asynchronous FIFO: Reads packet from one VC at a time Supports Priorities Virtual Channels: Multiple buffers Logical separation of data Packet on VC1 TD Demultiplexing: Output 1 packet at a time  easy for designers 321 21 3 2 1 2 1 ? Interleaved flits from 2 packets

18 18 Ready/Valid Any* width 0-600 bits Any* width 0-600 bits Frequency 1.2 GHz Frequency 1.2 GHz Credits Fixed Width 150 bits Fixed Width 150 bits Frequency 100-400 MHz Frequency 100-400 MHz FPGA Module Embedded NoC 2121 Packet on VC2 21 21 Packet on VC1 combine_data mode: Packets from VC1  upper bits Packets from VC2  lower bits 21 21 2 1 2 1

19 19 1. FabricPort: NoC-to-FPGA interface 2. NoC with FPGA design styles 3. Case studies: JPEG & Ethernet switch How do we adapt Embedded NoCs to FPGAs?

20 20 NoC Rules for using an NoC with FPGA design styles Photo from: http://www.dailymail.co.uk

21 FPGA Designs Mostly streaming Cannot tolerate reordering – Hardware expensive and difficult 21 Multiprocessors Memory mapped Packets arrive out-of-order – Fine for cache lines – Processors have re-order buffers 1 2 All packets with same src/dst must take same NoC path RULE

22 FPGA Designs Mostly streaming Cannot tolerate reordering – Hardware expensive and difficult 22 Multiprocessors Memory mapped Packets arrive out-of-order – Fine for cache lines – Processors have re-order buffers 1 2 All packets with same src/dst must take same NoC path RULE FULL 2 1 1 2 All packets with same src/dst must take same VC RULE

23 23 1 2 1 2 0

24 24 Each message type must take a different VC  FabricPort output must support that RULE 1 2 2 0 1

25 25 Design Styles Latency- insensitive Latency- sensitive

26 Stall-wrappers 26 Latency-Insensitive Modules are stallable Ready/valid signals AB C data Ready signals A B C Tolerate latency on links

27 27 ABC Stall-wrappers Latency-Insensitive Modules are stallable Ready/valid signals AB C data Ready signals Tolerate latency on links

28 28 ABC Longer path, lower frequency  Insert pipeline regs  Doesn’t affect functionality Stall-wrappers Latency-Insensitive Modules are stallable Ready/valid signals Tolerate latency on links AB C data Ready signals

29 29 ABC NoC for Latency-insensitive: Pipelined Timing closure: yes Little/no physical info (?) NoC for Latency-insensitive: Pipelined Timing closure: yes Little/no physical info (?) Stall-wrappers Latency-Insensitive Modules are stallable Ready/valid signals AB C data Ready signals Tolerate latency on links

30 30 Latency-Sensitive Not stallable No Ready/valid signals Latency must remain constant to be functional Permapaths A BC Reserved Paths Never stall Fixed latency Reserved Paths Never stall Fixed latency AB C data

31 31 1. FabricPort: NoC-to-FPGA interface 2. NoC with FPGA design styles 3. Case studies: JPEG & Ethernet switch How do we adapt Embedded NoCs to FPGAs?

32 32 Booksim - NoC simulator - Cycle-accurate - Widely-used - NoC simulator - Cycle-accurate - Widely-used Define traffic pattern www.eecg.utoronto.ca/~mohamed/rtl2booksim RTL (Verilog) - Hardware designs - FabricPort - NoC looks exactly like a hardware module - Hardware designs - FabricPort - NoC looks exactly like a hardware module Sockets SV DPI Simulate any RTL design with a flexible NoC Simulator

33 Latency-sensitive [ Permapaths ] Parallelize 10 – 40 times – Data width 130 – 520 bits Important for data center applications Compare NoC version to non-NoC version 33 DCTQNRRLE Strobe Pixel in valid Pixel out DCTQNRRLE DCTQNRRLE DCTQNRRLE Strobe0100000…0 PixelXXP1P2P3P4P5…P64 Photo from: http://www.fbpapa.com

34 34 DCT QNR RLE Original [close]

35 35 DCT QNR RLE Original [close]

36 36 QNR RLE DCT Critical path Original [far] 130 MHz =50% Original [close]

37 37 QNR RLE DCT Original [far] NoC [average] Original [close]

38 38 QNR RLE DCT Original [far] NoC [average] 5% drop in frequency from best case 80% improvement from worst case ~6% variability regardless of module placement Frequency is good, and (much) more predictable with NoC Original [close] No drop in frequency for largest design

39 39 NoC doesn’t produce routing hotspots – it reduces them NoC reduces utilization of scarce long FPGA wires Max 40% Max 100% [long wires]

40 Communications and networking are the largest FPGA customers Important design: Ethernet Switch 40 High Bandwidth Ethernet switches  ASIC land – 10/25/40/100 Gb/s per port switches, e.g. 32x32 Best previous FPGA switch: – 16x16, 10 Gb/s per port, total 160 Gb/s – FPGAs have transceiver BW more than 1000 Gb/s Photo from: http://wisegeek.org

41 Embedded NoC is the switch (16X16 switch) Use RTL2Booksim to measure latency/throughput 41

42 42 Previous work 10 Gb/s NoC Switch 10 Gb/s Approx. Half the latency Approx. Half the latency Saturates at higher injection: Handles more bursty traffic Saturates at higher injection: Handles more bursty traffic

43 Previous work 10 Gb/s NoC Switch 10 Gb/s 25 Gb/s 3X smaller area 40 Gb/s 5X more switching than best: 819 Gb/s vs 160 Gb/s ~ Half the latency Reconfigurable speed, radix, protocol – augment pkt processing

44 44 How do we adapt Embedded NoCs to FPGAs? 1. FabricPort: NoC-to-FPGA interface 2. NoC with FPGA design styles Module (any freq./width)  Embedded NoC (1.2 GHz, 150 bits) Designer-friendly Leverage VCs, credits Rules for NoC design on FPGAs: packet ordering and dependencies Support FPGA Design Styles: Latency-insensitive design Latency-sensitive design Parallel JPEG: Frequency predictable and good, Reduces routing hotspots and long wire utilization Ethernet Switch: 5X more switching, 3X less area, reconfigurable 3. Case studies: JPEG & Ethernet switch Compelling case for Embedded NoCs

45 45 www.eecg.utoronto.ca/~mohamed

46


Download ppt "Mohamed ABDELFATTAH Andrew BITAR Vaughn BETZ. 2 Module 1 Module 2 Module 3 Module 4 FPGAs are big! Design big systems High on-chip communication."

Similar presentations


Ads by Google