Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Hard/soft efficiency gap Integrating hard NoCs with FPGA 1 1 2 2 3 3.

Similar presentations


Presentation on theme: "Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Hard/soft efficiency gap Integrating hard NoCs with FPGA 1 1 2 2 3 3."— Presentation transcript:

1 Mohamed ABDELFATTAH Vaughn BETZ

2 2 Why NoCs on FPGAs? Hard/soft efficiency gap Integrating hard NoCs with FPGA 1 1 2 2 3 3

3 3 Why NoCs on FPGAs? Hard/soft efficiency gap Integrating hard NoCs with FPGA 1 1 2 2 3 3 MotivationPrevious Work

4 Interconnect 4 1. Why NoCs on FPGAs? Logic Blocks Switch Blocks Wires

5 5 1. Why NoCs on FPGAs? Logic Blocks Switch Blocks Wires Hard Blocks: Memory Multiplier Processor Hard Blocks: Memory Multiplier Processor

6 6 1. Why NoCs on FPGAs? Logic Blocks Switch Blocks Wires Hard Interfaces DDR/PCIe.. Hard Interfaces DDR/PCIe.. Interconnect still the same Hard Blocks: Memory Multiplier Processor Hard Blocks: Memory Multiplier Processor 1600 MHz 200 MHz 800 MHz

7 7 DDR3 PHY and Controller 1.Bandwidth requirements for hard logic/interfaces 2.Timing closure 1. Why NoCs on FPGAs? PCIe Controller Gigabit Ethernet 1600 MHz 200 MHz 800 MHz

8 8 DDR3 PHY and Controller 1.Bandwidth requirements for hard logic/interfaces 2.Timing closure 3.High interconnect utilization: – Huge CAD Problem – Slow compilation – Power/area utilization 1. Why NoCs on FPGAs? PCIe Controller Gigabit Ethernet

9 9 DDR3 PHY and Controller 1.Bandwidth requirements for hard logic/interfaces 2.Timing closure 3.High interconnect utilization: – Huge CAD Problem – Slow compilation – Power/area utilization 4.Wire speed not scaling: – Delay is interconnect-dominated 1. Why NoCs on FPGAs? PCIe Controller Gigabit Ethernet

10 10 DDR3 PHY and Controller 1.Bandwidth requirements for hard logic/interfaces 2.Timing closure 3.High interconnect utilization: – Huge CAD Problem – Slow compilation – Power/area utilization 4.Wire speed not scaling: – Delay is interconnect-dominated 5.Low-level interconnect hinders modularity: – Parallel compilation – Partial reconfiguration – Multi-chip interconnect 1. Why NoCs on FPGAs? PCIe Controller Gigabit Ethernet

11 BarcelonaLos Angeles Keep the “roads”, but add “freeways”. Hard Blocks Logic Cluster Source: Google Earth

12 12 DDR3 PHY and Controller 1. Why NoCs on FPGAs? PCIe Controller Gigabit Ethernet 1.Bandwidth requirements for hard logic/interfaces 2.Timing closure 3.High interconnect utilization: – Huge CAD Problem – Slow compilation – Power/area utilization 4.Wire speed not scaling: – Delay is interconnect-dominated 5.Low-level interconnect hinders modularity: – Parallel compilation – Partial reconfiguration – Multi-chip interconnect NoC RoutersLinks Router forwards data packet Router moves data to local interconnect

13 13 DDR3 PHY and Controller 1. Why NoCs on FPGAs? PCIe Controller Gigabit Ethernet 1.Bandwidth requirements for hard logic/interfaces 2.Timing closure 3.High interconnect utilization: – Huge CAD Problem – Slow compilation – Power/area utilization 4.Wire speed not scaling: – Delay is interconnect-dominated 5.Low-level interconnect hinders modularity: – Parallel compilation – Partial reconfiguration – Multi-chip interconnect  Pre-design NoC to requirements  NoC links are “re-usable”  Latency-tolerant communication  NoC abstraction favors modularity  High bandwidth endpoints known

14 14 DDR3 PHY and Controller 1. Why NoCs on FPGAs? PCIe Controller Gigabit Ethernet 1.Bandwidth requirements for hard logic/interfaces 2.Timing closure 3.High interconnect utilization: – Huge CAD Problem – Slow compilation – Power/area utilization 4.Wire speed not scaling: – Delay is interconnect-dominated 5.Low-level interconnect hinders modularity: – Parallel compilation – Partial reconfiguration – Multi-chip interconnect  Latency-tolerant communication  NoC abstraction favors modularity

15 15 DDR3 PHY and Controller 1. Why NoCs on FPGAs? PCIe Controller Gigabit Ethernet  Implementation options:  Soft Logic (LUTs,.. )  Hard Logic (unchangeable)  Mixed Soft/Hard  Implementation options:  Soft Logic (LUTs,.. )  Hard Logic (unchangeable)  Mixed Soft/Hard Soft NoC Hard NoC Build as needed out of LUTs Must build the whole thing Tailor to application Must be general enough for any aiapplication Slower, bigger Faster, smaller  Investigate the hard vs. soft tradeoff for NoCs (area/delay) Configurability Efficiency

16  FPGA-tuned Soft NoCs: – LiPar (2005), NoCeM (2008), Connect (2012)  Hard NoCs: – Francis and Moore (2008): Exploring Hard and Soft Networks-on-Chip for FPGAs  Applications that leverage NoCs: – Chung et al. (2011): CoRAM: An In-Fabric Memory Architecture for FPGA-based Computing 16 Our Contributions: 1.Quantify area/performance gap of hard and soft NoCs 2.Investigate how this impacts NoC design (hard/soft) 3.Integrate hard NoC with FPGA fabric Our Contributions: 1.Quantify area/performance gap of hard and soft NoCs 2.Investigate how this impacts NoC design (hard/soft) 3.Integrate hard NoC with FPGA fabric 1. Why NoCs on FPGAs?

17 17 Why NoCs on FPGAs? Hard/soft efficiency gap Integrating hard NoCs with FPGA 1 1 2 2 3 3 NoC Architecture Methodology Soft NoC design Results Area/Speed Efficiency Gap

18  NoC = Routers + Links 18 2. Hard/Soft Efficiency  State-of-the-art router architecture from Stanford: 1.Acknowledge that the NoC community have excelled at building a router: We just use it 2.To meet FPGA bandwidth requirements: High-performance router 3.A complex router includes a superset of NoC components that may be used: More complete analysis  Split router into 5 Components 

19 19 2. Hard/Soft Efficiency

20 20 2. Hard/Soft Efficiency Multi-Queue Buffer Port Width Buffer depth Number of VCs = Memory + CIControl Logic Input Modules

21 21 2. Hard/Soft Efficiency Multiplexers Logic + crowded interconnect Port Width Number of Ports Crossbar

22 22 2. Hard/Soft Efficiency Retiming Register Registers + little control logic Port Width Number of VCs Output Modules

23 23 2. Hard/Soft Efficiency Arbiters = Logic + Registers Number of Ports Number of VCs Allocators

24 24 2. Hard/Soft Efficiency 5 Components Input Module Crossbar VC Allocator SW Allocator Output Module Port Width Number of Ports Number of VCs Buffer Depth 4 Parameters

25  Post-routing FPGA (soft) area and delay  Post-synthesis ASIC (hard) area and delay  Both TSMC 65 nm technology (Stratix III)  Verify results against previous FPGA:ASIC comparison by Kuon and Rose 25 2. Hard/Soft Efficiency Per Router Component

26  Relatively small memories  Critical component in router design  3 options for FPGA: 26 Registers LUTRAM Block RAM One per LUT 640 bits 9 Kbits 2. Hard/Soft Efficiency  Area of each implementation option 

27 27 Width = 32 Bits 2. Hard/Soft Efficiency Another logic cluster used

28  Relatively small memories  3 options for implementation on FPGA 28 Registers LUTRAM Block RAM One per LUT 640 bits 9 Kbits 0.77 Kbit/mm 2 23 Kbit/mm 2 142 Kbit/mm 2  16% utilized BRAM more area efficient than fully used LUTRAM (Valid for Stratix III)  LUTRAM could win for some points in other FPGAs Use BRAM for FPGA (soft) implementation Soft 2. Hard/Soft Efficiency

29 29 High port count inefficient in soft Soft 24X – 94X 60X – 170X 2. Hard/Soft Efficiency

30 30 High port count inefficient in soft  Width scales better Soft 2. Hard/Soft Efficiency 26X – 17X 72X

31 31 Buffer depth is free on FPGAs when using BRAM Soft Filling up the BRAM 2. Hard/Soft Efficiency

32  Design recommendations based on FPGA silicon area  Supported by delay measurements 32 Buffer depth is free on FPGAs when using BRAM Soft High port count inefficient in soft  Width scales better Soft Use BRAM for FPGA (soft) implementation Soft 2. Hard/Soft Efficiency

33 33 Memory = Logic + Registers 2. Hard/Soft Efficiency Router ComponentMean Area RatioLUT:REG Input Module17-- Crossbar85-- VC Allocator488:1 Switch Allocator5620:1 Output Module390.6:1 Router30

34 34 2. Hard/Soft Efficiency Router ComponentMean Delay Ratio Input Module2.9 Crossbar4.4 VC Allocator3.9 Switch Allocator3.3 Output Module3.4 Router3.6

35 35 Why NoCs on FPGAs? Hard/soft efficiency gap Integrating hard NoCs with FPGA 1 1 2 2 3 3 Hard NoC + FPGA Wiring Conclusion Future Work

36 36 Router ComponentArea RatioDelay Ratio Input Module172.9 Crossbar854.4 VC Allocator483.9 Switch Allocator563.3 Output Module393.4 Router303.6 Router ComponentArea RatioDelay Ratio Input Module172.9 Crossbar854.4 VC Allocator483.9 SW Allocator563.3 Output Module393.4 Router303.6 50% Total Area Critical Path Results suggest hardening Crossbar and Allocators  Mixed hard/soft implementation 40% 10% 3. Hard NoC with FPGA

37 37 SoftHardMixed Area4.1 mm 2 (1X)0.14 mm 2 (30X)2.3 mm 2 (1.8X) Speed150 MHz (1X)810 MHz (5X)390 MHz (2.5X) ? ? How to connect hard and soft? How efficient is mixed/hard after doing that? Soft Hard Mixed not worth hardening For a typical router.. 5 ports 32 bits wide 2 VCs 10 buffer words 3. Hard NoC with FPGA

38 38 3. Hard NoC with FPGA FPGA Router Same I/O mux structure as a logic block – 9X the area Conventional FPGA interconnect between routers Logic clusters Router Logic

39 FPGA Router 39 3. Hard NoC with FPGA Same I/O mux structure as a logic block – 9X the area Conventional FPGA interconnect between routers 730 MHz

40 Router 40 3. Hard NoC with FPGA Assumed a mesh  Can form any topology FPGA

41 41 SoftHardHard (+ interconnect) Area4.1 mm 2 (1X)0.14 mm 2 (30X)0.18 mm 2 = 9 LABs (22X) Speed150 MHz (1X)810 MHz (5X)730 MHz (4.7X) 64-node NoC on Stratix V Router SoftHard (+ interconnect) Area ~12,500 LABs576 LABs %LABs 33 %1.6 % %FPGA 12 %0.6 % 3. Hard NoC with FPGA Hard NoC + Soft Interconnect is very compelling Provides 47 GB/s peak bisection bandwidth Very Cheap! Less than cost of 3 soft nodes

42 Why NoCs on FPGAs? Hard/soft efficiency gap Integrating hard NoCs with FPGA 1 1 2 2 3 3 Big city needs freeways to handle traffic Solve communication problems for a large/heterogeneous FPGA: Timing Closure – Interconnect Scaling – Modular Design A hard NoC is on average 30X smaller and 3.6X faster than soft Crossbars and allocators worst – Input buffer best An efficient soft NoC: Uses BRAMs – Large width, low Port Count – Deep buffers Mixed implementation does not make sense Integrated fully hard NoC with FPGA fabric (for NoC Links) 22X area improvement over soft Reaches max. FPGA frequency (4.7X faster than soft) 64-node NoC = 0.6% of total FPGA area (Stratix V)

43  Power analysis  More hardening: – Dedicated inter-router links (hard wires) – Clock domain crossing hardware  How do traffic hotspots (DDR/PCIe) influence NoC design?  Latency insensitive design methodology that uses NoC  CAD tool changes for a NoC-based FPGA 43 3. Hard NoC with FPGA

44

45


Download ppt "Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Hard/soft efficiency gap Integrating hard NoCs with FPGA 1 1 2 2 3 3."

Similar presentations


Ads by Google