Architecture and Routing for NoC-based FPGA Israel Cidon* *joint work with Roman Gindin and Idit Keidar
Israel Cidon - Technion 2 FPGA One NoC does not fit all! Flexibility Traffic uncertainty single application General purpose computer Chip design Run time SOC CMP I. Cidon and K. Goossens, in “Networks on Chips”, G. De Micheli and L. Benini, Morgan Kaufmann, 2006 Configuration
Israel Cidon - Technion 3 Field Programmable Gate Array Flexible Soft logic Configurable logic blocks (CLBs) and routing channels Programmed Look-up-tables (LUTs) Configurable switching boxes Area, power and speed efficient Hard logic Wire and clock infrastructure Special purpose modules, e.g., CPU, SerDes
Israel Cidon - Technion 4 FPGA Example
Israel Cidon - Technion 5 Challenges for Future FPGA Scalability of design methodology Dominance of wire delays Already more than 50% of delay Power Complex communication patterns Prototyping for NoC-based SoCs
Israel Cidon - Technion 6 NoC Based FPGA Architecture Functional unit Routers NoC for inter- routing Configurable region – User logic Configurable network interface
Israel Cidon - Technion 7 Future FPGA: NoC-Based Hierarchical: Divide chip into regions Programmable wiring inside regions Regions interconnected by NoCs Scalable Short wires, spatial reuse, power cost Modular design Prototype for NoC based SoC
Israel Cidon - Technion 8 Hard or soft NoC? Why hard Interconnect is a performance bottleneck Interconnect power Part of FPGA infrastructure Why soft Application is not known when the network is built Provides maximum flexibility Prevents resource lockup
Israel Cidon - Technion 9 Suggested FPGA NoC Architecture NoC ElementImplementation Wires, repeaters, etc.Hard Routers, including VCs, buffers, QoS support Hard Network interfacesSoft: Configurable Network Interface (CNI) Routing algorithm and headers Soft: determined in CNI Routing tablesSoft
Israel Cidon - Technion 10 FPGA Routing – Optimization Problem Set of Applications Different Architectures Different Traffic Patterns Implemented on the same chip Common efficient NoC
Israel Cidon - Technion 11 The NoC design problem The cost Hard grid links For uniform grids - the capacity of the most congestion link NoC Logic Hard logic for router Soft logic for routing tables, headers, CNIs Design Envelope Collection of designs supported by a given programmable chip The variables Number of “hard-coded” wires per link Possible configurable routing schemes
Israel Cidon - Technion 12 Routing Schemes XY Very simple logic Deadlock free Unbalanced - high cost in uniform capacity grids
Israel Cidon - Technion 13 Toggle XY (TXY) Split packets evenly between XY, YX routes Deadlock avoided with 2 VCs Near-optimal for symmetric traffic (permutations) [Seo et al. 05; Towles & Dally 02] Simple Better Balanced Split routes Does not take into account the traffic pattern
Israel Cidon - Technion 14 Weighted Schemes TXY not always produces the best results - Max. Capacity for graph with two hotspots at (1,1) and (1,2) on 5x5 grid TXY Optimum
Israel Cidon - Technion 15 WTXY Given a traffic pattern, choose XY/YX ratio of lowest maximum capacity Compute the ratio at programming time Load into C xy field in router Router chooses XY route with probability C xy, otherwise YX
Israel Cidon - Technion 16 TXY, WTXY Limitation Traffic split packets of the same flow take different paths Delays may cause out-of-order arrivals Re-ordering buffers are costly
Israel Cidon - Technion 17 Ordered Routing Algorithms One route per source-destination (S-D) pair No traffic splitting Unordered RoutingOrdered Routing
Israel Cidon - Technion 18 Source Toggle XY The route is a function of source and destination ID bitwise XOR Very simple algorithm Maximum capacity is similar to TXY
Israel Cidon - Technion 19 Weighted Ordered Toggle - WOT Weighted Ordered Toggle (WOT) Route per S-D pair is chosen at programming time Each source stores a routing bit for each destination Objective: minimize max link capacity Optimal route assignment is difficult
Israel Cidon - Technion 20 WOT Min-max Route Assignment initial assignment - STXY Make changes that reduce the capacity: Find most loaded link Among S-D pairs sharing this link change one that minimizes the max capacity (if possible) Sub-optimal
Israel Cidon - Technion 21 Iteration Demonstration S3S2 S1 D3 D1 D2
Israel Cidon - Technion 22 Benchmarks Previous work consider uniform permutations Chips have one or more hotspots CPU, on-chip memory, off-chip memory interface We use several hot-spot traffic models Also use a real world example
Israel Cidon - Technion 23 Single Hotspot
Israel Cidon - Technion 24 Two Hotspots Maximum Capacity Design Envelope for various distances between the hotspots for WOT
Israel Cidon - Technion 25 Three Hotspots Maximum capacity vs. Minimum distance between the hotspots
Israel Cidon - Technion 26 Mixed Traffic Model Three parameters per node A probability to be a hotspot, A probability to send data to a hotspot A probability to send data to a non-hotspot Average improvement for WOT vs. TXY is 12% and vs. XT is 25%
Israel Cidon - Technion 27 Real-World Example Based on Bertozzi - video encoder Mapping and placement are done manually
Israel Cidon - Technion 28 Real World Example Maximum Capacity WOT STXY XY
Israel Cidon - Technion 29 Summary A new NoC-based architecture for FPGA A design methodology for this architecture. WOT routing algorithm – Balanced In-order Low cost