Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi, L.Benini, G.N.Gaydadjiev§ University of Ferrara. University of Bologna. Universidad Politecnica de Valencia. §Delft University of Technology,.

Multi-dimension topologies 2D mesh frequently used for NoC design - perfectly matches 2D silicon surface - high level of modularity - controllability of electrical parameters But its avg latency and resource consumption scale poorly with network size Topology with more than 2 dimensions attractive: - higher bandwidth and lower avg latency - on-chip wiring more cost-effective than off-chip But physical design issues might impact their effectiveness and even feasibility (decreased operating frequency) (higher link latency)

Objective Explore the effectiveness and feasibility of multi-dimensional topologies Under realistic technological constraints 1.Physical synthesis impact over performance Over-the-cell routing? Latency in injection links? Latency in express links? Which switch operating frequency? Regularity broken by asymmetric tile size or heterogeneous tiles! Our approach Physical parameters from the physical synthesis are applied to system-level simulations Silicon-aware performance analysis

Objective Explore the effectiveness and feasibility of multi-dimensional topologies Under realistic architectural constraints Our approach Chip I/O interface modeling Capture the implications of I/O performance on topology performance differentiation 1.Physical synthesis impact over performance 2.Impact of chip I/O interface over topology performance May introduce an upper bound to the topology performance, affecting the performance differentiation between topologies

Objective Explore the effectiveness and feasibility of multi-dimensional topologies Software constraints: communication semantics of the middleware Traffic pattern usually abstracted as an average link bandwidth utilization or as a synthetic traffic pattern May lead to highly inaccurate performance predictions (traffic peaks, different kinds of messaging, synchronization mismatches) Our approach Project network traffic based on latest advances in MPSoC communication middleware Generate traffic patterns for the NoC shaped by the above communication middleware (e.g., synchronization, communication semantics) 1.Physical synthesis impact over performance 2.Impact of chip I/O interface over topology performance 3.Realistically capture traffic behavior

Backend synthesis flow Communication semantics Topologies under test Physical synthesis Layout-aware topology performance Conclusions

Topology generation Topology specification RTL SystemC/Verilog Simulation VCD Trace Physical Synthesis Placement Floorplan Clock Tree Synth., Power Grid, routing, post-routing opt Netlist, Parasitic ExtractionPrime time SDF (timing) Prime timePower estimation OCP Traffic Generator Transactional Simulator

Tile Architecture Processor core – Connected through a Network Interface Initiator Local memory core – Connected through a Network Interface Target Two network interfaces can be used in parallel Processor Core Memory Core Network IF Initiator Network IF Target Tile

Communication protocol Step 1: Producer checks local semaphores for pending messages for the destination If not, it writes data to the local tile memory and unblocks a semaphore at the consumer tile The producer is free to carry out other tasks Local Polling Producer Tile Write Message Reset Semaphore Local Polling ConsumerTile Read Operation 1 2 3 4 Step 2: Consumer detects unblocked semaphore Requests producer for data Step 3: Consumer reads data from the producer Step 4: Consumer sends a notification upon completion – This allows the producer to send another message to this consumer Message sent only when consumer is ready to read it Only one outstanding message for a producer-consumer pair Low network bandwidth utilization Tight latency constraints on the topology Dalla Torre, A. et al., MP-Queue: an Efficient Communication Library for Embedded Streaming Multimedia Platform, IEEE Workshop on Embedded Systems for Real-Time Multimedia, 2007.

Topologies Under Test – 16 tiles 4-ary 2-mesh (2D Mesh) Switches16 Bis. Band.4 Tiles x Switch1 Switch Arity6 Max. Hops6 4-ary 2-mesh Baseline Topology Tile Switch

Topologies Under Test – 16 tiles 4-ary 2-mesh (2D Mesh) 2-ary 4-mesh (Hypercube) Switches16 Bis. Band.48 Tiles x Switch11 Switch Arity66 Max. Hops64 4-ary 2-mesh Baseline Topology 2-ary 4-mesh High Bandwith Tile Switch Tile Switch

Topologies Under Test – 16 tiles 4-ary 2-mesh Baseline Topology 2-ary 2-mesh Low latency 4-ary 2-mesh (2D Mesh) 2-ary 4-mesh (Hypercube) 2-ary 2-mesh (Concentrated) Switches16 4 Bis. Band.482 Tiles x Switch114 Switch Arity6610 Max. Hops642 Tile Switch Tile Switch

Topologies Under Test – 64 tiles 8-ary 2-mesh (2D Mesh) Switches64 Bis. Band.8 Tiles x Switch1 Switch Arity8 Max. Hops14 8-ary 2-mesh Baseline Topology

Topologies Under Test – 64 tiles 8-ary 2-mesh (2D Mesh) 2-ary 6-mesh (Hypercube) Switches64 Bis. Band.832 Tiles x Switch11 Switch Arity68 Max. Hops146 8-ary 2-mesh Baseline Topology 2-ary 6-mesh High Bandwith

Topologies Under Test – 64 tiles 8-ary 2-mesh (2D Mesh) 2-ary 6-mesh (Hypercube) 2-ary 4-mesh (Concentrated) Switches64 16 Bis. Band.8328 Tiles x Switch114 Switch Arity6812 Max. Hops1464 8-ary 2-mesh Baseline Topology 2-ary 4-mesh Low Latency

Physical Synthesis Link latency and maximum frequency – Performance, area and power – Quantified by post-layout analysis For 16 tile systems – Real physical parameter values were obtained For 64 tile systems – Physical parameter values extrapolated based on 16 tiles results – Synthesis time constraints

Physical Synthesis – 16 Tiles Network building blocks synthesized for maximum performance Timing path in network logic – Ignore switch-to-switch links. Critical paths are in the switches – never in the network interfaces – Network speed closely reflects the maximum switch radix 4-ary 2-mesh (2D Mesh) 2-ary 4-mesh (Hypercube) 2-ary 2-mesh (Concentrated) Switch Arity6610 Post-synthesis freq.1 Ghz 850 Mhz Post-layout freq.786 MHz640 Mhz600 Mhz Core speed (max. 500)393 MHz320 Mhz300 Mhz Cell Area949k μm 2 1108k μm 2 733k μm 2

Physical Synthesis – 16 Tiles Inter-switch wiring reduces performance The connectivity pattern of 2-ary 4-mesh results into a larger frequency drop than the 2D mesh The 2-ary 2-mesh pays its lower number of switching resources with a larger switch-to-switch separation – Severe degradation of network performance 4-ary 2-mesh (2D Mesh) 2-ary 4-mesh (Hypercube) 2-ary 2-mesh (Concentrated) Switch Arity6610 Post-synthesis freq.1 Ghz 850 Mhz Post-layout freq.786 MHz640 Mhz600 Mhz Core speed (max. 500)393 MHz320 Mhz300 Mhz Cell Area949k μm 2 1108k μm 2 733k μm 2

Physical Synthesis – 16 Tiles Frequency-ratioed clock domain crossing in network interface – Network speed affects core speed. Maximum core speed of 500 MHz is assumed Post-layout speed drop – Cores cannot sustain the network speed – A divider of 2 is applied 4-ary 2-mesh (2D Mesh) 2-ary 4-mesh (Hypercube) 2-ary 2-mesh (Concentrated) Switch Arity6610 Post-synthesis freq.1 Ghz 850 Mhz Post-layout freq.786 MHz640 Mhz600 Mhz Core speed (max. 500)393 MHz320 Mhz300 Mhz Cell Area949k μm 2 1108k μm 2 733k μm 2

Physical Synthesis – 16 Tiles 2-ary 4-mesh larger area footprint than the 2D mesh 2-ary 2-mesh reduces the number of switches – Larger radix – Area not halved 4-ary 2-mesh (2D Mesh) 2-ary 4-mesh (Hypercube) 2-ary 2-mesh (Concentrated) Switch Arity6610 Post-synthesis freq.1 Ghz 850 Mhz Post-layout freq.786 MHz640 Mhz600 Mhz Core speed (max. 500)393 MHz320 Mhz300 Mhz Cell Area949k μm 2 1108k μm 2 733k μm 2

Physical Synthesis – 64 tiles 64 tile hypercubes present very long links – Switch-to-switch link delay impacts overall network speed – Overall network speed unacceptably low for 64 tiled systems Link pipelining becomes mandatory – Allows to sustain network speed even in the presence of long links Number of pipeline stages depends on the link length on the layout

Concentrated 2-ary 4-mesh Physical Synthesis – 64 tiles

8-ary 2-mesh (2D Mesh) 2-ary 6-mesh (Hypercube) 2-ary 4-mesh (Concentrated) Switch Arity6812 Post-synthesis freq.1 Ghz900 Ghz790 Mhz Post-layout freq.786 MHz640 Mhz500 Mhz Core speed (max. 500)393 MHz320 Mhz250 Mhz Cell Area4461kμm 2 7356k μm 2 2610k μm 2 Latency on top dimensions Dimension 3-11 Dimension 4-12 Dimension 5-2- Dimension 6-3-

Physical Synthesis – 64 tiles 8-ary 2-mesh (2D Mesh) 2-ary 6-mesh (Hypercube) 2-ary 4-mesh (Concentrated) Reduced 2-ary 4-mesh Switch Arity6812 Post-synthesis freq.1 Ghz900 Ghz790 Mhz Post-layout freq.786 MHz640 Mhz500 Mhz Core speed (max. 500)393 MHz320 Mhz250 Mhz500 Mhz Cell Area4461kμm 2 7356k μm 2 2610k μm 2 Latency on top dimensions Dimension 3-111 Dimension 4-122 Dimension 5-2-- Dimension 6-3--

Physical Synthesis – 64 tiles 8-ary 2-mesh (2D Mesh) 2-ary 6-mesh (Hypercube) 2-ary 6-mesh High-Speed Switch Arity688 Post-synthesis freq.1 Ghz900 Ghz900Mhz Post-layout freq.786 MHz640 Mhz786 Mhz Core speed (max. 500)393 MHz320 Mhz393Mhz Cell Area4461kμm 2 7356k μm 2 22784k μm 2 Latency on top dimensions Dimension 3-12 Dimension 4-12 Dimension 5-22 Dimension 6-33 Aggressive link pipelining 200% area overhead for 20% improvement in performance Not usable

Workload distribution Producer, worker and consumer tasks I/O devices dedicated to input OR output data External I/O

Topology performance 1 Input and 1 Output ports to the external memory are assumed for 16 tile systems 4 Input and 4 Output ports to the external memory are assumed for 64 tile systems I/O ports are accessed through sidewall tiles – The mapping of producer(s) and consumer(s) tasks is therefore constrained to these tiles

Topology performance Several I/O mapping strategies were considered: For sake of space, we only show here the most significative – OneSided: all the I/O tiles are placed on the same side of the chip.

Topology performance - 16 tiles 2-ary 4-mesh reduces total number of cycles by 27.4% 2-ary 2-mesh reduces cycles only by 1.6% over the hypercube – Chip I/O becomes the bottleneck Real operating frequency of each topology changes conclusions – Physical degradation is too severe to be compensated 2-ary 2-mesh shows superior energy saving properties – 50% over the 2D mesh

Topology performance - 64 tiles 2D mesh outperforms the non-reduced hypercubes Systems under test are I/O constrained – Computation tiles spend around 50% of their time waiting to send data to the consumer tile – Upper bound to topology-related performance optimization Improvement in terms of execution cycles – Performance improvements in cycles are not such to offset the lower operating speed Removal of the I/O bottleneck has to be considered as mandatory to achieve performance differentiation between topologies

Topology performance - 64 tiles Network and tiles work at the same frequency – Maximum frequency for all tiles: I/O tiles and processing tiles. Very similar performance – Reduced number of cycles – Low network frequency Reduced hardware resources – 4 times less switches, half the number of ports and works at half the frequency

Bottom-up approach to assess k-ary n-mesh topologies A number of real life issues are considered: Physical constraints of nanoscale technologies Impact of I/O interface Communication semantics of the middleware The intricate wiring of multi-dimension topologies or the long links required by concentrated k-ary n-meshes can be changed into 2 different kind of performance overhead by means of proper design techniques:

Conclusions Operating frequency reduction: in spite of a lower number of execution cycles, multi-dimension topologies loose in terms of RET due to lower working frequency. Concentrated topologies provide a way to trade performance for power/area Increase of link latency: the utilization of retiming stages allows to sustain operating frequency while increasing the network latency. Area and power overhead have to be taken into account Link pipelining can not materialize a frequency higher than the switch radix itself for 64 tile systems we found that in general, the 2D mesh outperforms the hypercubes. In spite of a better execution cycles, the real elapsed time is worst because of a lower operating frequency

Conclusions Unexpected results for the reduced 2-ary 4-mesh: Expected: Low cost - Low performance solution Results: Low cost with similar performance as 2D mesh Increment in core speed allows to reduce the impact: I/O tile congestion Processing tiles Possible solution to hypercube physical degradation issues: Decouple network speed from core speed (GALS) Other solutions: High performance – high radix switches

Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi, L.Benini, G.N.Gaydadjiev§ University of Ferrara. University of Bologna. Universidad Politecnica de Valencia. §Delft University of Technology,.

Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Similar presentations

Presentation on theme: "Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Similar presentations

Presentation on theme: "Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,"— Presentation transcript:

Similar presentations

About project

Feedback