Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Slides:

Advertisements

Similar presentations

IT253: Computer Organization

Advertisements

Evaluation of On-Chip Interconnect Architectures for Multi-Core DSP Students : Haim Assor, Horesh Ben Shitrit 2. Shared Bus 3. Fabric 4. Network on Chip.

THERMAL-AWARE BUS-DRIVEN FLOORPLANNING PO-HSUN WU & TSUNG-YI HO Department of Computer Science and Information Engineering, National Cheng Kung University.

Presenter : Cheng-Ta Wu Kenichiro Anjo, Member, IEEE, Atsushi Okamura, and Masato Motomura IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 39,NO. 5, MAY 2004.

Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

Misbah Mubarak, Christopher D. Carothers

A Novel 3D Layer-Multiplexed On-Chip Network

System-level Trade-off of Networks-on-Chip Architecture Choices Network-on-Chip System-on-Chip Group, CSE-IMM, DTU.

International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

Packet-Switched vs. Time-Multiplexed FPGA Overlay Networks Kapre et. al RC Reading Group – 3/29/2006 Presenter: Ilya Tabakh.

1 Lecture 23: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Appendix E)

MINIMISING DYNAMIC POWER CONSUMPTION IN ON-CHIP NETWORKS Robert Mullins Computer Architecture Group Computer Laboratory University of Cambridge, UK.

1 Multi - Core fast Communication for SoPC Multi - Core fast Communication for SoPC Technion – Israel Institute of Technology Department of Electrical.

Improving the Efficiency of Memory Partitioning by Address Clustering Alberto MaciiEnrico MaciiMassimo Poncino Proceedings of the Design,Automation and.

1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)

An Application-Specific Design Methodology for STbus Crossbar Generation Author: Srinivasan Murali, Giovanni De Micheli Proceedings of the DATE’05,pp ,2005.

Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

SYNTHESIS OF NETWORKS ON CHIPS FOR 3D SYSTEMS ON CHIPS Srinivasan Murali, Ciprian Seiculescu, Luca Benini, Giovanni De Micheli Presented by Puqing Wu.

1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.

High Performance Embedded Computing © 2007 Elsevier Lecture 16: Interconnection Networks Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

José Vicente Escamilla José Flich Pedro Javier García 1.

Low Contention Mapping of RT Tasks onto a TilePro 64 Core Processor 1 Background Introduction = why 2 Goal 3 What 4 How 5 Experimental Result 6 Advantage.

High-Performance Networks for Dataflow Architectures Pravin Bhat Andrew Putnam.

A Flexible Interconnection Structure for Reconfigurable FPGA Dataflow Applications Gianluca Durelli, Alessandro A. Nacci, Riccardo Cattaneo, Christian.

Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.

REXAPP Bilal Saqib. REXAPP  Radio EXperimentation And Prototyping Platform Based on NOC  REXAPP Compiler.

QoS Support in High-Speed, Wormhole Routing Networks Mario Gerla, B. Kannan, Bruce Kwan, Prasasth Palanti,Simon Walton.

Automated Design of Custom Architecture Tulika Mitra

Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.

1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University

High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.

George Michelogiannakis William J. Dally Stanford University Router Designs for Elastic- Buffer On-Chip Networks.

Univ. of TehranAdv. topics in Computer Network1 Advanced topics in Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.

Anshul Kumar, CSE IITD CSL718 : Multiprocessors Interconnection Mechanisms Performance Models 20 th April, 2006.

1 Tuning Garbage Collection in an Embedded Java Environment G. Chen, R. Shetty, M. Kandemir, N. Vijaykrishnan, M. J. Irwin Microsystems Design Lab The.

CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.

InterConnection Network Topologies to Minimize graph diameter: Low Diameter Regular graphs and Physical Wire Length Constrained networks Nilesh Choudhury.

Performance Analysis of a JPEG Encoder Mapped To a Virtual MPSoC-NoC Architecture Using TLM 林孟諭 Dept. of Electrical Engineering National Cheng Kung.

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

University of Michigan, Ann Arbor

IMPLEMENTATION OF MIPS 64 WITH VERILOG HARDWARE DESIGN LANGUAGE BY PRAMOD MENON CET520 S’03.

Run-time Adaptive on-chip Communication Scheme 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C.

Super computers Parallel Processing

Multi-objective Topology Synthesis and FPGA Prototyping Framework of Application Specific Network-on-Chip m Akram Ben Ahmed Xinyu LI, Omar Hammami.

Sunpyo Hong, Hyesoon Kim

Mohamed ABDELFATTAH Andrew BITAR Vaughn BETZ. 2 Module 1 Module 2 Module 3 Module 4 FPGAs are big! Design big systems High on-chip communication.

1 Power-Aware System on a Chip A. Laffely, J. Liang, R. Tessier, C. A. Moritz, W. Burleson University of Massachusetts Amherst Boston Area Architecture.

1 of 14 Lab 2: Design-Space Exploration with MPARM.

Embedded Real-Time Systems Processing interrupts Lecturer Department University.

Programmable Hardware: Hardware or Software?

The Interconnect Delay Bottleneck.

Andrea Acquaviva, Luca Benini, Bruno Riccò

Lecture 23: Interconnection Networks

Parallel and Distributed Simulation Techniques

OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel

Israel Cidon, Ran Ginosar and Avinoam Kolodny

Using Packet Information for Efficient Communication in NoCs

An Automated Design Flow for 3D Microarchitecture Evaluation

On-time Network On-chip

COMP60621 Fundamentals of Parallel and Distributed Systems

Network-on-Chip Programmable Platform in Versal™ ACAP Architecture

CS 6290 Many-core & Interconnect

COMP60611 Fundamentals of Parallel and Distributed Systems

University of Wisconsin-Madison Presented by: Nick Kirchem

Chapter 2 from ``Introduction to Parallel Computing'',

Presentation transcript:

Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi, L.Benini, G.N.Gaydadjiev§ University of Ferrara. University of Bologna. Universidad Politecnica de Valencia. §Delft University of Technology,.

Multi-dimension topologies 2D mesh frequently used for NoC design - perfectly matches 2D silicon surface - high level of modularity - controllability of electrical parameters But its avg latency and resource consumption scale poorly with network size Topology with more than 2 dimensions attractive: - higher bandwidth and lower avg latency - on-chip wiring more cost-effective than off-chip But physical design issues might impact their effectiveness and even feasibility (decreased operating frequency) (higher link latency)

Objective Explore the effectiveness and feasibility of multi-dimensional topologies Under realistic technological constraints 1.Physical synthesis impact over performance Over-the-cell routing? Latency in injection links? Latency in express links? Which switch operating frequency? Regularity broken by asymmetric tile size or heterogeneous tiles! Our approach Physical parameters from the physical synthesis are applied to system-level simulations Silicon-aware performance analysis

Objective Explore the effectiveness and feasibility of multi-dimensional topologies Under realistic architectural constraints Our approach Chip I/O interface modeling Capture the implications of I/O performance on topology performance differentiation 1.Physical synthesis impact over performance 2.Impact of chip I/O interface over topology performance May introduce an upper bound to the topology performance, affecting the performance differentiation between topologies

Objective Explore the effectiveness and feasibility of multi-dimensional topologies Software constraints: communication semantics of the middleware Traffic pattern usually abstracted as an average link bandwidth utilization or as a synthetic traffic pattern May lead to highly inaccurate performance predictions (traffic peaks, different kinds of messaging, synchronization mismatches) Our approach Project network traffic based on latest advances in MPSoC communication middleware Generate traffic patterns for the NoC shaped by the above communication middleware (e.g., synchronization, communication semantics) 1.Physical synthesis impact over performance 2.Impact of chip I/O interface over topology performance 3.Realistically capture traffic behavior

Backend synthesis flow Communication semantics Topologies under test Physical synthesis Layout-aware topology performance Conclusions

Topology generation Topology specification RTL SystemC/Verilog Simulation VCD Trace Physical Synthesis Placement Floorplan Clock Tree Synth., Power Grid, routing, post-routing opt Netlist, Parasitic ExtractionPrime time SDF (timing) Prime timePower estimation OCP Traffic Generator Transactional Simulator

Backend synthesis flow Communication semantics Topologies under test Physical synthesis Layout-aware topology performance Conclusions

Tile Architecture Processor core – Connected through a Network Interface Initiator Local memory core – Connected through a Network Interface Target Two network interfaces can be used in parallel Processor Core Memory Core Network IF Initiator Network IF Target Tile

Communication protocol Step 1: Producer checks local semaphores for pending messages for the destination If not, it writes data to the local tile memory and unblocks a semaphore at the consumer tile The producer is free to carry out other tasks Local Polling Producer Tile Write Message Reset Semaphore Local Polling ConsumerTile Read Operation Step 2: Consumer detects unblocked semaphore Requests producer for data Step 3: Consumer reads data from the producer Step 4: Consumer sends a notification upon completion – This allows the producer to send another message to this consumer Message sent only when consumer is ready to read it Only one outstanding message for a producer-consumer pair Low network bandwidth utilization Tight latency constraints on the topology Dalla Torre, A. et al., MP-Queue: an Efficient Communication Library for Embedded Streaming Multimedia Platform, IEEE Workshop on Embedded Systems for Real-Time Multimedia, 2007.

Backend synthesis flow Communication semantics Topologies under test Physical synthesis Layout-aware topology performance Conclusions

Topologies Under Test – 16 tiles 4-ary 2-mesh (2D Mesh) Switches16 Bis. Band.4 Tiles x Switch1 Switch Arity6 Max. Hops6 4-ary 2-mesh Baseline Topology Tile Switch

Topologies Under Test – 16 tiles 4-ary 2-mesh (2D Mesh) 2-ary 4-mesh (Hypercube) Switches16 Bis. Band.48 Tiles x Switch11 Switch Arity66 Max. Hops64 4-ary 2-mesh Baseline Topology 2-ary 4-mesh High Bandwith Tile Switch Tile Switch

Topologies Under Test – 16 tiles 4-ary 2-mesh Baseline Topology 2-ary 2-mesh Low latency 4-ary 2-mesh (2D Mesh) 2-ary 4-mesh (Hypercube) 2-ary 2-mesh (Concentrated) Switches16 4 Bis. Band.482 Tiles x Switch114 Switch Arity6610 Max. Hops642 Tile Switch Tile Switch

Topologies Under Test – 64 tiles 8-ary 2-mesh (2D Mesh) Switches64 Bis. Band.8 Tiles x Switch1 Switch Arity8 Max. Hops14 8-ary 2-mesh Baseline Topology

Topologies Under Test – 64 tiles 8-ary 2-mesh (2D Mesh) 2-ary 6-mesh (Hypercube) Switches64 Bis. Band.832 Tiles x Switch11 Switch Arity68 Max. Hops146 8-ary 2-mesh Baseline Topology 2-ary 6-mesh High Bandwith

Topologies Under Test – 64 tiles 8-ary 2-mesh (2D Mesh) 2-ary 6-mesh (Hypercube) 2-ary 4-mesh (Concentrated) Switches64 16 Bis. Band.8328 Tiles x Switch114 Switch Arity6812 Max. Hops ary 2-mesh Baseline Topology 2-ary 4-mesh Low Latency

Backend synthesis flow Communication semantics Topologies under test Physical synthesis Layout-aware topology performance Conclusions

Physical Synthesis Link latency and maximum frequency – Performance, area and power – Quantified by post-layout analysis For 16 tile systems – Real physical parameter values were obtained For 64 tile systems – Physical parameter values extrapolated based on 16 tiles results – Synthesis time constraints

Physical Synthesis – 16 Tiles Network building blocks synthesized for maximum performance Timing path in network logic – Ignore switch-to-switch links. Critical paths are in the switches – never in the network interfaces – Network speed closely reflects the maximum switch radix 4-ary 2-mesh (2D Mesh) 2-ary 4-mesh (Hypercube) 2-ary 2-mesh (Concentrated) Switch Arity6610 Post-synthesis freq.1 Ghz 850 Mhz Post-layout freq.786 MHz640 Mhz600 Mhz Core speed (max. 500)393 MHz320 Mhz300 Mhz Cell Area949k μm k μm 2 733k μm 2

Physical Synthesis – 16 Tiles Inter-switch wiring reduces performance The connectivity pattern of 2-ary 4-mesh results into a larger frequency drop than the 2D mesh The 2-ary 2-mesh pays its lower number of switching resources with a larger switch-to-switch separation – Severe degradation of network performance 4-ary 2-mesh (2D Mesh) 2-ary 4-mesh (Hypercube) 2-ary 2-mesh (Concentrated) Switch Arity6610 Post-synthesis freq.1 Ghz 850 Mhz Post-layout freq.786 MHz640 Mhz600 Mhz Core speed (max. 500)393 MHz320 Mhz300 Mhz Cell Area949k μm k μm 2 733k μm 2

Physical Synthesis – 16 Tiles Frequency-ratioed clock domain crossing in network interface – Network speed affects core speed. Maximum core speed of 500 MHz is assumed Post-layout speed drop – Cores cannot sustain the network speed – A divider of 2 is applied 4-ary 2-mesh (2D Mesh) 2-ary 4-mesh (Hypercube) 2-ary 2-mesh (Concentrated) Switch Arity6610 Post-synthesis freq.1 Ghz 850 Mhz Post-layout freq.786 MHz640 Mhz600 Mhz Core speed (max. 500)393 MHz320 Mhz300 Mhz Cell Area949k μm k μm 2 733k μm 2

Physical Synthesis – 16 Tiles 2-ary 4-mesh larger area footprint than the 2D mesh 2-ary 2-mesh reduces the number of switches – Larger radix – Area not halved 4-ary 2-mesh (2D Mesh) 2-ary 4-mesh (Hypercube) 2-ary 2-mesh (Concentrated) Switch Arity6610 Post-synthesis freq.1 Ghz 850 Mhz Post-layout freq.786 MHz640 Mhz600 Mhz Core speed (max. 500)393 MHz320 Mhz300 Mhz Cell Area949k μm k μm 2 733k μm 2

Physical Synthesis – 64 tiles 64 tile hypercubes present very long links – Switch-to-switch link delay impacts overall network speed – Overall network speed unacceptably low for 64 tiled systems Link pipelining becomes mandatory – Allows to sustain network speed even in the presence of long links Number of pipeline stages depends on the link length on the layout

Concentrated 2-ary 4-mesh Physical Synthesis – 64 tiles

8-ary 2-mesh (2D Mesh) 2-ary 6-mesh (Hypercube) 2-ary 4-mesh (Concentrated) Switch Arity6812 Post-synthesis freq.1 Ghz900 Ghz790 Mhz Post-layout freq.786 MHz640 Mhz500 Mhz Core speed (max. 500)393 MHz320 Mhz250 Mhz Cell Area4461kμm k μm k μm 2 Latency on top dimensions Dimension 3-11 Dimension 4-12 Dimension 5-2- Dimension 6-3-

Physical Synthesis – 64 tiles 8-ary 2-mesh (2D Mesh) 2-ary 6-mesh (Hypercube) 2-ary 4-mesh (Concentrated) Reduced 2-ary 4-mesh Switch Arity6812 Post-synthesis freq.1 Ghz900 Ghz790 Mhz Post-layout freq.786 MHz640 Mhz500 Mhz Core speed (max. 500)393 MHz320 Mhz250 Mhz500 Mhz Cell Area4461kμm k μm k μm 2 Latency on top dimensions Dimension Dimension Dimension Dimension 6-3--

Physical Synthesis – 64 tiles 8-ary 2-mesh (2D Mesh) 2-ary 6-mesh (Hypercube) 2-ary 6-mesh High-Speed Switch Arity688 Post-synthesis freq.1 Ghz900 Ghz900Mhz Post-layout freq.786 MHz640 Mhz786 Mhz Core speed (max. 500)393 MHz320 Mhz393Mhz Cell Area4461kμm k μm k μm 2 Latency on top dimensions Dimension 3-12 Dimension 4-12 Dimension 5-22 Dimension 6-33 Aggressive link pipelining 200% area overhead for 20% improvement in performance Not usable

Backend synthesis flow Communication semantics Topologies under test Physical synthesis Layout-aware topology performance Conclusions

Workload distribution Producer, worker and consumer tasks I/O devices dedicated to input OR output data External I/O

Topology performance 1 Input and 1 Output ports to the external memory are assumed for 16 tile systems 4 Input and 4 Output ports to the external memory are assumed for 64 tile systems I/O ports are accessed through sidewall tiles – The mapping of producer(s) and consumer(s) tasks is therefore constrained to these tiles

Topology performance Several I/O mapping strategies were considered: For sake of space, we only show here the most significative – OneSided: all the I/O tiles are placed on the same side of the chip.

Topology performance - 16 tiles 2-ary 4-mesh reduces total number of cycles by 27.4% 2-ary 2-mesh reduces cycles only by 1.6% over the hypercube – Chip I/O becomes the bottleneck Real operating frequency of each topology changes conclusions – Physical degradation is too severe to be compensated 2-ary 2-mesh shows superior energy saving properties – 50% over the 2D mesh

Topology performance - 64 tiles 2D mesh outperforms the non-reduced hypercubes Systems under test are I/O constrained – Computation tiles spend around 50% of their time waiting to send data to the consumer tile – Upper bound to topology-related performance optimization Improvement in terms of execution cycles – Performance improvements in cycles are not such to offset the lower operating speed Removal of the I/O bottleneck has to be considered as mandatory to achieve performance differentiation between topologies

Topology performance - 64 tiles Network and tiles work at the same frequency – Maximum frequency for all tiles: I/O tiles and processing tiles. Very similar performance – Reduced number of cycles – Low network frequency Reduced hardware resources – 4 times less switches, half the number of ports and works at half the frequency

Backend synthesis flow Communication semantics Topologies under test Physical synthesis Layout-aware topology performance Conclusions

Bottom-up approach to assess k-ary n-mesh topologies A number of real life issues are considered: Physical constraints of nanoscale technologies Impact of I/O interface Communication semantics of the middleware The intricate wiring of multi-dimension topologies or the long links required by concentrated k-ary n-meshes can be changed into 2 different kind of performance overhead by means of proper design techniques:

Conclusions Operating frequency reduction: in spite of a lower number of execution cycles, multi-dimension topologies loose in terms of RET due to lower working frequency. Concentrated topologies provide a way to trade performance for power/area Increase of link latency: the utilization of retiming stages allows to sustain operating frequency while increasing the network latency. Area and power overhead have to be taken into account Link pipelining can not materialize a frequency higher than the switch radix itself for 64 tile systems we found that in general, the 2D mesh outperforms the hypercubes. In spite of a better execution cycles, the real elapsed time is worst because of a lower operating frequency

Conclusions Unexpected results for the reduced 2-ary 4-mesh: Expected: Low cost - Low performance solution Results: Low cost with similar performance as 2D mesh Increment in core speed allows to reduce the impact: I/O tile congestion Processing tiles Possible solution to hypercube physical degradation issues: Decouple network speed from core speed (GALS) Other solutions: High performance – high radix switches

Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi, L.Benini, G.N.Gaydadjiev§ University of Ferrara. University of Bologna. Universidad Politecnica de Valencia. §Delft University of Technology,.