Networks-on-Chips (NoCs) Basics

Slides:

Advertisements

Similar presentations

Networks on Chip: Router Microarchitecture & Network Topologies

Advertisements

What is Flow Control ? Flow Control determines how a network resources, such as channel bandwidth, buffer capacity and control state are allocated to packet.

Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally Concurrent VLSI Architecture Group Stanford University.

Miguel Gorgues, Dong Xiang, Jose Flich, Zhigang Yu and Jose Duato Uni. Politecnica de Valencia, Spain School of Software, Tsinghua University, China, Achieving.

1 Lecture 17: On-Chip Networks Today: background wrap-up and innovations.

1 Lecture 12: Interconnection Networks Topics: dimension/arity, routing, deadlock, flow control.

1 Lecture 15: PCM, Networks Today: PCM wrap-up, projects discussion, on-chip networks background.

1 Lecture 23: Interconnection Networks Paper: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton.

CSE 291-a Interconnection Networks Lecture 12: Deadlock Avoidance (Cont’d) Router February 28, 2007 Prof. Chung-Kuan Cheng CSE Dept, UC San Diego Winter.

Chapter 4 Network Layer slides are modified from J. Kurose & K. Ross CPE 400 / 600 Computer Communication Networks Lecture 14.

1 Lecture 16: On-Chip Networks Today: on-chip networks background.

10 - Network Layer. Network layer r transport segment from sending to receiving host r on sending side encapsulates segments into datagrams r on rcving.

Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim

1 Lecture 21: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.

1 Lecture 13: Interconnection Networks Topics: flow control, router pipelines, case studies.

1 Lecture 25: Interconnection Networks Topics: flow control, router microarchitecture Final exam:  Dec 4 th 9am – 10:40am  ~15-20% on pre-midterm  post-midterm:

Predictive Load Balancing Reconfigurable Computing Group.

1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control Final exam reminders:  Plan well – attempt every question.

CSE 291-a Interconnection Networks Lecture 15: Router (cont’d) March 5, 2007 Prof. Chung-Kuan Cheng CSE Dept, UC San Diego Winter 2007 Transcribed by Ling.

1 Lecture 25: Interconnection Networks, Disks Topics: flow control, router microarchitecture, RAID.

Issues in System-Level Direct Networks Jason D. Bakos.

Parallel System Performance CS 524 – High-Performance Computing.

1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control.

1 Lecture 26: Interconnection Networks Topics: flow control, router microarchitecture.

Low-Latency Virtual-Channel Routers for On-Chip Networks Robert Mullins, Andrew West, Simon Moore Presented by Sailesh Kumar.

Router Architectures An overview of router architectures.

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

High Performance Embedded Computing © 2007 Elsevier Lecture 16: Interconnection Networks Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

1 Lecture 23: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm Next semester:

1 The Turn Model for Adaptive Routing. 2 Summary Introduction to Direct Networks. Deadlocks in Wormhole Routing. System Model. Partially Adaptive Routing.

On-Chip Networks and Testing

Elastic-Buffer Flow-Control for On-Chip Networks

The Alpha Network Architecture By Shubhendu S. Mukherjee, Peter Bannon Steven Lang, Aaron Spink, and David Webb Compaq Computer Corporation Presented.

Author : Jing Lin, Xiaola Lin, Liang Tang Publish Journal of parallel and Distributed Computing MAKING-A-STOP: A NEW BUFFERLESS ROUTING ALGORITHM FOR ON-CHIP.

Deadlock CEG 4131 Computer Architecture III Miodrag Bolic.

ECE 466 Switching Networks. ECE 466 A communication network provides a scalable solution to connect a large number of end systems Communication Networks.

Data and Computer Communications Chapter 10 – Circuit Switching and Packet Switching (Wide Area Networks)

© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Switch Microarchitecture Basics.

1 Lecture 15: Interconnection Routing Topics: deadlock, flow control.

BZUPAGES.COM Presentation On SWITCHING TECHNIQUE Presented To; Sir Taimoor Presented By; Beenish Jahangir 07_04 Uzma Noreen 07_08 Tayyaba Jahangir 07_33.

The Alpha Network Architecture Mukherjee, Bannon, Lang, Spink, and Webb Summary Slides by Fred Bower ECE 259, Spring 2004.

STORE AND FORWARD & CUT THROUGH FORWARD Switches can use different forwarding techniques— two of these are store-and-forward switching and cut-through.

Run-time Adaptive on-chip Communication Scheme 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C.

Router Architecture. December 21, 2015SoC Architecture2 Network-on-Chip Information in the form of packets is routed via channels and switches from one.

Unit III Bandwidth Utilization: Multiplexing and Spectrum Spreading In practical life the bandwidth available of links is limited. The proper utilization.

Lecture 16: Router Design

1 Lecture 22: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.

Virtual-Channel Flow Control William J. Dally

1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix F)

1 Lecture 29: Interconnection Networks Papers: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton Interconnect Design.

1 Lecture 22: Interconnection Networks Topics: Routing, deadlock, flow control, virtual channels.

McGraw-Hill©The McGraw-Hill Companies, Inc., 2000 Muhammad Waseem Iqbal Lecture # 20 Data Communication.

Overview Parallel Processing Pipelining

The network-on-chip protocol

Chapter 8 Switching Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Lecture 23: Interconnection Networks

Interconnection Networks: Flow Control

Lecture 23: Router Design

Lecture 16: On-Chip Networks

SWITCHING Switched Network Circuit-Switched Network Datagram Networks

NoC Switch: Basic Design Principles &

Mechanics of Flow Control

CEG 4131 Computer Architecture III Miodrag Bolic

Low-Latency Virtual-Channel Routers for On-Chip Networks Robert Mullins, Andrew West, Simon Moore Presented by Sailesh Kumar.

Lecture: Interconnection Networks

CS 6290 Many-core & Interconnect

Lecture 25: Interconnection Networks

Multiprocessors and Multi-computers

Presentation transcript:

Networks-on-Chips (NoCs) Basics ECE 284 On-Chip Interconnection Networks Spring 2013

Examples of Tiled Multiprocessors 2D-mesh networks often used as on-chip fabric 12.64mm I/O Area single tile 1.5mm 2.0mm 21.72mm Meshes are very popular on-chip network topology have been used in commercial CMPs like the Tilera Tile 64 processor and also in Intel’s research chip which is a 80 tile Polaris processor. Tilera Tile64 I/O Area Intel 80-core

Typical architecture Compute Unit Router CPU L1 Cache Slice of L2 Cache In recent times computer architects have discovered that the design of heavily pipelined [monolithic] superscalar uni-processor cores that aim to achieve application execution speedup with ILP along with high operating frequencies as a means to yield greater performance, has reached a fundamental limit, threatening the escalation of performance gains from one generation to the next with diminishing returns as technology keeps scaling. Instead economies of scale and new design paradigms point to “divide and conquer” strategies, where applications are broken down into smaller concurrent operations and then distributed across many smaller processing units which reside on a single chip. These small and many processing units communicate with each other using a communication medium. As the number of these on-chip units increases core-interconnectivity is moving from fully connected crossbars (where is complexity is in the order of n^2 (n is the number of connected cores)) or bus architectures which connect a handful of cores to interconnection networks. An on-chip interconnection network is the communication medium of preference because in these networks wire connections are shorter than global wires therefore enabling wire delays to scale with the size of the architecture as single cycle link traversals between computation units at high bandwidth rates offering delay-predictable communication. Additional benefits of interconnection networks include resource re-use as various components/applications are routed over shared routers and connecting wiring that can be used to achieve traffic balancing and even fault tolerance and therefore enabling higher throughput rates and performance to keep scaling. Also application can be mapped in such a way to evenly distribute its load according to the network topology. Each tile typically comprises the CPU, a local L1 cache, a “slice” of a distributed L2 cache, and a router

Router function The job of the router is forward packets from a source tile to a destination tile (e.g., when a “cache line” is read from a “remote” L2 slice). Two example switching modes: Store-and-forward: Bits of a packet are forwarded only after entire packet is first stored. Cut-through: Bits of a packet are forwarded once the header portion is received. Parallel programming and parallel machines have been around for a while in the context of clusters, supercomputers or grids. However, the granularity of parallelism in these machines in quite coarse as off chip communication costs are quite high. With multi-core processors, the communication cost is drastically reduced because of the proximity of the cores within a chip and due to the abundance of on-chip wiring. So, we can expect to exploit parallelism at a much finer granularity. This in turn will greatly increase the traffic volume between cores and will require networks that can deliver high throughput.

Store-and-forward switching Buffers for data packets Store Source end node Destination end node Packets are completely stored before any portion is forwarded [adapted from instructional slides of Pinkston & Duato, Computer Architecture: A Quantitative Approach]

Store-and-forward switching Requirement: buffers must be sized to hold entire packet Forward Store Source end node Destination end node Packets are completely stored before any portion is forwarded [adapted from instructional slides of Pinkston & Duato, Computer Architecture: A Quantitative Approach]

Cut-through switching Buffers for data packets Requirement: buffers must be sized to hold entire packet Virtual cut-through Source end node Destination end node Buffers for flits: packets can be larger than buffers Wormhole Source end node Destination end node [adapted from instructional slides of Pinkston & Duato, Computer Architecture: A Quantitative Approach]

Cut-through switching Buffers for data packets Requirement: buffers must be sized to hold entire packet (MTU) Virtual cut-through Busy Link Packet completely stored at the switch Source end node Destination end node Buffers for flits: packets can be larger than buffers Wormhole Busy Link Packet stored along the path Source end node Destination end node [adapted from instructional slides of Pinkston & Duato, Computer Architecture: A Quantitative Approach]

[adapted from Becker STM’09 talk] Packets to flits Transact. Type Message Type Packet Size Read Request 1 flit Reply 1+n flits Write [adapted from Becker STM’09 talk]

Wormhole routing Head flit establishes the connection from input port to output port. It contains the destination address. Body flits goes through the established connection (does not need destination address information) Tail flit releases the connection. All other flits blocked until connection is released

Deadlock

[adapted from Becker STM’09 talk] Virtual channels Share channel capacity between multiple data streams Interleave flits from different packets Provide dedicated buffer space for each virtual channel Decouple channels from buffers “The Swiss Army Knife for Interconnection Networks” Prevent deadlocks Reduce head-of-line blocking Also useful for providing QoS [adapted from Becker STM’09 talk]

Using VCs for deadlock prevention Protocol deadlock Circular dependencies between messages at network edge Solution: Partition range of VCs into different message classes Routing deadlock Circular dependencies between resources within network Partition range of VCs into different resource classes Restrict transitions between resource classes to impose partial order on resource acquisition {packet classes} = {message classes} × {resource classes} [adapted from Becker STM’09 talk]

Using VCs for flow control Coupling between channels and buffers causes head-of-line blocking Adds false dependencies between packets Limits channel utilization Increases latency Even with VCs for deadlock prevention, still applies to packets in same class Solution: Assign multiple VCs to each packet class [adapted from Becker STM’09 talk]

[adapted from Becker STM’09 talk] VC router pipeline Route Computation (RC) Determine candidate output port(s) and VC(s) Can be precomputed at upstream router (lookahead routing) Virtual Channel Allocation (VA) Assign available output VCs to waiting packets at input VCs Switch Allocation (SA) Assign switch time slots to buffered flits Switch Traversal (ST) Send flits through crossbar switch to appropriate output Per packet Per flit [adapted from Becker STM’09 talk]

[adapted from Becker STM’09 talk] Allocation basics Arbitration: Multiple requestors Single resource Request + grant vectors Allocation: Multiple equivalent resources Request + grant matrices Matching: Each grant must satisfy a request Each requester gets at most one grant Each resource is granted at most once [adapted from Becker STM’09 talk]

[adapted from Becker STM’09 talk] Separable allocators Matchings have at most one grant per row and per column Implement via to two phases of arbitration Column-wise and row-wise Perform in either order Arbiters in each stage are fully independent Fast and cheap But bad choices in first phase can prevent second stage from generating a good matching! Input-first: Output-first: [adapted from Becker STM’09 talk]

[adapted from Becker STM’09 talk] Wavefront allocators Avoid separate phases … and bad decisions in first Generate better matchings But delay scales linearly Also difficult to pipeline Principle of operation: Pick initial diagonal Grant all requests on diagonal Never conflict! For each grant, delete requests in same row, column Repeat for next diagonal [adapted from Becker STM’09 talk]

Wavefront allocator timing Originally conceived as full-custom design Tiled design True delay scales linearly Signal wraparound creates combinational loops Effectively broken at priority diagonal But static timing analysis cannot infer that Synthesized designs must be modified to avoid loops! [adapted from Becker STM’09 talk]

Diagonal Propagation Allocator Unrolled matrix avoids combinational loops Sliding priority window activates sub-matrix cells But static timing analysis again sees false paths! Actual delay is ~n Reported delay is ~(2n-1) Hurts synthesized designs Another design that avoids combinational loops is […] [adapted from Becker STM’09 talk]

[adapted from Becker STM’09 talk] VC allocation Before packets can proceed through router, need to acquire ownership of VC at downstream router VC allocator matches unassigned input VCs with output VCs that are not currently in use P×V requestors (input VCs), P×V resources (output VCs) VC is acquired by head flit, inherited by body & tail flits [adapted from Becker STM’09 talk]

VC allocator implementations Not shown: Masking logic for busy VCs [adapted from Becker STM’09 talk]

Typical pipelined router RC VA ST LT SA route computation VC + switch allocation switch traversal link traversal