Networks-on-Chips (NoCs) Basics

Name: Networks-on-Chips (NoCs) Basics
Uploaded: 2017-08-18T15:10:49+00:00
Duration: PTM13S33
Channel: Rhoda Conley
Description: Networks-on-Chips (NoCs) Basics

Networks-on-Chips (NoCs) Basics
ECE 284 On-Chip Interconnection Networks Spring 2013

Examples of Tiled Multiprocessors
2D-mesh networks often used as on-chip fabric 12.64mm I/O Area single tile 1.5mm 2.0mm 21.72mm Meshes are very popular on-chip network topology have been used in commercial CMPs like the Tilera Tile 64 processor and also in Intel’s research chip which is a 80 tile Polaris processor. Tilera Tile64 I/O Area Intel 80-core

Typical architecture Compute Unit Router CPU L1 Cache Slice of L2 Cache In recent times computer architects have discovered that the design of heavily pipelined [monolithic] superscalar uni-processor cores that aim to achieve application execution speedup with ILP along with high operating frequencies as a means to yield greater performance, has reached a fundamental limit, threatening the escalation of performance gains from one generation to the next with diminishing returns as technology keeps scaling. Instead economies of scale and new design paradigms point to “divide and conquer” strategies, where applications are broken down into smaller concurrent operations and then distributed across many smaller processing units which reside on a single chip. These small and many processing units communicate with each other using a communication medium. As the number of these on-chip units increases core-interconnectivity is moving from fully connected crossbars (where is complexity is in the order of n^2 (n is the number of connected cores)) or bus architectures which connect a handful of cores to interconnection networks. An on-chip interconnection network is the communication medium of preference because in these networks wire connections are shorter than global wires therefore enabling wire delays to scale with the size of the architecture as single cycle link traversals between computation units at high bandwidth rates offering delay-predictable communication. Additional benefits of interconnection networks include resource re-use as various components/applications are routed over shared routers and connecting wiring that can be used to achieve traffic balancing and even fault tolerance and therefore enabling higher throughput rates and performance to keep scaling. Also application can be mapped in such a way to evenly distribute its load according to the network topology. Each tile typically comprises the CPU, a local L1 cache, a “slice” of a distributed L2 cache, and a router

Router function The job of the router is forward packets from a source tile to a destination tile (e.g., when a “cache line” is read from a “remote” L2 slice). Two example switching modes: Store-and-forward: Bits of a packet are forwarded only after entire packet is first stored. Cut-through: Bits of a packet are forwarded once the header portion is received. Parallel programming and parallel machines have been around for a while in the context of clusters, supercomputers or grids. However, the granularity of parallelism in these machines in quite coarse as off chip communication costs are quite high. With multi-core processors, the communication cost is drastically reduced because of the proximity of the cores within a chip and due to the abundance of on-chip wiring. So, we can expect to exploit parallelism at a much finer granularity. This in turn will greatly increase the traffic volume between cores and will require networks that can deliver high throughput.

Store-and-forward switching
Buffers for data packets Store Source end node Destination end node Packets are completely stored before any portion is forwarded [adapted from instructional slides of Pinkston & Duato, Computer Architecture: A Quantitative Approach]

Store-and-forward switching
Requirement: buffers must be sized to hold entire packet Forward Store Source end node Destination end node Packets are completely stored before any portion is forwarded [adapted from instructional slides of Pinkston & Duato, Computer Architecture: A Quantitative Approach]

Cut-through switching
Buffers for data packets Requirement: buffers must be sized to hold entire packet Virtual cut-through Source end node Destination end node Buffers for flits: packets can be larger than buffers Wormhole Source end node Destination end node [adapted from instructional slides of Pinkston & Duato, Computer Architecture: A Quantitative Approach]

Cut-through switching
Buffers for data packets Requirement: buffers must be sized to hold entire packet (MTU) Virtual cut-through Busy Link Packet completely stored at the switch Source end node Destination end node Buffers for flits: packets can be larger than buffers Wormhole Busy Link Packet stored along the path Source end node Destination end node [adapted from instructional slides of Pinkston & Duato, Computer Architecture: A Quantitative Approach]

[adapted from Becker STM’09 talk]
Packets to flits Transact. Type Message Type Packet Size Read Request 1 flit Reply 1+n flits Write [adapted from Becker STM’09 talk]

Wormhole routing Head flit establishes the connection from input port to output port. It contains the destination address. Body flits goes through the established connection (does not need destination address information) Tail flit releases the connection. All other flits blocked until connection is released

Deadlock

Virtual channels Share channel capacity between multiple data streams Interleave flits from different packets Provide dedicated buffer space for each virtual channel Decouple channels from buffers “The Swiss Army Knife for Interconnection Networks” Prevent deadlocks Reduce head-of-line blocking Also useful for providing QoS [adapted from Becker STM’09 talk]

Using VCs for deadlock prevention
Protocol deadlock Circular dependencies between messages at network edge Solution: Partition range of VCs into different message classes Routing deadlock Circular dependencies between resources within network Partition range of VCs into different resource classes Restrict transitions between resource classes to impose partial order on resource acquisition {packet classes} = {message classes} × {resource classes} [adapted from Becker STM’09 talk]

Using VCs for flow control
Coupling between channels and buffers causes head-of-line blocking Adds false dependencies between packets Limits channel utilization Increases latency Even with VCs for deadlock prevention, still applies to packets in same class Solution: Assign multiple VCs to each packet class [adapted from Becker STM’09 talk]

VC router pipeline Route Computation (RC) Determine candidate output port(s) and VC(s) Can be precomputed at upstream router (lookahead routing) Virtual Channel Allocation (VA) Assign available output VCs to waiting packets at input VCs Switch Allocation (SA) Assign switch time slots to buffered flits Switch Traversal (ST) Send flits through crossbar switch to appropriate output Per packet Per flit [adapted from Becker STM’09 talk]

Allocation basics Arbitration: Multiple requestors Single resource Request + grant vectors Allocation: Multiple equivalent resources Request + grant matrices Matching: Each grant must satisfy a request Each requester gets at most one grant Each resource is granted at most once [adapted from Becker STM’09 talk]

Separable allocators Matchings have at most one grant per row and per column Implement via to two phases of arbitration Column-wise and row-wise Perform in either order Arbiters in each stage are fully independent Fast and cheap But bad choices in first phase can prevent second stage from generating a good matching! Input-first: Output-first: [adapted from Becker STM’09 talk]

Wavefront allocators Avoid separate phases … and bad decisions in first Generate better matchings But delay scales linearly Also difficult to pipeline Principle of operation: Pick initial diagonal Grant all requests on diagonal Never conflict! For each grant, delete requests in same row, column Repeat for next diagonal [adapted from Becker STM’09 talk]

Wavefront allocator timing
Originally conceived as full-custom design Tiled design True delay scales linearly Signal wraparound creates combinational loops Effectively broken at priority diagonal But static timing analysis cannot infer that Synthesized designs must be modified to avoid loops! [adapted from Becker STM’09 talk]

Diagonal Propagation Allocator
Unrolled matrix avoids combinational loops Sliding priority window activates sub-matrix cells But static timing analysis again sees false paths! Actual delay is ~n Reported delay is ~(2n-1) Hurts synthesized designs Another design that avoids combinational loops is […] [adapted from Becker STM’09 talk]

VC allocation Before packets can proceed through router, need to acquire ownership of VC at downstream router VC allocator matches unassigned input VCs with output VCs that are not currently in use P×V requestors (input VCs), P×V resources (output VCs) VC is acquired by head flit, inherited by body & tail flits [adapted from Becker STM’09 talk]

VC allocator implementations
Not shown: Masking logic for busy VCs [adapted from Becker STM’09 talk]

Typical pipelined router
RC VA ST LT SA route computation VC + switch allocation switch traversal link traversal

Networks-on-Chips (NoCs) Basics

Similar presentations

Presentation on theme: "Networks-on-Chips (NoCs) Basics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Networks-on-Chips (NoCs) Basics

Similar presentations

Presentation on theme: "Networks-on-Chips (NoCs) Basics"— Presentation transcript:

Similar presentations

About project

Feedback