Presentation on theme: "A Case for Globally Shared-Medium On- Chip Interconnect Enhancing Effective Throughput for Transmission Line-Based Bus Aaron Carpenter, Jianyun Hu, Jie."— Presentation transcript:
A Case for Globally Shared-Medium On- Chip Interconnect Enhancing Effective Throughput for Transmission Line-Based Bus Aaron Carpenter, Jianyun Hu, Jie Xu, Michael Huang, Hui Wu University of Rochester
Motivation: e.g. 5x5 mesh Worse case: 4+4 = 8 hops Per hop = pipeline delay + queue delay Example: 5 + 10 = 15 clock cycles/hop WC 15 * 8 = 120 clock cycles @ 1G Hz clock = 120 ns Much slower than DRAM access
Motivation Non-uniform cache access (NUCA) delays create problems. Significant existing research aimed to reduce unnecessary remote accesses by trying to map data closer to the threads that frequently access the data.
Motivation Transmission-line circuit technology allows data rates at >= 26 GHz/s = 0.04 ns per bit. Latency across chip ~ 2 ns. Claims to significantly reduce power because no power costs at intermediate routers (and queues).
Their Proposed Architecture Use Transmission-Lines (TLs to create a shared bus: – Two-level network: first-level connects 2-4 nodes per hub. – Shared bus connects all hubs. – Within a hub, can connect nodes via e.g. crossbar. – Centralized arbitration to control bus access.
Arbitration When the message want transfer from node i to j: 1. A setup step is performed to wake up the transmitter i. 2. In the background, the arbiter passes on the grant to node j 3. Need the time to drain the signal (waiting for the last bit is transmitted). 4. Arbiter can process next task.
Implementation problems Where to put arbiter? How to account for the communicate delay for getting requests from nodes to arbiter and grants back? The overhead of routing request/grant lines between arbiter and nodes? Put arbiter in the middle?
Outline of Remaining Talk Transmission Line transmission line medium transceiver circuitry Node structure Bus Architecture Arbitration Interface Circuit Design
Transmission Line transmission line medium Microstrips: simple, isolation, each line can support high data rate(> 20Gb/s) crosstalk from neighboring lines requires very large spacing Coplanar waveguides: use a grounded strip in between the signal lines significant spacing between signal lines coplanar strips: the more noise-tolerant differential signaling on a pair of lines
transceiver circuitry digital systems analog receiver: allows more attenuation and thus higher rates than digital systems analog transmitters: can be used to gather with more sophisticated encoding schemes
In their design: coplanar strips: as they utilize the space of the top metal layer more efficiently basic differential transmitters and receivers a data rate of 26.4Gb/s can be achieved for a pair of transmission lines with a total pitch (including spacing) of 45μm Within 2.5mm of space, this pitch allows 55 pairs to be laid out, allowing 1.4 5Tb/s of total bandwidth
Node structure assumption is that a chip consists of tiles each with a core, an L1 cache, and a slice of a globally shared L2 (last-level) cache. if an L1 miss occurs, the access will result in a packet injected into the interconnect if the address maps to a remote node Otherwise, the L1 miss is served by the local L2 bank
Node structure clustering a small number of cores and L2 slices into a node the backbone network only makes a stop at every node intra-node fabric connects multiple L1 caches and the L2 cache banks in the node
performance clustering adds extra latency for accesses from an L1 cache to the nearest L2 bank(Figure 4-b Core0 to L20) makes accessing neighboring cache banks within the node (Figure 4-b Core1 to L20) faster it reduces the number of hubs a long-distance packet needs to traverse through The extra cost of a larger intra-node fabric offsets the savings due to a lower number of hubs for inter-node fabric
Bus Architecture Each node uses a high speed communication circuit to deliver packets our bus is merely that allows point-to-point communication
Partitioning the bus Increase throughput, use a wide bus have multiple buses for diffirent packets. bundling: for better utilization of the bus bandwidth, sending multiple packets for each bus arbitration
Interface Circuit Design a transmitter, a receiver, a serializer (SER), a deserializer (DES), and a phase and data recovery circuit (PDR). Therefore, the transmitter (Tx) and receiver (Rx) are both implemented in standard CMOS technology without any special RF devices such as inductors. At 26.4Gb/s synchronization between the received data and the local clock is needed
Increasing Effective Bus Throughput There are many ways to increase the throughput of bus at circuits or architecture level. The proposed techniques can be categorized into three groups: 1. Increasing raw link throughput. 2. Increasing the utilization efficiency. 3. Optimization on the use of buses.
Increasing raw link throughput The potential of link throughput is high, the inherent channel bandwidth of the transmission line is quite high. There are many coding methods to increase the raw throughput.
Increasing raw link throughput First, we turn to 4-PAM which double the data rate compared to OOK. The additional circuit has a DAC for transmitter and ADC for receiver. These elements increase energy and latency, we use it only for data packet bus to minimize latency impact.
Increasing raw link throughput Then we use Frequency Division Multiplexing (FDM), it allows us to use higher frequency band. The attenuation in these band increase with frequency and can be high. When it used as global bus, the higher band becomes lossy. The higher frequency channel are intended for shorter communication instead of in long transmission lines.
Increasing raw link throughput We also have a circuit support includes mixer for transmitter and receiver side and a filter for receiver end. But it is challenging to estimate the power cost of support circuitry. We use a simplify analysis to estimate the minimum power cost to support frequency-division and multi-band transmission.
Increasing the Utilization Efficiency While the underlying global transmission lines support high data rate. Using them to shuttle short packet can cause under- utilization: 1. Long lines means it take long time to drain from transmission line. 2. Packet destined for near neighbor structure are poor match to the global line structure. A number of technique can address these issues, including: Partitioning, wave-based arbitration, segmentation
Partitioning It is straightforward to partition the same number of underlying links into more, narrower buses. Longer serialization reduces waste due to draining. In partitioning, the finer granularity allows better balance the load of two type of buses. For example, we can partition the five 1-flit-wide buses into any combination of meta bus and data bus. In this paper, we use a fixed configuration that achieve the best average performance.
Segmentation We can also improve its spatial utilization in order to increase the efficiency. Achieve that by dividing transmission line into few segments. If a node is communicating with another node within the same segment, only need to arbitrate this segment. When communication cross multiple segments, transmitter need to obtain permission for all segments. Then the segment act as a transmission line.
Segmentation The segment can be connected in two ways: 1:Pass gate is a passive, bi-directional connection. It will add a little bit attenuation and signal distortion, but it can be accepted. 2: Two separate uni-directional amplifiers. The cost of this approach is the power consumption for the amplifier. But with these amplifiers, source transmitter power can be lower since signal can travel at most the length of one segment.
Optimization on the use of buses Invalidation acknowledgement omission: With a packet-switched network, protocols rely explicit invalidation acknowledgement to provide completion. The explicit acknowledgement can be avoided if the interconnect offers certain capability to infer the deliver. Limited multicasting: Transmission line can allow multicast operation. It is easy to support small number of receiver operating. But there is a acceptable attenuation. Even though it may not reduce traffic dramatic, it cut latency and queuing delay.
Interaction between techniques These three groups of techniques are focus different sources of performance gain. But within each group, there is a varying degree of overlap. In general, implementing one technique reduce the potential of another. So when multiple techniques are applied, we can reach diminishing returns. Example: When we are tying to increase the utilization efficiency, we send a pulse train on bus, we wait until it propagate beyond the ends before allowing another pulse. Since propagation delay is significant than pulsed train, the duty cycle is low. But we are trying to improve the duty cycle in different ways.
Experimental Setup Transmission Line Links a total pitch of 45μm and a line width of 10μm The transmission lines are of a serpentine shape and measure about 7.5cm in total length · Traffic and Performance Analysis The L1 miss rate of these applications ranges up to 61 misses per thousand instructions (MPKI). a. Percentage of L2 accesses that are remote b. Speedup due to clustering left is for 1 core per node, the right bar is for 2 cores per node. The baseline in this case is a 16-core mesh
Performance comparison with mesh On average, TLL bus run 1.15x in the 16-node and 1.17x in the 8-node configurations than mesh. the TLL bus reduction in network energy of about 26x than mesh
The Impact of Bundling the turn-around time also wastes bus bandwidth and can be mitigated with bundling too much bundling can be detrimental to performance as well
Scaling Up performance compare with mesh We conduct a limited scalability test with a 64-core system organized into 2- or 4-core nodes (32 nodes, 2 cores each; and 16 nodes, 4 cores each) On average, the TLL bus performs 16% and 25% better than mesh for a 32- and 16-node system
the bus system achieves 67% and 72% of the idealized performance (using digital wire), for 32- and 16-nodes respectively. in a 16-core 8-node system, the bus can achieve 91% of the ideals performance. Scaling Up performance compare with idealized circuit
CONCLUSIONS main-stream chip multiprocessors are unlikely to require an extreme amount of bandwidth for on-chip backbone communication only a small number of nodes will be connected by packet- based backbone interconnect and the traffic on this fabric can be rather limited Experimental shown in a medium-scale16-core system, this design achieves 91% of that in an idealized wire-based interconnect important benefit of avoiding packet switching and relaying is the inherent energy efficiency of the communication system.