Presentation on theme: "A Novel 3D Layer-Multiplexed On-Chip Network"— Presentation transcript:
1A Novel 3D Layer-Multiplexed On-Chip Network Rohit Sunkam RamanujamBill LinElectrical and Computer EngineeringUniversity of California, San Diego
2Networks-on-Chip Chip-multiprocessors (CMPs) increasingly popular 2D-mesh networks often used as on-chip fabric12.64mmI/O Areasingle tile1.5mm2.0mm21.72mmChip multiprocessors are becoming increasingly popular.With increasing core counts the interconnect fabric connecting the processing elements within a chip has become an important component that has a significant impact on both performance and power consumed by the chip.On-chip networks are networks connecting different processing elements within a single chip.They provide a modular and scalable communication fabric for multi-core and many-core architectures2D mesh networks are often used as the on-chip communication fabricTilera Tile64I/O AreaIntel 80-core
33D Integrated Circuits Reduced chip footprint Reduced wire delays Through Silicon ViaDevice layer 2≥ 2 active device layersShort inter-layer distancesDevice layer 1Reduced chip footprintReduced wire delaysHigh inter-layer bandwidthHeterogeneous system integrationAnother technology that is emerging really fast is that of 3D integrated circuits.Uses the concept of vertical integrationInstead of having a single device layer in a 2D plane we can now have multiple device layers stacked on top of each other.This has several benefits which includes reduced chip footprint.Reduced wire delays because reducing footprint which reduces the length of horizontal global interconnectsVertical interconnect connecting components on different layers have very low delays because inter-layer distances are very small.WE can have high inter-layer bandwidth because wires running through the layers can be densely packed in 2 dimensions.3D technology also open up opportunities for heterogeneous system integration.
4Natural Progression: 3D Mesh for 3D CMPs Since 3D technology is so promising the natural way for the architecture community to take advantage of 3D integration is by extending 2D CMPs to 3 dimensions.The simplest way to extend 2D mesh topologies to a 3D layout is by adding two extra ports for vertical communication to each router and rearranging the tiles in the form of a 3D mesh.3D Mesh2D MeshWhat routing algorithms to use for 3D mesh networks?
5Outline Oblivious routing on a 3D mesh Layer-multiplexed 3D architectureEvaluation
6Oblivious Routing Objectives Maximize throughputDistribute traffic evenly on network linksMaximize worst-case throughput as traffic is application dependentMinimize hop countMinimize routing delay between source and destinationReduce powerNext, we take another look at the Routing algorithm objectives. The main task of the routing algorithm, maximizing throughput by evenly distributing the traffic over all network links.Since we are concerned with routing algorithms for general purpose processors,the application that will be running on the processor is unknown and as a result the traffic which is application dependent is also an unknown. So we try to maximize worst-case throughput or the throughput under the most adversarial traffic.A second objective which was a part of the constraints in my original problem statement is to minimize the number of intermediate router hops between the source and the destination. This has a two-fold benefit.It reduces delay and also reduces power as each intermediate router hop consumes power.
7Routing Algorithms for 3D Mesh Networks Valiant RoutingOptimal worst-case throughputPoor latency2VALDimension Ordered RoutingMinimal latencyPoor worst-case throughputO1TURN RoutingMinimal latencyPoor worst-case throughputIdeal routing algorithmMinimal latencyMaximum worst-case throughput(normalized to minimal)Average hop count1IDEALIn this graph, we have the two router objectives along the X and Y axes and we try to plot current routing algorithms on this 2D plane. First we see what an ideal routing algorithm should look like.DORO1TURN0.250.5Worst-case throughput(fraction of network capacity)
8Randomized Partially-Minimal Routing (RPM) XRandomintermediate layerDestinationTo ensure that the projected traffic in each 2D plane is admissible requires a very simple step – load- balance equally across all 2D layers.SourcePhase-2ZIntermediate layer to the destinationPhase-1ZSource to the intermediate layerXY or YX routing on the intermediate layer
9Main Idea Load-balance uniformly across the vertical layers 2 phases of vertical routingMin XY/YX used on each layer
10Routing Algorithms for 3D Mesh Networks 2VALRandomized Partially Minimal RoutingNear-optimal worst-case throughputLow latency(normalized to minimal)Average hop count1.1RPM1IDEALIn this graph, we have the two router objectives along the X and Y axes and we try to plot current routing algorithms on this 2D plane. First we see what an ideal routing algorithm should look like.DORO1TURN0.250.5Worst-case throughput(fraction of network capacity)
11RPM has Near-optimal Worst-case Throughput RPM is optimal for even radix, within 1/k2 of optimal for odd radix.
12Performance of RPM: Average-case Throughput The message we would like to convey here is that RPM is the best known oblivious routing algorithm for 3D mesh networks
13Outline Oblivious routing on a 3D mesh Layer-multiplexed (LM) 3D architectureEvaluationNow we will shift our focus from routing algorithm to an optimized 3D network architecture to effectively implement RPM routing.
14Unique Features of 3D ICs Inter-layer distances are very small (~50 μm)Order of magnitude lower than distances between adjacent tiles on a 2D plane (~1500 μm)Vertical interconnects implemented using Through-Silicon-Vias (TSVs) have very low delay50μmTSV1500μm
15Unique Features of 3D ICs Inter-layer distances are very small (~50 μm)Order of magnitude lower than distances between adjacent tiles on a 2D plane (~1500 μm)Vertical wires using Through-Silicon-Vias (TSVs) have very low delayVertical bandwidth abundant as TSVs can be densely packed in 2D with small via pitch (~4 μm)4 μm
16Unique Features of 3D ICs Inter-layer distances are very small (~50 μm)Order of magnitude lower than distances between adjacent tiles on a 2D plane (~1500 μm)Vertical wires using Through-Silicon-Vias (TSVs) have very low delayVertical wiring abundant as TSVs can be packed in 2D with small via pitch (~4 μm)Number of device layers likely to remain small (4-5 layers) due to thermal and manufacturing issues
17* RPM on a 3D Mesh Phase-2Z Intermediate layer to the destination XRandomintermediate layerDestinationNow lets take a look at how RPM routes a packet on a conventional 3D mesh.Lets say that the source and destination nodes are on the bottom layer and the intermediate layer chosen at random is the top layer.SourcePhase-2ZIntermediate layer to the destinationPhase-1ZSource to the intermediate layer*XY or YX routing on the intermediate layer
18Proposed Layer-Multiplexed Architecture Phase-2ZIntermediate layer to the destinationXYZPhase-1ZSource to the intermediate layerRandomintermediate layerP1P2P1P3The layer multiplexed architecture we propose replaces the hop-by-hop vertical communication with demultiplexing and multiplexing structures.The first step in RPM routing is to demultiplex packets from each processor to a randomly chosen intermediate layer. Since the inter-layer distances are very short we use a vertical demultiplexing switch that spans all 4 layers. This switch connects processors on the 4 layers to the injection queues of routers on all 4 layers. So with this switch in place, a processor at any layer can now inject a packet to a router on any other layer in just a single hop.Once a packet is injected to a router on one of the layers, routing on each layer is the same as routing on a 2D mesh. So we only need 5 port routers instead of large 7 port routers for routing on a 2D plane.Finally, when a packet reaches the (X,Y) coordinates of the destination, it is directly ejected to a packet ejection multiplexer at the destination processor. These multiplexer multiplex packets arriving from different layers at each processor.RPM on LM architecture called RPM-LM.P2RPM routing adapted to the LM architecture : RPM-LMP4P3Destination*P4XY or YX routing on the intermediate layerSource
19Power and Area Savings 5x5 crossbar in LM vs. 7x7 crossbar in 3D mesh Conventional 3D MeshP1P2P3P4Packet injection demultiplexerP1P2P3P4Packet ejection multiplexer.Now lets take a look at the advantages of the LM architecture over a conventional 3D mesh.First LM architecture uses only 5-port routers compared to 7-port routers used in a 3D mesh. Since power and area increase quadratically with the number of ports, 7-port routers are almost twice as expensive as 5-port routers in terms of power and area.In the LM architecture we decouple vertical routing from routing on each horizontal plane. By doing this we are able to reorganize the 7-port routers in a 3D mesh into 5-port routers for routing on each 2D plane integrated with demultiplexing and multiplexing structures for routing between layers.We also restrict vertical routing to packet injection and packet ejection stages. Since RPM requires 2 phases of vertical routing, these two phases are easily mapped to the packet injection and ejection stages.Layer-Multiplexed ArchitectureDecouple vertical routing from horizontal routingRestrict vertical routing to packet injection and packet ejection
20Single Hop Vertical Communication Single hop vertical routing more power efficient than one-layer-per-hop routingLeverages short inter-layer distances in 3D ICsBetter utilizes available vertical bandwidthThe next advantage of LM over a 3D mesh is that it allows single hop vertical communication instead of one-layer per hop routing.As it turns out, when inter-layer distances are very short, single hop routing using the demultiplexing and multiplexing stages is more power efficient that on-layer-per hop routing where power is dissipated at every intermediate router.Single hop routing is made possible because of the opportunities available in 3D Ics.
21Packet Injection Demultiplexer Route Selection/Load BalancingVC AllocationCredits in from the injection port of routers on layers 1-4Flit CountersSwitch ArbitrationTo the injection port of the Layer 1 routerP1.The packet injection demultiplexer is basically a vertical switch and its architecture is very similar to a 2D router.The route selection stage selects an intermediate layer to route a packet. Since the goal is to load balance across the vertical layers, we need a set of counters to keep track of the amount of traffic sent from every processor to each layer. The routing decision is based on the counter values.In this case there is also some added flexibility for the route selection stage as “technically” any output port can be used, provided we load balance over time. So route selection can can also try to reduce contention at output ports.The thing that is different here is that the route selection stage needs to balance traffic uniformly across the 4 layers. It does the balancing with the help of flit counters which keeps track of the number of flits a processor sends to each layer and it ensures that over time traffic is distributed uniformly across the layers.P2P3To the injection port of the Layer 4 routerP4
22Packet Ejection Multiplexer Credits out for L1-P1,L2-P1, L3-P1 and L4-P1ArbiterVCIDL1-P1Router on Layer 1Packets from layer2L2-P1P1Packets from layer3L3-P1Packets from layer4L4-P1.P2The job of the packet ejection multiplexers is to multiplex packets arriving from different layers to the destination processor.In a 4 layer architecture, each multiplexer will have 4 queues to receive packets from routers on each of the 4 layers. When a packet is ejected from a layer router, its destination field is used to direct it to the right ejection multiplexer. Each multiplexer can then independently choose which flit to forward to the destination processor every cycle.P3Credits out for L1-P4,L2-P4, L3-P4 and L4-P4ArbiterL1-P4Packets from layer2L2-P4P4Packets from layer3L3-P4Packets from layer4L4-P4
23Outline Oblivious routing on a 3D mesh Layer-multiplexed 3D architectureEvaluationPower and AreaPerformanceAfter describing the details of the LM architecture lets move on to the evaluation of the architecture. First I will compare the power and area of the LM architecture with a 3D mesh
24Power and Area Evaluation Used Orion 2.0 models for router power and area estimation.65nm process at 1V and 1GHzBuffers4VCs/port, 5flits/VC for routers5 flits/port for packet injection demultiplexer5 flits/port for each packet ejection multiplexer
25Power Comparison 3D mesh LM One 7-port router per tile One packet injection demultiplexer for every 4 tilesOne packet ejection multiplexer per tile
28Outline Oblivious routing on a 3D mesh Layer-multiplexed 3D architectureEvaluationPower and AreaPerformance
29RPM on a 3D mesh vs. RPM-LM Worst-case throughput RPM-LM achieves same (near-optimal) worst-case throughput as RPMAverage-case throughputThe reason RPM-LM outperforms RPM on the symmetric topology is because the LM architecture offers higher vertical bandwidth than a 3D mesh.
30Flit-Level Simulation Ideal throughput evaluation assumesIdeal single-cycle routerInfinite buffersNo contention in switches, no flow controlFlit-level simulationPopNet network simulator5 stage router pipelineCredit-based flow control8 virtual channels, each 5 flits deepMulti-flit packets injected into the network (5 flits/packet)Until now the throughput results presented were for an ideal scenario where we assumed ideal single cycle routers, infinite buffers and no contention is switches. To get a more realistic insight into the performance of RPM and RPM-LM we need to account for the non idealities present in practical implementations. For this purpose we used a cycle accurate flit-level simulator.
31Flit-Level Simulation (cont’d) Network configurations simulated4 x 4 x 4 mesh8 x 8 x 4 meshFour different traffic traces usedUniform trafficTranspose traffic: (x,y,z) → (y,z,x)Complement traffic: (x,y,z) → (k-x-1, k-y-1, k-z-1)Worst Case traffic pattern for DOR (DOR-WC):(x,y,z) → (k-z-1, k-y-1, k-x-1)
35Summary of Contributions Proposed a 3D Layer-multiplexed architecture which is an optimization of a 3D meshExploits the optimality of RPM together with the high vertical bandwidth enabled in 3D technologyLM architecture consumes 27% less power, occupies 26% less area than a 3D meshRPM-LM has comparable (marginally better) performance to RPM on a 3D mesh