University of Michigan, Ann Arbor

Name: University of Michigan, Ann Arbor
Uploaded: 2017-11-23T16:56:18+00:00
Duration: PTM23S15
Channel: Gervais McDowell
Description: University of Michigan, Ann Arbor

University of Michigan, Ann Arbor
Scaling Towards Kilo-Core Processors with Asymmetric High-Radix Topologies Nilmini Abeyratne, Reetuparna Das, Qingkun Li, Korey Sewell, Bharan Giridhar, Ronald G. Dreslinski, David Blaauw, and Trevor Mudge University of Michigan, Ann Arbor HPCA 19 February 27, 2013

Many-Core Trend Thousand-core chips are in our future
A scalable on-chip interconnect is required Mesh TILE Gx100 TILE64 Intel SCC Over the past decade, the number of cores on a single chip have been steadily increasing. Today, there are 100-core chips available. And we predict there will be kilo-core chips with 1000 cores in the future. Animation… With more and more cores, the movement of data around the chip starts becoming a bottleneck. This bottleneck calls for a power-efficient and scalable interconnect for future kilo-core systems. Current multicore systems use crossbars or rings, but these are not scalable due to limited bisection bandwidth. Current many-core systems with up to a 100 cores use a mesh network. Can take up one step further to a 1000 cores? ==================== Details: Why ring and buses are bad? Bisection bandwidth is low (just 2), doesn’t scale with # of cores Large hop count, ring hop count is 63 for 64 cores Crossbar or Ring

Outline Motivation Symmetric Low-Radix and High-Radix Designs
Asymmetric High-Radix Designs Super-Star Super-StarX Results Conclusion

Mesh Topology Popular in tiled-based many-core processors
Low complexity Planar 2D layout properties As mentioned earlier, the mesh topology is commonly used in tile-based many-core processors. Here is Tilera’s 64-core chip with an 8x8 mesh network. But can such a mesh topology scale to 100s of cores? Can Mesh topology scale to 100s of cores? Tilera’s TILE64 64-core processor

High-Radix Topologies
Alternative to low-radix topologies Concentration R 6 tile Tile R R R R Next I studied an alternative topology, which is the high-radix mesh topology. High-radix topologies are designed using concentration. Here is a conventional low-radix mesh topology. I can take a subset of the tiles and consolidate the smaller routers into one large router, thereby creating a concentrated cluster of tiles. Doing this across the entire chip results in a high-radix topology. Concentration improves network latency due to fewer hops. However, the single links between the routers become bottlenecks. Fewer hops improve latency, but links become bottlenecks

Improve throughput Additional Connectivity Parallel links Express links R R High-Radix Router We try to improve throughput by increasing connectivity using parallel links and express links. Transition to Swizzle-Switch An important component of high-radix topologies is the high-radix router. Hard to build large routers for reasonable power constraints with conventional crossbar designs Need something more efficient like the Swizzle-Switch.

High-Radix Switch: Swizzle-Switch
Traditional Matrix-Style Crossbar Separate crossbar & arbiter Not scalable as radix increases: Routing to/from arbiter becomes more challenging Arbitration logic grows more complex Swizzle-Switch* Combines routing-dominated arbiter with logic-dominated crossbar SRAM-like technology Scales to radix-64 in 1.5GHz Traditional routers have a matrix-style crossbar where the arbitration logic is separate from the crossbar itself. These crossbars are not scalable because as the radix increases, routing to/from the arbiter becomes more challenging as wire grow longer and arbitration logic grows more complex. The Swizzle-Switch innovation has addressed this problem by integrating the arbitration logic into the crossbar itself. Its SRAM-like technology embeds priority bits within the crossbar and senses those bits to do arbitration. The Swizzle-Switch has been shown to scale a radix of 64 in 32nm technology and operate at 1.5GHz. *VLSIC 2011, ISSCC 2012, DAC 2012, JETCAS 2012, HotChips 2012

Conventional Router Delay Swizzle-Switch Router Delay Global Communication Global Communication Delay Hop Count Local Communication Local Communication Here we analyze why low or high-radix topologies don’t scale. As the radix of the router increases, the router delay rises steeply. We have managed to lessen this effect by using the Swizzle-Switch. The advantage of high-radix routers is the fewer hop count. The left side of this graph is ideal for local communication because of small router delay and because local traffic only require a couple of routers. But it is bad for global communication due to higher hop count. The right side of this graph is ideal for global communication due to fewer hops, but is bad for local communication due to slower routers. What we see is a tradeoff Low-Radix Router High-Radix Router Symmetric high-radix topologies trade-off efficiency of local communication to achieve faster global communication

Outline Motivation Symmetric Low-Radix and High-Radix Designs
Asymmetric High-Radix Designs Super-Star Super-StarX Results Conclusion

Asymmetric High-Radix Topologies
Low-Radix Topologies optimize local communication High-Radix Topologies optimize global communication LR Fast, Low-Radix GR LR = Local Router Slow, High-Radix GR = Global Router To summarize, low-radix topologies optimize for local communication. And high-radix topologies optimize for global communication. We propose to use both local routers for local communication AND global routers for global communication. We call these Asymmetric High-Radix routers and they merge the best… The important question now is to decide what the radix of each type of router should be. The heuristic is this: Local routers connect tiles that are close-by. The wires here are very short and fast. Therefore, local routers should be fast with lower radices. Global routers, on the other hand, connect local routers across the entire chip. They stand to reduce the hop count by having higher radices. It is ok for global routers to be slow because the long wires across the chip are slow as well. We call this heuristic: matching router speed to wire speed. Asymmetric High-Radix merge best features of both low-radix and high-radix topologies

Asymmetric High-Radix Topologies
Decouple local and global communication Match router speed to wire speed Local communication  Short wires  Fast Low-Radix Global communication  Long wires  Slow High-Radix Routers  Reduce Hop count In summary, our asymmetric topologies decouple local and global communication with 2 types of routers. From our experiments, the heuristic we came up with to determine the radix is to match router speed to wire speed. This means, local routers, which connect short wires should be fast and have low-radix. And, the global routers should reduce hop count with as many connections as possible. This will result in slower routers, but that is OK because they also connect longer, slower wires. Details ===== Match router speed to wire speed: For local communication—where cores are close-by, wires are short, and wire delay is small—the router should be fast and have lower radix. Since communication is local, the lower radix does not increase hop count signiﬁcantly. For global communication the routes will be inherently long and wire latency will be large regardless of the number of pipeline stages. Hence, global routers can afford to be slower allowing their radix to be increased. With higher radix, the number of hops is reduced, which results in lower network latency for global communication.

Super-Star GR Each local router connects a cluster of tiles
Each global router connects to all local routers LR GR Our first design is called Super-Star. It is a hierarchical star topology. Local routers for local communication Local routers are not connected to each other! A Global router for global communication

Super-StarX Inter-cluster links further reduce local communication latency Locality-aware routing policy LR LR LR LR GR Inter-Cluster Links Low Load: Inter-Cluster Links LR LR LR High Load: Inter-Cluster Links + Global Router LR Our 2nd design is called Super-StarX. Green Arrow Animation: In this picture, for a local router to communicate with an adjacent local router, it still needs to send its packet through the global router. Inter-cluster links Animation: Because we have a 2D layout, we can optimize this type of local communication with extra links. In addition to these inter-cluster links, we have implemented a locality-aware routing policy. Blue Arrow Animation: Communication to adjacent clusters use inter-cluster links. During very high network load, we utilize the global router to take the burden off of inter-cluster links. Communication to elsewhere still uses the global router as in Super-Star. LR LR LR LR

Super-StarX GR GR GR GR Multiple global routers
Higher throughput, energy proportionality LR LR LR LR GR GR GR GR LR LR LR LR Multiple global routers::: Improves throughput further Provide energy proportionality The more global routers, the more throughput and more power. Use global routers to tune throughput and power. LR LR LR LR

Super-StarX Layout 4 tile GR GR GR GR 576 tiles in total LR
3.6mm 18mm 3.6mm LR 14.4mm Inter-Cluster Links GR 7.2mm GR 3.6mm 21.6mm 10.8mm 4 tile LR GR GR To do an accurate analysis, we did a layout of our topologies. Here is the Super-StarX Layout. Local routers connect a 4x4 cluster of tiles. All local routers connect to all global routers. Wire lengths were calculated to accurately measure power and performance. 576 tiles in total 21.6mm 25.2mm 21.6mm

Router Information & Link Dimensions
Evaluation 576 tiles Synthetic uniform random traffic, 4-flit messages 128-bit Swizzle-Switch in 15nm 4 VCs/port, buffer depth 5 flits/VC Power & delay from SPICE modeling in 32nm, scaled to 15nm Router Information & Link Dimensions Topology # Routers Radix Network Area Avg. Link Length (mm) Local Global mesh 576 5 38.19 0.79 cmesh-low 144 8 13.18 1.28 cmesh-high 16 52 15.20 3.25 fbly 42 10.82 3.56 superstar 36 24 18.24 1.80 12.90 superstarX 28 21.45 2.11 11.30 superring 4 17 11 7.12 6.48 Network Simulator: A cycle accurate It models the network in detail. It models router, arbitration logic in stages, pipelined stages of the links. Also models the contention in routers, buffer occupancy. We broke down the components of the router to get accurate power numbers: buffers and switches Table: Many variations of each topology were explored. The topology that provided the balance of the best latency, throughput, and power were chosen.

Results: Latency Compared with Mesh topology, Super-Star topologies have 39% more throughput, 45% reduction in latency

Results: Power Compared with Mesh topology, Super-Star topologies have
40% less power. At 30W, 3x more throughput Low Power 3x 3x performance improvement over Mesh and 2.3x performance improvement over Fbfly for the same 30W of power. 2.3x High Perf.

Results: Energy Proportionality
Available throughput can be tuned using global routers A single global router can provide full network connectivity Now we look more closely at the energy proportionality given to us by multiple global routers. In this study, we statically increase the number of global routers. Graph Animation… We can tune the available throughput of the network by varying the number of global routers => Hence we get energy proportionality Design time or dynamic decision to determine number of active global routers. Can design with all 8 GRs and power gate during low network load. Or if under a power budget, design with only the number of GRs that meet the power budget. Can power gate GRs because even a single router full network connectivity Can’t do this with mesh….all routers are needed to provide full network connectivity For example, if you want to bound the power budget to say 30W, in mesh you have to 1) underclock routers (reduce frequency) and sacrifice latency 2) have complex source-throttling mechanisms to control injection rate such that network power doesn’t exceeded the power budget. 3) adaptive routing methods.

Results: Localized Traffic
Nearest neighbor traffic between LRs Maximum one hop Inter-Cluster Links + Global Routers We wanted to analyze more closely the additional benefit of inter-cluster links in Super-StarX topology The uniform random traffic pattern we have been using thus far does not bring out this benefit because the portion of local traffic there is small. So we studied a more localized traffic pattern where all packets are sent within the cluster or to an adjacent cluster. To remind you of the routing policy, at low loads, all 1 hop traffic use inter-cluster links except for very high loads where the global routers also route packets. 5) We are able to show a 20% reduction in latency for localized traffic (5.5 to 4.3ns) 6) Before the red line: uses mostly local links After the red line: can’t keep using local links due to contention We able to bring down the power-throughput curve: more throughput for the same power Inter-Cluster Links

Results: Applications
Processor Configuration 576 nodes: 552 cores + 24 memory controllers (1 GHz frequency) Private L1 cache; shared, distributed L2 cache Workloads 4 workloads – 12 SPECCPU 2006 benchmarks each 1 workload – 8 SPLASH-2 benchmarks Metrics Performance (execution time in cycles) Power Results: Super-StarX Average over Mesh: 17% performance improvement, 39% less power Average over Fbfly: 32% performance improvement, 5% worse power

Conclusion Goal: a scalable on-chip network topology for kilo-core chips Made feasible by Swizzle-Switches Asymmetric high-radix topologies: Super-Star and Super-StarX Fast low-radix local routers, slow high-radix global routers Multiple global routers for higher throughput and energy proportionality Results: Super-StarX Average latency: 45% reduction over Mesh Power: 40% less over Mesh 30W TDP: 3x Mesh, 2.3x Fbfly

University of Michigan, Ann Arbor
Thank You! Scaling Towards Kilo-Core Processors with Asymmetric High-Radix Topologies Nilmini Abeyratne, Reetuparna Das, Qingkun Li, Korey Sewell, Bharan Giridhar, Ronald G. Dreslinski, David Blaauw, and Trevor Mudge University of Michigan, Ann Arbor HPCA 19 February 27, 2013

BACKUP SLIDES

High-Radix Switch: Swizzle-Switch
We studied the scalability of Swizzle-Switch Radix-64 128-bit channels 32nm 1.5 GHz 2W of power ~2mm2 of area

Super-Star Layout 4 tile GR GR GR GR 576 tiles In total LR 21.6mm
To do an accurate analysis, we did a layout of our topologies. Here is the Super-StarX Layout. Local routers connect a 4x4 cluster of tiles. All local routers connect to all global routers. Wire lengths were calculated to accurately measure power and performance. 576 tiles In total 21.6mm 25.2mm 21.6mm

Super-Ring (Anti-design)
Medium-radix local and global routers Limited connectivity hinders scalability LR GR LR GR LR GR LR GR Finally, we explore a topology that doesn’t adhere to our principle and show that it doesn’t scale in performance Super-Ring Topology: The chip is divided into 4 logical quadrants Each quadrant has a global router, connected to only the local routers in that quadrant Both local and global routers are medium-radix, we are not taking the advantage that global routers can be slow and can connect to a lot of local routers By limiting the connectivity of global routers, limit throughput -> hinders scalability

University of Michigan, Ann Arbor

Similar presentations

Presentation on theme: "University of Michigan, Ann Arbor"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

University of Michigan, Ann Arbor

Similar presentations

Presentation on theme: "University of Michigan, Ann Arbor"— Presentation transcript:

Similar presentations

About project

Feedback