University of Michigan, Ann Arbor

Slides:

Advertisements

Similar presentations

1 Networks for Multi-core Chip A Controversial View Shekhar Borkar Intel Corp.

Advertisements

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

A Novel 3D Layer-Multiplexed On-Chip Network

International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.

Flattened Butterfly Topology for On-Chip Networks John Kim, James Balfour, and William J. Dally Presented by Jun Pang.

On-Chip Interconnects Alexander Grubb Jennifer Tam Jiri Simsa Harsha Simhadri Martha Mercaldi Kim, John D. Davis, Mark Oskin, and Todd Austin. “Polymorphic.

Flattened Butterfly: A Cost-Efficient Topology for High-Radix Networks ______________________________ John Kim, William J. Dally &Dennis Abts Presented.

Destination-Based Adaptive Routing for 2D Mesh Networks ANCS 2010 Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering University of California,

CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.

Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally Concurrent VLSI Architecture Group Stanford University.

L2 to Off-Chip Memory Interconnects for CMPs Presented by Allen Lee CS258 Spring 2008 May 14, 2008.

Firefly: Illuminating Future Network-on-Chip with Nanophotonics Yan Pan, Prabhat Kumar, John Kim †, Gokhan Memik, Yu Zhang, Alok Choudhary EECS Department.

Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim

Design of a High-Throughput Distributed Shared-Buffer NoC Router

Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks Aniruddha N. Udipi with Naveen Muralimanohar*, Rajeev Balasubramonian University of Utah.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.

University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.

Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.

1 Near-Optimal Oblivious Routing for 3D-Mesh Networks ICCD 2008 Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering Department University.

Dragonfly Topology and Routing

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

Diamonds are a Memory Controller’s Best Friend* *Also known as: Achieving Predictable Performance through Better Memory Controller Placement in Many-Core.

McRouter: Multicast within a Router for High Performance NoCs

Tightly-Coupled Multi-Layer Topologies for 3D NoCs Hiroki Matsutani (Keio Univ, JAPAN) Michihiro Koibuchi (NII, JAPAN) Hideharu Amano (Keio Univ, JAPAN)

TLC: Transmission Line Caches Brad Beckmann David Wood Multifacet Project University of Wisconsin-Madison 12/3/03.

1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.

Interconnect Networks

Report Advisor: Dr. Vishwani D. Agrawal Report Committee: Dr. Shiwen Mao and Dr. Jitendra Tugnait Survey of Wireless Network-on-Chip Systems Master’s Project.

Three-Dimensional Layout of On-Chip Tree-Based Networks Hiroki Matsutani (Keio Univ, Japan) Michihiro Koibuchi (NII, Japan) D. Frank Hsu (Fordham Univ,

International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.

Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.

1 Application Aware Prioritization Mechanisms for On-Chip Networks Reetuparna Das Onur Mutlu † Thomas Moscibroda ‡ Chita Das § Reetuparna Das § Onur Mutlu.

Author : Jing Lin, Xiaola Lin, Liang Tang Publish Journal of parallel and Distributed Computing MAKING-A-STOP: A NEW BUFFERLESS ROUTING ALGORITHM FOR ON-CHIP.

Course Wrap-Up Miodrag Bolic CEG4136. What was covered Interconnection network topologies and performance Shared-memory architectures Message passing.

Presenter: Min-Yu Lo 2015/10/19 Asit K. Mishra, N. Vijaykrishnan, Chita R. Das Computer Architecture (ISCA), th Annual International Symposium on.

Design and Evaluation of Hierarchical Rings with Deflection Routing Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, Kevin Chang, Greg Nazario, Reetuparna.

O1TURN : Near-Optimal Worst-Case Throughput Routing for 2D-Mesh Networks DaeHo Seo, Akif Ali, WonTaek Lim Nauman Rafique, Mithuna Thottethodi School of.

TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.

In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI.

Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.

Express Cube Topologies for On-chip Interconnects Boris Grot J. Hestness, S. W. Keckler, O. Mutlu † The University of Texas at Austin † Carnegie Mellon.

CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.

Non-Minimal Routing Strategy for Application-Specific Networks-on-Chips Hiroki Matsutani Michihiro Koibuchi Yutaka Yamada Jouraku Akiya Hideharu Amano.

10/03/2005: 1 Physical Synthesis of Latency Aware Low Power NoC Through Topology Exploration and Wire Style Optimization CK Cheng CSE Department UC San.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Traffic Steering Between a Low-Latency Unsiwtched TL Ring and a High-Throughput Switched On-chip Interconnect Jungju Oh, Alenka Zajic, Milos Prvulovic.

Performance, Cost, and Energy Evaluation of Fat H-Tree: A Cost-Efficient Tree-Based On-Chip Network Hiroki Matsutani (Keio Univ, JAPAN) Michihiro Koibuchi.

SARC Proprietary and Confidential Processor-to-Memory-Blocks NoC with Pre-Configured (but run-time reconfigurable) Low-Latency Routes G. Mihelogiannakis,

Yu Cai Ken Mai Onur Mutlu

ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.

Design Tradeoffs of Long Links in Hierarchical Tiled Networks-on-Chip Group Research 1 QNoC.

Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.

Hybrid Optoelectric On-chip Interconnect Networks Yong-jin Kwon 1.

Topology-aware QOS Support in Highly Integrated CMPs Boris Grot (UT-Austin) Stephen W. Keckler (NVIDIA/UT-Austin) Onur Mutlu (CMU) WIOSCA '10.

Predictive High-Performance Architecture Research Mavens (PHARM), Department of ECE The NoX Router Mitchell Hayenga Mikko Lipasti.

Building manycore processor-to-DRAM networks using monolithic silicon photonics Ajay Joshi †, Christopher Batten †, Vladimir Stojanović †, Krste Asanović.

A Low-Area Interconnect Architecture for Chip Multiprocessors Zhiyi Yu and Bevan Baas VLSI Computation Lab ECE Department, UC Davis.

Design Space Exploration for NoC Topologies ECE757 6 th May 2009 By Amit Kumar, Kanchan Damle, Muhammad Shoaib Bin Altaf, Janaki K.M Jillella Course Instructor:

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.

NoCVision: A Network-on-Chip Dynamic Visualization Solution

Network-on-Chip Paradigm Erman Doğan. OUTLINE SoC Communication Basics  Bus Architecture  Pros, Cons and Alternatives NoC  Why NoC?  Components 

Lecture 23: Interconnection Networks

ESE532: System-on-a-Chip Architecture

Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio

Exploring Concentration and Channel Slicing in On-chip Network Router

OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel

Rahul Boyapati. , Jiayi Huang

Israel Cidon, Ran Ginosar and Avinoam Kolodny

Using Packet Information for Efficient Communication in NoCs

Presentation transcript:

University of Michigan, Ann Arbor Scaling Towards Kilo-Core Processors with Asymmetric High-Radix Topologies Nilmini Abeyratne, Reetuparna Das, Qingkun Li, Korey Sewell, Bharan Giridhar, Ronald G. Dreslinski, David Blaauw, and Trevor Mudge University of Michigan, Ann Arbor HPCA 19 February 27, 2013

Many-Core Trend Thousand-core chips are in our future A scalable on-chip interconnect is required Mesh TILE Gx100 TILE64 Intel SCC Over the past decade, the number of cores on a single chip have been steadily increasing. Today, there are 100-core chips available. And we predict there will be kilo-core chips with 1000 cores in the future. Animation… With more and more cores, the movement of data around the chip starts becoming a bottleneck. This bottleneck calls for a power-efficient and scalable interconnect for future kilo-core systems. Current multicore systems use crossbars or rings, but these are not scalable due to limited bisection bandwidth. Current many-core systems with up to a 100 cores use a mesh network. Can take up one step further to a 1000 cores? ==================== Details: Why ring and buses are bad? Bisection bandwidth is low (just 2), doesn’t scale with # of cores Large hop count, ring hop count is 63 for 64 cores Crossbar or Ring

Outline Motivation Symmetric Low-Radix and High-Radix Designs Asymmetric High-Radix Designs Super-Star Super-StarX Results Conclusion

Mesh Topology Popular in tiled-based many-core processors Low complexity Planar 2D layout properties As mentioned earlier, the mesh topology is commonly used in tile-based many-core processors. Here is Tilera’s 64-core chip with an 8x8 mesh network. But can such a mesh topology scale to 100s of cores? Can Mesh topology scale to 100s of cores? Tilera’s TILE64 64-core processor

High-Radix Topologies Alternative to low-radix topologies Concentration R 6 tile Tile R R R R Next I studied an alternative topology, which is the high-radix mesh topology. High-radix topologies are designed using concentration. Here is a conventional low-radix mesh topology. I can take a subset of the tiles and consolidate the smaller routers into one large router, thereby creating a concentrated cluster of tiles. Doing this across the entire chip results in a high-radix topology. Concentration improves network latency due to fewer hops. However, the single links between the routers become bottlenecks. Fewer hops improve latency, but links become bottlenecks

High-Radix Topologies Improve throughput Additional Connectivity Parallel links Express links R R High-Radix Router We try to improve throughput by increasing connectivity using parallel links and express links. Transition to Swizzle-Switch An important component of high-radix topologies is the high-radix router. Hard to build large routers for reasonable power constraints with conventional crossbar designs Need something more efficient like the Swizzle-Switch.

High-Radix Switch: Swizzle-Switch Traditional Matrix-Style Crossbar Separate crossbar & arbiter Not scalable as radix increases: Routing to/from arbiter becomes more challenging Arbitration logic grows more complex Swizzle-Switch* Combines routing-dominated arbiter with logic-dominated crossbar SRAM-like technology Scales to radix-64 in 32nm @ 1.5GHz Traditional routers have a matrix-style crossbar where the arbitration logic is separate from the crossbar itself. These crossbars are not scalable because as the radix increases, routing to/from the arbiter becomes more challenging as wire grow longer and arbitration logic grows more complex. The Swizzle-Switch innovation has addressed this problem by integrating the arbitration logic into the crossbar itself. Its SRAM-like technology embeds priority bits within the crossbar and senses those bits to do arbitration. The Swizzle-Switch has been shown to scale a radix of 64 in 32nm technology and operate at 1.5GHz. *VLSIC 2011, ISSCC 2012, DAC 2012, JETCAS 2012, HotChips 2012

High-Radix Topologies Conventional Router Delay Swizzle-Switch Router Delay Global Communication Global Communication Delay Hop Count Local Communication Local Communication Here we analyze why low or high-radix topologies don’t scale. As the radix of the router increases, the router delay rises steeply. We have managed to lessen this effect by using the Swizzle-Switch. The advantage of high-radix routers is the fewer hop count. The left side of this graph is ideal for local communication because of small router delay and because local traffic only require a couple of routers. But it is bad for global communication due to higher hop count. The right side of this graph is ideal for global communication due to fewer hops, but is bad for local communication due to slower routers. What we see is a tradeoff Low-Radix Router High-Radix Router Symmetric high-radix topologies trade-off efficiency of local communication to achieve faster global communication

Outline Motivation Symmetric Low-Radix and High-Radix Designs Asymmetric High-Radix Designs Super-Star Super-StarX Results Conclusion

Asymmetric High-Radix Topologies Low-Radix Topologies optimize local communication High-Radix Topologies optimize global communication LR Fast, Low-Radix GR LR = Local Router Slow, High-Radix GR = Global Router To summarize, low-radix topologies optimize for local communication. And high-radix topologies optimize for global communication. We propose to use both local routers for local communication AND global routers for global communication. We call these Asymmetric High-Radix routers and they merge the best… The important question now is to decide what the radix of each type of router should be. The heuristic is this: Local routers connect tiles that are close-by. The wires here are very short and fast. Therefore, local routers should be fast with lower radices. Global routers, on the other hand, connect local routers across the entire chip. They stand to reduce the hop count by having higher radices. It is ok for global routers to be slow because the long wires across the chip are slow as well. We call this heuristic: matching router speed to wire speed. Asymmetric High-Radix merge best features of both low-radix and high-radix topologies

Asymmetric High-Radix Topologies Decouple local and global communication Match router speed to wire speed Local communication  Short wires  Fast Low-Radix Global communication  Long wires  Slow High-Radix Routers  Reduce Hop count In summary, our asymmetric topologies decouple local and global communication with 2 types of routers. From our experiments, the heuristic we came up with to determine the radix is to match router speed to wire speed. This means, local routers, which connect short wires should be fast and have low-radix. And, the global routers should reduce hop count with as many connections as possible. This will result in slower routers, but that is OK because they also connect longer, slower wires. Details ===== Match router speed to wire speed: For local communication—where cores are close-by, wires are short, and wire delay is small—the router should be fast and have lower radix. Since communication is local, the lower radix does not increase hop count signiﬁcantly. For global communication the routes will be inherently long and wire latency will be large regardless of the number of pipeline stages. Hence, global routers can afford to be slower allowing their radix to be increased. With higher radix, the number of hops is reduced, which results in lower network latency for global communication.

Super-Star GR Each local router connects a cluster of tiles Each global router connects to all local routers LR GR Our first design is called Super-Star. It is a hierarchical star topology. Local routers for local communication Local routers are not connected to each other! A Global router for global communication

Super-StarX Inter-cluster links further reduce local communication latency Locality-aware routing policy LR LR LR LR GR Inter-Cluster Links Low Load: Inter-Cluster Links LR LR LR High Load: Inter-Cluster Links + Global Router LR Our 2nd design is called Super-StarX. Green Arrow Animation: In this picture, for a local router to communicate with an adjacent local router, it still needs to send its packet through the global router. Inter-cluster links Animation: Because we have a 2D layout, we can optimize this type of local communication with extra links. In addition to these inter-cluster links, we have implemented a locality-aware routing policy. Blue Arrow Animation: Communication to adjacent clusters use inter-cluster links. During very high network load, we utilize the global router to take the burden off of inter-cluster links. Communication to elsewhere still uses the global router as in Super-Star. LR LR LR LR

Super-StarX GR GR GR GR Multiple global routers Higher throughput, energy proportionality LR LR LR LR GR GR GR GR LR LR LR LR Multiple global routers::: Improves throughput further Provide energy proportionality The more global routers, the more throughput and more power. Use global routers to tune throughput and power. LR LR LR LR

Super-StarX Layout 4 tile GR GR GR GR 576 tiles in total LR 3.6mm 18mm 3.6mm LR 14.4mm Inter-Cluster Links GR 7.2mm GR 3.6mm 21.6mm 10.8mm 4 tile LR GR GR To do an accurate analysis, we did a layout of our topologies. Here is the Super-StarX Layout. Local routers connect a 4x4 cluster of tiles. All local routers connect to all global routers. Wire lengths were calculated to accurately measure power and performance. 576 tiles in total 21.6mm 25.2mm 21.6mm

Router Information & Link Dimensions Evaluation 576 tiles Synthetic uniform random traffic, 4-flit messages 128-bit Swizzle-Switch in 15nm 4 VCs/port, buffer depth 5 flits/VC Power & delay from SPICE modeling in 32nm, scaled to 15nm Router Information & Link Dimensions Topology # Routers Radix Network Area Avg. Link Length (mm) Local Global mesh 576 5 38.19 0.79 cmesh-low 144 8 13.18 1.28 cmesh-high 16 52 15.20 3.25 fbly 42 10.82 3.56 superstar 36 24 18.24 1.80 12.90 superstarX 28 21.45 2.11 11.30 superring 4 17 11 7.12 6.48 Network Simulator: A cycle accurate It models the network in detail. It models router, arbitration logic in stages, pipelined stages of the links. Also models the contention in routers, buffer occupancy. We broke down the components of the router to get accurate power numbers: buffers and switches Table: Many variations of each topology were explored. The topology that provided the balance of the best latency, throughput, and power were chosen.

Results: Latency Compared with Mesh topology, Super-Star topologies have 39% more throughput, 45% reduction in latency

Results: Power Compared with Mesh topology, Super-Star topologies have 40% less power. At 30W, 3x more throughput Low Power 3x 3x performance improvement over Mesh and 2.3x performance improvement over Fbfly for the same 30W of power. 2.3x High Perf.

Results: Energy Proportionality Available throughput can be tuned using global routers A single global router can provide full network connectivity Now we look more closely at the energy proportionality given to us by multiple global routers. In this study, we statically increase the number of global routers. Graph Animation… We can tune the available throughput of the network by varying the number of global routers => Hence we get energy proportionality Design time or dynamic decision to determine number of active global routers. Can design with all 8 GRs and power gate during low network load. Or if under a power budget, design with only the number of GRs that meet the power budget. Can power gate GRs because even a single router full network connectivity Can’t do this with mesh….all routers are needed to provide full network connectivity For example, if you want to bound the power budget to say 30W, in mesh you have to 1) underclock routers (reduce frequency) and sacrifice latency 2) have complex source-throttling mechanisms to control injection rate such that network power doesn’t exceeded the power budget. 3) adaptive routing methods.

Results: Localized Traffic Nearest neighbor traffic between LRs Maximum one hop Inter-Cluster Links + Global Routers We wanted to analyze more closely the additional benefit of inter-cluster links in Super-StarX topology The uniform random traffic pattern we have been using thus far does not bring out this benefit because the portion of local traffic there is small. So we studied a more localized traffic pattern where all packets are sent within the cluster or to an adjacent cluster. To remind you of the routing policy, at low loads, all 1 hop traffic use inter-cluster links except for very high loads where the global routers also route packets. 5) We are able to show a 20% reduction in latency for localized traffic (5.5 to 4.3ns) 6) Before the red line: uses mostly local links After the red line: can’t keep using local links due to contention We able to bring down the power-throughput curve: more throughput for the same power Inter-Cluster Links

Results: Applications Processor Configuration 576 nodes: 552 cores + 24 memory controllers (1 GHz frequency) Private L1 cache; shared, distributed L2 cache Workloads 4 workloads – 12 SPECCPU 2006 benchmarks each 1 workload – 8 SPLASH-2 benchmarks Metrics Performance (execution time in cycles) Power Results: Super-StarX Average over Mesh: 17% performance improvement, 39% less power Average over Fbfly: 32% performance improvement, 5% worse power

Conclusion Goal: a scalable on-chip network topology for kilo-core chips Made feasible by Swizzle-Switches Asymmetric high-radix topologies: Super-Star and Super-StarX Fast low-radix local routers, slow high-radix global routers Multiple global routers for higher throughput and energy proportionality Results: Super-StarX Average latency: 45% reduction over Mesh Power: 40% less over Mesh Throughput @ 30W TDP: 3x Mesh, 2.3x Fbfly

University of Michigan, Ann Arbor Thank You! Scaling Towards Kilo-Core Processors with Asymmetric High-Radix Topologies Nilmini Abeyratne, Reetuparna Das, Qingkun Li, Korey Sewell, Bharan Giridhar, Ronald G. Dreslinski, David Blaauw, and Trevor Mudge University of Michigan, Ann Arbor HPCA 19 February 27, 2013

BACKUP SLIDES

High-Radix Switch: Swizzle-Switch We studied the scalability of Swizzle-Switch Radix-64 128-bit channels 32nm 1.5 GHz 2W of power ~2mm2 of area

Super-Star Layout 4 tile GR GR GR GR 576 tiles In total LR 21.6mm To do an accurate analysis, we did a layout of our topologies. Here is the Super-StarX Layout. Local routers connect a 4x4 cluster of tiles. All local routers connect to all global routers. Wire lengths were calculated to accurately measure power and performance. 576 tiles In total 21.6mm 25.2mm 21.6mm

Super-Ring (Anti-design) Medium-radix local and global routers Limited connectivity hinders scalability LR GR LR GR LR GR LR GR Finally, we explore a topology that doesn’t adhere to our principle and show that it doesn’t scale in performance Super-Ring Topology: The chip is divided into 4 logical quadrants Each quadrant has a global router, connected to only the local routers in that quadrant Both local and global routers are medium-radix, we are not taking the advantage that global routers can be slow and can connect to a lot of local routers By limiting the connectivity of global routers, limit throughput -> hinders scalability