A Case for Interconnect-Aware Architectures

Slides:

Advertisements

Similar presentations

Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.

Advertisements

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

A Novel 3D Layer-Multiplexed On-Chip Network

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

Thoughts on Shared Caches Jeff Odom University of Maryland.

University of Utah1 Interconnect-Aware Coherence Protocols for Chip Multiprocessors Liqun Cheng Naveen Muralimanohar Karthik Ramani Rajeev Balasubramonian.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

Non-Uniform Power Access in Large Caches with Low-Swing Wires Aniruddha N. Udipi with Naveen Muralimanohar*, Rajeev Balasubramonian University of Utah.

1 Billion Transistor Architectures Interconnect design for low power – Naveen & Karthik Computational unit design for low temperature – Karthik Increased.

1 Lecture 19: Networks for Large Cache Design Papers: Interconnect Design Considerations for Large NUCA Caches, Muralimanohar and Balasubramonian, ISCA’07.

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

1 E. Bolotin – The Power of Priority, NoCs 2007 The Power of Priority : NoC based Distributed Cache Coherency Evgeny Bolotin, Zvika Guz, Israel Cidon,

Restrictive Compression Techniques to Increase Level 1 Cache Capacity Prateek Pujara Aneesh Aggarwal Dept of Electrical and Computer Engineering Binghamton.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.

CS 7810 Lecture 17 Managing Wire Delay in Large CMP Caches B. Beckmann and D. Wood Proceedings of MICRO-37 December 2004.

University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Back-end Timing Models Core Models.

1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.

1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010.

Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.

1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah.

In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI.

Express Cube Topologies for On-chip Interconnects Boris Grot J. Hestness, S. W. Keckler, O. Mutlu † The University of Texas at Austin † Carnegie Mellon.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.

1 Lecture 15: NoC Innovations Today: power and performance innovations for NoCs.

1 Lecture 22: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.

Boris Grot, Joel Hestness, Stephen W. Keckler 1 The University of Texas at Austin 1 NVIDIA Research Onur Mutlu Carnegie Mellon University.

1 Lecture 29: Interconnection Networks Papers: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton Interconnect Design.

University of Utah 1 Interconnect Design Considerations for Large NUCA Caches Naveen Muralimanohar Rajeev Balasubramonian.

M AESTRO : Orchestrating Predictive Resource Management in Future Multicore Systems Sangyeun Cho, Socrates Demetriades Computer Science Department University.

Advanced Caches Smruti R. Sarangi.

Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin

Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh

Lecture: Large Caches, Virtual Memory

Zhichun Zhu Zhao Zhang ECE Department ECE Department

Lecture 23: Interconnection Networks

Lecture: Large Caches, Virtual Memory

CSC 4250 Computer Architectures

Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio

Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy

5.2 Eleven Advanced Optimizations of Cache Performance

Azeddien M. Sllame, Amani Hasan Abdelkader

Cache Memory Presentation I

Lecture 16: On-Chip Networks

Lecture 17: NoC Innovations

Lecture 13: Large Cache Design I

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Israel Cidon, Ran Ginosar and Avinoam Kolodny

Lecture 12: Cache Innovations

Lecture 1: Parallel Architecture Intro

Lecture: Cache Innovations, Virtual Memory

Ali Shafiee Rajeev Balasubramonian Feifei Li

Impact of Interconnection Network resources on CMP performance

Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti

Lecture 15: Memory Design

Leveraging Optical Technology in Future Bus-based Chip Multiprocessors

Die Stacking (3D) Microarchitecture -- from Intel Corporation

Design and Management of 3D CMP’s using Network-in-Memory

CS 6290 Many-core & Interconnect

Lecture: Cache Hierarchies

ARM ORGANISATION.

Rajeev Balasubramonian

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

A Case for Interconnect-Aware Architectures Naveen Muralimanohar, Rajeev Balasubramonian School of Computing, University of Utah Work supported in part by NSF award CCF-0430063 and NSF CAREER award CCF-0545959 Figure 1 Figure 2 Results Abstract Models Simulated Model 1 – 512 banks, B-wires Design Space Exploration: Model 2 – 16 banks, B-wires Model 3 – 16 banks, B,L-wires, Early Lookup Model 4 – 16 banks, B,L-wires, Aggressive Lookup Hybrid Model: Model 5 – 16 banks, B,L-wires Model 6 – 1 cycle address transfer (upper bound) Model 7 – 16 banks, L-wires for both add. and data Model 8 – 16 banks, Address transfer on L-wires In future multi-core chips, a large fraction of chip area will be dedicated for cache hierarchies and interconnects between various cache components. Large caches will likely be partitioned into many banks, connected by an on-chip network. We show results for a design space exploration of non-uniform cache architecture (NUCA) organizations. We make a case for a heterogeneous interconnect architecture where each link is composed of different wire types. Design space exploration results – energy and latency as a function of number of banks Leveraging Interconnect Choices for Performance Optimizations Early Look-up A partial address is sent on fast wires to start indexing into the cache while the remaining address is sent on slower wires Aggressive Look-up The partial address is sent on fast wires to perform a partial tag match and aggressively send all matched blocks to the cache controller Hybrid Network The aggressive look-up approach with low-latency wires allows the address network to employ fewer routers – a bus broadcasts the address to all banks in a row. The data network continues to employ a grid network. Key Observations Network parameters (wire/router type) have a major impact on cache access latency Parts of a message are more important than others Wires can be designed in a variety of ways Design Space Exploration: CACTI tool extensions to model average network and bank delay/power for each cache bank size Cache Access Optimizations: Low latency wires carry a subset of the address to initiate prefetch out of a cache bank Hybrid Architectures: The address network employs low-latency wires and a combination of buses and point-to-point links Address Cache Bank CPU Access Time = Address transmission + Bank access + Data transmission Data Partial Address (16b) on fast wires Cache Bank CPU Partial address is enough to initiate bank look-up Transmission of full address and bank look-up happen in parallel Full Address Power and B/W Optimized Delay 8x Wires 4x Wires Data Methodology Simplescalar simulator, SPEC2k benchmarks Network parameters: grid topology, virtual channel flow control, partially adaptive network, four VCs per physical channel, router pipeline delay of three cycles Wire model: ITRS 2005, analytical equations (Ho et al.) Contributions Partial Address Cache Bank CPU Cache access with partial tag match Future Work Comprehensive tool to identify optimal cache organizations explore various link and router types models for network contention (especially in CMPs) Exploiting heterogeneity within routers Data Core L2 Controller Shared bus Router For more details and related work (HPCA 2005, ISCA 2006), visit: Draft paper: http://www.cs.utah.edu/~rajeev/pubs/draft06.pdf Related papers: http://www.cs.utah.edu/~rajeev/research.html