University of Utah 1 Interconnect Design Considerations for Large NUCA Caches Naveen Muralimanohar Rajeev Balasubramonian.

Slides:



Advertisements
Similar presentations
Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
Advertisements

A Novel 3D Layer-Multiplexed On-Chip Network
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
University of Utah1 Interconnect-Aware Coherence Protocols for Chip Multiprocessors Liqun Cheng Naveen Muralimanohar Karthik Ramani Rajeev Balasubramonian.
Non-Uniform Cache Architecture Prof. Hsien-Hsin S
1 Lecture 16: Large Cache Design Papers: An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al., ASPLOS’02 Distance Associativity.
- Sam Ganzfried - Ryan Sukauye - Aniket Ponkshe. Outline Effects of asymmetry and how to handle them Design Space Exploration for Core Architecture Accelerating.
1 Lecture 17: On-Chip Networks Today: background wrap-up and innovations.
HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
Non-Uniform Power Access in Large Caches with Low-Swing Wires Aniruddha N. Udipi with Naveen Muralimanohar*, Rajeev Balasubramonian University of Utah.
1 Billion Transistor Architectures Interconnect design for low power – Naveen & Karthik Computational unit design for low temperature – Karthik Increased.
1 Lecture 19: Networks for Large Cache Design Papers: Interconnect Design Considerations for Large NUCA Caches, Muralimanohar and Balasubramonian, ISCA’07.
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.
1 Exploring Design Space for 3D Clustered Architectures Manu Awasthi, Rajeev Balasubramonian University of Utah.
The Auction: Optimizing Banks Usage in Non-Uniform Cache Architectures Javier Lira ψ Carlos Molina ψ,ф Antonio González ψ,λ λ Intel Barcelona Research.
1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,
1 Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems G. Pokam and F. Bodin.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.
Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma
CS 7810 Lecture 17 Managing Wire Delay in Large CMP Caches B. Beckmann and D. Wood Proceedings of MICRO-37 December 2004.
University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
Distributed Microarchitectural Protocols in the TRIPS Prototype Processor Sankaralingam et al. Presented by Cynthia Sturton CS 258 3/3/08.
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
1 Lecture 23: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm Next semester:
TLC: Transmission Line Caches Brad Beckmann David Wood Multifacet Project University of Wisconsin-Madison 12/3/03.
1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.
On-Chip Networks and Testing
Javier Lira (Intel-UPC, Spain)Timothy M. Jones (U. of Cambridge, UK) Carlos Molina (URV, Spain)Antonio.
International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Design and Evaluation of Hierarchical Rings with Deflection Routing Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, ​ Kevin Chang, Greg Nazario, Reetuparna.
1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah.
02/21/2003 CART 1 On-chip MRAM as a High-Bandwidth, Low-Latency Replacement for DRAM Physical Memories Rajagopalan Desikan, Charles R. Lefurgy, Stephen.
Analysis of NUCA Policies for CMPs Using Parsec Benchmark Suite Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research Center Intel.
The Alpha Network Architecture Mukherjee, Bannon, Lang, Spink, and Webb Summary Slides by Fred Bower ECE 259, Spring 2004.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
University of Michigan, Ann Arbor
ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.
By Islam Atta Supervised by Dr. Ihab Talkhan
Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.
LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors Javier Lira ψ Carlos Molina ψ,ф Antonio González ψ,λ λ Intel Barcelona Research.
Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.
1 Lecture 29: Interconnection Networks Papers: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton Interconnect Design.
Advanced Caches Smruti R. Sarangi.
CS 704 Advanced Computer Architecture
Gwangsun Kim, Jiyun Jeong, John Kim
Lecture: Large Caches, Virtual Memory
Associativity in Caches Lecture 25
Lecture 23: Interconnection Networks
Lecture: Large Caches, Virtual Memory
Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio
5.2 Eleven Advanced Optimizations of Cache Performance
Lecture 16: On-Chip Networks
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Staged-Reads : Mitigating the Impact of DRAM Writes on DRAM Reads
Lecture: Cache Innovations, Virtual Memory
Ali Shafiee Rajeev Balasubramonian Feifei Li
Adapted from slides by Sally McKee Cornell University
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Design and Management of 3D CMP’s using Network-in-Memory
CS 6290 Many-core & Interconnect
Lecture: Cache Hierarchies
A Case for Interconnect-Aware Architectures
Lecture 25: Interconnection Networks
Rajeev Balasubramonian
Presentation transcript:

University of Utah 1 Interconnect Design Considerations for Large NUCA Caches Naveen Muralimanohar Rajeev Balasubramonian

University of Utah2 Large Caches  Cache hierarchies will dominate chip area  Montecito has two private 12 MB L3 caches (27MB including L2)  Long global wires are required to transmit data/address Intel Montecito Cache

University of Utah3 Wire Delay/Power  Wire delays are costly for performance and power  Latencies of 60 cycles to reach ends of a chip at 32nm 5 GHz)  50% of dynamic power is in interconnect switching (Magen et al. SLIP 04)  CACTI* access time for 24 MB cache is 90 5GHz, 65nm Tech *version 3.2

University of Utah 4 Contributions  Methodology to compute optimal baseline NUCA organization Performs 51% better than prior NUCA models  Introduce heterogeneity in the network Additional 15% improvement in performance

University of Utah5 Cache Design Basics Input address Decoder Wordline Bitlines Tag array Data array Column muxes Sense Amps Comparators Output driver Valid output? Mux drivers Data output Output driver

University of Utah6 Existing Model - CACTI Decoder delay Wordline & bitline delay Cache model with 4 sub-arraysCache model with 16 sub-arrays Decoder delay = H-tree delay + logic delay

University of Utah7 CACTI Shortcomings  Access delay is equal to the delay of slowest sub- array  Very high hit time for large caches  Employs a separate bus for each cache bank for multi-banked caches  Not scalable Exploit different wire types and network design choices to reduce access latency Potential solution – NUCA Extend CACTI to model NUCA

University of Utah8 Non-Uniform Cache Access (NUCA)*  Large cache is broken into a number of small banks  Employs on-chip network for communication  Access delay  (distance between bank and cache controller) CPU & L1 Cache banks *(Kim et al. ASPLOS 02)

University of Utah9 Extension to CACTI  On-chip network Wire model based on ITRS 2005 parameters Grid network 3-stage speculative router pipeline  Network latency vs Bank access latency tradeoff Iterate over different bank sizes Calculate the average network delay based on the number of banks and bank sizes  Similarly we also consider power consumed for each organization

University of Utah10 Effect of Network Delay (32MB cache) Earlier NUCA Model

University of Utah11 Power Centric Design (32MB Cache)

University of Utah12  Wires can be tuned for low latency or low power  Low power wires with small, fewer repeaters Wire Design Space Global wires B wires 8x plane Semi global wires W wires 4x plane Power optimized PW wires 4x plane Fast, low bandwidth L wires 8x plane  Fat, low-bandwidth fast wires

University of Utah13 Wire Model MMM Wire RC V o res o cap I cap C side-wall C adj Wire TypeRelative LatencyRelative AreaDynamic PowerStatic Power B-Wire 8x1x 2.65  1x B-Wire 4x1.6x0.5x 2.9  1.13x L-Wire 8x0.25x8x 1.46  0.55X PW-Wire 4x3.2x0.5x 0.87  0.3x Ref: Banerjee et al. IEEE TED nm process

University of Utah14 Access time for different link types Bank Count Bank Access Time Avg Access time 8x-wires4x-wiresL-wires

University of Utah15 Cache Look-Up Total cache access time Network delay (4-6 bits to identify the cache Bank) Decoder, WL, BL (10-15 bits of address) Comparator, output driver delay (rest of the add.)  The entire access happens in a sequential manner Bank access + Data transfer

University of Utah16 Early Look-Up  Send partial address in L-wires  Initiate the bank lookup  In parallel send the complete address  Complete the access L Early lookup (10-15 bits of address) Tag match  We can hide ~70% of the bank access delay + Data transfer LookupTag Address Traditional Access

University of Utah17 Aggressive Look-Up L Agg. lookup (additional 8-bits On L-wires for address for partial tag match) Tag match at cache controller Early lookup Tag L LookupTag Address Traditional Access Early Lookup Aggressive Lookup 1101… Full tag entry

University of Utah18 Aggressive Look-Up  Reduction in link delay (for address transfer)  Increase in traffic due to false match < 1%  Marginal increase in link overhead Additional 8-bits -More logic at the cache controller for tag match -Address transfer for writes happens on L-wires

University of Utah19 Heterogeneous Network  Routers introduce significant overhead (especially in L-network) L-wires can transfer signal across four banks in four cycles Router adds three cycles for each hop  Modify network topology to take advantage of wire property Different topology for address and data transfers

University of Utah20 Hybrid Network  Combination of point-to-point and bus Reduction in latency Reduction in power Efficient use of L- wires -Low bandwidth Core L2 Controller Shared bus Router

University of Utah21 Experimental Setup  Simplescalar with contention modeled in detail  Single core, 8-issue out-of-order processor  32 MB, 8-way set-associative, on-chip L2 cache (SNUCA organization)  32KB L1 I-cache and 32KB L1 D-cache with a hit latency of 3 cycles  Main memory latency 300 cycles

University of Utah22 CMP Setup  Eight Core CMP (Simplescalar tool)  32 MB, 8-way set-associative (SNUCA organization)  Two cache controllers  Main memory latency 300 cycles C5 C3 C4 C6 C2C1 C8 C7 L2 Bank

University of Utah23 Network Model  Virtual channel flow control Four virtual channels/physical channel Credit based flow control (for backpressure)  Adaptive routing Each hop should reduce Manhattan distance between the source and the destination

University of Utah24 Cache Models ModelBank Access (cycles) Bank CountNetwork LinkDescription 13512B-wiresBased on prior work 21716B-wiresCACTI-L B & L–wiresEarly Lookup 41716B & L–wiresAgg. Lookup 51716B & L–wiresHybrid network 61716B-wiresUpper bound

University of Utah25 Performance Results (Uniprocessor) Latency sensitive benchmarks - ~70% of the SPEC suite Prior workCACTI-L2EarlyAggr.Hybrid.Ideal 73% 114%

University of Utah26 Performance Results (Uniprocessor) Latency sensitive benchmarks - ~70% of the SPEC suite Prior workCACTI-L2EarlyAggr.Hybrid.Ideal 6% 8% 9% 15% 20% 19% 26%

University of Utah27 Performance Results (CMP)

University of Utah28 Performance Results (4X – Wires) Wire delay constrained model Performance improvements are better Early lookup - 7% Aggressive model - 20% Hybrid model - 29%

University of Utah29 Conclusion  Network parameters play a significant role in the performance of large caches  Modified CACTI model, that includes network overhead performs 51% better compared to previous models  Methodology to compute an optimal baseline NUCA

University of Utah30 Conclusion  Wires can be tuned for different metrics  Routers impose non-trivial overhead  Address and data have different bandwidth needs  We introduce heterogeneity at three levels Different types of wires for address and data transfers Different topologies for address and data networks Different architectures within address network (point-to-point and bus) (Yields an additional performance improvement of 15% over the optimal, baseline NUCA)

University of Utah31 Performance Results (Uniprocessor) Model derived from CACTI, improvement over model assumed in the prior work – 73% L2 Sensitive – 114% Model derived from CACTI, improvement over model assumed in the prior work – 73% L2 Sensitive – 114% Prior workCACTI-L2EarlyAggr.Hybrid.Ideal

University of Utah32 Performance Results (Uniprocessor) Early lookup technique, average improvement over Model 2 – 6% L2 Sensitive – 8% Prior workCACTI-L2EarlyAggr.Hybrid.Ideal

University of Utah33 Performance Results (Uniprocessor) Aggressive lookup technique, average improvement over Model 2 – 8% L2 Sensitive – 9% Prior workCACTI-L2EarlyAggr.Hybrid.Ideal

University of Utah34 Performance Results (Uniprocessor) Hybrid model, average improvement over Model 2 – 15% L2 Sensitive – 20% Prior workCACTI-L2EarlyAggr.Hybrid.Ideal

University of Utah35 Outline Problem Overview Cache Design Basics  Extensions to CACTI  Effect of Network Parameters  Wire Design Space  Exploiting Heterogeneous Wires  Results

University of Utah36 Outline Problem Overview Cache Design Basics Extensions to CACTI Effect of Network Parameters Wire Design Space  Exploiting Heterogeneous Wires  Results

University of Utah37 Outline Problem Overview Cache Design Basics Extensions to CACTI Effect of Network Parameters  Wire Design Space  Exploiting Heterogeneous Wires  Results

University of Utah38 Outline Overview Cache Design Effect of Network Parameters Wire Design Space Exploiting Heterogeneous Wires  Methodology  Results

1101… University of Utah39 Aggressive Look-Up L Agg. lookup (additional 8-bits of address for partial tag match) Tag match at cache controller 1101… Full tag entry Way 1 Way n