1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,

Slides:

Advertisements

Similar presentations

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Advertisements

Directory-Based Cache Coherence Marc De Melo. Outline Non-Uniform Cache Architecture (NUCA) Cache Coherence Implementation of directories in multicore.

Cache Coherence “Can we do a better job of supporting cache coherence?” Ross Daly Chan Kim.

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.

1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

Nikos Hardavellas, Northwestern University

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.

Non-Uniform Cache Architecture Prof. Hsien-Hsin S

CSIE30300 Computer Architecture Unit 10: Virtual Memory Hsin-Chou Chi [Adapted from material by and

1 Lecture 16: Large Cache Design Papers: An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al., ASPLOS’02 Distance Associativity.

1 Lecture 11: Large Cache Design IV Topics: prefetch, dead blocks, cache networks.

1 Lecture 12: Large Cache Design Papers (papers from last class and…): Co-Operative Caching for Chip Multiprocessors, Chang and Sohi, ISCA’06 Victim Replication,

HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

1 Lecture 18: Large Caches, Multiprocessors Today: NUCA caches, multiprocessors (Sections ) Reminder: assignment 5 due Thursday (don’t procrastinate!)

1 Lecture 15: Virtual Memory and Large Caches Today: TLB design and large cache design basics (Sections )

1 Lecture 16: Large Cache Innovations Today: Large cache design and other cache innovations Midterm scores  91-80: 17 students  79-75: 14 students 

Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel.

1 Lecture 19: Networks for Large Cache Design Papers: Interconnect Design Considerations for Large NUCA Caches, Muralimanohar and Balasubramonian, ISCA’07.

1 Lecture 8: Large Cache Design I Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.

1 E. Bolotin – The Power of Priority, NoCs 2007 The Power of Priority : NoC based Distributed Cache Coherency Evgeny Bolotin, Zvika Guz, Israel Cidon,

Lecture 17: Virtual Memory, Large Caches

1 Lecture 10: Large Cache Design III Topics: Replacement policies, prefetch, dead blocks, associativity Sign up for class mailing list Pseudo-LRU has a.

1 Lecture 15: Large Cache Design Topics: innovations for multi-mega-byte cache hierarchies Reminders:  Assignment 5 posted.

1 Dynamic Hardware-Assisted Software-Controlled Page Placement to Manage Capacity Allocation and Sharing within Caches Manu Awasthi, Kshitij Sudan, Rajeev.

CS 7810 Lecture 17 Managing Wire Delay in Large CMP Caches B. Beckmann and D. Wood Proceedings of MICRO-37 December 2004.

University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.

TLC: Transmission Line Caches Brad Beckmann David Wood Multifacet Project University of Wisconsin-Madison 12/3/03.

Non-Uniform Cache Architectures for Wire Delay Dominated Caches Abhishek Desai Bhavesh Mehta Devang Sachdev Gilles Muller.

1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.

ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 5 Non-Uniform Cache.

Design and Management of 3D CMP’s using Network-in-Memory Feihui Li et.al. Penn State University (ISCA – 2006)

Javier Lira (Intel-UPC, Spain)Timothy M. Jones (U. of Cambridge, UK) Carlos Molina (URV, Spain)Antonio.

Cache Improvements James Brock, Joseph Schmigel May 12, 2006 – Computer Architecture.

10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.

Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)

Computer Architecture Lecture 26 Fasih ur Rehman.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko March 5 th, 2008 Based on “Managing Distributed, Shared L2 Caches through.

1 Lecture 13: Cache, TLB, VM Today: large caches, virtual memory, TLB (Sections 2.4, B.4, B.5)

Analysis of NUCA Policies for CMPs Using Parsec Benchmark Suite Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research Center Intel.

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

By Islam Atta Supervised by Dr. Ihab Talkhan

1 Lecture 12: Large Cache Design Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.

Optimizing Replication, Communication, and Capacity Allocation in CMPs Z. Chishti, M. D. Powell, and T. N. Vijaykumar Presented by: Siddhesh Mhambrey Published.

1 Lecture: Virtual Memory Topics: virtual memory, TLB/cache access (Sections 2.2)

1 Lecture 7: PCM Wrap-Up, Cache coherence Topics: handling PCM errors and writes, cache coherence intro.

LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors Javier Lira ψ Carlos Molina ψ,ф Antonio González ψ,λ λ Intel Barcelona Research.

15-740/ Computer Architecture Lecture 18: Caching in Multi-Core Prof. Onur Mutlu Carnegie Mellon University.

1 Lecture 29: Interconnection Networks Papers: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton Interconnect Design.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

Lecture: Large Caches, Virtual Memory

ASR: Adaptive Selective Replication for CMP Caches

Lecture: Large Caches, Virtual Memory

Lecture: Cache Hierarchies

Lecture: Large Caches, Virtual Memory

Lecture: Cache Hierarchies

Lecture 13: Large Cache Design I

Lecture 12: Cache Innovations

Optical Overlay NUCA: A High Speed Substrate for Shared L2 Caches

Lecture: Cache Innovations, Virtual Memory

Lecture 15: Large Cache Design III

Lecture 14: Large Cache Design II

Lecture: Cache Hierarchies

Chapter Five Large and Fast: Exploiting Memory Hierarchy

Presentation transcript:

1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al., ASPLOS’02 Distance Associativity for High-Performance Energy-Efficient Non-Uniform Cache Architectures, Chishti et al., MICRO’03 Managing Wire Delay in Large Chip-Multiprocessor Caches, Beckmann and Wood, MICRO’04 Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06

2 Shared Vs. Private Caches in Multi-Core Advantages of a shared cache:  Space is dynamically allocated among cores  No wastage of space because of replication  Potentially faster cache coherence (and easier to locate data on a miss) Advantages of a private cache:  small L2  faster access time  private bus to L2  less contention

Core 0 L1 D$ L1 I$ Core 1 L1 D$ L1 I$ Core 2 L1 D$ L1 I$ Core 3 L1 D$ L1 I$ Core 4 L1 D$ L1 I$ Core 5 L1 D$ L1 I$ Core 6 L1 D$ L1 I$ Core 7 L1 D$ L1 I$ Shared L2 Cache and Directory State L2 Cache Controller Scalable Non-broadcast Interconnect

Core 0 L1 D$ L1 I$ Core 1 L1 D$ L1 I$ Core 2 L1 D$ L1 I$ Core 3 L1 D$ L1 I$ Core 4 L1 D$ L1 I$ Core 5 L1 D$ L1 I$ Core 6 L1 D$ L1 I$ Core 7 L1 D$ L1 I$ Replicated Tags of all L2 and L1 Caches Controller that handles L2 misses Scalable Non-broadcast Interconnect L2$ Off-chip access

Core 0 L1 D$ L1 I$ L2 $ Core 1 L1 D$ L1 I$ L2 $ Core 2 L1 D$ L1 I$ L2 $ Core 3 L1 D$ L1 I$ L2 $ Core 4 L1 D$ L1 I$ L2 $ Core 5 L1 D$ L1 I$ L2 $ Core 6 L1 D$ L1 I$ L2 $ Core 7 L1 D$ L1 I$ L2 $ Memory Controller for off-chip access A single tile composed of a core, L1 caches, and a bank (slice) of the shared L2 cache The cache controller forwards address requests to the appropriate L2 bank and handles coherence operations

Bottom die with cores and L1 caches Core 0 L1 D$ L1 I$ L2 $ Core 1 L1 D$ L1 I$ L2 $ Core 2 L1 D$ L1 I$ L2 $ Core 3 L1 D$ L1 I$ L2 $ Memory controller for off-chip access Top die with L2 cache banks Each core has low-latency access to one L2 bank

7 Large NUCA CPU Issues to be addressed for Non-Uniform Cache Access: Mapping Migration Search Replication

8 Static and Dynamic NUCA Static NUCA (S-NUCA)  The address index bits determine where the block is placed  Page coloring can help here as well to improve locality Dynamic NUCA (D-NUCA)  Blocks are allowed to move between banks  The block can be anywhere: need some search mechanism  Each core can maintain a partial tag structure so they have an idea of where the data might be (complex!)  Every possible bank is looked up and the search propagates (either in series or in parallel) (complex!)

9 Kim et al. (ASPLOS’02) Search policies:  incremental: check each bank before propagating the search  multicast: search in parallel  smart search: cache controller maintains partial tags that guide search or quickly signal a cache miss Movement: Data gradually moves closer as it is accessed Placement policy:  bring data close or far  replaced data is evicted or moved to furthest bank

10 Results Average IPC values (16 MB, 50nm technology) : UCA cache: 0.26 Multi-level UCA (L2/L3): 0.64 Static NUCA: 0.65 D-NUCA (simple map, multicast, 0.71 insert at tail, 1-hit/1-bank promotion) D-NUCA with smart search 0.75 Upper bound (instant L2 miss 0.89 detection and all hits in first bank)

11 Chishti et al. (MICRO’03) Decouples the tag and data arrays Tag arrays are first examined (serial tag-data access is common and more power-efficient for large caches) Only the appropriate bank is then accessed Tags are organized conventionally, but within the data arrays, a set may have all its ways concentrated nearby The tags maintain forward pointers to data and data blocks maintain reverse pointers to tags

12 NuRAPID and Distance-Associativity

13 Beckmann and Wood, MICRO’04 Latency 13-17cyc Latency 65 cyc Data must be placed close to the center-of-gravity of requests

14 Examples: Frequency of Accesses Dark  more accesses  OLTP (on-line transaction processing) Ocean  (scientific code)

Block Migration Results While block migration reduces avg. distance, it complicates search.

Alternative Layout From Huh et al., ICS’05: Paper also introduces the notion of sharing degree A bank can be shared by any number of cores between N=1 and 16. Will need support for L2 coherence as well

17 Cho and Jin, MICRO’06 Page coloring to improve proximity of data and computation Flexible software policies Has the benefits of S-NUCA (each address has a unique location and no search is required) Has the benefits of D-NUCA (page re-mapping can help migrate data, although at a page granularity) Easily extends to multi-core and can easily mimic the behavior of private caches

18 Page Coloring Example P C P C P C P C P C P C P C P C Recent work (Awasthi et al., HPCA’09) proposes a mechanism for hardware-based re-coloring of pages without requiring copies in DRAM memory

19 Title Bullet