Cache coherence for CMPs Miodrag Bolic. Private cache Each cache bank is private to a particular core Cache coherence is maintained at the L2 cache level.

Slides:

Advertisements

Similar presentations

L.N. Bhuyan Adapted from Patterson’s slides

Advertisements

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :

The University of Adelaide, School of Computer Science

1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

4/16/2013 CS152, Spring 2013 CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering.

1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.

Cache Optimization Summary

The University of Adelaide, School of Computer Science

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

1 Lecture 8: Large Cache Design I Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.

1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.

1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.

CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.

1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.

In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko March 5 th, 2008 Based on “Managing Distributed, Shared L2 Caches through.

1 Lecture 13: Cache, TLB, VM Today: large caches, virtual memory, TLB (Sections 2.4, B.4, B.5)

1 CACM July 2012 Talk: Mark D. Hill, Cornell University, 10/2012.

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

컴퓨터교육과 이상욱 Published in: COMPUTER ARCHITECTURE LETTERS (VOL. 10, NO. 1) Issue Date: JANUARY-JUNE 2011 Publisher: IEEE Authors: Omer Khan (Massachusetts.

1 Lecture: Virtual Memory Topics: virtual memory, TLB/cache access (Sections 2.2)

The University of Adelaide, School of Computer Science

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 7, 2005 Session 23.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

The University of Adelaide, School of Computer Science

Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.

Lecture: Large Caches, Virtual Memory

Architecture and Design of AlphaServer GS320

Lecture: Large Caches, Virtual Memory

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Lecture: Large Caches, Virtual Memory

Lecture 13: Large Cache Design I

The University of Adelaide, School of Computer Science

CMSC 611: Advanced Computer Architecture

Example Cache Coherence Problem

Lecture 12: Cache Innovations

Directory-based Protocol

The University of Adelaide, School of Computer Science

CS5102 High Performance Computer Systems Distributed Shared Memory

Lecture 2: Snooping-Based Coherence

Interconnect with Cache Coherency Manager

Lecture: Cache Innovations, Virtual Memory

Lecture 7: Directory-Based Cache Coherence

11 – Snooping Cache and Directory Based Multiprocessors

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

High Performance Computing

Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini

Lecture: Cache Hierarchies

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Cache coherence CEG 4131 Computer Architecture III

Lecture 24: Virtual Memory, Multiprocessors

Lecture 23: Virtual Memory, Multiprocessors

Lecture 17 Multiprocessors and Thread-Level Parallelism

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Multiprocessors and Multi-computers

Presentation transcript:

Cache coherence for CMPs Miodrag Bolic

Private cache Each cache bank is private to a particular core Cache coherence is maintained at the L2 cache level Intel Montecito [81], AMD Opteron [56], or IBM POWER6 [63]

Private cache Advantages Short L2 cache access latency Small amount of network traffic generated: Since the local L2 cache bank can filter most of the memory requests, the number of coherence messages injected into the interconnection network is limited. Disadvantages Data blocks can get duplicated if the working set accessed by the different cores is not well-balanced, some caches can be over-utilized whilst others can be under-utilized

Shared cache Cache coherence is maintained at the L1 level Bits usually chosen for the mapping to a particular bank are the less significant ones Piranha [16], Hydra [47], Sun UltraSPARC T2 [105] and Intel Merom [104]

Shared caches Advantage Single copy of blocks Workload balancing: Since the utilization of each cache bank does not depend on the working set accessed by each core, but they are uniformly distributed among cache banks in a round-robin fashion, the aggregate cache capacity is augmented. Disadvantages Many requests will be will be serviced by remote banks (L2 NUCA architecture)

Hammer protocol AMD - Opteron systems It relies on broadcasting requests to all tiles to solve cache misses It targets systems that use unordered point-to-point interconnection networks On every cache miss, Hammer sends a request to the home tile. If the memory block is present on-chip, the request is forwarded to the rest of tiles to obtain the requested block All tiles answer to the forwarded request by sending either an acknowledgement or the data message to the requesting core. The requesting core needs to wait until it receives the response from each other tile. When the requester receives all the responses, it sends an unblock message to the home tile.

Hammer protocol Disadvantages Requires three hops in the critical path before the requested data block is obtained. Broadcasting invalidation messages increases considerably the traffic injected into the interconnection network and, therefore, its power consumption.

Directory protocol In order to accelerate cache misses, this directory information is not stored in main memory. Instead, it is usually stored on-chip at the home tile of each block. In tiled CMPs, the directory structure is split into banks which are distributed across the tiles. Each directory bank tracks a particular range of memory blocks.

Directory protocol The indirection problem – every cache miss must reach the home tile before any coherence action can be performed. – adds unnecessary hops into the critical path of the cache misses The directory memory overhead to keep the track of sharers for each memory block could be intolerable for large-scale configurations. – Example: block size 16 bytes, 64 tiles

Comparison of protocols

Interleaving

Mapping between cache entries and directory entries One way to keep constant the size of the directory entries is storing duplicate tags.