Cache Coherence: Directory Protocol

Slides:

Advertisements

Similar presentations

The University of Adelaide, School of Computer Science

Advertisements

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

4/16/2013 CS152, Spring 2013 CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering.

1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.

Cache Optimization Summary

CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.

1 Lecture 24: Transactional Memory Topics: transactional memory implementations.

1 Lecture 20: Coherence protocols Topics: snooping and directory-based coherence protocols (Sections )

1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.

NUMA coherence CSE 471 Aut 011 Cache Coherence in NUMA Machines Snooping is not possible on media other than bus/ring Broadcast / multicast is not that.

1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.

CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.

1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )

1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.

Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.

1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

1 Lecture: Coherence Topics: snooping-based coherence, directory-based coherence protocols (Sections )

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 7, 2005 Session 23.

The University of Adelaide, School of Computer Science

1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.

Multiprocessors – Locks

Lecture 8: Snooping and Directory Protocols

Cache Coherence: Directory Protocol

תרגול מס' 5: MESI Protocol

Architecture and Design of AlphaServer GS320

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

CS 704 Advanced Computer Architecture

Lecture 18: Coherence and Synchronization

Multiprocessor Cache Coherency

Jason F. Cantin, Mikko H. Lipasti, and James E. Smith

The University of Adelaide, School of Computer Science

CMSC 611: Advanced Computer Architecture

Krste Asanovic Electrical Engineering and Computer Sciences

Example Cache Coherence Problem

The University of Adelaide, School of Computer Science

Lecture 2: Snooping-Based Coherence

Cache Coherence Protocols 15th April, 2006

CMSC 611: Advanced Computer Architecture

Lecture 8: Directory-Based Cache Coherence

Lecture 7: Directory-Based Cache Coherence

11 – Snooping Cache and Directory Based Multiprocessors

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

CS 213 Lecture 11: Multiprocessor 3: Directory Organization

Lecture 25: Multiprocessors

Lecture 9: Directory-Based Examples

Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini

Lecture 8: Directory-Based Examples

Lecture 25: Multiprocessors

Lecture 26: Multiprocessors

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Cache coherence CEG 4131 Computer Architecture III

Lecture 24: Virtual Memory, Multiprocessors

Lecture 24: Multiprocessors

Lecture: Coherence, Synchronization

Lecture: Coherence Topics: wrap-up of snooping-based coherence,

Coherent caches Adapted from a lecture by Ian Watson, University of Machester.

Lecture 17 Multiprocessors and Thread-Level Parallelism

CPE 631 Lecture 20: Multiprocessors

Lecture 23: Transactional Memory

Lecture 19: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

Lecture 10: Directory-Based Examples II

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

Cache Coherence: Directory Protocol Smruti R. Sarangi, IIT Delhi

Contents Overview of the Directory Protocol Details Optimizations

Basic Idea of a Coherence Protocol Memory Level n Level n+1 Coherent memory: multiple private caches  appear as one large shared cache Private Cache Private Cache Private Cache Private Cache Private Cache Private Cache Memory Level n+2

Limitations of the Snoopy Protocol Fundamentally relies on a broadcast based mechanism The order of writes is determined by the order of grant accesses to the broadcast bus This is not a scalable solution Solution: We can have different broadcast buses for different sets of addresses Can work for 4-8 core systems (not beyond, too many conflicts) Ultimately: not scalable We cannot make efficient use of the NoC Also not very power efficient All the cores need to keep on snooping the bus Need to migrate to a message based system

Single Writer Multiple Reader Model for each Cache Line Multiple Readers Ensures a global order of writes to the same memory location

Solution: Directory Protocol Don’t have a bus Have a dedicated structure called a directory The directory co-ordinates the actions of the coherence protocol It sends and receives messages to/from all the caches and the lower level in the memory hierarchy Scalable Cache Cache Directory Cache Cache Lower Level in the Memory Hierarchy

State of Lines in the Shared Cache Bank tag array entry coherence state tag Every line that is shared has some state associated with it I  Invalid S  Shared (with some other cache) E  Exclusive (no other cache contains a copy of this line) M  Modified (no other cache contains a copy of this line + this cache has modified it) Meaning of the states I  Nothing exists, content is not valid (cannot read or write) S  Can read the line, not write E  We know that no other cache contains a copy of the line. Cannot immediately write, can read. M  Can read or write Let us explain some common scenarios with two simple examples

Main Actions of the Protocol (Example: read) Line is not in the S or E or M state We want to read a line from cache C1 Directory locates the line Send a message to the directory Ask the other cache for a read permission Line is contained in another cache Acknowledgement + contents of the line Other cache: transitions the line to the S state (if not already so). Returns acknowledgement + contents Directory forwards the line to C1

Main Actions of the Protocol (Example: write) Line is not in the E or M state Directory locates all the caches that contain the line (sharers) We want to write a line to cache C1 Send a message to the directory Ask the other caches to invalidate the line Line is present in other caches Acknowledgement + contents of the line Other cache: transitions the line to the I state (if not already so). Returns acknowledgement + contents Directory forwards the line to C1

Contents Overview of the Directory Protocol Details Optimizations

Let us look at the protocol in some detail ... There are three entities The Cache (contains some state for each line) It receives messages from the upper level (read or write) It receives messages from the NoC (read miss, write miss) Generates events or gets messages from the lower level  evictions The NoC Routes messages The directory Gets messages from caches, processes them Sends messages to caches (if required)

Let us consider the cache’s point of view readX  read miss on NoC, writeX  write miss on NoC read  local read, write  local write Let us consider the I (invalid, line not present) state Standard format: event/action, (send: default sent to dir.) No other cache contains the line E I read/send readX Some other cache contains the line S I read/send readX

More I state transitions write/send writeX I I readX or writeX/-

S M I S state transitions read/- readX/send value(?) write/send writeX writeX/send value(?) evict/send evict 1) Support multiple readers (at a time), yet only one writer. 2)Also seamlessly evict lines. I

M S I M State Transitions read/- write/- readX/write-back+send data writeX/send data evict/send evict + write-back 1) Only one writer at a time 2) In the M  S transition, write back data to the lower level. This is done to enable seamless evictions in the S state. I

E S M I E State Transitions read/- write/send writeX readX/send data writeX/send data evict/send evict 1) We can seamlessly move to the M state. It is not necessary to send any messages. I

Protocol Summary: readX and writeX readX or writeX/- I S readX/send data(?) writeX/send data(?) writeX/send data writeX/send data readX/send data readX/write-back+send data M E

Protocol Summary: read,write, and evict evict/send evict S read/- read/send readX read/send readX write/send writeX write/send writeX evict/send evict + write-back evict/send evict write,read/- M E read/- write/-

Design of the Directory What are the messages that the directory receives? readX writeX evict What should the directory do: readX  Locate a cache that contains the line(sharer) and fetch the line writeX  Ask all sharers to invalidate their lines, give exclusive rights to the cache that wants to write evict  Delete the cache from the list of sharers Basic Design: Directory Directory Entry State Line Address List of Sharers

Actions of the Directory Notation: readX.C  id of the cache List of sharers  S Directory line state  DirState On receiving a readX Entry for line not present in the directory Create an entry, S = [readX.C], Read from the lower level Send an acknowledgement to the cache (state transition to the E state) Entry present DirState = Shared S = S U readX.C Send a readX message to one of the sharers to send the line’s contents to the requester DirState = Modified Send a readX message to the cache that is the exclusive owner

Actions of the Directory On receiving a writeX Entry for line not present in the directory Create an entry, S = [writeX.C], Read from the lower level Send an acknowledgement to the cache (state transition to the M state) Entry present DirState = Shared Send a writeX message to all the sharers Ask one of the sharers to send a copy of the line to the requester S = [writeX.C] DirState = Modified Send a writeX message to the cache that is the exclusive owner The exclusive owner needs to send the contents of the block to the requester

Actions of the Directory On receiving an evict from cache, C S = S – C If (S == Empty), set the state as invalid

Contents Overview of the Directory Protocol Details Optimizations

Enhancements to the Directory Protocol Let us list some of the common problems associated with directories We need an entry for each line in a program’s working set (lot of storage) In each directory entry, we need an entry for each constituent cache (storage overheads) The directory itself can become a point of contention Let us look at 

Distributed Directories Split the address space Address Space Directory Directory Directory Directory Directory Split the physical address space A directory handles all the requests for the part of the address space it owns Resolves the issue of the single point of contention.

List of Sharers How to maintain the list of sharers? Solution 1: If there are N processors, have a bit vector of N processors. Each block is associated with a bit vector of sharers block address 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 1 Problems with this solution: If there are a large number of processors and most of the entries are 0  space wasted Better solution: Maintain a bit for a set of caches. Run a snoopy protocol inside the set. OR Store the ids of only k sharers. Design the list of sharers as a linked list. Every block maintains a pointer to the next copy.

Size of the Directory The directory should ideally be as large as the number of blocks in the programs’ working sets Having an entry for every block in the physical address space is impractical Practical Solution: Design a directory as a cache Keep the state of a limited number of blocks If an entry is evicted from the directory, invalidate it in all the caches

Question Implement a critical section with the help of the directory protocol. A critical section has two functions: lock and unlock Let us implement both of them with assembly instructions Assume register r2 contains the lock address The lock address can contain 1 (it is locked) or 0 (unlocked) lock function Atomically exchange the contents of register r1 (=1), with [r2] If r1 = 0 after the exchange, we have the lock Otherwise keep repeating Unlock function: release the lock Write 0 to [r2] Why do you need a fence? .lock: mov r1, 1 atomicxchg r1, [r2] cmp r1, 0 bne .lock .unlock: fence st [r2], 0

Atomic Exchange atomicxchg r1, [r2] It involves 3 steps Method temp = r1, r1 = [r2], [r2] = temp It involves 3 steps 1 memory read + 1 memory write All 3 need to happen atomically This is called a read-modify-write instruction (RMW) Method Get exclusive access (M state) with write permissions for the memory address in r2 Perform the read-modify-write operation Do not respond to any other requests from the local cache, or other caches, or the directory when the operation is in progress Respond to the directory or other caches only when the operation is over The directory/requesting cache need to stall till they get valid responses from all caches that have the line in the M state