Copyright Josep Torrellas 2003,20081 Cache Coherence Instructor: Josep Torrellas CS533 Term: Spring 2008.

Slides:



Advertisements
Similar presentations
L.N. Bhuyan Adapted from Patterson’s slides
Advertisements

CSCI 8150 Advanced Computer Architecture
The University of Adelaide, School of Computer Science
Cache Coherence Mechanisms (Research project) CSCI-5593
CMSC 611: Advanced Computer Architecture
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.
Cache Optimization Summary
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.
EECC756 - Shaaban #1 lec # 13 Spring Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions: –Processor-Cache-Memory.
1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.
ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches.
NUMA coherence CSE 471 Aut 011 Cache Coherence in NUMA Machines Snooping is not possible on media other than bus/ring Broadcast / multicast is not that.
EECC756 - Shaaban #1 lec # 12 Spring Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions: –Processor-Cache-Memory.
1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
Cache Coherence for Large-Scale Machines Todd C. Mowry CS 495 Sept. 26 & Oct. 1, 2002 Topics Hierarchies Directory Protocols.
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Multiprocessor Cache Coherency
Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.
Cache Coherence in Scalable Machines CSE 661 – Parallel and Vector Architectures Prof. Muhamed Mudawar Computer Engineering Department King Fahd University.
1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.
Lecture 3. Directory-based Cache Coherence Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.
Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.
Cache Control and Cache Coherence Protocols How to Manage State of Cache How to Keep Processors Reading the Correct Information.
LimitLESS Directories: A Scalable Cache Coherence Scheme By: David Chaiken, John Kubiatowicz, John Kubiatowicz, Anant Agarwal Anant Agarwal Presented by:
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 5, 2005 Session 22.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 9: May 3, 2001 Distributed Shared Memory.
“An Evaluation of Directory Schemes for Cache Coherence” Presented by Scott Weber.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University
The University of Adelaide, School of Computer Science
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
Lecture 8: Snooping and Directory Protocols
Architecture and Design of AlphaServer GS320
Scalable Cache Coherent Systems
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Multiprocessor Cache Coherency
Scalable Cache Coherent Systems
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
Example Cache Coherence Problem
The University of Adelaide, School of Computer Science
Scalable Cache Coherent Systems
Cache Coherence Protocols 15th April, 2006
Lecture 8: Directory-Based Cache Coherence
Lecture 7: Directory-Based Cache Coherence
11 – Snooping Cache and Directory Based Multiprocessors
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
Scalable Cache Coherent Systems
High Performance Computing
Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini
Lecture 8: Directory-Based Examples
Cache Coherence: Part II Scalable Approaches Todd C
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Cache coherence CEG 4131 Computer Architecture III
Scalable Cache Coherent Systems
Scalable Cache Coherent Systems
Coherent caches Adapted from a lecture by Ian Watson, University of Machester.
Lecture 17 Multiprocessors and Thread-Level Parallelism
The University of Adelaide, School of Computer Science
Scalable Cache Coherent Systems
Scalable Cache Coherent Systems
Lecture 17 Multiprocessors and Thread-Level Parallelism
Scalable Cache Coherent Systems
Presentation transcript:

Copyright Josep Torrellas 2003,20081 Cache Coherence Instructor: Josep Torrellas CS533 Term: Spring 2008

Copyright Josep Torrellas 2003,20082 The Cache Coherence Problem Caches are critical to modern high-speed processors Multiple copies of a block can easily get inconsistent –processor writes. I/O writes,.. PP Cache A = 5 3 A = 7 Memory A = 5 1 2

Copyright Josep Torrellas 2003,20083 Cache Coherence Solutions Software based vs hardware based Software-based: –Compiler based or with run-time system support –With or without hardware assist –Tough problem because perfect information is needed in the presence of memory aliasing and explicit parallelism Focus on hardware based solutions as they are more common

Copyright Josep Torrellas 2003,20084 Hardware Solutions The schemes can be classified based on : –Shared caches vs Snoopy schemes vs. Directory schemes –Write through vs. write-back (ownership-based) protocols –update vs. invalidation protocols –dirty-sharing vs. no-dirty-sharing protocols

Copyright Josep Torrellas 2003,20085 Snoopy Cache Coherence Schemes A distributed cache coherence scheme based on the notion of a snoop that watches all activity on a global bus, or is informed about such activity by some global broadcast mechanism. Most commonly used method in commercial multiprocessors

Copyright Josep Torrellas 2003,20086 Write Through Schemes All processor writes result in : –update of local cache and a global bus write that : updates main memory invalidates/updates all other caches with that item Advantage : Simple to implement Disadvantages : Since ~15% of references are writes, this scheme consumes tremendous bus bandwidth. Thus only a few processors can be supported. Need for dual tagging caches in some cases

Copyright Josep Torrellas 2003,20087 Write-Back/Ownership Schemes When a single cache has ownership of a block, processor writes do not result in bus writes thus conserving bandwidth. Most bus-based multiprocessors nowadays use such schemes. Many variants of ownership-based protocols exist: –Goodmans write -once scheme –Berkeley ownership scheme –Firefly update protocol –…

Copyright Josep Torrellas 2003,20088 Invalidation vs. Update Strategies 1. Invalidation : On a write, all other caches with a copy are invalidated 2. Update : On a write, all other caches with a copy are updated Invalidation is bad when : –single producer and many consumers of data. Update is bad when : –multiple writes by one PE before data is read by another PE. –Junk data accumulates in large caches (e.g. process migration). Overall, invalidation schemes are more popular as the default.

Copyright Josep Torrellas 2003,20089 Dirty SharedInvalid Bus Write Miss Bus invalidate P-read Bus-read P- Read P-read P-write Bus Write Miss Bus-read P-write P- Read P-write

Copyright Josep Torrellas 2003, Illinois Scheme States: I, VE (valid-exclusive), VS (valid-shared), D (dirty) Two features : –The cache knows if it has an valid-exclusive (VE) copy. In VE state no invalidation traffic on write-hits. –If some cache has a copy, cache-cache transfer is used. Advantages: –closely approximates traffic on a uniprocessor for sequential pgms. –In large cluster-based machines, cuts down latency Disadvantages: –complexity of mechanism that determines exclusiveness –memory needs to wait before sharing status is determined

Copyright Josep Torrellas 2003, Dirty SharedInvalid Bus Write Miss Bus invalidate P-read [someone has it] Bus-read P- Read P-read P-write Bus Write Miss Bus-read P-write P- Read [someone has it] P-write Valid Exclusive Bus Write Miss P-read [no one else has it] Bus-read P-write P-read [no one else has it] P- Read

Copyright Josep Torrellas 2003, DEC Firefly Scheme Classification: Write-back, update, no-dirty-sharing. States : –VE (valid exclusive): only copy and clean –VS (valid shared) : shared -clean copy. Write hits result in updates to memory and other caches and entry remains in this state –D(dirty): dirty exclusive (only copy) Used special shared line on bus to detect sharing status of cache line Supports producer-consumer model well What about sequential processes migrating between CPUs?

Copyright Josep Torrellas 2003, Dirty Shared Valid Exclusive Bus Read/Write Bus write-miss P-Write and not SL Bus-read/write P- Write and SL P-read P-write Bus-write miss P-write Bus Read [update MM] P- Read and SL P-write Miss and not SL P-Read [no one else has it] Bus Write miss P-Write M and SL P Read

Copyright Josep Torrellas 2003, Directory Based Cache Coherence Key idea :keep track in a global directory (in main memory) of which processors are caching a location and the state.

Copyright Josep Torrellas 2003, Motivation Snoopy schemes do not scale because they rely on broadcast Hierarchical snoopy schemes have the root as a bottleneck Directory based schemes allow scaling –They avoid broadcasts by keeping track of all Pes caching a memory block, and then using point-to-point messages to maintain coherence –They allow the flexibility to use any scalable point-to- point network

Copyright Josep Torrellas 2003, Basic Scheme (Censier and Feautrier) p p Interconnection network cache Directory Dirty bit Presence bits memory Assume K processors With each cache-block in memory: K presence bits and 1 dirty bit With each cache-block in cache : 1 valid bit and 1 dirty (owner) bit

Copyright Josep Torrellas 2003, Read Miss Read from main-memory by PE_i –If dirty bit is off then {read from main memory;turn p[i] ON; } –If dirty bit is ON then {recall line from dirty PE (cache state to shared); update memory; turn dirty-bit OFF;turn p[i] ON; supply recalled data to PE_i;}

Copyright Josep Torrellas 2003, If dirty bit ON then {recall the data from owner PE which invalidates itself; (update memory); clear bit of previous owner; forward data to PE i; turn bit PE[I] on; (dirty bit ON all the time) } If dirty-bit OFF then {supply data to PE_i; send invalidations to all PEs caching that block and clear their P[k] bits; turn dirty bit ON; turn P[i] ON;.. } Write Miss

Copyright Josep Torrellas 2003, Write- hit to data valid (not owned ) in cache: {access memory-directory; send invalidations to all PEs caching block; clear their P[k] bits; supply data to PE i ; turn dirty bit ON ; turn PE[i] ON } Write Hit to Non-Owned Data

Copyright Josep Torrellas 2003, Key Issues Scaling of memory and directory bandwidth –Cannot have main memory or directory memory centralized –Need a distributed cache coherence protocol As shown, directory memory requirements do not scale well –Reason is that the number of presence bits needed grows as the number of Pes. --> But how many bits really needed? –Also: the larger the main memory is, the larger the directory

Copyright Josep Torrellas 2003, Directory Organizations Memory-based schemes (DASH) vs Cache-based schemes (SCI) Cache-based schemes (or linked-list based) –Singly linked –Doubly-linked (SCI) Memory-based schemes (or pointer-based) –Full map (Dir-N) vs Partial-map schemes (Dir-i-B, Dir- i-CV-r,…) –Dense (DASH) vs Sparse directory schemes

Copyright Josep Torrellas 2003, Pointer-Based Coherence Schemes The Full Bit Vector Scheme Limited Pointer Schemes Sparse Directories (Caching) LimitLess (Software Assistance)

Copyright Josep Torrellas 2003, The Full Bit Vector Scheme One bit of directory memory per main-memory block per PE Memory requirements are P x (P x M/B), where P is the number of PE, M is main memory per PE, and B is cache block size (not counting the dirty bit) Invalidation traffic is best One way to reduce the overhead is to increase B –Can result in false sharing and increased coherence traffic Overhead not too large for medium-scale mps –Example: 256 PE organized as 64 4-PE clusters with 64-byte cache blocks ---> 12% memory overhead

Copyright Josep Torrellas 2003, Limited Pointer Schemes Since data is expected to be in only a few caches at any one time, a limited number of pointers per directory entry should suffice Overflow strategy: what to do when the number of sharers exceeds the number of pointers? Many different schemes based on different overflow strategies

Copyright Josep Torrellas 2003, Some Examples Dir-i-B –Beyond i-pointers, set the inval-broadcast bit ON –Storage needed is: i x log(P) x PM/B (in addition to inval-broadcast bit) –Expected to do well since widely shared data is not written often Dir-i-NB –When sharers exceed i, invalidate one of the existing sharers –Significant degradation expected for widely-shared mostly-read data Dir-i-CV-r –When sharers exceed i, use bits allocated to i poiters as a coarse resolution vector (each bit points to multiple PE) –Always results in less coherence traffic than Dir-i-B LimitLess directories: Handle overflow using SW traps

Copyright Josep Torrellas 2003, Performance of Directories Figure10 in Gupta et al paper Figure 7 in Gupta et al paper

Copyright Josep Torrellas 2003, LimitLess Directories Limit number of pointers On overflow: –Memory module interrupts the local processor –Processor emulates the full-map directory for block

Copyright Josep Torrellas 2003, LimitLess Directories Require... Rapid trap handler: trap code executes within 5-10 cycles from trap initiation) Software has complete access to coherence controller Interface to the network that allows the processor to launch and intercept coherence protocol packets